# tratando_anomalias
En el TP1 hemos visto que el dataset, aparte de datos sin sentido o datos nulos, presenta anomalias, y no solo eso, sino que se trata de un dataset desbalanceado.

Decimos que presenta anomalias y que es un dataset desbalanceado porque:
- Se cargaron muchas mas propiedades en **Diciembre del 2016** que en el resto de los anio-meses. Consideramos que esto es una **anomalia** del dataset, ya que no hay razon alguna para que en el ultimo mes los datos se cuadripliquen. Sospechamos que pueden ser datos que no correspondan a ese anio-mes.
- Algo similar sucede con las provincias, donde **Distrito Federal** tiene la mayoria de publicaciones. Esto no es una anomalia ya que es la provincia mas poblada de Mexico y tiene sentido que tenga mas publicaciones. Aun asi, genera **desbalanceo** en el dataset, y puede provocar que algunos algoritmos funcionen peor.

En las siguientes lineas intentaremos proponer algunas soluciones a estos problemas:

In [1]:
import pandas as pd
import numpy as np
import math

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 42

In [2]:
import nbimporter

import pre_processing
import feature_generation
import feature_selection

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb


<hr>

### Modelo a utilizar:

In [3]:
import lightgbm as lgb

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14,
    'num_leaves': 120,
    #'learning_rate': 0.02,
    'learning_rate': 0.1,
    'verbose': 0, 
    'early_stopping_round': 100}
n_estimators=20000

<hr>

# TRATAMIENTO DE ANIO-MES DIC2016

## Primera alternativa: llenar con nulos los campos de fecha para dichos datos.
Analizaremos esta opcion probando realizar el tratamiento solo para el dataset de train, y luego para ambos y ver que resultados obtenemos.

In [4]:
train,_ = pre_processing.load_featured_datasets()

In [5]:
features = feature_generation.get_features()
features_con_fecha = features['fecha']['all']
features_con_fecha.remove('aniomes')

In [6]:
train = feature_selection.get_selected_dataframe(train, aniomes=True)

In [7]:
train.shape

(240000, 143)

In [8]:
#train['precio'] = train['precio'].map(lambda x: math.log(x))
train_original = train.copy()

In [9]:
for feature in features_con_fecha:
    if feature in train.columns:
        train[feature] = train.apply(lambda x: (np.nan) if (x['aniomes'] == 201612) else (x[feature]), axis=1)

In [10]:
X = train.drop('precio', axis=1)
Y = train['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [11]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, valid_sets=watchlist, verbose_eval=50)



Training until validation scores don't improve for 100 rounds
[50]	valid_0's l1: 527929
[100]	valid_0's l1: 505191
[150]	valid_0's l1: 498357
[200]	valid_0's l1: 492876
[250]	valid_0's l1: 489861
[300]	valid_0's l1: 487847
[350]	valid_0's l1: 485954
[400]	valid_0's l1: 484178
[450]	valid_0's l1: 482796
[500]	valid_0's l1: 481263
[550]	valid_0's l1: 480411
[600]	valid_0's l1: 479460
[650]	valid_0's l1: 478899
[700]	valid_0's l1: 478171
[750]	valid_0's l1: 477665
[800]	valid_0's l1: 476887
[850]	valid_0's l1: 476144
[900]	valid_0's l1: 475495
[950]	valid_0's l1: 475019
[1000]	valid_0's l1: 474557
[1050]	valid_0's l1: 474090
[1100]	valid_0's l1: 473792
[1150]	valid_0's l1: 473587
[1200]	valid_0's l1: 473160
[1250]	valid_0's l1: 472894
[1300]	valid_0's l1: 472452
[1350]	valid_0's l1: 472147
[1400]	valid_0's l1: 471859
[1450]	valid_0's l1: 471674
[1500]	valid_0's l1: 471394
[1550]	valid_0's l1: 471146
[1600]	valid_0's l1: 470961
[1650]	valid_0's l1: 470651
[1700]	valid_0's l1: 470434
[1750]

In [12]:
# Ahora, para comparar, probaremos el mismo metodo pero sin mejora aplicada:

In [13]:
X = train_original.drop('precio', axis=1)
Y = train_original['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [14]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, valid_sets=watchlist, verbose_eval=50)

[1]	valid_0's l1: 1.49313e+06
Training until validation scores don't improve for 100 rounds
[2]	valid_0's l1: 1.3816e+06
[3]	valid_0's l1: 1.28202e+06
[4]	valid_0's l1: 1.19444e+06
[5]	valid_0's l1: 1.11661e+06
[6]	valid_0's l1: 1.04835e+06
[7]	valid_0's l1: 988318
[8]	valid_0's l1: 935538
[9]	valid_0's l1: 889024
[10]	valid_0's l1: 848363
[11]	valid_0's l1: 812304
[12]	valid_0's l1: 780963
[13]	valid_0's l1: 753686
[14]	valid_0's l1: 729260
[15]	valid_0's l1: 707683
[16]	valid_0's l1: 688571
[17]	valid_0's l1: 672146
[18]	valid_0's l1: 657551
[19]	valid_0's l1: 643949
[20]	valid_0's l1: 632549
[21]	valid_0's l1: 622215
[22]	valid_0's l1: 612500
[23]	valid_0's l1: 604324
[24]	valid_0's l1: 597011
[25]	valid_0's l1: 589664
[26]	valid_0's l1: 583230
[27]	valid_0's l1: 577622
[28]	valid_0's l1: 572652
[29]	valid_0's l1: 568054
[30]	valid_0's l1: 563876
[31]	valid_0's l1: 560475
[32]	valid_0's l1: 556905
[33]	valid_0's l1: 554043
[34]	valid_0's l1: 551328
[35]	valid_0's l1: 548808
[36]	val

[306]	valid_0's l1: 487254
[307]	valid_0's l1: 487267
[308]	valid_0's l1: 487228
[309]	valid_0's l1: 487174
[310]	valid_0's l1: 487155
[311]	valid_0's l1: 487087
[312]	valid_0's l1: 487061
[313]	valid_0's l1: 487040
[314]	valid_0's l1: 486973
[315]	valid_0's l1: 486994
[316]	valid_0's l1: 486949
[317]	valid_0's l1: 486927
[318]	valid_0's l1: 486887
[319]	valid_0's l1: 486847
[320]	valid_0's l1: 486770
[321]	valid_0's l1: 486725
[322]	valid_0's l1: 486709
[323]	valid_0's l1: 486733
[324]	valid_0's l1: 486720
[325]	valid_0's l1: 486694
[326]	valid_0's l1: 486571
[327]	valid_0's l1: 486482
[328]	valid_0's l1: 486440
[329]	valid_0's l1: 486410
[330]	valid_0's l1: 486388
[331]	valid_0's l1: 486310
[332]	valid_0's l1: 486262
[333]	valid_0's l1: 486184
[334]	valid_0's l1: 486128
[335]	valid_0's l1: 486058
[336]	valid_0's l1: 485961
[337]	valid_0's l1: 485948
[338]	valid_0's l1: 485907
[339]	valid_0's l1: 485895
[340]	valid_0's l1: 485868
[341]	valid_0's l1: 485833
[342]	valid_0's l1: 485773
[

[611]	valid_0's l1: 478957
[612]	valid_0's l1: 478959
[613]	valid_0's l1: 478931
[614]	valid_0's l1: 478941
[615]	valid_0's l1: 478921
[616]	valid_0's l1: 478911
[617]	valid_0's l1: 478881
[618]	valid_0's l1: 478849
[619]	valid_0's l1: 478838
[620]	valid_0's l1: 478800
[621]	valid_0's l1: 478766
[622]	valid_0's l1: 478729
[623]	valid_0's l1: 478691
[624]	valid_0's l1: 478688
[625]	valid_0's l1: 478672
[626]	valid_0's l1: 478615
[627]	valid_0's l1: 478634
[628]	valid_0's l1: 478592
[629]	valid_0's l1: 478532
[630]	valid_0's l1: 478533
[631]	valid_0's l1: 478527
[632]	valid_0's l1: 478497
[633]	valid_0's l1: 478474
[634]	valid_0's l1: 478429
[635]	valid_0's l1: 478423
[636]	valid_0's l1: 478421
[637]	valid_0's l1: 478388
[638]	valid_0's l1: 478406
[639]	valid_0's l1: 478375
[640]	valid_0's l1: 478335
[641]	valid_0's l1: 478308
[642]	valid_0's l1: 478303
[643]	valid_0's l1: 478314
[644]	valid_0's l1: 478285
[645]	valid_0's l1: 478276
[646]	valid_0's l1: 478294
[647]	valid_0's l1: 478248
[

[919]	valid_0's l1: 474355
[920]	valid_0's l1: 474335
[921]	valid_0's l1: 474326
[922]	valid_0's l1: 474323
[923]	valid_0's l1: 474325
[924]	valid_0's l1: 474316
[925]	valid_0's l1: 474288
[926]	valid_0's l1: 474307
[927]	valid_0's l1: 474286
[928]	valid_0's l1: 474269
[929]	valid_0's l1: 474269
[930]	valid_0's l1: 474244
[931]	valid_0's l1: 474239
[932]	valid_0's l1: 474226
[933]	valid_0's l1: 474192
[934]	valid_0's l1: 474169
[935]	valid_0's l1: 474146
[936]	valid_0's l1: 474140
[937]	valid_0's l1: 474135
[938]	valid_0's l1: 474141
[939]	valid_0's l1: 474130
[940]	valid_0's l1: 474124
[941]	valid_0's l1: 474111
[942]	valid_0's l1: 474097
[943]	valid_0's l1: 474096
[944]	valid_0's l1: 474081
[945]	valid_0's l1: 474076
[946]	valid_0's l1: 474076
[947]	valid_0's l1: 474059
[948]	valid_0's l1: 474050
[949]	valid_0's l1: 474037
[950]	valid_0's l1: 474036
[951]	valid_0's l1: 474031
[952]	valid_0's l1: 474024
[953]	valid_0's l1: 474019
[954]	valid_0's l1: 474027
[955]	valid_0's l1: 474031
[

[1217]	valid_0's l1: 471939
[1218]	valid_0's l1: 471952
[1219]	valid_0's l1: 471941
[1220]	valid_0's l1: 471914
[1221]	valid_0's l1: 471925
[1222]	valid_0's l1: 471930
[1223]	valid_0's l1: 471920
[1224]	valid_0's l1: 471938
[1225]	valid_0's l1: 471943
[1226]	valid_0's l1: 471928
[1227]	valid_0's l1: 471936
[1228]	valid_0's l1: 471933
[1229]	valid_0's l1: 471912
[1230]	valid_0's l1: 471897
[1231]	valid_0's l1: 471905
[1232]	valid_0's l1: 471889
[1233]	valid_0's l1: 471862
[1234]	valid_0's l1: 471863
[1235]	valid_0's l1: 471869
[1236]	valid_0's l1: 471867
[1237]	valid_0's l1: 471874
[1238]	valid_0's l1: 471872
[1239]	valid_0's l1: 471874
[1240]	valid_0's l1: 471866
[1241]	valid_0's l1: 471871
[1242]	valid_0's l1: 471881
[1243]	valid_0's l1: 471842
[1244]	valid_0's l1: 471864
[1245]	valid_0's l1: 471846
[1246]	valid_0's l1: 471856
[1247]	valid_0's l1: 471838
[1248]	valid_0's l1: 471837
[1249]	valid_0's l1: 471834
[1250]	valid_0's l1: 471831
[1251]	valid_0's l1: 471827
[1252]	valid_0's l1:

[1511]	valid_0's l1: 470363
[1512]	valid_0's l1: 470360
[1513]	valid_0's l1: 470344
[1514]	valid_0's l1: 470329
[1515]	valid_0's l1: 470320
[1516]	valid_0's l1: 470327
[1517]	valid_0's l1: 470324
[1518]	valid_0's l1: 470340
[1519]	valid_0's l1: 470320
[1520]	valid_0's l1: 470300
[1521]	valid_0's l1: 470309
[1522]	valid_0's l1: 470315
[1523]	valid_0's l1: 470322
[1524]	valid_0's l1: 470314
[1525]	valid_0's l1: 470321
[1526]	valid_0's l1: 470292
[1527]	valid_0's l1: 470280
[1528]	valid_0's l1: 470271
[1529]	valid_0's l1: 470268
[1530]	valid_0's l1: 470288
[1531]	valid_0's l1: 470296
[1532]	valid_0's l1: 470278
[1533]	valid_0's l1: 470267
[1534]	valid_0's l1: 470261
[1535]	valid_0's l1: 470255
[1536]	valid_0's l1: 470246
[1537]	valid_0's l1: 470256
[1538]	valid_0's l1: 470263
[1539]	valid_0's l1: 470255
[1540]	valid_0's l1: 470249
[1541]	valid_0's l1: 470251
[1542]	valid_0's l1: 470234
[1543]	valid_0's l1: 470231
[1544]	valid_0's l1: 470215
[1545]	valid_0's l1: 470228
[1546]	valid_0's l1:

[1809]	valid_0's l1: 469379
[1810]	valid_0's l1: 469369
[1811]	valid_0's l1: 469348
[1812]	valid_0's l1: 469359
[1813]	valid_0's l1: 469370
[1814]	valid_0's l1: 469364
[1815]	valid_0's l1: 469343
[1816]	valid_0's l1: 469340
[1817]	valid_0's l1: 469346
[1818]	valid_0's l1: 469352
[1819]	valid_0's l1: 469324
[1820]	valid_0's l1: 469316
[1821]	valid_0's l1: 469328
[1822]	valid_0's l1: 469315
[1823]	valid_0's l1: 469333
[1824]	valid_0's l1: 469336
[1825]	valid_0's l1: 469333
[1826]	valid_0's l1: 469336
[1827]	valid_0's l1: 469340
[1828]	valid_0's l1: 469327
[1829]	valid_0's l1: 469332
[1830]	valid_0's l1: 469332
[1831]	valid_0's l1: 469332
[1832]	valid_0's l1: 469342
[1833]	valid_0's l1: 469330
[1834]	valid_0's l1: 469332
[1835]	valid_0's l1: 469335
[1836]	valid_0's l1: 469335
[1837]	valid_0's l1: 469336
[1838]	valid_0's l1: 469335
[1839]	valid_0's l1: 469339
[1840]	valid_0's l1: 469330
[1841]	valid_0's l1: 469324
[1842]	valid_0's l1: 469324
[1843]	valid_0's l1: 469320
[1844]	valid_0's l1:

[2106]	valid_0's l1: 468662
[2107]	valid_0's l1: 468666
[2108]	valid_0's l1: 468666
[2109]	valid_0's l1: 468676
[2110]	valid_0's l1: 468675
[2111]	valid_0's l1: 468679
[2112]	valid_0's l1: 468681
[2113]	valid_0's l1: 468673
[2114]	valid_0's l1: 468667
[2115]	valid_0's l1: 468674
[2116]	valid_0's l1: 468692
[2117]	valid_0's l1: 468679
[2118]	valid_0's l1: 468673
[2119]	valid_0's l1: 468669
[2120]	valid_0's l1: 468672
[2121]	valid_0's l1: 468676
[2122]	valid_0's l1: 468671
[2123]	valid_0's l1: 468665
[2124]	valid_0's l1: 468679
[2125]	valid_0's l1: 468679
[2126]	valid_0's l1: 468665
[2127]	valid_0's l1: 468664
[2128]	valid_0's l1: 468666
[2129]	valid_0's l1: 468681
[2130]	valid_0's l1: 468668
[2131]	valid_0's l1: 468670
[2132]	valid_0's l1: 468662
[2133]	valid_0's l1: 468667
[2134]	valid_0's l1: 468666
[2135]	valid_0's l1: 468682
[2136]	valid_0's l1: 468699
[2137]	valid_0's l1: 468700
[2138]	valid_0's l1: 468683
[2139]	valid_0's l1: 468677
[2140]	valid_0's l1: 468674
[2141]	valid_0's l1:

[2404]	valid_0's l1: 468293
[2405]	valid_0's l1: 468301
[2406]	valid_0's l1: 468295
[2407]	valid_0's l1: 468284
[2408]	valid_0's l1: 468273
[2409]	valid_0's l1: 468274
[2410]	valid_0's l1: 468263
[2411]	valid_0's l1: 468262
[2412]	valid_0's l1: 468253
[2413]	valid_0's l1: 468242
[2414]	valid_0's l1: 468248
[2415]	valid_0's l1: 468248
[2416]	valid_0's l1: 468247
[2417]	valid_0's l1: 468254
[2418]	valid_0's l1: 468252
[2419]	valid_0's l1: 468253
[2420]	valid_0's l1: 468242
[2421]	valid_0's l1: 468235
[2422]	valid_0's l1: 468231
[2423]	valid_0's l1: 468227
[2424]	valid_0's l1: 468227
[2425]	valid_0's l1: 468224
[2426]	valid_0's l1: 468218
[2427]	valid_0's l1: 468208
[2428]	valid_0's l1: 468214
[2429]	valid_0's l1: 468207
[2430]	valid_0's l1: 468221
[2431]	valid_0's l1: 468220
[2432]	valid_0's l1: 468209
[2433]	valid_0's l1: 468190
[2434]	valid_0's l1: 468204
[2435]	valid_0's l1: 468193
[2436]	valid_0's l1: 468199
[2437]	valid_0's l1: 468201
[2438]	valid_0's l1: 468201
[2439]	valid_0's l1:

[2700]	valid_0's l1: 467779
[2701]	valid_0's l1: 467791
[2702]	valid_0's l1: 467795
[2703]	valid_0's l1: 467799
[2704]	valid_0's l1: 467798
[2705]	valid_0's l1: 467800
[2706]	valid_0's l1: 467798
[2707]	valid_0's l1: 467797
[2708]	valid_0's l1: 467793
[2709]	valid_0's l1: 467799
[2710]	valid_0's l1: 467801
[2711]	valid_0's l1: 467790
[2712]	valid_0's l1: 467791
[2713]	valid_0's l1: 467795
[2714]	valid_0's l1: 467792
[2715]	valid_0's l1: 467782
[2716]	valid_0's l1: 467780
[2717]	valid_0's l1: 467782
[2718]	valid_0's l1: 467794
[2719]	valid_0's l1: 467791
[2720]	valid_0's l1: 467794
[2721]	valid_0's l1: 467782
[2722]	valid_0's l1: 467769
[2723]	valid_0's l1: 467766
[2724]	valid_0's l1: 467755
[2725]	valid_0's l1: 467749
[2726]	valid_0's l1: 467753
[2727]	valid_0's l1: 467755
[2728]	valid_0's l1: 467753
[2729]	valid_0's l1: 467746
[2730]	valid_0's l1: 467746
[2731]	valid_0's l1: 467736
[2732]	valid_0's l1: 467737
[2733]	valid_0's l1: 467735
[2734]	valid_0's l1: 467739
[2735]	valid_0's l1:

[2995]	valid_0's l1: 467371
[2996]	valid_0's l1: 467373
[2997]	valid_0's l1: 467368
[2998]	valid_0's l1: 467369
[2999]	valid_0's l1: 467360
[3000]	valid_0's l1: 467359
[3001]	valid_0's l1: 467363
[3002]	valid_0's l1: 467365
[3003]	valid_0's l1: 467366
[3004]	valid_0's l1: 467364
[3005]	valid_0's l1: 467377
[3006]	valid_0's l1: 467373
[3007]	valid_0's l1: 467373
[3008]	valid_0's l1: 467373
[3009]	valid_0's l1: 467367
[3010]	valid_0's l1: 467368
[3011]	valid_0's l1: 467372
[3012]	valid_0's l1: 467372
[3013]	valid_0's l1: 467366
[3014]	valid_0's l1: 467362
[3015]	valid_0's l1: 467358
[3016]	valid_0's l1: 467361
[3017]	valid_0's l1: 467355
[3018]	valid_0's l1: 467361
[3019]	valid_0's l1: 467361
[3020]	valid_0's l1: 467365
[3021]	valid_0's l1: 467367
[3022]	valid_0's l1: 467374
[3023]	valid_0's l1: 467371
[3024]	valid_0's l1: 467373
[3025]	valid_0's l1: 467375
[3026]	valid_0's l1: 467377
[3027]	valid_0's l1: 467377
[3028]	valid_0's l1: 467371
[3029]	valid_0's l1: 467368
[3030]	valid_0's l1:

### Resultados de la primer alternativa:
Podemos ver entonces que cuando llenamos con nulos, el MAE empeora, por lo que esta no resulta una idea conveniente.

# Problema de clasificacion

In [5]:
train,_ = pre_processing.load_featured_datasets()

In [6]:
train = feature_selection.get_selected_dataframe(train, aniomes=True)

In [7]:
mean = train['precio'].describe()[1]
std = train['precio'].describe()[2]

In [8]:
def bin_std(x, sup, inf):
    if ((x<sup) & (x>inf)):
        return 1
    return 0

In [9]:
train['precio_confiable'] = train['precio'].map(lambda x: bin_std(x, mean+std, mean-std))

In [10]:
train['precio_confiable'].value_counts()

1    201030
0     38970
Name: precio_confiable, dtype: int64

In [11]:
# Vamos a ver que para los valores que esten en el rango de precios confiable, predeciremos mucho mejor.

In [12]:
train_a = train.loc[train['precio_confiable'] == 1].drop('precio_confiable', axis=1).copy()
train_b = train.loc[train['precio_confiable'] == 0].drop('precio_confiable', axis=1).copy()

In [13]:
# Aplicamos LGB para el dataset con precios confiables:

In [14]:
X = train_a.drop('precio', axis=1)
Y = train_a['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [15]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, valid_sets=watchlist, verbose_eval=50)



Training until validation scores don't improve for 100 rounds
[50]	valid_0's l1: 343902
[100]	valid_0's l1: 330180
[150]	valid_0's l1: 325279
[200]	valid_0's l1: 322444
[250]	valid_0's l1: 320314
[300]	valid_0's l1: 318392
[350]	valid_0's l1: 317074
[400]	valid_0's l1: 315854
[450]	valid_0's l1: 314904
[500]	valid_0's l1: 313950
[550]	valid_0's l1: 313310
[600]	valid_0's l1: 312655
[650]	valid_0's l1: 312118
[700]	valid_0's l1: 311618
[750]	valid_0's l1: 310943
[800]	valid_0's l1: 310263
[850]	valid_0's l1: 309711
[900]	valid_0's l1: 309347
[950]	valid_0's l1: 308924
[1000]	valid_0's l1: 308535
[1050]	valid_0's l1: 308102
[1100]	valid_0's l1: 307945
[1150]	valid_0's l1: 307689
[1200]	valid_0's l1: 307447
[1250]	valid_0's l1: 307273
[1300]	valid_0's l1: 307182
[1350]	valid_0's l1: 307094
[1400]	valid_0's l1: 306865
[1450]	valid_0's l1: 306641
[1500]	valid_0's l1: 306466
[1550]	valid_0's l1: 306310
[1600]	valid_0's l1: 306115
[1650]	valid_0's l1: 305946
[1700]	valid_0's l1: 305712
[1750]

**RECORDAMOS:** Actualmente, nuestro mejor MAE obtenido es 474k. Vemos que si solo trabajamos con las 200k de propiedades, que representan un 83% aproximadamente del dataset, estariamos obteniendo resultados mucho mejores: **un MAE de 304k**, es decir, mas de 170k de mejora.

Intentaremos trabajar sobre el dataset que tiene mas ruido para mejorar sus predicciones.

In [16]:
# LGB para el dataset de precios con ruido:

In [17]:
X = train_b.drop('precio', axis=1)
Y = train_b['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [18]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, valid_sets=watchlist, verbose_eval=50)

Training until validation scores don't improve for 100 rounds
[50]	valid_0's l1: 875761
[100]	valid_0's l1: 853897
[150]	valid_0's l1: 844271
[200]	valid_0's l1: 839484
[250]	valid_0's l1: 837976
[300]	valid_0's l1: 835196
[350]	valid_0's l1: 833524
[400]	valid_0's l1: 833095
[450]	valid_0's l1: 833207
[500]	valid_0's l1: 833582
Early stopping, best iteration is:
[426]	valid_0's l1: 832643


Vemos que **el MAE es de 832k**, mucho mas elevado que el MAE promedio. Sin embargo, intentaremos mediante procesos de feature_engineering mejorar esto.

Pero antes de intentar mejorar, veremos si es factible realizar esta separacion:
Para poder aplicar el metodo, debemos antes aplicar un clasificador que tenga un muy buen AUC score para poder garantizar que el procesamiento sera el indicado. Vamos a probar con LGBMClassifier.

In [24]:
train = train.drop('precio', axis=1)

In [25]:
X = train.drop('precio_confiable', axis=1)
Y = train['precio_confiable']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [26]:
lightgbm = {'objective': 'binary',
            'num_leaves': 36,
            'metric':'auc',
            'n_estimators': 70,
            'min_split_gain': 0.01,
            'min_child_weight': 5.00001,
            'max_depth': 4,
            'learning_rate': 0.05,
            'lambda_l2': 0,
            'feature_fraction': 0.7000000000000001,
            'bagging_fraction': 1.0}

In [28]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(lightgbm, d_train, n_estimators, valid_sets=watchlist, verbose_eval=1)

[1]	valid_0's auc: 0.85031
[2]	valid_0's auc: 0.874947
[3]	valid_0's auc: 0.881422
[4]	valid_0's auc: 0.893386
[5]	valid_0's auc: 0.894878
[6]	valid_0's auc: 0.895301
[7]	valid_0's auc: 0.896227
[8]	valid_0's auc: 0.897843
[9]	valid_0's auc: 0.899719
[10]	valid_0's auc: 0.900571
[11]	valid_0's auc: 0.901358
[12]	valid_0's auc: 0.908662
[13]	valid_0's auc: 0.911072
[14]	valid_0's auc: 0.911752
[15]	valid_0's auc: 0.912758
[16]	valid_0's auc: 0.915742
[17]	valid_0's auc: 0.916175
[18]	valid_0's auc: 0.917268
[19]	valid_0's auc: 0.917971
[20]	valid_0's auc: 0.918791
[21]	valid_0's auc: 0.918843
[22]	valid_0's auc: 0.919597
[23]	valid_0's auc: 0.920013
[24]	valid_0's auc: 0.92022
[25]	valid_0's auc: 0.921097
[26]	valid_0's auc: 0.9214
[27]	valid_0's auc: 0.921728
[28]	valid_0's auc: 0.922246
[29]	valid_0's auc: 0.922502
[30]	valid_0's auc: 0.922759
[31]	valid_0's auc: 0.923277
[32]	valid_0's auc: 0.92331
[33]	valid_0's auc: 0.923787
[34]	valid_0's auc: 0.924288
[35]	valid_0's auc: 0.924361

In [73]:
Y_pred = reg.predict(X_val)

In [74]:
f = np.vectorize(lambda x: 1 if (x>0.5) else 0)

In [75]:
Y_pred = f(Y_pred)

In [76]:
df = X_val.copy()

In [77]:
df['target'] = Y_val

In [78]:
df['target_predicted'] = Y_pred

In [79]:
df['error'] = df.apply(lambda x: 1 if (x['target'] != x['target_predicted']) else 0, axis=1)

In [80]:
df['error'].value_counts()

0    43688
1     4312
Name: error, dtype: int64

In [81]:
# Resultados bastante buenos? 90% de precision.

In [82]:
# Podemos mejorar train_b?

In [83]:
train_b

Unnamed: 0_level_0,antiguedad,habitaciones,garages,banos,metroscubiertos,metrostotales,idzona,lat,lng,gimnasio,...,antiguedad_binning_8_ohe1,antiguedad_binning_9_ohe1,usd_precio_promedio_mensual,usd_subio,volcan_cerca,idzona_meanencoding_m0,idzona_meanencoding_m1,idzona_meanencoding_m4,precio,aniomes
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
44962,1.0,2.0,1.0,1.0,58.0,,9010.0,,,0.0,...,0,0,13.205298,1,,3.640000e+05,9.830967e+05,1.697439e+06,310000.0,201401
134537,,,,,250.0,,59171.0,19.316000,-98.887000,0.0,...,0,0,20.521016,1,1.0,6.200000e+06,3.753892e+06,2.938523e+06,6200000.0,201612
103293,,3.0,2.0,4.0,256.0,,325095.0,,,0.0,...,0,0,14.532748,1,,6.558305e+06,6.155558e+06,5.319085e+06,7200000.0,201412
181436,,,2.0,4.0,250.0,231.0,47732.0,20.729601,-103.431993,0.0,...,0,0,13.619410,1,1.0,5.786302e+06,5.770186e+06,5.722781e+06,5300000.0,201411
73348,5.0,3.0,2.0,,127.0,127.0,50003995.0,,,0.0,...,0,0,18.894074,0,,4.090639e+06,4.089411e+06,4.085739e+06,4750000.0,201610
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174073,0.0,3.0,2.0,,370.0,252.0,126640.0,,,0.0,...,0,0,18.462743,1,,7.349857e+06,6.278964e+06,4.779714e+06,7849000.0,201602
41606,5.0,3.0,2.0,2.0,180.0,180.0,55570.0,,,0.0,...,0,0,20.521016,1,,6.185104e+06,6.159460e+06,6.084643e+06,6800000.0,201612
67435,1.0,3.0,3.0,3.0,260.0,,55552.0,19.360489,-99.310568,0.0,...,0,0,13.061600,0,0.0,5.752816e+06,5.740861e+06,5.705521e+06,5100000.0,201404
252709,,2.0,1.0,1.0,55.0,90.0,2479.0,32.507828,-116.838982,0.0,...,0,0,18.462743,1,0.0,3.945556e+05,6.081838e+05,1.051873e+06,357000.0,201602


# Papelera de codigo

In [None]:
# Prediccion logaritmica...

Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)