# TP2 - Organización de Datos
#### Notebook principal

<hr>

### Notebooks utilizados:

- ***pre_processing:*** notebook para el manejo inicial de los dataframes.
- ***feature_generation:*** primer etapa del pipeline. En este notebook se generarán nuevos features para luego, realizar un proceso de selección de los mejores features para cada modelo.
- ***feature_selection*** segunda etapa, donde se buscara encontrar los features con mayor importancia, es decir aquellos que aporten mayor informacion.
- ***parameter_tuning:*** tercer etapa, notebook donde se tunean los parámetros para cada modelo.
- ***predict:*** finalmente, una vez obtenidos los mejores parametros y features para cada modelo, este notebook se encargará de generar el csv con las predicciones finales para el modelo que se le indique.

<hr>


In [1]:
import pandas as pd
import numpy as np
import math

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 7

In [2]:
import nbimporter

import pre_processing
import feature_generation
import feature_selection
import parameter_tuning
import predict

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb
Importing Jupyter notebook from parameter_tuning.ipynb
Importing Jupyter notebook from predict.ipynb


In [3]:
def escribir_respuesta(ids,predicciones):
    with open("respuesta.csv",'w') as archivo:
        archivo.write("id,target\n")
        for i in range(len(ids)):
            linea = f"{int(ids[i])},{predicciones[i]}"
            archivo.write(f"{linea}\n")

<hr>

# Resultados obtenidos

# area de testing:

In [4]:
import lightgbm as lgb

In [5]:
train,test = pre_processing.load_featured_datasets()

In [6]:
train['precio'] = train['precio'].map(lambda x: math.log(x))

In [7]:
features = feature_generation.get_features()

In [14]:
features['metros']

{0: ['metrostotales_confiables_alt'],
 1: ['metroscubiertos_alt', 'metrostotales_alt'],
 2: ['metroscubiertos_i1', 'metrostotales_i1'],
 3: ['metroscubiertos_alt', 'metrostotales_i2'],
 'all': ['metrostotales_confiables_alt',
  'metroscubiertos_alt',
  'metrostotales_alt',
  'metroscubiertos_i1',
  'metrostotales_i1',
  'metrostotales_i2']}

In [15]:
train_selected = train[['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos', 'metrostotales',
                        'idzona', 'lat', 'lng', 'gimnasio', 'usosmultiples', 'piscina', 'escuelascercanas',
                        'centroscomercialescercanos', 'aniomes']\
                       +features["metros"][0]\
                       +features["metros"][1]\
                       +features["tipodepropiedad"][3]\
                       +features["provincia"][0]\
                       +features["provincia"][3]\
                       +features["provincia"][5]\
                       +features["provincia"][6]\
                       +features["ciudad"][2]\
                       +features["ciudad"][3]\
                       +features["ciudad"][4]\
                       +features["fecha"][0]\
                       +features["descripcion"][0]\
                       +features["metricas"]['all']\
                       +features["habitaciones"]['all']\
                       +features["antiguedad"]['all']\
                       +features["extras"]['all']\
                       +features["volcanes"]['all']\
                       +features["idzona"][0]\
                       +features["idzona"][1]\
                       +features["idzona"][2]\
                       +features["idzona"][3]\
                       +features["idzona"][4]\
                       +["precio"]]

In [16]:
X = train_selected.drop('precio', axis=1)
Y = train_selected['precio']

In [17]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [18]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.1,
    'verbose': 0, 
    'early_stopping_round': 50,
    'n_jobs':2}
n_estimators=10000

In [19]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)



[1]	valid_0's l1: 0.637798
Training until validation scores don't improve for 50 rounds
[2]	valid_0's l1: 0.590846
[3]	valid_0's l1: 0.549523
[4]	valid_0's l1: 0.51271
[5]	valid_0's l1: 0.480274
[6]	valid_0's l1: 0.451318
[7]	valid_0's l1: 0.426066
[8]	valid_0's l1: 0.403509
[9]	valid_0's l1: 0.383537
[10]	valid_0's l1: 0.365994
[11]	valid_0's l1: 0.35086
[12]	valid_0's l1: 0.33751
[13]	valid_0's l1: 0.32564
[14]	valid_0's l1: 0.314824
[15]	valid_0's l1: 0.305485
[16]	valid_0's l1: 0.296982
[17]	valid_0's l1: 0.289712
[18]	valid_0's l1: 0.283107
[19]	valid_0's l1: 0.277311
[20]	valid_0's l1: 0.272378
[21]	valid_0's l1: 0.267824
[22]	valid_0's l1: 0.263773
[23]	valid_0's l1: 0.260357
[24]	valid_0's l1: 0.256846
[25]	valid_0's l1: 0.253928
[26]	valid_0's l1: 0.251387
[27]	valid_0's l1: 0.248956
[28]	valid_0's l1: 0.246875
[29]	valid_0's l1: 0.244887
[30]	valid_0's l1: 0.243139
[31]	valid_0's l1: 0.241392
[32]	valid_0's l1: 0.239989
[33]	valid_0's l1: 0.238565
[34]	valid_0's l1: 0.237229


[291]	valid_0's l1: 0.205227
[292]	valid_0's l1: 0.205179
[293]	valid_0's l1: 0.205127
[294]	valid_0's l1: 0.205087
[295]	valid_0's l1: 0.205063
[296]	valid_0's l1: 0.205005
[297]	valid_0's l1: 0.205004
[298]	valid_0's l1: 0.204963
[299]	valid_0's l1: 0.204917
[300]	valid_0's l1: 0.204908
[301]	valid_0's l1: 0.204885
[302]	valid_0's l1: 0.204863
[303]	valid_0's l1: 0.204834
[304]	valid_0's l1: 0.204826
[305]	valid_0's l1: 0.204761
[306]	valid_0's l1: 0.204738
[307]	valid_0's l1: 0.204734
[308]	valid_0's l1: 0.20473
[309]	valid_0's l1: 0.204708
[310]	valid_0's l1: 0.204692
[311]	valid_0's l1: 0.204662
[312]	valid_0's l1: 0.204652
[313]	valid_0's l1: 0.204651
[314]	valid_0's l1: 0.204626
[315]	valid_0's l1: 0.204595
[316]	valid_0's l1: 0.204565
[317]	valid_0's l1: 0.204543
[318]	valid_0's l1: 0.204544
[319]	valid_0's l1: 0.204542
[320]	valid_0's l1: 0.204528
[321]	valid_0's l1: 0.20448
[322]	valid_0's l1: 0.204446
[323]	valid_0's l1: 0.204429
[324]	valid_0's l1: 0.204398
[325]	valid_0's 

[575]	valid_0's l1: 0.199974
[576]	valid_0's l1: 0.199967
[577]	valid_0's l1: 0.199939
[578]	valid_0's l1: 0.199939
[579]	valid_0's l1: 0.199887
[580]	valid_0's l1: 0.199873
[581]	valid_0's l1: 0.199872
[582]	valid_0's l1: 0.199851
[583]	valid_0's l1: 0.199848
[584]	valid_0's l1: 0.19984
[585]	valid_0's l1: 0.19982
[586]	valid_0's l1: 0.19981
[587]	valid_0's l1: 0.19978
[588]	valid_0's l1: 0.199763
[589]	valid_0's l1: 0.199751
[590]	valid_0's l1: 0.199733
[591]	valid_0's l1: 0.199721
[592]	valid_0's l1: 0.199724
[593]	valid_0's l1: 0.199717
[594]	valid_0's l1: 0.199669
[595]	valid_0's l1: 0.199659
[596]	valid_0's l1: 0.199641
[597]	valid_0's l1: 0.19964
[598]	valid_0's l1: 0.199631
[599]	valid_0's l1: 0.199622
[600]	valid_0's l1: 0.199593
[601]	valid_0's l1: 0.19958
[602]	valid_0's l1: 0.199568
[603]	valid_0's l1: 0.199561
[604]	valid_0's l1: 0.199543
[605]	valid_0's l1: 0.199525
[606]	valid_0's l1: 0.199501
[607]	valid_0's l1: 0.199496
[608]	valid_0's l1: 0.199494
[609]	valid_0's l1: 

[864]	valid_0's l1: 0.196917
[865]	valid_0's l1: 0.196915
[866]	valid_0's l1: 0.196902
[867]	valid_0's l1: 0.196894
[868]	valid_0's l1: 0.196881
[869]	valid_0's l1: 0.196869
[870]	valid_0's l1: 0.196867
[871]	valid_0's l1: 0.196871
[872]	valid_0's l1: 0.196868
[873]	valid_0's l1: 0.196868
[874]	valid_0's l1: 0.196869
[875]	valid_0's l1: 0.196864
[876]	valid_0's l1: 0.196851
[877]	valid_0's l1: 0.196852
[878]	valid_0's l1: 0.196837
[879]	valid_0's l1: 0.196837
[880]	valid_0's l1: 0.196833
[881]	valid_0's l1: 0.196824
[882]	valid_0's l1: 0.19682
[883]	valid_0's l1: 0.196806
[884]	valid_0's l1: 0.19679
[885]	valid_0's l1: 0.196773
[886]	valid_0's l1: 0.196766
[887]	valid_0's l1: 0.196757
[888]	valid_0's l1: 0.196738
[889]	valid_0's l1: 0.196729
[890]	valid_0's l1: 0.196726
[891]	valid_0's l1: 0.196717
[892]	valid_0's l1: 0.19671
[893]	valid_0's l1: 0.196699
[894]	valid_0's l1: 0.196691
[895]	valid_0's l1: 0.196688
[896]	valid_0's l1: 0.196687
[897]	valid_0's l1: 0.196678
[898]	valid_0's l

[1151]	valid_0's l1: 0.194943
[1152]	valid_0's l1: 0.19493
[1153]	valid_0's l1: 0.194934
[1154]	valid_0's l1: 0.194922
[1155]	valid_0's l1: 0.194918
[1156]	valid_0's l1: 0.194916
[1157]	valid_0's l1: 0.194908
[1158]	valid_0's l1: 0.194905
[1159]	valid_0's l1: 0.194901
[1160]	valid_0's l1: 0.194898
[1161]	valid_0's l1: 0.194894
[1162]	valid_0's l1: 0.194893
[1163]	valid_0's l1: 0.194889
[1164]	valid_0's l1: 0.19488
[1165]	valid_0's l1: 0.194875
[1166]	valid_0's l1: 0.194874
[1167]	valid_0's l1: 0.194873
[1168]	valid_0's l1: 0.194855
[1169]	valid_0's l1: 0.194846
[1170]	valid_0's l1: 0.19484
[1171]	valid_0's l1: 0.194847
[1172]	valid_0's l1: 0.194843
[1173]	valid_0's l1: 0.194837
[1174]	valid_0's l1: 0.194834
[1175]	valid_0's l1: 0.194829
[1176]	valid_0's l1: 0.194828
[1177]	valid_0's l1: 0.194832
[1178]	valid_0's l1: 0.194828
[1179]	valid_0's l1: 0.194821
[1180]	valid_0's l1: 0.194812
[1181]	valid_0's l1: 0.194807
[1182]	valid_0's l1: 0.194807
[1183]	valid_0's l1: 0.194797
[1184]	valid_

[1430]	valid_0's l1: 0.19363
[1431]	valid_0's l1: 0.193626
[1432]	valid_0's l1: 0.193621
[1433]	valid_0's l1: 0.193604
[1434]	valid_0's l1: 0.1936
[1435]	valid_0's l1: 0.193596
[1436]	valid_0's l1: 0.193593
[1437]	valid_0's l1: 0.193599
[1438]	valid_0's l1: 0.193594
[1439]	valid_0's l1: 0.193583
[1440]	valid_0's l1: 0.193584
[1441]	valid_0's l1: 0.193582
[1442]	valid_0's l1: 0.19358
[1443]	valid_0's l1: 0.193573
[1444]	valid_0's l1: 0.193569
[1445]	valid_0's l1: 0.19357
[1446]	valid_0's l1: 0.193566
[1447]	valid_0's l1: 0.19356
[1448]	valid_0's l1: 0.193556
[1449]	valid_0's l1: 0.193551
[1450]	valid_0's l1: 0.193539
[1451]	valid_0's l1: 0.193537
[1452]	valid_0's l1: 0.193535
[1453]	valid_0's l1: 0.193525
[1454]	valid_0's l1: 0.19351
[1455]	valid_0's l1: 0.193503
[1456]	valid_0's l1: 0.193497
[1457]	valid_0's l1: 0.193496
[1458]	valid_0's l1: 0.193496
[1459]	valid_0's l1: 0.193487
[1460]	valid_0's l1: 0.193483
[1461]	valid_0's l1: 0.193476
[1462]	valid_0's l1: 0.193471
[1463]	valid_0's 

[1709]	valid_0's l1: 0.192709
[1710]	valid_0's l1: 0.192706
[1711]	valid_0's l1: 0.192701
[1712]	valid_0's l1: 0.192702
[1713]	valid_0's l1: 0.192696
[1714]	valid_0's l1: 0.192698
[1715]	valid_0's l1: 0.192703
[1716]	valid_0's l1: 0.192702
[1717]	valid_0's l1: 0.192696
[1718]	valid_0's l1: 0.1927
[1719]	valid_0's l1: 0.192682
[1720]	valid_0's l1: 0.19269
[1721]	valid_0's l1: 0.19269
[1722]	valid_0's l1: 0.19269
[1723]	valid_0's l1: 0.192685
[1724]	valid_0's l1: 0.192684
[1725]	valid_0's l1: 0.192683
[1726]	valid_0's l1: 0.192684
[1727]	valid_0's l1: 0.192678
[1728]	valid_0's l1: 0.19268
[1729]	valid_0's l1: 0.192682
[1730]	valid_0's l1: 0.192681
[1731]	valid_0's l1: 0.192681
[1732]	valid_0's l1: 0.192673
[1733]	valid_0's l1: 0.192673
[1734]	valid_0's l1: 0.192672
[1735]	valid_0's l1: 0.192674
[1736]	valid_0's l1: 0.192674
[1737]	valid_0's l1: 0.192674
[1738]	valid_0's l1: 0.192668
[1739]	valid_0's l1: 0.192665
[1740]	valid_0's l1: 0.192663
[1741]	valid_0's l1: 0.192667
[1742]	valid_0's

[1986]	valid_0's l1: 0.191948
[1987]	valid_0's l1: 0.191949
[1988]	valid_0's l1: 0.191946
[1989]	valid_0's l1: 0.191943
[1990]	valid_0's l1: 0.191938
[1991]	valid_0's l1: 0.191937
[1992]	valid_0's l1: 0.191931
[1993]	valid_0's l1: 0.191931
[1994]	valid_0's l1: 0.191931
[1995]	valid_0's l1: 0.191931
[1996]	valid_0's l1: 0.191927
[1997]	valid_0's l1: 0.191923
[1998]	valid_0's l1: 0.191924
[1999]	valid_0's l1: 0.191924
[2000]	valid_0's l1: 0.191921
[2001]	valid_0's l1: 0.191923
[2002]	valid_0's l1: 0.191919
[2003]	valid_0's l1: 0.191916
[2004]	valid_0's l1: 0.191919
[2005]	valid_0's l1: 0.191914
[2006]	valid_0's l1: 0.191915
[2007]	valid_0's l1: 0.191906
[2008]	valid_0's l1: 0.191905
[2009]	valid_0's l1: 0.191906
[2010]	valid_0's l1: 0.1919
[2011]	valid_0's l1: 0.191897
[2012]	valid_0's l1: 0.191895
[2013]	valid_0's l1: 0.191893
[2014]	valid_0's l1: 0.191894
[2015]	valid_0's l1: 0.191891
[2016]	valid_0's l1: 0.191888
[2017]	valid_0's l1: 0.191881
[2018]	valid_0's l1: 0.191884
[2019]	valid

[2269]	valid_0's l1: 0.191286
[2270]	valid_0's l1: 0.191283
[2271]	valid_0's l1: 0.191282
[2272]	valid_0's l1: 0.191283
[2273]	valid_0's l1: 0.191276
[2274]	valid_0's l1: 0.191272
[2275]	valid_0's l1: 0.191273
[2276]	valid_0's l1: 0.191275
[2277]	valid_0's l1: 0.191275
[2278]	valid_0's l1: 0.191273
[2279]	valid_0's l1: 0.191273
[2280]	valid_0's l1: 0.191272
[2281]	valid_0's l1: 0.191275
[2282]	valid_0's l1: 0.191271
[2283]	valid_0's l1: 0.19127
[2284]	valid_0's l1: 0.191271
[2285]	valid_0's l1: 0.191269
[2286]	valid_0's l1: 0.19127
[2287]	valid_0's l1: 0.191263
[2288]	valid_0's l1: 0.191258
[2289]	valid_0's l1: 0.191258
[2290]	valid_0's l1: 0.191256
[2291]	valid_0's l1: 0.191255
[2292]	valid_0's l1: 0.191249
[2293]	valid_0's l1: 0.191247
[2294]	valid_0's l1: 0.191247
[2295]	valid_0's l1: 0.191239
[2296]	valid_0's l1: 0.191236
[2297]	valid_0's l1: 0.191231
[2298]	valid_0's l1: 0.191233
[2299]	valid_0's l1: 0.191226
[2300]	valid_0's l1: 0.191214
[2301]	valid_0's l1: 0.191209
[2302]	valid

[2545]	valid_0's l1: 0.190759
[2546]	valid_0's l1: 0.190758
[2547]	valid_0's l1: 0.190756
[2548]	valid_0's l1: 0.190754
[2549]	valid_0's l1: 0.190749
[2550]	valid_0's l1: 0.190747
[2551]	valid_0's l1: 0.190743
[2552]	valid_0's l1: 0.190741
[2553]	valid_0's l1: 0.190734
[2554]	valid_0's l1: 0.19073
[2555]	valid_0's l1: 0.19073
[2556]	valid_0's l1: 0.190731
[2557]	valid_0's l1: 0.190731
[2558]	valid_0's l1: 0.19073
[2559]	valid_0's l1: 0.190727
[2560]	valid_0's l1: 0.19072
[2561]	valid_0's l1: 0.190718
[2562]	valid_0's l1: 0.190712
[2563]	valid_0's l1: 0.190712
[2564]	valid_0's l1: 0.190712
[2565]	valid_0's l1: 0.190709
[2566]	valid_0's l1: 0.190702
[2567]	valid_0's l1: 0.190703
[2568]	valid_0's l1: 0.190699
[2569]	valid_0's l1: 0.190698
[2570]	valid_0's l1: 0.190696
[2571]	valid_0's l1: 0.190697
[2572]	valid_0's l1: 0.190692
[2573]	valid_0's l1: 0.19069
[2574]	valid_0's l1: 0.190694
[2575]	valid_0's l1: 0.190693
[2576]	valid_0's l1: 0.19069
[2577]	valid_0's l1: 0.190685
[2578]	valid_0's

[2820]	valid_0's l1: 0.190343
[2821]	valid_0's l1: 0.190344
[2822]	valid_0's l1: 0.190346
[2823]	valid_0's l1: 0.190348
[2824]	valid_0's l1: 0.190347
[2825]	valid_0's l1: 0.190343
[2826]	valid_0's l1: 0.190339
[2827]	valid_0's l1: 0.190336
[2828]	valid_0's l1: 0.190341
[2829]	valid_0's l1: 0.190335
[2830]	valid_0's l1: 0.190338
[2831]	valid_0's l1: 0.190338
[2832]	valid_0's l1: 0.190339
[2833]	valid_0's l1: 0.190333
[2834]	valid_0's l1: 0.19033
[2835]	valid_0's l1: 0.190324
[2836]	valid_0's l1: 0.190321
[2837]	valid_0's l1: 0.19032
[2838]	valid_0's l1: 0.190318
[2839]	valid_0's l1: 0.19032
[2840]	valid_0's l1: 0.190311
[2841]	valid_0's l1: 0.190305
[2842]	valid_0's l1: 0.190305
[2843]	valid_0's l1: 0.190307
[2844]	valid_0's l1: 0.190308
[2845]	valid_0's l1: 0.190312
[2846]	valid_0's l1: 0.190309
[2847]	valid_0's l1: 0.19031
[2848]	valid_0's l1: 0.190312
[2849]	valid_0's l1: 0.190312
[2850]	valid_0's l1: 0.190314
[2851]	valid_0's l1: 0.190303
[2852]	valid_0's l1: 0.190304
[2853]	valid_0

[3097]	valid_0's l1: 0.190031
[3098]	valid_0's l1: 0.190029
[3099]	valid_0's l1: 0.19003
[3100]	valid_0's l1: 0.190032
[3101]	valid_0's l1: 0.190031
[3102]	valid_0's l1: 0.190034
[3103]	valid_0's l1: 0.190035
[3104]	valid_0's l1: 0.190022
[3105]	valid_0's l1: 0.190022
[3106]	valid_0's l1: 0.190023
[3107]	valid_0's l1: 0.190028
[3108]	valid_0's l1: 0.190025
[3109]	valid_0's l1: 0.19002
[3110]	valid_0's l1: 0.190022
[3111]	valid_0's l1: 0.190019
[3112]	valid_0's l1: 0.190021
[3113]	valid_0's l1: 0.190024
[3114]	valid_0's l1: 0.190017
[3115]	valid_0's l1: 0.190015
[3116]	valid_0's l1: 0.190016
[3117]	valid_0's l1: 0.190013
[3118]	valid_0's l1: 0.19001
[3119]	valid_0's l1: 0.190009
[3120]	valid_0's l1: 0.190004
[3121]	valid_0's l1: 0.190004
[3122]	valid_0's l1: 0.190006
[3123]	valid_0's l1: 0.190007
[3124]	valid_0's l1: 0.190005
[3125]	valid_0's l1: 0.190007
[3126]	valid_0's l1: 0.190009
[3127]	valid_0's l1: 0.190004
[3128]	valid_0's l1: 0.190004
[3129]	valid_0's l1: 0.190003
[3130]	valid_

[3373]	valid_0's l1: 0.189773
[3374]	valid_0's l1: 0.189774
[3375]	valid_0's l1: 0.189779
[3376]	valid_0's l1: 0.189781
[3377]	valid_0's l1: 0.189778
[3378]	valid_0's l1: 0.189781
[3379]	valid_0's l1: 0.189774
[3380]	valid_0's l1: 0.189771
[3381]	valid_0's l1: 0.189773
[3382]	valid_0's l1: 0.189773
[3383]	valid_0's l1: 0.189774
[3384]	valid_0's l1: 0.189774
[3385]	valid_0's l1: 0.189777
[3386]	valid_0's l1: 0.189779
[3387]	valid_0's l1: 0.189778
[3388]	valid_0's l1: 0.189782
[3389]	valid_0's l1: 0.18978
[3390]	valid_0's l1: 0.189777
[3391]	valid_0's l1: 0.189777
[3392]	valid_0's l1: 0.189776
[3393]	valid_0's l1: 0.189779
[3394]	valid_0's l1: 0.18978
[3395]	valid_0's l1: 0.189779
[3396]	valid_0's l1: 0.189779
[3397]	valid_0's l1: 0.189778
[3398]	valid_0's l1: 0.189778
[3399]	valid_0's l1: 0.189777
[3400]	valid_0's l1: 0.189779
[3401]	valid_0's l1: 0.189778
[3402]	valid_0's l1: 0.18977
[3403]	valid_0's l1: 0.189765
[3404]	valid_0's l1: 0.189763
[3405]	valid_0's l1: 0.189765
[3406]	valid_

[3654]	valid_0's l1: 0.189548
[3655]	valid_0's l1: 0.189549
[3656]	valid_0's l1: 0.189538
[3657]	valid_0's l1: 0.189539
[3658]	valid_0's l1: 0.189538
[3659]	valid_0's l1: 0.189537
[3660]	valid_0's l1: 0.189535
[3661]	valid_0's l1: 0.189537
[3662]	valid_0's l1: 0.189536
[3663]	valid_0's l1: 0.189533
[3664]	valid_0's l1: 0.189529
[3665]	valid_0's l1: 0.189528
[3666]	valid_0's l1: 0.189528
[3667]	valid_0's l1: 0.189527
[3668]	valid_0's l1: 0.189523
[3669]	valid_0's l1: 0.189524
[3670]	valid_0's l1: 0.189533
[3671]	valid_0's l1: 0.189531
[3672]	valid_0's l1: 0.189531
[3673]	valid_0's l1: 0.18953
[3674]	valid_0's l1: 0.189536
[3675]	valid_0's l1: 0.189535
[3676]	valid_0's l1: 0.189535
[3677]	valid_0's l1: 0.189533
[3678]	valid_0's l1: 0.189534
[3679]	valid_0's l1: 0.189533
[3680]	valid_0's l1: 0.189539
[3681]	valid_0's l1: 0.189532
[3682]	valid_0's l1: 0.189528
[3683]	valid_0's l1: 0.18953
[3684]	valid_0's l1: 0.189531
[3685]	valid_0's l1: 0.189532
[3686]	valid_0's l1: 0.189537
[3687]	valid

[3936]	valid_0's l1: 0.189364
[3937]	valid_0's l1: 0.189362
[3938]	valid_0's l1: 0.189363
[3939]	valid_0's l1: 0.189361
[3940]	valid_0's l1: 0.189359
[3941]	valid_0's l1: 0.189358
[3942]	valid_0's l1: 0.189361
[3943]	valid_0's l1: 0.18936
[3944]	valid_0's l1: 0.189359
[3945]	valid_0's l1: 0.189357
[3946]	valid_0's l1: 0.18936
[3947]	valid_0's l1: 0.189365
[3948]	valid_0's l1: 0.189363
[3949]	valid_0's l1: 0.18936
[3950]	valid_0's l1: 0.189361
[3951]	valid_0's l1: 0.189359
[3952]	valid_0's l1: 0.189359
[3953]	valid_0's l1: 0.189357
[3954]	valid_0's l1: 0.18936
[3955]	valid_0's l1: 0.189356
[3956]	valid_0's l1: 0.189353
[3957]	valid_0's l1: 0.189351
[3958]	valid_0's l1: 0.189352
[3959]	valid_0's l1: 0.189347
[3960]	valid_0's l1: 0.189344
[3961]	valid_0's l1: 0.189341
[3962]	valid_0's l1: 0.189339
[3963]	valid_0's l1: 0.189337
[3964]	valid_0's l1: 0.189339
[3965]	valid_0's l1: 0.189334
[3966]	valid_0's l1: 0.18934
[3967]	valid_0's l1: 0.189338
[3968]	valid_0's l1: 0.189333
[3969]	valid_0'

In [20]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

474799.4002208781

In [13]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

475592.8666470607

In [19]:
# Analisis de error...

In [49]:
df = X_val.copy()

df['precio'] = Y_val
df['precio_predicted'] = Y_pred
df['error'] = df.apply(lambda x: abs(x['precio']-x['precio_predicted']), axis=1).astype(int)

In [35]:
for columna in df.columns:
    print(columna)

antiguedad
habitaciones
garages
banos
metroscubiertos
metrostotales
idzona
lat
lng
gimnasio
usosmultiples
piscina
escuelascercanas
centroscomercialescercanos
metroscubiertos_alt
metrostotales_alt
metrostotales_confiables_alt
tipodepropiedad_le
es_Distrito Federal
es_Edo. de México
provincia_0_binary
provincia_1_binary
provincia_2_binary
provincia_3_binary
provincia_4_binary
provincia_5_binary
provincia_6_binary
ciudad_le
ciudad_0_binary
ciudad_1_binary
ciudad_2_binary
ciudad_3_binary
ciudad_4_binary
ciudad_5_binary
ciudad_6_binary
ciudad_7_binary
ciudad_top50_1_ohe
ciudad_top50_2_ohe
ciudad_top50_3_ohe
ciudad_top50_4_ohe
ciudad_top50_5_ohe
ciudad_top50_6_ohe
ciudad_top50_7_ohe
ciudad_top50_8_ohe
ciudad_top50_9_ohe
ciudad_top50_10_ohe
ciudad_top50_11_ohe
ciudad_top50_12_ohe
ciudad_top50_13_ohe
ciudad_top50_14_ohe
ciudad_top50_15_ohe
ciudad_top50_16_ohe
ciudad_top50_17_ohe
ciudad_top50_18_ohe
ciudad_top50_19_ohe
ciudad_top50_20_ohe
ciudad_top50_21_ohe
ciudad_top50_22_ohe
ciudad_top50_23_

In [53]:
df.groupby('aniomes')['error'].count()

aniomes
201201.0    154
201202.0    125
201203.0    112
201204.0    134
201205.0    177
201206.0    158
201207.0    205
201208.0    384
201209.0    251
201210.0    299
201211.0    155
201212.0    132
201301.0    189
201302.0    145
201303.0    168
201304.0    162
201305.0    213
201306.0    194
201307.0    280
201308.0    226
201309.0    307
201310.0    399
201311.0    400
201312.0    317
201401.0    251
201402.0    263
201403.0    317
201404.0    285
201405.0    314
201406.0    299
201407.0    295
201408.0    342
201409.0    385
201410.0    394
201411.0    468
201412.0    473
201501.0    463
201502.0    369
201503.0    383
201504.0    374
201505.0    367
201506.0    370
201507.0    391
201508.0    450
201509.0    508
201510.0    513
201511.0    491
201512.0    417
201601.0    545
201602.0    463
201603.0    514
201604.0    656
201605.0    533
201606.0    700
201607.0    609
201608.0    668
201609.0    639
201610.0    680
201611.0    622
Name: error, dtype: int64

In [55]:
df.groupby('es_Distrito Federal')['error'].agg('mean')

es_Distrito Federal
0    384327.814918
1    700878.962459
Name: error, dtype: float64

In [22]:
# respuesta para kaggle

In [23]:
ids = test_selected.index.values
X_test = test_selected.values

In [24]:
test_predict = reg.predict(X_test)

f = np.vectorize(math.exp)
test_predict = f(test_predict)

In [25]:
escribir_respuesta(ids, test_predict)

In [15]:
# en Y_pred estan los resultados predecidos y en Y_val los verdaderos. Analizaremos en donde nos
# estamos equivocando:

In [52]:
df = X_val.copy()

In [53]:
df['precio'] = Y_val

In [54]:
df['precio_predicted'] = Y_pred

In [56]:
df['error'] = df.apply(lambda x: abs(x['precio'] - x['precio_predicted']), axis=1)

In [61]:
df.loc[df['error'] < 100000]

Index(['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos',
       'metrostotales', 'idzona', 'lat', 'lng', 'gimnasio',
       ...
       'antiguedad_binning_2_3_ohe2', 'antiguedad_binning_2_4_ohe2',
       'usd_precio_promedio_mensual', 'usd_subio', 'volcan_cerca',
       'volcanes_cerca', 'idzona_mean_price', 'precio', 'precio_predicted',
       'error'],
      dtype='object', length=116)

In [72]:
df.groupby('dic2016')['error'].mean()

dic2016
0    469154.720787
1    558605.942472
Name: error, dtype: float64

In [71]:
df['dic2016'].value_counts()

0    42246
1     5754
Name: dic2016, dtype: int64

In [64]:
df.groupby('mes')['error'].mean()

mes
1     457792.162680
2     463169.944441
3     456001.596693
4     492672.976551
5     450777.269360
6     470895.831920
7     465610.862803
8     461132.009234
9     484955.481680
10    485033.971211
11    472161.605521
12    525364.342841
Name: error, dtype: float64

### Modelo: Regresion lineal

In [None]:
# ...

### Modelo: Regresion logistica

In [5]:
# ...

### Modelo: SVM

In [6]:
# ...

### Modelo: Decision Tree

In [7]:
# ...

### Modelo: RandomForest

In [4]:
from sklearn.ensemble import RandomForestRegressor

In [5]:
df = pre_processing.load_featured_appended_dataset()

In [10]:
features = feature_generation.get_features()

for feature in features:
    todas = []
    for opcion in features[feature]:
        valores = features[feature][opcion]
        for valor in valores:
            if (valor not in todas):
                todas.append(valor)
    features[feature]['todas'] = todas

In [11]:
# Como sabemos, este modelo no admite nulos, por lo que utilizaremos alguna tecnica de imputacion de los mismos
# para poder correrlo. Primero nos quedamos con las features mas importantes...
df = df[['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos', 'metrostotales',
                        'idzona', 'lat', 'lng', 'gimnasio', 'usosmultiples', 'piscina', 'escuelascercanas',
                        'centroscomercialescercanos']\
                       +features["metros"][1]\
                       +features["tipodepropiedad"][0]\
                       +features["provincia"][6]\
                       +features["ciudad"]['todas']\
                       +features["fecha"][18]\
                       +features["descripcion"][0]\
                       +features["metricas"][2]\
                       +features["habitaciones"][0]\
                       +features["antiguedad"][1]\
                       +features["extras"][2]\
                       +features["volcanes"][0]\
                       +features["idzona"][5]\
                       +["precio"]]

In [None]:
# Ahora utilizaremos una funcion definida en pre_processing para, utilizando otros regresores,
# predecir el valor que tienen las columnas con valores nulos.

df = pre_processing.fill_nans_lgb(df)

In [10]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [11]:
regressor = RandomForestRegressor(n_estimators = 100, random_state = seed, verbose=2, max_depth=10) 
regressor.fit(X_train, Y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 100


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.7s remaining:    0.0s


building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  5.1min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=7, verbose=2,
                      warm_start=False)

In [12]:
from sklearn import metrics

In [15]:
y_pred = regressor.predict(X_val)
print('MAE: ', int(metrics.mean_absolute_error(Y_val, y_pred)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  688548


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.6s finished


In [16]:
y_pred2 = regressor.predict(X_train)
print('MAE: ', int(metrics.mean_absolute_error(Y_train, y_pred2)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  663067


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.2s finished


In [20]:
names = train.columns.to_list()
print(sorted(zip(map(lambda x: round(x, 4), regressor.feature_importances_), names), reverse=True))

[(0.4881, 'metroscubiertos'), (0.2571, 'ciudad_le'), (0.0352, 'ciudad_muycara'), (0.0317, 'banos'), (0.0158, 'tipodepropiedad_1_pol'), (0.0141, 'dia'), (0.0129, 'precio_promedio_metrocubierto_mes'), (0.0125, 'antiguedad'), (0.0113, 'garages'), (0.0105, 'servicio'), (0.0096, 'es_Veracruz'), (0.0093, 'metroscubiertos_mean'), (0.009, 'precio'), (0.0085, 'intercept_pol'), (0.0069, 'tipodepropiedad_2_pol'), (0.0065, 'tipodepropiedad_0_pol'), (0.005, 'habitaciones'), (0.0042, 'aniomes'), (0.0033, 'tipodepropiedad_3_pol'), (0.0031, 'ciudad_barata'), (0.0025, 'es_apart'), (0.0024, 'tipodepropiedad_4_pol'), (0.002, 'tipodepropiedad_le'), (0.002, 'ciudad_cara'), (0.0019, 'tipodepropiedad_8_ohe'), (0.0017, 'lujo'), (0.0017, 'aniomes_scaled'), (0.0015, 'mes'), (0.0015, 'es_casa'), (0.0014, 'tipodepropiedad_7_pol'), (0.0014, 'hab_binning_1_ohe'), (0.0013, 'provincia_10_ohe'), (0.0013, 'gimnasio'), (0.0012, 'parrilla'), (0.0011, 'piscina'), (0.0011, 'es_Distrito Federal'), (0.001, 'hab_binning_7_ohe

### Modelo: XGBoost

_Generacion del dataset de train con sus features_

In [17]:
import xgboost
from sklearn.model_selection import GridSearchCV

In [18]:
train,test = load_featured_datasets()

In [19]:
train['precio'] = train['precio'].map(lambda x: math.log(x))

In [20]:
best_features = feature_selection.get_best_features_per_category()

In [21]:
features = feature_generation.get_features()

In [22]:
train_selected = train[['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos', 'metrostotales',
                        'idzona', 'lat', 'lng', 'gimnasio', 'usosmultiples', 'piscina', 'escuelascercanas',
                        'centroscomercialescercanos']\
                       +features["metros"][1]\
                       +features["tipodepropiedad"][0]\
                       +features["provincia"][6]\
                       +features["ciudad"][2]\
                       +features["fecha"][4]\
                       +features["descripcion"][0]\
                       +features["metricas"][2]\
                       +features["habitaciones"][0]\
                       +features["antiguedad"][1]\
                       +features["extras"][2]\
                       +features["volcanes"][0]\
                       +features["idzona"][0]\
                       +["precio"]]

test_selected = test[['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos', 'metrostotales',
                        'idzona', 'lat', 'lng', 'gimnasio', 'usosmultiples', 'piscina', 'escuelascercanas',
                        'centroscomercialescercanos']\
                       +features["metros"][1]\
                       +features["tipodepropiedad"][0]\
                       +features["provincia"][6]\
                       +features["ciudad"][2]\
                       +features["fecha"][4]\
                       +features["descripcion"][0]\
                       +features["metricas"][2]\
                       +features["habitaciones"][0]\
                       +features["antiguedad"][1]\
                       +features["extras"][2]\
                       +features["volcanes"][0]\
                       +features["idzona"][0]]

In [23]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [27]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [28]:
parametros = {
    'max_depth':[11,12,13,14,15],
    'n_estimators':[100,110,120,130,140],
    'learning_rate': [0.05,0.08,0.1,0.15,0.2,0.3],
    'subsample':[0.5,0.8,0.9,0.7],
    'min_child_weight':[5,10,15,20,30]
}

In [29]:
reg = xgboost.XGBRegressor(max_depth=17,n_estimators=240 ,learning_rate=0.06, verbosity=2,subsample=0.9, min_child_weight=10, n_jobs=2)
reg.fit(X_train,Y_train)

[20:09:07] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 56 extra nodes, 0 pruned nodes, max_depth=7
[20:09:09] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=8
[20:09:10] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 0 pruned nodes, max_depth=8
[20:09:11] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 66 extra nodes, 0 pruned nodes, max_depth=8
[20:09:12] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 68 extra nodes, 0 pruned nodes, max_depth=7
[20:09:13] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 70 extra nodes, 0 pruned nodes, max_depth=8
[20:09:14] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 0 pruned nodes, max_depth=8
[20:09:15] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 80 extra nod

[20:10:52] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 7162 extra nodes, 0 pruned nodes, max_depth=17
[20:10:55] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6928 extra nodes, 0 pruned nodes, max_depth=17
[20:10:59] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 7462 extra nodes, 0 pruned nodes, max_depth=17
[20:11:01] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6518 extra nodes, 0 pruned nodes, max_depth=17
[20:11:05] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 7162 extra nodes, 0 pruned nodes, max_depth=17
[20:11:09] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 7204 extra nodes, 0 pruned nodes, max_depth=17
[20:11:11] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 8340 extra nodes, 0 pruned nodes, max_depth=17
[20:11:13] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 

[20:15:54] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1830 extra nodes, 0 pruned nodes, max_depth=17
[20:15:59] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3068 extra nodes, 0 pruned nodes, max_depth=17
[20:16:03] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3698 extra nodes, 0 pruned nodes, max_depth=17
[20:16:08] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2018 extra nodes, 0 pruned nodes, max_depth=17
[20:16:12] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1414 extra nodes, 0 pruned nodes, max_depth=17
[20:16:17] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3068 extra nodes, 0 pruned nodes, max_depth=17
[20:16:21] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3336 extra nodes, 0 pruned nodes, max_depth=17
[20:16:27] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 

[20:20:23] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3606 extra nodes, 0 pruned nodes, max_depth=17
[20:20:25] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3286 extra nodes, 0 pruned nodes, max_depth=17
[20:20:27] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1394 extra nodes, 0 pruned nodes, max_depth=17
[20:20:29] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3344 extra nodes, 0 pruned nodes, max_depth=17
[20:20:31] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 3322 extra nodes, 0 pruned nodes, max_depth=17
[20:20:34] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4776 extra nodes, 0 pruned nodes, max_depth=17
[20:20:35] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1282 extra nodes, 0 pruned nodes, max_depth=17
[20:20:38] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.06, max_delta_step=0,
             max_depth=17, min_child_weight=10, missing=None, n_estimators=240,
             n_jobs=2, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=0.9, verbosity=2)

_Comprobacion contra el conjunto de validacion_

In [26]:
Y_pred = reg.predict(X_val)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val)
mean_absolute_error(Y_val,Y_pred)

473494.35926695546

In [32]:
# preparamos el csv de respuesta para kaggle

In [30]:
ids = test_selected.index.values
X_test = test_selected.values

In [31]:
test_predict = reg.predict(X_test)

f = np.vectorize(math.exp)
test_predict = f(test_predict)

In [33]:
escribir_respuesta(ids, test_predict)

### Modelo: CatBoost

In [None]:
#...

### Modelo: LightGBM

In [73]:
import lightgbm as lgb

In [74]:
train,test = load_featured_datasets()

In [75]:
features = feature_generation.get_features()

In [76]:
best_features = feature_selection.get_best_features_per_category()

In [98]:
features['metros']

{0: ['metroscubiertos_alt', 'metrostotales_alt'],
 1: ['metroscubiertos_alt',
  'metrostotales_alt',
  'metrostotales_confiables_alt'],
 2: ['metroscubiertos_i1', 'metrostotales_i1'],
 3: ['metroscubiertos_i1', 'metrostotales_i1', 'metrostotales_confiables_alt'],
 4: ['metroscubiertos_alt', 'metrostotales_i2'],
 5: ['metroscubiertos_alt',
  'metrostotales_i2',
  'metrostotales_confiables_alt']}

In [77]:
best_features

[('metros', 1),
 ('tipodepropiedad', 0),
 ('provincia', 6),
 ('ciudad', 2),
 ('fecha', 4),
 ('descripcion', 0),
 ('metricas', 2),
 ('habitaciones', 0),
 ('antiguedad', 1),
 ('extras', 2),
 ('volcanes', 0),
 ('idzona', 0)]

In [114]:
train_selected = train[['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos', 'metrostotales',
                        'idzona', 'lat', 'lng', 'gimnasio', 'usosmultiples', 'piscina', 'escuelascercanas',
                        'centroscomercialescercanos']\
                       +features["metros"][1]\
                       +features["tipodepropiedad"][0]\
                       +features["provincia"][6]\
                       +features["ciudad"][2]\
                       +features["fecha"][4]\
                       +features["descripcion"][0]\
                       +features["metricas"][2]\
                       +features["habitaciones"][0]\
                       +features["antiguedad"][1]\
                       +features["extras"][2]\
                       +features["volcanes"][0]\
                       +features["idzona"][0]\
                       +["precio"]]

In [115]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [116]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [117]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.2,
    'verbose': 0, 
    'early_stopping_round': 50}
n_estimators=10000

In [118]:
d_train = lgb.Dataset(X_train, label=Y_train)
d_valid = lgb.Dataset(X_val, label=Y_val)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)

[1]	valid_0's l1: 1.37848e+06
Training until validation scores don't improve for 50 rounds
[2]	valid_0's l1: 1.20026e+06
[3]	valid_0's l1: 1.06646e+06
[4]	valid_0's l1: 964527
[5]	valid_0's l1: 886962
[6]	valid_0's l1: 825819
[7]	valid_0's l1: 780005
[8]	valid_0's l1: 742121
[9]	valid_0's l1: 713473
[10]	valid_0's l1: 687841
[11]	valid_0's l1: 669070
[12]	valid_0's l1: 654806
[13]	valid_0's l1: 642552
[14]	valid_0's l1: 632085
[15]	valid_0's l1: 623036
[16]	valid_0's l1: 616622
[17]	valid_0's l1: 611148
[18]	valid_0's l1: 606440
[19]	valid_0's l1: 600293
[20]	valid_0's l1: 596400
[21]	valid_0's l1: 592895
[22]	valid_0's l1: 589279
[23]	valid_0's l1: 586698
[24]	valid_0's l1: 583480
[25]	valid_0's l1: 580945
[26]	valid_0's l1: 579340
[27]	valid_0's l1: 576913
[28]	valid_0's l1: 575237
[29]	valid_0's l1: 572096
[30]	valid_0's l1: 570468
[31]	valid_0's l1: 568829
[32]	valid_0's l1: 567254
[33]	valid_0's l1: 566310
[34]	valid_0's l1: 565160
[35]	valid_0's l1: 563258
[36]	valid_0's l1: 5621

[313]	valid_0's l1: 513172
[314]	valid_0's l1: 513125
[315]	valid_0's l1: 513092
[316]	valid_0's l1: 513154
[317]	valid_0's l1: 513044
[318]	valid_0's l1: 513021
[319]	valid_0's l1: 512951
[320]	valid_0's l1: 512956
[321]	valid_0's l1: 512850
[322]	valid_0's l1: 512802
[323]	valid_0's l1: 512664
[324]	valid_0's l1: 512586
[325]	valid_0's l1: 512481
[326]	valid_0's l1: 512538
[327]	valid_0's l1: 512486
[328]	valid_0's l1: 512340
[329]	valid_0's l1: 512216
[330]	valid_0's l1: 512170
[331]	valid_0's l1: 512064
[332]	valid_0's l1: 511962
[333]	valid_0's l1: 511925
[334]	valid_0's l1: 511863
[335]	valid_0's l1: 511841
[336]	valid_0's l1: 511708
[337]	valid_0's l1: 511644
[338]	valid_0's l1: 511596
[339]	valid_0's l1: 511572
[340]	valid_0's l1: 511551
[341]	valid_0's l1: 511514
[342]	valid_0's l1: 511462
[343]	valid_0's l1: 511476
[344]	valid_0's l1: 511431
[345]	valid_0's l1: 511429
[346]	valid_0's l1: 511443
[347]	valid_0's l1: 511366
[348]	valid_0's l1: 511306
[349]	valid_0's l1: 511244
[

[623]	valid_0's l1: 502016
[624]	valid_0's l1: 501982
[625]	valid_0's l1: 501985
[626]	valid_0's l1: 501975
[627]	valid_0's l1: 502002
[628]	valid_0's l1: 501962
[629]	valid_0's l1: 501937
[630]	valid_0's l1: 501924
[631]	valid_0's l1: 501931
[632]	valid_0's l1: 501862
[633]	valid_0's l1: 501834
[634]	valid_0's l1: 501766
[635]	valid_0's l1: 501745
[636]	valid_0's l1: 501754
[637]	valid_0's l1: 501773
[638]	valid_0's l1: 501770
[639]	valid_0's l1: 501770
[640]	valid_0's l1: 501752
[641]	valid_0's l1: 501748
[642]	valid_0's l1: 501765
[643]	valid_0's l1: 501754
[644]	valid_0's l1: 501685
[645]	valid_0's l1: 501644
[646]	valid_0's l1: 501590
[647]	valid_0's l1: 501591
[648]	valid_0's l1: 501581
[649]	valid_0's l1: 501612
[650]	valid_0's l1: 501543
[651]	valid_0's l1: 501540
[652]	valid_0's l1: 501484
[653]	valid_0's l1: 501469
[654]	valid_0's l1: 501456
[655]	valid_0's l1: 501450
[656]	valid_0's l1: 501455
[657]	valid_0's l1: 501469
[658]	valid_0's l1: 501405
[659]	valid_0's l1: 501433
[

[933]	valid_0's l1: 497524
[934]	valid_0's l1: 497537
[935]	valid_0's l1: 497519
[936]	valid_0's l1: 497482
[937]	valid_0's l1: 497475
[938]	valid_0's l1: 497485
[939]	valid_0's l1: 497456
[940]	valid_0's l1: 497468
[941]	valid_0's l1: 497450
[942]	valid_0's l1: 497469
[943]	valid_0's l1: 497474
[944]	valid_0's l1: 497489
[945]	valid_0's l1: 497451
[946]	valid_0's l1: 497457
[947]	valid_0's l1: 497434
[948]	valid_0's l1: 497449
[949]	valid_0's l1: 497436
[950]	valid_0's l1: 497355
[951]	valid_0's l1: 497334
[952]	valid_0's l1: 497332
[953]	valid_0's l1: 497336
[954]	valid_0's l1: 497291
[955]	valid_0's l1: 497268
[956]	valid_0's l1: 497305
[957]	valid_0's l1: 497286
[958]	valid_0's l1: 497280
[959]	valid_0's l1: 497245
[960]	valid_0's l1: 497257
[961]	valid_0's l1: 497253
[962]	valid_0's l1: 497228
[963]	valid_0's l1: 497214
[964]	valid_0's l1: 497177
[965]	valid_0's l1: 497152
[966]	valid_0's l1: 497171
[967]	valid_0's l1: 497128
[968]	valid_0's l1: 497134
[969]	valid_0's l1: 497099
[

[1230]	valid_0's l1: 494993
[1231]	valid_0's l1: 494972
[1232]	valid_0's l1: 494965
[1233]	valid_0's l1: 494977
[1234]	valid_0's l1: 494987
[1235]	valid_0's l1: 495000
[1236]	valid_0's l1: 494990
[1237]	valid_0's l1: 495010
[1238]	valid_0's l1: 495000
[1239]	valid_0's l1: 495021
[1240]	valid_0's l1: 495021
[1241]	valid_0's l1: 494993
[1242]	valid_0's l1: 494989
[1243]	valid_0's l1: 494952
[1244]	valid_0's l1: 494950
[1245]	valid_0's l1: 494952
[1246]	valid_0's l1: 494930
[1247]	valid_0's l1: 494900
[1248]	valid_0's l1: 494853
[1249]	valid_0's l1: 494854
[1250]	valid_0's l1: 494837
[1251]	valid_0's l1: 494848
[1252]	valid_0's l1: 494857
[1253]	valid_0's l1: 494868
[1254]	valid_0's l1: 494890
[1255]	valid_0's l1: 494880
[1256]	valid_0's l1: 494826
[1257]	valid_0's l1: 494789
[1258]	valid_0's l1: 494821
[1259]	valid_0's l1: 494785
[1260]	valid_0's l1: 494806
[1261]	valid_0's l1: 494812
[1262]	valid_0's l1: 494834
[1263]	valid_0's l1: 494836
[1264]	valid_0's l1: 494844
[1265]	valid_0's l1:

[1537]	valid_0's l1: 493350
[1538]	valid_0's l1: 493348
[1539]	valid_0's l1: 493358
[1540]	valid_0's l1: 493358
[1541]	valid_0's l1: 493375
[1542]	valid_0's l1: 493373
[1543]	valid_0's l1: 493371
[1544]	valid_0's l1: 493375
[1545]	valid_0's l1: 493378
[1546]	valid_0's l1: 493341
[1547]	valid_0's l1: 493364
[1548]	valid_0's l1: 493344
[1549]	valid_0's l1: 493331
[1550]	valid_0's l1: 493295
[1551]	valid_0's l1: 493269
[1552]	valid_0's l1: 493284
[1553]	valid_0's l1: 493272
[1554]	valid_0's l1: 493272
[1555]	valid_0's l1: 493305
[1556]	valid_0's l1: 493310
[1557]	valid_0's l1: 493321
[1558]	valid_0's l1: 493312
[1559]	valid_0's l1: 493321
[1560]	valid_0's l1: 493323
[1561]	valid_0's l1: 493287
[1562]	valid_0's l1: 493266
[1563]	valid_0's l1: 493243
[1564]	valid_0's l1: 493238
[1565]	valid_0's l1: 493239
[1566]	valid_0's l1: 493222
[1567]	valid_0's l1: 493172
[1568]	valid_0's l1: 493163
[1569]	valid_0's l1: 493123
[1570]	valid_0's l1: 493062
[1571]	valid_0's l1: 493061
[1572]	valid_0's l1:

[1832]	valid_0's l1: 491687
[1833]	valid_0's l1: 491692
[1834]	valid_0's l1: 491694
[1835]	valid_0's l1: 491679
[1836]	valid_0's l1: 491698
[1837]	valid_0's l1: 491671
[1838]	valid_0's l1: 491648
[1839]	valid_0's l1: 491662
[1840]	valid_0's l1: 491663
[1841]	valid_0's l1: 491659
[1842]	valid_0's l1: 491655
[1843]	valid_0's l1: 491636
[1844]	valid_0's l1: 491618
[1845]	valid_0's l1: 491602
[1846]	valid_0's l1: 491612
[1847]	valid_0's l1: 491623
[1848]	valid_0's l1: 491601
[1849]	valid_0's l1: 491599
[1850]	valid_0's l1: 491595
[1851]	valid_0's l1: 491606
[1852]	valid_0's l1: 491590
[1853]	valid_0's l1: 491588
[1854]	valid_0's l1: 491579
[1855]	valid_0's l1: 491599
[1856]	valid_0's l1: 491572
[1857]	valid_0's l1: 491566
[1858]	valid_0's l1: 491548
[1859]	valid_0's l1: 491553
[1860]	valid_0's l1: 491549
[1861]	valid_0's l1: 491544
[1862]	valid_0's l1: 491554
[1863]	valid_0's l1: 491565
[1864]	valid_0's l1: 491558
[1865]	valid_0's l1: 491568
[1866]	valid_0's l1: 491575
[1867]	valid_0's l1:

In [22]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

487268.28239609225

In [331]:
# preparamos el csv de respuesta para kaggle

In [46]:
ids = test_selected.index.values
X_test = test_selected.values

In [47]:
test_predict = reg.predict(X_test)
escribir_respuesta(ids, test_predict)

In [None]:
# best params so far
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 200}
n_estimators=20000

### Modelo: KNN

### Modelo: Neural Networks

In [3]:
# ...