# TP2 - Organización de Datos
#### Notebook principal

<hr>

### Notebooks utilizados:

- ***pre_processing:*** notebook para el manejo inicial de los dataframes.
- ***feature_generation:*** primer etapa del pipeline. En este notebook se generarán nuevos features para luego, realizar un proceso de selección de los mejores features para cada modelo.
- ***feature_selection*** segunda etapa, donde se buscara encontrar los features con mayor importancia, es decir aquellos que aporten mayor informacion.
- ***parameter_tuning:*** tercer etapa, notebook donde se tunean los parámetros para cada modelo.
- ***predict:*** finalmente, una vez obtenidos los mejores parametros y features para cada modelo, este notebook se encargará de generar el csv con las predicciones finales para el modelo que se le indique.

<hr>


In [1]:
import pandas as pd
import numpy as np
import math

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 42

In [2]:
import nbimporter

import pre_processing
import feature_generation
import feature_selection
import parameter_tuning
import predict

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb
Importing Jupyter notebook from parameter_tuning.ipynb
Importing Jupyter notebook from predict.ipynb


In [3]:
def escribir_respuesta(ids,predicciones):
    with open("respuesta.csv",'w') as archivo:
        archivo.write("id,target\n")
        for i in range(len(ids)):
            linea = f"{int(ids[i])},{predicciones[i]}"
            archivo.write(f"{linea}\n")

<hr>

# Resultados obtenidos

### LightGBM

In [4]:
import lightgbm as lgb
from datetime import datetime

In [5]:
# best
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14,
    'num_leaves': 120,
    'learning_rate': 0.02,
    'verbose': 0, 
    'early_stopping_round': 1000}
n_estimators=99999999

In [6]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 7,
    'num_leaves': 40,
    'learning_rate': 0.1,
    'verbose': 0, 
    'early_stopping_round': 100}
n_estimators=5000

In [7]:
train,test = pre_processing.load_featured_datasets()

KeyboardInterrupt: 

In [None]:
train['precio'] = train['precio'].map(lambda x: math.log(x))

In [None]:
features = feature_generation.get_features()

In [None]:
train_selected = feature_selection.get_selected_dataframe(train)
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

### Probando algunas cosas...

In [None]:
train_selected.drop(['precio_bajo', 'precio_alto'], axis=1, inplace=True)
test_selected.drop(['precio_bajo', 'precio_alto'], axis=1, inplace=True)

In [None]:
X = train_selected.drop(['precio', 'precio_bajo'], axis=1)
Y = train_selected['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    # parametros para controlar overfitting
    'max_depth': 14,
    'num_leaves': 50,
    'min_data_in_leaf': 1000,
    'min_split_gain': 0.1,
    'min_child_weight': 5,
    #'lambda_l2': 0.5,
    'feature_fraction': 0.6,
    'bagging_fraction': 0.62,
    'bagging_freq': 5,
    # parametros generales
    'learning_rate': 0.02,
    'verbose': 0, 
    'early_stopping_round': 1000}
n_estimators=9999999

In [None]:
%%time
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid, d_train]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=500)



Training until validation scores don't improve for 1000 rounds
[500]	training's l1: 0.206903	valid_0's l1: 0.210969
[1000]	training's l1: 0.197844	valid_0's l1: 0.203809
[1500]	training's l1: 0.193155	valid_0's l1: 0.200697
[2000]	training's l1: 0.190029	valid_0's l1: 0.198908
[2500]	training's l1: 0.187699	valid_0's l1: 0.197834
[3000]	training's l1: 0.185846	valid_0's l1: 0.19695
[3500]	training's l1: 0.18429	valid_0's l1: 0.196323
[4000]	training's l1: 0.183069	valid_0's l1: 0.195822
[4500]	training's l1: 0.181981	valid_0's l1: 0.195422
[5000]	training's l1: 0.181058	valid_0's l1: 0.195124
[5500]	training's l1: 0.180279	valid_0's l1: 0.194862
[6000]	training's l1: 0.179512	valid_0's l1: 0.194643
[6500]	training's l1: 0.178844	valid_0's l1: 0.194455
[7000]	training's l1: 0.178197	valid_0's l1: 0.194246
[7500]	training's l1: 0.1776	valid_0's l1: 0.194025
[8000]	training's l1: 0.177063	valid_0's l1: 0.193878
[8500]	training's l1: 0.176543	valid_0's l1: 0.193736
[9000]	training's l1: 0.

### Resultados parciales obtenidos: (sin tuneo de parametros)

**Usando ambas features**:
- Entrenando con los datos reales: **training's l1: 0.10173	| valid_0's l1: 0.166474**
- Entrenando con la prediccion: **probar**

**Usando solo precio_alto**:
- Entrenando con los datos reales: **training's l1: 0.109293 | valid_0's l1: 0.170321**
- Entrenando con la prediccion: **probar**

**Usando solo precio_bajo**:
- Entrenando con los datos reales: **training's l1: 0.113884 | valid_0's l1: 0.18273**
- Entrenando con la prediccion: **probar**

**Sin usar features**:
- **training's l1: 0.112031	valid_0's l1: 0.186319**

In [60]:
# con 0.05 de data

In [17]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

472817.8933678778

In [18]:
Y_pred = reg.predict(X_train.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_train = f(Y_train.values)
mean_absolute_error(Y_train,Y_pred)

419017.8999970699

In [59]:
# con 0.2 de data

In [51]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

418138.24884530925

In [52]:
Y_pred = reg.predict(X_train.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_train = f(Y_train.values)
mean_absolute_error(Y_train,Y_pred)

390797.02341466444

In [53]:
# Preparamos respuesta para Kaggle

In [61]:
ids = test_selected.index.values
X_test = test_selected.values
test_predict = reg.predict(X_test)
f = np.vectorize(math.exp)
test_predict = f(test_predict)
escribir_respuesta(ids, test_predict)

# Mejor submit hasta la fecha -- 19/11

In [4]:
import lightgbm as lgb
from datetime import datetime

In [5]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14,
    'num_leaves': 120,
    'learning_rate': 0.02,
    'verbose': 0, 
    'early_stopping_round': 1000}
n_estimators=99999999

In [6]:
train,test = pre_processing.load_featured_datasets()

In [7]:
train['precio'] = train['precio'].map(lambda x: math.log(x))

In [8]:
train_selected = feature_selection.get_selected_dataframe(train)

In [9]:
X = train_selected.drop('precio', axis=1)
Y = train_selected['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.1, random_state=seed)

In [10]:
print(datetime.now())

2019-11-19 23:15:12.446117


In [11]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=500)



Training until validation scores don't improve for 1000 rounds
[500]	valid_0's l1: 0.197949
[1000]	valid_0's l1: 0.192564
[1500]	valid_0's l1: 0.189929
[2000]	valid_0's l1: 0.188237
[2500]	valid_0's l1: 0.186912
[3000]	valid_0's l1: 0.185966
[3500]	valid_0's l1: 0.185192
[4000]	valid_0's l1: 0.184344
[4500]	valid_0's l1: 0.183718
[5000]	valid_0's l1: 0.183126
[5500]	valid_0's l1: 0.182645
[6000]	valid_0's l1: 0.182254
[6500]	valid_0's l1: 0.181833
[7000]	valid_0's l1: 0.181479
[7500]	valid_0's l1: 0.181153
[8000]	valid_0's l1: 0.180848
[8500]	valid_0's l1: 0.1806
[9000]	valid_0's l1: 0.180359
[9500]	valid_0's l1: 0.180107
[10000]	valid_0's l1: 0.179954
[10500]	valid_0's l1: 0.179728
[11000]	valid_0's l1: 0.179515
[11500]	valid_0's l1: 0.179334
[12000]	valid_0's l1: 0.179182
[12500]	valid_0's l1: 0.17907
[13000]	valid_0's l1: 0.178926
[13500]	valid_0's l1: 0.178839
[14000]	valid_0's l1: 0.178763
[14500]	valid_0's l1: 0.178636
[15000]	valid_0's l1: 0.178544
[15500]	valid_0's l1: 0.178446

In [12]:
print(datetime.now())

2019-11-19 23:32:58.995210


In [13]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

449620.60086473916

In [14]:
# Preparamos respuesta para Kaggle

In [17]:
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [18]:
ids = test_selected.index.values
X_test = test_selected.values
test_predict = reg.predict(X_test)
f = np.vectorize(math.exp)
test_predict = f(test_predict)
escribir_respuesta(ids, test_predict)

# AREA DE TESTING

In [7]:
import lightgbm as lgb
from datetime import datetime

In [8]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    # Prevenir overfitting:
    'max_bin': 100,
    'num_leaves': 120,
    'min_data_in_leaf':5000,
    'bagging_fraction':0.605,
    'bagging_freq': 3,
    'feature_fraction': 0.7,
    'min_gain_to_split':0.1,
    # Parametros grales:
    'max_depth': 12,
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 1000}
n_estimators=99999999

In [9]:
train,test = pre_processing.load_featured_datasets()

In [10]:
train['precio'] = train['precio'].map(lambda x: math.log(x))

In [11]:
train_selected = feature_selection.get_selected_dataframe(train)

In [12]:
X = train_selected.drop('precio', axis=1)
Y = train_selected['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.1, random_state=seed)

In [13]:
print(datetime.now())

2019-11-27 19:35:24.278748


In [14]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid, d_train]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=500)



Training until validation scores don't improve for 1000 rounds
[500]	training's l1: 0.213136	valid_0's l1: 0.214525
[1000]	training's l1: 0.20647	valid_0's l1: 0.208747
[1500]	training's l1: 0.2028	valid_0's l1: 0.205831
[2000]	training's l1: 0.200274	valid_0's l1: 0.203974
[2500]	training's l1: 0.19836	valid_0's l1: 0.202622
[3000]	training's l1: 0.196984	valid_0's l1: 0.201671
[3500]	training's l1: 0.195789	valid_0's l1: 0.200955
[4000]	training's l1: 0.194816	valid_0's l1: 0.200341
[4500]	training's l1: 0.193961	valid_0's l1: 0.199806
[5000]	training's l1: 0.193248	valid_0's l1: 0.1994
[5500]	training's l1: 0.192529	valid_0's l1: 0.199025
[6000]	training's l1: 0.191979	valid_0's l1: 0.19869
[6500]	training's l1: 0.191367	valid_0's l1: 0.19846
[7000]	training's l1: 0.190866	valid_0's l1: 0.198202
[7500]	training's l1: 0.190411	valid_0's l1: 0.197928
[8000]	training's l1: 0.189964	valid_0's l1: 0.197746
[8500]	training's l1: 0.18955	valid_0's l1: 0.197565
[9000]	training's l1: 0.18916

In [15]:
print(datetime.now())

2019-11-27 19:44:47.102832


In [None]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

In [16]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

488561.10936908884

In [None]:
Y_pred = reg.predict(X_train.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_train = f(Y_train.values)
mean_absolute_error(Y_train,Y_pred)

In [17]:
# Mejores resultados hasta la fecha:

In [13]:
Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)

449620.60086473916

In [26]:
# Preparamos respuesta para Kaggle

In [17]:
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [18]:
ids = test_selected.index.values
X_test = test_selected.values
test_predict = reg.predict(X_test)
f = np.vectorize(math.exp)
test_predict = f(test_predict)
escribir_respuesta(ids, test_predict)

### Modelo: Regresion lineal

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

In [6]:
train,test = pre_processing.load_featured_datasets()

In [7]:
train_selected = feature_selection.get_selected_dataframe(train)
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [8]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [12]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [14]:
X = imp.fit_transform(X)

In [15]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [16]:
reg = LinearRegression().fit(X_train,Y_train)

In [17]:
Y_predic = reg.predict(X_val)

In [20]:
mean_absolute_error(Y_predic,Y_val)

736737.7129180954

In [21]:
# La prediccion no es muy buena.

### Modelo: Regresion logistica

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

In [5]:
train,test = pre_processing.load_featured_datasets()

In [6]:
train_selected = feature_selection.get_selected_dataframe(train)
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [7]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [8]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [9]:
X = imp.fit_transform(X)

In [10]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [11]:
reg = LogisticRegression(random_state=0, solver='sag',multi_class='auto').fit(X_train,Y_train) #Tira memory error no se por que

MemoryError: 

### Modelo: SVM

### Modelo: Decision Tree

In [7]:
# ...

### Modelo: RandomForest

In [4]:
from sklearn.ensemble import RandomForestRegressor

In [5]:
df = pre_processing.load_featured_appended_dataset()

In [10]:
features = feature_generation.get_features()

for feature in features:
    todas = []
    for opcion in features[feature]:
        valores = features[feature][opcion]
        for valor in valores:
            if (valor not in todas):
                todas.append(valor)
    features[feature]['todas'] = todas

In [11]:
# Como sabemos, este modelo no admite nulos, por lo que utilizaremos alguna tecnica de imputacion de los mismos
# para poder correrlo. Primero nos quedamos con las features mas importantes...
df = df[['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos', 'metrostotales',
                        'idzona', 'lat', 'lng', 'gimnasio', 'usosmultiples', 'piscina', 'escuelascercanas',
                        'centroscomercialescercanos']\
                       +features["metros"][1]\
                       +features["tipodepropiedad"][0]\
                       +features["provincia"][6]\
                       +features["ciudad"]['todas']\
                       +features["fecha"][18]\
                       +features["descripcion"][0]\
                       +features["metricas"][2]\
                       +features["habitaciones"][0]\
                       +features["antiguedad"][1]\
                       +features["extras"][2]\
                       +features["volcanes"][0]\
                       +features["idzona"][5]\
                       +["precio"]]

In [10]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [11]:
regressor = RandomForestRegressor(n_estimators = 100, random_state = seed, verbose=2, max_depth=10) 
regressor.fit(X_train, Y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 100


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.7s remaining:    0.0s


building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  5.1min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=7, verbose=2,
                      warm_start=False)

In [12]:
from sklearn import metrics

In [15]:
y_pred = regressor.predict(X_val)
print('MAE: ', int(metrics.mean_absolute_error(Y_val, y_pred)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  688548


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.6s finished


In [16]:
y_pred2 = regressor.predict(X_train)
print('MAE: ', int(metrics.mean_absolute_error(Y_train, y_pred2)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  663067


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.2s finished


In [20]:
names = train.columns.to_list()
print(sorted(zip(map(lambda x: round(x, 4), regressor.feature_importances_), names), reverse=True))

[(0.4881, 'metroscubiertos'), (0.2571, 'ciudad_le'), (0.0352, 'ciudad_muycara'), (0.0317, 'banos'), (0.0158, 'tipodepropiedad_1_pol'), (0.0141, 'dia'), (0.0129, 'precio_promedio_metrocubierto_mes'), (0.0125, 'antiguedad'), (0.0113, 'garages'), (0.0105, 'servicio'), (0.0096, 'es_Veracruz'), (0.0093, 'metroscubiertos_mean'), (0.009, 'precio'), (0.0085, 'intercept_pol'), (0.0069, 'tipodepropiedad_2_pol'), (0.0065, 'tipodepropiedad_0_pol'), (0.005, 'habitaciones'), (0.0042, 'aniomes'), (0.0033, 'tipodepropiedad_3_pol'), (0.0031, 'ciudad_barata'), (0.0025, 'es_apart'), (0.0024, 'tipodepropiedad_4_pol'), (0.002, 'tipodepropiedad_le'), (0.002, 'ciudad_cara'), (0.0019, 'tipodepropiedad_8_ohe'), (0.0017, 'lujo'), (0.0017, 'aniomes_scaled'), (0.0015, 'mes'), (0.0015, 'es_casa'), (0.0014, 'tipodepropiedad_7_pol'), (0.0014, 'hab_binning_1_ohe'), (0.0013, 'provincia_10_ohe'), (0.0013, 'gimnasio'), (0.0012, 'parrilla'), (0.0011, 'piscina'), (0.0011, 'es_Distrito Federal'), (0.001, 'hab_binning_7_ohe

### Modelo: XGBoost

_Generacion del dataset de train con sus features_

In [13]:
import xgboost
from sklearn.model_selection import GridSearchCV

In [37]:
train,test = pre_processing.load_featured_datasets()

In [38]:
train['precio'] = train['precio'].map(lambda x: math.log(x))

In [39]:
train_selected = feature_selection.get_selected_dataframe(train)
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [40]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [26]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [41]:
reg = xgboost.XGBRegressor()

In [42]:
reg = xgboost.XGBRegressor(max_depth=14,n_estimators=140 ,learning_rate=0.1, verbosity=2,subsample=0.9, min_child_weight=15,n_jobs=4)
reg.fit(X,Y)

[10:53:53] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 76 extra nodes, 0 pruned nodes, max_depth=8
[10:53:53] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 74 extra nodes, 0 pruned nodes, max_depth=8
[10:53:54] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 86 extra nodes, 0 pruned nodes, max_depth=8
[10:53:55] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 82 extra nodes, 0 pruned nodes, max_depth=7
[10:53:55] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 92 extra nodes, 0 pruned nodes, max_depth=8
[10:53:56] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=9
[10:53:58] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=8
[10:53:59] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra n

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=14, min_child_weight=15, missing=None, n_estimators=140,
             n_jobs=4, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=0.9, verbosity=2)

_Comprobacion contra el conjunto de validacion_

In [36]:
Y_pred = reg.predict(X_val)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val)
mean_absolute_error(Y_val,Y_pred)

312252.35149328614

In [43]:
# preparamos el csv de respuesta para kaggle

In [44]:
ids = test_selected.index.values
X_test = test_selected.values

In [45]:
test_predict = reg.predict(X_test)

f = np.vectorize(math.exp)
test_predict = f(test_predict)

In [46]:
escribir_respuesta(ids, test_predict)

### Modelo: CatBoost

In [None]:
#...

### Modelo: LightGBM

In [4]:
import lightgbm as lgb

In [5]:
train,test = load_featured_datasets()

NameError: name 'load_featured_datasets' is not defined

In [75]:
features = feature_generation.get_features()

In [76]:
best_features = feature_selection.get_best_features_per_category()

In [98]:
features['metros']

{0: ['metroscubiertos_alt', 'metrostotales_alt'],
 1: ['metroscubiertos_alt',
  'metrostotales_alt',
  'metrostotales_confiables_alt'],
 2: ['metroscubiertos_i1', 'metrostotales_i1'],
 3: ['metroscubiertos_i1', 'metrostotales_i1', 'metrostotales_confiables_alt'],
 4: ['metroscubiertos_alt', 'metrostotales_i2'],
 5: ['metroscubiertos_alt',
  'metrostotales_i2',
  'metrostotales_confiables_alt']}

In [77]:
best_features

[('metros', 1),
 ('tipodepropiedad', 0),
 ('provincia', 6),
 ('ciudad', 2),
 ('fecha', 4),
 ('descripcion', 0),
 ('metricas', 2),
 ('habitaciones', 0),
 ('antiguedad', 1),
 ('extras', 2),
 ('volcanes', 0),
 ('idzona', 0)]

In [114]:
train_selected = train[['antiguedad', 'habitaciones', 'garages', 'banos', 'metroscubiertos', 'metrostotales',
                        'idzona', 'lat', 'lng', 'gimnasio', 'usosmultiples', 'piscina', 'escuelascercanas',
                        'centroscomercialescercanos']\
                       +features["metros"][1]\
                       +features["tipodepropiedad"][0]\
                       +features["provincia"][6]\
                       +features["ciudad"][2]\
                       +features["fecha"][4]\
                       +features["descripcion"][0]\
                       +features["metricas"][2]\
                       +features["habitaciones"][0]\
                       +features["antiguedad"][1]\
                       +features["extras"][2]\
                       +features["volcanes"][0]\
                       +features["idzona"][0]\
                       +["precio"]]

In [115]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [116]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [117]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.2,
    'verbose': 0, 
    'early_stopping_round': 50}
n_estimators=10000

In [118]:
d_train = lgb.Dataset(X_train, label=Y_train)
d_valid = lgb.Dataset(X_val, label=Y_val)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)

[1]	valid_0's l1: 1.37848e+06
Training until validation scores don't improve for 50 rounds
[2]	valid_0's l1: 1.20026e+06
[3]	valid_0's l1: 1.06646e+06
[4]	valid_0's l1: 964527
[5]	valid_0's l1: 886962
[6]	valid_0's l1: 825819
[7]	valid_0's l1: 780005
[8]	valid_0's l1: 742121
[9]	valid_0's l1: 713473
[10]	valid_0's l1: 687841
[11]	valid_0's l1: 669070
[12]	valid_0's l1: 654806
[13]	valid_0's l1: 642552
[14]	valid_0's l1: 632085
[15]	valid_0's l1: 623036
[16]	valid_0's l1: 616622
[17]	valid_0's l1: 611148
[18]	valid_0's l1: 606440
[19]	valid_0's l1: 600293
[20]	valid_0's l1: 596400
[21]	valid_0's l1: 592895
[22]	valid_0's l1: 589279
[23]	valid_0's l1: 586698
[24]	valid_0's l1: 583480
[25]	valid_0's l1: 580945
[26]	valid_0's l1: 579340
[27]	valid_0's l1: 576913
[28]	valid_0's l1: 575237
[29]	valid_0's l1: 572096
[30]	valid_0's l1: 570468
[31]	valid_0's l1: 568829
[32]	valid_0's l1: 567254
[33]	valid_0's l1: 566310
[34]	valid_0's l1: 565160
[35]	valid_0's l1: 563258
[36]	valid_0's l1: 5621

In [22]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

487268.28239609225

In [331]:
# preparamos el csv de respuesta para kaggle

In [46]:
ids = test_selected.index.values
X_test = test_selected.values

In [47]:
test_predict = reg.predict(X_test)
escribir_respuesta(ids, test_predict)

In [None]:
# best params so far
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 200}
n_estimators=20000

### Modelo: KNN

In [26]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer

In [16]:
train,test = pre_processing.load_featured_datasets()

In [17]:
train_selected = feature_selection.get_selected_dataframe(train)
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [18]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [27]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [28]:
X = imp.fit_transform(X)

In [29]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.1, random_state=seed)

In [51]:
reg = KNeighborsRegressor(n_neighbors=10, algorithm='kd_tree', metric='minkowski', p=2)

In [52]:
reg.fit(X_train,Y_train)

KNeighborsRegressor(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                    weights='uniform')

In [53]:
Y_pred = reg.predict(X_val)

In [54]:
mean_absolute_error(Y_val,Y_pred)

809110.8316791666

### Modelo: Neural Networks

In [63]:
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

In [57]:
train,test = pre_processing.load_featured_datasets()

In [58]:
train_selected = feature_selection.get_selected_dataframe(train)
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [59]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [68]:
## Tratado de nulos y escalado de los datos para la red neuronal

In [60]:
scaler = MinMaxScaler()

In [61]:
X = scaler.fit_transform(X)

In [64]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [65]:
X = imp.fit_transform(X)

In [67]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.1, random_state=seed)

In [74]:
reg = MLPRegressor(hidden_layer_sizes=(6,),activation='relu',solver='adam',learning_rate='adaptive',max_iter=1000,
            learning_rate_init=0.01,alpha=0.01, verbose = True)

In [75]:
reg.fit(X_train,Y_train)

Iteration 1, loss = 5479174818011.77636719
Iteration 2, loss = 5314735544892.87109375
Iteration 3, loss = 5038288825323.17285156
Iteration 4, loss = 4690582949306.25097656
Iteration 5, loss = 4298947223866.83154297
Iteration 6, loss = 3885482613340.61230469
Iteration 7, loss = 3469673774375.79101562
Iteration 8, loss = 3070898556496.04150391
Iteration 9, loss = 2706648010594.15332031
Iteration 10, loss = 2393534030732.16455078
Iteration 11, loss = 2143240739350.12890625
Iteration 12, loss = 1961527643323.32617188
Iteration 13, loss = 1843553682394.23071289
Iteration 14, loss = 1771439218489.66186523
Iteration 15, loss = 1721304165548.47656250
Iteration 16, loss = 1677377910150.12377930
Iteration 17, loss = 1634672471379.27929688
Iteration 18, loss = 1592086428120.01684570
Iteration 19, loss = 1549346120525.87890625
Iteration 20, loss = 1506515828888.14965820
Iteration 21, loss = 1463462610477.21337891
Iteration 22, loss = 1420088647230.22167969
Iteration 23, loss = 1376657569487.944580



MLPRegressor(activation='relu', alpha=0.01, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(6,), learning_rate='adaptive',
             learning_rate_init=0.01, max_iter=1000, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=True, warm_start=False)

In [76]:
Y_pred = reg.predict(X_val)

In [77]:
mean_absolute_error(Y_val,Y_pred)

621982.1997982272