# TP2 - Organización de Datos
#### Notebook principal

<hr>

### Notebooks utilizados:

- ***pre_processing:*** notebook para el manejo inicial de los dataframes.
- ***feature_generation:*** primer etapa del pipeline. En este notebook se generarán nuevos features para luego, realizar un proceso de selección de los mejores features para cada modelo.
- ***feature_selection*** segunda etapa, donde se buscara encontrar los features con mayor importancia, es decir aquellos que aporten mayor informacion.
- ***parameter_tuning:*** tercer etapa, notebook donde se tunean los parámetros para cada modelo.
- ***predict:*** finalmente, una vez obtenidos los mejores parametros y features para cada modelo, este notebook se encargará de generar el csv con las predicciones finales para el modelo que se le indique.

<hr>


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 7

In [2]:
import nbimporter

from pre_processing import load_featured_datasets
import feature_generation
import feature_selection
import parameter_tuning
import predict

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb
Importing Jupyter notebook from parameter_tuning.ipynb
Importing Jupyter notebook from predict.ipynb


In [3]:
def escribir_respuesta(ids,predicciones):
    with open("respuesta.csv",'w') as archivo:
        archivo.write("id,target\n")
        for i in range(len(ids)):
            linea = f"{int(ids[i])},{predicciones[i]}"
            archivo.write(f"{linea}\n")

<hr>

# Resultados obtenidos

# area de testing:

In [2]:
# ...

### Modelo: Regresion lineal

In [4]:
# ...

### Modelo: Regresion logistica

In [5]:
# ...

### Modelo: SVM

In [6]:
# ...

### Modelo: Decision Tree

In [7]:
# ...

### Modelo: RandomForest

In [4]:
from sklearn.ensemble import RandomForestRegressor

In [5]:
train,test = load_featured_datasets()

In [6]:
train.shape

(240000, 137)

In [7]:
train.fillna(train.mean(), inplace = True)

In [8]:
train.shape

(240000, 137)

In [10]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [11]:
regressor = RandomForestRegressor(n_estimators = 100, random_state = seed, verbose=2, max_depth=10) 
regressor.fit(X_train, Y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 100


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.7s remaining:    0.0s


building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  5.1min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=7, verbose=2,
                      warm_start=False)

In [12]:
from sklearn import metrics

In [15]:
y_pred = regressor.predict(X_val)
print('MAE: ', int(metrics.mean_absolute_error(Y_val, y_pred)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  688548


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.6s finished


In [16]:
y_pred2 = regressor.predict(X_train)
print('MAE: ', int(metrics.mean_absolute_error(Y_train, y_pred2)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  663067


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.2s finished


In [20]:
names = train.columns.to_list()
print(sorted(zip(map(lambda x: round(x, 4), regressor.feature_importances_), names), reverse=True))

[(0.4881, 'metroscubiertos'), (0.2571, 'ciudad_le'), (0.0352, 'ciudad_muycara'), (0.0317, 'banos'), (0.0158, 'tipodepropiedad_1_pol'), (0.0141, 'dia'), (0.0129, 'precio_promedio_metrocubierto_mes'), (0.0125, 'antiguedad'), (0.0113, 'garages'), (0.0105, 'servicio'), (0.0096, 'es_Veracruz'), (0.0093, 'metroscubiertos_mean'), (0.009, 'precio'), (0.0085, 'intercept_pol'), (0.0069, 'tipodepropiedad_2_pol'), (0.0065, 'tipodepropiedad_0_pol'), (0.005, 'habitaciones'), (0.0042, 'aniomes'), (0.0033, 'tipodepropiedad_3_pol'), (0.0031, 'ciudad_barata'), (0.0025, 'es_apart'), (0.0024, 'tipodepropiedad_4_pol'), (0.002, 'tipodepropiedad_le'), (0.002, 'ciudad_cara'), (0.0019, 'tipodepropiedad_8_ohe'), (0.0017, 'lujo'), (0.0017, 'aniomes_scaled'), (0.0015, 'mes'), (0.0015, 'es_casa'), (0.0014, 'tipodepropiedad_7_pol'), (0.0014, 'hab_binning_1_ohe'), (0.0013, 'provincia_10_ohe'), (0.0013, 'gimnasio'), (0.0012, 'parrilla'), (0.0011, 'piscina'), (0.0011, 'es_Distrito Federal'), (0.001, 'hab_binning_7_ohe

### Modelo: XGBoost

_Generacion del dataset de train con sus features_

In [4]:
import xgboost
from sklearn.model_selection import GridSearchCV

In [5]:
train,test = load_featured_datasets()

In [6]:
features = feature_generation.get_features()

In [7]:
train_selected = train[["habitaciones","metroscubiertos","antiguedad","garages","banos","gimnasio","piscina","usosmultiples", "centroscomercialescercanos", "escuelascercanas"]\
                       +features["metros"][1]\
                       +features["metros"][2]\
                       +features["tipodepropiedad"][5]\
                       +features["provincia"][2]\
                       +features["ciudad"][4]\
                       +features["fecha"][5]\
                       +features["descripcion"][0]\
                       +features["metricas"][0]\
                       +features["antiguedad"][0]\
                       +features["habitaciones"][0]\
                       +features["idzona"][0]
                       +["precio"]]

In [8]:
test_selected = test[["habitaciones","metroscubiertos","antiguedad","garages","banos","gimnasio","piscina","usosmultiples", "centroscomercialescercanos", "escuelascercanas"]\
                       +features["metros"][1]\
                       +features["metros"][2]\
                       +features["tipodepropiedad"][5]\
                       +features["provincia"][2]\
                       +features["ciudad"][4]\
                       +features["fecha"][5]\
                       +features["descripcion"][0]\
                       +features["metricas"][0]\
                       +features["antiguedad"][0]\
                       +features["habitaciones"][0]\
                       +features["idzona"][0]
                    ]

In [9]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [10]:
X.shape

(240000, 65)

In [487]:
#X_purged = X[:,feature_indices]

In [10]:
#X_train, X_val, Y_train, Y_val = train_test_split(X_purged, Y, test_size=0.2)
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [11]:
parametros = {
    'max_depth':[11,12,13,14,15],
    'n_estimators':[100,110,120,130,140],
    'learning_rate': [0.05,0.08,0.1,0.15,0.2,0.3],
    'subsample':[0.5,0.8,0.9,0.7],
    'min_child_weight':[5,10,15,20,30]
}

In [11]:
reg = xgboost.XGBRegressor(max_depth=14,n_estimators=140 ,learning_rate=0.08, verbosity=2,subsample=0.9, min_child_weight=10, n_jobs=4)
reg.fit(X,Y)

[15:47:40] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 5404 extra nodes, 0 pruned nodes, max_depth=14
[15:47:40] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 5874 extra nodes, 0 pruned nodes, max_depth=14
[15:47:41] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6286 extra nodes, 0 pruned nodes, max_depth=14
[15:47:42] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6596 extra nodes, 0 pruned nodes, max_depth=14
[15:47:43] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6976 extra nodes, 0 pruned nodes, max_depth=14
[15:47:44] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 7516 extra nodes, 0 pruned nodes, max_depth=14
[15:47:45] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 7862 extra nodes, 0 pruned nodes, max_depth=14
[15:47:45] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.08, max_delta_step=0,
             max_depth=14, min_child_weight=10, missing=None, n_estimators=140,
             n_jobs=4, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=0.9, verbosity=2)

_Comprobacion contra el conjunto de validacion_

In [37]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

510411.868680013

In [32]:
# preparamos el csv de respuesta para kaggle

In [12]:
ids = test_selected.index.values
X_test = test_selected.values


In [13]:
test_predict = reg.predict(X_test)

In [14]:
escribir_respuesta(ids, test_predict)

### Modelo: CatBoost

In [None]:
#...

### Modelo: LightGBM

In [20]:
import lightgbm as lgb

In [5]:
train,test = load_featured_datasets()

In [6]:
features = feature_generation.get_features()

In [15]:
train_selected = train[["habitaciones","metroscubiertos","antiguedad","garages","banos","gimnasio","piscina","usosmultiples", "centroscomercialescercanos", "escuelascercanas"]\
                       +features["metros"][1]\
                       +features["metros"][2]\
                       +features["tipodepropiedad"][5]\
                       +features["provincia"][2]\
                       +features["ciudad"][4]\
                       +features["fecha"][5]\
                       +features["descripcion"][0]\
                       +features["metricas"][0]\
                       +features["antiguedad"][0]\
                       +features["habitaciones"][0]\
                       +features["idzona"][0]
                       +["precio"]]

In [16]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [17]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [18]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.1,
    'verbose': 0, 
    'early_stopping_round': 50}
n_estimators=5000

"""Use large max_bin (may be slower)
Use small learning_rate with large num_iterations
Use large num_leaves (may cause over-fitting)
Use bigger training data
Try dart"""

'Use large max_bin (may be slower)\nUse small learning_rate with large num_iterations\nUse large num_leaves (may cause over-fitting)\nUse bigger training data\nTry dart'

In [21]:
d_train = lgb.Dataset(X_train, label=Y_train)
d_valid = lgb.Dataset(X_val, label=Y_val)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)



[1]	valid_0's l1: 1.48936e+06
Training until validation scores don't improve for 50 rounds
[2]	valid_0's l1: 1.3851e+06
[3]	valid_0's l1: 1.2942e+06
[4]	valid_0's l1: 1.21274e+06
[5]	valid_0's l1: 1.14139e+06
[6]	valid_0's l1: 1.07966e+06
[7]	valid_0's l1: 1.02518e+06
[8]	valid_0's l1: 976646
[9]	valid_0's l1: 934298
[10]	valid_0's l1: 896851
[11]	valid_0's l1: 865285
[12]	valid_0's l1: 836628
[13]	valid_0's l1: 812434
[14]	valid_0's l1: 790155
[15]	valid_0's l1: 770287
[16]	valid_0's l1: 753358
[17]	valid_0's l1: 736983
[18]	valid_0's l1: 722881
[19]	valid_0's l1: 709313
[20]	valid_0's l1: 697621
[21]	valid_0's l1: 686704
[22]	valid_0's l1: 677018
[23]	valid_0's l1: 668852
[24]	valid_0's l1: 660661
[25]	valid_0's l1: 653689
[26]	valid_0's l1: 647158
[27]	valid_0's l1: 641665
[28]	valid_0's l1: 636692
[29]	valid_0's l1: 631639
[30]	valid_0's l1: 627055
[31]	valid_0's l1: 623341
[32]	valid_0's l1: 619283
[33]	valid_0's l1: 616035
[34]	valid_0's l1: 612621
[35]	valid_0's l1: 609415
[36]	

In [22]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

487268.28239609225

In [331]:
# preparamos el csv de respuesta para kaggle

In [25]:
ids = test.index.values
X_test = test.values

In [26]:
test_predict = reg.predict(X_test)
escribir_respuesta(ids, test_predict)

In [None]:
# best params so far
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 20}
n_estimators=5000

### Modelo: KNN

### Modelo: Neural Networks

In [3]:
# ...