# TP2 - Organización de Datos
#### Notebook principal

<hr>

### Notebooks utilizados:

- ***pre_processing:*** notebook para el manejo inicial de los dataframes.
- ***feature_generation:*** primer etapa del pipeline. En este notebook se generarán nuevos features para luego, realizar un proceso de selección de los mejores features para cada modelo.
- ***feature_selection*** segunda etapa, donde se buscara encontrar los features con mayor importancia, es decir aquellos que aporten mayor informacion.
- ***parameter_tuning:*** tercer etapa, notebook donde se tunean los parámetros para cada modelo.
- ***predict:*** finalmente, una vez obtenidos los mejores parametros y features para cada modelo, este notebook se encargará de generar el csv con las predicciones finales para el modelo que se le indique.

<hr>


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 7

In [2]:
import nbimporter

from pre_processing import load_appended_dataset, separate_train_test, load_datasets
from feature_generation import load_featured_appended_dataset, load_featured_datasets
import feature_selection
import parameter_tuning
import predict

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb
Importing Jupyter notebook from parameter_tuning.ipynb
Importing Jupyter notebook from predict.ipynb


In [3]:
def escribir_respuesta(ids,predicciones):
    with open("respuesta.csv",'w') as archivo:
        archivo.write("id,target\n")
        for i in range(len(ids)):
            linea = f"{int(ids[i])},{predicciones[i]}"
            archivo.write(f"{linea}\n")

<hr>

# Resultados obtenidos

### Modelo: Regresion lineal

In [4]:
# ...

### Modelo: Regresion logistica

In [5]:
# ...

### Modelo: SVM

In [6]:
# ...

### Modelo: Decision Tree

In [7]:
# ...

### Modelo: RandomForest

In [4]:
from sklearn.ensemble import RandomForestRegressor

In [5]:
train,test = load_featured_datasets()

In [6]:
train.drop(['titulo', 'descripcion', 'direccion', 'tipodepropiedad', 'ciudad',
               'provincia', 'fecha', 'metrostotales'], axis=1, inplace=True)
test.drop(['titulo', 'descripcion', 'direccion', 'tipodepropiedad', 'ciudad',
               'provincia', 'fecha', 'metrostotales'], axis=1, inplace=True)

In [7]:
train.shape

(240000, 227)

In [9]:
train.fillna(train.mean(), inplace = True)

In [10]:
train.shape

(240000, 227)

In [11]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3)

In [14]:
regressor = RandomForestRegressor(n_estimators = 100, random_state = seed, verbose=2, max_depth=10) 
regressor.fit(X_train, Y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 100


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.9s remaining:    0.0s


building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  7.0min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=7, verbose=2,
                      warm_start=False)

In [15]:
from sklearn import metrics

In [16]:
y_pred = regressor.predict(X_val)
print('MAE: ', int(metrics.mean_absolute_error(Y_val, y_pred)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  729748


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.8s finished


In [17]:
y_pred2 = regressor.predict(X_train)
print('MAE: ', int(metrics.mean_absolute_error(Y_train, y_pred2)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  699275


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    1.9s finished


In [12]:
# resultados muy buenos xq estamos manejando los datos que no tienen nulls.

### Modelo: XGBoost

_Generacion del dataset de train con sus features_

In [4]:
import xgboost

In [5]:
train,test = load_featured_datasets()

In [6]:
train.drop(['titulo', 'descripcion', 'direccion', 'tipodepropiedad', 'ciudad',
               'provincia', 'fecha', 'metrostotales'], axis=1, inplace=True)
test.drop(['titulo', 'descripcion', 'direccion', 'tipodepropiedad', 'ciudad',
               'provincia', 'fecha', 'metrostotales'], axis=1, inplace=True)

In [7]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values


In [8]:
feature_selector = feature_selection.k_features_selector(150,train)
feature_indices = feature_selector.get_support(indices=True)

In [9]:
X_purged = X[:,feature_indices]

In [10]:
X_train, X_val, Y_train, Y_val = train_test_split(X_purged, Y, test_size=0.2)

In [11]:
reg = xgboost.XGBRegressor(max_depth=10,n_estimators=120 ,learning_rate=0.1, verbosity=2,subsample=0.9, min_child_weight=10)
#reg.fit(X_train,Y_train)
reg.fit(X_purged,Y)

[12:40:02] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1346 extra nodes, 0 pruned nodes, max_depth=10
[12:40:04] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1394 extra nodes, 0 pruned nodes, max_depth=10
[12:40:06] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1428 extra nodes, 0 pruned nodes, max_depth=10
[12:40:09] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1464 extra nodes, 0 pruned nodes, max_depth=10
[12:40:11] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1486 extra nodes, 0 pruned nodes, max_depth=10
[12:40:14] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1454 extra nodes, 0 pruned nodes, max_depth=10
[12:40:16] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1542 extra nodes, 0 pruned nodes, max_depth=10
[12:40:19] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=10, min_child_weight=10, missing=None, n_estimators=120,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=0.9, verbosity=2)

_Comprobacion contra el conjunto de validacion_

In [47]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

580489.5362863693

In [19]:
# preparamos el csv de respuesta para kaggle

In [12]:
ids = test.index.values
X_test = test.values
X_test_purged = X_test[:,feature_indices]

In [13]:
test_predict = reg.predict(X_test_purged)

In [15]:
escribir_respuesta(ids, test_predict)

### Modelo: CatBoost

In [None]:
#...

### Modelo: LightGBM

In [4]:
import lightgbm as lgb

In [5]:
train,test = load_featured_datasets()

In [6]:
train.drop(['titulo', 'descripcion', 'direccion', 'tipodepropiedad', 'ciudad',
               'provincia', 'fecha', 'metrostotales'], axis=1, inplace=True)
test.drop(['titulo', 'descripcion', 'direccion', 'tipodepropiedad', 'ciudad',
               'provincia', 'fecha', 'metrostotales'], axis=1, inplace=True)

In [8]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [9]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 20}
n_estimators=5000

"""Use large max_bin (may be slower)
Use small learning_rate with large num_iterations
Use large num_leaves (may cause over-fitting)
Use bigger training data
Try dart"""

'Use large max_bin (may be slower)\nUse small learning_rate with large num_iterations\nUse large num_leaves (may cause over-fitting)\nUse bigger training data\nTry dart'

In [10]:
d_train = lgb.Dataset(X_train, label=Y_train)
d_valid = lgb.Dataset(X_val, label=Y_val)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)



[1]	valid_0's l1: 1.35741e+06
Training until validation scores don't improve for 20 rounds
[2]	valid_0's l1: 1.31249e+06
[3]	valid_0's l1: 1.26991e+06
[4]	valid_0's l1: 1.23092e+06
[5]	valid_0's l1: 1.19421e+06
[6]	valid_0's l1: 1.15957e+06
[7]	valid_0's l1: 1.12667e+06
[8]	valid_0's l1: 1.09604e+06
[9]	valid_0's l1: 1.06712e+06
[10]	valid_0's l1: 1.03962e+06
[11]	valid_0's l1: 1.01415e+06
[12]	valid_0's l1: 989750
[13]	valid_0's l1: 967206
[14]	valid_0's l1: 945563
[15]	valid_0's l1: 925980
[16]	valid_0's l1: 907251
[17]	valid_0's l1: 889518
[18]	valid_0's l1: 873086
[19]	valid_0's l1: 857338
[20]	valid_0's l1: 842220
[21]	valid_0's l1: 827668
[22]	valid_0's l1: 814688
[23]	valid_0's l1: 802137
[24]	valid_0's l1: 790622
[25]	valid_0's l1: 779530
[26]	valid_0's l1: 769315
[27]	valid_0's l1: 759194
[28]	valid_0's l1: 749558
[29]	valid_0's l1: 740502
[30]	valid_0's l1: 731764
[31]	valid_0's l1: 723795
[32]	valid_0's l1: 716222
[33]	valid_0's l1: 708587
[34]	valid_0's l1: 701986
[35]	vali

In [11]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

475851.1325406793

In [24]:
# preparamos el csv de respuesta para kaggle

In [25]:
ids = test.index.values
X_test = test.values

In [26]:
test_predict = reg.predict(X_test)
escribir_respuesta(ids, test_predict)

In [None]:
# best params so far
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 20}
n_estimators=5000

### Modelo: KNN

In [3]:
# ...

### Modelo: Neural Networks

In [3]:
# ...