# TP2 - Organización de Datos
#### Notebook principal

<hr>

### Notebooks utilizados:

- ***pre_processing:*** notebook para el manejo inicial de los dataframes.
- ***feature_generation:*** primer etapa del pipeline. En este notebook se generarán nuevos features para luego, realizar un proceso de selección de los mejores features para cada modelo.
- ***feature_selection*** segunda etapa, donde se buscara encontrar los features con mayor importancia, es decir aquellos que aporten mayor informacion.
- ***parameter_tuning:*** tercer etapa, notebook donde se tunean los parámetros para cada modelo.
- ***predict:*** finalmente, una vez obtenidos los mejores parametros y features para cada modelo, este notebook se encargará de generar el csv con las predicciones finales para el modelo que se le indique.

<hr>


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 7

In [2]:
import nbimporter

from pre_processing import load_featured_datasets
import feature_generation
import feature_selection
import parameter_tuning
import predict

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb
Importing Jupyter notebook from parameter_tuning.ipynb
Importing Jupyter notebook from predict.ipynb


In [3]:
def escribir_respuesta(ids,predicciones):
    with open("respuesta.csv",'w') as archivo:
        archivo.write("id,target\n")
        for i in range(len(ids)):
            linea = f"{int(ids[i])},{predicciones[i]}"
            archivo.write(f"{linea}\n")

<hr>

# Resultados obtenidos

# area de testing:

In [5]:
train,test = load_datasets()

In [11]:
train.fecha.describe()

count                  240000
unique                   1830
top       2016-12-03 00:00:00
freq                     1416
first     2012-01-01 00:00:00
last      2016-12-31 00:00:00
Name: fecha, dtype: object

In [55]:
usd = pd.read_csv('data/usd_mxn_diario.csv')

In [57]:
def aniomes(x):
    #meses = {'Ene':'01', 'Feb':'02', 'Mar':'03', 'Abr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Ago':'08',
             #'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dic':'12'}
    #mes,anio = x.split()
    
    mes = x[3:5]
    anio = x[6:]
    
    return anio+mes
    

In [58]:
def numeric(x):
    entero, fraccion = x.split(',')
    return int(entero) + int(fraccion)/10000

In [59]:
usd['aniomes'] = usd['Fecha'].map(aniomes)

In [60]:
usd.columns

Index(['Fecha', 'Último', 'Apertura', 'Máximo', 'Mínimo', '% var.', 'aniomes'], dtype='object')

In [61]:
valores = ['Último', 'Apertura', 'Máximo', 'Mínimo']
for valor in valores:
    usd[valor] = usd[valor].map(numeric)

In [62]:
usd['daily_mean'] = usd.apply(lambda x: (x['Último'] + x['Apertura'])/2, axis=1)

In [64]:
usd = usd[['aniomes', 'daily_mean']]

In [65]:
usd.head(3)

Unnamed: 0,aniomes,daily_mean
0,201612,20.73335
1,201612,20.7471
2,201612,20.7571


In [68]:
usd = usd.groupby('aniomes')['daily_mean'].agg(['mean'])

In [71]:
usd.head(5)

Unnamed: 0_level_0,mean
aniomes,Unnamed: 1_level_1
201201,13.419316
201202,12.789329
201203,12.734559
201204,13.047955
201205,13.630126


In [73]:
usd.columns = ['usd_mxn']

In [74]:
usd['mxn_usd'] = usd['usd_mxn'].map(lambda x: 1/x)

In [79]:
usd.reset_index(inplace=True)

In [81]:
usd.to_csv('data/usd_mxn_featured.csv')

### Modelo: Regresion lineal

In [4]:
# ...

### Modelo: Regresion logistica

In [5]:
# ...

### Modelo: SVM

In [6]:
# ...

### Modelo: Decision Tree

In [7]:
# ...

### Modelo: RandomForest

In [4]:
from sklearn.ensemble import RandomForestRegressor

In [5]:
train,test = load_featured_datasets()

In [6]:
train.shape

(240000, 137)

In [7]:
train.fillna(train.mean(), inplace = True)

In [8]:
train.shape

(240000, 137)

In [10]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [11]:
regressor = RandomForestRegressor(n_estimators = 100, random_state = seed, verbose=2, max_depth=10) 
regressor.fit(X_train, Y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 100


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.7s remaining:    0.0s


building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  5.1min finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=7, verbose=2,
                      warm_start=False)

In [12]:
from sklearn import metrics

In [15]:
y_pred = regressor.predict(X_val)
print('MAE: ', int(metrics.mean_absolute_error(Y_val, y_pred)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  688548


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.6s finished


In [16]:
y_pred2 = regressor.predict(X_train)
print('MAE: ', int(metrics.mean_absolute_error(Y_train, y_pred2)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


MAE:  663067


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.2s finished


In [20]:
names = train.columns.to_list()
print(sorted(zip(map(lambda x: round(x, 4), regressor.feature_importances_), names), reverse=True))

[(0.4881, 'metroscubiertos'), (0.2571, 'ciudad_le'), (0.0352, 'ciudad_muycara'), (0.0317, 'banos'), (0.0158, 'tipodepropiedad_1_pol'), (0.0141, 'dia'), (0.0129, 'precio_promedio_metrocubierto_mes'), (0.0125, 'antiguedad'), (0.0113, 'garages'), (0.0105, 'servicio'), (0.0096, 'es_Veracruz'), (0.0093, 'metroscubiertos_mean'), (0.009, 'precio'), (0.0085, 'intercept_pol'), (0.0069, 'tipodepropiedad_2_pol'), (0.0065, 'tipodepropiedad_0_pol'), (0.005, 'habitaciones'), (0.0042, 'aniomes'), (0.0033, 'tipodepropiedad_3_pol'), (0.0031, 'ciudad_barata'), (0.0025, 'es_apart'), (0.0024, 'tipodepropiedad_4_pol'), (0.002, 'tipodepropiedad_le'), (0.002, 'ciudad_cara'), (0.0019, 'tipodepropiedad_8_ohe'), (0.0017, 'lujo'), (0.0017, 'aniomes_scaled'), (0.0015, 'mes'), (0.0015, 'es_casa'), (0.0014, 'tipodepropiedad_7_pol'), (0.0014, 'hab_binning_1_ohe'), (0.0013, 'provincia_10_ohe'), (0.0013, 'gimnasio'), (0.0012, 'parrilla'), (0.0011, 'piscina'), (0.0011, 'es_Distrito Federal'), (0.001, 'hab_binning_7_ohe

### Modelo: XGBoost

_Generacion del dataset de train con sus features_

In [4]:
import xgboost

In [5]:
train,test = load_featured_datasets()

In [8]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values

In [8]:
#feature_selector = feature_selection.k_features_selector(150,train)
#feature_indices = feature_selector.get_support(indices=True)

In [9]:
#X_purged = X[:,feature_indices]

In [10]:
X_purged.shape

(240000, 150)

In [10]:
#X_train, X_val, Y_train, Y_val = train_test_split(X_purged, Y, test_size=0.2)
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

In [11]:
reg = xgboost.XGBRegressor(max_depth=10,n_estimators=120 ,learning_rate=0.1, verbosity=2,subsample=0.9, min_child_weight=10)
reg.fit(X_train,Y_train)
#reg.fit(X_purged,Y)

[04:13:30] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1602 extra nodes, 0 pruned nodes, max_depth=10
[04:13:32] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1618 extra nodes, 0 pruned nodes, max_depth=10
[04:13:34] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1652 extra nodes, 0 pruned nodes, max_depth=10
[04:13:36] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1656 extra nodes, 0 pruned nodes, max_depth=10
[04:13:39] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1654 extra nodes, 0 pruned nodes, max_depth=10
[04:13:41] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1680 extra nodes, 0 pruned nodes, max_depth=10
[04:13:44] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1678 extra nodes, 0 pruned nodes, max_depth=10
[04:13:46] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 

[04:15:59] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 442 extra nodes, 0 pruned nodes, max_depth=10
[04:16:01] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 564 extra nodes, 0 pruned nodes, max_depth=10
[04:16:03] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 294 extra nodes, 0 pruned nodes, max_depth=10
[04:16:05] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 426 extra nodes, 0 pruned nodes, max_depth=10
[04:16:07] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 456 extra nodes, 0 pruned nodes, max_depth=10
[04:16:10] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 490 extra nodes, 0 pruned nodes, max_depth=10
[04:16:12] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 466 extra nodes, 0 pruned nodes, max_depth=10
[04:16:15] INFO: /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=10, min_child_weight=10, missing=None, n_estimators=120,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=0.9, verbosity=2)

_Comprobacion contra el conjunto de validacion_

In [None]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

In [15]:
# preparamos el csv de respuesta para kaggle

In [16]:
ids = test.index.values
X_test = test.values
X_test_purged = X_test[:,feature_indices]

In [17]:
test_predict = reg.predict(X_test_purged)

In [18]:
escribir_respuesta(ids, test_predict)

### Modelo: CatBoost

In [None]:
#...

### Modelo: LightGBM

In [13]:
import lightgbm as lgb

In [14]:
train,test = load_featured_datasets()

In [15]:
X = train.drop('precio', axis=1).values
Y = train['precio'].values

In [16]:
feature_selector = feature_selection.k_features_selector(100,train)
feature_indices = feature_selector.get_support(indices=True)

  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [17]:
X_purged = X[:,feature_indices]

In [18]:
X_train, X_val, Y_train, Y_val = train_test_split(X_purged, Y, test_size=0.2, random_state=seed)
#X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [19]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 20}
n_estimators=5000

"""Use large max_bin (may be slower)
Use small learning_rate with large num_iterations
Use large num_leaves (may cause over-fitting)
Use bigger training data
Try dart"""

'Use large max_bin (may be slower)\nUse small learning_rate with large num_iterations\nUse large num_leaves (may cause over-fitting)\nUse bigger training data\nTry dart'

In [20]:
d_train = lgb.Dataset(X_train, label=Y_train)
d_valid = lgb.Dataset(X_val, label=Y_val)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)



[1]	valid_0's l1: 1.55232e+06
Training until validation scores don't improve for 20 rounds
[2]	valid_0's l1: 1.50069e+06
[3]	valid_0's l1: 1.45181e+06
[4]	valid_0's l1: 1.40671e+06
[5]	valid_0's l1: 1.36443e+06
[6]	valid_0's l1: 1.32411e+06
[7]	valid_0's l1: 1.28737e+06
[8]	valid_0's l1: 1.25194e+06
[9]	valid_0's l1: 1.21965e+06
[10]	valid_0's l1: 1.18905e+06
[11]	valid_0's l1: 1.16e+06
[12]	valid_0's l1: 1.13357e+06
[13]	valid_0's l1: 1.10775e+06
[14]	valid_0's l1: 1.08441e+06
[15]	valid_0's l1: 1.06186e+06
[16]	valid_0's l1: 1.04165e+06
[17]	valid_0's l1: 1.02217e+06
[18]	valid_0's l1: 1.00433e+06
[19]	valid_0's l1: 986632
[20]	valid_0's l1: 970632
[21]	valid_0's l1: 954833
[22]	valid_0's l1: 940831
[23]	valid_0's l1: 926808
[24]	valid_0's l1: 914199
[25]	valid_0's l1: 901845
[26]	valid_0's l1: 890526
[27]	valid_0's l1: 879681
[28]	valid_0's l1: 869954
[29]	valid_0's l1: 859865
[30]	valid_0's l1: 851311
[31]	valid_0's l1: 843146
[32]	valid_0's l1: 835064
[33]	valid_0's l1: 826823
[34

[311]	valid_0's l1: 635851
[312]	valid_0's l1: 635735
[313]	valid_0's l1: 635618
[314]	valid_0's l1: 635556
[315]	valid_0's l1: 635505
[316]	valid_0's l1: 635494
[317]	valid_0's l1: 635431
[318]	valid_0's l1: 635297
[319]	valid_0's l1: 635204
[320]	valid_0's l1: 635179
[321]	valid_0's l1: 635178
[322]	valid_0's l1: 635127
[323]	valid_0's l1: 635097
[324]	valid_0's l1: 635067
[325]	valid_0's l1: 635046
[326]	valid_0's l1: 635047
[327]	valid_0's l1: 635016
[328]	valid_0's l1: 634953
[329]	valid_0's l1: 634852
[330]	valid_0's l1: 634798
[331]	valid_0's l1: 634745
[332]	valid_0's l1: 634607
[333]	valid_0's l1: 634585
[334]	valid_0's l1: 634578
[335]	valid_0's l1: 634569
[336]	valid_0's l1: 634559
[337]	valid_0's l1: 634535
[338]	valid_0's l1: 634452
[339]	valid_0's l1: 634394
[340]	valid_0's l1: 634255
[341]	valid_0's l1: 634190
[342]	valid_0's l1: 634168
[343]	valid_0's l1: 634149
[344]	valid_0's l1: 634097
[345]	valid_0's l1: 634036
[346]	valid_0's l1: 633914
[347]	valid_0's l1: 633902
[

KeyboardInterrupt: 

In [11]:
Y_pred = reg.predict(X_val)
mean_absolute_error(Y_val,Y_pred)

475851.1325406793

In [24]:
# preparamos el csv de respuesta para kaggle

In [25]:
ids = test.index.values
X_test = test.values

In [26]:
test_predict = reg.predict(X_test)
escribir_respuesta(ids, test_predict)

In [None]:
# best params so far
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'mae',
    'max_depth': 14, 
    'learning_rate': 0.05,
    'verbose': 0, 
    'early_stopping_round': 20}
n_estimators=5000

### Modelo: KNN

In [3]:
# ...

### Modelo: Neural Networks

In [3]:
# ...