# MiniProyecto 4: Optimización de parámetros

¡Bienvenidos al cuarto (mini)proyecto de la carrera de Data Science de Acamica! 

En este proyecto vamos a seguir trabajando (por última vez) con el dataset de propiedades en venta publicadas en el portal [Properati](www.properati.com.ar). El objetivo en este caso es optimizar los parámetros de los algoritmos que usamos en el proyecto pasado.

El dataset es el mismo del proyecto 3. Recordemos que las columnas que se agregan son:

* `barrios_match`: si coincide el barrio publicado con el geográfico vale 1, si no 0.

* `PH`, `apartment`, `house`: variables binarias que indican el tipo de propiedad.

* dummies de barrios: variables binarias con 1 o 0 según el barrio.

La métrica que vamos a usar para medir es RMSE (raíz del error cuadréatico medio), cuya fórmula es:

$$RMSE = \sqrt{\frac{\sum_{t=1}^n (\hat y_t - y_t)^2}{n}}$$

## Pandas - Levantamos el dataset

In [1]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
path_dataset = 'dataset/datos_properati_limpios_model.csv'
df = pd.read_csv(path_dataset)

**Separá** el dataset en entrenamiento (80%) y test (20%) utilizando como target la columna `price_aprox_usd`

In [2]:
# Hacé la separación en esta celda
import numpy as np
from sklearn.model_selection import train_test_split 
X = df.drop(['price_aprox_usd'], axis=1)
y = df['price_aprox_usd']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print(X_train.shape[0], X_test.shape[0])

5100 1276


## Scikit-learn - Entrenamiento

Para repasar los parámetros de árboles de decisión en Scikit-learn: 

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

En primer lugar veamos como hacer cross validation. Para eso necesitamos definir la cantidad de folds, en este caso vamos a usar 5.

GridSearchCV nos permite testear a través de un espacio de búsqueda de parámetros la mejor combinación posible dado un estimador.

Por ejemplo, en este caso probamos la profundidad máxima y la máxima cantidad de features para hacer los split. Ambos entre 1 y 5.
Recordemos que para hacer la optimización scikit-learn usa la métrica `neg_mean_squared_error` en lugar de `mean_squared_error`.

**Creá** una variable `param_grid` con valores del 1 al 5 para los atributos `max_depth` y `max_features`. 

In [3]:
# Creá en esta celda la variable param_grid
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
param_grid = [
    {'max_depth': [1,2, 3, 4,5], 'max_features': [1,2, 3, 4,5]},
]

**Importá** `GridSearchCV` y `DecisionTreeRegressor`.

**Creá** una variable `grid_search` y asignale un `GridSearchCV` que recorra el `param_grid` que creaste con el algoritmos `DecisionTreeRegressor` y el un scoring de `neg_mean_squared_error`

In [4]:
# Importa y crea un GridSearchCV en esta celda
tree_reg = DecisionTreeRegressor()
grid_search = GridSearchCV(tree_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', 
                           return_train_score=True)

A continuación, realizá el `fit` del `grid_search` con el conjunto de entrenamiento

In [5]:
# Hace el fit de grid search en esta celda
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
                                             max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             presort=False, random_state=None,
                                             splitter='best'),
             iid='warn', n_jobs=None,
             param_grid=[{'max_depth': [1, 2, 3, 4, 5],
                          'max_features': [1, 2, 3, 4, 5]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring

Revisemos los resultados. Recordemos que no están expresados en RMSE.

In [8]:
grid_search.scorer_

make_scorer(mean_squared_error, greater_is_better=False)

**Mostrá** los `grid_scores` obtenidos durante el `grid_search`

In [9]:
np.sqrt(-grid_search.cv_results_['mean_train_score'])

array([31307.36740377, 31096.97581302, 31064.6137871 , 31237.85621584,
       29543.42847446, 30354.24988877, 31070.22714981, 30186.81600125,
       31110.48903673, 30073.87983227, 31204.30195365, 29542.72603594,
       28992.64069996, 29814.23910269, 27235.50558466, 30174.49318726,
       28955.31265446, 29972.16490805, 28442.74649968, 28132.31458905,
       29704.90652173, 28910.31533579, 28769.10611045, 28687.20667287,
       26962.71662975])

In [10]:
np.sqrt(-grid_search.cv_results_['mean_test_score'])

array([31320.72531343, 31111.1544259 , 31080.04559342, 31268.87658358,
       29650.88912907, 30461.00488424, 31119.15138626, 30183.54118705,
       31185.53646704, 30172.23067716, 31261.66299884, 29476.78667774,
       29115.49541057, 29960.36695567, 27456.69184855, 30404.50845312,
       29092.67636526, 30250.43118664, 28782.06379496, 28276.95181572,
       30001.91371718, 29169.31133948, 28785.99914748, 28821.77032657,
       27297.10898615])

In [11]:
# Mostrá los grid_scores en esta celda
grid_search.best_params_

{'max_depth': 5, 'max_features': 5}

De esta manera, el valor con mejor resultado (dado el espacio de búsqueda definido) es `max_depth` 3 y `max_features` 3.

**Mostrá** el mejor score y los mejores parámetros encontrados por `grid_search`

In [12]:
# Mostrás los resultados en esta celda
grid_search.cv_results_['mean_train_score']

array([-9.80151254e+08, -9.67021905e+08, -9.65010230e+08, -9.75803661e+08,
       -8.72814166e+08, -9.21380486e+08, -9.65359015e+08, -9.11243860e+08,
       -9.67862528e+08, -9.04438248e+08, -9.73708460e+08, -8.72772662e+08,
       -8.40573215e+08, -8.88888853e+08, -7.41772764e+08, -9.10500039e+08,
       -8.38410131e+08, -8.98330669e+08, -8.08989828e+08, -7.91427124e+08,
       -8.82381471e+08, -8.35806333e+08, -8.27661466e+08, -8.22955827e+08,
       -7.26988088e+08])

Convertimos a RMSE.

In [16]:
def nmsq2rmse(score):
    return np.round(np.sqrt(-score), 2)

In [17]:
grid_search.best_score_  
print("Best Scores_RMSE:", nmsq2rmse(grid_search.best_score_))

Best Scores_RMSE: 27297.11


__Encontrar el mejor modelo para el espacio de búsqueda dado__

* `"min_samples_split": [2, 10, 20]`
* `"max_depth": [None, 2, 5, 10, 15]`
* `"min_samples_leaf": [1, 5, 10, 15]`
* `"max_leaf_nodes": [None, 5, 10, 20]`

In [18]:
# Creá en esta celda la variable param_grid
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
param_grid = [
    {'min_samples_split': [2, 10, 20], 'max_depth': [None, 2, 5, 10, 15],
     'min_samples_leaf': [1, 5, 10, 15], 'max_leaf_nodes': [None, 5, 10, 20]},
]

In [19]:
tree_reg = DecisionTreeRegressor()
grid_search = GridSearchCV(tree_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', 
                           return_train_score=True)

In [20]:
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
                                             max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             presort=False, random_state=None,
                                             splitter='best'),
             iid='warn', n_jobs=None,
             param_grid=[{'max_depth': [None, 2, 5, 10, 15],
                          'max_leaf_nodes': [None, 5, 10, 20],
                          'min_samples_leaf': [1, 5, 10, 15],
                          

In [21]:
grid_search.scorer_

make_scorer(mean_squared_error, greater_is_better=False)

In [22]:
np.sqrt(-grid_search.cv_results_['mean_train_score'])

array([  483.1111576 , 10637.6897186 , 14236.53900024, 13719.82029512,
       13722.14534468, 15600.0029534 , 16682.35193361, 16682.35193361,
       16682.35193361, 17887.77221277, 17887.77221277, 17887.77221277,
       24292.04536038, 24292.04536038, 24292.04536038, 24292.04536038,
       24292.04536038, 24292.04536038, 24292.04536038, 24292.04536038,
       24292.04536038, 24292.04536038, 24292.04536038, 24292.04536038,
       22340.71376253, 22340.71376253, 22340.71376253, 22340.71376253,
       22340.71376253, 22340.71376253, 22340.71376253, 22340.71376253,
       22340.71376253, 22340.71376253, 22340.71376253, 22340.71376253,
       21308.29344367, 21308.29344367, 21308.29344367, 21311.51960742,
       21311.51960742, 21311.51960742, 21313.68804818, 21313.68804818,
       21313.68804818, 21319.31187285, 21319.31187285, 21319.31187285,
       24910.50273428, 24910.50273428, 24910.50273428, 24910.50273428,
       24910.50273428, 24910.50273428, 24910.50273428, 24910.50273428,
      

In [23]:
np.sqrt(-grid_search.cv_results_['mean_test_score'])

array([25631.96984368, 23973.13752708, 22524.14224204, 22971.34677827,
       22978.35993374, 22163.02894719, 21709.06693172, 21721.03806477,
       21716.58415239, 21578.25579436, 21578.25579436, 21578.25579436,
       24458.2389823 , 24458.2389823 , 24458.2389823 , 24458.2389823 ,
       24458.2389823 , 24458.2389823 , 24458.2389823 , 24458.2389823 ,
       24458.2389823 , 24458.2389823 , 24458.2389823 , 24458.2389823 ,
       22807.05480143, 22807.05480143, 22807.05480143, 22807.05480143,
       22807.05480143, 22807.05480143, 22807.05480143, 22807.05480143,
       22807.05480143, 22807.05480143, 22807.05480143, 22807.05480143,
       22282.42857879, 22282.42857879, 22282.42857879, 22273.60463086,
       22273.60463086, 22273.60463086, 22251.38830455, 22251.38830455,
       22251.38830455, 22240.61653214, 22240.61653214, 22240.61653214,
       24966.89079942, 24966.89079942, 24966.89079942, 24966.89079942,
       24966.89079942, 24966.89079942, 24966.89079942, 24966.89079942,
      

In [24]:
grid_search.best_params_

{'max_depth': 10,
 'max_leaf_nodes': None,
 'min_samples_leaf': 15,
 'min_samples_split': 2}

In [25]:
grid_search.cv_results_['mean_train_score']

array([-2.33396391e+05, -1.13160443e+08, -2.02679043e+08, -1.88233469e+08,
       -1.88297273e+08, -2.43360092e+08, -2.78300866e+08, -2.78300866e+08,
       -2.78300866e+08, -3.19972395e+08, -3.19972395e+08, -3.19972395e+08,
       -5.90103468e+08, -5.90103468e+08, -5.90103468e+08, -5.90103468e+08,
       -5.90103468e+08, -5.90103468e+08, -5.90103468e+08, -5.90103468e+08,
       -5.90103468e+08, -5.90103468e+08, -5.90103468e+08, -5.90103468e+08,
       -4.99107491e+08, -4.99107491e+08, -4.99107491e+08, -4.99107491e+08,
       -4.99107491e+08, -4.99107491e+08, -4.99107491e+08, -4.99107491e+08,
       -4.99107491e+08, -4.99107491e+08, -4.99107491e+08, -4.99107491e+08,
       -4.54043369e+08, -4.54043369e+08, -4.54043369e+08, -4.54180868e+08,
       -4.54180868e+08, -4.54180868e+08, -4.54273298e+08, -4.54273298e+08,
       -4.54273298e+08, -4.54513059e+08, -4.54513059e+08, -4.54513059e+08,
       -6.20533146e+08, -6.20533146e+08, -6.20533146e+08, -6.20533146e+08,
       -6.20533146e+08, -

In [28]:
grid_search.best_score_ 
print("Best Scores_RMSE:", nmsq2rmse(grid_search.best_score_))

Best Scores_RMSE: 21491.95


Recordemos que `GridSearchCV` tiene como parámetro default `refit=True`. Esto significa que luego de hacer la corrida se ajusta el mejor modelo al conjunto de datos de entrada. De esta manera, se puede predecir directamente usando `best_estimator_`.

In [29]:
optimised_decision_tree = grid_search.best_estimator_

## Probando el RandomizedSearch 

In [33]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
param_grid = {'max_depth': [1,2, 3, 4,5], 'max_features': [1,2, 3, 4,5]}

In [34]:
tree_reg = DecisionTreeRegressor()
rand_search = RandomizedSearchCV(tree_reg, param_distributions=param_grid, cv=5,
                                 scoring='neg_mean_squared_error',
                                 n_iter=10,
                                 return_train_score=True)

rand_search.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=DecisionTreeRegressor(criterion='mse',
                                                   max_depth=None,
                                                   max_features=None,
                                                   max_leaf_nodes=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   presort=False,
                                                   random_state=None,
                                                   splitter='best'),
                   iid='warn', n_iter=10, n_jobs=None,
                   param_di

In [35]:
rand_search.best_params_

{'max_features': 3, 'max_depth': 4}

In [36]:
rand_search.best_estimator_

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=3,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [37]:
np.sqrt(-rand_search.cv_results_['mean_train_score'])

array([27726.56798484, 30218.46455059, 30817.35650271, 29853.88516235,
       29474.74844103, 28558.38918743, 30345.67546846, 30027.13044112,
       29626.4068557 , 31260.10016165])

In [38]:
np.sqrt(-rand_search.cv_results_['mean_test_score'])

array([27995.07990935, 30126.33724685, 30945.39611261, 30031.38134999,
       29687.71506483, 28804.60883615, 30351.27925601, 30275.3517073 ,
       29694.92215438, 31314.32371036])

In [39]:
# Comparamos los resultados entre Randomizedsearch y GridSearch
print("Promedio GridSearch RMSE: ", np.sqrt(-grid_search.cv_results_['mean_test_score']).mean())
print("Promedio RandomizedSearch RMSE: ", np.sqrt(-rand_search.cv_results_['mean_test_score']).mean())

Promedio GridSearch RMSE:  23347.137638110893
Promedio RandomizedSearch RMSE:  29922.63953478298


__Evaluemos en testing el desempeño de este modelo.__

Como venimos trabajando, el resultado en testing será la medición que usaremos como benchmark para comparar este modelos con otros en el futuro, puesto que no estuvo en contacto con el dataset de test para la calibración.

In [40]:
from sklearn.metrics import mean_squared_error
y_opt_pred = optimised_decision_tree.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_opt_pred))
np.round(rmse)

20733.0

Vemos los primeros 10 resultados de la predicción del valor de propiedades.

In [41]:
val_real = pd.Series(y_test.values)
val_pred = pd.Series(y_opt_pred)

In [42]:
predicciones = pd.concat([val_real.rename('Valor real'),val_pred.rename('Valor Pred') ,abs(val_real-val_pred).rename('Dif(+/-)')] ,  axis=1)

In [43]:
predicciones.head(10)

Unnamed: 0,Valor real,Valor Pred,Dif(+/-)
0,85000.0,134370.37,49370.37
1,172000.0,141206.667,30793.333
2,80000.0,123772.638,43772.638
3,120000.0,162834.615,42834.615
4,159000.0,150609.059,8390.941
5,175000.0,151900.716,23099.284
6,91300.0,77046.164,14253.836
7,87000.0,111984.0,24984.0
8,170000.0,135768.902,34231.098
9,169000.0,142176.792,26823.208
