# parameter_tuning
En este notebook buscaremos los parametros optimos (aproximados, pues siempre se puede tunear mas tiempo) para los modelos elegidos para la resolucion del problema. Por tratarse de procesos costosos en tiempo, devolveremos los resultados hardcodeados para no tener que repetir el mismo.

In [1]:
import pandas as pd
import numpy as np
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 42

In [1]:
import nbimporter

import pre_processing
import feature_generation
import feature_selection
import parameter_tuning

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb
Importing Jupyter notebook from parameter_tuning.ipynb


<hr>

# Parameter Tuning para XGBoost

In [3]:
import xgboost
from sklearn.model_selection import RandomizedSearchCV

In [4]:
train,test = pre_processing.load_featured_datasets()

In [5]:
train['precio'] = train['precio'].map(lambda x: math.log(x))

In [6]:
train_selected = feature_selection.get_selected_dataframe(train)
test_selected = feature_selection.get_selected_dataframe(test, precio=False)

In [7]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [8]:
reg = xgboost.XGBRegressor()

In [9]:
param_grid = {
    'max_depth':[13,14,15],
    'n_estimators':[120,130,140],
    'learning_rate': [0.05,0.1,0.3],
    'subsample':[0.5,0.8,0.9],
    'min_child_weight':[15,20]
}

In [10]:
randomsearch = RandomizedSearchCV(reg, param_grid, cv=4, scoring = 'neg_mean_absolute_error', n_iter=20)

In [11]:
randomsearch.fit(X,Y)



RandomizedSearchCV(cv=4, error_score='raise-deprecating',
                   estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                          colsample_bylevel=1,
                                          colsample_bynode=1,
                                          colsample_bytree=1, gamma=0,
                                          importance_type='gain',
                                          learning_rate=0.1, max_delta_step=0,
                                          max_depth=3, min_child_weight=1,
                                          missing=None, n_estimators=100,
                                          n_jobs=1, nthread=None,
                                          objective='reg:linear',
                                          random_st...
                                          seed=None, silent=None, subsample=1,
                                          verbosity=1),
                   iid='warn', n_iter=20, n_jobs=None,

In [12]:
randomsearch.best_params_

{'subsample': 0.9,
 'n_estimators': 140,
 'min_child_weight': 15,
 'max_depth': 15,
 'learning_rate': 0.1}

### Resultados obtenidos...

**Detalles de la prueba:**
- Metodo utilizado: RandomizedSearchCV.
- n_iterations: 20.
- Parametros a probar:

```
param_grid = {
    'max_depth':[13,14,15],
    'n_estimators':[120,130,140],
    'learning_rate': [0.05,0.1,0.3],
    'subsample':[0.5,0.8,0.9],
    'min_child_weight':[15,20]
}
```

**Resultados (7h 48min)**:
```
{'subsample': 0.9,
 'n_estimators': 140,
 'min_child_weight': 15,
 'max_depth': 15,
 'learning_rate': 0.1}
 ```

<hr>

# Parameter Tuning para LightGBM
Utilizaremos learning rate de 0.1 y 5000 estimadores, luego esto sera escalado a mas estimadores con mejores learning rates, pero no deberia variar el resultado.

In [None]:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [None]:
train,test = pre_processing.load_featured_datasets()
train_selected = feature_selection.get_selected_dataframe(train)

In [None]:
X = train_selected.drop('precio', axis=1).values
Y = train_selected['precio'].values

In [None]:
param_grid = {
    'num_leaves': [55, 60, 65],
    'max_depth': [8,10,12],
    'min_gain_to_split':[0.1, 0.2], 
    'max_bin':[50, 100, 150],
    'min_data_in_leaf':[3000, 5000, 7000],
    'bagging_freq':[4,5,6],
    'bagging_fraction':[0.65, 0.7, 0.75],
    'feature_fraction':[0.7]
}

In [None]:
reg = lgb.LGBMRegressor(boosting_type='gbdt',  objective='regression', metric='mae', num_boost_round=5000,
                       verbose=0, learning_rate=0.1)

In [None]:
#gridsearch = GridSearchCV(reg, param_grid, cv=4, scoring = 'neg_mean_absolute_error')
gridsearch = RandomizedSearchCV(reg, param_grid, n_iter=120, n_jobs=3, cv=4, scoring = 'neg_mean_absolute_error')

In [None]:
%%time
gridsearch.fit(X,Y)

**Como se corrio en otro notebook, se imprime a continuacion los resultados obtenidos:**

CPU times: user 11min 24s, sys: 1.72 s, total: 11min 25s
Wall time: 16h 57min 11s

RandomizedSearchCV(cv=4, error_score='raise-deprecating',
                   estimator=LGBMRegressor(boosting_type='gbdt',
                                           class_weight=None,
                                           colsample_bytree=1.0,
                                           importance_type='split',
                                           learning_rage=0.1, learning_rate=0.1,
                                           max_depth=-1, metric='mae',
                                           min_child_samples=20,
                                           min_child_weight=0.001,
                                           min_split_gain=0.0, n_estimators=100,
                                           n_jobs=-1, num_boost_round=5000,
                                           num_leaves=31...
                   param_distributions={'bagging_fraction': [0.65, 0.7, 0.75],
                                        'bagging_freq': [4, 5, 6],
                                        'feature_fraction': [0.7],
                                        'max_bin': [50, 100, 150],
                                        'max_depth': [8, 10, 12],
                                        'min_data_in_leaf': [3000, 5000, 7000],
                                        'min_gain_to_split': [0.1, 0.2],
                                        'num_leaves': [55, 60, 65]},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring='neg_mean_absolute_error',

In [None]:
gridsearch.best_params_

### Resultados obtenidos...

**Detalles de la prueba:**
- Metodo utilizado: RandomizedSearchCV.
- n_iterations: 40.
- Parametros a probar:

```
param_grid = {
    # Prevenir overfitting:
    'num_leaves': [60, 80, 100, 120],
    'max_depth': [5,10,15],
    'min_gain_to_split':[0.1], 
    'max_bin':[100],
    'min_data_in_leaf':[5000],
    'bagging_freq':[3, 5],
    'bagging_fraction':[0.5, 0.605, 0.7],
    'feature_fraction':[0.7]
}
```

**Resultados (4h 35min)**:
```
{'num_leaves': 60,
 'min_gain_to_split': 0.1,
 'min_data_in_leaf': 5000,
 'max_depth': 10,
 'max_bin': 100,
 'feature_fraction': 0.7,
 'bagging_freq': 5,
 'bagging_fraction': 0.7}
 
MAE = 502020k
 ```
 
<hr>

**Detalles de la prueba:**
- Metodo utilizado: RandomizedSearchCV.
- n_iterations: 120.
- Parametros a probar:

```
param_grid = {
    'num_leaves': [55, 60, 65],
    'max_depth': [8,10,12],
    'min_gain_to_split':[0.1, 0.2], 
    'max_bin':[50, 100, 150],
    'min_data_in_leaf':[3000, 5000, 7000],
    'bagging_freq':[4,5,6],
    'bagging_fraction':[0.65, 0.7, 0.75],
    'feature_fraction':[0.7]
}
```

**Resultados (16h 57min)**:
```
{'num_leaves': 55,
 'min_gain_to_split': 0.2,
 'min_data_in_leaf': 3000,
 'max_depth': 12,
 'max_bin': 150,
 'feature_fraction': 0.7,
 'bagging_freq': 5,
 'bagging_fraction': 0.75}
 
MAE = 494138k
 ```

<hr>

# Devolvemos los resultados...

In [4]:
def get_best_params():
    '''Devuelve los mejores parametros obtenidos con el randomsearch anterior'''
    return {
        'xgboost':{
            'subsample': 0.9,
            'n_estimators': 140,
            'min_child_weight': 15,
            'max_depth': 15,
            'learning_rate': 0.1
        },
        'lightgbm':{'num_leaves': 55,
            'min_gain_to_split': 0.2,
            'min_data_in_leaf': 3000,
            'max_depth': 12,
            'max_bin': 150,
            'feature_fraction': 0.7,
            'bagging_freq': 5,
            'bagging_fraction': 0.75,
            # Parametros que no fueron optimizados:
            'boosting_type':'gbdt',
            'objective':'regression',
            'metric':'mae',
            'num_boost_round':5000,
            'verbose':0,
            'learning_rate':0.1
         }
    }