# Predicción de demanda en vuelos a futuro. Comparativa final

In [39]:
# Load Libraries
%matplotlib inline
import numpy as np
from scipy import stats
import pandas as pd 
import multiprocessing
import random
import math



from sklearn.preprocessing import scale
from sklearn.preprocessing import OneHotEncoder
import timeit

## Dataset

Leamos el dataset a utilizar.

In [2]:
# Load the dataset 
datos = pd.read_csv("./datasets/datos_pred_demanda.csv", sep=';', decimal=',')
datos.shape

(943794, 24)

El dataset consta de:

- 943794 **filas** o instancias
- 24 **columnas** o variables.

Veamos su contenido.

In [3]:
datos.head()

Unnamed: 0,ant,ruta,aeropuerto_origen,aeropuerto_destino,fecha_salida,num_vuelo_operador,month,weekday,timezone,year,...,second,first_ratio,second_ratio,hit,first_weight,second_weight,global_first_ratio,global_second_ratio,first_p,second_p
0,11,ACEMAD,ACE,MAD,2017-02-02,3857,2,Thursday,Mediodia,2017,...,UX,3.14034,0.0,2,0.0,0.0,2.520623,0.0,0.623683,0.0
1,11,ACEMAD,ACE,MAD,2017-02-03,3857,2,Friday,Mediodia,2017,...,UX,3.61,0.0,2,0.0,0.0,2.210075,0.0,0.846071,0.0
2,11,ACEMAD,ACE,MAD,2017-02-04,3857,2,Saturday,Mediodia,2017,...,UX,0.0,3.252252,1,0.0,1.0,0.0,1.595346,0.0,0.981601
3,11,ACEMAD,ACE,MAD,2017-02-05,3857,2,Sunday,Mediodia,2017,...,UX,3.094286,3.252252,2,0.0,1.0,0.0,1.485467,0.658558,0.955942
4,11,ACEMAD,ACE,MAD,2017-02-06,3857,2,Monday,Mediodia,2017,...,UX,2.022857,0.0,2,0.0,0.0,2.648854,0.0,0.387667,0.0


- **ant**: antelación hasta el vuelo.

- **ruta**: trayecto de ida y vuelta ordenado alfabéticamente.
        
- **aeropuerto_origen**: aeropuerto donde despega el avión.

- **aeropuerto_destino**: aeropuerto donde aterriza el avión.
    
- **fecha_salida**: fecha de despegue del avión.
    
- **num_vuelo_operador**: identificador de número de vuelo.

- **month**: mes de vuelo.

- **weekday**: día de la semana de vuelo.
    
- **timezone**: franja horaria de vuelo.
    
- **year**: año de vuelo.
    
- **capacidad**: número máximo de pax en cada vuelo, puede variar en cada vuelo.
    
- **demand**: billetes vendidos en cada vuelo.
    
- **first**: código del principal competidor de la ruta del vuelo.
    
- **second**: código del segundo competidor de la ruta del vuelo.
    
- **first_ratio**: 
    
- **second_ratio**:

- **hit**:

- **first_weight**: peso de importancia del primer competidor en la ruta del vuelo respecto a I2.

- **second_weight**: peso de importancia del segundo competidor en la ruta del vuelo respecto a I2.

- **global_first_weight**: 

- **global_second_weight**: 

- **first_p**:

- **second_p**:

De estas variables, fecha de salida la usaremos para realizar la división en train/validacion/test, demand es el target, el número de vuelo se descartará pues es un identificador del vuelo junto con la fecha de salira y las restantes variables serán el input de nuestro modelo.

Seleccionamos datos de antelacion 11 y una lista de vuelos en concreto para tener un conjunto de datos asequible para esta clase.

In [4]:
# seleccionamos datos de antelacion 11 y una lista de vuelos para tener un conjunto de datos asequible para esta clase
# selected_num_vuelo = [3857, 3855, 3853, 3851, 3859, 3753, 3755]
# datos = datos.loc[datos ['num_vuelo_operador'].isin(selected_num_vuelo)]
datos = datos[datos.ant==11]


In [5]:
datos.shape

(34242, 24)

El dataset ahora consta de:

- 1909 **filas** o instancias
- 24 **columnas** o variables.

## Pre-procesado

### Ordenación de datos

Vamos a ordenar los datos por fecha salida para cuando escojamos el conjunto de train, validation y test se respete el eje temporal

In [6]:
datos = datos.sort_values(by=['fecha_salida'], ascending=True)

Quitamos las variables que no aportan información al modelo como el num_vuelo_operador

In [7]:
del datos["num_vuelo_operador"]
del datos["ant"]

### One-Hot Encoding

Utilizaremos la técnica de one-hot encoding.

<img src="../figures/oh.png" width="50%">

Seleccionemos las variables categóricas en primer lugar.

In [8]:
categorical_vars = ['ruta', 'aeropuerto_origen', 'aeropuerto_destino', 'month', 'weekday', 'year', 
                    'nombre_blackout', 'first', 'second', 'timezone', 'hit']

Seleccionamos las variables numéricas.

In [9]:
numerical_vars = list(set(datos.columns) - set(categorical_vars))

In [10]:
print(categorical_vars)
print(numerical_vars)

['ruta', 'aeropuerto_origen', 'aeropuerto_destino', 'month', 'weekday', 'year', 'nombre_blackout', 'first', 'second', 'timezone', 'hit']
['second_weight', 'demand', 'capacidad', 'second_p', 'fecha_salida', 'second_ratio', 'first_weight', 'first_ratio', 'first_p', 'global_second_ratio', 'global_first_ratio']


Realizamos one hot encoding de las variables categoricas

In [11]:
ohe = OneHotEncoder(sparse = False)
ohe_fit = ohe.fit(datos[categorical_vars])
X_ohe = pd.DataFrame(ohe.fit_transform(datos[categorical_vars]))
X_ohe.columns = pd.DataFrame(ohe_fit.get_feature_names())

Volvemos a pegar las variables numéricas.

In [12]:
datos = pd.concat((X_ohe, datos[numerical_vars].reset_index()), axis=1)
del datos['index']

### Tipificar

Ahora vamos a tipificar los datos, es decir, llevarlos a media 0 y desviación estándar 1.

<img src="../figures/tipify.png" width="50%">

In [13]:
fecha_salida_values = datos['fecha_salida']
y = datos['demand']
del datos['fecha_salida']
del datos['demand']

In [14]:
datos_scale = pd.DataFrame(scale(datos))
datos_scale.columns = datos.columns

datos = datos_scale
datos['fecha_salida'] = fecha_salida_values
datos['demand'] = y


### Split en Train/Validación/Test

Utilizaremos a modo de ejemplo los ratios habitualmente recomendados:

• Train: 70%.

• Validación: 15%.

• Test: 15%.


In [15]:
perc_values = [0.7, 0.15, 0.15];

Selección del patrón de datos X y del target y

In [16]:
y = datos['demand']
X = datos.loc[:, datos.columns != 'demand']

Creamos los conjuntos de train, validacion y test con el tamaño seleccionado pero respetando el eje temporal.

In [17]:
# dimensiones de los conjuntos de train y test
n_train = int(X.shape[0] * perc_values[0])
n_val = int(X.shape[0] * perc_values[1])
n_test = int(X.shape[0] * perc_values[2])

# selección del conjunto de train
X_train = X.iloc[:n_train]
y_train = y.iloc[:n_train]

# selección del conjunto de validación
X_val = X.iloc[(n_train):(n_train+n_val)]
y_val = y.iloc[(n_train):(n_train+n_val)]

# selección del conjunto de test
X_test = X.iloc[(n_train+n_val):]
y_test = y.iloc[(n_train+n_val):]

Visualizamos el tamaño de los 3 subdatasets

In [18]:
print('Train data size = ' + str(X_train.shape))
print('Train target size = ' + str(y_train.shape))
print('Validation data size = ' + str(X_val.shape))
print('Validation target size = ' + str(y_val.shape))
print('Test data size = ' + str(X_test.shape))
print('Test target size = ' + str(y_test.shape))

Train data size = (23969, 73)
Train target size = (23969,)
Validation data size = (5136, 73)
Validation target size = (5136,)
Test data size = (5137, 73)
Test target size = (5137,)


Visualizamos las fecha de salida de los conjuntos para el set con el eje temporal

In [19]:
# eje temporal de cada conjunto
print('Train min date data = ' + np.amin(X_train["fecha_salida"]))
print('Train max date data = ' + np.amax(X_train["fecha_salida"]))

print('Val min date data = ' + np.amin(X_val["fecha_salida"]))
print('Val max date data = ' + np.amax(X_val["fecha_salida"]))

print('Test min date data = ' + np.amin(X_test["fecha_salida"]))
print('Test max date data = ' + np.amax(X_test["fecha_salida"]))

Train min date data = 2017-02-02
Train max date data = 2019-05-02
Val min date data = 2019-05-03
Val max date data = 2019-09-19
Test min date data = 2019-09-19
Test max date data = 2020-02-18


Una vez hechas las comprobaciones borramos la variable fecha_salida ya que no es una variable del modelo.

In [20]:
del X_train['fecha_salida']
del X_val['fecha_salida']
del X_test['fecha_salida']

# Grid Search

Vamos a proceder a realizar un grid search que compare todos los modelos de regresión que hemos visto, es decir:

![imagen.png](attachment:imagen.png)

Importamos todos los modelos que vamos a usar. En este caso importamos las librerias para regresión.

In [21]:
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor  
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

Importamos la métrica, en este caso MAE.

In [22]:
from sklearn.metrics import mean_absolute_error as metric
#from sklearn.metrics import mean_squared_error as metric2

Definimos algunos parámetros generales.

In [23]:
random_state = 1;
nthread = multiprocessing.cpu_count() - 1;

Definimos el grid a llevar a cabo

In [24]:
# Regresion Logística
reg_values = [1e-05, 1e-04, 1e-03, 1e-02,1e-01,1e-00,1e01]

# SVM
C_values = [0.1, 1, 100];
gamma_svm_values = [0.01, 0.1, 1];
epsilon_values = [1, 0.1];

# Arbol de Decision
max_depth_values = [None, 6, 20];
min_samples_split_values = [2, 5, 20];
min_samples_leaf_values = [1, 5, 20];
max_features_values = [None, 1, 2];

# Random Forest
ntree_values = [10, 100, 1000];

# Xgboost
nrounds_values = [10, 100]
eta_values = [0.3, 0.99]
gamma_values = [0, 1]
max_depth_values = [6, 20]
min_child_weight_values = [1, 20]
subsample_values = [0.1, 1]
colsample_bytree_values = [0.1, 1]
num_parallel_tree_values = [6, 20]
lambda_values = [0, 1]
alpha_values = [0, 1]


In [25]:
params_values = [{'model': 'linear regression',
                  'regularization': alpha_values},
                 {'model': 'svm',
                  'C': C_values,
                 'gamma': gamma_svm_values,
                 'epsilon': epsilon_values},
                 {'model': 'decision tree',
                  'max_depth': max_depth_values,
                 'min_samples_split': min_samples_split_values,
                 'min_samples_leaf': min_samples_leaf_values,
                 'max_features': max_features_values},
                 {'model': 'random forest',
                  'n_trees': ntree_values,
                 'min_samples_leaf': min_samples_leaf_values,
                'min_samples_split': min_samples_split_values,
                 'max_features': max_features_values,
                 'max_depth': max_depth_values},
                 {'model': 'xgboost',
                  'nrounds': nrounds_values,
                  'eta': eta_values,
                 'gamma': gamma_values,
                 'max_depth': max_depth_values,
                 'min_child_weight': min_child_weight_values,
                 'subsample': subsample_values,
                 'colsample_bytree': colsample_bytree_values,
                 'num_parallel_tree': num_parallel_tree_values,
                 'lambda': lambda_values,
                 'alpha': alpha_values}]

In [26]:
params_values

[{'model': 'linear regression', 'regularization': [0, 1]},
 {'C': [0.1, 1, 100],
  'epsilon': [1, 0.1],
  'gamma': [0.01, 0.1, 1],
  'model': 'svm'},
 {'max_depth': [6, 20],
  'max_features': [None, 1, 2],
  'min_samples_leaf': [1, 5, 20],
  'min_samples_split': [2, 5, 20],
  'model': 'decision tree'},
 {'max_depth': [6, 20],
  'max_features': [None, 1, 2],
  'min_samples_leaf': [1, 5, 20],
  'min_samples_split': [2, 5, 20],
  'model': 'random forest',
  'n_trees': [10, 100, 1000]},
 {'alpha': [0, 1],
  'colsample_bytree': [0.1, 1],
  'eta': [0.3, 0.99],
  'gamma': [0, 1],
  'lambda': [0, 1],
  'max_depth': [6, 20],
  'min_child_weight': [1, 20],
  'model': 'xgboost',
  'nrounds': [10, 100],
  'num_parallel_tree': [6, 20],
  'subsample': [0.1, 1]}]

In [27]:
total_iteraciones = 0
for params in params_values:
    if params['model'] == 'linear regression':
        n = len(params['regularization'])
    elif params['model'] == 'svm':
        n = len(params['C'])*len(params['gamma'])*len(params['epsilon'])
    elif params['model'] == 'decision tree':
        n = len(params['max_depth'])*len(params['min_samples_split'])*len(params['min_samples_leaf'])*len(params['max_features'])
    elif params['model'] == 'random forest':
        n = len(params['n_trees'])*len(params['min_samples_leaf'])*len(params['max_features'])*len(params['max_depth'])*len(params['min_samples_split'])
    elif params['model'] == 'xgboost':
        n = len(params['nrounds'])*len(params['eta'])*len(params['gamma'])*len(params['max_depth'])*len(params['min_child_weight'])*len(params['subsample'])*len(params['colsample_bytree'])*len(params['num_parallel_tree'])*len(params['lambda'])*len(params['alpha'])
    total_iteraciones = total_iteraciones + n;
    print(str(n)+ ' iteraciones de ' + str(params['model']))
print(str(total_iteraciones)+ ' iteraciones en total')        

2 iteraciones de linear regression
18 iteraciones de svm
54 iteraciones de decision tree
162 iteraciones de random forest
1024 iteraciones de xgboost
1260 iteraciones en total


In [None]:
grid_results = pd.DataFrame();
num_iter = 0
for params in params_values:
    
        # Linear Regression
    if params['model'] == 'linear regression':
        for regularization in params['regularization']:
            # Actualizar contador
            num_iter += 1; 
                
            # print control iteracion modelo
            print('Inicio de iteracion ' + str(num_iter) + 
                '. Regularizacion = ' + str(regularization) + 
                '\n')
                
           
            model = Ridge(alpha = regularization, random_state = random_state)
    
               
            model.fit(X_train, np.array(y_train))

            # Generar predicciones
            pred_train = model.predict(X_train)
            pred_val = model.predict(X_val)

            # Calcular métricas de evaluación
            error_train = metric(y_train, pred_train)    
            error_val = metric(y_val, pred_val)                                           

            print('Fin de iteracion ' + str(num_iter) + 
                     '. Regularizacion = ' + str(regularization) + 
                      '. Error train = '  + str(error_train) + 
                      ' -  Error val = '  + str(error_val)  + 
                      '\n')
            grid_results = grid_results.append(pd.DataFrame(data={'model':'Linear Regression',
                                                                      'params': [{'regularization':[regularization]}],
                                                                                          'error_train':[error_train],
                                                                                          'error_val':[error_val]},
                                                                                           columns=['model', 'params','error_train', 'error_val']),
                                                   ignore_index=True)
     
    # SVM
    if params['model'] == 'svm':
        for C in params['C']:
            for gamma in params['gamma']:  
                for epsilon in params['epsilon']:  
                    # Actualizar contador
                    num_iter += 1; 

                    # print control iteracion modelo
                    print('Inicio de iteracion ' + str(num_iter) + 
                          '. C = ' + str(C) + 
                          ', gamma = '  + str(gamma) +
                          ', epsilon = '  + str(epsilon) +
                          '\n')

                    # Entrenar modelo               
                    model = SVR(C = C, gamma = gamma, epsilon=epsilon)
               
                    model.fit(X_train, np.array(y_train))

                    # Generar predicciones
                    pred_train = model.predict(X_train)
                    pred_val = model.predict(X_val)

                    # Calcular métricas de evaluación
                    error_train = metric(y_train, pred_train)    
                    error_val = metric(y_val, pred_val)                                          

                    print('Fin de iteracion ' + str(num_iter) + 
                         '. C = ' + str(C) + 
                          ', gamma = '  + str(gamma) +
                          ' -  error train = '  + str(error_train) + 
                          ' -  error val = '  + str(error_val)  + 
                          '\n')
                    grid_results = grid_results.append(pd.DataFrame(data={'model':'SVM',
                                                                         'params': [{'C':[C],
                                                                                  'gamma':[gamma],
                                                                                    'epsilon': [epsilon]}],
                                                                                          'error_train':[error_train],
                                                                                          'error_val':[error_val]},
                                                                                           columns=['model', 'params','error_train', 'error_val']),
                                                       ignore_index=True)
                
    # Decision Tree
    if params['model'] == 'decision tree':
        for max_depth in params['max_depth']:
            for min_samples_split in params['min_samples_split']:  
                for min_samples_leaf in params['min_samples_leaf']:  
                    for max_features in params['max_features']:  
                
                        # Actualizar contador
                        num_iter += 1; 

                        # print control iteracion modelo
                        print('Inicio de iteracion ' + str(num_iter) + 
                              '. max_depth = ' + str(max_depth) + 
                              ', min_samples_split = '  + str(min_samples_split) +
                              ', min_samples_leaf = '  + str(min_samples_leaf) +
                              ', max_features = '  + str(max_features) +
                              '\n')

                        # Entrenar modelo               
                        model = DecisionTreeRegressor(max_depth = max_depth,
                                                      min_samples_split = min_samples_split,
                                                      min_samples_leaf = min_samples_leaf,
                                                      max_features = max_features, random_state = random_state)

                        model.fit(X_train, np.array(y_train))

                        model.fit(X_train, np.array(y_train))

                        # Generar predicciones
                        pred_train = model.predict(X_train)
                        pred_val = model.predict(X_val)

                        # Calcular métricas de evaluación
                        error_train = metric(y_train, pred_train)    
                        error_val = metric(y_val, pred_val)  

                        print('Fin de iteracion ' + str(num_iter) + 
                             '. max_depth = ' + str(max_depth) + 
                              ', min_samples_split = '  + str(min_samples_split) +
                              ', min_samples_leaf = '  + str(min_samples_leaf) +
                              ', max_features = '  + str(max_features) +
                              ' -  error train = '  + str(error_train) + 
                              ' -  error val = '  + str(error_val)  + 
                              '\n')
                        grid_results = grid_results.append(pd.DataFrame(data={'model':'decision tree',
                                                                              'params': [{'max_depth':[max_depth],
                                                                                          'min_samples_split':[min_samples_split],
                                                                                          'min_samples_leaf':[min_samples_leaf],
                                                                                          'max_features':[max_features]}],
                                                                                          'error_train':[error_train],
                                                                                          'error_val':[error_val]},
                                                                                           columns=['model', 'params','error_train', 'error_val']),
                                                           ignore_index=True)  
                        
    
    # Random Forest
    if params['model'] == 'random forest':
        for n_trees in params['n_trees']:
            for max_depth in params['max_depth']:
                for min_samples_split in params['min_samples_split']:  
                    for min_samples_leaf in params['min_samples_leaf']:  
                        for max_features in params['max_features']:  
                
                            # Actualizar contador
                            num_iter += 1; 

                            # print control iteracion modelo
                            print('Inicio de iteracion ' + str(num_iter) + 
                                  '. n_trees = ' + str(n_trees) + 
                                  ', max_depth = ' + str(max_depth) + 
                                  ', min_samples_split = '  + str(min_samples_split) +
                                  ', min_samples_leaf = '  + str(min_samples_leaf) +
                                  ', max_features = '  + str(max_features) +
                                  '\n')

                            # Entrenar modelo               
                            model = RandomForestRegressor(n_estimators = n_trees,
                                                          max_depth = max_depth,
                                                          min_samples_split = min_samples_split,
                                                          min_samples_leaf = min_samples_leaf,
                                                          max_features = max_features, random_state = random_state)

                            model.fit(X_train, np.array(y_train))

                            # Generar predicciones
                            pred_train = model.predict(X_train)
                            pred_val = model.predict(X_val)

                            # Calcular métricas de evaluación
                            error_train = metric(y_train, pred_train)    
                            error_val = metric(y_val, pred_val)                                         

                            print('Fin de iteracion ' + str(num_iter) + 
                                 '. n_trees = ' + str(n_trees) + 
                                  ', max_depth = ' + str(max_depth) + 
                                  ', min_samples_split = '  + str(min_samples_split) +
                                  ', min_samples_leaf = '  + str(min_samples_leaf) +
                                  ', max_features = '  + str(max_features) +
                                  ' -  error train = '  + str(error_train) + 
                                  ' -  error val = '  + str(error_val)  + 
                                  '\n')
                            grid_results = grid_results.append(pd.DataFrame(data={'model':'random forest',
                                                                                  'params': [{'n_trees':[n_trees],
                                                                                              'max_depth':[max_depth],
                                                                                              'min_samples_split':[min_samples_split],
                                                                                              'min_samples_leaf':[min_samples_leaf],
                                                                                              'max_features':[max_features]}],
                                                                                          'error_train':[error_train],
                                                                                          'error_val':[error_val]},
                                                                                           columns=['model', 'params','error_train', 'error_val']),
                                                               ignore_index=True)  
    
    # XGBOOST
    if params['model'] == 'xgboost':
        for nrounds in params['nrounds']:
            for eta in params['eta']:
                for gamma in params['gamma']:
                    for max_depth in params['max_depth']:
                        for min_child_weight in params['min_child_weight']:
                            for subsample in params['subsample']:
                                for colsample_bytree in params['colsample_bytree']:
                                    for num_parallel_tree in params['num_parallel_tree']:
                                        for lamda in params['lambda']:
                                            for alpha in params['alpha']:

                                                # Actualizar contador
                                                num_iter += 1; 

                                                # print control iteracion modelo
                                                print('Inicio de iteracion ' + str(num_iter) +
                                                      '. Parametro nrounds = ' + str(nrounds) +
                                                      ', parametro eta = ' + str(eta) + 
                                                      ', parametro gamma = '  + str(gamma) +
                                                      ', parametro max_depth = '  + str(max_depth) +
                                                      ', parametro min_child_weight = '  + str(min_child_weight) +
                                                      ', parametro subsample = '  + str(subsample) +
                                                      ', parametro colsample_bytree = '  + str(colsample_bytree) +
                                                      ', parametro num_parallel_tree = '  + str(num_parallel_tree) +
                                                      ', parametro lambda = '  + str(lamda) +
                                                      ', parametro alpha = '  + str(alpha) + 
                                                      '\n')
                                                # Entrenar modelo
                                                model = XGBRegressor(nthread = nthread,
                                                                      random_state = random_state,
                                                                      n_estimators = nrounds, 
                                                                      learning_rate = eta, 
                                                                      gamma = gamma,
                                                                      max_depth = max_depth,
                                                                      min_child_weight = min_child_weight ,
                                                                      subsample = subsample,
                                                                      colsample_bytree = colsample_bytree,
                                                                      num_parallel_tree  = num_parallel_tree,
                                                                      reg_lambda = lamda,
                                                                      reg_alpha = alpha)
                                                model.fit(X_train, np.array(y_train))

                                                # Generar predicciones
                                                pred_train = model.predict(X_train)
                                                pred_val = model.predict(X_val)

                                                # Calcular métricas de evaluación
                                                error_train = metric(y_train, pred_train)    
                                                error_val = metric(y_val, pred_val)                                            

                                                print('Fin de iteracion ' + str(num_iter) + 
                                                      '. Parametro nrounds = ' + str(nrounds) + 
                                                      ', parametro eta = ' + str(eta) + 
                                                      ', parametro gamma = '  + str(gamma) +
                                                      ', parametro max_depth = '  + str(max_depth) +
                                                      ', parametro min_child_weight = '  + str(min_child_weight) +
                                                      ', parametro subsample = '  + str(subsample) +
                                                      ', parametro colsample_bytree = '  + str(colsample_bytree) +
                                                      ', parametro num_parallel_tree = '  + str(num_parallel_tree) +
                                                      ', parametro lambda = '  + str(lamda) +
                                                      ', parametro alpha = '  + str(alpha) + 
                                                      ' -  error train = '  + str(error_train) + 
                                                      ' -  error val = '  + str(error_val)  + 
                                                      '\n')
                                                grid_results = grid_results.append(pd.DataFrame(data={'model':'xgboost',
                                                                                              'params': [{'nrounds':[nrounds],
                                                                                              'eta':[eta],
                                                                                              'gamma':[gamma],
                                                                                              'max_depth':[max_depth],
                                                                                              'min_child_weight':[min_child_weight],
                                                                                              'subsample':[subsample],
                                                                                              'colsample_bytree':[colsample_bytree],
                                                                                              'num_parallel_tree':[num_parallel_tree],
                                                                                              'lamda':[lamda],
                                                                                              'alpha':[alpha]}],
                                                                                              'error_train':[error_train],
                                                                                              'error_val':[error_val]},
                                                                                               columns=['model', 'params','error_train', 'error_val']), 
                                                                                   ignore_index=True)

In [29]:
grid_results

Unnamed: 0,model,params,error_train,error_val
0,Linear Regression,{'regularization': [0]},21.524317,19.345895
1,Linear Regression,{'regularization': [1]},21.412824,18.952354
2,SVM,"{'C': [0.1], 'gamma': [0.01], 'epsilon': [1]}",22.652512,20.413609
3,SVM,"{'C': [0.1], 'gamma': [0.01], 'epsilon': [0.1]}",22.649043,20.427162
4,SVM,"{'C': [0.1], 'gamma': [0.1], 'epsilon': [1]}",25.198758,22.029860
5,SVM,"{'C': [0.1], 'gamma': [0.1], 'epsilon': [0.1]}",25.195731,22.038546
6,SVM,"{'C': [0.1], 'gamma': [1], 'epsilon': [1]}",25.404767,22.055721
7,SVM,"{'C': [0.1], 'gamma': [1], 'epsilon': [0.1]}",25.402461,22.090669
8,SVM,"{'C': [1], 'gamma': [0.01], 'epsilon': [1]}",19.103701,17.697900
9,SVM,"{'C': [1], 'gamma': [0.01], 'epsilon': [0.1]}",19.087004,17.718467


Vamos a analizar el mejor resultado para cada familia de modelos.

In [30]:
grid_results.groupby(['model'], sort=False)['error_val'].min().sort_values()

model
random forest        15.743116
SVM                  17.011897
decision tree        17.380361
xgboost              18.251872
Linear Regression    18.952354
Name: error_val, dtype: float64

Comparemos el peor y el mejor resultado

In [31]:
print(grid_results.iloc[grid_results['error_val'].idxmin()])
print('-------------------------------------------')
print(grid_results.iloc[grid_results['error_val'].idxmax()])

model                                              random forest
params         {'n_trees': [1000], 'max_depth': [20], 'min_sa...
error_train                                              11.5019
error_val                                                15.7431
Name: 212, dtype: object
-------------------------------------------
model                                              decision tree
params         {'max_depth': [20], 'min_samples_split': [2], ...
error_train                                              8.74496
error_val                                                25.0196
Name: 49, dtype: object


Veamos la mayor diferencia entre auc_train y auc_val.

In [32]:
diff = grid_results['error_val'] - grid_results['error_train']
grid_results.iloc[diff.idxmax()]

model                                                   SVM
params         {'C': [100], 'gamma': [1], 'epsilon': [0.1]}
error_train                                         1.96877
error_val                                           23.3996
Name: 19, dtype: object

**OVERFITT...**

Nos quedamos con el mejor resultado.

In [43]:
best_params = grid_results.iloc[grid_results['error_val'].idxmin()]

Juntamos validación y train para entrenar el modelo final.

In [34]:
print('Train data size = ' + str(X_train.shape))
print('Train target size = ' + str(y_train.shape))
print('Validation data size = ' + str(X_val.shape))
print('Validation target size = ' + str(y_val.shape))

# Combinar train y validación
X_train = pd.concat((X_train,X_val), axis = 0)
y_train = np.concatenate((y_train, y_val), axis = 0)

del X_val, y_val

print('Train data size = ' + str(X_train.shape))
print('Train target size = ' + str(y_train.shape))

Train data size = (23969, 72)
Train target size = (23969,)
Validation data size = (5136, 72)
Validation target size = (5136,)
Train data size = (29105, 72)
Train target size = (29105,)


Y ahora entrenamos el modelo final.


In [45]:
# Logistic Regression
if best_params['model'] == 'logistic regression':       

    # Entrenar modelo
    model = Ridge(alpha = best_params['params']['regularization'][0], random_state = random_state)

# SVM
elif best_params['model'] == 'SVM':

    model = SVR(C = best_params['params']['C'][0], 
                gamma = best_params['params']['gamma'][0],
                epsilon = best_params['params']['epsilon'][0])             


# Decision Tree
elif best_params['model'] == 'decision tree':
    model = DecisionTreeRegressor(max_depth = int(best_params['params']['max_depth'][0]),
                                                  min_samples_split = int(best_params['params']['min_samples_split'][0]),
                                                  min_samples_leaf = int(best_params['params']['min_samples_leaf'][0]),
                                                  max_features = int(best_params['params']['max_features'][0]), 
                                   random_state = random_state)


# Random Forest
elif best_params['model'] == 'random forest':    
    if best_params['params']['max_features'][0] != None:
        best_params['params']['max_features'][0] = int(best_params['params']['max_features'][0])
    model = RandomForestRegressor(n_estimators = int(best_params['params']['n_trees'][0]),
                                                      max_depth = int(best_params['params']['max_depth'][0]),
                                                      min_samples_split = int(best_params['params']['min_samples_split'][0]),
                                                      min_samples_leaf = int(best_params['params']['min_samples_leaf'][0]),
                                                      max_features = best_params['params']['max_features'][0], 
                                                      random_state = random_state)

# XGBOOST
elif best_params['model'] == 'xgboost':
    model = XGBRegressor(nthread = nthread, 
                                                              random_state = random_state,
                                                              n_estimators = int(best_params['params']['nrounds'][0]), 
                                                              learning_rate = best_params['params']['eta'][0], 
                                                              gamma = best_params['params']['gamma'][0],
                                                              max_depth = int(best_params['params']['max_depth'][0]),
                                                              min_child_weight = best_params['params']['min_child_weight'][0],
                                                              subsample = best_params['params']['subsample'][0],
                                                              colsample_bytree = best_params['params']['colsample_bytree'][0],
                                                              num_parallel_tree  = int(best_params['params']['num_parallel_tree'][0]),
                                                              reg_lambda = best_params['params']['lamda'][0],
                                                              reg_alpha = best_params['params']['alpha'][0])

# Entrenar modelo
model.fit(X_train, np.array(y_train))

# Generar predicciones
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

# Calcular métricas de evaluación
error_train = metric(y_train, pred_train)
error_test = metric(y_test, pred_test) 

results = pd.DataFrame()
results = results.append(pd.DataFrame(data={'model':best_params['model'],'error_train':[error_train],'error_test':[error_test]}, columns=['model',  'error_train', 'error_test']), ignore_index=True)


In [46]:
results

Unnamed: 0,model,error_train,error_test
0,random forest,11.132167,17.800839
