# MODELIZACIÓN Y EJECUCIÓN PARA REGRESIÓN

**IMPORTANTE**: Recuerda hacer una copia de esta plantilla para no machacar la original.

**IMPORTANTE**: Esta plantilla está diseñada para una visión de máximos utilizando el framework de The Ultimate Algo Machine. Si tienes problemas de memoria o rendimiento recuerda reducir el problema mediante:

* Muestreo
* Balanceo undersampling
* Reducir el número de algoritmos a testar
* Reducir el número de parámetros a testar
* Usar random search y especificar un n_iter adecuado

## IMPORTAR PAQUETES

In [2]:
import os
import json

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#Calidad de Datos
from janitor import clean_names

#Transformación de Variables
from sklearn.preprocessing import OneHotEncoder



from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.metrics import mean_absolute_percentage_error


#Autocompletar rápido
%config IPCompleter.greedy=True

#Desactivar la notación científica
pd.options.display.float_format = '{:.2f}'.format

#Desactivar los warnings
import warnings
warnings.filterwarnings("ignore")

## IMPORTAR LOS DATOS

Cargamos JSON.DATA

In [3]:
# Obtener la ruta del directorio raíz
directorio_raiz = os.getcwd()

# Nombre del archivo JSON
nombre_archivo = "data.json"

# Ruta completa del archivo JSON
ruta_archivo = os.path.join(directorio_raiz, nombre_archivo)

# Abrir el archivo en modo lectura
with open(ruta_archivo, "r") as archivo:
    # Cargar el contenido del archivo JSON
    data = json.load(archivo)

Sustituir la ruta del proyecto.

In [4]:
ruta_proyecto = data['ruta_proyecto']

Nombres de los ficheros de datos.

In [5]:
nombre_x = 'x_preseleccionado.pickle'
nombre_y = 'y_preseleccionado.pickle'

Cargar los datos.

In [6]:
x = pd.read_pickle(ruta_proyecto + '/02_Datos/03_Trabajo/' + nombre_x)
y = pd.read_pickle(ruta_proyecto + '/02_Datos/03_Trabajo/' + nombre_y)

## MODELIZAR

### Reservar el dataset de validacion

In [7]:
train_x,val_x,train_y,val_y = train_test_split(x,y,test_size=0.3)

### Crear el pipe y el diccionario de algorimos, parámetros y valores a testar

Modificar para dejar solo los algoritmos que se quieran testar.

Modificar los parámetros.

In [8]:
pipe = Pipeline([('algoritmo',XGBRegressor())])

grid = [{'algoritmo': [LinearRegression()],
         'algoritmo__n_jobs': [-1]},
            
        {'algoritmo': [XGBRegressor()],
         'algoritmo__n_jobs': [-1],
         'algoritmo__learning_rate': [0.01,0.025,0.05,0.1],
         'algoritmo__max_depth': [5,10,20],
         'algoritmo__reg_alpha': [0,0.1,0.5,1],
         'algoritmo__reg_lambda': [0.01,0.1,1],
         'algoritmo__n_estimators': [100,500,1000]},
       ]

### Optimizar los hiper parámetros

Elegir si se quiere usar grid search o random search.

Comentar la opción que no se vaya a usar.

####  Con grid search

In [9]:
'''grid_search = GridSearchCV(estimator= pipe, 
                            param_grid = grid, 
                           cv = 3, 
                           scoring = 'neg_mean_absolute_percentage_error',
                            verbose = 0,
                            n_jobs = -1)

modelo = grid_search.fit(train_x,train_y)

pd.DataFrame(grid_search.cv_results_).sort_values(by = 'rank_test_score')'''

"grid_search = GridSearchCV(estimator= pipe, \n                            param_grid = grid, \n                           cv = 3, \n                           scoring = 'neg_mean_absolute_percentage_error',\n                            verbose = 0,\n                            n_jobs = -1)\n\nmodelo = grid_search.fit(train_x,train_y)\n\npd.DataFrame(grid_search.cv_results_).sort_values(by = 'rank_test_score')"

####  Con random search

In [10]:
random_search = RandomizedSearchCV(estimator = pipe,
                                   param_distributions = grid, 
                                   n_iter = 25, 
                                   cv = 3, 
                                   scoring = 'neg_mean_absolute_percentage_error', 
                                   verbose = 0,
                                   n_jobs = -1)

modelo = random_search.fit(train_x,train_y)

pd.DataFrame(random_search.cv_results_).sort_values(by = 'rank_test_score').head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_algoritmo__reg_lambda,param_algoritmo__reg_alpha,param_algoritmo__n_jobs,param_algoritmo__n_estimators,param_algoritmo__max_depth,param_algoritmo__learning_rate,param_algoritmo,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
6,0.5,0.03,0.01,0.0,0.1,1.0,-1,100,10,0.03,"XGBRegressor(base_score=None, booster=None, ca...","{'algoritmo__reg_lambda': 0.1, 'algoritmo__reg...",-0.13,-0.14,-0.13,-0.13,0.0,1
13,1.09,0.01,0.02,0.01,0.1,0.1,-1,500,5,0.01,"XGBRegressor(base_score=None, booster=None, ca...","{'algoritmo__reg_lambda': 0.1, 'algoritmo__reg...",-0.14,-0.14,-0.14,-0.14,0.0,2
2,0.27,0.01,0.02,0.01,0.01,0.1,-1,100,5,0.05,"XGBRegressor(base_score=None, booster=None, ca...","{'algoritmo__reg_lambda': 0.01, 'algoritmo__re...",-0.14,-0.14,-0.14,-0.14,0.0,3
22,0.56,0.02,0.01,0.0,1.0,0.0,-1,100,20,0.03,"XGBRegressor(base_score=None, booster=None, ca...","{'algoritmo__reg_lambda': 1, 'algoritmo__reg_a...",-0.14,-0.14,-0.14,-0.14,0.0,4
24,0.42,0.01,0.01,0.01,1.0,0.1,-1,100,10,0.05,"XGBRegressor(base_score=None, booster=None, ca...","{'algoritmo__reg_lambda': 1, 'algoritmo__reg_a...",-0.14,-0.14,-0.14,-0.14,0.0,5


## EVALUAR

### Predecir sobre validación del Dataset de Trabajo

In [11]:
pred = modelo.best_estimator_.predict(val_x)

### Evaluar sobre validación

In [12]:
mean_absolute_percentage_error(val_y, pred)


0.14004406703787614

### Examinar el mejor modelo

In [13]:
modelo.best_estimator_