<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/marco-canas/Machine-Learning/blob/main/ML/classes/class_march_3/class_march_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

# Modelo de regresión para predecir el valor pagado de propina

## [Video de apoyo]()

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt



In [None]:
df = pd.read_csv('tips.csv')
df.head() 

In [None]:
df.keys()

In [None]:
p = df[['total_bill', 'sex', 'day', 'time', 'tip']]

In [None]:
p.head() 

# Dividir en atributos predictores y etiquetas 

In [None]:
p_atributos = p.drop('tip', axis = 1)
p_labels = p.tip

# Metodología de constitución de un modelo de regresión en aprendizaje supervisado

## 1. Plantear bien la pregunta.  

* ¿Regresión o clasificación?

Es una tarea de regresión porque lo que se trata de predecir es valores no clases. 

2. Exploración inicial.

* Hacer explícita la función objetivo.
* Decir cuáles son los atributos (descripción breve de cada uno)

• Practicar una exploración tabular y gráfica de los datos.

In [None]:
p.info() 

In [None]:
p.sex.value_counts() 

In [None]:
p.day.value_counts() # Aplicaremos codificación OneHotEncoder para el atributo día

In [None]:
p.time.value_counts() 

In [None]:
p.hist() 

3. Preparar los datos para los algoritmos de aprendizaje.
* Hacer separación inicial de datos para entrenar y para testear.

In [None]:
from sklearn.model_selection import train_test_split 
p_train_atributos,p_test_atributos,p_train_labels, p_test_labels = train_test_split(\
                                                      p_atributos,\
                                                      p_labels, \
                                                      test_size = 0.2,\
                                                      random_state = 42) 

* Explorar correlaciones lineales con la variable objetivo.

In [None]:
p.corr().tip.sort_values(ascending = False) 

In [None]:
from pandas.plotting import scatter_matrix 
scatter_matrix(p[['total_bill', 'tip']])

# Codificación de variables categóricas 

In [None]:
from sklearn.compose import ColumnTransformer 

In [None]:
lista_atributos_binarios = ['sex', 'time'] 
lista_atributos_multi_clase = ['day'] 

In [None]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler

In [None]:
procesador = ColumnTransformer([
    ('num', StandardScaler(), ['total_bill']),
    ('bi',OrdinalEncoder(),lista_atributos_binarios),
    ('multi',OneHotEncoder(), lista_atributos_multi_clase)
])

In [None]:
X_train_preparados = procesador.fit_transform(p_train_atributos)
X_test_preparados = procesador.transform(p_test_atributos)

In [None]:
X_train_preparados.shape

## 4. Entrenamiento y selección de modelo.  

* Instanciar varios modelos y entrenarlos sobre datos de entrenamiento preparados.
* Medir el desempeño de varios modelos (comparativa, con la técnica de la validación cruzada)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor 

In [None]:
r_lineal = LinearRegression() 
r_tree = DecisionTreeRegressor()
r_forest = RandomForestRegressor() 

In [None]:
r_lineal.fit(X_train_preparados, p_train_labels)
r_tree.fit(X_train_preparados, p_train_labels)
r_forest.fit(X_train_preparados, p_train_labels)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
score_lineal = np.sqrt(-cross_val_score(r_lineal, X_train_preparados, p_train_labels, cv = 10,\
                              scoring = 'neg_mean_squared_error'))
score_tree = np.sqrt(-cross_val_score(r_tree, X_train_preparados, p_train_labels, cv = 10,\
                             scoring = 'neg_mean_squared_error' ))
score_forest = np.sqrt(-cross_val_score(r_forest, X_train_preparados, p_train_labels, cv = 10,\
                               scoring = 'neg_mean_squared_error'))

In [None]:
score_lineal.mean()

In [None]:
score_tree.mean()

In [None]:
score_forest.mean() 

## 5. Afinar el modelo.  

* Crear cuadrícula (de búsqueda) de hiperparámetros.
* Seleccionar la combinación de hiperparámetros que consigue el mejor puntaje. (El mejor modelo).

In [None]:
grid_param = {
    'fit_intercept':[True, False], 
    'normalize':[True, False],
    'copy_X':[True, False],
    'positive':[True, False]
}

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_search = GridSearchCV(r_lineal, grid_param, cv = 10, scoring = 'neg_mean_squared_error',\
                          return_train_score = True)

In [None]:
grid_search.fit(X_train_preparados, p_train_labels)

In [None]:
grid_search.best_params_


In [None]:
mejor_modelo = grid_search.best_estimator_

In [None]:
np.sqrt(-cross_val_score(mejor_modelo, X_train_preparados, p_train_labels, cv = 10, \
                        scoring = 'neg_mean_squared_error')).mean() 

## 6. Presentar la solución.  

* Mostrar el desempeño sobre los datos para testear.

In [None]:
p_test_atributos.head() 

In [None]:
X_test_prep = procesador.transform(p_test_atributos)

In [None]:
from sklearn.metrics import mean_squared_error 

In [None]:
p_test_predicciones = mejor_modelo.predict(X_test_prep)

In [None]:
np.sqrt(mean_squared_error(p_test_labels, p_test_predicciones))

In [None]:
p_atributos.iloc[0,:].values

In [None]:
mejor_modelo.predict([X_train_preparados[2]]),p_train_labels[2]

In [None]:
mejor_modelo.predict([X_test_preparados[4]]),p_test_labels.iloc[4]

## Referencias  

* dataset de las propinas: https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv  

* la clase LinearRegression de Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

* La clase GridSearchCV de sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
from sklearn.datasets import fetch_openml 

In [None]:
mnist = fetch_openml(name = 'mnist_784', version = 1, as_frame = False) 

In [None]:
X,y = mnist['data'], mnist['target'] 

In [None]:
X.shape

In [None]:
X[0]

In [None]:
a = X[0].reshape(28,28)

In [None]:
a.shape 

In [None]:
a = a.ravel() 

In [None]:
a.shape 

In [None]:
import numpy as np 

In [None]:
A = np.array([[1,2], [3,4]])
A 

In [None]:
A.ravel() 