### Ejercicio de Clustering y Regression

Utilizando el dataset de **`fuel_consumption_co2_train.csv`**:

**Parte 1**:
1. Realiza un **`Exploratory Data Analysis`** (**EDA**).
2. Realiza **preprocesamiento** de datos.
3. Utilizando métodos de **clustering**, **¿existe alguna forma de "categorizar" los datos?**
4. Selecciona un número de **clusters "optimo"** y crea una columna con la categorización dada por el clustering.
5. Teniendo el conjunto separado en diferentes "clases" o "categorias" entrena modelos de regresión, como:
    - **LinearRegression**
    - **KNeighborsRegressor**
    - **RadiusNeighborsRegressor**
    - **DecisionTreeRegressor**
    - **RandomForestRegressor**
    - **SVR**
    - **AdaBoostRegressor**
    - **GradientBoostingRegressor**.
6. Recuerda hacer **`train_test_split`** para cada conjuto de datos de cada **cluster** para poder calcular métricas, el objetivo es encontrar el mejor **`r2_score`** para cada modelo.
7. Haz el método de validación más adecuado para los datos, **solo es necesario hacer uno para el mejor modelo**.
8. Guarda los modelos en archivos binarios.
9. Guarda el **DataFrame** en un **.csv** llamado **`fuel_consumption_co2_cluster.csv`**.

**Parte 2**:
1. En caso de que el modelo lo permita, haz **tuning** al mejor modelo usando **`GridSearchCV`**. (opcional)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re

from sklearn.preprocessing import OneHotEncoder, TargetEncoder, MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.cluster import DBSCAN

In [None]:
df = pd.read_csv("../Data/fuel_consumption_co2_train.csv")

df.head(3)

In [None]:
# Eliminar columnas "Make" y "Model"
df = df.drop(["Make", "Model"], axis = 1)

# Transformar columna "Transmission"
df["Transmission_numero"] = df["Transmission"].apply(lambda x : re.findall(r"\d+", x)[0] if len(re.findall(r"\d+", x)) > 0 else np.nan)
df["Transmission_letra"] = df["Transmission"].apply(lambda x : re.findall(r"[a-zA-Z]+", x)[0] if len(re.findall(r"[a-zA-Z]+", x)) > 0 else np.nan)

df = df.dropna()
df = df.drop("Transmission", axis = 1)

In [None]:
# Código Adrián

df["Vehicle Class"] = df["Vehicle Class"].str.lower()   
df["Vehicle Class"] = df["Vehicle Class"].str.strip()          
df["Vehicle Class"] = df["Vehicle Class"].str.replace(" - ", " ")  
df["Vehicle Class"] = df["Vehicle Class"].str.replace("-", " ")    
df["Vehicle Class"] = df["Vehicle Class"].str.replace(":", "")     
df["Vehicle Class"] = df["Vehicle Class"].str.replace("  ", " ")

In [None]:
for col in df.columns:
    sns.scatterplot(x = df[col], y = df["CO2 Emissions"])
    plt.show()

In [None]:
# OneHot "Fuel Type"

onehot_fuel_type = OneHotEncoder(sparse_output = False)
onehot_fuel_type.set_output(transform = "pandas")

df = pd.concat([df.drop("Fuel Type", axis = 1), onehot_fuel_type.fit_transform(df[["Fuel Type"]])], axis = 1)

# OneHot "Transmission_letra"
onehot_transmission = OneHotEncoder(sparse_output = False)
onehot_transmission.set_output(transform = "pandas")

df = pd.concat([df.drop("Transmission_letra", axis = 1), onehot_transmission.fit_transform(df[["Transmission_letra"]])], axis = 1)

df.head(3)

In [None]:
target_vehicle_class = {x : y for x, y in df.groupby(by = "Vehicle Class", as_index = False).agg({"CO2 Emissions" : "mean"}).values}

df["Vehicle Class"] = df["Vehicle Class"].map(target_vehicle_class)

df.head(3)

In [None]:
# # Codigo Jesús

# class_encoder = TargetEncoder(target_type="binary")

# # Lo "entrenamos" con los datos de la columna y la transformamos
# class_encoder.fit_transform(df[["Vehicle Class"]], df["CO2 Emissions"])

### Clustering

In [None]:
X = df.copy().values

x_scaler = MinMaxScaler()
X = x_scaler.fit_transform(X)
X.shape

In [None]:
np.sqrt(20)

In [None]:
dbscan = DBSCAN(eps = 1.415, min_samples = X.shape[1]*2, metric = "euclidean")
dbscan.fit(X)
set(dbscan.labels_)

In [None]:
sns.scatterplot(x = df["Fuel Consumption City"], y = df["CO2 Emissions"], hue = dbscan.labels_)
plt.show()

In [None]:
df["cluster"] = dbscan.labels_

df.head(3)

In [None]:
df0 = df[df["cluster"] == 0]
df1 = df[df["cluster"] == 1]
df_outlier = df[df["cluster"] == -1]

In [None]:
df0.shape, df1.shape, df_outlier.shape

# Cluster 0

In [None]:
df0 = df0.drop("cluster", axis = 1)

X = df0.drop("CO2 Emissions", axis = 1)
y = df0["CO2 Emissions"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

modelos = [LinearRegression(), 
           KNeighborsRegressor(), 
           DecisionTreeRegressor(), 
           RandomForestRegressor(), 
           SVR(), 
           AdaBoostRegressor(), 
           GradientBoostingRegressor()]

datos_cluster0 = list()

for model in modelos:

    model.fit(X_train, y_train)
    
    yhat = model.predict(X_test)
    
    r2 = r2_score(y_test, yhat)
    mae = mean_absolute_error(y_test, yhat)
    mse = mean_squared_error(y_test, yhat)

    datos_cluster0.append([str(model), model, r2, mae, mse])

df_score = pd.DataFrame(data = datos_cluster0, columns = ["modelo_str", "modelo", "r2_score", "mae", "mse"])
df_score.sort_values("r2_score", ascending = False)

# Cluster 1

In [None]:
df1 = df1.drop("cluster", axis = 1)

X = df1.drop("CO2 Emissions", axis = 1)
y = df1["CO2 Emissions"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

modelos = [LinearRegression(), 
           KNeighborsRegressor(), 
           DecisionTreeRegressor(), 
           RandomForestRegressor(), 
           SVR(), 
           AdaBoostRegressor(), 
           GradientBoostingRegressor()]

datos_cluster1 = list()

for model in modelos:

    model.fit(X_train, y_train)
    
    yhat = model.predict(X_test)
    
    r2 = r2_score(y_test, yhat)
    mae = mean_absolute_error(y_test, yhat)
    mse = mean_squared_error(y_test, yhat)

    datos_cluster1.append([str(model), model, r2, mae, mse])

df_score = pd.DataFrame(data = datos_cluster1, columns = ["modelo_str", "modelo", "r2_score", "mae", "mse"])
df_score.sort_values("r2_score", ascending = False)

# Todo el df

In [None]:
X = df.drop(["CO2 Emissions", "cluster"], axis = 1)
y = df["CO2 Emissions"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

model = RandomForestRegressor()
model.fit(X_train, y_train)

yhat = model.predict(X_test)

r2_score(y_test, yhat)

In [None]:
df

In [None]:
################################################################################################################################