![image info](https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/images/banner_1.png)

# Proyecto 1 - Predicción de precios de vehículos usados

En este proyecto podrán poner en práctica sus conocimientos sobre modelos predictivos basados en árboles y ensambles, y sobre la disponibilización de modelos. Para su desasrrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 1: Predicción de precios de vehículos usados".

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 4. Sin embargo, es importante que avancen en la semana 3 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 4, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/b8be43cf89c540bfaf3831f2c8506614).

## Datos para la predicción de precios de vehículos usados

En este proyecto se usará el conjunto de datos de Car Listings de Kaggle, donde cada observación representa el precio de un automóvil teniendo en cuenta distintas variables como: año, marca, modelo, entre otras. El objetivo es predecir el precio del automóvil. Para más detalles puede visitar el siguiente enlace: [datos](https://www.kaggle.com/jpayne/852k-used-car-listings).

## Ejemplo predicción conjunto de test para envío a Kaggle

En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [27]:
# Importación librerías
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from math import sqrt
from sklearn.model_selection import RandomizedSearchCV

In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTrain_carListings.zip')
dataTesting = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTest_carListings.zip', index_col=0)

## Transformación información en Train

In [4]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,Price,Year,Mileage,State,Make,Model
0,34995,2017,9913,FL,Jeep,Wrangler
1,37895,2015,20578,OH,Chevrolet,Tahoe4WD
2,18430,2012,83716,TX,BMW,X5AWD
3,24681,2014,28729,OH,Cadillac,SRXLuxury
4,26998,2013,64032,CO,Jeep,Wrangler


#### Factorizacion variables categóricas:


In [5]:
#Factorizacion variables categoricas
dataTraining['State'] = pd.factorize(dataTraining.State)[0]
dataTraining['Make'] = pd.factorize(dataTraining.Make)[0]
dataTraining['Model'] = pd.factorize(dataTraining.Model)[0]
dataTraining['Year'] = pd.factorize(dataTraining.Year)[0]

In [6]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0_level_0,Year,Mileage,State,Make,Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2014,31909,MD,Nissan,MuranoAWD
1,2017,5362,FL,Jeep,Wrangler
2,2014,50300,OH,Ford,FlexLimited
3,2004,132160,WA,BMW,5
4,2015,25226,MA,Jeep,Grand


In [7]:
dataTesting['State'] = pd.factorize(dataTesting.State)[0]
dataTesting['Make'] = pd.factorize(dataTesting.Make)[0]
dataTesting['Model'] = pd.factorize(dataTesting.Model)[0]
dataTesting['Year'] = pd.factorize(dataTesting.Year)[0]

#### Definimos X_train, y_train y x_test

In [8]:
#Particion de x-y datos training y test
#X_train = dataTraining.drop(["Price"], axis = 1)  
#y_train = dataTraining.filter(["Price"], axis = 1)
#X_test = dataTesting

In [8]:
#Particion de x-y datos training
XTotal = dataTraining.drop(["Price"], axis = 1)  
yTotal = dataTraining.filter(["Price"], axis = 1)

# particion de training para calculo de MSE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(XTotal, yTotal, test_size=0.20, random_state=0)

### Bagging

In [10]:
# Uso de BaggingRegressor de la libreria (sklearn) donde se usa el modelo DecisionTreeRegressor como estimador
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=10, 
                          bootstrap=True, oob_score=True, random_state=1)

In [11]:
bagreg.fit(X_train, y_train)
bagreg_pred = bagreg.predict(X_test)

In [12]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, bagreg_pred)
print(f'MSE: {mse}')

MSE: 15877368.410204569


In [13]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

RMSE: 3984.6415660890466


In [14]:
mae = mean_absolute_error(y_test, bagreg_pred)
print(f"MAE: {mae}")

MAE: 2507.549804521262


In [15]:
r2 = r2_score(y_test, bagreg_pred)
print(f'R^2: {r2}')

R^2: 0.8626108020934592


#### Calibrando parámetros

In [16]:
# Define the hyperparameter search space
param_dist = {
    'n_estimators': [1, 100, 500],
    'max_samples': [0.5, 1.0],
    'max_features': [1, 5],
    'bootstrap': [True, False],
    'bootstrap_features': [True, False]
}

# Create a RandomizedSearchCV instance
random_search = RandomizedSearchCV(
    bagreg,
    param_distributions=param_dist,
    n_jobs=-1,
    random_state=0
)

# Fit the RandomizedSearchCV instance to the data
random_search.fit(X_train, y_train)

# Print the best parameters
print(random_search.best_params_)

{'n_estimators': 100, 'max_samples': 0.5, 'max_features': 5, 'bootstrap_features': True, 'bootstrap': True}


In [17]:
bagreg_cal = BaggingRegressor(DecisionTreeRegressor(), n_estimators=100, bootstrap_features=True, max_features=5, max_samples=0.5,
                          bootstrap=True, oob_score=True, random_state=1)

In [18]:
bagreg_cal.fit(X_train, y_train)
bagregpred_cal = bagreg_cal.predict(X_test)

In [19]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, bagregpred_cal)
print(f'MSE: {mse}')

MSE: 20492734.416370828


In [20]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

RMSE: 4526.8901484761955


In [21]:
mae = mean_absolute_error(y_test, bagregpred_cal)
print(f"MAE: {mae}")

MAE: 3148.748171144526


In [22]:
r2 = r2_score(y_test, bagregpred_cal)
print(f'R^2: {r2}')

R^2: 0.8226733630135199


### Random Forest

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Definición de modelo Random Forest para un problema de regresión
rfreg = RandomForestRegressor()

In [10]:
#cross_val_score(clf, XTotal,yTotal, cv=10)
rfreg.fit(X_train, y_train)
predRF = rfreg.predict(X_test)

In [25]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, predRF)
print(f'MSE: {mse}')

MSE: 14938271.79953913


In [26]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

RMSE: 3865.006054269402


In [27]:
mae = mean_absolute_error(y_test, predRF)
print(f"MAE: {mae}")

MAE: 2427.384459804687


In [28]:
r2 = r2_score(y_test, predRF)
print(f'R^2: {r2}')

R^2: 0.8707369428217395


#### Calibración de parámetros

In [11]:
# Definir el espacio de búsqueda de hiperparámetros
param_dist = {
    'n_estimators': [50, 100, 500],
    'max_depth': [None, 10, 50]
}

# Crear una instancia de RandomizedSearchCV
random_search = RandomizedSearchCV(
    rfreg,
    param_distributions=param_dist,
    n_jobs=-1,
    random_state=0
)

# Fit the RandomizedSearchCV instance to the data
random_search.fit(X_train, y_train)

# Print the best parameters
print(random_search.best_params_)

{'n_estimators': 100, 'max_depth': None}


In [12]:
rfreg_cal = RandomForestRegressor(n_estimators=100, max_depth=None)

In [13]:
#cross_val_score(clf, XTotal,yTotal, cv=10)
rfreg_cal.fit(X_train, y_train)
predRF_cal = rfreg_cal.predict(X_test)

In [14]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, predRF_cal)
print(f'MSE: {mse}')

MSE: 14875433.67488388


In [15]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

RMSE: 3856.8683766605104


In [16]:
mae = mean_absolute_error(y_test, predRF_cal)
print(f"MAE: {mae}")

MAE: 2423.580718772989


In [17]:
r2 = r2_score(y_test, predRF_cal)
print(f'R^2: {r2}')

R^2: 0.8712806903321133


### Adaboost

In [20]:
# Importación y definición de modelo AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor()

In [21]:
ada.fit(X_train, y_train)
predada = ada.predict(X_test)

In [22]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, predada)
print(f'MSE: {mse}')

MSE: 117777777.55703342


In [23]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

RMSE: 10852.547053896307


In [24]:
mae = mean_absolute_error(y_test, predada)
print(f"MAE: {mae}")

MAE: 9033.709756282879


#### Calibración de parámetros

In [28]:
from scipy.stats import uniform

In [29]:
param_dist = {
    'n_estimators': range(50, 500),
    'learning_rate': uniform(0.01, 1)
}

# Define la búsqueda aleatoria
random_search = RandomizedSearchCV(
    ada,
    param_distributions=param_dist,
    n_iter=100,
    cv=5,
    n_jobs=-1,
    random_state=0
)

# Ajusta la búsqueda aleatoria al conjunto de datos
random_search.fit(X_train, y_train)

# Muestra los mejores parámetros encontrados
print(random_search.best_params_)

{'learning_rate': 0.014695476192547066, 'n_estimators': 175}


In [30]:
ada_cal = AdaBoostRegressor(learning_rate=0.014695476192547066,n_estimators=175)

In [31]:
ada_cal.fit(X_train, y_train)
predada_cal = ada_cal.predict(X_test)

In [32]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, predada_cal)
print(f'MSE: {mse}')

MSE: 79811400.80802599


In [33]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

RMSE: 8933.722673556975


In [34]:
mae = mean_absolute_error(y_test, predada_cal)
print(f"MAE: {mae}")

MAE: 7059.816202242784


In [35]:
r2 = r2_score(y_test, predada_cal)
print(f'R^2: {r2}')

R^2: 0.3093802412643737


### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
grb = GradientBoostingRegressor()

In [None]:
grb.fit(X_train, y_train)
predgrb = grb.predict(X_test)

In [None]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, predgrb)
print(f'MSE: {mse}')

In [None]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

In [None]:
mae = mean_absolute_error(y_test, predgrb)
print(f"MAE: {mae}")

In [None]:
r2 = r2_score(y_test, predgrb)
print(f'R^2: {r2}')

#### Calibrando parámetros

In [None]:
from scipy.stats import uniform

In [36]:
# Define la distribución de parámetros para la búsqueda aleatoria
param_dist = {
    'n_estimators': [50, 100, 200],
    'learning_rate': uniform(0.01, 1),
    'max_depth': [1, 5, 10]
}

# Define la búsqueda aleatoria
random_search = RandomizedSearchCV(
    grb,
    param_distributions=param_dist,
    n_iter=100,
    cv=5,
    n_jobs=-1,
    random_state=0
)

# Ajusta la búsqueda aleatoria al conjunto de datos
random_search.fit(X_train, y_train)

# Muestra los mejores parámetros encontrados
print(random_search.best_params_)

SyntaxError: invalid syntax (3978930228.py, line 6)

In [None]:
grb_cal = GradientBoostingRegressor(                       )

In [None]:
grb_cal.fit(X_train, y_train)
predgrb_cal = grb_cal.predict(X_test)

In [None]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, predgrb_cal)
print(f'MSE: {mse}')

In [None]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

In [None]:
mae = mean_absolute_error(y_test, predgrb_cal)
print(f"MAE: {mae}")

In [None]:
r2 = r2_score(y_test, predgrb_cal)
print(f'R^2: {r2}')

### XGBoost

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

xgb = XGBRegressor()
xgb

In [None]:
from sklearn import metrics
xgb.fit(X_train, y_train)
xgbpred = xgb.predict(X_test)

In [None]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, xgbpred)
print(f'MSE: {mse}')

In [None]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

In [None]:
mae = mean_absolute_error(y_test, xgbpred)
print(f"MAE: {mae}")

In [None]:
r2 = r2_score(y_test, xgbpred)
print(f'R^2: {r2}')

#### Calibramos parámetros de XGBoost

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Definimos los valores a probar para cada parámetro
param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'gamma': [0, 1, 5],
    'colsample_bytree': [0.3, 0.7, 1]
}

In [None]:
# Creamos un objeto XGBRegressor
xgb_cal = XGBRegressor()

# Creamos un objeto GridSearchCV
grid_search = GridSearchCV(xgb_cal, param_grid, cv=5)

In [None]:
# Realizamos la búsqueda en cuadrícula para encontrar los mejores parámetros utilizando validación cruzada
grid_search.fit(X_train, y_train)

In [None]:
# Se muestran los mejores parámetros encontrados
print(grid_search.best_params_)

In [None]:
xgb_cal = XGBRegressor(learning_rate=1,gamma=0,colsample_bytree=1)
xgb_cal

In [None]:
# Entrenamiento (fit) y desempeño del modelo XGBRegressor
xgb_cal.fit(X_train, y_train)
xgbpred_cal = xgb_cal.predict(X_test)

In [None]:
# Calcula el MSE usando validación cruzada
mse = mean_squared_error(y_test, xgbpred_cal)
print(f'MSE: {mse}')

In [None]:
rmse = sqrt(mse)
print(f'RMSE: {rmse}')

In [None]:
mae = mean_absolute_error(y_test, xgbpred_cal)
print(f"MAE: {mae}")

In [None]:
r2 = r2_score(y_test, xgbpred_cal)
print(f'R^2: {r2}')

In [None]:
# Predicción del conjunto de test - acá se genera un número aleatorio como ejemplo
np.random.seed(42)
y_pred = pd.DataFrame(np.random.rand(dataTesting.shape[0]) * 75000 + 5000, index=dataTesting.index, columns=['Price'])

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
y_pred.to_csv('test_submission.csv', index_label='ID')
y_pred.head()