![image info](https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/images/banner_1.png)

# Proyecto 1 - Predicción de precios de vehículos usados

En este proyecto podrán poner en práctica sus conocimientos sobre modelos predictivos basados en árboles y ensambles, y sobre la disponibilización de modelos. Para su desasrrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 1: Predicción de precios de vehículos usados".

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 4. Sin embargo, es importante que avancen en la semana 3 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 4, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/b8be43cf89c540bfaf3831f2c8506614).

## Datos para la predicción de precios de vehículos usados

En este proyecto se usará el conjunto de datos de Car Listings de Kaggle, donde cada observación representa el precio de un automóvil teniendo en cuenta distintas variables como: año, marca, modelo, entre otras. El objetivo es predecir el precio del automóvil. Para más detalles puede visitar el siguiente enlace: [datos](https://www.kaggle.com/jpayne/852k-used-car-listings).

## Ejemplo predicción conjunto de test para envío a Kaggle

En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importación librerías
import pandas as pd
import numpy as np

In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTrain_carListings.zip')
dataTesting = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTest_carListings.zip', index_col=0)

In [6]:
# Visualización datos de entrenamiento
dataTraining

Unnamed: 0,Price,Year,Mileage,State,Make,Model
0,34995,2017,9913,0,0,0
1,37895,2015,20578,1,1,1
2,18430,2012,83716,2,2,2
3,24681,2014,28729,1,3,3
4,26998,2013,64032,3,0,0
...,...,...,...,...,...,...
399995,29900,2015,25287,2,23,278
399996,17688,2015,17677,28,1,103
399997,24907,2014,66688,13,6,118
399998,11498,2014,37872,33,8,11


In [None]:
#dataTrainingNA=dataTraining.dropna(axis=0,how="any")
#dataTestingNA=dataTesting.dropna(axis=0,how="any")

In [8]:
#Factorizacion variables categoricas
dataTraining['State'] = pd.factorize(dataTraining.State)[0]
dataTraining['Make'] = pd.factorize(dataTraining.Make)[0]
dataTraining['Model'] = pd.factorize(dataTraining.Model)[0]
dataTraining['Year'] = pd.factorize(dataTraining.Year)[0]

In [9]:
dataTesting['State'] = pd.factorize(dataTesting.State)[0]
dataTesting['Make'] = pd.factorize(dataTesting.Make)[0]
dataTesting['Model'] = pd.factorize(dataTesting.Model)[0]

In [10]:
#Particion de x-y datos training
XTotal = dataTraining.drop(["Price"], axis = 1)  
yTotal = dataTraining.filter(["Price"], axis = 1)

# particion de training para calculo de MSE
from sklearn.model_selection import train_test_split
XTrain, XTest, yTrain, yTest = train_test_split(XTotal, yTotal, test_size=0.20, random_state=0)

In [None]:
yTotal

## Bagging

In [11]:
# Uso de BaggingRegressor de la libreria (sklearn) donde se usa el modelo DecisionTreeRegressor como estimador
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
bagreg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=10, 
                          bootstrap=True, oob_score=True, random_state=1)

In [8]:
bagreg.fit(XTrain, yTrain)
y_pred = bagreg.predict(XTest)
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(yTest, y_pred))

3934.8087517976446

In [12]:
bagreg.fit(XTrain, yTrain)
y_pred = bagreg.predict(XTest)
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(yTest, y_pred))

3984.6415660890466

In [13]:
#modelo con todos los datos y prediccion para subir
bagreg.fit(XTotal, yTotal)
y_pred2 = bagreg.predict(dataTesting)
y_pred2G = pd.DataFrame(y_pred2, index=dataTesting.index, columns=['Price'])

In [14]:
y_pred2 #y_pred2G

array([32196.2, 42146.4, 36994.8, ..., 32527.4, 33582.7, 14116.7])

In [17]:
y_pred2G.to_csv('testBagging_default.csv', index_label='ID')

## Random forest

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Definición de modelo Random Forest para un problema de clasificación
rfreg = RandomForestRegressor()

In [10]:
#cross_val_score(clf, XTotal,yTotal, cv=10)
rfreg.fit(XTrain, yTrain)
predRF = rfreg.predict(XTest)
np.sqrt(mean_squared_error(yTest, predRF))

3813.358027437825

In [18]:
#modelo con todos los datos y prediccion para subir
rfreg.fit(XTotal, yTotal)
predRF2 = rfreg.predict(dataTesting)
predRF2G = pd.DataFrame(predRF2, index=dataTesting.index, columns=['Price'])

In [21]:
predRF2G.to_csv('testRandomForest_default.csv', index_label='ID')

In [57]:
#Calibracion
param_gridRF = {
    'n_estimators': [10, 200, 10],
    'max_features': [1, len(XTrain.columns), 1],
    'max_depth': [1, 10, 1]
}

In [58]:
 # objeto GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_searchRF = GridSearchCV(rfreg, param_gridRF)

In [59]:
grid_searchRF.fit(XTrain, yTrain)

GridSearchCV(estimator=RandomForestRegressor(),
             param_grid={'max_depth': [1, 10, 1], 'max_features': [1, 5, 1],
                         'n_estimators': [10, 200, 10]})

In [61]:
print(grid_searchRF.best_params_)

{'max_depth': 10, 'max_features': 5, 'n_estimators': 200}


In [62]:
rfregCal = RandomForestRegressor(n_estimators=200, random_state=0, n_jobs=-1, max_features=5, max_depth = 10)
rfregCal.fit(XTrain, yTrain)
predRFCal = rfregCal.predict(XTest)
np.sqrt(mean_squared_error(yTest, predRFCal))

5679.77438098646

In [68]:
rfregCal = RandomForestRegressor(n_estimators=350, random_state=0, n_jobs=-1, max_features=5, max_depth = 25)
rfregCal.fit(XTrain, yTrain)
predRFCal = rfregCal.predict(XTest)
np.sqrt(mean_squared_error(yTest, predRFCal))

3740.350634987885

In [None]:
#modelo cal con todos los datos y prediccion para subir

In [69]:
rfCal = RandomForestRegressor(n_estimators=350, random_state=0, n_jobs=-1, max_features=5, max_depth = 25)
rfCal.fit(XTotal, yTotal)
predRFCal2 = rfCal.predict(dataTesting)
predRFCal2G = pd.DataFrame(predRFCal2, index=dataTesting.index, columns=['Price'])

In [73]:
predRFCal2G.to_csv('testRandomForest_calibra1.csv', index_label='ID')

## AdaBoost

In [22]:
# Importación y definición de modelo AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor()

In [23]:
ada.fit(XTrain, yTrain)
predada = ada.predict(XTest)
np.sqrt(mean_squared_error(yTest, predada))

11202.121005616384

In [27]:
#modelo con todos los datos y prediccion para subir
ada.fit(XTotal, yTotal)
predada2 = ada.predict(dataTesting)
predada2G = pd.DataFrame(predada2, index=dataTesting.index, columns=['Price'])

In [30]:
predada2G.to_csv('testAdaB_default.csv', index_label='ID')

## Gradient Boosting

In [31]:
from sklearn.ensemble import GradientBoostingRegressor
grb = GradientBoostingRegressor()

In [32]:
grb.fit(XTrain, yTrain)
predgrb = grb.predict(XTest)
np.sqrt(mean_squared_error(yTest, predgrb))

6507.705325901701

In [33]:
#modelo con todos los datos y prediccion para subir
grb.fit(XTotal, yTotal)
predgrb2 = grb.predict(dataTesting)
predgrb2G = pd.DataFrame(predgrb2, index=dataTesting.index, columns=['Price'])

In [35]:
predgrb2G

Unnamed: 0_level_0,Price
ID,Unnamed: 1_level_1
0,29463.840780
1,41652.962384
2,27376.315519
3,8663.784714
4,25174.787328
...,...
99995,16314.841461
99996,31979.750609
99997,33680.810664
99998,26050.907044


In [36]:
predgrb2G.to_csv('testGradBoost_default.csv', index_label='ID')

## xgboost

In [None]:
#pip install xgboost

In [1]:
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)

In [37]:
xgb.fit(XTrain, yTrain)
predxgb = xgb.predict(XTest)
np.sqrt(mean_squared_error(yTest, predxgb))

3848.9132576103534

In [38]:
#modelo con todos los datos y prediccion para subir
xgb.fit(XTotal, yTotal)
predxgb2 = xgb.predict(dataTesting)
predxgb2G = pd.DataFrame(predxgb2, index=dataTesting.index, columns=['Price'])

In [40]:
predxgb2G

Unnamed: 0_level_0,Price
ID,Unnamed: 1_level_1
0,31735.732422
1,45641.714844
2,26509.072266
3,7990.861816
4,28900.083984
...,...
99995,16314.141602
99996,32044.556641
99997,31546.707031
99998,29846.750000


In [41]:
predxgb2G.to_csv('testXGBoost_default.csv', index_label='ID')

In [42]:
#Calibracion
param_grid = {
    'learning_rate': [0.1, 1, 0.1],
    'gamma': [0, 1, 5],
    'colsample_bytree': [0.1, 1, 0.1]
}

In [45]:
# objeto GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(xgb, param_grid)

In [46]:
grid_search.fit(XTrain, yTrain)

GridSearchCV(estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None, gpu_id=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None, max_bin=None,
                                    max_cat_threshold=None,
                                    max_cat_to_onehot=None, max_delta_step=None,
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                    

In [47]:
print(grid_search.best_params_)

{'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 1}


In [49]:
xgbCal = XGBRegressor(learning_rate=1,gamma=0,colsample_bytree=1)
xgbCal.fit(XTrain, yTrain)
predxgbCal = xgbCal.predict(XTest)
np.sqrt(mean_squared_error(yTest, predxgbCal))

3776.8545055039995

In [50]:
#modelo con todos los datos y prediccion para subir
xgbCal.fit(XTotal, yTotal)
predxgbCal2 = xgbCal.predict(dataTesting)
predxgbCal2G = pd.DataFrame(predxgbCal2, index=dataTesting.index, columns=['Price'])


In [53]:
predxgbCal2G

Unnamed: 0_level_0,Price
ID,Unnamed: 1_level_1
0,31756.341797
1,52680.792969
2,24794.355469
3,11177.109375
4,40889.632812
...,...
99995,14810.829102
99996,32470.097656
99997,31321.871094
99998,31687.347656


In [54]:
predxgbCal2G.to_csv('testXGBoost_calibra1.csv', index_label='ID')

In [11]:
# Predicción del conjunto de test - acá se genera un número aleatorio como ejemplo
np.random.seed(42)
y_pred = pd.DataFrame(np.random.rand(dataTesting.shape[0]) * 75000 + 5000, index=dataTesting.index, columns=['Price'])

In [12]:
y_pred

Unnamed: 0_level_0,Price
ID,Unnamed: 1_level_1
0,33090.508914
1,76303.572981
2,59899.545636
3,49899.386315
4,16701.398033
...,...
99995,64422.862300
99996,63443.967139
99997,55584.005547
99998,42458.543350


In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
y_pred.to_csv('test_submission.csv', index_label='ID')
y_pred.head()