# **Predicción de Precios de Vehículos Usados (Core)**
Implementar y evaluar modelos de regresión, y seleccionar el mejor modelo basado en las métricas de evaluación.

<font color="blue">**ML realizado eliminando los outliers de precios de autos**</font>

**DEA realizado en archivo DEA_CORE3.ipynb**

In [1]:
import pandas as pd

In [3]:
path= '/content/drive/MyDrive/Bootcamp-ML/Cores/Core3 Autos/vehicles_so_core3.csv' #dataset post DEA y eliminados los valores outliers
df = pd.read_csv(path)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  float64
 3   year          426880 non-null  float64
 4   manufacturer  426880 non-null  object 
 5   condition     426880 non-null  object 
 6   cylinders     426880 non-null  object 
 7   fuel          426880 non-null  object 
 8   odometer      426880 non-null  float64
 9   title_status  426880 non-null  object 
 10  transmission  426880 non-null  object 
 11  drive         426880 non-null  object 
 12  size          426880 non-null  object 
 13  type          426880 non-null  object 
 14  paint_color   426880 non-null  object 
 15  state         426880 non-null  object 
dtypes: float64(3), int64(1), object(12)
memory usage: 52.1+ MB


In [5]:
df.describe().T.round()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,426880.0,7311487000.0,4473170.0,7207408000.0,7308143000.0,7312621000.0,7315254000.0,7317101000.0
price,426880.0,17558.0,12150.0,1.0,7900.0,14988.0,25950.0,55605.0
year,426880.0,2011.0,9.0,1900.0,2008.0,2014.0,2017.0,2022.0
odometer,426880.0,97915.0,212780.0,0.0,38130.0,85548.0,133000.0,10000000.0


Split : features y target

In [30]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
X=df.drop(columns=['price','id'])
y=df['price']

In [8]:
# Dividir en train y test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  float64
 3   year          426880 non-null  float64
 4   manufacturer  426880 non-null  object 
 5   condition     426880 non-null  object 
 6   cylinders     426880 non-null  object 
 7   fuel          426880 non-null  object 
 8   odometer      426880 non-null  float64
 9   title_status  426880 non-null  object 
 10  transmission  426880 non-null  object 
 11  drive         426880 non-null  object 
 12  size          426880 non-null  object 
 13  type          426880 non-null  object 
 14  paint_color   426880 non-null  object 
 15  state         426880 non-null  object 
dtypes: float64(3), int64(1), object(12)
memory usage: 52.1+ MB


In [10]:
# Definir variables.
num_cols = ["year", "odometer"]
nom_cols = ["region","manufacturer", "condition", "cylinders","fuel","title_status","transmission","drive","size","type","paint_color","state"]

**Regresión Lineal**

In [22]:
preprocessorRL = ColumnTransformer(
    transformers=[
        ("num", "passthrough", num_cols),

        ("nom", OneHotEncoder(handle_unknown="ignore", sparse_output=False), nom_cols),
    ]
)

In [25]:
# Pipeline con regresión lineal
pipelineRL = Pipeline(steps=[
    ("preprocessing", preprocessorRL),
    ("regressor", LinearRegression())
])


In [26]:
# Entrenar modelo
pipelineRL.fit(X_train, y_train)

In [27]:
y_pred = pipelineRL.predict(X_test)

In [32]:
r2 = r2_score(y_test, y_pred)

In [31]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"RMSE: ${rmse:,.2f}")

RMSE: $9,118.57


**Algoritmo de Regresión con Árbol de decisión**

In [11]:
# Preprocesador.
preprocessor_tree = ColumnTransformer(transformers=[
    ("num", "passthrough", num_cols)
])

# Modelo.
pipeline_tree = Pipeline([
    ("preprocessing", preprocessor_tree),
    ("model", DecisionTreeRegressor(max_depth=8, random_state=42))
])

In [12]:
pipeline_tree.fit(X_train, y_train)

In [13]:
# Prediccion.
y_pred_tree = pipeline_tree.predict(X_test)

**Algoritmo de Regresión con KNN**

In [14]:
# Preprocesador.
preprocessor_knn = ColumnTransformer(transformers=[
    ("num", StandardScaler(), num_cols) # se debe escalar
])

# Modelo.
pipeline_knn = Pipeline([
    ("preprocessing", preprocessor_knn),
    ("model", KNeighborsRegressor(n_neighbors=3))
])

In [15]:
# Entrenamiento.
pipeline_knn.fit(X_train, y_train)

In [16]:
# Prediccion.
y_pred_knn = pipeline_knn.predict(X_test)

**Algoritmo de Regresión con Random Forest**

In [17]:
# Preprocesador.
preprocessor_forest = ColumnTransformer(transformers=[
    ("num", "passthrough", num_cols)
])

# Modelo.
pipeline_forest = Pipeline([
    ("preprocessing", preprocessor_forest),
    ("model", RandomForestRegressor(n_estimators=100, random_state=42))
])

In [18]:
# Entrenamiento.
pipeline_forest.fit(X_train, y_train)

In [19]:
# Prediccion.
y_pred_forest = pipeline_forest.predict(X_test)

**Evaluación de los algoritmos o modelos entrenados para predecir el precio de los automóviles de acuerdo a métricas R2 y MSE**

In [37]:
r2 = r2_score(y_test, y_pred)
r2_score_tree = r2_score(y_test, y_pred_tree)
r2_score_knn = r2_score(y_test, y_pred_knn)
r2_score_forest = r2_score(y_test, y_pred_forest)

In [39]:
print(f"Regresión Lineal: {r2}")
print(f"Arbol Decisión Regresión: {r2_score_tree}")
print(f"knn Regresión: {r2_score_knn}")
print(f"Random forest Regresión : {r2_score_forest}")

Regresión Lineal: 0.4379506771441768
Arbol Decisión Regresión: 0.39312278290713465
knn Regresión: 0.5734284733705706
Random forest Regresión : 0.6343407903945526


In [33]:
mseRL = mean_squared_error(y_test, y_pred)
rmseRL = np.sqrt(mse)
mseTREE = mean_squared_error(y_test, y_pred_tree)
rmseTREE = np.sqrt(mseTREE)
mseKNN = mean_squared_error(y_test, y_pred_knn)
rmseKNN = np.sqrt(mseKNN)
mseFOREST = mean_squared_error(y_test, y_pred_forest)
rmseFOREST = np.sqrt(mseFOREST)

In [35]:
print(f"RMSE Regresión Lineal: ${rmseRL:,.2f}")
print(f"RMSE Arbol de Decisión: ${rmseTREE:,.2f}")
print(f"RMSE KNN: ${rmseKNN:,.2f}")
print(f"RMSE Random Forest: ${rmseFOREST:,.2f}")

RMSE Regresión Lineal: $9,118.57
RMSE Arbol de Decisión: $9,475.24
RMSE KNN: $7,943.93
RMSE Random Forest: $7,354.92


**Regresión Lineal**
R2 bajo (0.438) indica que solo explica un 43.8% de la variabilidad del precio.

RMSE elevado: $9,118.57, lo que indica que los errores promedio son altos.

Modelo lineal simple, limitado para relaciones complejas.

**Árbol de Decisión**
Peor desempeño global: menor R² y mayor RMSE.

Posiblemente está sobreajustando al entrenamiento y generaliza mal.

**KNN**
Mejor que los anteriores. R2 aceptable (0.573) y RMSE notablemente más bajo ($7,943.93).

Puede beneficiarse mucho de buen preprocesamiento.

***Random Forest
Mejor Algoritmo***: mejor R2 (0.634) y menor RMSE ($7,354.92).

Captura relaciones no lineales y reduce sobreajuste.

Muy útil para datos tabulares como precios de autos.

# <font color="purple">**Exportación Mejor Modelo**</font>


In [40]:
import joblib

path ='/content/drive/MyDrive/Bootcamp-ML/Cores/Core3 Autos/MModelo_RF.pkl'
joblib.dump(pipeline_knn, path)

['/content/drive/MyDrive/Bootcamp-ML/Cores/Core3 Autos/MModelo_RF.pkl']

# **Conclusión final**

El modelo **Random Forest** es el más preciso y confiable en tu conjunto de datos.

**KNN** es una buena segunda opción si se ajustan sus parámetros y se normalizan los datos.

***La Regresión Lineal y el Árbol de Decisión no son recomendables para este problema.***

# **<font color="tomato">Optimizadores con Gridsearch</font>**

In [41]:
from sklearn.model_selection import GridSearchCV

# **Optimizador KNN**

In [42]:
# Preprocesador.
preprocessor_knng = ColumnTransformer(transformers=[
    ("num", StandardScaler(), num_cols)
])

# Model.
knn_pipelineg = Pipeline([
    ("pp", preprocessor_knng),
    ("model", KNeighborsRegressor())
])

In [43]:
# Optimizacion de hiperparametros.
knn_params = {
    "model__n_neighbors": [2, 3, 5, 10]

    }

knn_grid = GridSearchCV(knn_pipelineg, knn_params, cv=5, scoring="r2")
knn_grid.fit(X_train, y_train)

In [44]:
# Evaluación.
knn_bestg = knn_grid.best_estimator_ #prediccion con la mejor estimador

In [45]:
knn_bestg

In [46]:
y_pred_knng = knn_bestg.predict(X_test)

print("KNN Regressor")
print("Mejores parámetros:", knn_grid.best_params_) #mejor parametro --- > no es necesario probarlo, se puede hacer otro grisearch con 10, 15 , 20.. o menores..es hasta que mp cambie
print("R²:", r2_score(y_test, y_pred_knng))

KNN Regressor
Mejores parámetros: {'model__n_neighbors': 3}
R²: 0.5734284733705706


# **Optimizador Random Forest**

In [47]:
# Preprocesador.
preprocessor_forest = ColumnTransformer(transformers=[
    ("num", "passthrough", num_cols)
])

# Modelo.
pipeline_forest = Pipeline([
    ("pp", preprocessor_forest),
    ("model", RandomForestRegressor(random_state=42))
])

In [48]:
# Optimizacion de hiperparametros.
forest_params = {
    "model__n_estimators": [50, 100, 200] # El nombre debe coincidir con el nombre del modelo.
}

forest_grid = GridSearchCV(pipeline_forest, forest_params, cv=3, scoring="r2")
forest_grid.fit(X_train, y_train)

In [49]:
# Evaluación.
forest_best = forest_grid.best_estimator_
y_pred_forest = forest_best.predict(X_test)


In [50]:

print("Forest Regressor")
print("Mejores parámetros:", forest_grid.best_params_)
print("R²:", r2_score(y_test, y_pred_forest))

Forest Regressor
Mejores parámetros: {'model__n_estimators': 200}
R²: 0.6346746436436614
