# **Predicción de Precios de Vehículos Usados (Core)**
Implementar y evaluar modelos de regresión, y seleccionar el mejor modelo basado en las métricas de evaluación.

<font color="blue">**ML realizado manteniendo los outliers de precios de autos**</font>

**DEA realizado en archivo DEA_CORE3.ipynb**

In [2]:
import pandas as pd

In [5]:
path= '/content/drive/MyDrive/Bootcamp-ML/Cores/Core3 Autos/vehicles_clean_core3.csv'
df = pd.read_csv(path)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  float64
 3   year          426880 non-null  float64
 4   manufacturer  426880 non-null  object 
 5   condition     426880 non-null  object 
 6   cylinders     426880 non-null  object 
 7   fuel          426880 non-null  object 
 8   odometer      426880 non-null  float64
 9   title_status  426880 non-null  object 
 10  transmission  426880 non-null  object 
 11  drive         426880 non-null  object 
 12  size          426880 non-null  object 
 13  type          426880 non-null  object 
 14  paint_color   426880 non-null  object 
 15  state         426880 non-null  object 
dtypes: float64(3), int64(1), object(12)
memory usage: 52.1+ MB


In [7]:
df.describe().T.round()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,426880.0,7311487000.0,4473170.0,7207408000.0,7308143000.0,7312621000.0,7315254000.0,7317101000.0
price,426880.0,76469.0,12182275.0,1.0,7900.0,14988.0,26990.0,3736929000.0
year,426880.0,2011.0,9.0,1900.0,2008.0,2014.0,2017.0,2022.0
odometer,426880.0,97915.0,212780.0,0.0,38130.0,85548.0,133000.0,10000000.0


Split : features y target

In [13]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
X=df.drop(columns=['price','id'])
y=df['price']

In [11]:
# Dividir en train y test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  float64
 3   year          426880 non-null  float64
 4   manufacturer  426880 non-null  object 
 5   condition     426880 non-null  object 
 6   cylinders     426880 non-null  object 
 7   fuel          426880 non-null  object 
 8   odometer      426880 non-null  float64
 9   title_status  426880 non-null  object 
 10  transmission  426880 non-null  object 
 11  drive         426880 non-null  object 
 12  size          426880 non-null  object 
 13  type          426880 non-null  object 
 14  paint_color   426880 non-null  object 
 15  state         426880 non-null  object 
dtypes: float64(3), int64(1), object(12)
memory usage: 52.1+ MB


In [14]:
# Definir variables.
num_cols = ["year", "odometer"]
nom_cols = ["region","manufacturer", "condition", "cylinders","fuel","title_status","transmission","drive","size","type","paint_color","state"]

**Algoritmo de Regresión con Árbol de decisión**

In [15]:
# Preprocesador.
preprocessor_tree = ColumnTransformer(transformers=[
    ("num", "passthrough", num_cols)
])

# Modelo.
pipeline_tree = Pipeline([
    ("preprocessing", preprocessor_tree),
    ("model", DecisionTreeRegressor(max_depth=8, random_state=42))
])

In [16]:
pipeline_tree.fit(X_train, y_train)

In [17]:
# Prediccion.
y_pred_tree = pipeline_tree.predict(X_test)

**Algoritmo de Regresión con KNN**

In [18]:
# Preprocesador.
preprocessor_knn = ColumnTransformer(transformers=[
    ("num", StandardScaler(), num_cols) # se debe escalar
])

# Modelo.
pipeline_knn = Pipeline([
    ("preprocessing", preprocessor_knn),
    ("model", KNeighborsRegressor(n_neighbors=3))
])

In [19]:
# Entrenamiento.
pipeline_knn.fit(X_train, y_train)

In [20]:
# Prediccion.
y_pred_knn = pipeline_knn.predict(X_test)

**Algoritmo de Regresión con Random Forest**

In [21]:
# Preprocesador.
preprocessor_forest = ColumnTransformer(transformers=[
    ("num", "passthrough", num_cols)
])

# Modelo.
pipeline_forest = Pipeline([
    ("preprocessing", preprocessor_forest),
    ("model", RandomForestRegressor(n_estimators=100, random_state=42))
])

In [22]:
# Entrenamiento.
pipeline_forest.fit(X_train, y_train)

In [23]:
# Prediccion.
y_pred_forest = pipeline_forest.predict(X_test)

**Evaluación de los algoritmos o modelos entrenados para predecir el precio de los automóviles**

In [25]:
r2_score_tree = r2_score(y_test, y_pred_tree)
r2_score_knn = r2_score(y_test, y_pred_knn)
r2_score_forest = r2_score(y_test, y_pred_forest)

In [27]:
print(f"Arbol Decisión Regresión: {r2_score_tree}")
print(f"knn Regresión: {r2_score_knn}")
print(f"Random forest Regresión : {r2_score_forest}")

Arbol Decisión Regresión: -0.19612904450088275
knn Regresión: -0.09952470477289244
Random forest Regresión : -0.14086970013643052


<font color="red">**De los modelos evaluados ninguno salió bien, R" negativos simplemente no hay predicción... lo volveré a hacer , pero sin valores outliers**</font>