El servicio de venta de autos usados Rusty Bargain está desarrollando una aplicación para atraer nuevos clientes. Gracias a esa app, puedes averiguar rápidamente el valor de mercado de tu coche. Tienes acceso al historial: especificaciones técnicas, versiones de equipamiento y precios. Tienes que crear un modelo que determine el valor de mercado.
A Rusty Bargain le interesa:
- la calidad de la predicción;
- la velocidad de la predicción;
- el tiempo requerido para el entrenamiento

## Preparación de datos

In [1]:
#importar librerías
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor
import lightgbm as lgb
import time

In [2]:
#leer datos

df = pd.read_csv('/datasets/car_data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
# eliminar columnas irrelevantes
df = df.drop(['DateCrawled', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'], axis=1)
df

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no
...,...,...,...,...,...,...,...,...,...,...,...
354364,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes
354365,2200,,2005,,0,,20000,1,,sonstige_autos,
354366,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no
354367,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              354369 non-null  int64 
 1   VehicleType        316879 non-null  object
 2   RegistrationYear   354369 non-null  int64 
 3   Gearbox            334536 non-null  object
 4   Power              354369 non-null  int64 
 5   Model              334664 non-null  object
 6   Mileage            354369 non-null  int64 
 7   RegistrationMonth  354369 non-null  int64 
 8   FuelType           321474 non-null  object
 9   Brand              354369 non-null  object
 10  NotRepaired        283215 non-null  object
dtypes: int64(5), object(6)
memory usage: 29.7+ MB


In [6]:
# renombrar las columnas en minúsculas
df.columns = df.columns.str.lower()

print(df.columns)

Index(['price', 'vehicletype', 'registrationyear', 'gearbox', 'power', 'model',
       'mileage', 'registrationmonth', 'fueltype', 'brand', 'notrepaired'],
      dtype='object')


In [7]:
#conteo de valores nulos
df.isna().sum()

price                    0
vehicletype          37490
registrationyear         0
gearbox              19833
power                    0
model                19705
mileage                  0
registrationmonth        0
fueltype             32895
brand                    0
notrepaired          71154
dtype: int64

In [8]:
# rellenar valores nulos para columnas VehicleType, Model y NotRepaired
df['vehicletype'].fillna('Unknown', inplace=True)
df['model'].fillna('Unknown', inplace=True)
df['notrepaired'].fillna('Unknown', inplace=True)
df['fueltype'].fillna('Unknown', inplace=True)

# verificar nulos
print(df.isna().sum())

price                    0
vehicletype              0
registrationyear         0
gearbox              19833
power                    0
model                    0
mileage                  0
registrationmonth        0
fueltype                 0
brand                    0
notrepaired              0
dtype: int64


In [9]:
# Rellenar los valores nulos en "Gearbox" según categoría

# filtrar los valores no "Unknown" para 'vehicleType' y 'model'
df_filtered_gearbox = df[df['vehicletype'] != 'Unknown']
df_filtered_gearbox = df_filtered_gearbox[df_filtered_gearbox['model'] != 'Unknown']

# Agrupar por 'vehicleType', 'brand', 'model' y obtener el valor de 'gearbox'
gearbox_mode = df_filtered_gearbox.groupby(['vehicletype', 'brand', 'model'])['gearbox'].agg(pd.Series.mode).reset_index()

# merge para rellenar los valores de 'Gearbox' en los datos originales
df = df.merge(gearbox_mode, on=['vehicletype', 'brand', 'model'], how='left', suffixes=('', '_mode'))

# rellenar los valores nulos de 'Gearbox' con la moda calculada
df['gearbox'].fillna(df['gearbox_mode'], inplace=True)

# eliminar la columna extra 'gearbox_mode'
df.drop(columns=['gearbox_mode'], inplace=True)

# verificar si hay valores nulos en Gearbox
print(df['gearbox'].isna().sum())

11089


In [10]:
# rellenar los últimos valores nulos en 'gearbox'
df['gearbox'].fillna('Unknown', inplace=True)
print(df.isna().sum())

price                0
vehicletype          0
registrationyear     0
gearbox              0
power                0
model                0
mileage              0
registrationmonth    0
fueltype             0
brand                0
notrepaired          0
dtype: int64


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354369 entries, 0 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   price              354369 non-null  int64 
 1   vehicletype        354369 non-null  object
 2   registrationyear   354369 non-null  int64 
 3   gearbox            354369 non-null  object
 4   power              354369 non-null  int64 
 5   model              354369 non-null  object
 6   mileage            354369 non-null  int64 
 7   registrationmonth  354369 non-null  int64 
 8   fueltype           354369 non-null  object
 9   brand              354369 non-null  object
 10  notrepaired        354369 non-null  object
dtypes: int64(5), object(6)
memory usage: 32.4+ MB


In [12]:
df.describe()

Unnamed: 0,price,registrationyear,power,mileage,registrationmonth
count,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645
std,4514.158514,90.227958,189.850405,37905.34153,3.726421
min,0.0,1000.0,0.0,5000.0,0.0
25%,1050.0,1999.0,69.0,125000.0,3.0
50%,2700.0,2003.0,105.0,150000.0,6.0
75%,6400.0,2008.0,143.0,150000.0,9.0
max,20000.0,9999.0,20000.0,150000.0,12.0


## Entrenamiento del modelo 

In [13]:
# codificación con label encoder para regresión lineal y bosque aleatorio

label_encoder = LabelEncoder()

categorical_columns = ['vehicletype', 'gearbox', 'model', 'fueltype', 'brand', 'notrepaired']
df[categorical_columns] = df[categorical_columns].astype(str)

for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])

# verificar dataFrame después de la codificación
df.head()

Unnamed: 0,price,vehicletype,registrationyear,gearbox,power,model,mileage,registrationmonth,fueltype,brand,notrepaired
0,480,0,1993,4,0,117,150000,0,7,38,0
1,18300,3,2011,4,190,26,125000,5,3,1,2
2,9800,7,2004,3,163,118,125000,8,3,14,0
3,1500,6,2001,4,75,117,150000,6,7,38,1
4,3600,6,2008,4,69,102,90000,7,3,31,1


In [14]:
# dividir en entrenamiento y prueba

target = df['price'] 
features = df.drop('price', axis=1) 

# escalar características numéricas
scaler = StandardScaler()
numeric_columns = ['registrationyear', 'power', 'mileage', 'registrationmonth']
features[numeric_columns] = scaler.fit_transform(features[numeric_columns])

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)


### Regresión lineal

In [15]:
#modelo de regresión lineal
start = time.time()
linear_model = LinearRegression()
linear_model.fit(features_train, target_train)
train_time_linear = time.time() - start

start = time.time()
predictions_linear = linear_model.predict(features_valid)  
predict_time_linear = time.time() - start

# evaluación con RMSE
rmse_linear = np.sqrt(mean_squared_error(target_valid, predictions_linear))
print(f"RMSE para Regresión Lineal: {rmse_linear}")


RMSE para Regresión Lineal: 4071.5587424687446


El modelo de regresión lineal tiene un RMSE de aproximadamente 4071.56, por lo que en promedio, las predicciones de precios del modelo difieren del valor real en unos 4071.56 euros. Este valor es alto considerando que el precio promedio de un auto en el dataset es de 4416 euros, sirve como referencia (prueba de cordura).

### Bosque Aleatorio

In [16]:
# modelo de bosque aleatorio
start = time.time()
rf_model = RandomForestRegressor(n_estimators= 33, max_depth = 11, random_state=12345)
rf_model.fit(features_train, target_train)
train_time_rf = time.time() - start

# predicciones
start = time.time()
predictions_rf = rf_model.predict(features_valid)
predict_time_rf = time.time() - start

# evaluación con RMSE
rmse_rf = np.sqrt(mean_squared_error(target_valid, predictions_rf))
print(f"RMSE para Bosque Aleatorio: {rmse_rf}")

RMSE para Bosque Aleatorio: 1969.2164926796618


El bosque aleatorio con 33 árboles y profundidad de 11, logró un RMSE de 1969.

### Modelo LightGBM

In [17]:
# crear datasets para LightGBM

train_data = lgb.Dataset(features_train, label=target_train)
valid_data = lgb.Dataset(features_valid, label=target_valid, reference=train_data)


# parámetros para LightGBM
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

# entrenamiento del modelo
start = time.time()
lgb_model_1 = lgb.train(params, train_data, valid_sets=[valid_data], num_boost_round=1000, early_stopping_rounds=50)
train_time_lgb1  = time.time() - start

# predicciones
start = time.time()
predictions_lgb1  = lgb_model_1.predict(features_valid, num_iteration=lgb_model_1.best_iteration)
predict_time_lgb1  = time.time() - start

# evaluación con RMSE
rmse_lgb1  = np.sqrt(mean_squared_error(target_valid, predictions_lgb1))
print(f"RMSE para LightGBM: {rmse_lgb1}")





You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 699
[LightGBM] [Info] Number of data points in the train set: 265776, number of used features: 10
[LightGBM] [Info] Start training from score 4413.365319
[1]	valid_0's rmse: 4367.61
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4251.38
[3]	valid_0's rmse: 4110.44
[4]	valid_0's rmse: 3979.58
[5]	valid_0's rmse: 3856.75
[6]	valid_0's rmse: 3742.49
[7]	valid_0's rmse: 3643.78
[8]	valid_0's rmse: 3542.64
[9]	valid_0's rmse: 3447.83
[10]	valid_0's rmse: 3359.29
[11]	valid_0's rmse: 3277
[12]	valid_0's rmse: 3200.78
[13]	valid_0's rmse: 3129.11
[14]	valid_0's rmse: 3062.45
[15]	valid_0's rmse: 3000.93
[16]	valid_0's rmse: 2943.24
[17]	valid_0's rmse: 2894.03
[18]	valid_0's rmse: 2843.61
[19]	valid_0's rmse: 2795.55
[20]	valid_0's rmse: 2750.55
[21]	valid_0's rmse: 2709.03
[22]	valid_0's rmse: 2669.19
[23]	valid_0's rmse: 2631.33
[24]	valid_0's rmse: 2597.17
[25]	valid_0's

[265]	valid_0's rmse: 1828.34
[266]	valid_0's rmse: 1827.81
[267]	valid_0's rmse: 1827.56
[268]	valid_0's rmse: 1827.14
[269]	valid_0's rmse: 1826.9
[270]	valid_0's rmse: 1826.62
[271]	valid_0's rmse: 1826.23
[272]	valid_0's rmse: 1826.06
[273]	valid_0's rmse: 1825.72
[274]	valid_0's rmse: 1825.39
[275]	valid_0's rmse: 1825.08
[276]	valid_0's rmse: 1824.8
[277]	valid_0's rmse: 1824.35
[278]	valid_0's rmse: 1823.82
[279]	valid_0's rmse: 1823.52
[280]	valid_0's rmse: 1823.01
[281]	valid_0's rmse: 1822.53
[282]	valid_0's rmse: 1822.14
[283]	valid_0's rmse: 1821.7
[284]	valid_0's rmse: 1821.58
[285]	valid_0's rmse: 1821.37
[286]	valid_0's rmse: 1821.19
[287]	valid_0's rmse: 1820.86
[288]	valid_0's rmse: 1820.51
[289]	valid_0's rmse: 1820.11
[290]	valid_0's rmse: 1819.69
[291]	valid_0's rmse: 1819.38
[292]	valid_0's rmse: 1819.23
[293]	valid_0's rmse: 1819.12
[294]	valid_0's rmse: 1818.76
[295]	valid_0's rmse: 1818.39
[296]	valid_0's rmse: 1818.05
[297]	valid_0's rmse: 1817.71
[298]	valid_0

[549]	valid_0's rmse: 1771.67
[550]	valid_0's rmse: 1771.47
[551]	valid_0's rmse: 1771.49
[552]	valid_0's rmse: 1771.51
[553]	valid_0's rmse: 1771.33
[554]	valid_0's rmse: 1771.07
[555]	valid_0's rmse: 1771
[556]	valid_0's rmse: 1770.94
[557]	valid_0's rmse: 1770.89
[558]	valid_0's rmse: 1770.78
[559]	valid_0's rmse: 1770.72
[560]	valid_0's rmse: 1770.73
[561]	valid_0's rmse: 1770.51
[562]	valid_0's rmse: 1770.39
[563]	valid_0's rmse: 1770.31
[564]	valid_0's rmse: 1770.17
[565]	valid_0's rmse: 1769.92
[566]	valid_0's rmse: 1769.86
[567]	valid_0's rmse: 1769.6
[568]	valid_0's rmse: 1769.47
[569]	valid_0's rmse: 1769.33
[570]	valid_0's rmse: 1769.18
[571]	valid_0's rmse: 1769.14
[572]	valid_0's rmse: 1769.13
[573]	valid_0's rmse: 1768.98
[574]	valid_0's rmse: 1768.76
[575]	valid_0's rmse: 1768.57
[576]	valid_0's rmse: 1768.45
[577]	valid_0's rmse: 1768.28
[578]	valid_0's rmse: 1768.15
[579]	valid_0's rmse: 1768
[580]	valid_0's rmse: 1767.81
[581]	valid_0's rmse: 1767.75
[582]	valid_0's r

[824]	valid_0's rmse: 1745.23
[825]	valid_0's rmse: 1745.15
[826]	valid_0's rmse: 1745.07
[827]	valid_0's rmse: 1745.11
[828]	valid_0's rmse: 1745.11
[829]	valid_0's rmse: 1745.09
[830]	valid_0's rmse: 1744.95
[831]	valid_0's rmse: 1744.89
[832]	valid_0's rmse: 1744.8
[833]	valid_0's rmse: 1744.66
[834]	valid_0's rmse: 1744.48
[835]	valid_0's rmse: 1744.45
[836]	valid_0's rmse: 1744.4
[837]	valid_0's rmse: 1744.4
[838]	valid_0's rmse: 1744.35
[839]	valid_0's rmse: 1744.27
[840]	valid_0's rmse: 1744.25
[841]	valid_0's rmse: 1744.22
[842]	valid_0's rmse: 1744.12
[843]	valid_0's rmse: 1744.06
[844]	valid_0's rmse: 1743.92
[845]	valid_0's rmse: 1743.83
[846]	valid_0's rmse: 1743.75
[847]	valid_0's rmse: 1743.72
[848]	valid_0's rmse: 1743.68
[849]	valid_0's rmse: 1743.7
[850]	valid_0's rmse: 1743.64
[851]	valid_0's rmse: 1743.53
[852]	valid_0's rmse: 1743.45
[853]	valid_0's rmse: 1743.35
[854]	valid_0's rmse: 1743.28
[855]	valid_0's rmse: 1743.2
[856]	valid_0's rmse: 1743.18
[857]	valid_0's

In [19]:
#2do modelo con LightGBM modificando parámetros
train_data = lgb.Dataset(features_train, label=target_train)
valid_data = lgb.Dataset(features_valid, label=target_valid, reference=train_data)

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 64,             # más hojas = más complejo
    'learning_rate': 0.03,        # un poco más lento
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'max_depth': -1               # sin límite de profundidad
}

start = time.time()
lgb_model_2 = lgb.train(params, train_data, valid_sets=[valid_data], num_boost_round=1000, early_stopping_rounds=30)
train_time_lgb2  = time.time() - start

start = time.time()
predictions_lgb2 = lgb_model_2.predict(features_valid, num_iteration=lgb_model_2.best_iteration)
predict_time_lgb2  = time.time() - start


rmse_lgb2 = np.sqrt(mean_squared_error(target_valid, predictions_lgb2))
print(f"RMSE para LightGBM: {rmse_lgb2}")

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 699
[LightGBM] [Info] Number of data points in the train set: 265776, number of used features: 10
[LightGBM] [Info] Start training from score 4413.365319
[1]	valid_0's rmse: 4443.7
Training until validation scores don't improve for 30 rounds
[2]	valid_0's rmse: 4367.31
[3]	valid_0's rmse: 4274.58
[4]	valid_0's rmse: 4185.95
[5]	valid_0's rmse: 4117.66
[6]	valid_0's rmse: 4033.97
[7]	valid_0's rmse: 3958.56
[8]	valid_0's rmse: 3880.65
[9]	valid_0's rmse: 3805.78
[10]	valid_0's rmse: 3734.14
[11]	valid_0's rmse: 3664.96
[12]	valid_0's rmse: 3598.45
[13]	valid_0's rmse: 3534.1
[14]	valid_0's rmse: 3472.52
[15]	valid_0's rmse: 3418.52
[16]	valid_0's rmse: 3361.5
[17]	valid_0's rmse: 3315.92
[18]	valid_0's rmse: 3272.69
[19]	valid_0's rmse: 3221.33
[20]	valid_0's rmse: 3172.09
[21]	valid_0's rmse: 3124.59
[22]	valid_0's rmse: 3079.04
[23]	val

[267]	valid_0's rmse: 1811.3
[268]	valid_0's rmse: 1810.82
[269]	valid_0's rmse: 1810.55
[270]	valid_0's rmse: 1810.17
[271]	valid_0's rmse: 1809.68
[272]	valid_0's rmse: 1809.29
[273]	valid_0's rmse: 1808.86
[274]	valid_0's rmse: 1808.5
[275]	valid_0's rmse: 1808.14
[276]	valid_0's rmse: 1807.75
[277]	valid_0's rmse: 1807.34
[278]	valid_0's rmse: 1807.03
[279]	valid_0's rmse: 1806.66
[280]	valid_0's rmse: 1806.32
[281]	valid_0's rmse: 1805.8
[282]	valid_0's rmse: 1805.47
[283]	valid_0's rmse: 1805.1
[284]	valid_0's rmse: 1804.82
[285]	valid_0's rmse: 1804.39
[286]	valid_0's rmse: 1803.97
[287]	valid_0's rmse: 1803.55
[288]	valid_0's rmse: 1803.31
[289]	valid_0's rmse: 1802.84
[290]	valid_0's rmse: 1802.62
[291]	valid_0's rmse: 1802.45
[292]	valid_0's rmse: 1802.21
[293]	valid_0's rmse: 1801.75
[294]	valid_0's rmse: 1801.41
[295]	valid_0's rmse: 1800.78
[296]	valid_0's rmse: 1800.52
[297]	valid_0's rmse: 1800.15
[298]	valid_0's rmse: 1799.79
[299]	valid_0's rmse: 1799.46
[300]	valid_0'

[545]	valid_0's rmse: 1757.1
[546]	valid_0's rmse: 1757.09
[547]	valid_0's rmse: 1756.84
[548]	valid_0's rmse: 1756.59
[549]	valid_0's rmse: 1756.47
[550]	valid_0's rmse: 1756.24
[551]	valid_0's rmse: 1756.06
[552]	valid_0's rmse: 1755.95
[553]	valid_0's rmse: 1755.84
[554]	valid_0's rmse: 1755.74
[555]	valid_0's rmse: 1755.7
[556]	valid_0's rmse: 1755.62
[557]	valid_0's rmse: 1755.53
[558]	valid_0's rmse: 1755.45
[559]	valid_0's rmse: 1755.37
[560]	valid_0's rmse: 1755.23
[561]	valid_0's rmse: 1755.03
[562]	valid_0's rmse: 1754.86
[563]	valid_0's rmse: 1754.67
[564]	valid_0's rmse: 1754.58
[565]	valid_0's rmse: 1754.41
[566]	valid_0's rmse: 1754.25
[567]	valid_0's rmse: 1754.02
[568]	valid_0's rmse: 1753.84
[569]	valid_0's rmse: 1753.77
[570]	valid_0's rmse: 1753.59
[571]	valid_0's rmse: 1753.55
[572]	valid_0's rmse: 1753.44
[573]	valid_0's rmse: 1753.39
[574]	valid_0's rmse: 1753.26
[575]	valid_0's rmse: 1753.16
[576]	valid_0's rmse: 1753.1
[577]	valid_0's rmse: 1753.09
[578]	valid_0

[825]	valid_0's rmse: 1732.77
[826]	valid_0's rmse: 1732.71
[827]	valid_0's rmse: 1732.63
[828]	valid_0's rmse: 1732.54
[829]	valid_0's rmse: 1732.46
[830]	valid_0's rmse: 1732.35
[831]	valid_0's rmse: 1732.23
[832]	valid_0's rmse: 1732.04
[833]	valid_0's rmse: 1731.94
[834]	valid_0's rmse: 1731.84
[835]	valid_0's rmse: 1731.7
[836]	valid_0's rmse: 1731.61
[837]	valid_0's rmse: 1731.53
[838]	valid_0's rmse: 1731.44
[839]	valid_0's rmse: 1731.36
[840]	valid_0's rmse: 1731.33
[841]	valid_0's rmse: 1731.25
[842]	valid_0's rmse: 1731.17
[843]	valid_0's rmse: 1731.12
[844]	valid_0's rmse: 1731.09
[845]	valid_0's rmse: 1731.01
[846]	valid_0's rmse: 1730.95
[847]	valid_0's rmse: 1730.91
[848]	valid_0's rmse: 1730.86
[849]	valid_0's rmse: 1730.85
[850]	valid_0's rmse: 1730.79
[851]	valid_0's rmse: 1730.66
[852]	valid_0's rmse: 1730.6
[853]	valid_0's rmse: 1730.52
[854]	valid_0's rmse: 1730.4
[855]	valid_0's rmse: 1730.31
[856]	valid_0's rmse: 1730.24
[857]	valid_0's rmse: 1730.21
[858]	valid_0

El modelo con lightGBM ajustando parámetros como: aumentar número de hojas, y learning rate más lento, obtuvo una mejora en el RMSE de 12 euros con respecto al primer modelo de lightGBM, el tiempo de ejecución a comparar.

### Modelo de CatBoost

In [21]:
# entrenar el modelo de CatBoost
catboost_model_1 = CatBoostRegressor(
    iterations=1000,          # Número de iteraciones
    depth=6,                  # Profundidad del árbol
    learning_rate=0.05,       # Tasa de aprendizaje
    loss_function='RMSE',     # Función de pérdida
    cat_features=[features_train.columns.get_loc(col) for col in categorical_columns],  
    verbose=100               # Muestra información cada 100 iteraciones
)

# entrenamiento del modelo
start = time.time()
catboost_model_1.fit(features_train, target_train)
train_time_cat1 = time.time() - start

# predicciones
start = time.time()
predictions_cat1 = catboost_model_1.predict(features_valid)
predict_time_cat1 = time.time() - start

# evaluación con RMSE
rmse_cat1 = np.sqrt(mean_squared_error(target_valid, predictions_cat1))
print(f"RMSE para CatBoost: {rmse_cat1}")

0:	learn: 4368.1320155	total: 364ms	remaining: 6m 3s
100:	learn: 2001.7095522	total: 23.3s	remaining: 3m 27s
200:	learn: 1908.1172498	total: 45.1s	remaining: 2m 59s
300:	learn: 1859.4189888	total: 1m 7s	remaining: 2m 37s
400:	learn: 1827.8693972	total: 1m 30s	remaining: 2m 15s
500:	learn: 1806.0479287	total: 1m 52s	remaining: 1m 52s
600:	learn: 1788.6685267	total: 2m 14s	remaining: 1m 29s
700:	learn: 1775.8003925	total: 2m 37s	remaining: 1m 7s
800:	learn: 1764.0502882	total: 2m 59s	remaining: 44.7s
900:	learn: 1754.6603400	total: 3m 21s	remaining: 22.1s
999:	learn: 1746.7014200	total: 3m 43s	remaining: 0us
RMSE para CatBoost: 1786.1109571406228


In [23]:
# segundo modelo de CatBoost

catboost_model_2 = CatBoostRegressor(
    iterations=900,          # menor número de iteraciones
    depth=9,                  # mayor profundidad
    learning_rate=0.03,       # aprendizaje más lento
    loss_function='RMSE',     
    cat_features=[features_train.columns.get_loc(col) for col in categorical_columns],  
    verbose=100               
)

start = time.time()
catboost_model_2.fit(features_train, target_train)
train_time_cat2 = time.time() - start

start = time.time()
predictions_cat2 = catboost_model_2.predict(features_valid)
predict_time_cat2 = time.time() - start

rmse_cat2 = np.sqrt(mean_squared_error(target_valid, predictions_cat2))
print(f"RMSE para CatBoost: {rmse_cat2}")

0:	learn: 4416.9344263	total: 432ms	remaining: 6m 27s
100:	learn: 1990.4926530	total: 39.4s	remaining: 5m 11s
200:	learn: 1865.2402314	total: 1m 15s	remaining: 4m 22s
300:	learn: 1811.4214280	total: 1m 50s	remaining: 3m 40s
400:	learn: 1774.3072836	total: 2m 27s	remaining: 3m 3s
500:	learn: 1746.0902561	total: 3m 4s	remaining: 2m 26s
600:	learn: 1723.6981957	total: 3m 40s	remaining: 1m 49s
700:	learn: 1705.9277865	total: 4m 18s	remaining: 1m 13s
800:	learn: 1690.5767135	total: 4m 55s	remaining: 36.5s
899:	learn: 1675.6785892	total: 5m 31s	remaining: 0us
RMSE para CatBoost: 1756.2733001793104


El segundo modelo de CatBoost con menores iteraciones pero mayor profundidad y menor tasa de aprendizaje, logró un mejor performance con la métrica de RMSE con respecto al primer modelo de Catboost.

## Análisis de los modelos

In [24]:
resultados_modelos = [
    {
        "Modelo": "Regresión Lineal",
        "RMSE": rmse_linear,
        "Tiempo de Entrenamiento (s)": train_time_linear,
        "Tiempo de Predicción (s)": predict_time_linear
    },
    {
        "Modelo": "Bosque Aleatorio",
        "RMSE": rmse_rf,
        "Tiempo de Entrenamiento (s)": train_time_rf,
        "Tiempo de Predicción (s)": predict_time_rf
    },
    {
        "Modelo": "LightGBM 1",
        "RMSE": rmse_lgb1,
        "Tiempo de Entrenamiento (s)": train_time_lgb1,
        "Tiempo de Predicción (s)": predict_time_lgb1
    },
    {
        "Modelo": "LightGBM 2",
        "RMSE": rmse_lgb2,
        "Tiempo de Entrenamiento (s)": train_time_lgb2,
        "Tiempo de Predicción (s)": predict_time_lgb2
    },
    {
        "Modelo": "CatBoost 1",
        "RMSE": rmse_cat1,
        "Tiempo de Entrenamiento (s)": train_time_cat1,
        "Tiempo de Predicción (s)": predict_time_cat1
    },
    {
        "Modelo": "CatBoost 2",
        "RMSE": rmse_cat2,
        "Tiempo de Entrenamiento (s)": train_time_cat2,
        "Tiempo de Predicción (s)": predict_time_cat2
    }
]

# convertir a DataFrame para visualización
df_resultados = pd.DataFrame(resultados_modelos)

# ordernar por RMSE
df_resultados = df_resultados.sort_values(by="RMSE").reset_index(drop=True)

print(df_resultados)

             Modelo         RMSE  Tiempo de Entrenamiento (s)  \
0        LightGBM 2  1722.471214                    45.986643   
1        LightGBM 1  1734.539633                   912.479229   
2        CatBoost 2  1756.273300                   332.939108   
3        CatBoost 1  1786.110957                   224.367124   
4  Bosque Aleatorio  1969.216493                    12.983249   
5  Regresión Lineal  4071.558742                     0.043465   

   Tiempo de Predicción (s)  
0                  6.421412  
1                  4.606434  
2                  0.506796  
3                  0.345819  
4                  0.222264  
5                  0.074935  


## Conclusión

Después de preparar los datos, se entrenaron diferentes modelos para determinar el valor de los autos. Al utilizar la regresión lineal, bosque aleatorio, LightGBM y Catboost con diferentes parámetros, el modelo con menor error resultó el segundo entrenamiento de LightGBM, con más hojas y sin límite de profundidad, a diferencia del primero. Por comparación con la prueba de cordura (la regresión lineal

# Lista de control

Escribe 'x' para verificar. Luego presiona Shift+Enter

- [x]  Jupyter Notebook está abierto
- [ ]  El código no tiene errores- [ ]  Las celdas con el código han sido colocadas en orden de ejecución- [ ]  Los datos han sido descargados y preparados- [ ]  Los modelos han sido entrenados
- [ ]  Se realizó el análisis de velocidad y calidad de los modelos