## Predicción de Precios de Vehículos Usados (Core)

Implementar y evaluar modelos de regresión, y seleccionar el mejor modelo basado en las métricas de evaluación.

- About Dataset
- Context
- Craigslist is the world's largest collection of used vehicles for sale, yet it's very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist.

- Content
- This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories. For ML projects, consider feature engineering on location columns such as long/lat. For previous listings, check older versions of the dataset.

## 1. Carga y Exploración de Datos:

* Descargar y cargar el dataset.
* Realizar una exploración inicial para entender la estructura del dataset.
* Identificar valores faltantes, duplicados y outliers.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import utils

In [2]:
df = pd.read_csv('../../../data/vehicles.csv')
df.head().T

Unnamed: 0,0,1,2,3,4
id,7222695916,7218891961,7221797935,7222270760,7210384030
url,https://prescott.craigslist.org/cto/d/prescott...,https://fayar.craigslist.org/ctd/d/bentonville...,https://keys.craigslist.org/cto/d/summerland-k...,https://worcester.craigslist.org/cto/d/west-br...,https://greensboro.craigslist.org/cto/d/trinit...
region,prescott,fayetteville,florida keys,worcester / central MA,greensboro
region_url,https://prescott.craigslist.org,https://fayar.craigslist.org,https://keys.craigslist.org,https://worcester.craigslist.org,https://greensboro.craigslist.org
price,6000,11900,21000,1500,4900
year,,,,,
manufacturer,,,,,
model,,,,,
condition,,,,,
cylinders,,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  pain

In [4]:
df.describe()

Unnamed: 0,id,price,year,odometer,county,lat,long
count,426880.0,426880.0,425675.0,422480.0,0.0,420331.0,420331.0
mean,7311487000.0,75199.03,2011.235191,98043.33,,38.49394,-94.748599
std,4473170.0,12182280.0,9.45212,213881.5,,5.841533,18.365462
min,7207408000.0,0.0,1900.0,0.0,,-84.122245,-159.827728
25%,7308143000.0,5900.0,2008.0,37704.0,,34.6019,-111.939847
50%,7312621000.0,13950.0,2013.0,85548.0,,39.1501,-88.4326
75%,7315254000.0,26485.75,2017.0,133542.5,,42.3989,-80.832039
max,7317101000.0,3736929000.0,2022.0,10000000.0,,82.390818,173.885502


In [5]:
df.isnull().sum()

id                   0
url                  0
region               0
region_url           0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
image_url           68
description         70
county          426880
state                0
lat               6549
long              6549
posting_date        68
dtype: int64

In [6]:
utils.calculate_na_statistics(df)

Unnamed: 0,datos sin NAs en q,Na en q,Na en %
county,0,426880,100.0
size,120519,306361,71.77
cylinders,249202,177678,41.62
condition,252776,174104,40.79
VIN,265838,161042,37.73
drive,296313,130567,30.59
paint_color,296677,130203,30.5
type,334022,92858,21.75
manufacturer,409234,17646,4.13
title_status,418638,8242,1.93


In [7]:
## Eliminar columnas con exceso de valores nulos
df = df.drop(['county', 'size'], axis=1)

In [8]:
categorical_cols = ['manufacturer', 'type', 'fuel', 'transmission', 'drive', 'paint_color']
numeric_cols = ['odometer', 'year', 'price']


In [9]:

# Crear un DataFrame filtrado
df_filtered = df.dropna(subset=categorical_cols + numeric_cols)

# Mostrar cuántas filas fueron eliminadas
num_original = df.shape[0]
num_filtrado = df_filtered.shape[0]
print(f"Filas originales: {num_original}")
print(f"Filas después de filtrar valores faltantes: {num_filtrado}")
print(f"Filas eliminadas: {num_original - num_filtrado}")



Filas originales: 426880
Filas después de filtrar valores faltantes: 208061
Filas eliminadas: 218819


In [10]:
# Agrupar por columnas categóricas y calcular estadísticas descriptivas
grouped_df = df_filtered.groupby(categorical_cols)[numeric_cols].agg(['mean', 'std', 'median', 'count'])

# Aplanar el MultiIndex en las columnas
grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]

# Restablecer el índice
grouped_df = grouped_df.reset_index()


In [11]:
# Calcular el coeficiente de variación para las columnas numéricas
for col in numeric_cols:
    std_col = f"{col}_std"
    mean_col = f"{col}_mean"
    cv_col = f"{col}_cv"

    # Evitar divisiones por cero
    grouped_df[cv_col] = grouped_df[std_col] / grouped_df[mean_col].replace({0: pd.NA})

# Reemplazar NaN en coeficientes de variación con 0
grouped_df.fillna({f"{col}_cv": 0 for col in numeric_cols}, inplace=True)


  grouped_df.fillna({f"{col}_cv": 0 for col in numeric_cols}, inplace=True)


In [12]:
# Guardar las estadísticas agrupadas en un archivo CSV
grouped_df.to_csv('craigslist_statistics_grouped.csv', index=False)
print("\nEstadísticas agrupadas guardadas en 'craigslist_statistics_grouped.csv'.")

# Mostrar las primeras filas donde hay al menos dos valores en el cálculo
print("\nPrimeras filas del DataFrame agrupado:")
print(grouped_df[grouped_df[f"{numeric_cols[0]}_count"] > 1])



Estadísticas agrupadas guardadas en 'craigslist_statistics_grouped.csv'.

Primeras filas del DataFrame agrupado:
      manufacturer   type   fuel transmission drive paint_color  \
0            acura    SUV    gas    automatic   4wd       black   
1            acura    SUV    gas    automatic   4wd        blue   
2            acura    SUV    gas    automatic   4wd       brown   
3            acura    SUV    gas    automatic   4wd      custom   
4            acura    SUV    gas    automatic   4wd       green   
...            ...    ...    ...          ...   ...         ...   
10501        volvo  wagon    gas    automatic   rwd       white   
10507        volvo  wagon    gas        other   fwd       black   
10508        volvo  wagon    gas        other   fwd         red   
10511        volvo  wagon  other    automatic   fwd       white   
10512        volvo  wagon  other        other   fwd         red   

       odometer_mean  odometer_std  odometer_median  odometer_count  ...  \
0    

In [None]:
# Filtrar grupos con CV bajo y conteo alto
filtered_groups = grouped_df[
    (grouped_df['odometer_cv'] < 0.5) & (grouped_df['odometer_count'] > 10)
]
print(filtered_groups)


In [13]:
# Convertir el DataFrame agrupado en un diccionario para búsquedas rápidas
grouped_dict = grouped_df.set_index(['manufacturer', 'type', 'fuel', 'transmission', 'drive', 'paint_color']).to_dict(orient='index')


Manejar valores faltantes en columnas categóricas

In [14]:
for col in ['condition', 'cylinders', 'fuel', 'drive', 'type', 'paint_color']:
    df[col].fillna(df[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


Manejar valores faltantes en columnas numéricas

In [17]:
def imputar_faltantes(row, col):
    key = (row['manufacturer'], row['type'], row['fuel'], row['transmission'], row['drive'], row['paint_color'])
    if key in grouped_dict and f"{col}_median" in grouped_dict[key]:
        return grouped_dict[key][f"{col}_median"]
    return row[col]  # Devolver el valor original si no se encuentra la combinación


In [18]:
for col in ['year', 'odometer', 'lat', 'long']:
    df[col] = df.apply(lambda row: imputar_faltantes(row, col) if pd.isnull(row[col]) else row[col], axis=1)


In [19]:
df['VIN'].fillna('unknown', inplace=True)
df['description'].fillna('no_description', inplace=True)
df['image_url'].fillna('no_image', inplace=True)
df['posting_date'].fillna('unknown_date', inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['VIN'].fillna('unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['description'].fillna('no_description', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting v

In [21]:
utils.calculate_na_statistics(df)

Unnamed: 0,datos sin NAs en q,Na en q,Na en %
manufacturer,409234,17646,4.13
title_status,418638,8242,1.93
long,420331,6549,1.53
lat,420331,6549,1.53
model,421603,5277,1.24
transmission,424324,2556,0.6
year,425681,1199,0.28
odometer,425788,1092,0.26
drive,426880,0,0.0
state,426880,0,0.0
