# Residual value model

## Imports

In [1]:
import pandas as pd
import numpy as np
import re
from openai import OpenAI
import json
import os
import math

## 1.- Load data

In [2]:
makes=["acura","audi","bmw","buick","cadillac","chevrolet","chrysler","dodge","ford","gmc","honda","hyundai","infiniti","jeep","kia","land_rover","lexus",
     "lincoln","mazda","mercedes_benz","mini","mitsubishi","nissan","porsche","ram","subaru","tesla","toyota","volkswagen","volvo"]
df_list=[]
for make in makes:
    for batch in range(1,26):
        path='cars\\'+make+'\\cars_'+make+'_'+str(batch)+'.json'
        if os.path.exists(path):
            df_list.append(pd.read_json(path))

df=pd.concat(df_list, ignore_index=True,axis=0)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,mpg,fuel_type,transmission,engine,price_USD,url
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,22–31 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,21–27 Based on EPA mileage ratings. Use for co...,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,18–26 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,29–36 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,19–27 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...


## 2.- Data Exploration and Preprocesing

### 2.1.- Removing duplicates
Durante el proceso de extracción de datos, pueden surgir duplicidades. Se verifica su existencia y, en caso afirmativo, se eliminan las filas duplicadas del DataFrame.<br> <br>
During the data extraction process, duplicates may arise. Their presence is verified, and if detected, duplicate rows are removed from the DataFrame.

In [3]:
num_duplicated_rows=sum(df.duplicated())
print(f'The number of duplicated rows is {num_duplicated_rows}')

The number of duplicated rows is 8982


In [4]:
df = df.drop_duplicates()
num_duplicated_rows=sum(df.duplicated())
print(f'The number of duplicated rows is now {num_duplicated_rows}')

The number of duplicated rows is now 0


### 2.2.- Variable description

A continuación, se realiza un análisis de alto nivel de las variables disponibles, utilizando un diccionario de datos y evaluando la cantidad de valores ausentes, valores atípicos o categorías.

#### Diccionario de datos
- **year_manufacture**: Año de fabricación del vehículo, que generalmente coincide con el año de compra.
- **years**: Antigüedad del vehículo en años, calculada en 2025. Se obtiene como la diferencia entre el año 2025 y el año de fabricación.
- **make**: Marca del vehículo.
- **model**: Modelo del vehículo.
- **mileage**: Kilometraje del vehículo, expresado en millas.
- **stock_type**: Indicador de si el vehículo es nuevo o usado.
- **interior_color**: Color del interior del vehículo, expresado en lenguaje natural.
- **exterior_color**: Color del exterior del vehículo, expresado en lenguaje natural.
- **drive_train**: Tipo de tracción del vehículo (delantera, trasera, a las cuatro ruedas).
- **mpg**: Consumo de combustible, expresado en millas por galón. Puede incluir información sobre la diferencia entre el consumo en ciudad y en carretera.
- **fuel_type**: Tipo de motorización (diésel, gasolina, híbrido, eléctrico, etc.).
- **transmission**: Tipo de transmisión. Cadena de texto que indica si es manual o automática y el número de marchas. Transmission type.
- **engine**: Cadena de caracteres que describe el motor en lenguaje natural. Puede incluir información sobre la cilindrada y el volumen. Text string that describes the engine in natural language.
- **price_USD**: Precio de venta del vehículo a la fecha de extracción de datos, expresado en dólares estadounidenses.
- **url**: Dirección web de donde se ha extraído la información.
  
Datos obtenidos en la página web [Cars.com](https://www.cars.com/) entre los días 22/03/2025 y 28/03/2025.

Next, a high-level analysis of the available variables is conducted using a data dictionary and assessing the number of missing values, outliers, or categories.
#### Data dictionary
- **year_manufacture**: Year of vehicle manufacture, which generally corresponds to the year of purchase.
- **years**: Vehicle age in years, calculated in 2025. It is obtained as the difference between the year 2025 and the year of manufacture.
- **make**: Vehicle make.
- **model**: Vehicle model.
- **mileage**: Vehicle mileage, expressed in miles.
- **stock_type**: Indicator of whether the vehicle is new or used.
- **interior_color**: Interior color of the vehicle, expressed in natural language.
- **exterior_color**: Exterior color of the vehicle, expressed in natural language.
- **drive_train**: Vehicle drivetrain type (front, rear, all-wheel drive).
- **mpg**: Fuel consumption, expressed in miles per gallon. It may include information about the difference between city and highway consumption.
- **fuel_type**: Engine type (diesel, gasoline, hybrid, electric, etc.).
- **transmission**: Transmission type. Text string indicating whether it is manual or automatic and the number of gears.
- **engine**: Text string that describes the engine in natural language. It may include information about the engine displacement and volume.
- **price_USD**: Vehicle sale price at the data extraction date, expressed in US dollars.
- **url**: Web address from which the information was extracted.

Data obtained from the website [Cars.com](https://www.cars.com/) between the dates 03/22/2025 and 03/28/2025.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 141185 entries, 0 to 150166
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   year_manufacture  141185 non-null  int64  
 1   years             141185 non-null  int64  
 2   make              141185 non-null  object 
 3   model             141185 non-null  object 
 4   mileage           141185 non-null  int64  
 5   stock_type        141185 non-null  object 
 6   interior_color    141185 non-null  object 
 7   exterior_color    141185 non-null  object 
 8   drive_train       141185 non-null  object 
 9   mpg               131855 non-null  object 
 10  fuel_type         133860 non-null  object 
 11  transmission      141185 non-null  object 
 12  engine            141185 non-null  object 
 13  price_USD         141181 non-null  float64
 14  url               141185 non-null  object 
dtypes: float64(1), int64(3), object(11)
memory usage: 17.2+ MB


#### Check the Categorical and Numerical Columns

In [6]:
# Categorical variables
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical variables:',cat_col)
# Numerical variables
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical variables:',num_col)

Categorical variables: ['make', 'model', 'stock_type', 'interior_color', 'exterior_color', 'drive_train', 'mpg', 'fuel_type', 'transmission', 'engine', 'url']
Numerical variables: ['year_manufacture', 'years', 'mileage', 'price_USD']


In [7]:
#Number of unique values for categorical variables
df[cat_col].nunique()

make                  30
model               9201
stock_type            26
interior_color      4507
exterior_color      5674
drive_train           17
mpg                  931
fuel_type             29
transmission        1019
engine              5278
url               141027
dtype: int64

In [8]:
for col in cat_col:
    print(col)
    print('Sample of some unique values:', df[col].unique()[:20],'\n')

make
Sample of some unique values: ['acura' 'audi' 'bmw' 'buick' 'cadillac' 'chevrolet' 'chrysler' 'dodge'
 'ford' 'gmc' 'honda' 'hyundai' 'infiniti' 'jeep' 'kia' 'land_rover'
 'lexus' 'lincoln' 'mazda' 'mercedes_benz'] 

model
Sample of some unique values: ['Acura TSX Base' 'Acura RDX Technology Package'
 'Acura MDX 3.5L w/Technology Package' 'Acura Integra A-SPEC'
 'Acura RDX Technology & AcuraWatch Plus Package' 'Acura TLX V6 Tech'
 'Acura Integra A-SPEC w/ Technology' 'Acura TLX V6' 'Acura ZDX A-SPEC'
 'Acura MDX Sport Hybrid SH-AWD Sport Hybrid w/Technology Pkg'
 'Acura RLX Technology Package' 'Acura MDX SH-AWD Technology'
 'Acura TLX Base' 'Acura MDX SH-AWD' 'Acura ZDX A-Spec'
 'Acura RDX SH-AWD Technology' 'Acura MDX Sport Hybrid Technology Package'
 'Acura TSX Technology' 'Acura ILX Base' 'Acura MDX 3.5L'] 

stock_type
Sample of some unique values: ['Used' 'Acura Certified' 'Certified' 'Audi Certified' 'BMW Certified'
 'Buick Certified' 'Cadillac Certified' 'Chevrolet Certified

#### Feature engineering

##### MPG

Se puede observar que este campo está compuesto por un par de valores numéricos separados por un guion, seguidos de una nota informativa: 'Based on EPA mileage ratings. Use for comparison purposes only. Actual mileage will vary depending on driving conditions, driving habits, vehicle maintenance, and other factors.' Asumiremos que, en producción, se contará con los 'EPA mileage ratings' de los vehículos, por lo que esta variable puede seguir siendo útil. De ella, obtendremos los valores de consumo en ciudad y carretera si están disponibles.

It can be observed that this field consists of a pair of numerical values separated by a hyphen, followed by an informational note: 'Based on EPA mileage ratings. Use for comparison purposes only. Actual mileage will vary depending on driving conditions, driving habits, vehicle maintenance, and other factors.' We will assume that, in production, the 'EPA mileage ratings' of the vehicles will be available, so this variable may still be useful. From it, we will obtain the values for city and highway consumption if available.

In [9]:
def keep_numbers_and_en_dash(text):
    if text is None:
        return None  # Return None if the value is None
    return re.sub(r'[^0-9–]', '', text)  # Keep numbers and en dash (–)
def get_mpg_city(text):
    if text is None:
        return None  # Return None if the value is None
    if text == "–":
        return None
    if "–" not in text:
        return float(text) if text.replace('.', '', 1).isdigit() and text.count('.') < 2 else None
    return float(text.split('–')[0]) if text.split('–')[0].replace('.', '', 1).isdigit() and text.split('–')[0].count('.') < 2 else None

def get_mpg_highway(text):
    if text is None:
        return None  # Return None if the value is None
    if text == "–":
        return None
    if "–" not in text:
        return float(text) if text.replace('.', '', 1).isdigit() and text.count('.') < 2 else None
    return float(text.split('–')[1]) if text.split('–')[1].replace('.', '', 1).isdigit() and text.split('–')[1].count('.') < 2 else None

df.loc[df['mpg'].apply(lambda x: isinstance(x, float) and np.isnan(x)),'mpg']=None # substitue NaN for None
df['auxiliar'] = df['mpg'].apply(keep_numbers_and_en_dash)
df['mpg_city'] = df['auxiliar'].apply(get_mpg_city)
df['mpg_highway'] = df['auxiliar'].apply(get_mpg_highway)



In [10]:
# Check nulls in mpg_city: All null values in mpg_city are due to the fact that mpg is None or '–'
df[df['mpg_city'].isnull()]['mpg'].unique()

array([None, '–'], dtype=object)

In [11]:
# Check nulls in mpg_highway: All null values in mpg_highway are due to the fact that mpg is None or '–'
df[df['mpg_highway'].isnull()]['mpg'].unique()

array([None, '–'], dtype=object)

In [12]:
df.drop(columns=['mpg', 'auxiliar'], inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,transmission,engine,price_USD,url,mpg_city,mpg_highway
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0


#### Transmission

De esta columna, queremos derivar otras dos columna: una que nos indique el tipo de transmisión (automática, manual, [manumática](https://es.wikipedia.org/wiki/Transmisi%C3%B3n_manum%C3%A1tica), [variable](https://es.wikipedia.org/wiki/Transmisi%C3%B3n_variable_continua) o [dual-clutch](https://es.wikipedia.org/wiki/Caja_de_cambios_de_doble_embrague) ), y otra que nos indique el número de marchas del vehículo.

From this column, we want to derive two other columns: one indicating the type of transmission (automatic, manual, [manumatic](https://en.wikipedia.org/wiki/Manumatic), [variable](https://en.wikipedia.org/wiki/Continuously_variable_transmission), or [dual-clutch](https://en.wikipedia.org/wiki/Dual-clutch_transmission)), and another indicating the number of gears/speeds of the vehicle.

In [32]:
def get_speeds_of_transmission(text):
    if text is None:
        return None  # Return None if the value is None
    if "SINGLE" in text.upper():
        return 1
    if "TWO" in text.upper():
        return 1
    if text in ('4L80E', '4l60e'):
        return 4
    if text in ('TH400', 'TH350', 'TH 400'):
        return 3
    if text == '6L70E':
      return 6

    substring=re.sub(r'[^0-9]', '', text)
    if substring=="":
        return None
    if len(substring)==1:
        return float(substring)
    if len(substring)>=2:
        if substring[0]=="1":
            if float(substring)==10006: #The general logic that the first numbers are the number of gears soesn't apply for Allison 1000 6-Speed Automatic 
                return 6
            if float(substring)==1503: # nor Borg-Warner T150 3 Speed Manual
                return 3
            return float(substring)
        else:
            return float(substring[0])
    return None

def keep_letters(text):
    return re.sub(r'[^a-zA-Z]', '', text)

def get_transmission_type(text):
    if "MANUAL" in text.upper() or "M/T" in text.upper() or "SMG" in text.upper(): # SMG for bmw
        return "Manual"
    if "AUTO" in text.upper() or "A/T" in text.upper() \
    or keep_letters(text).upper()=='A' or "AU" in text.upper() or "AT" in text.upper()\
    or '6L80' in text.upper() \
    or " A" in text.upper() or "ZF" in text.upper() or 'ECT' in text.upper() \
    or text in ('4L80E','TH400', 'TH350', 'TH 400', '4l60e', '6L70E', 'EFLITE SI-EVT', 'E4OD 4R100', '8HP75'): #ZF and 8HP75 for bmw, ECT for toyota, 4L80E, 6L70E,6L80, 4l60e for GM group, TH350 and TH400 for chevrolet and buick, EFLITE SI-EVT chrysler
        return "Automatic"
    if "VARIABLE" in text.upper() or "CVT" in text.upper() or text in ('i-VT', 'IVT'): #Continuously Variable Transmission, ivt for hyundai
        return "Variable"
    if "DUAL" in text.upper() or "DOUBLE" in text.upper() or "DC" in text.upper() or "S TRONIC" in text.upper() or "S-TRONIC" in text.upper() or "PDK" in text.upper(): # Dual-Clutch Transmission (DCT) stronic for audi, PDK for Porsche
        return "Dual-clutch"
    if "GEARTRONIC" in text.upper() or "SHIFTRONIC" in text.upper() \
    or "STEPTRONIC" in text.upper() or "TRIPTONIC" in text.upper() \
    or "TIPTRONIC" in text.upper() or "SPORTRONIC" in text.upper(): #Geartronic for volvo, steptronic for bmw, Tiptronic for volkswagen group, sportronic for Mitsubishi
        return "Manumatic"
    return None
df['num_speeds']=df['transmission'].apply(get_speeds_of_transmission)
df['transmission_type']=df['transmission'].apply(get_transmission_type)

In [33]:
# Check nulls in transmission_type: All null values in transmission_type are due to the fact that transmission variable doesn't give any clue of the transmission type
df[df['transmission_type'].isnull()]['transmission'].unique()

array(['Not Specified', '9-Speed', 'Single-speed transmission', '7 speed',
       '–', '6-Speed', '5-Speed', 'Single Speed', '3-Speed',
       'Transmission Overdrive Switch', 'Single-Speed Fixed Gear',
       'Single Reduction Gear', 'Unspecified', '10 Speed', 'Sequential',
       'SINGLE-SPEED TRANSMISSION', '10-SPEED TRANSMISSION',
       'NOT SPECIFIED', '448', '8', 'Standard', '8-Speed',
       'Single-Speed Reduction Gear', 'DFT', 'Single-Speed Fi', 'Other',
       '15S', 'MOD HYBRID TRANSMISSION', 'N/A',
       '6-Spd Sport Transmission', '5SP', '5 Speed',
       'Continuously Variabl', '9SP', 'Single Speed Reducer', '5M',
       '7-Speed', 'standard', '6 Speed', 'Single Speed Transmission', 'C',
       'SINGLE-SPEED FIXED GEAR', 'Single-Speed Fixed Gear Transmission',
       '4 Speed Transmission', '1-SPEED G', 'AWD', 'FWD', 'Drivetrai'],
      dtype=object)

In [34]:
# Check nulls in num_speeds: All null values in num_speeds are due to the fact that transmission variable doesn't indicate the number of gears/speeds
df[df['num_speeds'].isnull()]['transmission'].unique()

array(['Automatic', 'CVT', 'Manual', 'Variable', 'A/T',
       'Transmission w/Dual Shift Mode', 'Automatic CVT',
       'Continuously Variable Transmission', 'Not Specified',
       'Automatic w/OD', 'Auto-Shift Manual', 'CVT Transmission',
       'Continuously Variable', 'Transmission-Auto', 'AUTO',
       'Automatic with Tiptronic', 'quattroa? s tronica?', 'M/T',
       'quattroA? S tronicA?', 'quattroA? TiptronicA?', 'A', 'AUTOMATIC',
       'quattro, s-tronic', 'Automatic w/Tiptronic', '–',
       'quattro S tronic', 'CVT with Multitronic', 'FWD, s-tronic',
       'Auto, CVT Multitronic', 'Auto', 'Automatic w/Steptronic', 'auto',
       'Steptronic', 'Automatic w/Manual Shift', 'DCT',
       'STEPTRONIC AUTOMATIC', 'automatic', 'Automatic Automatic',
       'Dynaflow  Automatic', 'Automatic, CVT', 'AT',
       'Continuously Variable (CVT)', 'CONTINUOUSLY VARIABLE (CVT)',
       'AUTO Automatic', 'Transmission Overdrive Switch', 'CVT Automatic',
       'manual', 'Unspecified', 'CVT

In [35]:
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,transmission,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic


Later on we'll discuss how to handle missing data and impute values

In [37]:
df.groupby(['transmission_type','num_speeds'], dropna=False).size().reset_index(name='Count')

Unnamed: 0,transmission_type,num_speeds,Count
0,Automatic,1.0,2944
1,Automatic,2.0,177
2,Automatic,3.0,77
3,Automatic,4.0,1122
4,Automatic,5.0,1571
5,Automatic,6.0,9985
6,Automatic,7.0,3884
7,Automatic,8.0,15761
8,Automatic,9.0,6306
9,Automatic,10.0,4438


#### Drive train

In [38]:
def get_drive_train(text):
    if "FRONT" in text.upper() or 'FWD' == text:
        return "Front-wheel Drive"
    if "ALL" in text.upper() or "FOUR" in text.upper() or '4' in text or 'AWD'==text:
        return "All-wheel Drive"
    if "REAR" in text.upper() or 'RWD'==text:
        return "Rear-wheel Drive"
    return None
df['drive_train_v2']=df['drive_train'].apply(get_drive_train)


In [39]:
# Check nulls in drive_train_v2: All null values in drive_train_v2 are due to the fact that drive_train variable doesn't provide the information
df[df['drive_train_v2'].isnull()]['drive_train'].unique()

array(['–', '2WD', 'Unknown'], dtype=object)

In [40]:
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,transmission,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type,drive_train_v2
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic,Front-wheel Drive
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic,All-wheel Drive
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic,All-wheel Drive
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic,Front-wheel Drive
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic,All-wheel Drive


#### Stock type

In [43]:
def stock_type(text):
    if "CERTIFIED" in text.upper():
        return "Certified"
    if "USED" in text.upper():
        return "Used"
    if "NEW" in text.upper():
        return "New"
    return None

df['stock_type_v2']=df['stock_type'].apply(stock_type)
df['stock_type_v2'].unique()

array(['Used', 'Certified'], dtype=object)

#### Model

In [320]:
df[(df['make']=='gmc')].groupby(['model_v2'], dropna=False).size().reset_index(name='Count').sort_values(by='Count', ascending=True)


Unnamed: 0,model_v2,Count
7,Panel,1
9,Rally,1
10,Safari,3
2,Caballero,3
19,Sprint,3
14,Sierra,5
8,Pickup,5
20,Suburban,7
18,Sonoma,8
6,Jimmy,11


In [337]:
df[(df['make']=='hyundai')&(df['model_v2'].isnull())]['model'].unique()

array([], dtype=object)

In [340]:
set({word.upper().split()[1] for word in list(df[(df['make']=='infiniti')&(df['model_v2'].isnull())]['model'].unique())})
# df[(df['make']=='audi')]['model'].unique()

{'EX35',
 'EX37',
 'FX35',
 'FX37',
 'FX45',
 'FX50',
 'G20',
 'G25',
 'G25X',
 'G35',
 'G35X',
 'G37',
 'G37X',
 'I30',
 'I35',
 'J30',
 'JX35',
 'M30',
 'M35',
 'M35H',
 'M35X',
 'M37',
 'M37X',
 'M45',
 'M45X',
 'M56',
 'M56X',
 'Q40',
 'Q45',
 'Q50',
 'Q60',
 'Q70',
 'Q70H',
 'Q70L',
 'QX30',
 'QX4',
 'QX50',
 'QX55',
 'QX56',
 'QX60',
 'QX70',
 'QX80'}

In [None]:
def get_model(row):
    text=row['model']
    match row['make']:
        case 'acura':
            models =['MDX', 'TSX', 'RDX', 'TLX', 'TL', 'MDX',  'ILX', 'ZDX', 'RLX','RL', 'NSX', 'RSX','Integra', 'CL', 'Legend']
            for model in models:
                if model in text:
                    return model
            return None
        case 'audi':
            models =['A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8',
                      'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8',
                      'RS', 'TT', 
                      'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'R8', 'allroad', 'e-tron']
            for model in models:
                if model in text:
                    return model
            return None
        case 'bmw':
            models =['Gran Coupe','ALPINA',  'Isetta', 'ActiveHybrid',
                     'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'XM',
                     'Z3', 'Z4', 'Z8',
                     'M2', 'M3', 'M4', 'M5', 'M6', 'M8', 'M440', 'M235' 'M240', 'M760',
                     'i3', 'i4', 'i5', 'i7', 'i8', 'iX',
                     '128', '135', 
                     '228', '230',
                     '318', '320', '323', '328', '325', '330', '335', '340',
                     '428', '430', '435', '440', '445',
                     '525', '528', '530', '535', '540','545', '550',
                     '633', '635', '640', '645', '650', 
                     '700', '740', '745', '750', '760',
                     '840', 'M'
                     ]
            for model in models:
                if model in text:
                    return model
            return None
        case 'buick':
            models =['Encore', 'Cascada', 'Verano', 'LaCrosse', 'Regal', 'Enclave', 'Envision',
                      'Lucerne', 'Envista', 'Riviera', 'Roadmaster', 'Electra', 'Super',
                      'Century', 'Skylark', 'LeSabre', 'Rendezvous', 'Terraza', 'Park Avenue',
                      'Reatta', 'Wildcat', 'Special', 'Rainier', 'Model 27', 'Series 50 Model 56']
            for model in models:
                if model in text:
                    return model
            return 'GranSport'
        case 'cadillac':
            models =['CT4', 'CT5', 'CT6', 'CTS',
                     'XT4', 'XT5', 'XT6', 'XTS',
                      'Escalade', 'Eldorado', 'DTS', 'SRX', 'Series 62','Series 61','Series 60','XLR',
                      'STS', 'ATS', 'Allante', 'DeVille', 'Seville', 'LYRIQ',
                      'Fleetwood', 'ELR', 'Catera', 'OPTIQ', 'Brougham']
            for model in models:
                if model in text:
                    return model
            return None
        case 'chevrolet':
            models =['Malibu', 'Corvette', 'Bolt', 'Monte Carlo',
                     'Silverado', 'Caprice', 'Camaro', 'Impala',
                      'Suburban', 'Equinox', 'Captiva', 'Spark', 'Suburban','Cobalt','Traverse','Colorado',
                      'Cruze', 'Avalanche', 'Express', 'Blazer', 'Sonic', 'Tahoe',
                      'Chevelle', 'Trailblazer', 'Nomad', 'Trax', 'Aveo', 'Master', 'Fleetline', 
                      '150', '1500', '210', '2500', '3100', '3500', 
                      'Apache', 'Astro', 'Bel',  'Lumina', 'Luv', 'Nova', 'Pickup', 'Prizm',
                      'Sportvan', 'Styleline', 'Tracker', 'Uplander', 'Van', 'Venture', 'Volt',
                      'C10/K10', 'C20/K20', 'C30/K30',
                      'Cavalier', 'Chevy', 'Corvair',
                      'El Camino',
                      'Fleetmaster',
                      'HHR', 'S-10', 'SS', 'SSR'
                      ]
            for model in models:
                if model in text:
                    return model
            return None
        case 'chrysler':
            models =['Cirrus', 'Concorde', 'Crossfire', 'Imperial', 'LeBaron', 'Newport',
                      'Pacifica', 'Prowler', 'Sebring', 'Town', 'Voyager', '300', '200',
                      'PT Cruiser', 'LHS', 'TC by Maserati', 'New Yorker', 'Aspen']
            for model in models:
                if model in text:
                    return model
            return None
        case 'dodge':
            models =[ 'Avenger', 'Caliber', 'Caravan', 'Challenger', 'Charger', 'Coronet', 'Custom', 'Dakota',
                        'Dart', 'Daytona', 'Durango', 'Dynasty', 'Hornet', 'Intrepid', 'Magnum', 'Monaco',
                        'Neon', 'Nitro', 'Polara', 'Ramcharger', 'Royal', 'Sprinter', 'Stealth', 'Stratus',
                        'Super', 'Viper', 'Ram 1500', 'Ram 2500', 'Ram 3500', 'Journey', 'Ram Van', 'Ram Wagon',
                        '3/4', '600', 'Aspen', 'D150', 'D250', 'W250']
            for model in models:
                if model in text:
                    return model
            return None
            
        case 'ford':
            models =[ 'Bronco', 'Club', 'Contour', 'Crown', 'Custom', 'E-Transit',
                     'EcoSport', 'Edge', 'Escape', 'Escort', 'Excursion', 'Expedition',
                    'Explorer', 'Fairlane', 'Falcon', 'Fiesta',
                    'Flex', 'Focus', 'Freestar', 'Freestyle', 'Fusion', 'Galaxie', 'Maverick', 'Mustang', 'Parklane',
                    'Pickup', 'Pinto', 'Ranch', 'Ranchero', 'Ranger', 'Sedan', 'Shelby', 'Taurus',
                    'Thunderbird', 'Torino', 'Transit', 'Utility', 'Victoria','Windstar','E100', 'E150', 'E250',
                    'E350', 'F-150', 'F-250', 'F-350', 'F-450', 'F100', 'Five Hundred', 'C-Max', 'Five Hundred',
                    'Coupe', 'Deluxe', 'Model A', 'Model T', 'Model 78']
            for model in models:
                if model in text:
                    return model
            return None
        case 'gmc':
            models =[  'Caballero', 'Canyon', 'Envoy','HUMMER', 'Jimmy', 'Panel', 'Pickup', 'Rally', 'Safari', 'Savana 1500', 'Savana 2500',
                      'Savana 3500', 'Sierra 1500', 'Sierra 2500', 'Sierra 3500','Sierra', '1500', 'Sonoma', 'Sprint', 'Suburban', 'Terrain', 'Yukon', 'Acadia']
            for model in models:
                if model in text:
                    return model
            return None
        case 'honda':
            models =['Accord', 'Civic', 'Clarity', 'Crosstour', 'Element', 'Insight', 'Odyssey',
                     'Passport', 'Pilot', 'Prelude', 'Prologue', 'Ridgeline', 'CR-V', 'CR-Z', 
                     'CRX', 'Fit', 'HR-V', 'S2000', 'del Sol']
            for model in models:
                if model in text:
                    return model
            return None
        
        case 'hyundai':
            models =['ACCENT', 'AZERA', 'ELANTRA', 'ENTOURAGE', 'EQUUS', 'GENESIS',
                     'IONIQ', 'KONA', 'NEXO', 'PALISADE',  'SONATA', 'TIBURON',
                     'TUCSON', 'VELOSTER', 'VENUE', 'VERACRUZ', 'XG350', 'SANTA FE', 'SANTA CRUZ']
            for model in models:
                if model in text.upper():
                    return model
            return None
        case 'infiniti':
            return None
        case 'jeep':
            return None
        case 'kia':
            return None
        case 'land_rover':
            return None
        case 'lexus':
            return None
        case 'lincoln':
            return None
        case 'mazda':
            return None
        case 'mercedes_benz':
            return None
        case 'mini':
            return None
        case 'mitsubishi':
            return None
        case 'nissan':
            return None
        case 'porsche':
            return None
        case 'ram':
            return None
        case 'subaru':
            return None
        case 'tesla':
            return None
        case 'toyota':
            return None
        case 'volkswagen':
            return None
        case 'volvo':
            return None
        case _:
            return "Unknown"
        
df['model_v2'] = df.apply(get_model, axis=1)

#### Engine

In [None]:
# TODO

#### Fuel type

For the sake of simplicity, we only consider 5 categories: Electric (fuel type not informed), Gasoline, Diesel, Hybrid and Gas (includes E85 and natural gas)

En aras de la simplicidad, consideramos que hay solo cinco categorías de tipo de motorización: eléctrico (cuando el tipo de combustible no está informado), gasolina, diesel, híbrido y gas (incluye E85 y gas natural)

In [45]:
df.loc[df['fuel_type'].apply(lambda x: isinstance(x, float) and np.isnan(x)),'fuel_type']=None # substitue NaN for None
df.loc[df['fuel_type'] == '–', 'fuel_type'] = None


def get_fuel_type(text):
    if text is None:
        return 'Electric'
    if "HYBRI" in text.upper():
        return "Hybrid"
    if "GASOL" in text.upper() or "UNLEADED" in text.upper():
        return "Gasoline"
    if "DIESEL" in text.upper():
        return "Diesel"
    if "GAS" in text.upper() or 'E85' in text.upper():
        return "Gas"
    return None

df['fuel_type_v2']=df['fuel_type'].apply(get_fuel_type)
df['fuel_type'].unique()

array(['Gasoline', None, 'Hybrid', 'E85 Flex Fuel', 'Gas',
       'Plug-In Hybrid', 'Diesel', 'Flexible Fuel',
       'Gasoline / Natural Gas', 'Gasoline fuel type',
       'Gasoline/Mild Electric Hybrid', 'Plug-in Gas/Electric Hybrid',
       'Other', 'Diesel (B20 capable)', 'Unspecified', 'Gaseous', 'Flex',
       'Plug-In Hybrid Fuel', 'Regular Unleaded',
       'Compressed Natural Gas', 'Natural Gas', 'Gasoline Fuel',
       'Premium Unleaded', 'Plug-in Hybrid Electric (PHEV)',
       'PHEV (plug-in hybrid electric vehicle)', 'PHEV Hybrid Fuel',
       'mild', 'MHEV (mild hybrid electric vehicle)',
       'Gasoline/Mild Electric Hybri'], dtype=object)

In [46]:
# Check nulls in fuel_type_v2: All null values in fuel_type_v2 are due to the fact that fuel_type variable doesn't provide the information
df[df['fuel_type_v2'].isnull()]['fuel_type'].unique()

array(['Flexible Fuel', 'Other', 'Unspecified', 'Flex', 'mild'],
      dtype=object)

In [47]:
df.groupby(['fuel_type','fuel_type_v2'], dropna=False).size().reset_index(name='Count')

Unnamed: 0,fuel_type,fuel_type_v2,Count
0,Compressed Natural Gas,Gas,2
1,Diesel,Diesel,2880
2,Diesel (B20 capable),Diesel,4
3,E85 Flex Fuel,Gas,1492
4,Flex,,1
5,Flexible Fuel,,11
6,Gas,Gas,20
7,Gaseous,Gas,6
8,Gasoline,Gasoline,122148
9,Gasoline / Natural Gas,Gasoline,27


#### Color

In [48]:
color_list=list(set((list(df['interior_color'])+list(df['interior_color']))))
len(color_list)

4507

In [84]:
# Missing percentage
round((df.isnull().sum()/df.shape[0])*100,2)

Unnamed: 0,0
year_manufacture,0.0
years,0.0
make,0.0
model,0.0
mileage,0.0
stock_type,0.0
interior_color,0.0
exterior_color,0.0
drive_train,0.0
fuel_type,10.35
