# Residual value model

## Imports

In [1]:
import pandas as pd
import numpy as np
import re
import os
import matplotlib.pyplot as plt
import seaborn as sns
import ast

## 1.- Load data

In [2]:
makes=["acura","audi","bmw","buick","cadillac","chevrolet","chrysler","dodge","ford","gmc","honda","hyundai","infiniti","jeep","kia","land_rover","lexus",
     "lincoln","mazda","mercedes_benz","mini","mitsubishi","nissan","porsche","ram","subaru","tesla","toyota","volkswagen","volvo"]
df_list=[]
for make in makes:
    for batch in range(1,26):
        path='cars\\'+make+'\\cars_'+make+'_'+str(batch)+'.json'
        if os.path.exists(path):
            df_list.append(pd.read_json(path))

df=pd.concat(df_list, ignore_index=True,axis=0)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,mpg,fuel_type,transmission,engine,price_USD,url
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,22–31 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,21–27 Based on EPA mileage ratings. Use for co...,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,18–26 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,29–36 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,19–27 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...


## 2.- Data Exploration and Preprocesing

### 2.1.- Removing duplicates
Durante el proceso de extracción de datos, pueden surgir duplicidades. Se verifica su existencia y, en caso afirmativo, se eliminan las filas duplicadas del DataFrame.<br> <br>
During the data extraction process, duplicates may arise. Their presence is verified, and if detected, duplicate rows are removed from the DataFrame.

In [3]:
num_duplicated_rows=sum(df.duplicated())
print(f'The number of duplicated rows is {num_duplicated_rows}')

The number of duplicated rows is 8982


In [4]:
df = df.drop_duplicates()
num_duplicated_rows=sum(df.duplicated())
print(f'The number of duplicated rows is now {num_duplicated_rows}')

The number of duplicated rows is now 0


### 2.2.- Variable description

A continuación, se realiza un primer análisis a muy alto nivel de las variables disponibles, utilizando un diccionario de datos y evaluando la cantidad de valores ausentes, valores atípicos o categorías.

#### Diccionario de datos
- **year_manufacture**: Año de fabricación del vehículo, que generalmente coincide con el año de compra.
- **years**: Antigüedad del vehículo en años, calculada en 2025. Se obtiene como la diferencia entre el año 2025 y el año de fabricación.
- **make**: Marca del vehículo.
- **model**: Modelo del vehículo.
- **mileage**: Kilometraje del vehículo, expresado en millas.
- **stock_type**: Indicador de si el vehículo es nuevo o usado.
- **interior_color**: Color del interior del vehículo, expresado en lenguaje natural.
- **exterior_color**: Color del exterior del vehículo, expresado en lenguaje natural.
- **drive_train**: Tipo de tracción del vehículo (delantera, trasera, a las cuatro ruedas).
- **mpg**: Consumo de combustible, expresado en millas por galón. Puede incluir información sobre la diferencia entre el consumo en ciudad y en carretera.
- **fuel_type**: Tipo de motorización (diésel, gasolina, híbrido, eléctrico, etc.).
- **transmission**: Tipo de transmisión. Cadena de texto que indica si es manual o automática y el número de marchas.
- **engine**: Cadena de caracteres que describe el motor en lenguaje natural. Puede incluir información sobre la cilindrada, el volumen o los caballos.
- **price_USD**: Precio de venta del vehículo a la fecha de extracción de datos, expresado en dólares estadounidenses.
- **url**: Dirección web de donde se ha extraído la información.
  
Datos obtenidos en la página web [Cars.com](https://www.cars.com/) entre los días 22/03/2025 y 28/03/2025 mediante web scraping (ver el [código](https://github.com/jcllanu/residual-value/blob/main/web_scraping.py)).

Next, a high-level analysis of the available variables is conducted using a data dictionary and assessing the number of missing values, outliers, or categories.
#### Data dictionary
- **year_manufacture**: Year of vehicle manufacture, which generally corresponds to the year of purchase.
- **years**: Vehicle age in years, calculated in 2025. It is obtained as the difference between the year 2025 and the year of manufacture.
- **make**: Vehicle make.
- **model**: Vehicle model.
- **mileage**: Vehicle mileage, expressed in miles.
- **stock_type**: Indicator of whether the vehicle is new or used.
- **interior_color**: Interior color of the vehicle, expressed in natural language.
- **exterior_color**: Exterior color of the vehicle, expressed in natural language.
- **drive_train**: Vehicle drivetrain type (front, rear, all-wheel drive).
- **mpg**: Fuel consumption, expressed in miles per gallon. It may include information about the difference between city and highway consumption.
- **fuel_type**: Engine type (diesel, gasoline, hybrid, electric, etc.).
- **transmission**: Transmission type. Text string indicating whether it is manual or automatic and the number of gears.
- **engine**: Text string that describes the engine in natural language. It may include information about the engine displacement and volume.
- **price_USD**: Vehicle sale price at the data extraction date, expressed in US dollars.
- **url**: Web address from which the information was extracted.

Data obtained from the website [Cars.com](https://www.cars.com/) between the dates 03/22/2025 and 03/28/2025 using web scraping techniques (see the [code](https://github.com/jcllanu/residual-value/blob/main/web_scraping.py)).

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 141185 entries, 0 to 150166
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   year_manufacture  141185 non-null  int64  
 1   years             141185 non-null  int64  
 2   make              141185 non-null  object 
 3   model             141185 non-null  object 
 4   mileage           141185 non-null  int64  
 5   stock_type        141185 non-null  object 
 6   interior_color    141185 non-null  object 
 7   exterior_color    141185 non-null  object 
 8   drive_train       141185 non-null  object 
 9   mpg               131855 non-null  object 
 10  fuel_type         133860 non-null  object 
 11  transmission      141185 non-null  object 
 12  engine            141185 non-null  object 
 13  price_USD         141181 non-null  float64
 14  url               141185 non-null  object 
dtypes: float64(1), int64(3), object(11)
memory usage: 17.2+ MB


#### Check the Categorical and Numerical Columns

In [6]:
# Categorical variables
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical variables:',cat_col)
# Numerical variables
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical variables:',num_col)

Categorical variables: ['make', 'model', 'stock_type', 'interior_color', 'exterior_color', 'drive_train', 'mpg', 'fuel_type', 'transmission', 'engine', 'url']
Numerical variables: ['year_manufacture', 'years', 'mileage', 'price_USD']


In [7]:
#Number of unique values for categorical variables
df[cat_col].nunique()

make                  30
model               9201
stock_type            26
interior_color      4507
exterior_color      5674
drive_train           17
mpg                  931
fuel_type             29
transmission        1019
engine              5278
url               141027
dtype: int64

Se observa que algunas variables categóricas presentan una gran variedad de valores distintos, lo cual se debe a que están expresadas en lenguaje natural. Será necesario aplicar un procesamiento de datos riguroso para tratarlas adecuadamente.

It is observed that some categorical variables exhibit a wide range of distinct values, which is due to their expression in natural language. A thorough data processing approach will be required to handle them appropriately.

In [8]:
for col in cat_col:
    print('Variable:',col)
    print('Sample of some unique values:', df[col].unique()[:20],'\n')

Variable: make
Sample of some unique values: ['acura' 'audi' 'bmw' 'buick' 'cadillac' 'chevrolet' 'chrysler' 'dodge'
 'ford' 'gmc' 'honda' 'hyundai' 'infiniti' 'jeep' 'kia' 'land_rover'
 'lexus' 'lincoln' 'mazda' 'mercedes_benz'] 

Variable: model
Sample of some unique values: ['Acura TSX Base' 'Acura RDX Technology Package'
 'Acura MDX 3.5L w/Technology Package' 'Acura Integra A-SPEC'
 'Acura RDX Technology & AcuraWatch Plus Package' 'Acura TLX V6 Tech'
 'Acura Integra A-SPEC w/ Technology' 'Acura TLX V6' 'Acura ZDX A-SPEC'
 'Acura MDX Sport Hybrid SH-AWD Sport Hybrid w/Technology Pkg'
 'Acura RLX Technology Package' 'Acura MDX SH-AWD Technology'
 'Acura TLX Base' 'Acura MDX SH-AWD' 'Acura ZDX A-Spec'
 'Acura RDX SH-AWD Technology' 'Acura MDX Sport Hybrid Technology Package'
 'Acura TSX Technology' 'Acura ILX Base' 'Acura MDX 3.5L'] 

Variable: stock_type
Sample of some unique values: ['Used' 'Acura Certified' 'Certified' 'Audi Certified' 'BMW Certified'
 'Buick Certified' 'Cadillac C

#### Feature engineering

##### MPG

Se puede observar que este campo está compuesto por un par de valores numéricos separados por un guion, seguidos de una nota informativa: 'Based on EPA mileage ratings. Use for comparison purposes only. Actual mileage will vary depending on driving conditions, driving habits, vehicle maintenance, and other factors.' Asumiremos que, en producción, se contará con los 'EPA mileage ratings' de los vehículos, por lo que esta variable puede seguir siendo útil. De ella, obtendremos los valores de consumo en ciudad y carretera si están disponibles.

It can be observed that this field consists of a pair of numerical values separated by a hyphen, followed by an informational note: 'Based on EPA mileage ratings. Use for comparison purposes only. Actual mileage will vary depending on driving conditions, driving habits, vehicle maintenance, and other factors.' We will assume that, in production, the 'EPA mileage ratings' of the vehicles will be available, so this variable may still be useful. From it, we will obtain the values for city and highway consumption if available.

In [9]:
def keep_numbers_and_en_dash(text):
    if text is None:
        return None  # Return None if the value is None
    return re.sub(r'[^0-9–]', '', text)  # Keep numbers and en dash (–)

def get_mpg_city(text):
    if text is None:
        return None  # Return None if the value is None
    if text == "–":
        return None # No information available
    if "–" not in text:
        return float(text) if text.replace('.', '', 1).isdigit() and text.count('.') < 2 \
            else None # If the text is a decimal number, return the number
    return float(text.split('–')[0]) if text.split('–')[0].replace('.', '', 1).isdigit() \
        and text.split('–')[0].count('.') < 2 else None # Since the text contains an hyphen, 
                                                        # return what's on the left-hand
                                                        # side if it is a decimal number

def get_mpg_highway(text):
    if text is None:
        return None  # Return None if the value is None
    if text == "–":
        return None # No information available
    if "–" not in text:
        return float(text) if text.replace('.', '', 1).isdigit() and text.count('.') < 2 \
            else None # If the text is a decimal number, return the number
    return float(text.split('–')[1]) if text.split('–')[1].replace('.', '', 1).isdigit() \
        and text.split('–')[1].count('.') < 2 else None # Since the text contains an hyphen, 
                                                        # return what's on the right-hand
                                                        # side if it is a decimal number

df.loc[df['mpg'].apply(lambda x: isinstance(x, float) and np.isnan(x)),'mpg']=None # substitue NaN for None
df['auxiliar'] = df['mpg'].apply(keep_numbers_and_en_dash)
df['mpg_city'] = df['auxiliar'].apply(get_mpg_city)
df['mpg_highway'] = df['auxiliar'].apply(get_mpg_highway)

In [10]:
# Check nulls in mpg_city: All null values in mpg_city are due to the fact that mpg is None or '–'
df[df['mpg_city'].isnull()]['mpg'].unique()

array([None, '–'], dtype=object)

In [11]:
# Check nulls in mpg_highway: All null values in mpg_highway are due to the fact that mpg is None or '–'
df[df['mpg_highway'].isnull()]['mpg'].unique()

array([None, '–'], dtype=object)

In [12]:
# Remove useless colums and display the df
df.drop(columns=['mpg', 'auxiliar'], inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,transmission,engine,price_USD,url,mpg_city,mpg_highway
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0


#### Transmission

De esta columna, queremos derivar otras dos columna: una que nos indique el tipo de transmisión (automática, manual, [manumática](https://es.wikipedia.org/wiki/Transmisi%C3%B3n_manum%C3%A1tica), [variable](https://es.wikipedia.org/wiki/Transmisi%C3%B3n_variable_continua) o [dual-clutch](https://es.wikipedia.org/wiki/Caja_de_cambios_de_doble_embrague) ), y otra que nos indique el número de marchas del vehículo.

From this column, we want to derive two other columns: one indicating the type of transmission (automatic, manual, [manumatic](https://en.wikipedia.org/wiki/Manumatic), [variable](https://en.wikipedia.org/wiki/Continuously_variable_transmission), or [dual-clutch](https://en.wikipedia.org/wiki/Dual-clutch_transmission)), and another indicating the number of gears/speeds of the vehicle.

In [13]:
def get_speeds_of_transmission(text):
    if text is None:
        return None  # Return None if the value is None
    
    # Check some particular cases individually
    if "SINGLE" in text.upper():
        return 1
    if "TWO" in text.upper():
        return 1
    if text in ('4L80E', '4l60e'):
        return 4
    if text in ('TH400', 'TH350', 'TH 400'):
        return 3
    if text == '6L70E':
      return 6

    # The general logic of the strings is that the first 
    # number of the string identifies the number of speeds
    substring = re.sub(r'[^0-9]', '', text) #Obtain the numbers in the text
    if substring=="":
        return None # Unable to identify the number of speeds
    if len(substring)==1:
        return int(substring) # return the number if it has 1 digit
    if len(substring)>=2:
        if substring[0]=="1":
            if int(substring)==10006: # The general doesn't apply for 'Allison 1000 6-Speed Automatic' 
                return 6
            if int(substring)==1503: # nor 'Borg-Warner T150 3 Speed Manual'
                return 3
            return int(substring) # Surprisingly, there are cars with 10 or more speeds
        else:
            return int(substring[0]) #If there are more than 2 digits, return the first one
    return None

def keep_letters(text):
    return re.sub(r'[^a-zA-Z]', '', text) #Auxiliar function to obtain just the letters of a string

def get_transmission_type(text):
    # General cases in which the transmission type is explicitly stated combined with
    # particular cases in which it has to be derived from the commercial name of the 
    # transmission 

    if "MANUAL" in text.upper() or "M/T" in text.upper() or "SMG" in text.upper(): # SMG for bmw
        return "Manual"
    
    if "AUTO" in text.upper() or "A/T" in text.upper() \
    or keep_letters(text).upper()=='A' or "AU" in text.upper() or "AT" in text.upper()\
    or '6L80' in text.upper() or " A" in text.upper() or "ZF" in text.upper() \
    or 'ECT' in text.upper() or text in ('4L80E','TH400', 'TH350', 'TH 400', \
    '4l60e', '6L70E', 'EFLITE SI-EVT', 'E4OD 4R100', '8HP75'): #ZF and 8HP75 for bmw, ECT for toyota, 4L80E, 
    # 6L70E,6L80, 4l60e for GM group, TH350 and TH400 for chevrolet and buick, EFLITE SI-EVT chrysler
        return "Automatic"
    
    if "VARIABLE" in text.upper() or "CVT" in text.upper() or text in ('i-VT', 'IVT'): 
        #Continuously Variable Transmission, ivt for hyundai
        return "Variable"
    
    if "DUAL" in text.upper() or "DOUBLE" in text.upper() or "DC" in text.upper() \
        or "S TRONIC" in text.upper() or "S-TRONIC" in text.upper() \
        or "PDK" in text.upper(): # Dual-Clutch Transmission (DCT), stronic for audi, PDK for Porsche
        return "Dual-clutch"
    
    if "GEARTRONIC" in text.upper() or "SHIFTRONIC" in text.upper() \
    or "STEPTRONIC" in text.upper() or "TRIPTONIC" in text.upper() \
    or "TIPTRONIC" in text.upper() or "SPORTRONIC" in text.upper(): 
        #Geartronic for volvo, steptronic for bmw, Tiptronic for 
        # volkswagen group, sportronic for Mitsubishi
        return "Manumatic"
    return None

df['num_speeds']=df['transmission'].apply(get_speeds_of_transmission)
df['transmission_type']=df['transmission'].apply(get_transmission_type)

In [14]:
# Check nulls in transmission_type: All null values in transmission_type are due to the fact that transmission variable doesn't give any clue of the transmission type
df[df['transmission_type'].isnull()]['transmission'].unique()

array(['Not Specified', '9-Speed', 'Single-speed transmission', '7 speed',
       '–', '6-Speed', '5-Speed', 'Single Speed', '3-Speed',
       'Transmission Overdrive Switch', 'Single-Speed Fixed Gear',
       'Single Reduction Gear', 'Unspecified', '10 Speed', 'Sequential',
       'SINGLE-SPEED TRANSMISSION', '10-SPEED TRANSMISSION',
       'NOT SPECIFIED', '448', '8', 'Standard', '8-Speed',
       'Single-Speed Reduction Gear', 'DFT', 'Single-Speed Fi', 'Other',
       '15S', 'MOD HYBRID TRANSMISSION', 'N/A',
       '6-Spd Sport Transmission', '5SP', '5 Speed',
       'Continuously Variabl', '9SP', 'Single Speed Reducer', '5M',
       '7-Speed', 'standard', '6 Speed', 'Single Speed Transmission', 'C',
       'SINGLE-SPEED FIXED GEAR', 'Single-Speed Fixed Gear Transmission',
       '4 Speed Transmission', '1-SPEED G', 'AWD', 'FWD', 'Drivetrai'],
      dtype=object)

In [None]:
# Check nulls in num_speeds: All null values in num_speeds are due to the fact that transmission variable doesn't explicitly indicate the number of gears/speeds
df[df['num_speeds'].isnull()]['transmission'].unique()

array(['Automatic', 'CVT', 'Manual', 'Variable', 'A/T',
       'Transmission w/Dual Shift Mode', 'Automatic CVT',
       'Continuously Variable Transmission', 'Not Specified',
       'Automatic w/OD', 'Auto-Shift Manual', 'CVT Transmission',
       'Continuously Variable', 'Transmission-Auto', 'AUTO',
       'Automatic with Tiptronic', 'quattroa? s tronica?', 'M/T',
       'quattroA? S tronicA?', 'quattroA? TiptronicA?', 'A', 'AUTOMATIC',
       'quattro, s-tronic', 'Automatic w/Tiptronic', '–',
       'quattro S tronic', 'CVT with Multitronic', 'FWD, s-tronic',
       'Auto, CVT Multitronic', 'Auto', 'Automatic w/Steptronic', 'auto',
       'Steptronic', 'Automatic w/Manual Shift', 'DCT',
       'STEPTRONIC AUTOMATIC', 'automatic', 'Automatic Automatic',
       'Dynaflow  Automatic', 'Automatic, CVT', 'AT',
       'Continuously Variable (CVT)', 'CONTINUOUSLY VARIABLE (CVT)',
       'AUTO Automatic', 'Transmission Overdrive Switch', 'CVT Automatic',
       'manual', 'Unspecified', 'CVT

In [16]:
df.drop(columns=['transmission'], inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic


Más adelante se abordará el tratamiento de los valores faltantes y la imputación correspondiente. No obstante, antes de ello, se puede observar que la mayoría de los valores ausentes en la variable num_speeds se concentra en los vehículos identificados como automáticos, lo cual resulta coherente.

The treatment of missing values and the corresponding imputation will be addressed later. However, prior to that, it can be observed that most of the missing values in the num_speeds variable are concentrated in vehicles identified as automatic, which is logically consistent.

In [17]:
df.groupby(['transmission_type','num_speeds'], dropna=False).size().reset_index(name='Count')

Unnamed: 0,transmission_type,num_speeds,Count
0,Automatic,1.0,2944
1,Automatic,2.0,177
2,Automatic,3.0,77
3,Automatic,4.0,1122
4,Automatic,5.0,1571
5,Automatic,6.0,9985
6,Automatic,7.0,3884
7,Automatic,8.0,15761
8,Automatic,9.0,6306
9,Automatic,10.0,4438


#### Drive train

El tratamiento propuesto para esta variable consiste en reducir el conjunto de posibles valores a tres categorías: tracción delantera, tracción trasera y tracción a las cuatro ruedas.

The proposed treatment for this variable involves reducing the range of possible values to three categories: front-wheel drive, rear-wheel drive, and all-wheel drive.

In [18]:
def get_drive_train(text):
    if "FRONT" in text.upper() or 'FWD' == text:
        return "Front-wheel Drive"
    if "ALL" in text.upper() or "FOUR" in text.upper() or '4' in text or 'AWD'==text:
        return "All-wheel Drive"
    if "REAR" in text.upper() or 'RWD'==text:
        return "Rear-wheel Drive"
    return None
df['drive_train_v2']=df['drive_train'].apply(get_drive_train)


In [19]:
# Check nulls in drive_train_v2: All null values in drive_train_v2 are due to the fact that drive_train variable doesn't provide the information
df[df['drive_train_v2'].isnull()]['drive_train'].unique()

array(['–', '2WD', 'Unknown'], dtype=object)

In [20]:
# We remove the original variable and substitute it for the processed one
df.drop(columns=['drive_train'], inplace=True)
df.rename(columns={'drive_train_v2': 'drive_train'}, inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,fuel_type,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type,drive_train
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Gasoline,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic,Front-wheel Drive
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,Gasoline,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic,All-wheel Drive
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,Gasoline,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic,All-wheel Drive
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Gasoline,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic,Front-wheel Drive
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,Gasoline,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic,All-wheel Drive


#### Stock type

El tratamiento propuesto para esta variable consiste en agrupar los valores posibles en tres categorías: usado, nuevo y certificado.

1.- Inspeccionado – Ha pasado una inspección detallada de múltiples puntos (a menudo entre 100 y más de 200 elementos) realizada por el fabricante o el concesionario.

2.- Reacondicionado o restaurado – Cualquier problema encontrado durante la inspección ha sido corregido para cumplir con ciertos estándares de calidad.

3.- Respaldado por una garantía – Viene con una garantía extendida del fabricante o del concesionario, a menudo más allá de la garantía original de fábrica.

4.- Generalmente más nuevo y con menor kilometraje – Los autos certificados suelen tener una antigüedad limitada (por ejemplo, menos de 5 años) y un kilometraje restringido (por ejemplo, menos de 60,000 km).

Si bien la condición de stock no es una característica intrínseca del vehículo que pueda incluirse como variable en el modelo —ya que todos los datos de entrada corresponderán a vehículos nuevos y no es posible anticipar si, con el tiempo, pasarán a ser certificados—, no es necesario excluir los vehículos certificados del conjunto de datos históricos para la construcción del modelo.

The proposed treatment for this variable involves grouping the possible values into three categories: used, new, and certified. Certified vehicles are typically characterized by the following:

1.- Thorough inspection: They have undergone a detailed multi-point inspection (often covering 100 to over 200 items) performed by the manufacturer or dealership.

2.- Reconditioning or restoration: Any issues identified during the inspection have been addressed to meet specific quality standards.

3.- Extended warranty: They come with an additional warranty provided by the manufacturer or dealership, often extending beyond the original factory warranty.

4.- Generally newer and with lower mileage: Certified vehicles typically have limited age (e.g., less than 5 years) and mileage (e.g., under 60,000 miles).

Although stock status is not an intrinsic characteristic of the vehicle and therefore cannot be included as a model input—since all inputs will represent new vehicles, and it's not possible to know in advance whether they will later qualify as certified—it is not necessary to exclude certified vehicles from the historical dataset used to train the model.

In [21]:
def stock_type(text):
    if "CERTIFIED" in text.upper():
        return "Certified"
    if "USED" in text.upper():
        return "Used"
    if "NEW" in text.upper():
        return "New"
    return None

df['stock_type_v2']=df['stock_type'].apply(stock_type)
df['stock_type_v2'].unique()

array(['Used', 'Certified'], dtype=object)

In [22]:
# We remove the original variable and substitute it for the processed one
df.drop(columns=['stock_type'], inplace=True)
df.rename(columns={'stock_type_v2': 'stock_type'}, inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,interior_color,exterior_color,fuel_type,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type,drive_train,stock_type
0,2006,19,acura,Acura TSX Base,76497,Biege,Green Pearl,Gasoline,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic,Front-wheel Drive,Used
1,2021,4,acura,Acura RDX Technology Package,54931,Espresso,Majestic Black Pearl,Gasoline,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic,All-wheel Drive,Used
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Ebony,Modern Steel Metallic,Gasoline,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic,All-wheel Drive,Used
3,2024,1,acura,Acura Integra A-SPEC,17309,Ebony,Platinum White Pearl,Gasoline,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic,Front-wheel Drive,Used
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Ebony,White Diamond Pearl,Gasoline,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic,All-wheel Drive,Used


#### Vehicle model

A simple vista, se observa que el nombre de los modelos contiene un nivel de detalle excesivo, lo que puede dificultar la capacidad del modelo de aprendizaje automático para generalizar adecuadamente. Por ello, se propone reducir el conjunto de posibles valores de la variable model para cada marca a una lista elaborada con criterio experto, garantizando que todo modelo reciba una nueva clasificación. Es decir, para dos modelos como 'Acura RDX Technology Package' y 'Acura RDX Technology & AcuraWatch Plus Package', ambos, con make='Acura', se clasificarían bajo la nueva categoría model_v2 = 'RDX'. Esta simplificación tiene como objetivo mejorar la capacidad de generalización del modelo, reduciendo el número de categorías. No obstante, el número total de valores únicos para los modelos seguirá siendo elevado (alrededor de 700). Por esta razón, se propone extraer algunas características adicionales de cada modelo de vehículo. Para simplificar, en nuestro caso, se asignará un tipo de vehículo a cada modelo, como se detallará a continuación, y dicho tipo podrá tomar solo unos pocos valores: camioneta, compacto, deportivo, entre otros.

Se podría plantear la inclusión de otras variables derivadas de la variable model, como indicadores que indiquen si las cadenas de texto 'sport', 'luxury', 'technology' u otras aparecen como substrings dentro de model. Sin embargo, por simplicidad, se ha decidido prescindir de estas variables.

At first glance, it is evident that the model names contain excessive detail, which may hinder the machine learning model's ability to generalize effectively. Therefore, it is proposed to reduce the set of possible values for the model variable for each brand to a list created based on expert criteria, ensuring that all models receive a new classification. For instance, two models such as 'Acura RDX Technology Package' and 'Acura RDX Technology & AcuraWatch Plus Package', both with make='Acura', would be classified under the new category model_v2 = 'RDX'. This simplification is intended to improve the model's generalization by reducing the number of categories. However, the total number of unique model values will still be high (around 700). For this reason, it is proposed to extract additional features for each vehicle model. For simplicity, in our case, a vehicle type will be assigned to each model, as detailed below, and this type will only take a few possible values: SUV, compact, sport-car, among others.

The inclusion of additional variables derived from the model variable could be considered, such as indicators that specify whether the strings 'sport', 'luxury', 'technology', or others appear as substrings within model. However, for simplicity, it has been decided to omit these variables.

In [24]:
# Open the file containing the data
with open('model_vehicle_type.txt', 'r') as file:
    content = file.read()
models_dictionary = ast.literal_eval(content)
models_dictionary

{'acura': {'MDX': 'SUV',
  'TSX': 'Compact',
  'RDX': 'SUV',
  'TLX': 'Mid-size',
  'TL': 'Mid-size',
  'ILX': 'Compact',
  'ZDX': 'SUV',
  'RLX': 'Large',
  'RL': 'Large',
  'NSX': 'Sport-car',
  'RSX': 'Compact',
  'INTEGRA': 'Compact',
  'CL': 'Mid-size',
  'LEGEND': 'Large'},
 'audi': {'A2': 'Subcompact',
  'A3': 'Compact',
  'A4': 'Mid-size',
  'A5': 'Mid-size',
  'A6': 'Large',
  'A7': 'Large',
  'A8': 'Large',
  'Q3': 'SUV',
  'Q4': 'SUV',
  'Q5': 'SUV',
  'Q6': 'SUV',
  'Q7': 'SUV',
  'Q8': 'SUV',
  'RS': 'Sport-car',
  'TT': 'Sport-car',
  'S3': 'Compact',
  'S4': 'Mid-size',
  'S5': 'Mid-size',
  'S6': 'Large',
  'S7': 'Large',
  'S8': 'Large',
  'R8': 'Sport-car',
  'ALLROAD': 'Mid-size',
  'E-TRON': 'SUV'},
 'bmw': {'GRAN COUPE': 'Mid-size',
  'ALPINA': 'Mid-size',
  'ISETTA': 'Subcompact',
  'ACTIVEHYBRID': 'Mid-size',
  'X1': 'SUV',
  'X2': 'SUV',
  'X3': 'SUV',
  'X4': 'SUV',
  'X5': 'SUV',
  'X6': 'SUV',
  'X7': 'SUV',
  'XM': 'SUV',
  'Z3': 'Sport-car',
  'Z4': 'Sport-

In [25]:
def get_model(row):
    text=row['model'].upper()
    # For a given make, obtain all possible models from a closed list
    models =[model.upper() for model in list(models_dictionary[row['make']].keys())]
    for model in models:
        if model in text: # Check if the simplified model is contained in the
                          # original model and over-detailed text
            return model
        
    if row['make']=='buick': # Particular case
        return 'GRANSPORT'
    return None
      
df['model_v2'] = df.apply(get_model, axis=1)

In [26]:
# Check nulls in model_v2: There are not
df[(df['model_v2'].isnull())]['model'].unique()

array([], dtype=object)

In [30]:
original_num_models=len(df['model'].unique())
print(f"The original number of different car models was {original_num_models}")
num_models=len(df['model_v2'].unique())
print(f"The number of different car models considered now is {num_models}")

The original number of different car models was 9201
The number of different car models considered now is 704


#### Vehicle type

Como se anticipó, el número de modelos diferentes ha sido reducido considerablemente, de aproximadamente 9,200 a unos 700. Sin embargo, este número aún puede resultar elevado para utilizarlo como una variable final en el modelo. Por ello, se propone como alternativa incorporar las características de cada modelo como variables, siempre que sean conocidas y no puedan ser personalizadas por el comprador. Ejemplos de estas características incluyen: tipo de vehículo, altura, anchura, longitud, tonelaje, tamaño del maletero, número de asientos, entre otros. No obstante, se excluyen atributos como el color o los llamados 'extras' (llantas, lunas tintadas, asientos con ventilación...).

En nuestro caso, dada la escasez de datos, las limitaciones de tiempo y el alcance del proyecto, por simplicidad, hemos decidido derivar únicamente una variable del modelo de vehículo (y marca). Esta variable es el tipo de vehículo, que puede tomar los valores: subcompacto, compacto, mediano, grande, deportivo, SUV, furgoneta y camioneta.

Para etiquetar los datos, es decir, asignar a cada modelo de cada marca el valor de la variable vehicle_type, se ha utilizado un modelo LLM, con el objetivo de agilizar una tarea rutinaria y fácilmente automatizable. El prompt utilizado se puede encontrar [aquí](https://github.com/jcllanu/residual-value/blob/main/car_classification_prompt.txt), y el modelo LLM empleado es el 4o mini de ChatGPT. De manera manual, se ha verificado que esta clasificación es coherente para aquellos modelos de vehículos con mayor porcentaje de población, con el fin de corregir alucinaciones en caso de que existieran.

As anticipated, the number of distinct models has been significantly reduced, from approximately 9,200 to around 700. However, this number may still be too high to use as a final variable in the model. Therefore, as an alternative, it is proposed to incorporate characteristics of each model as variables, provided they are known and cannot be customized by the buyer. Examples of such characteristics include: vehicle type, height, width, length, weight, trunk size, number of seats, among others. However, attributes such as color or so called 'extras' (tinted windows, refrigerated seats, alloy wheels, etc. ) are excluded.

In our case, given the scarcity of data, time constraints, and the scope of the project, for simplicity, we have decided to derive only one variable from the vehicle model (and brand). This variable is vehicle type, which can take the following values: Subcompact, Compact, Mid-size, Large, Sport-car, SUV, Van, and Truck.

To label the data, i.e., to assign the vehicle_type value to each model of each brand, an LLM model was used to speed up a routine and easily automatable task. The prompt used can be found [here](https://github.com/jcllanu/residual-value/blob/main/car_classification_prompt.txt), and the LLM model employed is ChatGPT's 4o mini. A manual verification was performed to ensure the classification is consistent for those vehicle models with the highest population percentage, in order to correct any potential hallucinations.

In [31]:
def get_vehicle_type(row):
    # Search in the labeled data the type of vehicle for each model and make
    return models_dictionary[row['make']][row['model_v2']] 
      
df['vehicle_type'] = df.apply(get_vehicle_type, axis=1)

In [32]:
# Save the information in an excel file to check for possible hallucinations
df.groupby(['make','model_v2','vehicle_type'], dropna=False).size().reset_index(name='Count').to_excel('vehicle_types.xlsx', index=False)

In [33]:
# Most populated models with their vehicle type assignation
df.groupby(['make','model_v2','vehicle_type'], dropna=False).size().reset_index(name='Count').sort_values(by='Count', ascending=False).head(50)

Unnamed: 0,make,model_v2,vehicle_type,Count
573,mitsubishi,OUTLANDER,SUV,2754
619,ram,1500,Truck,2600
219,chrysler,PACIFICA,SUV,2161
455,land_rover,RANGE ROVER,SUV,2115
561,mini,HARDTOP,Compact,1762
559,mini,COUNTRYMAN,SUV,1726
103,buick,ENCORE,SUV,1687
501,mazda,CX-5,SUV,1687
4,acura,MDX,SUV,1578
427,jeep,WRANGLER,SUV,1491


In [39]:
# We remove the original variable and substitute it for the processed one
df.drop(columns=['model'], inplace=True)
df.rename(columns={'model_v2': 'model'}, inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,mileage,interior_color,exterior_color,fuel_type,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type,drive_train,stock_type,model,vehicle_type,engine_displacement,engine_cylinders
0,2006,19,acura,76497,Biege,Green Pearl,Gasoline,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic,Front-wheel Drive,Used,TSX,Compact,2.4,4.0
1,2021,4,acura,54931,Espresso,Majestic Black Pearl,Gasoline,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic,All-wheel Drive,Used,RDX,SUV,2.0,4.0
2,2017,8,acura,103720,Ebony,Modern Steel Metallic,Gasoline,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic,All-wheel Drive,Used,MDX,SUV,3.5,6.0
3,2024,1,acura,17309,Ebony,Platinum White Pearl,Gasoline,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic,Front-wheel Drive,Used,INTEGRA,Compact,1.5,4.0
4,2017,8,acura,66552,Ebony,White Diamond Pearl,Gasoline,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic,All-wheel Drive,Used,RDX,SUV,3.5,6.0


#### Engine

De la variable engine, que está expresada en lenguaje natural, se pretende extraer tres variables que están disponibles en algunos de los casos: la cilindrada, el número de cilindros y la potencia en caballos de fuerza.

From the engine variable, which is expressed in natural language, the goal is to extract three variables that are provided in some cases: engine displacement, the number of cylinders, and horsepower.

In [127]:
df['engine'].unique()[:50]

array(['2.4L I-4 DOHC, i-VTEC variable valve control, premium unleaded,',
       '2L I-4 gasoline direct injection, DOHC, VTEC variable valve cont',
       '3.5L V-6 gasoline direct injection, i-VTEC variable valve contro',
       '1.5L I-4 gasoline direct injection, DOHC, VTEC variable valve co',
       '3.5L V-6 i-VTEC variable valve control, premium unleaded, engine',
       '2.4L I4', '3.5L V6 24V GDI SOHC',
       '3.5L V-6 gasoline direct injection, variable valve control, prem',
       'Electric ZEV 490hp',
       '3.0L PGM-FI 24V SOHC i-VTEC V6 -inc: Variable Cylinder Managemen',
       'Gas', '2.0L I4 16V GDI DOHC Turbo', '3.5L V6 SOHC 24V',
       'Electric', 'Turbo Gas',
       '3L V-6 i-VTEC variable valve control, premium unleaded, engine w',
       '2.4L I-4 gasoline direct injection, DOHC, i-VTEC variable valve',
       '3.5L V6 24V MPFI SOHC',
       '3.2L V-6 variable valve control, premium unleaded, engine with 2',
       '3.5L V-6 variable valve control, premium unle

##### Engine displacement

La cilindrada (o desplazamiento del motor) es el volumen total de los pistones dentro de los cilindros del motor, expresado en litros. En la variable engine, esta información puede estar expresada en diferentes unidades: litros, centímetros cúbicos, pulgadas cúbicas, entre otras. Para extraer la información, se aplicará una lógica específica basada en expresiones regulares.

Engine displacement is the total volume of the pistons within the engine cylinders, expressed in liters. In the engine variable, this information may be expressed in different units: liters, cubic centimeters, cubic inches, among others. A specific logic based on regular expressions will be applied to extract the information.

In [None]:
def get_left_hand_side_engine_displacement(text):
    # Get the number on the left hand-side of the text if any
    text=text.replace(',','')
    result=""
    already_found_dot=False
    for char in text[::-1]: #iterate in the reversed list
        if char.isdigit():
            result = char + result
        elif char == '.' and not already_found_dot:
            result = char + result
            already_found_dot=True   
        else:
            break  # Stop when we encounter a non-number, non-dot character, or a second dot
    return result

def get_engine_displacement(text):
    if text.replace('.', '', 1).isdigit() and text.count('.') < 2 and float(text)<10:
        return float(text)
    possible_splitters=['CC', 'cc', '-liter', 'liter', 'T','I','CI', 'ci','c.i.', '/','L ','L', 'l ', ' ']
    for splitter in possible_splitters:
        for i in range(text.count(splitter)):
            liters=get_left_hand_side_engine_displacement(text.split(splitter)[i].strip())
            if liters.replace('.', '', 1).isdigit() and liters.count('.')< 2:
                if splitter in ('CC', 'cc'):
                    return float(liters)/1000 # Convert CC into liters
                elif splitter in ('CI', 'ci', 'c.i.'):
                    return float(liters)/61.024 # Convert CI into liters
                elif splitter == ' ':
                    if '.' in liters and float(liters) < 10: #Check that the engine displacement makes business sense
                        return float(liters)
                elif splitter in ('T', 'I', '/'):
                    if float(liters) < 10: #Check that the engine displacement makes business sense
                        return float(liters)
                else:
                    if float(liters) < 10: #Check that the engine displacement makes business sense
                        return float(liters)
    return None
df['engine_displacement']=df['engine'].apply(get_engine_displacement)

In [None]:
# Check nulls in engine_displacement: All null values in engine_displacement are due to the fact that engine variable doesn't provide that information
df[df['engine_displacement'].isnull()]['engine'].unique()

array(['Electric ZEV 490hp', 'Gas', 'Electric', 'Turbo Gas',
       '4 Cylinder Engine', '–', '4 Cylinder', 'Electric ZEV 358hp',
       '6 Cylinder', 'Electric ZEV 499hp', 'Electric Motor', 'V6', 'I4',
       'Engine: Dual Motor -inc: start/stop pushbutton', 'I-4 cyl',
       'V6 Cylinder Engine', 'Electric ZEV', 'Supercharged Gas',
       '355.0HP Electric Motor Electric Fuel System',
       'Electric 402hp 490ft. lbs.',
       'Dual Synchronous Electric Motors Engine', 'L Electric Motor',
       'L 4-Cyl Engine', '0 Cylinders',
       'Dual Asynchronous Electric Motors Engine',
       'Dual AC Electric Motors', 'Plug-in Hybrid', 'Not Specified', 'V8',
       'I6', 'Range-Extended Electric 168hp 184ft. lbs.',
       'Straight 6 Cylinder Engine', 'Electric LEV3-SULEV30 170hp',
       'ELECTRIC', '8 Cylinder Engine', 'AC Electric Motor',
       'Turbo Diesel', '39.5L Electric Motor', '8 Cylinder',
       'ELECTRIC MOTOR', 'Electric 536hp 549ft. lbs.',
       ': 5th Generation Electric 

#### Engine cylinders

Para extraer la información de la variable engine, se aplicará una lógica específica basada en expresiones regulares. Los pistones pueden estar organizados de diferentes formas: en línea, en V, en W, en H, entre otras. Esta topología del motor podría ser utilizada como una variable adicional, pero, por simplicidad, se ha decidido prescindir de ella.

To extract the information from the engine variable, a specific logic based on regular expressions will be applied. The pistons can be arranged in different configurations: inline, V, W, H, among others. This engine topology could also be used as an additional variable, but for the sake of simplicity, it has been decided to exclude it.

In [38]:
def get_right_hand_side_cylinders(text):
    # Get the number on the right hand-side of the text if any
    result=""
    for char in text:
        if char.isdigit():
            result += char
        elif char =='.':
            return "" #This number has nothing to do with cylinders
        elif char =='V':
            return "" #This is the number of valves
        else:
            break  # Stop when we encounter a non-number, non-dot character, or a second dot
    return result

def get_left_hand_side_cylinders(text):
    # Get the number on the left hand-side of the text if any
    result=""
    for char in text[::-1]:
        if char.isdigit():
            result = char + result
        elif char =='.':
            return "" #This number has nothing to do with cylinders
        else:
            break  # Stop when we encounter a non-number, non-dot character, or a second dot
    return result

def get_engine_cylinders(text):
    possible_splitters=['Inline-','Flat-', 'Inline','Flat','Transverse','I-', 'V-', 
                        'W-', 'H-', 'L-', 'l-', 'IC', 'VR', 'I', 'H', 'V',  'T', 'L', 'F']
    for splitter in possible_splitters:
        for i in range(text.count(splitter)):
            #Evaluate if there is a number in the right hand side of the splitters associated to cylinders
            cylinders=get_right_hand_side_cylinders(text.split(splitter)[i+1].strip()) 

            if cylinders.isdigit() and 0<=int(cylinders)<=16 and \
                "VALVE" not in text.split(splitter)[i+1].upper()[:9]:
                # Check that the number of cylinders makes business sense
                # and it is not the number of valves
                return int(cylinders)
            
    possible_splitters=['-CYL', 'CYL']
    for splitter in possible_splitters:
        for i in range(text.upper().count(splitter)):
            #Evaluate if there is a number in the left hand side of the splitters associated to cylinders
            cylinders=get_left_hand_side_cylinders(text.upper().split(splitter)[i].strip())
            if cylinders.isdigit() and 0<=int(cylinders)<=16:
                # Check that the number of cylinders makes business sense
                return int(cylinders) 
    return None
df['engine_cylinders']=df['engine'].apply(get_engine_cylinders)

Hay muchos valores de la variable model que no tienen información sobre el número de cilindros

There are many values in the model variable that do not contain information about the number of cylinders.

In [44]:
# Count of cars by number of cylinders
df.groupby(['engine_cylinders'], dropna=False).size().reset_index(name='Count')

Unnamed: 0,engine_cylinders,Count
0,0.0,36
1,2.0,3
2,3.0,3397
3,4.0,62092
4,5.0,589
5,6.0,41684
6,8.0,17054
7,10.0,239
8,12.0,83
9,,16008


In [129]:
print("There are",str(len(df[df['engine_cylinders'].isnull()]['engine'])), "cars in which the number of cylinders hasn't been obtained")

df[df['engine_cylinders'].isnull()]['engine'].unique()

There are 16008 cars in which the number of cylinders hasn't been obtained


array(['Electric ZEV 490hp', 'Gas', 'Electric', 'Turbo Gas', '–',
       '2.4L DOHC 16V', 'Electric ZEV 358hp', '2.0L 16V DOHC', '2.4L',
       '3.5L', '3.7L', 'Electric ZEV 499hp', 'Electric Motor',
       '2.0L 16-Valve DOHC VTEC Turbo Engine', '2.0 L', '2.0L DOHC',
       'Engine: Dual Motor -inc: start/stop pushbutton', '3.5 L',
       '3.7 Liter', '3.5', '3.5L 273.0hp', '3.5L 290.0hp', '3.7',
       '3.0L TFSI', 'Electric ZEV', '2.0 Liter Turbo', '2.0L', '1.8L',
       '3.0L', '0.0', '2.9L', 'Supercharged Gas', '0.0L', '5.2L', '2.5L',
       '355.0HP Electric Motor Electric Fuel System',
       'Electric 402hp 490ft. lbs.', '2.0L Turbocharged',
       'Dual Synchronous Electric Motors Engine', 'L Electric Motor',
       '3.0L Supercharged', 'Dual Asynchronous Electric Motors Engine',
       '2.0L TFSI', 'Dual AC Electric Motors', '4.0L', 'Plug-in Hybrid',
       'Not Specified', '2.0-liter TFSI four-cylinder engine',
       '2.0L 252.0hp', '4.2L',
       'Range-Extended Electric 1

The majority of the above listed are electric or hybrid engines

In [42]:
df[df['engine_cylinders'].isnull()].groupby(['engine'], dropna=False).size().reset_index(name='Count').sort_values(by='Count', ascending=False).head(50)

Unnamed: 0,engine,Count
513,Gas,3382
412,Electric,3195
558,Turbo Gas,2716
471,Electric Motor,2695
578,–,1115
517,Hybrid,293
539,Plug-in Hybrid,149
409,ELECTRIC MOTOR,128
520,L Electric Motor,74
387,Dual AC Electric Motors,66


#### Engine Horse Power

En algunos vehículos, se encuentra información sobre los caballos de fuerza, medidos en HP o kW.

Some vehicles provide information about horsepower, measured in HP or kW.

In [45]:
def get_left_hand_side_hp(text):
    # Get the number on the left hand-side of the text if any
    text=text.replace(',','')
    result=""
    already_found_dot=False
    for char in text[::-1]:
        if char.isdigit():
            result = char + result
        elif char == '.' and not already_found_dot:
            result = char + result
            already_found_dot=True   
        else:
            break  # Stop when we encounter a non-number, non-dot character, or a second dot
    return result


def get_engine_HP(text):
    possible_splitters=['HP', 'KW']
    for splitter in possible_splitters:
        for i in range(text.upper().count(splitter)):
            #Evaluate if there is a number in the left hand side of the splitters associated to horse power
            horse_power=get_left_hand_side_hp(text.upper().split(splitter)[i].strip())
            if horse_power.replace('.', '', 1).isdigit() and horse_power.count('.')< 2:
                if splitter=='KW':
                    return float(horse_power)*1.34102 # Convert KW to HP
                else: 
                    return float(horse_power)
                
    # Find all the numbers (including decimals) in the string
    numbers = re.findall(r'\d+\.?\d*', text)
    numbers = [float(num) if '.' in num else int(num) for num in numbers]
    sorted_numbers = sorted(numbers, reverse=True)
    for number in sorted_numbers: 
        # We assume that the HP is the greatest number between 60 and 1200
        # that is not surrounded by a volume unit
        if 60<=number<=500:
            i=text.find(str(number))
            j=i+len(str(number))
            is_HP=True
            possible_other_units = ['CI', 'ci','c.i.','CC', 'cc']
            for unit in possible_other_units:
                if unit in text[j:j+6]:
                    is_HP=False
                    break
                if unit in text[i-6:i]:
                    is_HP=False
                    break
            if is_HP:
                return number
    return None

df['engine_HP']=df['engine'].apply(get_engine_HP)


In [46]:
# Check nulls in engine_HP: Taking a sample of the null values, it seams that non of them have information about the power of the car
df[df['engine_HP'].isnull()]['engine'].unique()[:60]

array(['2.4L I-4 DOHC, i-VTEC variable valve control, premium unleaded,',
       '2L I-4 gasoline direct injection, DOHC, VTEC variable valve cont',
       '3.5L V-6 gasoline direct injection, i-VTEC variable valve contro',
       '1.5L I-4 gasoline direct injection, DOHC, VTEC variable valve co',
       '3.5L V-6 i-VTEC variable valve control, premium unleaded, engine',
       '2.4L I4', '3.5L V6 24V GDI SOHC',
       '3.5L V-6 gasoline direct injection, variable valve control, prem',
       '3.0L PGM-FI 24V SOHC i-VTEC V6 -inc: Variable Cylinder Managemen',
       'Gas', '2.0L I4 16V GDI DOHC Turbo', '3.5L V6 SOHC 24V',
       'Electric', 'Turbo Gas',
       '3L V-6 i-VTEC variable valve control, premium unleaded, engine w',
       '2.4L I-4 gasoline direct injection, DOHC, i-VTEC variable valve',
       '3.5L V6 24V MPFI SOHC',
       '3.2L V-6 variable valve control, premium unleaded, engine with 2',
       '3.5L V-6 variable valve control, premium unleaded, engine with 2',
       

In [47]:
#The number of missing values is really high
df.groupby(['engine_HP'], dropna=False).size().reset_index(name='Count')

Unnamed: 0,engine_HP,Count
0,60.0,1
1,61.0,1
2,62.0,10
3,63.0,1
4,70.0,22
...,...,...
397,834.0,7
398,845.0,5
399,1000.0,1
400,1020.0,11


In [49]:
# We remove the original variable
df.drop(columns=['engine'], inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,mileage,interior_color,exterior_color,fuel_type,price_USD,url,mpg_city,...,num_speeds,transmission_type,drive_train,stock_type,model,vehicle_type,engine_displacement,engine_cylinders,engine_HP,fuel_type_v2
0,2006,19,acura,76497,Biege,Green Pearl,Gasoline,10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,...,,Automatic,Front-wheel Drive,Used,TSX,Compact,2.4,4.0,,Gasoline
1,2021,4,acura,54931,Espresso,Majestic Black Pearl,Gasoline,27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,...,10.0,Automatic,All-wheel Drive,Used,RDX,SUV,2.0,4.0,,Gasoline
2,2017,8,acura,103720,Ebony,Modern Steel Metallic,Gasoline,17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,...,,Automatic,All-wheel Drive,Used,MDX,SUV,3.5,6.0,,Gasoline
3,2024,1,acura,17309,Ebony,Platinum White Pearl,Gasoline,30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,...,,Automatic,Front-wheel Drive,Used,INTEGRA,Compact,1.5,4.0,,Gasoline
4,2017,8,acura,66552,Ebony,White Diamond Pearl,Gasoline,18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,...,,Automatic,All-wheel Drive,Used,RDX,SUV,3.5,6.0,,Gasoline


#### Fuel type

En aras de la simplicidad, consideramos que hay solo cinco categorías de tipo de motorización: eléctrico (cuando el tipo de combustible no está informado), gasolina, diesel, híbrido y gas (incluye E85 y gas natural)

For the sake of simplicity, we only consider 5 categories: Electric (fuel type not informed), Gasoline, Diesel, Hybrid and Gas (includes E85 and natural gas)

In [48]:
df.loc[df['fuel_type'].apply(lambda x: isinstance(x, float) and np.isnan(x)),'fuel_type']=None # substitue NaN for None
df.loc[df['fuel_type'] == '–', 'fuel_type'] = None


def get_fuel_type(text):
    if text is None:
        return 'Electric'
    if "HYBRI" in text.upper():
        return "Hybrid"
    if "GASOL" in text.upper() or "UNLEADED" in text.upper():
        return "Gasoline"
    if "DIESEL" in text.upper():
        return "Diesel"
    if "GAS" in text.upper() or 'E85' in text.upper():
        return "Gas"
    return None

df['fuel_type_v2']=df['fuel_type'].apply(get_fuel_type)
df['fuel_type'].unique()

array(['Gasoline', None, 'Hybrid', 'E85 Flex Fuel', 'Gas',
       'Plug-In Hybrid', 'Diesel', 'Flexible Fuel',
       'Gasoline / Natural Gas', 'Gasoline fuel type',
       'Gasoline/Mild Electric Hybrid', 'Plug-in Gas/Electric Hybrid',
       'Other', 'Diesel (B20 capable)', 'Unspecified', 'Gaseous', 'Flex',
       'Plug-In Hybrid Fuel', 'Regular Unleaded',
       'Compressed Natural Gas', 'Natural Gas', 'Gasoline Fuel',
       'Premium Unleaded', 'Plug-in Hybrid Electric (PHEV)',
       'PHEV (plug-in hybrid electric vehicle)', 'PHEV Hybrid Fuel',
       'mild', 'MHEV (mild hybrid electric vehicle)',
       'Gasoline/Mild Electric Hybri'], dtype=object)

In [137]:
# Check nulls in fuel_type_v2: All null values in fuel_type_v2 are due to the fact that fuel_type variable doesn't provide the information
df[df['fuel_type_v2'].isnull()]['fuel_type'].unique()

array(['Flexible Fuel', 'Other', 'Unspecified', 'Flex', 'mild'],
      dtype=object)

En la siguiente tabla se muestra cómo se mapea el tipo de combustible, según la variable original, a la variable procesada.

The following table shows how the fuel type is mapped from the original variable to the processed variable.

In [138]:
df.groupby(['fuel_type','fuel_type_v2'], dropna=False).size().reset_index(name='Count')

Unnamed: 0,fuel_type,fuel_type_v2,Count
0,Compressed Natural Gas,Gas,2
1,Diesel,Diesel,2880
2,Diesel (B20 capable),Diesel,4
3,E85 Flex Fuel,Gas,1492
4,Flex,,1
5,Flexible Fuel,,11
6,Gas,Gas,20
7,Gaseous,Gas,6
8,Gasoline,Gasoline,122148
9,Gasoline / Natural Gas,Gasoline,27


In [50]:
# We remove the original variable and substitute it for the processed one
df.drop(columns=['fuel_type'], inplace=True)
df.rename(columns={'fuel_type_v2': 'fuel_type'}, inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,mileage,interior_color,exterior_color,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type,drive_train,stock_type,model,vehicle_type,engine_displacement,engine_cylinders,engine_HP,fuel_type
0,2006,19,acura,76497,Biege,Green Pearl,10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic,Front-wheel Drive,Used,TSX,Compact,2.4,4.0,,Gasoline
1,2021,4,acura,54931,Espresso,Majestic Black Pearl,27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic,All-wheel Drive,Used,RDX,SUV,2.0,4.0,,Gasoline
2,2017,8,acura,103720,Ebony,Modern Steel Metallic,17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic,All-wheel Drive,Used,MDX,SUV,3.5,6.0,,Gasoline
3,2024,1,acura,17309,Ebony,Platinum White Pearl,30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic,Front-wheel Drive,Used,INTEGRA,Compact,1.5,4.0,,Gasoline
4,2017,8,acura,66552,Ebony,White Diamond Pearl,18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic,All-wheel Drive,Used,RDX,SUV,3.5,6.0,,Gasoline


#### Color

En el conjunto de datos se dispone del color interior y exterior del vehículo, expresado en lenguaje natural. Sin embargo, las descripciones asociadas a estos colores son, en muchos casos, no estándar, creativas o ambiguas. Para que estas variables puedan ser utilizadas en el modelo, es imprescindible codificarlas mediante el sistema RGB, que asigna a cada color tres valores enteros entre 0 y 255, correspondientes a su composición en términos de la intensidad de los colores primarios (rojo, verde y azul).

Para un modelo de aprendizaje automático, la codificación RGB es mucho más útil que una codificación categórica en colores, dado que el número de valores posibles que puede tomar la variable categórica es muy elevado. Con la codificación RGB, se pueden cuantificar las relaciones entre las diferentes tonalidades de los colores. Por ejemplo, (100, 50, 0) es marrón oscuro, (200, 100, 0) es marrón claro y (150, 75, 0) es marrón. Mientras que medir la similitud de cadenas de texto puede ser complicado, comparar vectores en un espacio tridimensional es mucho más sencillo.

Para etiquetar los datos, es decir, asignar a cada color en lenguaje natural su codificación RGB, se ha utilizado un modelo LLM con el objetivo de agilizar una tarea rutinaria y fácilmente automatizable. El prompt utilizado se puede encontrar [aquí](https://github.com/jcllanu/residual-value/blob/main/color_codification_prompt.txt), y el modelo LLM empleado es el 3o mini de ChatGPT con la opción "Razona" activada. De manera manual, se ha verificado que esta codificación es coherente para una muestra reducida. Para ello, se ha utilizado la siguiente [macro](https://github.com/jcllanu/residual-value/blob/main/excel_color_macro.txt) de Excel, que rellena las celdas del color indicado por la codificación RGB. A pesar de la dificultad de la tarea, debido a la ambigüedad en muchas de las descripciones de los colores en lenguaje natural, los resultados obtenidos por el modelo LLM son bastante satisfactorios.

The dataset includes both the interior and exterior colors of the vehicle, expressed in natural language. However, the descriptions associated with these colors are, in many cases, non-standard, creative, or ambiguous. To make these variables usable in the model, it is essential to encode them using the RGB system, which assigns three integer values between 0 and 255 to each color, corresponding to its composition in terms of the intensity of the primary colors (red, green, and blue).

For a machine learning model, RGB encoding is much more useful than categorical color encoding, as the number of possible values the categorical variable can take is very high. With RGB encoding, the relationships between different color shades can be quantified. For example, (100, 50, 0) is dark brown, (200, 100, 0) is light brown, and (150, 75, 0) is brown. While measuring the similarity of text strings can be difficult, comparing vectors in a three-dimensional space is much easier.

To label the data, i.e., to assign the RGB encoding to each color in natural language, an LLM model was used to streamline a routine and easily automatable task. The prompt used can be found [here](https://github.com/jcllanu/residual-value/blob/main/color_codification_prompt.txt), and the LLM model employed is the 3o mini of ChatGPT with the "Reason" option activated. Manual verification was carried out to ensure the consistency of the encoding for a reduced sample. For this, the following Excel [macro](https://github.com/jcllanu/residual-value/blob/main/excel_color_macro.txt) was used to fill in the cells with the RGB color encoding. Despite the difficulty of the task, due to the ambiguity in many of the natural language color descriptions, the results obtained by the LLM are quite satisfactory.

In [52]:
# List of all possible interior and exterior colors
color_list=list(set((list(df['interior_color'])+list(df['exterior_color']))))
sorted_color_list=sorted(color_list)
print(len(sorted_color_list))
sorted_color_list[1000:1500]

9872


['Black Blue Vescin/Cl Tcb2',
 'Black Bordeaux Red',
 'Black CLoth',
 'Black Carbon',
 'Black Ceramique',
 'Black Chalk',
 'Black Cherry',
 'Black Cherry Metallic',
 'Black Cherry Mica',
 'Black Cherry Pearl',
 'Black Clear Coat',
 'Black Clear Coat Px8',
 'Black Clear Coat/White Gold Clear Coat',
 'Black Clearcoat',
 'Black Clearcoat / Black Cloth Top',
 'Black Clearcoat/Black Cloth Top',
 'Black Clearcoat/Black Soft Top',
 'Black Clearcoat/Bright Silver Metallic Clearcoat',
 'Black Cloth',
 'Black Cloth Interior',
 'Black Cloth with Orange Stitching',
 'Black Cloudtex & Cloth',
 'Black Cloudtex Leatherette & Cloth',
 'Black Copper',
 'Black Copper Pearl',
 'Black Currant Metallic',
 'Black DINAMICA',
 'Black DINAMICA w/Red',
 'Black Dakota',
 'Black Dakota Leather',
 'Black Dakota, leather',
 'Black Diamond',
 'Black Diamond Mica',
 'Black Diamond P',
 'Black Diamond Pearl',
 'Black Diamond Tri-Coat',
 'Black Diamond Tricoat',
 'Black Diamond/Alloy Silver Roof',
 'Black Diamond/Deep 

In [55]:
#Load the data
ini=0
fin=499
df_list=[]
for _ in range(20):
    path='colors/colors_'+str(ini)+'_'+str(fin)+'.csv'
    df_list.append(pd.read_csv(path, delimiter=';'))
    ini+=500
    fin+=500
df_list.append(pd.read_csv('colors/other_colors.csv', delimiter=';'))

df_color_dictionary=pd.concat(df_list, ignore_index=True,axis=0)[['Color','R','G','B']]

#Include some colors manually due to CSV condification problems
df_auxiliar = pd.DataFrame({'Color': ["'", 'Ebony &#47; Terracotta', 'N/A', 'None', 'null'],
        'R': [None, 80, None, None, None,],
        'G': [None, 40, None, None, None,],
        'B': [None, 20, None, None, None,]})

df_color_dictionary=pd.concat([df_auxiliar, df_color_dictionary], ignore_index=True,axis=0)
df_color_dictionary.tail(20)

Unnamed: 0,Color,R,G,B
9864,Blue Grey,119.0,136.0,153.0
9865,Blue Ice,173.0,216.0,230.0
9866,Blue Ink,0.0,0.0,64.0
9867,Brooklyn Grey Metallic,128.0,128.0,128.0
9868,Championship White,255.0,255.0,255.0
9869,Creme Beige Leather,245.0,245.0,220.0
9870,Dark Slate/Medium Graystone,105.0,105.0,105.0
9871,Ice Silver Metallic,192.0,192.0,192.0
9872,Ice Storm Metallic,176.0,196.0,222.0
9873,Iced Mocha Premium Colorant,210.0,180.0,140.0


In [57]:
print('Number of entries before the left joins:', len(df)) 

# Left join each natural language interior vehicle color to its RGB codification
df_color_dictionary.columns=['interior_color', 'R_interior', 'G_interior', 'B_interior']
df=pd.merge(df, df_color_dictionary, how='left', on='interior_color')

# Left join each natural language exterior vehicle color to its RGB codification
df_color_dictionary.columns=['exterior_color', 'R_exterior', 'G_exterior', 'B_exterior']
df=pd.merge(df, df_color_dictionary, how='left', on='exterior_color')

print('Number of entries after the left joins:', len(df))

Number of entries before the left joins: 141185
Number of entries after the left joins: 141185


In [58]:
# We remove the original color variables
df.drop(columns=['interior_color', 'exterior_color'], inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,mileage,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type,...,B_interior_x,R_exterior_x,G_exterior_x,B_exterior_x,R_interior_y,G_interior_y,B_interior_y,R_exterior_y,G_exterior_y,B_exterior_y
0,2006,19,acura,76497,10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic,...,220.0,144.0,238.0,144.0,245.0,245.0,220.0,144.0,238.0,144.0
1,2021,4,acura,54931,27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic,...,20.0,20.0,20.0,20.0,80.0,40.0,20.0,20.0,20.0,20.0
2,2017,8,acura,103720,17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic,...,0.0,119.0,136.0,153.0,0.0,0.0,0.0,119.0,136.0,153.0
3,2024,1,acura,17309,30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic,...,0.0,255.0,255.0,255.0,0.0,0.0,0.0,255.0,255.0,255.0
4,2017,8,acura,66552,18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic,...,0.0,255.0,255.0,255.0,0.0,0.0,0.0,255.0,255.0,255.0


In [None]:
# Save final df
df.to_excel('processed_dataset.xlsx', index=False)

In [152]:
# Missing percentage
round((df.isnull().sum()/df.shape[0])*100,2)

year_manufacture        0.00
years                   0.00
make                    0.00
model                   0.00
mileage                 0.00
stock_type              0.00
interior_color          0.00
exterior_color          0.00
drive_train             0.00
fuel_type               6.18
transmission            0.00
engine                  0.00
price_USD               0.00
url                     0.00
mpg_city                9.83
mpg_highway             9.83
num_speeds             64.76
transmission_type       0.47
drive_train_v2          0.96
stock_type_v2           0.00
model_v2                0.00
vehicle_type            0.00
engine_displacement    11.68
engine_cylinders       11.34
engine_HP              94.78
fuel_type_v2            0.03
R_interior_x            0.01
G_interior_x            0.01
B_interior_x            0.01
R_exterior_x            0.00
G_exterior_x            0.00
B_exterior_x            0.00
R_interior_y            0.01
G_interior_y            0.01
B_interior_y  