# Residual value model

## Imports

In [1]:
import pandas as pd
import numpy as np
import re
from openai import OpenAI
import json
import os

## 1.- Load data

In [2]:
makes=["acura","audi","bmw","buick","cadillac","chevrolet","chrysler","dodge","ford","gmc","honda","hyundai","infiniti","jeep","kia","land_rover","lexus",
     "lincoln","mazda","mercedes_benz","mini","mitsubishi","nissan","porsche","ram","subaru","tesla","toyota","volkswagen","volvo"]
df_list=[]
for make in makes:
    for batch in range(1,26):
        if os.path.exists('cars\\'+make+'\\cars_'+make+'_'+str(batch)+'.json'):
            df_list.append(pd.read_json('cars\\'+make+'\\cars_'+make+'_'+str(batch)+'.json'))
        
df=pd.concat(df_list, ignore_index=True,axis=0)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,mpg,fuel_type,transmission,engine,price_USD,url
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,22–31 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,21–27 Based on EPA mileage ratings. Use for co...,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,18–26 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,29–36 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,19–27 Based on EPA mileage ratings. Use for co...,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...


## 2.- Data Exploration and Preprocesing

### 2.1.- Removing duplicates
Durante el proceso de extracción de datos, pueden surgir duplicidades. Se verifica su existencia y, en caso afirmativo, se eliminan las filas duplicadas del DataFrame.<br> <br>
During the data extraction process, duplicates may arise. Their presence is verified, and if detected, duplicate rows are removed from the DataFrame.

In [3]:
num_duplicated_rows=sum(df.duplicated())
print(f'The number of duplicated rows is {num_duplicated_rows}')

The number of duplicated rows is 1985


In [4]:
df = df.drop_duplicates()
num_duplicated_rows=sum(df.duplicated())
print(f'The number of duplicated rows is now {num_duplicated_rows}')

The number of duplicated rows is now 0


### 2.2.- Variable description

A continuación, se realiza un análisis de alto nivel de las variables disponibles, utilizando un diccionario de datos y evaluando la cantidad de valores ausentes, valores atípicos o categorías.

#### Diccionario de datos
- **year_manufacture**: Año de fabricación del vehículo, que generalmente coincide con el año de compra.
- **years**: Antigüedad del vehículo en años, calculada en 2025. Se obtiene como la diferencia entre el año 2025 y el año de fabricación.
- **make**: Marca del vehículo.
- **model**: Modelo del vehículo.
- **mileage**: Kilometraje del vehículo, expresado en millas.
- **stock_type**: Indicador de si el vehículo es nuevo o usado.
- **interior_color**: Color del interior del vehículo, expresado en lenguaje natural.
- **exterior_color**: Color del exterior del vehículo, expresado en lenguaje natural. 
- **drive_train**: Tipo de tracción del vehículo (delantera, trasera, a las cuatro ruedas). 
- **mpg**: Consumo de combustible, expresado en millas por galón. Puede incluir información sobre la diferencia entre el consumo en ciudad y en carretera.
- **fuel_type**: Tipo de motorización (diésel, gasolina, híbrido, eléctrico, etc.).
- **transmission**: Tipo de transmisión. Cadena de texto que indica si es manual o automática y el número de marchas. Transmission type. 
- **engine**: Cadena de caracteres que describe el motor en lenguaje natural. Puede incluir información sobre la cilindrada y el volumen. Text string that describes the engine in natural language. 
- **price_USD**: Precio de venta del vehículo a la fecha de extracción de datos, expresado en dólares estadounidenses.
- **url**: Dirección web de donde se ha extraído la información.
  
Datos obtenidos en la página web [Cars.com](https://www.cars.com/) entre los días 22/03/2025 y 28/03/2025.

Next, a high-level analysis of the available variables is conducted using a data dictionary and assessing the number of missing values, outliers, or categories.
#### Data dictionary
- **year_manufacture**: Year of vehicle manufacture, which generally corresponds to the year of purchase.
- **years**: Vehicle age in years, calculated in 2025. It is obtained as the difference between the year 2025 and the year of manufacture.
- **make**: Vehicle make.
- **model**: Vehicle model.
- **mileage**: Vehicle mileage, expressed in miles.
- **stock_type**: Indicator of whether the vehicle is new or used.
- **interior_color**: Interior color of the vehicle, expressed in natural language.
- **exterior_color**: Exterior color of the vehicle, expressed in natural language.
- **drive_train**: Vehicle drivetrain type (front, rear, all-wheel drive).
- **mpg**: Fuel consumption, expressed in miles per gallon. It may include information about the difference between city and highway consumption.
- **fuel_type**: Engine type (diesel, gasoline, hybrid, electric, etc.).
- **transmission**: Transmission type. Text string indicating whether it is manual or automatic and the number of gears.
- **engine**: Text string that describes the engine in natural language. It may include information about the engine displacement and volume.
- **price_USD**: Vehicle sale price at the data extraction date, expressed in US dollars.
- **url**: Web address from which the information was extracted.

Data obtained from the website [Cars.com](https://www.cars.com/) between the dates 03/22/2025 and 03/28/2025.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24321 entries, 0 to 26305
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   year_manufacture  24321 non-null  int64  
 1   years             24321 non-null  int64  
 2   make              24321 non-null  object 
 3   model             24321 non-null  object 
 4   mileage           24321 non-null  int64  
 5   stock_type        24321 non-null  object 
 6   interior_color    24321 non-null  object 
 7   exterior_color    24321 non-null  object 
 8   drive_train       24321 non-null  object 
 9   mpg               23127 non-null  object 
 10  fuel_type         23569 non-null  object 
 11  transmission      24321 non-null  object 
 12  engine            24321 non-null  object 
 13  price_USD         24317 non-null  float64
 14  url               24321 non-null  object 
dtypes: float64(1), int64(3), object(11)
memory usage: 3.0+ MB


#### Check the Categorical and Numerical Columns

In [6]:
# Categorical variables
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical variables:',cat_col)
# Numerical variables
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical variables:',num_col)

Categorical variables: ['make', 'model', 'stock_type', 'interior_color', 'exterior_color', 'drive_train', 'mpg', 'fuel_type', 'transmission', 'engine', 'url']
Numerical variables: ['year_manufacture', 'years', 'mileage', 'price_USD']


In [7]:
#Number of unique values for categorical variables
df[cat_col].nunique()

make                  6
model              2555
stock_type            7
interior_color      982
exterior_color     1334
drive_train          14
mpg                 428
fuel_type            16
transmission        325
engine             1132
url               24296
dtype: int64

In [8]:
for col in cat_col:
    print(col)
    print('Sample of some unique values:', df[col].unique()[:20],'\n')

make
Sample of some unique values: ['acura' 'audi' 'bmw' 'toyota' 'volkswagen' 'volvo'] 

model
Sample of some unique values: ['Acura TSX Base' 'Acura RDX Technology Package'
 'Acura MDX 3.5L w/Technology Package' 'Acura Integra A-SPEC'
 'Acura RDX Technology & AcuraWatch Plus Package' 'Acura TLX V6 Tech'
 'Acura Integra A-SPEC w/ Technology' 'Acura TLX V6' 'Acura ZDX A-SPEC'
 'Acura MDX Sport Hybrid SH-AWD Sport Hybrid w/Technology Pkg'
 'Acura RLX Technology Package' 'Acura MDX SH-AWD Technology'
 'Acura TLX Base' 'Acura MDX SH-AWD' 'Acura ZDX A-Spec'
 'Acura RDX SH-AWD Technology' 'Acura MDX Sport Hybrid Technology Package'
 'Acura TSX Technology' 'Acura ILX Base' 'Acura MDX 3.5L'] 

stock_type
Sample of some unique values: ['Used' 'Acura Certified' 'Certified' 'Audi Certified' 'BMW Certified'
 'Volkswagen Certified' 'Volvo Certified'] 

interior_color
Sample of some unique values: ['Biege' 'Espresso' 'Ebony' 'Gray' '–' 'Red' 'BLACK' 'BROWN' 'Black' 'RED'
 'No Color' 'GRAY' 'Graysto

#### Feature engineering

##### MPG

Se puede observar que este campo está compuesto por un par de valores numéricos separados por un guion, seguidos de una nota informativa: 'Based on EPA mileage ratings. Use for comparison purposes only. Actual mileage will vary depending on driving conditions, driving habits, vehicle maintenance, and other factors.' Asumiremos que, en producción, se contará con los 'EPA mileage ratings' de los vehículos, por lo que esta variable puede seguir siendo útil. De ella, obtendremos los valores de consumo en ciudad y carretera si están disponibles.

It can be observed that this field consists of a pair of numerical values separated by a hyphen, followed by an informational note: 'Based on EPA mileage ratings. Use for comparison purposes only. Actual mileage will vary depending on driving conditions, driving habits, vehicle maintenance, and other factors.' We will assume that, in production, the 'EPA mileage ratings' of the vehicles will be available, so this variable may still be useful. From it, we will obtain the values for city and highway consumption if available.

In [9]:
def keep_numbers_and_en_dash(text):
    if text is None:
        return None  # Return None if the value is None
    return re.sub(r'[^0-9–]', '', text)  # Keep numbers and en dash (–)
def get_mpg_city(text):
    if text is None:
        return None  # Return None if the value is None
    if text == "–":
        return None
    if "–" not in text:
        return float(text) if text.replace('.', '', 1).isdigit() and text.count('.') < 2 else None
    return float(text.split('–')[0]) if text.split('–')[0].replace('.', '', 1).isdigit() and text.split('–')[0].count('.') < 2 else None

def get_mpg_highway(text):
    if text is None:
        return None  # Return None if the value is None
    if text == "–":
        return None
    if "–" not in text:
        return float(text) if text.replace('.', '', 1).isdigit() and text.count('.') < 2 else None
    return float(text.split('–')[1]) if text.split('–')[1].replace('.', '', 1).isdigit() and text.split('–')[1].count('.') < 2 else None


df['auxiliar'] = df['mpg'].apply(keep_numbers_and_en_dash)
df['mpg_city'] = df['auxiliar'].apply(get_mpg_city)
df['mpg_highway'] = df['auxiliar'].apply(get_mpg_highway)



In [11]:
# Check nulls in mpg_city: All null values in mpg_city are due to the fact that mpg is None or '–'
df[df['mpg_city'].isnull()]['mpg'].unique()

array([None, '–'], dtype=object)

In [12]:
# Check nulls in mpg_highway: All null values in mpg_highway are due to the fact that mpg is None or '–'
df[df['mpg_highway'].isnull()]['mpg'].unique()

array([None, '–'], dtype=object)

In [13]:
df.drop(columns=['mpg', 'auxiliar'], inplace=True)
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,transmission,engine,price_USD,url,mpg_city,mpg_highway
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0


#### Transmission 

De esta columna, queremos derivar otras dos columna: una que nos indique el tipo de transmisión (automática, manual, [manumática](https://es.wikipedia.org/wiki/Transmisi%C3%B3n_manum%C3%A1tica), [variable](https://es.wikipedia.org/wiki/Transmisi%C3%B3n_variable_continua) o [dual-clutch](https://es.wikipedia.org/wiki/Caja_de_cambios_de_doble_embrague) ), y otra que nos indique el número de marchas del vehículo. 

From this column, we want to derive two other columns: one indicating the type of transmission (automatic, manual, [manumatic](https://en.wikipedia.org/wiki/Manumatic), [variable](https://en.wikipedia.org/wiki/Continuously_variable_transmission), or [dual-clutch](https://en.wikipedia.org/wiki/Dual-clutch_transmission)), and another indicating the number of gears/speeds of the vehicle.

In [51]:
def get_speeds_of_transmission(text):
    if text is None:
        return None  # Return None if the value is None
    if "SINGLE" in text.upper():
        return 1
    substring=re.sub(r'[^0-9]', '', text)
    if substring=="":
        return None
    if len(substring)==1:
        return float(substring)
    if len(substring)==2:
        if substring[0]=="1":
            return float(substring)
        else:
            return float(substring[0])
    return None 

def keep_letters(text):
    return re.sub(r'[^a-zA-Z]', '', text)
    
def get_transmission_type(text):
    if "MANUAL" in text.upper() or "M/T" in text.upper() or "SMG" in text.upper(): # SMG for bmw
        return "Manual"
    if "AUTO" in text.upper() or "A/T" in text.upper() or keep_letters(text).upper()=='A' or "AU" in text.upper() or " A" in text.upper() \
        or "ZF" in text.upper() or 'ECT' in text.upper(): #ZF for bmw, ECT for toyota
        return "Automatic"
    if "VARIABLE" in text.upper() or "CVT" in text.upper(): #Continuously Variable Transmission
        return "Variable"
    if "DUAL" in text.upper() or "DOUBLE" in text.upper() or "DC" in text.upper() or "S TRONIC" in text.upper() or "S-TRONIC" in text.upper(): # Dual-Clutch Transmission (DCT) stronic for audi
        return "Dual-clutch"
    if "GEARTRONIC" in text.upper() or "STEPTRONIC" in text.upper() or "TIPTRONIC" in text.upper(): #Geartronic for volvo, steptronic for bmw, Tiptronic for volkswagen group
        return "Manumatic"
    return None
df['num_speeds']=df['transmission'].apply(get_speeds_of_transmission)
df['transmission_type']=df['transmission'].apply(get_transmission_type)

In [None]:
# Check nulls in transmission_type: All null values in transmission_type are due to the fact that transmission variable doesn't give any clue of the transmission type
df[df['transmission_type'].isnull()]['transmission'].unique()

array(['Not Specified', '9-Speed', 'Single-speed transmission', '7 speed',
       '–', '6-Speed', '5-Speed', '7-Speed', 'Single-Speed Fixed Gear',
       '4 Speed Transmission', '1-SPEED G', 'AWD', 'FWD', 'Drivetrai'],
      dtype=object)

In [52]:
# Check nulls in num_speeds: All null values in num_speeds are due to the fact that transmission variable doesn't indicate the number of gears/speeds
df[df['num_speeds'].isnull()]['transmission'].unique()

array(['Automatic', 'CVT', 'Manual', 'Variable', 'A/T',
       'Transmission w/Dual Shift Mode', 'Automatic CVT',
       'Continuously Variable Transmission', 'Not Specified',
       'Automatic w/OD', 'Auto-Shift Manual', 'CVT Transmission',
       'Continuously Variable', 'Transmission-Auto', 'AUTO',
       'Automatic with Tiptronic', 'quattroa? s tronica?', 'M/T',
       'quattroA? S tronicA?', 'quattroA? TiptronicA?', 'A', 'AUTOMATIC',
       'quattro, s-tronic', 'Automatic w/Tiptronic', '–',
       'quattro S tronic', 'CVT with Multitronic', 'FWD, s-tronic',
       'Auto, CVT Multitronic', 'Auto', 'Automatic w/Steptronic', 'auto',
       'Steptronic', 'Automatic w/Manual Shift', 'DCT',
       'STEPTRONIC AUTOMATIC', 'automatic',
       'Continuously Variable (ECVT)', 'eCVT', 'CVT Automatic',
       'Automatic, CVT', 'E-CVT Automatic', 'Automatic, CVTi',
       'Automatic, ECVT', 'DSG Automatic', 'manual', 'SMG other',
       'Manual Manual', 'Automatic with Geartronic', 'AWD', 'FWD

In [43]:
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,transmission,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic


Later on we'll discuss how to handle missing data and impute values

In [54]:
df.groupby(['transmission_type','num_speeds'], dropna=False).size().reset_index(name='Count')

Unnamed: 0,transmission_type,num_speeds,Count
0,Automatic,1.0,310
1,Automatic,2.0,3
2,Automatic,3.0,1
3,Automatic,4.0,51
4,Automatic,5.0,274
5,Automatic,6.0,1183
6,Automatic,7.0,866
7,Automatic,8.0,4370
8,Automatic,9.0,272
9,Automatic,10.0,1049


#### Drive train 

In [58]:
def get_drive_train(text):
    if "FRONT" in text.upper() or 'FWD' == text:
        return "Front-wheel Drive"
    if "ALL" in text.upper() or "FOUR" in text.upper() or '4' in text or 'AWD'==text:
        return "All-wheel Drive"
    if "REAR" in text.upper() or 'RWD'==text: 
        return "Rear-wheel Drive"
    return None
df['drive_train_v2']=df['drive_train'].apply(get_drive_train)


In [59]:
# Check nulls in drive_train_v2: All null values in drive_train_v2 are due to the fact that drive_train variable doesn't provide the information
df[df['drive_train_v2'].isnull()]['drive_train'].unique()

array(['–', '2WD', 'Unknown'], dtype=object)

In [61]:
df.head()

Unnamed: 0,year_manufacture,years,make,model,mileage,stock_type,interior_color,exterior_color,drive_train,fuel_type,transmission,engine,price_USD,url,mpg_city,mpg_highway,num_speeds,transmission_type,drive_train_v2
0,2006,19,acura,Acura TSX Base,76497,Used,Biege,Green Pearl,Front-wheel Drive,Gasoline,Automatic,"2.4L I-4 DOHC, i-VTEC variable valve control, ...",10995.0,https://cars.com/vehicledetail/d6115b1a-3830-4...,22.0,31.0,,Automatic,Front-wheel Drive
1,2021,4,acura,Acura RDX Technology Package,54931,Used,Espresso,Majestic Black Pearl,All-wheel Drive,Gasoline,10-Speed Automatic,"2L I-4 gasoline direct injection, DOHC, VTEC v...",27985.0,https://cars.com/vehicledetail/f9548b93-31b2-4...,21.0,27.0,10.0,Automatic,All-wheel Drive
2,2017,8,acura,Acura MDX 3.5L w/Technology Package,103720,Used,Ebony,Modern Steel Metallic,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 gasoline direct injection, i-VTEC var...",17981.0,https://cars.com/vehicledetail/06e7a2fc-13ec-4...,18.0,26.0,,Automatic,All-wheel Drive
3,2024,1,acura,Acura Integra A-SPEC,17309,Used,Ebony,Platinum White Pearl,Front-wheel Drive,Gasoline,Automatic,"1.5L I-4 gasoline direct injection, DOHC, VTEC...",30049.0,https://cars.com/vehicledetail/658c133c-dc51-4...,29.0,36.0,,Automatic,Front-wheel Drive
4,2017,8,acura,Acura RDX Technology & AcuraWatch Plus Package,66552,Used,Ebony,White Diamond Pearl,All-wheel Drive,Gasoline,Automatic,"3.5L V-6 i-VTEC variable valve control, premiu...",18950.0,https://cars.com/vehicledetail/79565df9-1fba-4...,19.0,27.0,,Automatic,All-wheel Drive


#### Stock type

In [None]:
# TODO

#### Engine

In [None]:
# TODO

#### Fuel type

In [60]:
# TODO
df.loc[df['fuel_type'] == '–', 'fuel_type'] = None
df['fuel_type'].unique()

array(['Gasoline', None, 'Hybrid', 'E85 Flex Fuel', 'Gas',
       'Plug-In Hybrid', 'Diesel', 'Flexible Fuel',
       'Gasoline / Natural Gas', 'Gasoline fuel type',
       'Gasoline/Mild Electric Hybrid', 'Plug-in Gas/Electric Hybrid',
       'Other', 'Unspecified', 'Premium Unleaded', 'Regular Unleaded'],
      dtype=object)

#### Color

In [66]:
color_list=list(set((list(df['interior_color'])+list(df['interior_color']))))
color_list[:100]

['Cognac w/Contrast Stitch',
 'Cognac w/Granite Gray',
 'Tan',
 'Adelaide Grey',
 'RED PERFORATED MILANO PRE',
 'Black/Steel Blue',
 'Trade',
 'Plad',
 'BURGUNDY',
 'Calcite / Umbra',
 'Saddle Brown',
 '1st Edition Rwd',
 'BAIGE',
 'Greystone',
 'Titan Black Leather',
 'Cornsilk Beige w/Brown Piping',
 'Unspecified',
 'Off Black',
 'Mauro Brown w/Black',
 'Charcoal Ven Nap',
 'Blond/Off-Black',
 'Cognac Perforated SensaTec',
 'Charcoal Black',
 'Parchment Prem Perforated',
 'Off-Black/Softbeige Inlay',
 'Sandstone',
 '`prestige',
 'Blue and White',
 'Grey Melange',
 'Sand Beige',
 'Venetian Beige/Black',
 'Marrakesh Brown',
 'Black w/Exclusive Stitching',
 'Light Blue Gray/Black',
 'Maroon Brown Nappa Leathe',
 'CREAM',
 'Black and Blue',
 'Black w/Oyster Contrast Stitching',
 'BROWN',
 'Pearl Beige w/Agate Gray',
 'Red Leather',
 'Ebony Lthette Trmmed',
 'Titan Black W/Accents',
 'Atlantic Blue',
 'Titan Black - BV',
 'Moonrock Gray / Quartz',
 'Saddle',
 'Oyster and Black Nappa Leath

In [None]:
# Missing percentage
round((df.isnull().sum()/df.shape[0])*100,2)