# Used Car Price Regression Dataset - Kaggle Competition

Overview
Welcome to the 2024 Kaggle Playground Series! We plan to continue in the spirit of previous playgrounds, providing interesting an approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.

Your Goal: **The goal of this competition is to predict the price of used cars based on various attributes**.

## About the dataset

**features**
- ```id```: id of the car, not very important
- ```brand```: brand of the car
- ```model```: model of the car
- ```model_year```: year the model was made
- ```milage```: total miles on the car
- ```fuel_type```: type of fuel the car takes
- ```engine```: type of engine on the car
- ```transmission```: type of transmission on the car
- ```ext_col```: exterior color
- ```int_col```: interior color
- ```accident```: accidents the car has had in the past
- ```clean_title```: is the title clean 

**target variable**
- ```price```: price of the car


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [2]:
# read in training data
data = pd.read_csv('../data/raw/train.csv')
data.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200
1,1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999
2,2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900
3,3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000
4,4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188533 entries, 0 to 188532
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            188533 non-null  int64 
 1   brand         188533 non-null  object
 2   model         188533 non-null  object
 3   model_year    188533 non-null  int64 
 4   milage        188533 non-null  int64 
 5   fuel_type     183450 non-null  object
 6   engine        188533 non-null  object
 7   transmission  188533 non-null  object
 8   ext_col       188533 non-null  object
 9   int_col       188533 non-null  object
 10  accident      186081 non-null  object
 11  clean_title   167114 non-null  object
 12  price         188533 non-null  int64 
dtypes: int64(4), object(9)
memory usage: 18.7+ MB


Numerical Features
- ```id```: id of the car, not very important
- ```model_year```: year the model was made
- ```milage```: total miles on the car
- ```price```: price of the car

Categorical Features
- ```brand```: brand of the car (MULTICLASS: 57 brands of cars)
- ```model```: model of the car (MULTICLASS: 1897 models of cars)
- ```fuel_type```: type of fuel the car takes (MULTICLASS: 7 types of fuels)
- ```engine```: type of engine on the car (MULTICLASS: 1117 types of engines)
- ```transmission```: type of transmission on the car (MULTICLASS: 52 types of transmissions)
- ```ext_col```: exterior color (MULTICLASS: 319 types of colors)
- ```int_col```: interior color (MULTICLASS: 156 types of colors)
- ```accident```: accidents the car has had in the past (BINARY)
- ```clean_title```: is the title clean

In [4]:
# Get the number of unique classes for each categorical feature
unique_counts = data.nunique()

# Print or view the results
print(unique_counts)

id              188533
brand               57
model             1897
model_year          34
milage            6651
fuel_type            7
engine            1117
transmission        52
ext_col            319
int_col            156
accident             2
clean_title          1
price             1569
dtype: int64


## PLAN

Handle Missing Values

- ```fuel_type```: This column has some missing values that will be dropped because this reduces the dataset by only about 2.7%

- ```accident```: This column has some missing values that will be dropped because this reduces the dataset by only about 1.3%

- ```clean_title```: It has a significant number of missing values but the whole feature will be dropped

Encode Categorical Vaues
- ```accident``` is binary already but needs to be encoded with 1's and 0's
- OneHotEncoding will be used for ```fuel_type```

Feature Engineering
- ```model_year```: Consider creating new features such as car_age (current year minus model_year) to capture how the age of the car affects the price.
- ```milage```: This numeric feature can be used as-is but consider transformations (e.g., log transformation) if the distribution is skewed.

Scaling Data

## Handling Missing Values

In [5]:
data.drop('clean_title', axis=1, inplace=True)
data.drop('id', axis=1, inplace=True)

In [6]:
data.isna().sum().value_counts

<bound method IndexOpsMixin.value_counts of brand              0
model              0
model_year         0
milage             0
fuel_type       5083
engine             0
transmission       0
ext_col            0
int_col            0
accident        2452
price              0
dtype: int64>

In [7]:
# Drop rows where 'fuel_type' or 'accident' have missing values
data.dropna(subset=['fuel_type', 'accident'], inplace=True)


In [8]:
data.isna().sum().value_counts

<bound method IndexOpsMixin.value_counts of brand           0
model           0
model_year      0
milage          0
fuel_type       0
engine          0
transmission    0
ext_col         0
int_col         0
accident        0
price           0
dtype: int64>

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 181067 entries, 0 to 188532
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   brand         181067 non-null  object
 1   model         181067 non-null  object
 2   model_year    181067 non-null  int64 
 3   milage        181067 non-null  int64 
 4   fuel_type     181067 non-null  object
 5   engine        181067 non-null  object
 6   transmission  181067 non-null  object
 7   ext_col       181067 non-null  object
 8   int_col       181067 non-null  object
 9   accident      181067 non-null  object
 10  price         181067 non-null  int64 
dtypes: int64(3), object(8)
memory usage: 16.6+ MB


In [10]:
# Get the number of unique classes for each categorical feature
unique_counts = data.nunique()

# Print or view the results
print(unique_counts)

brand             57
model           1888
model_year        34
milage          6480
fuel_type          7
engine          1108
transmission      52
ext_col          319
int_col          156
accident           2
price           1569
dtype: int64


## Encode Categorical Values


In [11]:
# List of features to apply frequency encoding
features_to_encode = ['brand', 'model', 'engine', 'transmission', 'ext_col', 'int_col']

# Frequency encoding
for feature in features_to_encode:
    # Calculate frequency of each category
    freq_encoding = data[feature].value_counts() / len(data)
    
    # Map frequencies to the original feature
    data[feature] = data[feature].map(freq_encoding)


In [12]:
data.head()

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,price
0,0.005799,0.00317,2007,213000,Gasoline,0.001883,0.259534,0.005241,0.114797,None reported,4200
1,0.013205,0.00016,2002,143250,Gasoline,0.000436,0.259534,0.090353,0.131791,At least 1 accident or damage reported,4999
2,0.088829,0.000337,2002,136731,E85 Flex Fuel,0.004319,0.259534,0.076916,0.114797,None reported,13900
3,0.005252,0.000519,2017,19500,Gasoline,0.001022,0.105005,0.25949,0.570866,None reported,45000
4,0.104624,0.00285,2021,7388,Gasoline,0.003789,0.060602,0.25949,0.131791,None reported,97500


In [13]:
# Replace values in the 'accident' column
data['accident'] = data['accident'].map({
    'None reported': 0,
    'At least 1 accident or damage reported': 1
})

# Verify the transformation
print(data['accident'].value_counts())

accident
0    139724
1     41343
Name: count, dtype: int64


In [14]:

fuel_type = data[['fuel_type']]

cat_encoder = OneHotEncoder(sparse_output=False)
fuel_type_1hot = cat_encoder.fit_transform(fuel_type)

# Creates a dense array meaning that all instances have a value not just the ones that are encoded
fuel_type_1hot

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

## Feature Engineering

In [15]:
def feature_engineering(data):
    data["car_age"] = 2024 - data["model_year"]
    return data

In [16]:
feature_engineered_data = feature_engineering(data)

## Scaling Numerical Features

In [17]:
numerical_features = ['model_year', 'milage', 'car_age']
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])

Combine One Hot Encoded Categories

In [18]:
# Convert the Encoded Data to a DataFrame
fuel_type_1hot_df = pd.DataFrame(fuel_type_1hot, columns=cat_encoder.get_feature_names_out(['fuel_type']))

In [19]:
data.isna().sum().value_counts

<bound method IndexOpsMixin.value_counts of brand           0
model           0
model_year      0
milage          0
fuel_type       0
engine          0
transmission    0
ext_col         0
int_col         0
accident        0
price           0
car_age         0
dtype: int64>

In [20]:
# Reset indices of both DataFrames
data.reset_index(drop=True, inplace=True)
fuel_type_1hot_df.reset_index(drop=True, inplace=True)

In [21]:
# Concatenate the Encoded Data with the Original DataFrame
data_processed = pd.concat([data, fuel_type_1hot_df], axis=1)


In [22]:
# Step 4: Drop the Original 'fuel_type' Column
data_processed.drop('fuel_type', axis=1, inplace=True)

In [23]:
data_processed

Unnamed: 0,brand,model,model_year,milage,engine,transmission,ext_col,int_col,accident,price,car_age,fuel_type_Diesel,fuel_type_E85 Flex Fuel,fuel_type_Gasoline,fuel_type_Hybrid,fuel_type_Plug-In Hybrid,fuel_type_not supported,fuel_type_–
0,0.005799,0.003170,-1.531749,2.927615,0.001883,0.259534,0.005241,0.114797,0,4200,1.531749,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.013205,0.000160,-2.413259,1.529983,0.000436,0.259534,0.090353,0.131791,1,4999,2.413259,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.088829,0.000337,-2.413259,1.399357,0.004319,0.259534,0.076916,0.114797,0,13900,2.413259,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.005252,0.000519,0.231270,-0.949687,0.001022,0.105005,0.259490,0.570866,0,45000,-0.231270,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.104624,0.002850,0.936478,-1.192384,0.003789,0.060602,0.259490,0.131791,0,97500,-0.936478,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181062,0.025394,0.003032,0.231270,-0.358574,0.015492,0.105005,0.232511,0.131791,0,27500,-0.231270,0.0,0.0,1.0,0.0,0.0,0.0,0.0
181063,0.104624,0.000596,0.407572,-0.767343,0.002402,0.111942,0.232511,0.570866,1,30000,-0.407572,0.0,0.0,1.0,0.0,0.0,0.0,0.0
181064,0.104624,0.000315,0.936478,-1.066907,0.002502,0.060602,0.232511,0.570866,0,86900,-0.936478,0.0,0.0,1.0,0.0,0.0,0.0,0.0
181065,0.058923,0.001441,1.112780,-1.061998,0.000243,0.000591,0.001116,0.570866,0,84900,-1.112780,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [25]:
train_data_processed = data_processed.drop('price', axis = 1)
train_labels_processed = data_processed['price']

In [26]:
train_data_processed.to_csv('../data/processed/train_data_processed.csv')
train_labels_processed.to_csv('../data/processed/train_labels_processed.csv')