# Understand the problems
## Context
Having csv file containing information about car listings. Defined tasks:
1. Build a model
- Work on brands make up 90% of the total cars -> Maybe filtering is a good option. 
- Predict the price of used cars -> Regression task
2. Define a success metrics 
- Maybe metrics related to regression tasks, including:
    - functional metrics -> MSE, R-squared
    - non-functional metrics -> inference time per request, 
3. Build an API
- Objective is to develop a RESTful API for the model mentioned above -> Since this is a demo, maybe i will stick with FastAPI for a fast demo.
4. Data Analysis Questions
- How does mileage relate to car price? Is there a clear negative correlation? -> I think this question is more data-centric instead of model-centric, therefore using modeling techniques like SHAP value to interprete maybe is not suitable. Thus, sticking with statistic methods maybe more general. 
- Does the fuel type (petrol, diesel, etc.) have a noticeable impact on price? -> I can reframe this question as is fuel type is statistical significant with the price -> Maybe using statistics test -> Using some libraries. 
- How to compare two cars in the same segment? -> Still thinking (what is segment mean?)

## Scoring objective:
- How well does your model perform -> This maybe strongly related to the chosen metrics. 
- How effective is the selected metric in evaluating model performance -> Are these metrics align with what people are doing in real-life problem?
- How could your model be improved? -> Tbh, both modeling and data preparation are needed in order to improve the model's prediction. 

# Import libraries

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import sys

In [7]:
# Load data
data_path = 'data/raw'
ref = pd.read_csv(f'{data_path}/car.csv') # for reference if needed
data = pd.read_csv(f'{data_path}/car.csv')

In [None]:
unique_df = ref[~ref.duplicated()]
unique_df

id                  37545
list_id             37545
list_time           78507
manufacture_date       44
brand                  67
model                 521
origin                 10
type                    9
seats                  11
gearbox                 4
fuel                    4
color                  12
mileage_v2           3240
price                2095
condition               2
dtype: int64


# EDA and Preprocessing

In [3]:
# Check for the shape of the data
data.shape

(317636, 15)

In [None]:
# Check for the first 5 rows of the data
data.head()

Unnamed: 0,id,list_id,list_time,manufacture_date,brand,model,origin,type,seats,gearbox,fuel,color,mileage_v2,price,condition
0,148468232,108616925,1693378633111,1980,Jeep,A2,Mỹ,SUV / Cross over,4.0,MT,petrol,green,40000,380000000.0,used
1,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
2,149580046,109560282,1693462201000,2016,Kia,Rio,Hàn Quốc,,,AT,petrol,,78545,295000000.0,used
3,148601679,108727914,1693493126176,2020,Toyota,Vios,Việt Nam,Sedan,5.0,MT,petrol,white,99999,368000000.0,used
4,149530234,109517456,1693313503000,2001,Fiat,Siena,,,,MT,petrol,white,200000,73000000.0,used


In [5]:
# Check for the data types of the columns
data.dtypes

id                    int64
list_id               int64
list_time             int64
manufacture_date      int64
brand                object
model                object
origin               object
type                 object
seats               float64
gearbox              object
fuel                 object
color                object
mileage_v2            int64
price               float64
condition            object
dtype: object

Cái list_id với list_time là gì nhỉ ._.

In [7]:
# Check for the unique values of the columns (id, list_id, list_time)
data['id'].nunique(), data['list_id'].nunique(), data['list_time'].nunique()

(37545, 37545, 78507)

In [8]:
# Check if the combination of these 3 create unique identifiers
data['id_list_id_list_time'] = data['id'].astype(str) + '_' + data['list_id'].astype(str) + '_' + data['list_time'].astype(str)
data['id_list_id_list_time'].nunique()

78793

In [13]:
data['id_list_id_list_time'].head()
unique_identifiers = data['id_list_id_list_time'].nunique()
print(f'The data has: {unique_identifiers} unique identifiers')

The data has: 78793 unique identifiers


In [16]:
# Check for duplicates
duplicates = data.duplicated()
print(f'The data has: {duplicates.sum()} duplicates')
duplicates

The data has: 236178 duplicates


0         False
1         False
2         False
3         False
4         False
          ...  
317631     True
317632     True
317633     True
317634     True
317635     True
Length: 317636, dtype: bool

In [8]:
# Try with a random id to see what happens
data[data.id==149864917]   

Unnamed: 0,id,list_id,list_time,manufacture_date,brand,model,origin,type,seats,gearbox,fuel,color,mileage_v2,price,condition
1,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
128,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
255,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
297,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
488,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
594,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
705,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used
801,149864917,109805135,1694308247000,2021,Honda,City,Nhật Bản,Sedan,5.0,AT,petrol,white,23000,455000000.0,used


Seem like it is heavily duplicated, therefore it is resonable to drop them.

In [9]:
duplicates = data[data.duplicated(keep=False) | data.duplicated(keep='first')]
duplicates.sort_values(by=list(duplicates.columns))

Unnamed: 0,id,list_id,list_time,manufacture_date,brand,model,origin,type,seats,gearbox,fuel,color,mileage_v2,price,condition
206668,45885586,29890521,1695457603346,2023,Ford,Ranger,Việt Nam,Pick-up (bán tải),5.0,MT,oil,others,0,195000000.0,used
210744,45885586,29890521,1695457603346,2023,Ford,Ranger,Việt Nam,Pick-up (bán tải),5.0,MT,oil,others,0,195000000.0,used
215483,45885586,29890521,1695457603346,2023,Ford,Ranger,Việt Nam,Pick-up (bán tải),5.0,MT,oil,others,0,195000000.0,used
220510,45885586,29890521,1695457603346,2023,Ford,Ranger,Việt Nam,Pick-up (bán tải),5.0,MT,oil,others,0,195000000.0,used
225051,45885586,29890521,1695457603346,2023,Ford,Ranger,Việt Nam,Pick-up (bán tải),5.0,MT,oil,others,0,195000000.0,used
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199907,151369542,111096726,1698207185000,2022,Mitsubishi,Outlander,Việt Nam,SUV / Cross over,7.0,AT,petrol,white,56000,699000000.0,used
191306,151369657,111096801,1698207292000,2023,Kia,Sportage,Hàn Quốc,SUV / Cross over,6.0,AT,petrol,blue,0,799000000.0,new
191307,151369657,111096801,1698207292000,2023,Kia,Sportage,Hàn Quốc,SUV / Cross over,6.0,AT,petrol,blue,0,799000000.0,new
187682,151369827,111096948,1698207471000,2022,Honda,City,,,5.0,AT,petrol,,14000,526000000.0,used


Ok, it is very bad. So we have to delete them :(

In [10]:
data = data.drop_duplicates()
print(f'The data now has: {data.shape[0]} rows and {data.shape[1]} columns')

The data now has: 81458 rows and 15 columns


In [None]:
data.nunique()

In [13]:
# create reference data
ref = pd.read_csv(f'{data_path}/car.csv')

id                  37545
list_id             37545
list_time           78507
manufacture_date       44
brand                  67
model                 521
origin                 10
type                    9
seats                  11
gearbox                 4
fuel                    4
color                  12
mileage_v2           3240
price                2095
condition               2
dtype: int64

# Feature Engineering

# Modeling and Evaluation