## Used Cars Dataset

### Step 1 - Get The Data
- The data for this project is downloaded from Kaggle - https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/data?select=car+data.csv
- This dataset contains information about used cars.
- This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning.
- After downloading the data and reading it as a dataframe, we take a look at the data and attributes(columns).
- The `info()` method is used to get a quick description of the data, in particular the total number of rows, each attribute's type, and the number of non-null values.
 

In [46]:
## import libraries
import pandas as pd
import numpy as np
# import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

In [47]:
# Sklearn imports
from sklearn.model_selection import train_test_split

In [48]:
# load the dataset
used_cars_df = pd.read_csv("../datasets/used_cars/used_car_data.csv")
used_cars_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


In [49]:
# a quick description of the data
used_cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4340 entries, 0 to 4339
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           4340 non-null   object
 1   year           4340 non-null   int64 
 2   selling_price  4340 non-null   int64 
 3   km_driven      4340 non-null   int64 
 4   fuel           4340 non-null   object
 5   seller_type    4340 non-null   object
 6   transmission   4340 non-null   object
 7   owner          4340 non-null   object
dtypes: int64(3), object(5)
memory usage: 271.4+ KB


In [50]:
# describe the continuous variables in the dataset
used_cars_df.describe()

Unnamed: 0,year,selling_price,km_driven
count,4340.0,4340.0,4340.0
mean,2013.090783,504127.3,66215.777419
std,4.215344,578548.7,46644.102194
min,1992.0,20000.0,1.0
25%,2011.0,208749.8,35000.0
50%,2014.0,350000.0,60000.0
75%,2016.0,600000.0,90000.0
max,2020.0,8900000.0,806599.0


### Step 2 - Exploratory Data Analysis (EDA)

In [51]:
## all column names
used_cars_df.columns

Index(['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type',
       'transmission', 'owner'],
      dtype='object')

In [52]:
## Check total number of na values
used_cars_df.isna().sum()

name             0
year             0
selling_price    0
km_driven        0
fuel             0
seller_type      0
transmission     0
owner            0
dtype: int64

In [53]:
# Add a column for age of the car, as year is not useful for calculating regression coefficients
max_year = used_cars_df["year"].max() + 1
used_cars_df["age"] = used_cars_df["year"].apply(lambda x: max_year - x)
used_cars_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,age
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,14
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner,14
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner,9
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner,4
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner,7


In [58]:
## Numeric columns or real value columns
numeric_cols = ["age", "selling_price", "km_driven"]

# Categorical Columns
categorical_cols = ["fuel", "seller_type", "transmission", "owner"]

# Columns not required
not_required_cols = ["name", "year"]

In [55]:
# drop the name column as it is not required, then reset the index of the dataframe
used_cars_df = used_cars_df.drop(not_required_cols, axis=1)
used_cars_df.reset_index
used_cars_df

Unnamed: 0,selling_price,km_driven,fuel,seller_type,transmission,owner,age
0,60000,70000,Petrol,Individual,Manual,First Owner,14
1,135000,50000,Petrol,Individual,Manual,First Owner,14
2,600000,100000,Diesel,Individual,Manual,First Owner,9
3,250000,46000,Petrol,Individual,Manual,First Owner,4
4,450000,141000,Diesel,Individual,Manual,Second Owner,7
...,...,...,...,...,...,...,...
4335,409999,80000,Diesel,Individual,Manual,Second Owner,7
4336,409999,80000,Diesel,Individual,Manual,Second Owner,7
4337,110000,83000,Petrol,Individual,Manual,Second Owner,12
4338,865000,90000,Diesel,Individual,Manual,First Owner,5


In [56]:
# List Categorical Columns
used_cars_df.select_dtypes(include=["object"]).columns.to_list()

['fuel', 'seller_type', 'transmission', 'owner']

In [59]:
# Get the estimates of central tendencies (mean, median and mode)
used_cars_df[numeric_cols].mean()

age                   7.909217
selling_price    504127.311751
km_driven         66215.777419
dtype: float64

### Step 3 - Create a Test Set
- To create a test set, pick some instances randomly, typically 20% of the dataset, and set them aside.

In [None]:
# Train test split - Random sampling (without stratified sampling)
train_set, test_Set = train_test_split(used_cars_df, test_size=0.2, random_state=42)
train_set.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
227,Mahindra Scorpio S11 BSIV,2017,1500000,20000,Diesel,Individual,Manual,First Owner
964,Maruti Swift Dzire VDI,2018,500000,50000,Diesel,Individual,Manual,First Owner
2045,Maruti Alto 800 LXI,2013,92800,25000,Petrol,Individual,Manual,Second Owner
1025,Chevrolet Beat Diesel LS,2011,95000,70000,Diesel,Individual,Manual,First Owner
4242,Maruti Vitara Brezza LDi Option,2017,685000,72000,Diesel,Dealer,Manual,First Owner
