# Car Price Prediction: Multiple Linear Regression

## Objective
* Importing appropriate libaries needed for constructing linear regression and visualization
* Data Wrangling: 
    * Dealing with missing values, corresponding to different data types
    * Changing data types, if needed
* Exploratory Data Analysis (EDA)
    * Evaluating relationship with visualization (e.g., regression plot, residual plot)
    * Using the correct statistics (e.g., Pearson correlation)
    * Exploration of variety statistical tests (e.g., ANOVA)
* Model Development & Refinement
    * Applying appropriate transformations for polynomial regression
    * Constructing a more overfitting-free model with Ridge regression and GridSearchCV

## Table of Contents


## Import Libaries

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

print('loaded')

loaded


## Import the dataset

In [3]:
filepath = 'data/CarPrice_Assignment.csv'

df = pd.read_csv(filepath)

In [4]:
df.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


## Data Wrangling

Let us first remove the first column, `car_ID`, because this variable is only used for identifying each vehicle.

In [5]:
# removing the first column because it is irrelevant
df = df.drop('car_ID', axis=1)

In [9]:
for column in list(df.columns):
    print(f"The number of missing values in {column}: {df[column].isna().sum()}")

The number of missing values in symboling: 0
The number of missing values in CarName: 0
The number of missing values in fueltype: 0
The number of missing values in aspiration: 0
The number of missing values in doornumber: 0
The number of missing values in carbody: 0
The number of missing values in drivewheel: 0
The number of missing values in enginelocation: 0
The number of missing values in wheelbase: 0
The number of missing values in carlength: 0
The number of missing values in carwidth: 0
The number of missing values in carheight: 0
The number of missing values in curbweight: 0
The number of missing values in enginetype: 0
The number of missing values in cylindernumber: 0
The number of missing values in enginesize: 0
The number of missing values in fuelsystem: 0
The number of missing values in boreratio: 0
The number of missing values in stroke: 0
The number of missing values in compressionratio: 0
The number of missing values in horsepower: 0
The number of missing values in peakrpm

We can see that there are no missing values in this dataset.

## Exploratory Data Analysis

In [6]:
df.describe()

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329756,3.255415,10.142537,104.117073,5125.121951,25.219512,30.75122,13276.710571
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.270844,0.313597,3.97204,39.544167,476.985643,6.542142,6.886443,7988.852332
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7788.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16503.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


In [7]:
df.describe(include='all')

Unnamed: 0,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
count,205.0,205,205,205,205,205,205,205,205.0,205.0,...,205.0,205,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
unique,,147,2,2,2,5,3,2,,,...,,8,,,,,,,,
top,,toyota corona,gas,std,four,sedan,fwd,front,,,...,,mpfi,,,,,,,,
freq,,6,185,168,115,96,120,202,,,...,,94,,,,,,,,
mean,0.834146,,,,,,,,98.756585,174.049268,...,126.907317,,3.329756,3.255415,10.142537,104.117073,5125.121951,25.219512,30.75122,13276.710571
std,1.245307,,,,,,,,6.021776,12.337289,...,41.642693,,0.270844,0.313597,3.97204,39.544167,476.985643,6.542142,6.886443,7988.852332
min,-2.0,,,,,,,,86.6,141.1,...,61.0,,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,,,,,,,,94.5,166.3,...,97.0,,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7788.0
50%,1.0,,,,,,,,97.0,173.2,...,120.0,,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,,,,,,,,102.4,183.1,...,141.0,,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16503.0


In [8]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   symboling         205 non-null    int64  
 1   CarName           205 non-null    object 
 2   fueltype          205 non-null    object 
 3   aspiration        205 non-null    object 
 4   doornumber        205 non-null    object 
 5   carbody           205 non-null    object 
 6   drivewheel        205 non-null    object 
 7   enginelocation    205 non-null    object 
 8   wheelbase         205 non-null    float64
 9   carlength         205 non-null    float64
 10  carwidth          205 non-null    float64
 11  carheight         205 non-null    float64
 12  curbweight        205 non-null    int64  
 13  enginetype        205 non-null    object 
 14  cylindernumber    205 non-null    object 
 15  enginesize        205 non-null    int64  
 16  fuelsystem        205 non-null    object 
 1

In [22]:
df.shape

(205, 25)

Let us first observe the car price, grouped by each categorical variable
* `fueltype`
* `aspiration`
* `doornumber`
* `carbody`
* `drivewheel`
* `enginelocation`
* `enginetype`
* `cylindernumber`
* `fuelsystem`

In [48]:
# defining a function to display dataframes horizontally

def horizontal(dfs):
    html = '<div style="display:flex">'
    for df in dfs:
        html += '<div style="margin-right: 32px">'
        html += df.to_html()
        html += '</div>'
    html += '</div>'
    display(HTML(html))

In [20]:
# Selecting number and object columns
num_only_df = df.select_dtypes(include=np.number)
obj_only_df = df.select_dtypes(include='object')

In [51]:
df_list = []
for column in list(obj_only_df.columns)[1:]:
    df_list.append(df[[column,'price']].groupby(column).agg(['mean', 'max','min', 'count']))

horizontal(df_list)

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
fueltype,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
diesel,15838.15,31600.0,7099.0,20
gas,12999.7982,45400.0,5118.0,185

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
aspiration,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
std,12611.270833,45400.0,5118.0,168
turbo,16298.166676,31600.0,7689.0,37

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
doornumber,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
four,13501.152174,40960.0,6229.0,115
two,12989.924078,45400.0,5118.0,90

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
carbody,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
convertible,21890.5,37028.0,11595.0,6
hardtop,22208.5,45400.0,8249.0,8
hatchback,10376.652386,31400.5,5118.0,70
sedan,14344.270833,41315.0,5499.0,96
wagon,12371.96,28248.0,6918.0,25

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
drivewheel,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
4wd,11087.463,17859.167,7603.0,9
fwd,9239.308333,23875.0,5118.0,120
rwd,19910.809211,45400.0,6785.0,76

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
enginelocation,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
front,12961.097361,45400.0,5118.0,202
rear,34528.0,37028.0,32528.0,3

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
enginetype,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
dohc,18116.416667,35550.0,9298.0,12
dohcv,31400.5,31400.5,31400.5,1
l,14627.583333,18150.0,5151.0,12
ohc,11574.048426,41315.0,5195.0,148
ohcf,13738.6,37028.0,5118.0,15
ohcv,25098.384615,45400.0,13499.0,13
rotor,13020.0,15645.0,10945.0,4

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
cylindernumber,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
eight,37400.1,45400.0,31400.5,5
five,21630.469727,31600.0,13295.0,11
four,10285.754717,22625.0,5118.0,159
six,23671.833333,41315.0,13499.0,24
three,5151.0,5151.0,5151.0,1
twelve,36000.0,36000.0,36000.0,1
two,13020.0,15645.0,10945.0,4

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,max,min,count
fuelsystem,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1bbl,7555.545455,10295.0,5399.0,11
2bbl,7478.151515,11245.0,5118.0,66
4bbl,12145.0,13645.0,10945.0,3
idi,15838.15,31600.0,7099.0,20
mfi,12964.0,12964.0,12964.0,1
mpfi,17754.60284,45400.0,7957.0,94
spdi,10990.444444,14869.0,7689.0,9
spfi,11048.0,11048.0,11048.0,1


In [40]:
obj_only_df['fuelsystem'].value_counts().to_frame()

Unnamed: 0_level_0,count
fuelsystem,Unnamed: 1_level_1
mpfi,94
2bbl,66
idi,20
1bbl,11
spdi,9
4bbl,3
mfi,1
spfi,1
