<img src="https://miro.medium.com/max/647/1*ZOcUPrSXLYucFxppoI-dYg.png">

# Problem definition


For this project we are using a car dataset, where we want to predict the selling price of car based on its certain features.
Since we need to find the real value, with real calculation, therefore this problem is regression problem. 
We will be using linear regression to solve this problem.

General equation of Multiple Linear Regression:
$$y = \beta_0 + \beta_{1}x_1 + \beta_{2}x_2 + \beta_{3}x_3 + \beta_{4}x_4 + ... + \beta_{n}x_n$$

# Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Gathering

In [2]:
df = pd.read_csv("car_dataset.csv")

# Data Preparation

In [3]:
df.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


In [5]:
df.describe()

Unnamed: 0,Year,Selling_Price,Present_Price,Kms_Driven,Owner
count,301.0,301.0,301.0,301.0,301.0
mean,2013.627907,4.661296,7.628472,36947.20598,0.043189
std,2.891554,5.082812,8.644115,38886.883882,0.247915
min,2003.0,0.1,0.32,500.0,0.0
25%,2012.0,0.9,1.2,15000.0,0.0
50%,2014.0,3.6,6.4,32000.0,0.0
75%,2016.0,6.0,9.9,48767.0,0.0
max,2018.0,35.0,92.6,500000.0,3.0


In [6]:
df.isnull().sum()

Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

# Feature Engineering

<ul>Fuel_Type feature:
    <li>Fuel is Petrol if Fuel_type_diesel = 0 ,Fuel_Type_Petrol = 1</li>
    <li>Fuel is Diesel if Fuel_type_diesel = 1 ,Fuel_Type_Petrol = 0</li>
    <li>Fuel is cng if Fuel_type_diesel = 0 ,Fuel_Type_Petrol = 0</li>
   </ul>
<ul>Transmission feature:
    <li>transmission is manual if Transmission_Manual = 1</li> 
    <li>transmission is automatic if Transmission_Manual = 0</li></ul>
<ul>Seller_Type feature:
    <li>Seller_Type is Individual if Seller_Type_Individual = 1 </li> 
    <li>Seller_Type is dealer if Seller_Type_Individual = 0</li> </ul>
    


In [7]:
df.Fuel_Type.value_counts(dropna=False)

Petrol    239
Diesel     60
CNG         2
Name: Fuel_Type, dtype: int64

In [8]:
df.Transmission.value_counts()

Manual       261
Automatic     40
Name: Transmission, dtype: int64

In [9]:
df.Seller_Type.value_counts()

Dealer        195
Individual    106
Name: Seller_Type, dtype: int64

In [10]:
df.Year.value_counts()

2015    61
2016    50
2014    38
2017    35
2013    33
2012    23
2011    19
2010    15
2008     7
2009     6
2006     4
2005     4
2007     2
2003     2
2018     1
2004     1
Name: Year, dtype: int64

In [11]:
df.Owner.value_counts()

0    290
1     10
3      1
Name: Owner, dtype: int64

In [12]:
df["Age"] = df.Year.max() - df.Year

In [13]:
df = df.drop(columns=["Car_Name", "Year"])

In [14]:
df.head()

Unnamed: 0,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner,Age
0,3.35,5.59,27000,Petrol,Dealer,Manual,0,4
1,4.75,9.54,43000,Diesel,Dealer,Manual,0,5
2,7.25,9.85,6900,Petrol,Dealer,Manual,0,1
3,2.85,4.15,5200,Petrol,Dealer,Manual,0,7
4,4.6,6.87,42450,Diesel,Dealer,Manual,0,4


In [15]:
df_dummy = pd.get_dummies(df)

### Features and target variable

In [16]:
df_dummy.head()

Unnamed: 0,Selling_Price,Present_Price,Kms_Driven,Owner,Age,Fuel_Type_CNG,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Dealer,Seller_Type_Individual,Transmission_Automatic,Transmission_Manual
0,3.35,5.59,27000,0,4,0,0,1,1,0,0,1
1,4.75,9.54,43000,0,5,0,1,0,1,0,0,1
2,7.25,9.85,6900,0,1,0,0,1,1,0,0,1
3,2.85,4.15,5200,0,7,0,0,1,1,0,0,1
4,4.6,6.87,42450,0,4,0,1,0,1,0,0,1


In [17]:
X = df_dummy.drop("Selling_Price", axis=1)
y = df_dummy.Selling_Price

### Splitting data into training and testing

In [18]:
from sklearn.preprocessing import PolynomialFeatures

**I checked the degrees 1 through 4 and saw that 2 is the best.**

**You can see all the scores at the end of the notebook**

In [19]:
poly_converter = PolynomialFeatures(degree=2, include_bias=False)

In [20]:
poly_features = poly_converter.fit_transform(X)

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=42)

### Scaling

In [23]:
from sklearn.preprocessing import StandardScaler

In [24]:
scaler = StandardScaler()

In [25]:
scaler.fit(X_train)

StandardScaler()

In [26]:
X_train = scaler.transform(X_train)

In [27]:
X_test = scaler.transform(X_test)

##  Model Building (Linear Regression)

In [28]:
from sklearn.linear_model import LinearRegression

In [29]:
lm = LinearRegression()

In [30]:
lm.fit(X_train, y_train)

LinearRegression()

In [31]:
y_pred = lm.predict(X_test)

In [32]:
y_train_pred = lm.predict(X_train)

In [33]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [34]:
def eval_metrics(actual, pred):
    mae = mean_absolute_error(actual, pred)
    mse = mean_squared_error(actual, pred)
    rmse = np.sqrt(mean_squared_error(actual, pred))
    R2_score = r2_score(actual, pred)
    print("Model testing performance:")
    print("--------------------------")
    print(f"R2_score \t: {R2_score}")
    print(f"MAE \t\t: {mae}")
    print(f"MSE \t\t: {mse}")
    print(f"RMSE \t\t: {rmse}")

In [35]:
eval_metrics(y_test, y_pred)

Model testing performance:
--------------------------
R2_score 	: -0.5471908374391437
MAE 		: 2.307544430325751
MSE 		: 44.05502224022357
RMSE 		: 6.637395742324212


In [36]:
eval_metrics(y_train, y_train_pred)

Model testing performance:
--------------------------
R2_score 	: 0.9946650821498522
MAE 		: 0.21991648076534087
MSE 		: 0.13093294785386583
RMSE 		: 0.36184658054742735


# Model Evaluation

### Cross Validation

In [37]:
from sklearn.model_selection import cross_validate, cross_val_score

In [38]:
model = LinearRegression()
scores = cross_validate(model, X_train, y_train, scoring = ['r2', 'neg_mean_absolute_error','neg_mean_squared_error', \
                                                            'neg_root_mean_squared_error'], cv = 10)

In [39]:
scores = pd.DataFrame(scores, index=range(1,11))
scores.iloc[:,2:].mean()

test_r2                            -1.257517e+23
test_neg_mean_absolute_error       -2.327676e+11
test_neg_mean_squared_error        -2.324479e+24
test_neg_root_mean_squared_error   -7.587299e+11
dtype: float64

In [40]:
eval_metrics(y_test, y_pred)

Model testing performance:
--------------------------
R2_score 	: -0.5471908374391437
MAE 		: 2.307544430325751
MSE 		: 44.05502224022357
RMSE 		: 6.637395742324212


In [41]:
lm_scores = {"lm_train": {"rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred)),
    "mae" : mean_absolute_error(y_train, y_train_pred),
    "mse" : mean_squared_error(y_train, y_train_pred),
    "R2" : r2_score(y_train, y_train_pred)}, 

    "lm_test": {"rmse" : np.sqrt(mean_squared_error(y_test, y_pred)),
    "mae" : mean_absolute_error(y_test, y_pred),
    "mse" : mean_squared_error(y_test, y_pred),
    "R2" : r2_score(y_test, y_pred)}}
ls =pd.DataFrame(lm_scores)
ls

Unnamed: 0,lm_train,lm_test
rmse,0.361847,6.637396
mae,0.219916,2.307544
mse,0.130933,44.055022
R2,0.994665,-0.547191


# Regularization

# Ridge

In [42]:
from sklearn.linear_model import RidgeCV

In [43]:
alpha_space = np.linspace(0.01, 1, 100)

In [44]:
ridge_cv_model = RidgeCV(alphas = alpha_space, cv = 10, scoring = "neg_root_mean_squared_error")

In [45]:
ridge_cv_model.fit(X_train, y_train)

RidgeCV(alphas=array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11,
       0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22,
       0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33,
       0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44,
       0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55,
       0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66,
       0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77,
       0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88,
       0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99,
       1.  ]),
        cv=10, scoring='neg_root_mean_squared_error')

In [46]:
ridge_cv_model.alpha_

1.0

In [47]:
y_pred = ridge_cv_model.predict(X_test)

In [48]:
eval_metrics(y_test, y_pred)

Model testing performance:
--------------------------
R2_score 	: 0.6620246447467675
MAE 		: 0.9162191174189288
MSE 		: 9.623578056455674
RMSE 		: 3.1021892360808154


In [49]:
y_train_pred = ridge_cv_model.predict(X_train)
eval_metrics(y_train, y_train_pred)

Model testing performance:
--------------------------
R2_score 	: 0.9912013955774566
MAE 		: 0.3052990798835174
MSE 		: 0.21594094724659357
RMSE 		: 0.4646944665547391


In [50]:
ridge_cv_scores = {"ridge_cv_train": {"rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred)),
    "mae" : mean_absolute_error(y_train, y_train_pred),
    "mse" : mean_squared_error(y_train, y_train_pred),
    "R2" : r2_score(y_train, y_train_pred)}, 

    "ridge_cv_test": {"rmse" : np.sqrt(mean_squared_error(y_test, y_pred)),
    "mae" : mean_absolute_error(y_test, y_pred),
    "mse" : mean_squared_error(y_test, y_pred),
    "R2" : r2_score(y_test, y_pred)}}
rcs = pd.DataFrame(ridge_cv_scores)
rcs

Unnamed: 0,ridge_cv_train,ridge_cv_test
rmse,0.464694,3.102189
mae,0.305299,0.916219
mse,0.215941,9.623578
R2,0.991201,0.662025


In [51]:
pd.concat([ls, rcs], axis = 1)

Unnamed: 0,lm_train,lm_test,ridge_cv_train,ridge_cv_test
rmse,0.361847,6.637396,0.464694,3.102189
mae,0.219916,2.307544,0.305299,0.916219
mse,0.130933,44.055022,0.215941,9.623578
R2,0.994665,-0.547191,0.991201,0.662025


# Lasso

In [52]:
from sklearn.linear_model import LassoCV

In [53]:
lasso_cv_model = LassoCV(alphas = alpha_space, cv = 10, max_iter = 100000)

In [54]:
lasso_cv_model.fit(X_train, y_train)

LassoCV(alphas=array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11,
       0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22,
       0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33,
       0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44,
       0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55,
       0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66,
       0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77,
       0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88,
       0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99,
       1.  ]),
        cv=10, max_iter=100000)

In [55]:
lasso_cv_model.alpha_

0.09

In [56]:
y_train_pred = lasso_cv_model.predict(X_train)
eval_metrics(y_train, y_train_pred)

Model testing performance:
--------------------------
R2_score 	: 0.9678968976740915
MAE 		: 0.5857289816167638
MSE 		: 0.7878947606792187
RMSE 		: 0.8876343620428507


In [57]:
y_pred = lasso_cv_model.predict(X_test)
eval_metrics(y_test, y_pred)

Model testing performance:
--------------------------
R2_score 	: 0.8729436979180589
MAE 		: 0.9522568065403407
MSE 		: 3.6178266303885405
RMSE 		: 1.9020585244383361


In [58]:
lasso_cv_scores = {"lasso_cv_train": {"rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred)),
    "mae" : mean_absolute_error(y_train, y_train_pred),
    "mse" : mean_squared_error(y_train, y_train_pred),
    "R2" : r2_score(y_train, y_train_pred)}, 

    "lasso_cv_test": {"rmse" : np.sqrt(mean_squared_error(y_test, y_pred)),
    "mae" : mean_absolute_error(y_test, y_pred),
    "mse" : mean_squared_error(y_test, y_pred),
    "R2" : r2_score(y_test, y_pred)}}
lcs = pd.DataFrame(lasso_cv_scores)
lcs

Unnamed: 0,lasso_cv_train,lasso_cv_test
rmse,0.887634,1.902059
mae,0.585729,0.952257
mse,0.787895,3.617827
R2,0.967897,0.872944


In [59]:
pd.concat([ls, rcs, lcs], axis = 1)

Unnamed: 0,lm_train,lm_test,ridge_cv_train,ridge_cv_test,lasso_cv_train,lasso_cv_test
rmse,0.361847,6.637396,0.464694,3.102189,0.887634,1.902059
mae,0.219916,2.307544,0.305299,0.916219,0.585729,0.952257
mse,0.130933,44.055022,0.215941,9.623578,0.787895,3.617827
R2,0.994665,-0.547191,0.991201,0.662025,0.967897,0.872944


## Elastic-Net 

In [60]:
from sklearn.linear_model import ElasticNet, ElasticNetCV

In [61]:
elastic_cv_model = ElasticNetCV(alphas = alpha_space, l1_ratio=[0.1, 0.5, 0.7,0.9, 0.95, 1], cv = 10, max_iter = 100000)

In [62]:
elastic_cv_model.fit(X_train, y_train)

ElasticNetCV(alphas=array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11,
       0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22,
       0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33,
       0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44,
       0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55,
       0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66,
       0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77,
       0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88,
       0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99,
       1.  ]),
             cv=10, l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1], max_iter=100000)

In [63]:
elastic_cv_model.alpha_

0.45

In [64]:
elastic_cv_model.l1_ratio_

0.1

In [65]:
y_train_pred = elastic_cv_model.predict(X_train)
eval_metrics(y_train, y_train_pred)

Model testing performance:
--------------------------
R2_score 	: 0.9587349723380985
MAE 		: 0.6549131909795064
MSE 		: 1.0127525609216996
RMSE 		: 1.0063560805806757


In [66]:
y_pred = elastic_cv_model.predict(X_test)
eval_metrics(y_test, y_pred)

Model testing performance:
--------------------------
R2_score 	: 0.842635069182998
MAE 		: 1.0276310238164565
MSE 		: 4.480840604284508
RMSE 		: 2.1167996136348166


In [67]:
elastic_cv_scores = {"elastic_cv_train": {"rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred)),
    "mae" : mean_absolute_error(y_train, y_train_pred),
    "mse" : mean_squared_error(y_train, y_train_pred),
    "R2" : r2_score(y_train, y_train_pred)}, 

    "elastic_cv_test": {"rmse" : np.sqrt(mean_squared_error(y_test, y_pred)),
    "mae" : mean_absolute_error(y_test, y_pred),
    "mse" : mean_squared_error(y_test, y_pred),
    "R2" : r2_score(y_test, y_pred)}}
ecs = pd.DataFrame(elastic_cv_scores)
ecs

Unnamed: 0,elastic_cv_train,elastic_cv_test
rmse,1.006356,2.1168
mae,0.654913,1.027631
mse,1.012753,4.480841
R2,0.958735,0.842635


# Choosing Model

**To do this I restarted the Kernel with diferrent degrees and checked the scores**

### With Degree=1 (Age included)

In [68]:
pd.concat([ls, rcs, lcs, ecs], axis = 1)

Unnamed: 0,lm_train,lm_test,ridge_cv_train,ridge_cv_test,lasso_cv_train,lasso_cv_test,elastic_cv_train,elastic_cv_test
rmse,1.695972,1.881953,1.696115,1.887694,1.741942,1.995606,1.73525,1.99896
mae,1.146981,1.269836,1.147553,1.272388,1.121672,1.319979,1.160606,1.317664
mse,2.876321,3.541749,2.876805,3.563388,3.034362,3.982445,3.011094,3.99584
R2,0.882803,0.875616,0.882783,0.874856,0.876364,0.860138,0.877312,0.859668


### With Degree=2 (Age included)

In [68]:
pd.concat([ls, rcs, lcs, ecs], axis = 1)

Unnamed: 0,lm_train,lm_test,ridge_cv_train,ridge_cv_test,lasso_cv_train,lasso_cv_test,elastic_cv_train,elastic_cv_test
rmse,0.615075,1.119265,0.61408,1.121275,0.689535,1.081935,0.689535,1.081935
mae,0.439206,0.642369,0.416676,0.641896,0.452941,0.659212,0.452941,0.659212
mse,0.378318,1.252755,0.377095,1.257258,0.475458,1.170583,0.475458,1.170583
R2,0.984585,0.956004,0.984635,0.955846,0.980627,0.95889,0.980627,0.95889


### With Degree=3 (Age included)

In [68]:
pd.concat([ls, rcs, lcs, ecs], axis = 1)

Unnamed: 0,lm_train,lm_test,ridge_cv_train,ridge_cv_test,lasso_cv_train,lasso_cv_test,elastic_cv_train,elastic_cv_test
rmse,0.431924,3.883108,0.510404,1.089686,0.573435,1.088201,0.573376,1.138123
mae,0.271853,1.163168,0.346686,0.609201,0.386883,0.595537,0.3911,0.609599
mse,0.186558,15.078528,0.260512,1.187416,0.328828,1.184182,0.32876,1.295324
R2,0.992399,0.470449,0.989385,0.958299,0.986602,0.958412,0.986605,0.954509


### With Degree=4 (Age included)

In [68]:
pd.concat([ls, rcs, lcs, ecs], axis = 1)

Unnamed: 0,lm_train,lm_test,ridge_cv_train,ridge_cv_test,lasso_cv_train,lasso_cv_test,elastic_cv_train,elastic_cv_test
rmse,0.361847,6.637396,0.464694,3.102189,0.887634,1.902059,1.006356,2.1168
mae,0.219916,2.307544,0.305299,0.916219,0.585729,0.952257,0.654913,1.027631
mse,0.130933,44.055022,0.215941,9.623578,0.787895,3.617827,1.012753,4.480841
R2,0.994665,-0.547191,0.991201,0.662025,0.967897,0.872944,0.958735,0.842635


**We can see that degree 2 linear model gives the best scores**