The "mpg" dataset, which stands for "miles per gallon". It contains information about various car models and their characteristics, such as cylinders, displacement, horsepower, weight, acceleration, model year, origin, and miles per gallon (mpg) fuel efficiency.

> Add blockquote

> Add blockquote





Here's a brief explanation of each column:

mpg: Miles per gallon, representing the fuel efficiency of the car.
cylinders: Number of cylinders in the engine.
displacement: Engine displacement, the measure of the cylinder volume swept by all of the pistons of a piston engine.
horsepower: The power of the engine, typically measured in horsepower (hp).
weight: Weight of the car, often measured in pounds.
acceleration: Acceleration of the car from 0 to 60 miles per hour (mph) in seconds.
model year: Year of manufacturing of the car model.
origin: Origin of the car, represented as a categorical variable (1: USA, 2: Europe, 3: Japan).
name: The name of the car model.
This dataset is commonly used for regression tasks, where the goal is to predict the fuel efficiency (mpg) of a car based on its other characteristics

In [493]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [494]:
df=sns.load_dataset('mpg')

In [495]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [496]:
df.drop('name', axis=1, inplace=True)
# we can drop the name of the car bcz it just identity the car not learn ml pattern here

In [497]:
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64

In [498]:
df.shape

(398, 8)

In [499]:
df['horsepower'].median()

93.5

In [500]:
df['horsepower']=df['horsepower'].fillna(df['horsepower'].median())
# median>> because outlier not treated yet

In [501]:
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64

In [502]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
dtypes: float64(4), int64(3), object(1)
memory usage: 25.0+ KB


In [503]:
df['origin']

0         usa
1         usa
2         usa
3         usa
4         usa
        ...  
393       usa
394    europe
395       usa
396       usa
397       usa
Name: origin, Length: 398, dtype: object

In [504]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin           object
dtype: object

In [505]:
df['origin'].unique()

array(['usa', 'japan', 'europe'], dtype=object)

In [506]:
df['origin'].value_counts()

origin
usa       249
japan      79
europe     70
Name: count, dtype: int64

In [507]:
df['origin']=df['origin'].map({'usa':1, 'japan':2, 'europe':3})
# label encoding

In [508]:
df['origin']=df['origin'].astype(int)

In [509]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin            int32
dtype: object

In [510]:
df['origin'].value_counts()

origin
1    249
2     79
3     70
Name: count, dtype: int64

In [511]:
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,1
1,15.0,8,350.0,165.0,3693,11.5,70,1
2,18.0,8,318.0,150.0,3436,11.0,70,1
3,16.0,8,304.0,150.0,3433,12.0,70,1
4,17.0,8,302.0,140.0,3449,10.5,70,1
...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,1
394,44.0,4,97.0,52.0,2130,24.6,82,3
395,32.0,4,135.0,84.0,2295,11.6,82,1
396,28.0,4,120.0,79.0,2625,18.6,82,1


In [512]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int32  
dtypes: float64(4), int32(1), int64(3)
memory usage: 23.4 KB


In [513]:
# sepret X and y
X=df.drop('mpg', axis=1)
y=df['mpg']

In [514]:
X

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,8,307.0,130.0,3504,12.0,70,1
1,8,350.0,165.0,3693,11.5,70,1
2,8,318.0,150.0,3436,11.0,70,1
3,8,304.0,150.0,3433,12.0,70,1
4,8,302.0,140.0,3449,10.5,70,1
...,...,...,...,...,...,...,...
393,4,140.0,86.0,2790,15.6,82,1
394,4,97.0,52.0,2130,24.6,82,3
395,4,135.0,84.0,2295,11.6,82,1
396,4,120.0,79.0,2625,18.6,82,1


In [515]:
y

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
393    27.0
394    44.0
395    32.0
396    28.0
397    31.0
Name: mpg, Length: 398, dtype: float64

In [516]:
# train test split: 
from sklearn.model_selection import train_test_split

In [517]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3, random_state=1)

In [518]:
X_train.shape, X_test.shape

((278, 7), (120, 7))

In [519]:
y_train.shape, y_test.shape

((278,), (120,))

In [520]:
# simple linear regression
from sklearn.linear_model import LinearRegression

regression_model=LinearRegression()

In [521]:
regression_model

In [522]:
regression_model.fit(X_train, y_train)

In [523]:
regression_model.coef_

array([-0.31761423,  0.02623748, -0.01827076, -0.00748775,  0.05040673,
        0.84709514,  1.51909584])

In [524]:
regression_model.intercept_

-23.085380742316662

In [525]:
X_train.columns

Index(['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
       'model_year', 'origin'],
      dtype='object')

In [526]:
X_train

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
350,4,105.0,63.0,2215,14.9,81,1
59,4,97.0,54.0,2254,23.5,72,3
120,4,121.0,112.0,2868,15.5,73,3
12,8,400.0,150.0,3761,9.5,70,1
349,4,91.0,68.0,1985,16.0,81,2
...,...,...,...,...,...,...,...
393,4,140.0,86.0,2790,15.6,82,1
255,4,140.0,88.0,2720,15.4,78,1
72,8,304.0,150.0,3892,12.5,72,1
235,4,97.0,75.0,2265,18.2,77,2


In [527]:
for i, col_name in enumerate(X_train): 
    print(i, col_name)

0 cylinders
1 displacement
2 horsepower
3 weight
4 acceleration
5 model_year
6 origin


In [528]:
for i, col_name in enumerate(X_train.columns): 
    print(i, col_name)

0 cylinders
1 displacement
2 horsepower
3 weight
4 acceleration
5 model_year
6 origin


In [529]:
regression_model.coef_

array([-0.31761423,  0.02623748, -0.01827076, -0.00748775,  0.05040673,
        0.84709514,  1.51909584])

In [530]:
for i, col_name in enumerate(X_train.columns): 
    print(f"the coefficient for {col_name} is {regression_model.coef_[i]}")

the coefficient for cylinders is -0.3176142302799369
the coefficient for displacement is 0.026237482599078946
the coefficient for horsepower is -0.018270764913124595
the coefficient for weight is -0.007487750398361897
the coefficient for acceleration is 0.0504067346197138
the coefficient for model_year is 0.8470951427061365
the coefficient for origin is 1.5190958387975024


In [531]:
# coefficients are relatively smaller>> it one independent variable changes slightly there will be not much difference in predicion
# this is called sometimes as smoother model>> it means all feature are equaly important

In [532]:
from sklearn.metrics import r2_score

In [533]:
y_pred_linear=regression_model.predict(X_test)

In [534]:
r2_linear=r2_score(y_test, y_pred_linear)

In [535]:
print(f"r2 of linear regression {r2_linear}")

r2 of linear regression 0.8348001123742285


## Ridge Regression

In [536]:
from sklearn.linear_model import Ridge

In [537]:
ridge_reg_model=Ridge(alpha=0.1)

In [538]:
ridge_reg_model.fit(X_train, y_train)

In [539]:
for i, col_name in enumerate(X_train.columns): 
    print(f"the coefficient for {col_name} is {ridge_reg_model.coef_[i]}")

the coefficient for cylinders is -0.31700321010070404
the coefficient for displacement is 0.026213249757984638
the coefficient for horsepower is -0.018263252481448694
the coefficient for weight is -0.0074873260502131955
the coefficient for acceleration is 0.05036896947443444
the coefficient for model_year is 0.8470062938903199
the coefficient for origin is 1.5174528285654085


In [540]:
# for ridge regression evaluation: 
y_pred_ridge=ridge_reg_model.predict(X_test)
r2_ridge=r2_score(y_test, y_pred_ridge)
print(f"R2 for ridge regression: {r2_ridge}")

R2 for ridge regression: 0.8348084889168355


In [541]:
# we don't see much variation in coeff of ridge regression as compare to linear regression

## Lasso Regression

In [542]:
from sklearn.linear_model import Lasso

In [543]:
lasso_reg_model=Lasso(alpha=0.5)

In [544]:
lasso_reg_model.fit(X_train, y_train)

In [545]:
for i, col_name in enumerate(X_train.columns): 
    print(f"the coefficient for {col_name} is {lasso_reg_model.coef_[i]}")

the coefficient for cylinders is -0.0
the coefficient for displacement is 0.006208198888300358
the coefficient for horsepower is -0.011058382987169565
the coefficient for weight is -0.0069826731680230885
the coefficient for acceleration is 0.0
the coefficient for model_year is 0.744654952003819
the coefficient for origin is 0.0


In [546]:
# 3 feature coefficents are 0. >> lasso help in feature selection

In [547]:
y_pred_lasso =lasso_reg_model.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"R-squared score for Lasso Regression: {r2_lasso}")

R-squared score for Lasso Regression: 0.8277934716635554


## Elastic Net Regrassion

In [548]:
from sklearn.linear_model import ElasticNet

In [549]:
Elastic_net_model=ElasticNet(alpha=1, l1_ratio=0.5)

# The l1_ratio parameter ranges from 0 to 1:

# l1_ratio = 1: Pure L1 regularization (Lasso).
# l1_ratio = 0: Pure L2 regularization (Ridge).
# l1_ratio = 0.5: A balance between L1 and L2 regularization

In [550]:
Elastic_net_model.fit(X_train, y_train)

In [551]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficient for {col_name} is {Elastic_net_model.coef_[i]}")

The coefficient for cylinders is -0.0
The coefficient for displacement is 0.005888869953667563
The coefficient for horsepower is -0.012403874933570126
The coefficient for weight is -0.006934550516257631
The coefficient for acceleration is 0.0
The coefficient for model_year is 0.7133150744603874
The coefficient for origin is 0.0


In [552]:
y_pred_elastic_net=Elastic_net_model.predict(X_test)

In [553]:
r2_elastic_net=r2_score(y_test, y_pred_elastic_net)

In [554]:
print(f"R-squared score for Elastic Net Regression: {r2_elastic_net}")

R-squared score for Elastic Net Regression: 0.8284840073256804


### Lasso Cross validation

In [555]:
from sklearn.linear_model import LassoCV
lasso_cv=LassoCV(cv=5)
lasso_cv.fit(X_train, y_train)

In [556]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficient for {col_name} is {lasso_cv.coef_[i]}")

The coefficient for cylinders is -0.0
The coefficient for displacement is -0.0
The coefficient for horsepower is -0.013327222357463604
The coefficient for weight is -0.0067165624427067425
The coefficient for acceleration is 0.0
The coefficient for model_year is 0.32784617728496573
The coefficient for origin is 0.0


In [557]:
y_pred_lasso_cv=lasso_cv.predict(X_test)

In [558]:
r2_lasso_cv=r2_score(y_test, y_pred_lasso_cv)

In [559]:
r2_lasso_cv

0.808280598384475

In [560]:
lasso_cv.get_params()

{'alphas': None,
 'copy_X': True,
 'cv': 5,
 'eps': 0.001,
 'fit_intercept': True,
 'max_iter': 1000,
 'n_alphas': 100,
 'n_jobs': None,
 'positive': False,
 'precompute': 'auto',
 'random_state': None,
 'selection': 'cyclic',
 'tol': 0.0001,
 'verbose': False}

### Ridge Cross validation

In [561]:
from sklearn.linear_model import RidgeCV
ridge_cv=RidgeCV(cv=5)
ridge_cv.fit(X_train, y_train)

In [562]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficient for {col_name} is {ridge_cv.coef_[i]}")

The coefficient for cylinders is -0.26552702695754654
The coefficient for displacement is 0.024091023557804115
The coefficient for horsepower is -0.01759935748391101
The coefficient for weight is -0.007448836506766948
The coefficient for acceleration is 0.046979944761985144
The coefficient for model_year is 0.8388152960503655
The coefficient for origin is 1.3705907328959879


In [563]:
y_pred_ridge_cv=ridge_cv.predict(X_test)

In [564]:
r2_ridge_cv=r2_score(y_test, y_pred_ridge_cv)

In [565]:
r2_ridge_cv

0.835414524750205

In [566]:
ridge_cv.get_params()

{'alpha_per_target': False,
 'alphas': (0.1, 1.0, 10.0),
 'cv': 5,
 'fit_intercept': True,
 'gcv_mode': None,
 'scoring': None,
 'store_cv_values': False}

### Elastic Net Cross Validation

In [567]:
from sklearn.linear_model import ElasticNetCV

In [568]:
elastic_cv=ElasticNetCV(cv=5)

In [569]:
elastic_cv.fit(X_train, y_train)

In [570]:
for i, col_name in enumerate(X_train.columns):
    print(f"The coefficient for {col_name} is {elastic_cv.coef_[i]}")

The coefficient for cylinders is -0.0
The coefficient for displacement is -0.0010567633437045254
The coefficient for horsepower is -0.017573518450910947
The coefficient for weight is -0.006567865294895822
The coefficient for acceleration is 0.0
The coefficient for model_year is 0.22190807898841178
The coefficient for origin is 0.0


In [571]:
y_pred_elastic_cv=elastic_cv.predict(X_test)

In [572]:
r2_elastic_net_cv=r2_score(y_test, y_pred_elastic_cv)

In [573]:
r2_elastic_net_cv

0.792863401804916

In [574]:
elastic_cv.get_params()

{'alphas': None,
 'copy_X': True,
 'cv': 5,
 'eps': 0.001,
 'fit_intercept': True,
 'l1_ratio': 0.5,
 'max_iter': 1000,
 'n_alphas': 100,
 'n_jobs': None,
 'positive': False,
 'precompute': 'auto',
 'random_state': None,
 'selection': 'cyclic',
 'tol': 0.0001,
 'verbose': 0}

In [None]:
#keep learning .. keep exploring>> 