### **Table of Contents**
1. [Data Preparation](#section-a)
2. [Data Visualization](#section-b)
3. [Building Regression Model](#section-c)

<a id="section-a"></a>
#### **Section A: Data Preparation**

* Importing the necessary libraries
* Reading the file and storing it in a master dataframe
* Checking the occurences of missing data and duplicated data


In [343]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [307]:
df = pd.read_csv('../input/2022-fuel-consumption-ratings/MY2022 Fuel Consumption Ratings.csv')
df.head()

In [308]:
print(f"Total no. of null values in dataset : {df.isnull().sum().sum()}")

In [338]:
print(f'No. of duplicated rows = {df.duplicated().sum()}')

In [341]:
print(f"Shape of dataset: {df.shape}")
print(" - ")
print(f"No. of features: {len(df.columns)}")
print(" - ")
print(f"List of features : {list(df.columns)}")

<a id="section-b"></a>
#### **Section B: Data Visualization**

Examining the variable by plotting visualizations.

In [310]:
df['Model Year'].value_counts().plot(kind='bar',rot=0,color='orange',title =' Model Year')

In [311]:
df['Make'].value_counts().sort_values(ascending=False).iloc[0:9].plot(kind='bar',figsize=(20,5),rot=0,title = 'Top 10 Cars')
plt.savefig('make.jpg')

In [312]:
df['Model'].value_counts().sort_values(ascending=False).iloc[0:9].plot(kind='bar',rot=0,figsize=(20,5),color='pink',title='Top 10 Models')

In [313]:

df['Vehicle Class'].value_counts().sort_values(ascending=False).iloc[0:9].plot(kind='bar',rot=0,figsize=(20,5),color='green',title='Top 10 Vehicle Classes')

In [314]:
df['Engine Size(L)'].value_counts().sort_values(ascending=False).plot(kind='bar',rot=0,figsize=(20,5),color='purple',title='Composition of cars by Engine Size')
#df['Engine Size(L)'].hist()

In [315]:
df['Cylinders'].value_counts().sort_values(ascending=False).plot(kind='bar',rot=0,figsize=(20,5),color='grey',title='Composition of cars by #Cylinders')
#df['Engine Size(L)'].hist()

In [316]:
df['Transmission'].value_counts().sort_values(ascending=False).plot(kind='bar',rot=0,figsize=(20,5),color='black',title='Composition of cars by Transmission Types')


In [317]:
temp = df['Fuel Type']
temp = temp.replace({'Z':'Z (Premium Gasoline)', 'X':'X (Regular Gasoline)', 'D':'Diesel','E':'E85'})
temp.value_counts().sort_values(ascending=False).plot(kind='bar',rot=0,figsize=(20,5),color='brown',title='Composition of cars by Fuel Types')

In [318]:
ax1 =plt.hist(df['Fuel Consumption (City (L/100 km)'])
plt.title('Distribution by Fuel Economy (City) (L/100 km)')

In [319]:
print('Average Fuel Economy (City) = ', round(df['Fuel Consumption (City (L/100 km)'].mean(),2))

In [320]:
ax =plt.hist(df['Fuel Consumption(Hwy (L/100 km))'])
plt.title('Distribution by Fuel Economy (Highway) (L/100 km)')

In [321]:
ax =plt.hist(df['CO2 Emissions(g/km)'])
plt.title('Distribution by CO2 Emissions(g/km))')

In [322]:
ax =plt.hist(df['CO2 Rating'])
plt.title('Distribution by CO2 Rating')


In [323]:
best_5_make = df[['Make','Fuel Consumption (City (L/100 km)']].groupby(by ='Make').mean().sort_values(by='Fuel Consumption (City (L/100 km)',ascending=False).iloc[0:5]
print("Top 5 cars with Highest Average Fuel Economy")
best_5_make.rename(columns ={'Fuel Consumption (City (L/100 km)' : 'Average Fuel Economy (L/100 km)'})

In [324]:
lowest_5_make = df[['Make','Fuel Consumption (City (L/100 km)']].groupby(by ='Make').mean().sort_values(by='Fuel Consumption (City (L/100 km)',ascending=True).iloc[0:5]
print("Bottom 5 cars with lowest Average Fuel Economy")
lowest_5_make.rename(columns ={'Fuel Consumption (City (L/100 km)' : 'Average Fuel Economy (L/100 km)'})

In [325]:
df[['Vehicle Class','Fuel Consumption (City (L/100 km)']].groupby(by ='Vehicle Class').mean().sort_values(by='Fuel Consumption (City (L/100 km)',ascending=False)

In [326]:
df[['Fuel Type','Fuel Consumption (City (L/100 km)']].groupby(by ='Fuel Type').mean().sort_values(by='Fuel Consumption (City (L/100 km)',ascending=False)

<a id="section-c"></a>
#### **C. Building the Linear Regression Model**
* Encoding the categorical features
* Creating the feature and target dataframes
* Splitting the data into training and test data sets
* Building the Linear Regression Model
* Prediction and calculating the Mean Square Errors and R2 score
* Finding the intercept and coefficeint values of linear regression model

Encoding 'Transmission' and 'Fuel Type' columns before using to build the regression model

In [342]:
encoder = OrdinalEncoder()
cat_trans = enc.fit_transform(df[['Transmission', 'Fuel Type']])
df[['Transmission', 'Fuel Type']] = pd.DataFrame(cat_trans)

In [336]:
X = df[['Engine Size(L)', 'Transmission', 'CO2 Emissions(g/km)', 'Smog Rating']]
y = df['Fuel Consumption(Comb (L/100 km))']

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)

reg = linear_model.LinearRegression()
reg.fit (X_train, y_train)
y_pred=reg.predict(X_test)


In [337]:
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))
print("--------------------")
i=0
coeffs = []
intercept = round(reg.intercept_,2)

for x in reg.coef_:
    #print('Coeff.# ',i+1,'=',round(x,2))
    coeffs.append(round(reg.coef_[i],2))
    i+=1
print(f"Optimum Parameters: {coeffs}")
print(f"Intercept = {intercept}")
print(f"Linear Regression Model:\n h(x) = {intercept} + ({coeffs[0]})*Engine Size(L) + ({coeffs[1]})*Transmission + ({coeffs[2]})*CO2 Emissions(g/km) + ({coeffs[1]})*Smog Rating")
print("--------------------")
print('Weights Against the variables: ')
j=0
while j<len(X.columns):
    print(f"{coeffs[j]} : {X.columns[j]}")
    j=j+1


In [333]:
sns.regplot(x=y_test,y=y_pred,ci=None,color ='blue')
plt.xlabel('Actual');
plt.ylabel('Predicted')