# PROCESS

1. Define the goal of the project
2. Data Collection
3. EDA - Exploratory Data Analysis
4. Feature Engineering
5. Splitting the dataset into training and testing datasets
6. Model Training
7. Model Testing
8. Model Performance Evaluation
9. Tuning for improvement

## 1. Goal of the Project

To use a machine learning model to train & then predict the Carbon Dioxide emisions from motor vehicles.

## 2. Data Collection

Data Source: Kaggle

Path: https://www.kaggle.com/datasets/ahmettyilmazz/fuel-consumption

File Type: CSV

## 3. Exploratory Data Analysis

3.1 Importing required Python libraries  
3.2 Loading the data in the IDE  
3.3 High level summary of the dataset  
3.4 Dealing with dirty data (Missing Values, Wrong Format, Outliers)

#### 3.1 Importing required Python libraries

In [1]:
import numpy as np
import pandas as pd
import sklearn as sl

#### 3.2 Loading the data in the IDE

In [2]:
# csv file
data = pd.read_csv('C:/Users/HIAMNSHU/Downloads/FuelConsumption.csv')

#### 3.3 High level summary of the dataset

In [3]:
data.shape

(1067, 13)

Rows: 1067  
Columns: 13

In [4]:
data.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067 entries, 0 to 1066
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   MODELYEAR                 1067 non-null   int64  
 1   MAKE                      1067 non-null   object 
 2   MODEL                     1067 non-null   object 
 3   VEHICLECLASS              1067 non-null   object 
 4   ENGINESIZE                1067 non-null   float64
 5   CYLINDERS                 1067 non-null   int64  
 6   TRANSMISSION              1067 non-null   object 
 7   FUELTYPE                  1067 non-null   object 
 8   FUELCONSUMPTION_CITY      1067 non-null   float64
 9   FUELCONSUMPTION_HWY       1067 non-null   float64
 10  FUELCONSUMPTION_COMB      1067 non-null   float64
 11  FUELCONSUMPTION_COMB_MPG  1067 non-null   int64  
 12  CO2EMISSIONS              1067 non-null   int64  
dtypes: float64(4), int64(4), object(5)
memory usage: 108.5+ KB


Observation: No null values in the dataset.

In [6]:
data.duplicated().sum()

0

#### 3.4 Dealing with dirty data (Missing Values, Wrong Format, Outliers)

1. As per observation till now, dataset does not have any missing values.
2. To check for wrong format and ouliers, instead of checking each column, we can check only required columns for achieving the goal.

## 4. Feature Engineering

In [7]:
# available features
data.columns

Index(['MODELYEAR', 'MAKE', 'MODEL', 'VEHICLECLASS', 'ENGINESIZE', 'CYLINDERS',
       'TRANSMISSION', 'FUELTYPE', 'FUELCONSUMPTION_CITY',
       'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB',
       'FUELCONSUMPTION_COMB_MPG', 'CO2EMISSIONS'],
      dtype='object')

#### Following 4 attributes/features have more relevance and impact on CO2 emissions by motor vehicles.
1. ENGINESIZE
2. CYLINDERS
3. FUELTYPE
4. FUELCONSUMPTION_COMB

#### Correlation Matrix

In [8]:
# excluding fueltype as it has string values
matrx=data[["ENGINESIZE","CYLINDERS","FUELCONSUMPTION_COMB","CO2EMISSIONS"]]

In [9]:
matrx

Unnamed: 0,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_COMB,CO2EMISSIONS
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136
3,3.5,6,11.1,255
4,3.5,6,10.6,244
...,...,...,...,...
1062,3.0,6,11.8,271
1063,3.2,6,11.5,264
1064,3.0,6,11.8,271
1065,3.2,6,11.3,260


In [10]:
corel = matrx.corr()
print(corel)

                      ENGINESIZE  CYLINDERS  FUELCONSUMPTION_COMB  \
ENGINESIZE              1.000000   0.934011              0.819482   
CYLINDERS               0.934011   1.000000              0.776788   
FUELCONSUMPTION_COMB    0.819482   0.776788              1.000000   
CO2EMISSIONS            0.874154   0.849685              0.892129   

                      CO2EMISSIONS  
ENGINESIZE                0.874154  
CYLINDERS                 0.849685  
FUELCONSUMPTION_COMB      0.892129  
CO2EMISSIONS              1.000000  


In [11]:
matrx.describe()

Unnamed: 0,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_COMB,CO2EMISSIONS
count,1067.0,1067.0,1067.0,1067.0
mean,3.346298,5.794752,11.580881,256.228679
std,1.415895,1.797447,3.485595,63.372304
min,1.0,3.0,4.7,108.0
25%,2.0,4.0,9.0,207.0
50%,3.4,6.0,10.9,251.0
75%,4.3,8.0,13.35,294.0
max,8.4,12.0,25.8,488.0


In [12]:
# We just have to check these 4 features for wrong format or outliers
# usual range of engine size values (globally) is 0.6 to 8.0 Litre
# usual range of number of cylinders in a vehicle (globally) is 2 to 12

# MODEL 01 (Engine-CO2)

## 5. Splitting the dataset into training & testing(validation) database

**sklearn.model_selection module**  
The Scikit-Learn module has function to auto split the dataset as per given percentage of training and test dataset.

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
x_train, x_test, y_train, y_test = train_test_split(matrx[['ENGINESIZE']], matrx[['CO2EMISSIONS']], test_size=0.25, random_state=0)

In [15]:
x_train.shape

(800, 1)

In [17]:
x_test.shape

(267, 1)

## 6. Model Training

Since our goal is predicting the target variable value, we will opt for regression machine learning model; linear regression.  
  
**sklearn.linear_model module**  
This Scikit-Learn module has function for fitting a linear regression model (line) to training dataset and learn the pattern and logic from the data points.

In [19]:
from sklearn.linear_model import LinearRegression

#### Creating an object of the Linear Regression class in the linear_model module

In [20]:
L1=LinearRegression()

#### Fitting the model on training dataset using 'fit' method

In [21]:
L1.fit(x_train, y_train)

## 7. Model Testing

#### Testing the model on testing dataset using 'predict' method

In [22]:
y_pred = L1.predict(x_test)

## 8. Model Performance Evaluation

#### R2 Score

In [24]:
L1.score(x_test,y_test)

0.7202776851136601

# MODEL 02 (No of Cylinders-CO2)

## Splitting the dataset into training & testing(validation) database

## Model Training

## Model Testing

## Model Performance Evaluation

# MODEL 03 (Fuel Consumption-CO2)

## Splitting the dataset into training & testing(validation) database

## Model Training

## Model Testing

## Model Performance Evaluation

# MODEL 4 (Fuel Type-CO2)