# Challenge 1 - Multiple Regression Analysis

# Before your start:

    Read the README.md file
    Comment as much as you can and use the resources (README.md file)
    Happy learning!

In this lab exercise we will work with vehicles.csv data set. 
You can find a copy of the dataset in the git hub folder. 
The objective of this exercise is to predict CO2 emissions for vehicles. 
Please follow the steps and provide your code along with comments.

In [1]:
## Import the libraries for loading the data set 
import pandas as pd
## Load the dataset 
data = pd.read_csv('vehicles.csv')
data.head()

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550


In [2]:
## You can use the warnings library to ignore warnings that might show when you run the code
import warnings
warnings.filterwarnings('ignore')

### In this exercise, we are not going to use all the variables in the dataset to make the predictions. Here is the list of the numerical and categorical variables that we will use:
    
    Numerical : Engine Displacement, Fuel Barrels/Year, Combined MPG
    Categorical : Cylinders, Fuel Type, Drivetrain

Numerical and Categorical together make up the predictor variables Target variable is "CO2 emissions for vehicles"

### Data Pre-processing (Handling Numerical variables)

#### First we will perform the data pre-processing operations on the specified numerical varibles

In [3]:
## Store the specified numerical columns data as a separate dataframe. Give it the name "numerics"

numeric_cols = ['Engine Displacement', 'Fuel Barrels/Year', 'Combined MPG']
numerics = data[numeric_cols]
numerics.head()

Unnamed: 0,Engine Displacement,Fuel Barrels/Year,Combined MPG
0,2.5,19.388824,17
1,4.2,25.354615,13
2,2.5,20.600625,16
3,4.2,25.354615,13
4,3.8,20.600625,16


### MinMax scaler

Hint: Since we are using "numerics" to store the nummerical variables we can pass "numerics" directly
as MinMaxScaler().fit(numerics)

In [4]:
# Import the required library
# Perform the scaling and store the results inside "numerical"

from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(numerics[numeric_cols])
numerical = transformer.transform(numerics[numeric_cols])

In [5]:
# Convert "numerical" into a dataframe so that it can be used later with the dataframe of categorical variables
numerical = pd.DataFrame(numerical, columns=numeric_cols)
numerical.head()

Unnamed: 0,Engine Displacement,Fuel Barrels/Year,Combined MPG
0,0.24359,0.411014,0.204082
1,0.461538,0.537873,0.122449
2,0.24359,0.436782,0.183673
3,0.461538,0.537873,0.122449
4,0.410256,0.436782,0.183673


### Data Pre-processing (Handling Categorical variables)

In this step we will perform the data pre-processing operations on the specified categorical varibles

In [6]:
# Similar to numerical variables, store the specified categorical columns data as a dataframe. 
# Give it the name "cats"
cat_cols = ['Cylinders', 'Fuel Type', 'Drivetrain']
cats = data[cat_cols]

In [7]:
# Check if "cats" is actually a dataframe using cats.head(3)
cats.head(3)

Unnamed: 0,Cylinders,Fuel Type,Drivetrain
0,4.0,Regular,2-Wheel Drive
1,6.0,Regular,2-Wheel Drive
2,4.0,Regular,Rear-Wheel Drive


### Using One Hot Encoding 

In [8]:
# Perform One hot encoding and store the results (one hot encoded dataframe) into "categorical"

categorical = pd.get_dummies(cats, columns=cat_cols)

In [9]:
# Check how the new OHE data looks like using the head() function

categorical.head()

Unnamed: 0,Cylinders_2.0,Cylinders_3.0,Cylinders_4.0,Cylinders_5.0,Cylinders_6.0,Cylinders_8.0,Cylinders_10.0,Cylinders_12.0,Cylinders_16.0,Fuel Type_CNG,...,Fuel Type_Regular Gas and Electricity,Fuel Type_Regular Gas or Electricity,Drivetrain_2-Wheel Drive,"Drivetrain_2-Wheel Drive, Front",Drivetrain_4-Wheel Drive,Drivetrain_4-Wheel or All-Wheel Drive,Drivetrain_All-Wheel Drive,Drivetrain_Front-Wheel Drive,Drivetrain_Part-time 4-Wheel Drive,Drivetrain_Rear-Wheel Drive
0,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Now we have pre-processed our specified numerical and categorical data. 
In this next step we will combine the two dataframes (numerical and categorical)

You can the following code to combine / concatenate the two dataframes

In [10]:
X = pd.concat([numerical,categorical],axis=1)

In [11]:
# Store the target variable "CO2 emissions for vehicles" as a dataframe 'Y'

Y = data['CO2 Emission Grams/Mile']

In [12]:
# Import the libraries required for regression model 
# Fit the linear regression model on the data

from sklearn import linear_model

lin_model = linear_model.LinearRegression()
model = lin_model.fit(X,Y)

In [13]:
# Make predictions on the dataset, store the results in "predictions"

predictions = lin_model.predict(X)

#### Print the measures of accuracy of the model - MSE, RMSE, and R2 score
    Hint: Use from sklearn.metrics import mean_squared_error, r2_score

In [14]:
from sklearn.metrics import mean_squared_error, r2_score

print(mean_squared_error(Y, predictions))
r2_score(Y, predictions)

54.694284168183394


0.9961415166969217