# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv('petrol.csv')
df.index = df.index + 1 

In [3]:
df.head(5)

Unnamed: 0,tax,income,highway,dl,consumption
1,9.0,3571,1976,0.525,541
2,9.0,4092,1250,0.572,524
3,9.0,3865,1586,0.58,561
4,7.5,4870,2351,0.529,414
5,8.0,4399,431,0.544,410


In [4]:
df.describe()

Unnamed: 0,tax,income,highway,dl,consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [5]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
min_cap = Q1 - 1.5*IQR
max_cap = Q3 + 1.5*IQR
# print (((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))))

In [6]:
# Remove the outlier value rows
df_new = df[~((df < min_cap) | (df > (max_cap))).any(axis=1)]
# Re-set the index after removing rows
df_new.index = np.arange(1, len(df_new) + 1)
df_new

Unnamed: 0,tax,income,highway,dl,consumption
1,9.0,3571,1976,0.525,541
2,9.0,4092,1250,0.572,524
3,9.0,3865,1586,0.58,561
4,7.5,4870,2351,0.529,414
5,8.0,4399,431,0.544,410
6,8.0,5319,11868,0.451,344
7,8.0,5126,2138,0.553,467
8,8.0,4447,8577,0.529,464
9,7.0,4512,8507,0.552,498
10,8.0,4391,5939,0.53,580


# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [7]:
df_new.corr()

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.109537,-0.390602,-0.314702,-0.446116
income,-0.109537,1.0,0.051169,0.150689,-0.347326
highway,-0.390602,0.051169,1.0,-0.016193,0.034309
dl,-0.314702,0.150689,-0.016193,1.0,0.611788
consumption,-0.446116,-0.347326,0.034309,0.611788,1.0


### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [17]:
#X = df_new.drop(' consumption', axis='columns')
X = df_new[['tax',' dl']]
Y = df_new[[' consumption']]

# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [18]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [19]:
from sklearn.linear_model import LinearRegression
regression_model_1 = LinearRegression()
regression_model_1.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [20]:
for idx, col_name in enumerate(x_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model_1.coef_[0][idx]))

The coefficient for tax is -30.70924254754727
The coefficient for  dl is 892.8862087487333


# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [32]:
regression_model_1.score(x_test, y_test)

0.2876056314158515

In [33]:
regression_model_1.score(x_train, y_train)

0.4657867429910155

# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [23]:
X1 = df_new.drop(' consumption', axis='columns')
Y1 = df_new[[' consumption']]

In [28]:
x1_train, x1_test, y1_train, y1_test = train_test_split(X1, Y1, test_size=0.20, random_state=1)
regression_model_2 = LinearRegression()
regression_model_2.fit(x1_train, y1_train)
regression_model_2.score(x1_train, y1_train)

0.6407622941321002

In [31]:
regression_model_2.fit(x1_test, y1_test)
regression_model_2.score(x1_test, y1_test)

0.851493648026156

# Question 9: Print the coefficients of the multilinear regression model

In [29]:
for idx, col_name in enumerate(x1_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model_2.coef_[0][idx]))

The coefficient for tax is -39.411583621415424
The coefficient for  income is -0.06262814005687901
The coefficient for  highway is -0.0030219870395790096
The coefficient for  dl is 950.8827441430783


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

In [None]:
# R-squared is always between 0 and 100%. 
# It indicates that the model explains all the variability of the response data around its mean.

In [None]:
# regression_model_1 explains 46.57% of the variability in consumption using tax and no of drivers.
# regression_model_2 explains 85.14% of the variability in consumption using tax, no of drivers, income and
# the number of miles of paved highway.
# With more data values the regression_model_2 indicates more variability of the response data around its mean. 