# MULTIPLE LINEAR REGRESSION
Multiple linear regression (MLR) is a supervised learning algorithm for finding the existence of an
association relationship between a dependent variable (aka response variable or outcome variable) and
several independent variables (aka explanatory variables or predictor variable or features).

The functional form of MLR is given by

<img src="q.png" />


The regression coefficients b 1 , b 2 , ... , b k are called partial regression coefficients since the relationship
between an explanatory variable and the response (outcome) variable is calculated after removing (or
controlling) the effect all the other explanatory variables (features) in the model.
The assumptions that are made in multiple linear regression model are as follows:
1. The regression model is linear in regression parameters (b-values).
2. The residuals follow a normal distribution and the expected value (mean) of the residuals is zero.
3. In time series data, residuals are assumed to uncorrelated.
4. The variance of the residuals is constant for all values of X i . When the variance of the residuals is constant for different values of X i , it is called homoscedasticity. A non-constant variance of residuals is called heteroscedasticity.
5. There is no high correlation between independent variables in the model (called multi-collinearity). Multi-collinearity can destabilize the model and can result in an incorrect estimation of the regression parameters.

The partial regressions coefficients are estimated by minimizing the sum of squared errors (SSE). We will
explain the multiple linear regression model by using the example of auction pricing of players in the
Indian premier league (IPL).

### Predicting the SOLD PRICE (Auction Price) of Players
The Indian Premier League (IPL) is a professional league for Twenty20 (T20) cricket championships that
was started in 2008 in India. IPL was initiated by the BCCI with eight franchises comprising players from
across the world. 

The first IPL auction was held in 2008 for ownership of the teams for 10 years, with
a base price of USD 50 million. The franchises acquire players through an English auction that is conducted every year. However, there are several rules imposed by the IPL. For example, only international
players and popular Indian players are auctioned.


The performance of the players could be measured through several metrics. Although the IPL follows the Twenty20 format of the game, it is possible that the performance of the players in the other
formats of the game such as Test and One-Day matches could influence player pricing. 

A few players
had excellent records in Test matches, but their records in Twenty20 matches were not very impressive.
The performances of 130 players who played in at least one season of the IPL (2008−2011) measured
through various performance metrics are provided in Table

<img src="w.png" />
<img src="e.png" />

In [1]:
import numpy as np
import pandas as pd

import statsmodels.api as sm

In [2]:
df = pd.read_csv('IPL IMB381IPL2013.csv')
df.head()

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0,...,0.0,0,307,15,20.47,8.9,13.93,2009,50000,50000
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,...,0.0,0,29,0,0.0,14.5,0.0,2008,50000,50000
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,...,121.01,5,1059,29,36.52,8.81,24.9,2008,200000,350000
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,...,120.71,28,0,0,0.0,0.0,0.0,2011,100000,800000


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Sl.NO.         130 non-null    int64  
 1   PLAYER NAME    130 non-null    object 
 2   AGE            130 non-null    int64  
 3   COUNTRY        130 non-null    object 
 4   TEAM           130 non-null    object 
 5   PLAYING ROLE   130 non-null    object 
 6   T-RUNS         130 non-null    int64  
 7   T-WKTS         130 non-null    int64  
 8   ODI-RUNS-S     130 non-null    int64  
 9   ODI-SR-B       130 non-null    float64
 10  ODI-WKTS       130 non-null    int64  
 11  ODI-SR-BL      130 non-null    float64
 12  CAPTAINCY EXP  130 non-null    int64  
 13  RUNS-S         130 non-null    int64  
 14  HS             130 non-null    int64  
 15  AVE            130 non-null    float64
 16  SR-B           130 non-null    float64
 17  SIXERS         130 non-null    int64  
 18  RUNS-C    

We can build a model to understand what features of players are influencing their SOLD PRICE or
predict the player’s auction prices in future. However, all columns are not features. For example, Sl. NO. is just a serial number and cannot be considered a feature of the player. We will build a model using only
player’s statistics. 

So, BASE PRICE can also be removed. We will create a variable X_feature which will
contain the list of features that we will finally use for building the model and ignore rest of the columns
of the DataFrame. 


The following function is used for including the features in the model building.


In [4]:
xfeatures = df.columns
xfeatures

Index(['Sl.NO.', 'PLAYER NAME', 'AGE', 'COUNTRY', 'TEAM', 'PLAYING ROLE',
       'T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'ODI-SR-BL',
       'CAPTAINCY EXP', 'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C',
       'WKTS', 'AVE-BL', 'ECON', 'SR-BL', 'AUCTION YEAR', 'BASE PRICE',
       'SOLD PRICE'],
      dtype='object')

Most of the features in the dataset are numerical (ratio scale) whereas features such as AGE, COUNTRY,
PLAYING ROLE, CAPTAINCY EXP are categorical and hence need to be encoded before building the
model. 

Categorical variables cannot be directly included in the regression model, and they must be
encoded using dummy variables before incorporating in the model building.


In [5]:
xfeatures = ['AGE', 'COUNTRY', 'PLAYING ROLE',
       'T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'ODI-SR-BL',
       'CAPTAINCY EXP', 'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C',
       'WKTS', 'AVE-BL', 'ECON', 'SR-BL']

### Encoding Categorical Features
Qualitative variables or categorical variables need to be encoded using dummy variables before incorporating
them in the regression model. 


Finding unique values of column PLAYING ROLE shows the values: Allrounder, Bowler, Batsman,
W. Keeper. 

The following Python code is used to encode a categorical or qualitative variable using dummy
variables:

In [6]:
df['PLAYING ROLE'].unique()

array(['Allrounder', 'Bowler', 'Batsman', 'W. Keeper'], dtype=object)

The variable can be converted into four dummy variables. Set the variable value to 1 to indicate the role
of the player. 

This can be done using pd.get_dummies() method. We will create dummy variables for only
PLAYING ROLE to understand and then create dummy variables for the rest of the categorical variables.

In [7]:
pd.get_dummies(df['PLAYING ROLE'])[:5]

Unnamed: 0,Allrounder,Batsman,Bowler,W. Keeper
0,1,0,0,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,1,0,0


the pd.get_dummies() method has created four dummy variables and has
already set the variables to 1 as variable value in each sample.

Whenever we have n levels (or categories) for a qualitative variable (categorical variable), we will use
(n − 1) dummy variables, where each dummy variable is a binary variable used for representing whether
an observation belongs to a category or not. 

The reason why we create only (n − 1) dummy variables
is that inclusion of dummy variables for all categories and the constant in the regression equation will
create perfect multi-collinearity . 

To drop one category, the parameter drop_ first
should be set to True.


We must create dummy variables for all categorical (qualitative) variables present in the dataset.

In [8]:
categorical_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'CAPTAINCY EXP']

encoded_df = pd.get_dummies(df[xfeatures], 
                            columns=categorical_features, 
                            drop_first=True)

encoded_df.columns

Index(['T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'ODI-SR-BL',
       'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS', 'AVE-BL',
       'ECON', 'SR-BL', 'AGE_2', 'AGE_3', 'COUNTRY_BAN', 'COUNTRY_ENG',
       'COUNTRY_IND', 'COUNTRY_NZ', 'COUNTRY_PAK', 'COUNTRY_SA', 'COUNTRY_SL',
       'COUNTRY_WI', 'COUNTRY_ZIM', 'PLAYING ROLE_Batsman',
       'PLAYING ROLE_Bowler', 'PLAYING ROLE_W. Keeper', 'CAPTAINCY EXP_1'],
      dtype='object')

In [9]:
xfeatures = encoded_df.columns

In [10]:
encoded_df.head()

Unnamed: 0,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,ODI-WKTS,ODI-SR-BL,RUNS-S,HS,AVE,SR-B,...,COUNTRY_NZ,COUNTRY_PAK,COUNTRY_SA,COUNTRY_SL,COUNTRY_WI,COUNTRY_ZIM,PLAYING ROLE_Batsman,PLAYING ROLE_Bowler,PLAYING ROLE_W. Keeper,CAPTAINCY EXP_1
0,0,0,0,0.0,0,0.0,0,0,0.0,0.0,...,0,0,1,0,0,0,0,0,0,0
1,214,18,657,71.41,185,37.6,0,0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
2,571,58,1269,80.62,288,32.9,167,39,18.56,121.01,...,0,0,0,0,0,0,0,1,0,0
3,284,31,241,84.56,51,36.8,58,11,5.8,76.32,...,0,0,0,0,0,0,0,1,0,0
4,63,0,79,45.93,0,0.0,1317,71,32.93,120.71,...,0,0,0,0,0,0,1,0,0,0


In [11]:
from sklearn.model_selection import train_test_split

X = sm.add_constant(encoded_df)

Y = df['SOLD PRICE']

xtrain, xtest, ytrain, ytest = train_test_split(X, 
                                                Y, 
                                                train_size=0.8, 
                                                random_state=42)

### Building the Model on the Training Dataset
We will build the MLR model using the training dataset and analyze the model summary. The summary
provides details of the model accuracy, feature significance, and signs of any multi-collinearity effect.

In [12]:
model1 = sm.OLS(ytrain, xtrain).fit()

model1.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.362
Dependent Variable:,SOLD PRICE,AIC:,2965.2841
Date:,2022-05-14 18:09,BIC:,3049.9046
No. Observations:,104,Log-Likelihood:,-1450.6
Df Model:,31,F-statistic:,2.883
Df Residuals:,72,Prob (F-statistic):,0.000114
R-squared:,0.554,Scale:,110340000000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,375827.1991,228849.9306,1.6422,0.1049,-80376.7996,832031.1978
T-RUNS,-53.7890,32.7172,-1.6441,0.1045,-119.0096,11.4316
T-WKTS,-132.5967,609.7525,-0.2175,0.8285,-1348.1162,1082.9228
ODI-RUNS-S,57.9600,31.5071,1.8396,0.0700,-4.8482,120.7681
ODI-SR-B,-524.1450,1576.6368,-0.3324,0.7405,-3667.1130,2618.8231
ODI-WKTS,815.3944,832.3883,0.9796,0.3306,-843.9413,2474.7301
ODI-SR-BL,-773.3092,1536.3334,-0.5033,0.6163,-3835.9338,2289.3154
RUNS-S,114.7205,173.3088,0.6619,0.5101,-230.7643,460.2054
HS,-5516.3354,2586.3277,-2.1329,0.0363,-10672.0855,-360.5853

0,1,2,3
Omnibus:,0.891,Durbin-Watson:,2.244
Prob(Omnibus):,0.64,Jarque-Bera (JB):,0.638
Skew:,0.19,Prob(JB):,0.727
Kurtosis:,3.059,Condition No.:,84116.0


As per the p-value (<0.05), only the features
HS, AGE_2, AVE and COUNTRY_ENG have come out significant. 

The model says that none of the other
features are influencing SOLD PRICE (at a significance value of 0.05). This is not very intuitive and could
be a result of multi-collinearity effect of variables.

### Multi-Collinearity and Handling Multi-Collinearity
When the dataset has a large number of independent variables (features), it is possible that few of these
independent variables (features) may be highly correlated. 

The existence of a high correlation between
independent variables is called multi-collinearity. 

Presence of multi-collinearity can destabilize the multiple linear regression model. Thus, it is necessary to identify the presence of multi-collinearity.


Multi-collinearity can have the following impact on the model:
1. The standard error of estimate, S e ( b ) , is inflated.
2. A statistically significant explanatory variable may be labelled as statistically insignificant due to the large p-value. This is because when the standard error of estimate is inflated, it results in an underestimation of t-statistic value.
3. The sign of the regression coefficient may be different, that is, instead of negative value for regression coefficient, we may have a positive regression coefficient and vice versa.
4. Adding/removing a variable or even an observation may result in large variation in regression coefficient estimates

### Variance Inflation Factor (VIF)
Variance Inflation Factor (VIF) is a measure used for identifying the existence of multi-collinearity. 

For
example, consider two independent variables X 1 and X 2 and regression between them.

<img src="ee.png" />

Let R 12 be the R-squared value of this model. Then the VIF, which is a measure of multi-collinearity, is
given by

<img src="rr.png" />

under root of (VIF) is the value by which the t-statistic value is flatten. VIF value of greater than 4 requires further
investigation to assess the impact of multi-collinearity. One approach to eliminate multi-collinearity is
to remove one of the variables from the model building.


variance_inflation_factor() method available in statsmodels.stats.outliers_influence package can be
used to calculate VIF for the features. 

The following method is written to calculate VIF and assign the
VIF to the columns and return a DataFrame:

Now, calling the above method with the X features will return the VIF for the corresponding features.

Checking Correlation of Columns with Large VIFs
We can generate a correlation heatmap to understand the correlation between the independent variables
which can be used to decide which features to include in the model. 

We will first select the features that
have VIF value of more than 4.

#### Observations
1. T-RUNS and ODI-RUNS-S are highly correlated, whereas ODI-WKTS and T-WKTS are highly correlated.
2. Batsman features like RUNS-S, HS, AVE, SIXERS are highly correlated, while bowler’s features like AVE-BL, ECON and SR-BL are highly correlated.


To avoid multi-collinearity, we can keep only one column from each group of highly correlated variables
and remove the others. 

Now which one to keep and which one to remove depends on the understanding
of the data and the domain.


We have decided to remove the following features. Please note that it may take multiple iterations
before deciding at a final set of variables, which do not have multi-collinearity. 

These iterations have been
omitted here for simplicity.

The VIFs on the final set of variables indicate that there is no multi-collinearity present any more
(VIF values are less than 4). 

We can proceed to build the model with these set of variables now.

### Building a New Model after Removing Multi-collinearity

The p-values of the coefficients estimated show whether the variables are statistically significant in influencing response variables or not. If the p-value is less than the significance value (a) then the
feature is statistically significant, otherwise it is not. 

The value of a is usually selected as 0.05; however, it
may be chosen based on the context of the problem.


Based on the p-values, only the variables COUNTRY_IND, COUNTRY_ENG, SIXERS, CAPTAINCY
EXP_1 have come out statistically significant. So, the features that decide the SOLD PRICE are

1. Whether the players belong to India or England (that is, origin country of the player).
2. How many sixes has the player hit in previous versions of the IPL? How many wickets have been taken by the player in ODIs?
3. Whether the player has any previous captaincy experience or not.

Let us create a new list called significant_vars to store the column names of significant variables and build
a new model

The following inference can be derived from the latest model ipl_model_3 

1. All the variables are statistically significant, as p-value is less than 0.05%.
2. The overall model is significant as the p-value for the F-statistics is also less than 0.05%.
3. The model can explain 71.5% of the variance in SOLD PRICE as the R-squared value is 0.715 and the adjusted R-squared value is 0.704%. Adjusted R-squared is a measure that is calculated after normalizing SSE and SST with the corresponding degrees of freedom.