# IPL Player Price Prediction

# MULTIPLE LINEAR REGRESSION
Multiple linear regression (MLR) is a supervised learning algorithm for finding the existence of an
association relationship between a dependent variable (aka response variable or outcome variable) and
several independent variables (aka explanatory variables or predictor variable or features).

The functional form of MLR is given by

<img src="qqq.png" />


The regression coefficients b 1 , b 2 , ... , b k are called partial regression coefficients since the relationship
between an explanatory variable and the response (outcome) variable is calculated after removing (or
controlling) the effect all the other explanatory variables (features) in the model.
The assumptions that are made in multiple linear regression model are as follows:
1. The regression model is linear in regression parameters (b-values).
2. The residuals follow a normal distribution and the expected value (mean) of the residuals is zero.
3. In time series data, residuals are assumed to uncorrelated.
4. The variance of the residuals is constant for all values of X i . When the variance of the residuals is constant for different values of X i , it is called homoscedasticity. A non-constant variance of residuals is called heteroscedasticity.
5. There is no high correlation between independent variables in the model (called multi-collinearity). Multi-collinearity can destabilize the model and can result in an incorrect estimation of the regression parameters.

The partial regressions coefficients are estimated by minimizing the sum of squared errors (SSE). We will
explain the multiple linear regression model by using the example of auction pricing of players in the
Indian premier league (IPL).

### Predicting the SOLD PRICE (Auction Price) of Players
The Indian Premier League (IPL) is a professional league for Twenty20 (T20) cricket championships that
was started in 2008 in India. IPL was initiated by the BCCI with eight franchises comprising players from
across the world. 

The first IPL auction was held in 2008 for ownership of the teams for 10 years, with
a base price of USD 50 million. The franchises acquire players through an English auction that is conducted every year. However, there are several rules imposed by the IPL. For example, only international
players and popular Indian players are auctioned.


The performance of the players could be measured through several metrics. Although the IPL follows the Twenty20 format of the game, it is possible that the performance of the players in the other
formats of the game such as Test and One-Day matches could influence player pricing. 

A few players
had excellent records in Test matches, but their records in Twenty20 matches were not very impressive.
The performances of 130 players who played in at least one season of the IPL (2008−2011) measured
through various performance metrics are provided in Table

<img src="ww.png" />
<img src="ee.png" />




In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [2]:
ipl_df = pd.read_csv("IPLdata.csv")
ipl_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'IPLdata.csv'

In [4]:
ipl_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Sl.NO.         130 non-null    int64  
 1   PLAYER NAME    130 non-null    object 
 2   AGE            130 non-null    int64  
 3   COUNTRY        130 non-null    object 
 4   TEAM           130 non-null    object 
 5   PLAYING ROLE   130 non-null    object 
 6   T-RUNS         130 non-null    int64  
 7   T-WKTS         130 non-null    int64  
 8   ODI-RUNS-S     130 non-null    int64  
 9   ODI-SR-B       130 non-null    float64
 10  ODI-WKTS       130 non-null    int64  
 11  ODI-SR-BL      130 non-null    float64
 12  CAPTAINCY EXP  130 non-null    int64  
 13  RUNS-S         130 non-null    int64  
 14  HS             130 non-null    int64  
 15  AVE            130 non-null    float64
 16  SR-B           130 non-null    float64
 17  SIXERS         130 non-null    int64  
 18  RUNS-C    

We can build a model to understand what features of players are influencing their SOLD PRICE or
predict the player’s auction prices in future. However, all columns are not features. For example, Sl. NO. is just a serial number and cannot be considered a feature of the player. We will build a model using only
player’s statistics. 

So, BASE PRICE can also be removed. We will create a variable X_feature which will
contain the list of features that we will finally use for building the model and ignore rest of the columns
of the DataFrame. 


The following function is used for including the features in the model building.


In [5]:
ipl_df['PLAYING ROLE'].unique()

array(['Allrounder', 'Bowler', 'Batsman', 'W. Keeper'], dtype=object)

In [6]:
pd.get_dummies(ipl_df['PLAYING ROLE'])[0:5]

Unnamed: 0,Allrounder,Batsman,Bowler,W. Keeper
0,1,0,0,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,1,0,0


In [None]:
X_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'T-RUNS', 'T-WKTS',
              'ODI-RUNS-S', 'ODI-SR-B','ODI-WKTS', 'ODI-SR-BL',
              'CAPTAINCY EXP', 'RUNS-S','HS', 'AVE', 'SR-B', 'SIXERS',
              'RUNS-C', 'WKTS','AVE-BL', 'ECON', 'SR-BL']

### Encoding Categorical Features
Qualitative variables or categorical variables need to be encoded using dummy variables before incorporating
them in the regression model. 


Finding unique values of column PLAYING ROLE shows the values: Allrounder, Bowler, Batsman,
W. Keeper. 

The following Python code is used to encode a categorical or qualitative variable using dummy
variables:

In [None]:
categorial_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'CAPTAINCY EXP']
ipl_encoded_df = pd.getdummies(ipl_df[], columns = categorial_features, drop_first=True)

ipl_encoded_df.columns

As per the p-value (<0.05), only the features
HS, AGE_2, AVE and COUNTRY_ENG have come out significant. 

The model says that none of the other
features are influencing SOLD PRICE (at a significance value of 0.05). This is not very intuitive and could
be a result of multi-collinearity effect of variables.

### Multi-Collinearity and Handling Multi-Collinearity
When the dataset has a large number of independent variables (features), it is possible that few of these
independent variables (features) may be highly correlated. 

The existence of a high correlation between
independent variables is called multi-collinearity. 

Presence of multi-collinearity can destabilize the multiple linear regression model. Thus, it is necessary to identify the presence of multi-collinearity.


Multi-collinearity can have the following impact on the model:
1. The standard error of estimate, S e ( b ) , is inflated.
2. A statistically significant explanatory variable may be labelled as statistically insignificant due to the large p-value. This is because when the standard error of estimate is inflated, it results in an underestimation of t-statistic value.
3. The sign of the regression coefficient may be different, that is, instead of negative value for regression coefficient, we may have a positive regression coefficient and vice versa.
4. Adding/removing a variable or even an observation may result in large variation in regression coefficient estimates

### Variance Inflation Factor (VIF)
Variance Inflation Factor (VIF) is a measure used for identifying the existence of multi-collinearity. 

For
example, consider two independent variables X 1 and X 2 and regression between them.

<img src="eee.png" />

Let R 12 be the R-squared value of this model. Then the VIF, which is a measure of multi-collinearity, is
given by

<img src="rrr.png" />

under root of (VIF) is the value by which the t-statistic value is flatten. VIF value of greater than 4 requires further
investigation to assess the impact of multi-collinearity. One approach to eliminate multi-collinearity is
to remove one of the variables from the model building.


variance_inflation_factor() method available in statsmodels.stats.outliers_influence package can be
used to calculate VIF for the features. 

The following method is written to calculate VIF and assign the
VIF to the columns and return a DataFrame:

#### Observations
1. T-RUNS and ODI-RUNS-S are highly correlated, whereas ODI-WKTS and T-WKTS are highly correlated.
2. Batsman features like RUNS-S, HS, AVE, SIXERS are highly correlated, while bowler’s features like AVE-BL, ECON and SR-BL are highly correlated.


To avoid multi-collinearity, we can keep only one column from each group of highly correlated variables
and remove the others. 

Now which one to keep and which one to remove depends on the understanding
of the data and the domain.


We have decided to remove the following features. Please note that it may take multiple iterations
before deciding at a final set of variables, which do not have multi-collinearity. 
