# Regression Analyses with Cricket Data

In week 1, we took a brief look at the cricket match of statistics of the Indian Premier league in 2018 (IPL2018teams dataset). In this week, we will look at the player level statistics. In particular, we are interested in whether the player performance impact their salaries. 

### Import useful libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm

### Import cricket data
In our data repository, there is a data set “IPL18Player.csv” which contains performance statistics as well as salary information of cricket players in the Indian Premier League in 2018.


In [2]:
IPLPlayer=pd.read_csv("../Data/IPL18Player.csv")
IPLPlayer.head()

Unnamed: 0,player_id,long_scorecard_name,Salary,team,matches,wins,team_runs_for,team_runs_against,matches_keeper,byes_conceded,...,bowling_dot_balls,bowling_sixes,no_balls,balls_bowled_1_to_6,runs_conceded_1_to_6,balls_bowled_7_to_14,runs_conceded_7_to_14,balls_bowled_15_to_20,runs_conceded_15_to_20,event_winner
0,8931,AT Rayudu,343750.0,Chennai Super Kings,16,11,2809,2750,0,0,...,0,0,0,0,0,0,0,0,0,1
1,254771,D Shorey,31250.0,Chennai Super Kings,1,1,128,127,0,0,...,0,0,0,0,0,0,0,0,0,1
2,44613,DJ Bravo,1000000.0,Chennai Super Kings,16,11,2809,2750,0,0,...,90,29,0,0,0,126,160,195,373,1
3,214425,DJ Willey,,Chennai Super Kings,3,2,484,483,0,0,...,20,3,0,24,38,6,10,30,47,1
4,258155,DL Chahar,125000.0,Chennai Super Kings,12,9,2117,2068,0,0,...,118,10,2,194,236,37,42,0,0,1


## Data Exploration and Preparation

In [3]:
IPLPlayer.shape

(149, 35)

### Missing Values

In [4]:
IPLPlayer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 35 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   player_id               149 non-null    int64  
 1   long_scorecard_name     149 non-null    object 
 2   Salary                  141 non-null    float64
 3   team                    149 non-null    object 
 4   matches                 149 non-null    int64  
 5   wins                    149 non-null    int64  
 6   team_runs_for           149 non-null    int64  
 7   team_runs_against       149 non-null    int64  
 8   matches_keeper          149 non-null    int64  
 9   byes_conceded           149 non-null    int64  
 10  moms                    149 non-null    int64  
 11  innings                 149 non-null    int64  
 12  not_outs                149 non-null    int64  
 13  runs                    149 non-null    int64  
 14  balls_faced             149 non-null    in

_There are missing values in the salary variable. We will drop observations with missing values._

In [5]:
IPLPlayer=IPLPlayer.dropna()
IPLPlayer.shape

(141, 35)

## Create useful variables
### Create dummy variables to indicate the role of the players.
- Create a variable to indicate whether a player had played as a batsman.

 The variable "innings" indicates how many innings a player had batted in.

In [6]:
IPLPlayer['batsman']=np.where(IPLPlayer['innings']> 0, 1, 0)
IPLPlayer['batsman'].describe()

count    141.000000
mean       0.943262
std        0.232165
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: batsman, dtype: float64

- Create a variable to indicate bowler.

In [7]:
IPLPlayer['bowler']=np.where(IPLPlayer['matches_bowled']> 0, 1, 0)
IPLPlayer['bowler'].describe()

count    141.000000
mean       0.631206
std        0.484198
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: bowler, dtype: float64

The last type of player that is not captured by either batsman or bowler is wicket keeper. In the dataset, the variable "matches_keeper" indicates the number of matches that a player is a wicket keeper.


### Performance Measures
1. batting average = runs / the numbers of outs
2. batting strike rate = (runs * 100) / balls faced
3. bowling average = runs conceded / wicket taken
4. bowling strike rate = number of balls bowled / wicket taken

_Notice that if a batsman has scored runs but not been dismissed, his batting average is technically infinite. Similarly, if a player did not face any ball, his batting strike would be infinite and if a player did not lose any wicket, his bowling average or bowling strike would be infinite._

We will not be able to run a regression when our variables have some infinite values.

There are two alternatives we will consider to deal with this issue.
1. Add 1 to the number of outs, balls faced, andn wickets taken in calculating the above variables.
2. Instead of creating the above measures, we can simply include total runs, total number of outs, and balls faced to measure a batsman's performance, and include runs conceded, number of balls bowled, and wickets taken to measure a bowler's performance.

In [8]:
IPLPlayer['outs']=np.where(IPLPlayer['batsman']==1, IPLPlayer['innings']-IPLPlayer['not_outs'], 0)
IPLPlayer['outs'].describe()

count    141.000000
mean       5.000000
std        4.605897
min        0.000000
25%        1.000000
50%        4.000000
75%        9.000000
max       16.000000
Name: outs, dtype: float64

Create batting average, batting strke rate, bowling average, and bowling strike rate variables. Add 1 to the number of outs, balls faced, andn wickets taken in calculating these variables.

In [9]:
IPLPlayer['batting_average']=IPLPlayer['runs']/(IPLPlayer['outs']+1)
IPLPlayer['batting_strike']=IPLPlayer['runs']/((IPLPlayer['balls_faced']+1))*100
IPLPlayer['bowling_average']=IPLPlayer['runs_conceded']/(IPLPlayer['wickets']+1)
IPLPlayer['bowling_strike']=IPLPlayer['balls_bowled']/(IPLPlayer['wickets']+1)

In [10]:
IPLPlayer['batting_average'].describe()

count    141.000000
mean      15.093066
std       13.761819
min        0.000000
25%        4.000000
50%       12.500000
75%       23.000000
max       65.000000
Name: batting_average, dtype: float64

In [11]:
IPLPlayer['batting_strike'].describe()

count    141.000000
mean     104.164456
std       53.873378
min        0.000000
25%       73.913043
50%      118.446602
75%      139.669421
max      250.000000
Name: batting_strike, dtype: float64

In [12]:
IPLPlayer['bowling_average'].describe()

count    141.000000
mean      17.493864
std       16.108488
min        0.000000
25%        0.000000
50%       20.052632
75%       27.466667
max       72.000000
Name: bowling_average, dtype: float64

In [13]:
IPLPlayer['bowling_strike'].describe()

count    141.000000
mean      11.478621
std       10.295591
min        0.000000
25%        0.000000
50%       12.500000
75%       19.600000
max       42.000000
Name: bowling_strike, dtype: float64

## Regression Analyses
### First let's run a regression of the salary on the type of player, batsman, bowler, and all-rounder.

In [14]:
reg_IPL1=sm.ols(formula = 'Salary ~ batsman+ bowler+ batsman*bowler', data= IPLPlayer, missing="drop").fit()
print(reg_IPL1.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.060
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     4.379
Date:                Mon, 29 Nov 2021   Prob (F-statistic):             0.0143
Time:                        16:42:59   Log-Likelihood:                -2069.2
No. Observations:                 141   AIC:                             4144.
Df Residuals:                     138   BIC:                             4153.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       2.859e+05   1.11e+05      2.

### Next we will first focus on performance of batsman.
We will first simply use the total number of runs, number of not outs, and number of balls faced to measure players’ performance.

In [15]:
reg_IPL2=sm.ols(formula = 'Salary ~ runs', data= IPLPlayer).fit()
print(reg_IPL2.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.267
Model:                            OLS   Adj. R-squared:                  0.261
Method:                 Least Squares   F-statistic:                     50.57
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           5.54e-11
Time:                        16:44:04   Log-Likelihood:                -2051.7
No. Observations:                 141   AIC:                             4107.
Df Residuals:                     139   BIC:                             4113.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3.878e+05   5.39e+04      7.198      0.0

In [16]:
reg_IPL3=sm.ols(formula = 'Salary ~ runs+not_outs', data= IPLPlayer).fit()
print(reg_IPL3.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.318
Model:                            OLS   Adj. R-squared:                  0.308
Method:                 Least Squares   F-statistic:                     32.15
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           3.45e-12
Time:                        16:45:17   Log-Likelihood:                -2046.6
No. Observations:                 141   AIC:                             4099.
Df Residuals:                     138   BIC:                             4108.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    2.88e+05   6.07e+04      4.747      0.0

In [17]:
reg_IPL4=sm.ols(formula = 'Salary ~ runs+not_outs+balls_faced', data= IPLPlayer).fit()
print(reg_IPL4.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.321
Model:                            OLS   Adj. R-squared:                  0.306
Method:                 Least Squares   F-statistic:                     21.60
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           1.62e-11
Time:                        16:46:01   Log-Likelihood:                -2046.3
No. Observations:                 141   AIC:                             4101.
Df Residuals:                     137   BIC:                             4112.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    3.013e+05   6.29e+04      4.791      

In the next regressions, we will use the modified batting average and batting strike variables to measure player performance.

In [18]:
reg_IPL5=sm.ols(formula = 'Salary ~ batting_average', data= IPLPlayer).fit()
print(reg_IPL5.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.233
Model:                            OLS   Adj. R-squared:                  0.227
Method:                 Least Squares   F-statistic:                     42.13
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           1.40e-09
Time:                        16:47:15   Log-Likelihood:                -2054.9
No. Observations:                 141   AIC:                             4114.
Df Residuals:                     139   BIC:                             4120.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        3.072e+05   6.52e+04     

In [19]:
reg_IPL6=sm.ols(formula = 'Salary ~ batting_average+batting_strike', data= IPLPlayer).fit()
print(reg_IPL6.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.234
Model:                            OLS   Adj. R-squared:                  0.223
Method:                 Least Squares   F-statistic:                     21.12
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           9.96e-09
Time:                        16:56:53   Log-Likelihood:                -2054.7
No. Observations:                 141   AIC:                             4115.
Df Residuals:                     138   BIC:                             4124.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        2.668e+05   9.69e+04     

### We will now turn to bowlers' performance.

Again, we will first use number of runs conceded, number of balls bowled, and number of wickets taken to measure bowlers' performance.

In [20]:
reg_IPL7=sm.ols(formula = 'Salary ~ runs_conceded', data= IPLPlayer).fit()
print(reg_IPL7.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     3.200
Date:                Mon, 29 Nov 2021   Prob (F-statistic):             0.0758
Time:                        16:57:57   Log-Likelihood:                -2072.0
No. Observations:                 141   AIC:                             4148.
Df Residuals:                     139   BIC:                             4154.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      5.438e+05   6.53e+04      8.322

In [21]:
reg_IPL8=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled', data= IPLPlayer).fit()
print(reg_IPL8.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     3.026
Date:                Mon, 29 Nov 2021   Prob (F-statistic):             0.0518
Time:                        17:02:01   Log-Likelihood:                -2070.5
No. Observations:                 141   AIC:                             4147.
Df Residuals:                     138   BIC:                             4156.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      5.565e+05   6.54e+04      8.514

In [22]:
reg_IPL9=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL9.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     2.329
Date:                Mon, 29 Nov 2021   Prob (F-statistic):             0.0772
Time:                        20:47:06   Log-Likelihood:                -2070.1
No. Observations:                 141   AIC:                             4148.
Df Residuals:                     137   BIC:                             4160.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      5.543e+05   6.54e+04      8.472

In the next regression, we will use the modified bowling average and bowling strike variables to measure player performance.

In [23]:
reg_IPL10=sm.ols(formula = 'Salary ~ bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL10.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.054
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     3.912
Date:                Mon, 29 Nov 2021   Prob (F-statistic):             0.0223
Time:                        20:47:10   Log-Likelihood:                -2069.7
No. Observations:                 141   AIC:                             4145.
Df Residuals:                     138   BIC:                             4154.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        6.535e+05   7.33e+04     

### Lastly, we will incorporate performance measures of both batsman and bowler in the same regression.
We will first use the original variables, total number of runs, number of not outs, number of balls faced, number of runs conceded, number of balls bowled, and number of wickets in the regression.

In [24]:
reg_IPL11=sm.ols(formula = 'Salary ~ runs+not_outs+balls_faced+runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL11.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.408
Model:                            OLS   Adj. R-squared:                  0.382
Method:                 Least Squares   F-statistic:                     15.41
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           2.20e-13
Time:                        20:47:18   Log-Likelihood:                -2036.6
No. Observations:                 141   AIC:                             4087.
Df Residuals:                     134   BIC:                             4108.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      1.458e+05   7.32e+04      1.993

We will also use the modified batting average, batting strike, bowling average, and bowling strike variables to measure the player performance.

In [25]:
reg_IPL12=sm.ols(formula = 'Salary ~ batting_average+batting_strike+bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL12.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.308
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                     15.16
Date:                Mon, 29 Nov 2021   Prob (F-statistic):           2.85e-10
Time:                        20:47:43   Log-Likelihood:                -2047.6
No. Observations:                 141   AIC:                             4105.
Df Residuals:                     136   BIC:                             4120.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         1.37e+05   1.14e+05     

## Self Test
- Run a regression of salary as a function of the interaction of batsman and runs and the interaction of bowler and wickets taken.
- Interpret your regression results.