# Introduction
Cricket is a sport played by two teams with each side having eleven players. Each team is a right
blend of batsmen, bowlers and allrounders. The batsmen’s role is to score maximum runs possible
and the bowlers have to take maximum wickets and restrict the other team from scoring runs at the
same time. Allrounders are the players who can both bat and bowl and they contribute by scoring
runs and taking wickets. Each player contributes towards the overall performance of the team by
giving his best performance in each match. Each player’s performance varies with factors like the
team he is playing against and the ground at which the match is being played. It is important to
select the right players that can perform the best in each match. The performance of a player also
depends on several factors like his current form, his performance against a particular team, his
performance at a particular venue etc. The team management, the coach and the captain analyze
each player’s characteristics, abilities and past stats to select the best playing XI for a given match.
In other words, they try to predict the players’ performance for each match.
In this project, we predict the players’ performance in One Day International (ODI) matches by
analyzing their characteristics and stats using supervised machine learning techniques. For this,
we predict batsmen’s and bowlers’ performance separately as how many runs will a batsman
score and how many wickets will a bowler take in a particular match. 


# Overview
This project has 4 main parts:
1. Batsman Data Engineering
2. Bowler Data Engineering
3. Batsman Prediction
4. Bowler Prediction

Each of the part has been implemented in seperate notebooks. First two are for data engineering and other two are for model implementation and predictions.

# Acknowledgements
* The following project is based on a research paper contributed by Kalpdrum Passi and Niravkumar Pandey
**Department of Mathematics and Computer Science**
**Laurentian University, Sudbury, Canada**

* The following dataset in scraped from espncricinfo.com


# Approach
The stats of the players such as average, strike rate etc. are
not available directly for each game, we calculated these attributes from the innings by innings
list using aggregate functions and mathematical formulae. The formulaes and values are referenced from the research paper. These attributes are generally used to
measure a player’s performance. They are as follows: 
### Batting Attributes
1. No. of Innings: 
The number of innings in which the batsman has batted till the day of the match.
This attribute signifies the experience of the batsman. The more innings the batsman has played,
the more experienced the player is.

2. Batting Average: 
Batting average commonly referred to as average is the average number of runs
scored per innings. This attribute indicates the run scoring capability of the player.
**Average = Runs Scored / Number of times dismissed**

3. Strike Rate (SR): 
Strike rate is the average number of runs scored per 100 balls faced. In limited
overs cricket, it is important to score runs at a fast pace. More runs scored at a slow pace is rather
harmful to the team as they have a limited number of overs. This attribute indicates how quickly
the batsman can score runs.
**Strike Rate: (Runs Scored / Balls Faced) * 100**

4. Centuries: 
Number of innings in which the batsman scored more than 100 runs. This attribute
indicates the capability of the player to play longer innings and score more runs.

5. Fifties: 
Number of innings in which the batsman scored more than 50 (and less than 100)
runs.This attribute indicates the capability of the player to play longer innings and score more runs.

6. Zeros: 
Number of innings in which the batsman was dismissed without scoring a single run. This
attribute shows how many times the batsman failed to score runs, hence this being a negative
factor, was impacts the batsman’s prediction negatively. 


### Bowling Attributes
1. No. of Innings:
The number of innings in which the bowler bowled at least one ball. It represents
the bowling experience of a player. The more innings the player has played, the more experienced
the player is.
Overs: The number of overs bowled by a bowler.This attribute also indicates the experience
of the bowler. The more overs the bowler has bowled, the more experienced the bowler
is.
2. Bowling Average:
Bowling average is the number of runs conceded by a bowler per
wicket taken. This attribute indicates the capabilities of the bowler to restrict the batsmen
from scoring runs and taking wickets at the same time. Lower values of bowling average
indicate more capabilities.
3. Bowling Average:
**Number of runs conceded / Number of wickets taken**
4. Bowling Strike Rate:
Bowling strike rate is the number of balls bowled per wicket taken.
This attribute indicates the wicket taking capability of the bowler. Lower values mean
that the bowler is capable of taking wickets quickly.
5. Strike Rate:
**Number of balls bowled / Number of wickets taken**
6. Four/Five Wicket Haul:
Number of innings in which the bowler has taken more than four
wickets. This attribute indicates the capability of the bowler to take more wickets in an
innings. Higher the value, more capable the player. 


In [1]:
import numpy as np 
import pandas as pd 
import re
import os

In [2]:
dataset=pd.read_csv('Player_Innings_Stats.csv')

In [3]:
dataset.columns

Index(['Innings Player', 'Innings Runs Scored', 'Innings Runs Scored Num',
       'Innings Minutes Batted', 'Innings Batted Flag', 'Innings Not Out Flag',
       'Innings Balls Faced', 'Innings Boundary Fours',
       'Innings Boundary Sixes', 'Innings Batting Strike Rate',
       'Innings Number', 'Opposition', 'Ground', 'Innings Date', 'Country',
       '50's', '100's', 'Innings Runs Scored Buckets', 'Innings Overs Bowled',
       'Innings Bowled Flag', 'Innings Maidens Bowled',
       'Innings Runs Conceded', 'Innings Wickets Taken', '4 Wickets',
       '5 Wickets', '10 Wickets', 'Innings Wickets Taken Buckets',
       'Innings Economy Rate'],
      dtype='object')

In [4]:
batsman=dataset[dataset['Innings Overs Bowled'].isnull()]

In [5]:
#removing unnecessary columns
drop=['Innings Overs Bowled',
       'Innings Bowled Flag', 'Innings Maidens Bowled',
       'Innings Runs Conceded', 'Innings Wickets Taken', '4 Wickets',
       '5 Wickets', '10 Wickets', 'Innings Wickets Taken Buckets',
       'Innings Economy Rate','Innings Runs Scored Num', 'Innings Minutes Batted', 'Innings Batted Flag'
     ,'Innings Not Out Flag']
batsman=batsman.drop(drop, axis=1)

In [6]:
batsman['Innings_Runs_Score']=0
batsman=batsman[(batsman['Innings Runs Scored']!='DNB') & (batsman['Innings Runs Scored']!='TDNB')]

In [7]:
#writing regular expressions to extract runs scored
runs = r'([0-9]*)'
index_2=batsman.columns.get_loc('Innings Runs Scored')
index_runs=batsman.columns.get_loc('Innings_Runs_Score')
for row in range(0,len(batsman)):
    run=re.search(runs,batsman.iat[row,index_2]).group()
    if run!='':
        batsman.iat[row,index_runs]=int(run)

## Manipulating features
We will be manipulating features as a part of data cleaning and processing step.

In [8]:
#number of 4's

batsman['Innings_Boundary_Fours']=0

In [9]:
batsman['Innings Boundary Fours']= np.where(batsman['Innings Boundary Fours']==' ',
                                            0,batsman['Innings Boundary Fours'])

In [10]:
index_3=batsman.columns.get_loc('Innings Boundary Fours')
index_fours=batsman.columns.get_loc('Innings_Boundary_Fours')

In [11]:
for row in range(0,len(batsman)):
    fours= batsman.iat[row,index_3]
    if fours!='-':
        batsman.iat[row,index_fours]=int(fours)
    

In [12]:
#number of 6's

batsman['Innings_Boundary_Sixes']=0
batsman['Innings Boundary Sixes']=np.where(batsman['Innings Boundary Sixes']==' ',
                                           0,batsman['Innings Boundary Sixes'])
index_3=batsman.columns.get_loc('Innings Boundary Sixes')
index_sixes=batsman.columns.get_loc('Innings_Boundary_Sixes')

for row in range(0,len(batsman)):
    sixes= batsman.iat[row,index_3]
    if sixes!='-':
        batsman.iat[row,index_sixes]=int(sixes)

In [13]:
# current innings strike rate
batsman['Innings_Batting_Strike_Rate']=0.0
index_3=batsman.columns.get_loc('Innings Batting Strike Rate')
index_sr=batsman.columns.get_loc('Innings_Batting_Strike_Rate')

for row in range(0,len(batsman)):
    sr= batsman.iat[row,index_3]
    if sr!='-':
        batsman.iat[row,index_sr]=float(sr)

In [14]:
# Innings played

batsman['Innings_Number']=0
index_3=batsman.columns.get_loc('Innings Number')
index_in=batsman.columns.get_loc('Innings_Number')

for row in range(0,len(batsman)):
    inn= batsman.iat[row,index_3]
    if inn!='-':
        batsman.iat[row,index_in]=int(inn)

In [15]:
#Innings Balls Faced
batsman['Innings_Balls_Faced']=0
index_3=batsman.columns.get_loc('Innings Balls Faced')
index_in=batsman.columns.get_loc('Innings_Balls_Faced')

for row in range(0,len(batsman)):
    inn= batsman.iat[row,index_3]
    if inn!='-':
        batsman.iat[row,index_in]=int(inn)

In [16]:
#Extracting names of Opposition teams

index_3=batsman.columns.get_loc('Opposition')
opp = r'[^v][A-Z]+[a-z]*[" "]*[A-Z]*[a-z]*'
for row in range(0,len(batsman)):
    opps=re.search(opp,batsman.iat[row,index_3]).group()
    batsman.iat[row,index_3]=opps

In [17]:
#Extracting year of match from date feature

batsman['Year']=0
years=r'([0-9]{4})'
index_3=batsman.columns.get_loc('Innings Date')
index_year=batsman.columns.get_loc('Year')

for row in range(0,len(batsman)):
    year=re.search(years,batsman.iat[row,index_3]).group()
    batsman.iat[row,index_year]=int(year)

In [18]:
#Extracting month in which match was played from date feature

batsman['Month']=0
batsman['Innings Date']=pd.to_datetime(batsman['Innings Date'])
index_month=batsman.columns.get_loc('Month')

for row in range(0,len(batsman)):
    batsman.iat[row,index_month]=int(batsman.iat[row, index_3].month)

In [19]:
#Extracting day name from date feature

batsman['Day']=''
index_day=batsman.columns.get_loc('Day')

import calendar
for row in range(0,len(batsman)):
    batsman.iat[row,index_day]=calendar.day_name[batsman.iat[row, index_3].weekday()]

In [20]:
#number of 50s and 100s scored

batsman['50s']=0
batsman['100s']=0

fifty=batsman.columns.get_loc('50s')
hundred=batsman.columns.get_loc('100s')

index_fifty=batsman.columns.get_loc("50's")
index_hundred=batsman.columns.get_loc("100's")

for row in range(0,len(batsman)):
    fifties= batsman.iat[row,index_fifty]
    hundreds=batsman.iat[row,index_hundred]
    if fifties!='-':
        batsman.iat[row,fifty]=int(fifties)
    if hundreds!='-':
        batsman.iat[row,hundred]=int(hundreds)

In [21]:
#Numbers of zeroes scored in innings is necessary for further feature engineering.
#Formula used for creating this feature:
# Number of zeroes = 0 + Number of innings played (if player has scored 0 runs in total)

batsman['0s']=0
index_0=batsman.columns.get_loc('0s')
index_runs=batsman.columns.get_loc('Innings_Runs_Score')
index_inn=batsman.columns.get_loc('Innings_Number')
zeros=0

for row in range(len(batsman)):
    if batsman.iat[row,index_runs]==0:
        zeros=0+batsman.iat[row,index_inn]
    batsman.iat[row,index_0]=zeros

In [22]:
drop=['Innings Runs Scored', 'Innings Balls Faced','Innings Boundary Fours',
       'Innings Boundary Sixes', 'Innings Batting Strike Rate',
       'Innings Number','Innings Date', "50's", "100's"]
batsman=batsman.drop(drop, axis=1)

In [23]:
#creating batting average

batsman['Batting_Average']=0.0
index_ba=batsman.columns.get_loc("Batting_Average")
index_in=batsman.columns.get_loc("Innings_Number")
index_inruns=batsman.columns.get_loc("Innings_Runs_Score")

for row in range(len(batsman)):
    inumber=batsman.iat[row,index_in]
    inruns=batsman.iat[row,index_inruns]
    batsman.iat[row,index_ba]=inruns/inumber

# Data Augmentation
Here, we are using other dataset to extract batting style feature in order to enhance our current dataset

In [24]:
dataset=pd.read_csv('/kaggle/input/project/personal_male.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/project/personal_male.csv'

In [None]:
#function to extract Initials of players' name.
#We are doing this in order to match it with the names column of our original dataset.

def name(s): 
  
    # split the string into a list  
    l = s.split() 
    new = "" 
  
    # traverse in the list  
    for i in range(len(l)-1): 
        s = l[i] 
          
        # adds the capital first character  
        new += (s[0].upper()) 
          
    # l[-1] gives last item of list l. We 
    # use title to print first character in 
    # capital. 
    new=new+" "+l[-1].title() 
      
    return new  
      
# Driver code             
index_name=dataset.columns.get_loc("fullName")
dataset['New_name']=""
index_new=dataset.columns.get_loc("New_name")

for row in range(len(dataset)):
    cname=name(dataset.iat[row,index_name])
    dataset.iat[row,index_new]=cname

In [None]:
batsman['Name']=batsman['Innings Player']
dataset['Name']=dataset['New_name']

In [None]:
drop=['name', 'fullName', 'dob', 'country', 'birthPlace', 'nationalTeam',
       'teams','bowlingStyle', 'New_name']
dataset.drop(drop, axis=1,inplace=True)

In [None]:
#merging with original dataset
batsman=pd.merge(batsman,dataset,on='Name', how='inner')

# Attribute Derivation
Here, we are deriving new attributes to further enhance the dataset

## Consistency
This attribute describes how experienced the player is and how consistent he has been throughout
his career. All the traditional attributes used in this formula are calculated over the entire career of
the player. 

**Consistency = 0.4262 x average + 0.2566 x no. of innings + 0.1510 x SR + 0.0787 x Centuries + 0.0556 x Fifties – 0.0328 x Zeros**


In [36]:
def attribute(df,col_name):
    df['Average']=0.0
    index_ba=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]=inruns/inumber

    df['Strike_rate']=0.0
    index_ba=df.columns.get_loc("Strike_rate")
    index_in=df.columns.get_loc("Innings_Balls_Faced")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]= (inruns/inumber)*100  
     
    index_new=df.columns.get_loc(col_name)
    index_sr=df.columns.get_loc("Strike_rate")
    index_av=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_100=df.columns.get_loc("100s")
    index_50=df.columns.get_loc('50s')
    index_0=df.columns.get_loc('0s')

    for row in range(len(df)):
        f=0.4262*(df.iat[row,index_av])
        f=f+0.2566*(df.iat[row,index_in])
        f+=0.1510*(df.iat[row,index_sr])
        f+=0.0787*(df.iat[row,index_100])
        f+=0.0556*(df.iat[row,index_50])
        f=f-(0.0328*(df.iat[row,index_0]))
        df.iat[row,index_new]=f
    
    return(df)
    
    
    
g=batsman.groupby('Innings Player')
df=g.sum()
df['consistency']=0.0
df=attribute(df,'consistency')
df['Average_Career']=df['Average']
df['Strike_rate_Career']=df['Strike_rate']
drop=['Innings_Runs_Score','Innings_Boundary_Fours',
       'Innings_Boundary_Sixes', 'Innings_Batting_Strike_Rate',
       'Innings_Number', 'Innings_Balls_Faced', 'Year', 'Month', '50s', '100s',
       '0s', 'Batting_Average','Average','Strike_rate']
df.drop(drop,axis=1,inplace=True)
batsman=pd.merge(batsman,df,on='Innings Player', how='inner')

  df.iat[row,index_ba]= (inruns/inumber)*100


KeyError: "['Innings_Boundary_Fours' 'Innings_Boundary_Sixes'] not found in axis"

## Form
Form of a player describes his performance over last one year. All the traditional attributes used in
this formula are calculated over the matches played by the player in last 12 months from the day of
the match. 

**Form = 0.4262 x average + 0.2566 x no. of innings + 0.1510 x SR + 0.0787 x Centuries +
0.0556 x Fifties – 0.0328 x Zeros**

In [26]:
def attribute(df,col_name):
    df['Average']=0.0
    index_ba=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]=inruns/inumber

    df['Strike_rate']=0.0
    index_ba=df.columns.get_loc("Strike_rate")
    index_in=df.columns.get_loc("Innings_Balls_Faced")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]=(inruns/inumber)*100  
     
    index_new=df.columns.get_loc(col_name)
    index_sr=df.columns.get_loc("Strike_rate")
    index_av=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_100=df.columns.get_loc("100s")
    index_50=df.columns.get_loc('50s')
    index_0=df.columns.get_loc('0s')

    for row in range(len(df)):
        f=0.4262*(df.iat[row,index_av])
        f=f+0.2566*(df.iat[row,index_in])
        f+=0.1510*(df.iat[row,index_sr])
        f+=0.0787*(df.iat[row,index_100])
        f+=0.0556*(df.iat[row,index_50])
        f=f-(0.0328*(df.iat[row,index_0]))
        df.iat[row,index_new]=f
    
    return(df)
    


g=batsman.groupby(['Innings Player','Year'])
df=g.sum()
df['form']=0.0
df=attribute(df,'form')
df['Average_Yearly']=df['Average']
df['Strike_rate_Yearly']=df['Strike_rate']
drop=['Innings_Runs_Score','Innings_Boundary_Fours',
       'Innings_Boundary_Sixes', 'Innings_Batting_Strike_Rate',
       'Innings_Number', 'Innings_Balls_Faced', 'Month', '50s', '100s',
       '0s', 'Batting_Average','consistency','Average_Career','Strike_rate_Career','Average','Strike_rate']
df.drop(drop,axis=1,inplace=True)
on=['Innings Player','Year']
batsman=pd.merge(batsman,df,on=on, how='inner')

  df.iat[row,index_ba]=(inruns/inumber)*100


## Opposition
Opposition describes a player’s performance against a particular team. All the traditional attributes
used in this formula are calculated over all the matches played by the player against the opposition
team in his entire career till the day of the match. 

**Opposition = 0.4262 x average + 0.2566 x no. of innings + 0.1510 x SR + 0.0787 x Centuries +
0.0556 x Fifties – 0.0328 x Zeros** 

In [27]:
def attribute(df,col_name):
    df['Average']=0.0
    index_ba=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]=inruns/inumber

    df['Strike_rate']=0.0
    index_ba=df.columns.get_loc("Strike_rate")
    index_in=df.columns.get_loc("Innings_Balls_Faced")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]=(inruns/inumber)*100  
     
    index_new=df.columns.get_loc(col_name)
    index_sr=df.columns.get_loc("Strike_rate")
    index_av=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_100=df.columns.get_loc("100s")
    index_50=df.columns.get_loc('50s')
    index_0=df.columns.get_loc('0s')

    for row in range(len(df)):
        f=0.4262*(df.iat[row,index_av])
        f=f+0.2566*(df.iat[row,index_in])
        f+=0.1510*(df.iat[row,index_sr])
        f+=0.0787*(df.iat[row,index_100])
        f+=0.0556*(df.iat[row,index_50])
        f=f-(0.0328*(df.iat[row,index_0]))
        df.iat[row,index_new]=f
    
    return(df)
    


g=batsman.groupby(['Innings Player','Opposition'])
df=g.sum()
df['opposition']=0.0
df=attribute(df,'opposition')
df['Average_Opposition']=df['Average']
df['Strike_rate_Opposition']=df['Strike_rate']
drop=['Innings_Runs_Score','Innings_Boundary_Fours',
       'Innings_Boundary_Sixes', 'Innings_Batting_Strike_Rate',
       'Innings_Number', 'Innings_Balls_Faced', 'Year', 'Month', '50s', '100s',
       '0s', 'Batting_Average','consistency', 'form','Average_Career','Strike_rate_Career','Average_Yearly',
     'Strike_rate_Yearly','Average','Strike_rate']
df.drop(drop,axis=1,inplace=True)
on=['Innings Player','Opposition']
batsman=pd.merge(batsman,df,on=on, how='inner')

  df.iat[row,index_ba]=(inruns/inumber)*100


## Venue
Venue describes a player’s performance at a particular venue. All the traditional attributes used in
this formula are calculated over all the matches played by the player at the venue in his entire
career till the day of the match. 

**Venue = 0.4262 x average + 0.2566 x no. of innings + 0.1510 x SR + 0.0787 x Centuries +
0.0556 x Fifties + 0.0328 x HS** 

In [28]:
def attribute(df,col_name):
    df['Average']=0.0
    index_ba=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]=inruns/inumber

    df['Strike_rate']=0.0
    index_ba=df.columns.get_loc("Strike_rate")
    index_in=df.columns.get_loc("Innings_Balls_Faced")
    index_inruns=df.columns.get_loc("Innings_Runs_Score")
    for row in range(len(df)):
        inumber=df.iat[row,index_in]
        inruns=df.iat[row,index_inruns]
        df.iat[row,index_ba]=(inruns/inumber)*100  
     
    index_new=df.columns.get_loc(col_name)
    index_sr=df.columns.get_loc("Strike_rate")
    index_av=df.columns.get_loc("Average")
    index_in=df.columns.get_loc("Innings_Number")
    index_100=df.columns.get_loc("100s")
    index_50=df.columns.get_loc('50s')
    index_HS=df.columns.get_loc('Innings_Runs_Score')

    for row in range(len(df)):
        f=0.4262*(df.iat[row,index_av])
        f=f+0.2566*(df.iat[row,index_in])
        f+=0.1510*(df.iat[row,index_sr])
        f+=0.0787*(df.iat[row,index_100])
        f+=0.0556*(df.iat[row,index_50])
        f=f+(0.0328*(df.iat[row,index_HS]))
        df.iat[row,index_new]=f
    
    return(df)
    


g=batsman.groupby(['Innings Player','Ground'])
df=g.max()
df['venue']=0.0
df=attribute(df,'venue')
df['Average_venue']=df['Average']
df['Strike_rate_venue']=df['Strike_rate']
drop=['Innings_Runs_Score','Innings Runs Scored Buckets','Innings_Boundary_Fours',
       'Innings_Boundary_Sixes', 'Innings_Batting_Strike_Rate',
       'Innings_Number', 'Innings_Balls_Faced', 'Year', 'Month', '50s', '100s',
       '0s', 'Batting_Average','consistency', 'form','Average_Career','Strike_rate_Career','Average_Yearly',
     'Strike_rate_Yearly','Average','Strike_rate','opposition','Average_Opposition','Strike_rate_Opposition']
df.drop(drop,axis=1,inplace=True)
on=['Innings Player','Ground']
batsman=pd.merge(batsman,df,on=on, how='inner')

  df.iat[row,index_ba]=(inruns/inumber)*100


## Obtainings columns for total historical data 

In [29]:
g=batsman.groupby(['Innings Player'])
df=g.sum()
drop=['Innings_Runs_Score','Innings_Boundary_Fours',
       'Innings_Boundary_Sixes', 'Innings_Batting_Strike_Rate',
       'Year', 'Month', 'Batting_Average','consistency', 'form','Average_Career','Strike_rate_Career','Average_Yearly',
     'Strike_rate_Yearly','opposition','Average_Opposition','venue','Strike_rate_Opposition','Average_venue','Strike_rate_venue']
df.drop(drop,axis=1,inplace=True)
on=['Innings Player']
batsman=pd.merge(batsman,df,on=on, how='inner')

In [31]:
batsman['50s']=batsman['50s_y']
batsman['100s']=batsman['100s_y']
batsman['0s']=batsman['0s_y']
batsman['Innings_Balls_Faced']=batsman['Innings_Balls_Faced_y']
batsman['Innings_Number']=batsman['Innings_Number_y']

drop=['Opposition_y','Innings_Boundary_Fours','Innings_Boundary_Sixes','Day_y','Country_y','50s_x',
      '100s_x','0s_x','Innings_Balls_Faced_x','Innings_Number_x','50s_y',
      '100s_y','0s_y','Innings_Balls_Faced_y','Innings_Number_y']
batsman.drop(drop,axis=1,inplace=True)

In [32]:
batsman

Unnamed: 0,Innings Player,Opposition_x,Ground,Country_x,Innings Runs Scored Buckets,Innings_Runs_Score,Innings_Batting_Strike_Rate,Year,Month,Day_x,...,Average_Opposition,Strike_rate_Opposition,venue,Average_venue,Strike_rate_venue,50s,100s,0s,Innings_Balls_Faced,Innings_Number
0,JJ Roy,Australia,Melbourne,England,150-199,180,119.20,2018,1,Sunday,...,31.259259,115.775034,62.853900,90.0,119.205298,54,27,182,9603,372
1,JJ Roy,Australia,Melbourne,England,150-199,180,119.20,2018,1,Sunday,...,31.259259,115.775034,62.853900,90.0,119.205298,54,27,182,9603,372
2,JJ Roy,Australia,Melbourne,England,150-199,180,119.20,2018,1,Sunday,...,31.259259,115.775034,62.853900,90.0,119.205298,54,27,182,9603,372
3,JJ Roy,Australia,Cardiff,England,100-149,120,111.11,2018,6,Saturday,...,31.259259,115.775034,89.711288,153.0,126.446281,54,27,182,9603,372
4,JJ Roy,Australia,Cardiff,England,100-149,120,111.11,2018,6,Saturday,...,31.259259,115.775034,89.711288,153.0,126.446281,54,27,182,9603,372
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129815,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,0,0.00,2017,2,Tuesday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9
129816,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,1,100.00,2017,2,Friday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9
129817,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,0,0.00,2017,2,Tuesday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9
129818,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,1,100.00,2017,2,Friday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9


# Rating The Attributes
Here, we created bins/ratings for various attributes. We used these ratings instead of actual values of the measures, in the formulae of derived attributes. The measures are rated as follows:

### For Consistency:
* 1 – 49: 1
* 50 – 99: 2
* 100 – 124: 3
* 125 – 149: 4
* Consistency>=150: 5 

In [34]:
dataset

Unnamed: 0,Innings Player,Opposition_x,Ground,Country_x,Innings Runs Scored Buckets,Innings_Runs_Score,Innings_Batting_Strike_Rate,Year,Month,Day_x,...,Average_Opposition,Strike_rate_Opposition,venue,Average_venue,Strike_rate_venue,50s,100s,0s,Innings_Balls_Faced,Innings_Number
0,JJ Roy,Australia,Melbourne,England,150-199,180,119.20,2018,1,Sunday,...,31.259259,115.775034,62.853900,90.0,119.205298,54,27,182,9603,372
1,JJ Roy,Australia,Melbourne,England,150-199,180,119.20,2018,1,Sunday,...,31.259259,115.775034,62.853900,90.0,119.205298,54,27,182,9603,372
2,JJ Roy,Australia,Melbourne,England,150-199,180,119.20,2018,1,Sunday,...,31.259259,115.775034,62.853900,90.0,119.205298,54,27,182,9603,372
3,JJ Roy,Australia,Cardiff,England,100-149,120,111.11,2018,6,Saturday,...,31.259259,115.775034,89.711288,153.0,126.446281,54,27,182,9603,372
4,JJ Roy,Australia,Cardiff,England,100-149,120,111.11,2018,6,Saturday,...,31.259259,115.775034,89.711288,153.0,126.446281,54,27,182,9603,372
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129815,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,0,0.00,2017,2,Tuesday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9
129816,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,1,100.00,2017,2,Friday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9
129817,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,0,0.00,2017,2,Tuesday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9
129818,Fareed Ahmad,Zimbabwe,Harare,Afghanistan,0-49,1,100.00,2017,2,Friday,...,0.333333,16.666667,3.779100,0.5,20.000000,0,0,12,18,9


In [38]:
batsman.isnull().sum()

Innings Player                   0
Opposition_x                     0
Ground                           0
Country_x                        0
Innings Runs Scored Buckets      0
Innings_Runs_Score               0
Innings_Batting_Strike_Rate      0
Year                             0
Month                            0
Day_x                            0
Batting_Average                  0
consistency                     12
Average_Career                   0
Strike_rate_Career              12
form                           131
Average_Yearly                   0
Strike_rate_Yearly             131
opposition                     336
Average_Opposition               0
Strike_rate_Opposition         336
venue                          871
Average_venue                    0
Strike_rate_venue              871
50s                              0
100s                             0
0s                               0
Innings_Balls_Faced              0
Innings_Number                   0
dtype: int64

In [46]:
batsman = batsman.dropna()
df = df.reset_index(drop=True)
batsman.isnull().sum()

Innings Player                 0
Opposition_x                   0
Ground                         0
Country_x                      0
Innings Runs Scored Buckets    0
Innings_Runs_Score             0
Innings_Batting_Strike_Rate    0
Year                           0
Month                          0
Day_x                          0
Batting_Average                0
consistency                    0
Average_Career                 0
Strike_rate_Career             0
form                           0
Average_Yearly                 0
Strike_rate_Yearly             0
opposition                     0
Average_Opposition             0
Strike_rate_Opposition         0
venue                          0
Average_venue                  0
Strike_rate_venue              0
50s                            0
100s                           0
0s                             0
Innings_Balls_Faced            0
Innings_Number                 0
dtype: int64

In [49]:
dummy=[batsman]

for dataset in dummy:
    dataset.loc[dataset['consistency']<=49, 'consistency']=pd.Series(1),
    dataset.loc[(dataset['consistency']>49) & (dataset['consistency']<=99), 'consistency']=pd.Series(2),
    dataset.loc[(dataset['consistency']>99) & (dataset['consistency']<=124), 'consistency']=pd.Series(3),
    dataset.loc[(dataset['consistency']>124) & (dataset['consistency']<=149), 'consistency']=pd.Series(4),
    dataset.loc[dataset['consistency']>149, 'consistency']=5    

### For Form:
* 1 – 4: 1
* 5 – 9: 2
* 10 – 11: 3
* 12 – 14: 4
* Form>=15: 5 


In [52]:
dummy=[batsman]
for dataset in dummy:
    dataset.loc[dataset['form']<=4, 'form']=pd.Series(1),
    dataset.loc[(dataset['form']>4) & (dataset['form']<=9), 'form']=pd.Series(2),
    dataset.loc[(dataset['form']>9) & (dataset['form']<=11), 'form']=pd.Series(3),
    dataset.loc[(dataset['form']>11) & (dataset['form']<=14), 'form']=pd.Series(4),
    dataset.loc[(dataset['form']>14), 'form']=pd.Series(5)

### For Opposition:
* 1 – 2: 1
* 3 – 4: 2
* 5 – 6: 3
* 7 – 9: 4
* Opposition>=10: 5 


In [53]:
dummy=[batsman]
for dataset in dummy:
    dataset.loc[dataset['opposition']<=2, 'opposition']=pd.Series(1),
    dataset.loc[(dataset['opposition']>2) & (dataset['opposition']<=4), 'opposition']=pd.Series(2),
    dataset.loc[(dataset['opposition']>4) & (dataset['opposition']<=6), 'opposition']=pd.Series(3),
    dataset.loc[(dataset['opposition']>6) & (dataset['opposition']<=9), 'opposition']=pd.Series(4),
    dataset.loc[dataset['opposition']>9, 'opposition']=pd.Series(5)
    

### For Venue:
* 1: 1
* 2: 2
* 3: 3
* 4: 4
* Venue>=5: 5 


In [54]:
dummy=[batsman]
for dataset in dummy:
    dataset.loc[dataset['venue']<=1, 'venue']=pd.Series(1),
    dataset.loc[(dataset['venue']>1) & (dataset['venue']<=2), 'venue']=pd.Series(2),
    dataset.loc[(dataset['venue']>2) & (dataset['venue']<=3), 'venue']=pd.Series(3),
    dataset.loc[(dataset['venue']>3) & (dataset['venue']<=4), 'venue']=pd.Series(4),
    dataset.loc[(dataset['venue'])>=5,'venue']=pd.Series(5)

### Batting Average (for all derived attributes):
* 0.0 - 9.99: 1
* 10.00 - 19.99: 2
* 20.00 - 29.99: 3
* 30.00 - 39.99: 4 
* Batting Average>=40: 5

In [55]:
def average(df,col_name):
    dummy=[df]
    for dataset in dummy:
        dataset.loc[dataset[col_name]<=9.99, col_name]=pd.Series(1),
        dataset.loc[(dataset[col_name]>=10.00) & (dataset[col_name]<=19.99), col_name]=pd.Series(2),
        dataset.loc[(dataset[col_name]>=20.00) & (dataset[col_name]<=29.99), col_name]=pd.Series(3),
        dataset.loc[(dataset[col_name]>=30.00) & (dataset[col_name]<=39.99), col_name]=pd.Series(4),
        dataset.loc[(dataset[col_name])>=40,col_name]=pd.Series(5)
    

average(batsman,'Batting_Average')
average(batsman,'Average_Career')
average(batsman,'Average_Yearly')
average(batsman,'Average_Opposition')
average(batsman,'Average_venue')  

### Batting Strike Rate (for all derived attributes):
* 0.0 - 49.99: 1
* 50.00 - 59.99: 2
* 60.00 - 79.99: 3
* 80.00 - 100.00: 4
* Strike Rate>=100.00: 5 


In [56]:
def SR(df,col_name):
    dummy=[df]
    for dataset in dummy:
        dataset.loc[dataset[col_name]<=49.99, col_name]=pd.Series(1),
        dataset.loc[(dataset[col_name]>=50.00) & (dataset[col_name]<=59.99), col_name]=pd.Series(2),
        dataset.loc[(dataset[col_name]>=60.00) & (dataset[col_name]<=79.99), col_name]=pd.Series(3),
        dataset.loc[(dataset[col_name]>=80.00) & (dataset[col_name]<=100), col_name]=pd.Series(4),
        dataset.loc[(dataset[col_name])>100,col_name]=pd.Series(5)
        
    
SR(batsman,'Innings_Batting_Strike_Rate')
SR(batsman,'Strike_rate_Career')
SR(batsman,'Strike_rate_Yearly')
SR(batsman,'Strike_rate_Opposition')
SR(batsman,'Strike_rate_venue')

### Target variable binning
Runs are predicted in five classes:
* 1 – 24: 1
* 25 – 49: 2
* 50 – 74: 3
* 75 – 99: 4
* Runs>=100: 5

In [57]:
dummy=[batsman]
col='Innings_Runs_Score'
for dataset in dummy:
    dataset.loc[dataset[col]<=24, col]=pd.Series(1),
    dataset.loc[(dataset[col]>24) & (dataset[col]<=49), col]=pd.Series(2),
    dataset.loc[(dataset[col]>=50) & (dataset[col]<=74), col]=pd.Series(3),
    dataset.loc[(dataset[col]>74) & (dataset[col]<=99), col]=pd.Series(4),
    dataset.loc[(dataset[col])>=100,col]=pd.Series(5)

## Saving final csv file

In [58]:
batsman.to_csv('cricket_batsman_information.csv', header=True, index=False)