# Cricket Prediction - IIT Madras IPL Hackhathon April 2021

## Problem Statement
- Predict the score of an inning after 6 overs

## Dataset
- Dataset downloaded from [CricSheet](https://cricsheet.org/downloads/t20s_male_csv2.zip)

### Dataset - Analysis
- Its a zip file containing data for all T20 matches played by Men.
- There are 999 files each containing below 22 columns.
   - match_id
   - season
   - start_date
   - venue
   - innings
   - ball
   - batting_team
   - bowling_team
   - striker
   - non_striker
   - bowler
   - runs_off_bat
   - extras
   - wides
   - noballs
   - byes
   - legbyes
   - penalty
   - wicket_type
   - player_dismissed
   - other_wicket_type
   - other_player_dismissed
   
- As the problem statement is to predict score after first 6 overs, we need to prepare dataset as given below
  - ✔ Unzip file (t20s_male_csv2.zip) 
  - ✔ Merge data from all the files and load it into dataframe
  - ✔ Take data till six overs from each innings
  - ✔ Add a column for total score : Total score per ball = runs_off_bat + extras

- For prediction, an input file is provided with 6 columns(venue,innings,batting_team,bowling_team,batsmen,bowlers)
  - The dataset does not have information about batsman and bowlers, we need to add two columns for them which holds list of bowler and bastman till 6 overs
  
- Feature Engineering
  - Drop Null values
  - One hot encoding for Non numerical columns such as venue, batting team, bowling team, batsman, bowlers
  - Convert date to date time object
  - Split dataset based on time : Train < 2015 and Test > 2015
  - Remove unwanted columns : MatchId, Season, venue, innings, player_dismissed, other_wicket_type, other_player_dismissed etc

- Training
  - As the objective is to predict a score(a discrete value), its a regression problem in machine learning
  - A model could be trained with multiple algorithms such as
      - Linear regression
      - Ridge regression
      - Decision tree

- Testing
  - Test the model on the split dataset 

- Score
  - Evaluate score R2. A lower value is preferred
  
- Prediction
  - Make prediction by providing data present in input file


   



In [3]:
import zipfile
import pandas as pd

ziptrain = zipfile.ZipFile('dataset/t20s_male_csv2.zip')

files_df = []

files_df = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() if f.__contains__('csv') ]

concatenated_df = pd.concat(files_df)

In [4]:
concatenated_df.head()

Unnamed: 0,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,...,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
0,211048,2004/05,2005-02-17,Eden Park,1,0.1,Australia,New Zealand,AC Gilchrist,MJ Clarke,...,1,1.0,,,,,,,,
1,211048,2004/05,2005-02-17,Eden Park,1,0.2,Australia,New Zealand,AC Gilchrist,MJ Clarke,...,1,,,,1.0,,,,,
2,211048,2004/05,2005-02-17,Eden Park,1,0.3,Australia,New Zealand,MJ Clarke,AC Gilchrist,...,0,,,,,,,,,
3,211048,2004/05,2005-02-17,Eden Park,1,0.4,Australia,New Zealand,MJ Clarke,AC Gilchrist,...,0,,,,,,,,,
4,211048,2004/05,2005-02-17,Eden Park,1,0.5,Australia,New Zealand,AC Gilchrist,MJ Clarke,...,0,,,,,,,,,


In [5]:
concatenated_df.tail()

Unnamed: 0,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,...,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
250,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,18.5,Netherlands,Nepal,BFW de Leede,AJ Staal,...,0,,,,,,,,,
251,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,18.6,Netherlands,Nepal,BFW de Leede,AJ Staal,...,0,,,,,,,,,
252,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,19.1,Netherlands,Nepal,AJ Staal,BFW de Leede,...,0,,,,,,caught,AJ Staal,,
253,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,19.2,Netherlands,Nepal,BFW de Leede,PA van Meekeren,...,0,,,,,,,,,
254,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,19.3,Netherlands,Nepal,BFW de Leede,PA van Meekeren,...,0,,,,,,,,,


In [7]:
concatenated_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 229334 entries, 0 to 254
Data columns (total 22 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   match_id                229334 non-null  int64  
 1   season                  229334 non-null  object 
 2   start_date              229334 non-null  object 
 3   venue                   229334 non-null  object 
 4   innings                 229334 non-null  int64  
 5   ball                    229334 non-null  float64
 6   batting_team            229334 non-null  object 
 7   bowling_team            229334 non-null  object 
 8   striker                 229334 non-null  object 
 9   non_striker             229334 non-null  object 
 10  bowler                  229334 non-null  object 
 11  runs_off_bat            229334 non-null  int64  
 12  extras                  229334 non-null  int64  
 13  wides                   7105 non-null    float64
 14  noballs                

In [8]:
concatenated_df.isna()

Unnamed: 0,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,...,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
0,False,False,False,False,False,False,False,False,False,False,...,False,False,True,True,True,True,True,True,True,True
1,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,False,True,True,True,True,True
2,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,True,True,True,True
3,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,True,True,True,True
4,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
250,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,True,True,True,True
251,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,True,True,True,True
252,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,False,False,True,True
253,False,False,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,True,True,True,True


In [9]:
# Filter data to 6 overs

concatenated_df = concatenated_df[concatenated_df['ball'] <= 5.6 ]

print(concatenated_df.head())

   match_id   season  start_date      venue  innings  ball batting_team  \
0    211048  2004/05  2005-02-17  Eden Park        1   0.1    Australia   
1    211048  2004/05  2005-02-17  Eden Park        1   0.2    Australia   
2    211048  2004/05  2005-02-17  Eden Park        1   0.3    Australia   
3    211048  2004/05  2005-02-17  Eden Park        1   0.4    Australia   
4    211048  2004/05  2005-02-17  Eden Park        1   0.5    Australia   

  bowling_team       striker   non_striker  ... extras  wides  noballs  byes  \
0  New Zealand  AC Gilchrist     MJ Clarke  ...      1    1.0      NaN   NaN   
1  New Zealand  AC Gilchrist     MJ Clarke  ...      1    NaN      NaN   NaN   
2  New Zealand     MJ Clarke  AC Gilchrist  ...      0    NaN      NaN   NaN   
3  New Zealand     MJ Clarke  AC Gilchrist  ...      0    NaN      NaN   NaN   
4  New Zealand  AC Gilchrist     MJ Clarke  ...      0    NaN      NaN   NaN   

   legbyes  penalty  wicket_type  player_dismissed other_wicket_type

In [10]:
concatenated_df.tail()

Unnamed: 0,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,...,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
167,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,5.2,Netherlands,Nepal,TP Visee,BN Cooper,...,0,,,,,,,,,
168,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,5.3,Netherlands,Nepal,TP Visee,BN Cooper,...,0,,,,,,stumped,TP Visee,,
169,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,5.4,Netherlands,Nepal,BFW de Leede,BN Cooper,...,0,,,,,,,,,
170,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,5.5,Netherlands,Nepal,BN Cooper,BFW de Leede,...,1,1.0,,,,,,,,
171,1257948,2021,2021-04-20,Tribhuvan University International Cricket Gro...,2,5.6,Netherlands,Nepal,BN Cooper,BFW de Leede,...,0,,,,,,,,,


In [11]:
concatenated_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73979 entries, 0 to 171
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   match_id                73979 non-null  int64  
 1   season                  73979 non-null  object 
 2   start_date              73979 non-null  object 
 3   venue                   73979 non-null  object 
 4   innings                 73979 non-null  int64  
 5   ball                    73979 non-null  float64
 6   batting_team            73979 non-null  object 
 7   bowling_team            73979 non-null  object 
 8   striker                 73979 non-null  object 
 9   non_striker             73979 non-null  object 
 10  bowler                  73979 non-null  object 
 11  runs_off_bat            73979 non-null  int64  
 12  extras                  73979 non-null  int64  
 13  wides                   2706 non-null   float64
 14  noballs                 335 non-null    

In [12]:
# Added a column for total score
# Total score = runs_off_bat + extras

concatenated_df['score'] = concatenated_df['runs_off_bat'] + concatenated_df['extras']

print(concatenated_df.head())

   match_id   season  start_date      venue  innings  ball batting_team  \
0    211048  2004/05  2005-02-17  Eden Park        1   0.1    Australia   
1    211048  2004/05  2005-02-17  Eden Park        1   0.2    Australia   
2    211048  2004/05  2005-02-17  Eden Park        1   0.3    Australia   
3    211048  2004/05  2005-02-17  Eden Park        1   0.4    Australia   
4    211048  2004/05  2005-02-17  Eden Park        1   0.5    Australia   

  bowling_team       striker   non_striker  ... wides  noballs  byes  legbyes  \
0  New Zealand  AC Gilchrist     MJ Clarke  ...   1.0      NaN   NaN      NaN   
1  New Zealand  AC Gilchrist     MJ Clarke  ...   NaN      NaN   NaN      1.0   
2  New Zealand     MJ Clarke  AC Gilchrist  ...   NaN      NaN   NaN      NaN   
3  New Zealand     MJ Clarke  AC Gilchrist  ...   NaN      NaN   NaN      NaN   
4  New Zealand  AC Gilchrist     MJ Clarke  ...   NaN      NaN   NaN      NaN   

   penalty  wicket_type  player_dismissed  other_wicket_type  

In [13]:
concatenated_df['total score'] = concatenated_df['score'].cumsum()
print(concatenated_df.head())

   match_id   season  start_date      venue  innings  ball batting_team  \
0    211048  2004/05  2005-02-17  Eden Park        1   0.1    Australia   
1    211048  2004/05  2005-02-17  Eden Park        1   0.2    Australia   
2    211048  2004/05  2005-02-17  Eden Park        1   0.3    Australia   
3    211048  2004/05  2005-02-17  Eden Park        1   0.4    Australia   
4    211048  2004/05  2005-02-17  Eden Park        1   0.5    Australia   

  bowling_team       striker   non_striker  ... noballs  byes  legbyes  \
0  New Zealand  AC Gilchrist     MJ Clarke  ...     NaN   NaN      NaN   
1  New Zealand  AC Gilchrist     MJ Clarke  ...     NaN   NaN      1.0   
2  New Zealand     MJ Clarke  AC Gilchrist  ...     NaN   NaN      NaN   
3  New Zealand     MJ Clarke  AC Gilchrist  ...     NaN   NaN      NaN   
4  New Zealand  AC Gilchrist     MJ Clarke  ...     NaN   NaN      NaN   

   penalty  wicket_type  player_dismissed  other_wicket_type  \
0      NaN          NaN               Na

In [30]:
# Add bowler(s) and batsman(s) column

# unique_bowlers = concatenated_df['bowler'].where(concatenated_df['match_id'] == 211048)
unique_bowlers = concatenated_df[concatenated_df['match_id'] == 211048]['bowler'].unique()
print(unique_bowlers[0])

DR Tuffey
