In [1]:
import pandas as pd
import numpy as np
df = pd.read_pickle('./data/batters.pkl')

In [3]:
df.head()

Unnamed: 0,Innings Player,Innings Runs Scored,Innings Runs Scored Num,Innings Minutes Batted,Innings Batted Flag,Innings Not Out Flag,Innings Balls Faced,Innings Boundary Fours,Innings Boundary Sixes,Innings Batting Strike Rate,Innings Number,50's,100's,Innings Runs Scored Buckets
0,L Hutton,364,364,797,1.0,0.0,847,35,0,42.97,1,0.0,1.0,200+
1,WR Hammond,336*,336,318,1.0,1.0,-,34,10,-,2,0.0,1.0,200+
2,GA Gooch,333,333,628,1.0,0.0,485,43,3,68.65,1,0.0,1.0,200+
3,A Sandham,325,325,600,1.0,0.0,640,28,0,50.78,1,0.0,1.0,200+
4,JH Edrich,310*,310,532,1.0,1.0,450,52,5,68.88,1,0.0,1.0,200+


# Innings Runs Scored
Possible values are
* Numeric: Runs scored
* Numeric: Runs scored appended with * for NO
* DNB: Did Not Bat
* absent: Probably retired hurt
* TDNB: ?
* sub: Substitute
* NaN: For bowler rows

**Use**: NO. *Since we're interested in stuff like scores and averages.*

In [8]:
df[(df['Innings Runs Scored'].str.isnumeric() == False) & (df['Innings Runs Scored'].str.endswith("*") == False)]['Innings Runs Scored'].unique()

array(['DNB', 'absent', 'TDNB', 'sub'], dtype=object)

# Innings Runs Scored Num
Possible values:
* Numeric: Runs scored
* \- (Dash): For `DNB`, `absent`, `TDNB`, `sub` of `Innings Runs Scored` column
* NaN: For bowler rows

**Use**: Yes. But, we're going to replace the `-` with NaN - `fillna`, maybe. Basically, we don't really care why a batter didn't bat. Just that the batter did or didn't. And replacing - with NaN allows us to dtype this as a Int / Float column. And then do number crunching (literally) on it.

In [12]:
df[(df['Innings Runs Scored Num'].str.isnumeric() == False)]['Innings Runs Scored'].unique()

array(['DNB', 'absent', 'TDNB', 'sub'], dtype=object)

# Innings Minutes Batted
Possible values:
* Numeric: Minutes
* \- (Dash): Possibly for not recorded
* NaN: For bowler rows

**Use**: Yes. Again, let's replace `-` with `NaN`. See `Innings Runs Scored Num` for explanation

In [18]:
df[df['Innings Minutes Batted'].str.isnumeric() == False]['Innings Minutes Batted'].unique()

array(['-'], dtype=object)

# Innings Batted Flag
Possible values:
* 1: Batted
* 0: Not batted. For `DNB`, `absent`, `TDNB`, `sub` of `Innings Runs Scored` column

**Use**: No. The `Innings Runs Scored Num` column should do the job. I think

In [24]:
df[(df['Innings Batted Flag'] == 0)]['Innings Runs Scored'].unique()

array(['DNB', 'absent', 'TDNB', 'sub'], dtype=object)

# Innings Not Out Flag
Possible values:
* 1: Not out
* 0: Out
* NaN: For bowler rows

**Use**: Yes.

In [31]:
df[(df['Innings Not Out Flag'].isna() == False)]['Innings Not Out Flag'].unique()

array([0., 1.])

# Innings Balls Faced
Possible values:
* Numeric: Balls faced
* \- (Dash): Possibly for not recorded
* NaN: For bowler rows

**Use**: Yes. Again, let's replace `-` with `NaN`. See `Innings Runs Scored Num` for explanation

In [32]:
df[df['Innings Balls Faced'].str.isnumeric() == False]['Innings Balls Faced'].unique()
    

array(['-'], dtype=object)

# Innings Boundary Fours
Possible values:
Numeric: Balls faced
•- (Dash): Possibly for not recorded
•NaN: For bowler rows

**Use**: Yes. Again, let's replace `-` with `NaN`. See `Innings Runs Scored Num` for explanation


In [16]:
df[df['Innings Boundary Fours'].str.isnumeric() == False]['Innings Boundary Fours'].unique()

array(['-'], dtype=object)

# Innings Boundary Sixes
See `Innings Boundary Fours`

In [17]:
df[df['Innings Boundary Sixes'].str.isnumeric() == False]['Innings Boundary Sixes'].unique()

array(['-'], dtype=object)

# Innings Batting Strike Rate
Possible values
* Numeric: Strike rate
* \- (Dash): Possibly for not recorded
* NaN: For bowler rows

**Use**: No. This is computed as `Innings Runs Scored Num` / `Innings Balls Faced`.
But when computing this, we need to keep in mind that `Innings Runs Scored Num` could be a Numeric. But this field is not computable if `Innings Balls Faced` is `NaN` (because it wasn't recorded). See `Innings Balls Faced`

# Innings Number
Possible values:
* 1: First innings
* 2: Second innings
* \- (Dash): If `Innings Runs Scored` = `TDNB`. And we'll replace `-` with `NaN`. For both batters anad bowlers

**Use**: Yes.

In [30]:
df[(df['Innings Number'] == '-') & (df['Innings Runs Scored'] != 'TDNB')]

Unnamed: 0,Innings Player,Innings Runs Scored,Innings Runs Scored Num,Innings Minutes Batted,Innings Batted Flag,Innings Not Out Flag,Innings Balls Faced,Innings Boundary Fours,Innings Boundary Sixes,Innings Batting Strike Rate,...,Innings Overs Bowled,Innings Bowled Flag,Innings Maidens Bowled,Innings Runs Conceded,Innings Wickets Taken,4 Wickets,5 Wickets,10 Wickets,Innings Wickets Taken Buckets,Innings Economy Rate


# 50's
Possible values:
* 1: 49 < `Innings Runs Scored Num` < 100
* 0: Not a half century
* NaN: For bowler rows

**Use**: No. Computed field

In [37]:
df[(df["50's"] == 0)]

Unnamed: 0,Innings Player,Innings Runs Scored,Innings Runs Scored Num,Innings Minutes Batted,Innings Batted Flag,Innings Not Out Flag,Innings Balls Faced,Innings Boundary Fours,Innings Boundary Sixes,Innings Batting Strike Rate,Innings Number,50's,100's,Innings Runs Scored Buckets
0,L Hutton,364,364,797,1.0,0.0,847,35,0,42.97,1,0.0,1.0,200+
1,WR Hammond,336*,336,318,1.0,1.0,-,34,10,-,2,0.0,1.0,200+
2,GA Gooch,333,333,628,1.0,0.0,485,43,3,68.65,1,0.0,1.0,200+
3,A Sandham,325,325,600,1.0,0.0,640,28,0,50.78,1,0.0,1.0,200+
4,JH Edrich,310*,310,532,1.0,1.0,450,52,5,68.88,1,0.0,1.0,200+
5,AN Cook,294,294,773,1.0,0.0,545,33,0,53.94,2,0.0,1.0,200+
6,RE Foster,287,287,419,1.0,0.0,-,37,0,-,2,0.0,1.0,200+
7,PBH May,285*,285,600,1.0,1.0,-,25,2,-,3,0.0,1.0,200+
8,DCS Compton,278,278,287,1.0,0.0,-,34,1,-,2,0.0,1.0,200+
9,AN Cook,263,263,836,1.0,0.0,528,18,0,49.81,2,0.0,1.0,200+


# 100's
Possible values:
* 1: 99 < `Innings Runs Scored Num
* 0: Not a century
* NaN: For bowler rows

**Use**: No. Computed field

# Innings Runs Scored Buckets
Possible values:
* 200+
* 150-199
* 100-149
* 50-99
* 0-49


In [39]:
df['Innings Runs Scored Buckets'].unique()

array(['200+', '150-199', '100-149', '50-99', '0-49', '-', nan],
      dtype=object)