# T20 Cricket Score Predictor

In a T20 cricket score predictor, various machine learning algorithms can be used depending on the specific requirements and the nature of the data. 

Here are some commonly used algorithms:
1.Linear Regression
2.Random Forest
3.Support Vector Machines (SVM)
4.XGBoost
5.Neural Networks

Dataset Link: https://www.kaggle.com/datasets/harshmishraandheri/t20i-cricket-matches-ball-by-ball-info-dataset

1.Linear Regression:
  Use Case: Predicting the final score based on the number of runs scored so far, wickets lost, overs   remaining, etc.
  Why: Simple and interpretable, useful for understanding the relationship between input features and   the target variable.
  
2.Random Forest:
  Use Case: Predicting the score by considering various features like the batting team, bowling team,   venue, etc.
  Why: Handles non-linear relationships well and can capture interactions between features. It’s also   robust to overfitting.
  
3.Support Vector Machines (SVM):
  Use Case: Predicting outcomes in situations where the decision boundary between classes 
  (e.g.,predicting win/loss based on score) is complex.
  Why: Effective in high-dimensional spaces and works well with non-linear data through the kernel       trick.
  
4.XGBoost:
  Use Case: Advanced regression tasks in score prediction, especially when the dataset is large and     complex.
  Why: Known for high performance, flexibility, and ability to handle a variety of data types. It’s     often used in competitive machine learning tasks.
  
5.Neural Networks:
  Use Case: Predicting cricket scores by learning complex patterns and relationships in the data.
  Why: Capable of modeling complex relationships, especially when there is a large amount of data with   many features.

In [27]:
import numpy as np
import pandas as pd

we will import the dataset and have an overview of it.

In [28]:
df = pd.read_csv('t20i_info.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue
0,0,2,Australia,Sri Lanka,0.1,0,0,,Melbourne Cricket Ground
1,1,2,Australia,Sri Lanka,0.2,0,0,,Melbourne Cricket Ground
2,2,2,Australia,Sri Lanka,0.3,1,0,,Melbourne Cricket Ground
3,3,2,Australia,Sri Lanka,0.4,2,0,,Melbourne Cricket Ground
4,4,2,Australia,Sri Lanka,0.5,0,0,,Melbourne Cricket Ground


We need to create some columns and extract few to get the desired data.

Eventually we want our data to have columns:
batting team
bowling team
city
current_score
balls left
wickets_left
current_run_rate
last five

We already have some of the necessary columns in our dataset, such as the batting team and bowling team. Additionally, we have a column for the city, but it contains some null values that need to be addressed. For the remaining columns, we'll need to perform some data manipulation.

Now we will begin our feature extraction with the city column. To fill in the missing values, we'll use the venue column.

In [29]:
df[df['city'].isnull()]['venue'].value_counts()

Dubai International Cricket Stadium        2969
Pallekele International Cricket Stadium    2066
Melbourne Cricket Ground                   1453
Sydney Cricket Ground                       749
Adelaide Oval                               498
Harare Sports Club                          372
Sharjah Cricket Stadium                     249
Sylhet International Cricket Stadium        128
Carrara Oval                                 64
Name: venue, dtype: int64

Here, we're examining the values in the 'venue' column where the city column has null values. If we look closely, we'll notice that the first word in the venue name typically corresponds to the city where the venue is located, such as "Dubai" in "Dubai International Cricket Stadium" or "Melbourne" in "Melbourne Cricket Ground."

In [30]:
cities = np.where(df['city'].isnull(), df['venue'].str.split().apply(lambda x:x[0]), df['city'])
df['city'] = cities
df.isnull().sum()

Unnamed: 0          0
match_id            0
batting_team        0
bowling_team        0
ball                0
runs                0
player_dismissed    0
city                0
venue               0
dtype: int64

We store the first word of each venue in a variable named cities and use it to fill the missing values in the city column. Now, our dataset no longer has any null values. However, there's still one more thing to address. Our dataset is a ball-by-ball dataset, meaning that if there are 63,000 rows, it represents that many balls bowled and played.

In [31]:
df['city'].value_counts()

Colombo          4086
Mirpur           3420
Johannesburg     3331
Dubai            2969
Auckland         2532
                 ... 
Nairobi           123
Potchefstroom     122
Dharamsala        122
Ahmedabad         121
Carrara            64
Name: city, Length: 86, dtype: int64

This indicates that some cities have only a few deliveries played. Therefore, we can disregard those cities and focus solely on the ones where at least 600 deliveries have been played.

In [32]:
eligible_cities = df['city'].value_counts()[df['city'].value_counts()>600].index.tolist()
df = df[df['city'].isin(eligible_cities)]
df

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue
0,0,2,Australia,Sri Lanka,0.1,0,0,Melbourne,Melbourne Cricket Ground
1,1,2,Australia,Sri Lanka,0.2,0,0,Melbourne,Melbourne Cricket Ground
2,2,2,Australia,Sri Lanka,0.3,1,0,Melbourne,Melbourne Cricket Ground
3,3,2,Australia,Sri Lanka,0.4,2,0,Melbourne,Melbourne Cricket Ground
4,4,2,Australia,Sri Lanka,0.5,0,0,Melbourne,Melbourne Cricket Ground
...,...,...,...,...,...,...,...,...,...
63883,121,964,Sri Lanka,Australia,19.3,1,0,Colombo,R Premadasa Stadium
63884,122,964,Sri Lanka,Australia,19.4,0,0,Colombo,R Premadasa Stadium
63885,123,964,Sri Lanka,Australia,19.5,0,DM de Silva,Colombo,R Premadasa Stadium
63886,124,964,Sri Lanka,Australia,19.6,2,0,Colombo,R Premadasa Stadium


With the city column now complete, we move on to the current_runs column. This can be easily extracted from the runs column using the cumsum() function, which calculates the cumulative sum of the column.

In [33]:
df['current_score'] = df.groupby('match_id').cumsum()['runs']
df

  df['current_score'] = df.groupby('match_id').cumsum()['runs']


Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score
0,0,2,Australia,Sri Lanka,0.1,0,0,Melbourne,Melbourne Cricket Ground,0
1,1,2,Australia,Sri Lanka,0.2,0,0,Melbourne,Melbourne Cricket Ground,0
2,2,2,Australia,Sri Lanka,0.3,1,0,Melbourne,Melbourne Cricket Ground,1
3,3,2,Australia,Sri Lanka,0.4,2,0,Melbourne,Melbourne Cricket Ground,3
4,4,2,Australia,Sri Lanka,0.5,0,0,Melbourne,Melbourne Cricket Ground,3
...,...,...,...,...,...,...,...,...,...,...
63883,121,964,Sri Lanka,Australia,19.3,1,0,Colombo,R Premadasa Stadium,125
63884,122,964,Sri Lanka,Australia,19.4,0,0,Colombo,R Premadasa Stadium,125
63885,123,964,Sri Lanka,Australia,19.5,0,DM de Silva,Colombo,R Premadasa Stadium,125
63886,124,964,Sri Lanka,Australia,19.6,2,0,Colombo,R Premadasa Stadium,127


Next, we need to create a balls_left column. To do this, we’ll first create two new columns: overs and balls, which will indicate how many overs have been completed and how many balls have been bowled in the current over, respectively. The code to achieve this is quite straightforward.

In [34]:
df['over'] = df['ball'].apply(lambda x : str(x).split(".")[0])
df['ball_no'] = df['ball'].apply(lambda x: str(x).split(".")[1])
df

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score,over,ball_no
0,0,2,Australia,Sri Lanka,0.1,0,0,Melbourne,Melbourne Cricket Ground,0,0,1
1,1,2,Australia,Sri Lanka,0.2,0,0,Melbourne,Melbourne Cricket Ground,0,0,2
2,2,2,Australia,Sri Lanka,0.3,1,0,Melbourne,Melbourne Cricket Ground,1,0,3
3,3,2,Australia,Sri Lanka,0.4,2,0,Melbourne,Melbourne Cricket Ground,3,0,4
4,4,2,Australia,Sri Lanka,0.5,0,0,Melbourne,Melbourne Cricket Ground,3,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...
63883,121,964,Sri Lanka,Australia,19.3,1,0,Colombo,R Premadasa Stadium,125,19,3
63884,122,964,Sri Lanka,Australia,19.4,0,0,Colombo,R Premadasa Stadium,125,19,4
63885,123,964,Sri Lanka,Australia,19.5,0,DM de Silva,Colombo,R Premadasa Stadium,125,19,5
63886,124,964,Sri Lanka,Australia,19.6,2,0,Colombo,R Premadasa Stadium,127,19,6


Next, we can create a balls_bowled column to represent the total number of balls bowled. This can be calculated using the formula:

balls_bowled=(overs×6)+balls

In [35]:
df['balls_bowled'] = (df['over'].astype('int')*6 + df['ball_no'].astype('int'))
df

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score,over,ball_no,balls_bowled
0,0,2,Australia,Sri Lanka,0.1,0,0,Melbourne,Melbourne Cricket Ground,0,0,1,1
1,1,2,Australia,Sri Lanka,0.2,0,0,Melbourne,Melbourne Cricket Ground,0,0,2,2
2,2,2,Australia,Sri Lanka,0.3,1,0,Melbourne,Melbourne Cricket Ground,1,0,3,3
3,3,2,Australia,Sri Lanka,0.4,2,0,Melbourne,Melbourne Cricket Ground,3,0,4,4
4,4,2,Australia,Sri Lanka,0.5,0,0,Melbourne,Melbourne Cricket Ground,3,0,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
63883,121,964,Sri Lanka,Australia,19.3,1,0,Colombo,R Premadasa Stadium,125,19,3,117
63884,122,964,Sri Lanka,Australia,19.4,0,0,Colombo,R Premadasa Stadium,125,19,4,118
63885,123,964,Sri Lanka,Australia,19.5,0,DM de Silva,Colombo,R Premadasa Stadium,125,19,5,119
63886,124,964,Sri Lanka,Australia,19.6,2,0,Colombo,R Premadasa Stadium,127,19,6,120


Finally, we can create the balls_left column by subtracting balls_bowled from 120, as there are a total of 120 balls in an innings. In cases where the ball count exceeds 120 due to extras (such as wides or no-balls), we can simply set the balls_left value to 0.

In [36]:
df['balls_left'] = 120 - df['balls_bowled']
df['balls_left'] = df['balls_left'].apply(lambda x:0 if x < 0 else x)
df.head()

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score,over,ball_no,balls_bowled,balls_left
0,0,2,Australia,Sri Lanka,0.1,0,0,Melbourne,Melbourne Cricket Ground,0,0,1,1,119
1,1,2,Australia,Sri Lanka,0.2,0,0,Melbourne,Melbourne Cricket Ground,0,0,2,2,118
2,2,2,Australia,Sri Lanka,0.3,1,0,Melbourne,Melbourne Cricket Ground,1,0,3,3,117
3,3,2,Australia,Sri Lanka,0.4,2,0,Melbourne,Melbourne Cricket Ground,3,0,4,4,116
4,4,2,Australia,Sri Lanka,0.5,0,0,Melbourne,Melbourne Cricket Ground,3,0,5,5,115


The player_dismissed column contains either a 0 or the name of the player who got out on that particular ball. We'll first replace all player names with 1, then apply the cumsum() function to count the total wickets lost. Finally, we'll subtract this value from 10 to get the wickets_left column.

In [37]:
df['player_dismissed'] = df['player_dismissed'].apply(lambda x : 1 if x != '0' else 0)
df.sample(5)

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score,over,ball_no,balls_bowled,balls_left
13876,12,193,Australia,Sri Lanka,1.6,4,0,Adelaide,Adelaide Oval,20,1,6,12,108
44335,115,675,Pakistan,Bangladesh,18.3,1,0,Mirpur,Shere Bangla National Stadium,120,18,3,111,9
17368,123,246,India,New Zealand,19.4,1,0,Wellington,Westpac Stadium,161,19,4,118,2
62265,47,927,India,West Indies,7.5,2,0,Mumbai,Wankhede Stadium,65,7,5,47,73
54375,21,816,Pakistan,Australia,3.4,0,0,Dubai,Dubai International Cricket Stadium,18,3,4,22,98


In [38]:
df['player_dismissed'] = df['player_dismissed'].astype('int')
df['player_dismissed'] = df.groupby('match_id').cumsum()['player_dismissed']
df['wickets_left'] = 10 - df['player_dismissed']
df.sample(5)

  df['player_dismissed'] = df.groupby('match_id').cumsum()['player_dismissed']


Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score,over,ball_no,balls_bowled,balls_left,wickets_left
19431,95,309,Pakistan,England,15.3,1,3,Manchester,Old Trafford,139,15,3,93,27,7
20731,23,431,West Indies,New Zealand,2.8,5,0,Auckland,Eden Park,51,2,8,20,100,10
3876,4,55,Bangladesh,Sri Lanka,0.5,4,0,Colombo,R Premadasa Stadium,5,0,5,5,115,10
11206,42,144,Bangladesh,West Indies,7.1,0,2,Lauderhill,Central Broward Regional Park Stadium Turf Ground,76,7,1,43,77,8
29170,52,520,Sri Lanka,Pakistan,7.1,1,0,London,Lord's,74,7,1,43,77,10


Next, we will create the current_run_rate column, which is a straightforward calculation.

In [39]:
df['crr'] = (df['current_score']*6 / df['balls_bowled'])
df.head()

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr
0,0,2,Australia,Sri Lanka,0.1,0,0,Melbourne,Melbourne Cricket Ground,0,0,1,1,119,10,0.0
1,1,2,Australia,Sri Lanka,0.2,0,0,Melbourne,Melbourne Cricket Ground,0,0,2,2,118,10,0.0
2,2,2,Australia,Sri Lanka,0.3,1,0,Melbourne,Melbourne Cricket Ground,1,0,3,3,117,10,2.0
3,3,2,Australia,Sri Lanka,0.4,2,0,Melbourne,Melbourne Cricket Ground,3,0,4,4,116,10,4.5
4,4,2,Australia,Sri Lanka,0.5,0,0,Melbourne,Melbourne Cricket Ground,3,0,5,5,115,10,3.6


We need to create a column that shows the total runs scored in the last five overs. This column will have null values for the first five overs, as there won't be enough data to compute the total runs for that period.

In [40]:
groups = df.groupby('match_id')

match_ids = df['match_id'].unique()
last_five = []
for id in match_ids:
    last_five.extend(groups.get_group(id).rolling(window=30).sum()['runs'].values.tolist())

  last_five.extend(groups.get_group(id).rolling(window=30).sum()['runs'].values.tolist())


In [41]:
df['last_five'] = last_five
df.sample(5)

Unnamed: 0.1,Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr,last_five
33738,58,565,South Africa,New Zealand,9.4,1,2,Barbados,"Kensington Oval, Bridgetown",64,9,4,58,62,8,6.62069,24.0
27621,115,499,Australia,South Africa,18.5,0,6,Johannesburg,New Wanderers Stadium,149,18,5,113,7,4,7.911504,58.0
12941,39,161,West Indies,India,6.3,1,3,Kolkata,Eden Gardens,33,6,3,39,81,7,5.076923,22.0
62901,77,951,India,Bangladesh,12.4,0,3,Mirpur,Shere Bangla National Stadium,79,12,4,76,44,7,6.236842,37.0
31535,65,543,South Africa,England,10.3,1,0,Centurion,SuperSport Park,134,10,3,63,57,10,12.761905,70.0


Finally, we need to create the last column, which will be our target column: the total runs scored in that innings.

In [42]:
final_df = df.groupby('match_id').sum()['runs'].reset_index().merge(df,on='match_id')
final_df.head()

  final_df = df.groupby('match_id').sum()['runs'].reset_index().merge(df,on='match_id')


Unnamed: 0.1,match_id,runs_x,Unnamed: 0,batting_team,bowling_team,ball,runs_y,player_dismissed,city,venue,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr,last_five
0,2,168,0,Australia,Sri Lanka,0.1,0,0,Melbourne,Melbourne Cricket Ground,0,0,1,1,119,10,0.0,
1,2,168,1,Australia,Sri Lanka,0.2,0,0,Melbourne,Melbourne Cricket Ground,0,0,2,2,118,10,0.0,
2,2,168,2,Australia,Sri Lanka,0.3,1,0,Melbourne,Melbourne Cricket Ground,1,0,3,3,117,10,2.0,
3,2,168,3,Australia,Sri Lanka,0.4,2,0,Melbourne,Melbourne Cricket Ground,3,0,4,4,116,10,4.5,
4,2,168,4,Australia,Sri Lanka,0.5,0,0,Melbourne,Melbourne Cricket Ground,3,0,5,5,115,10,3.6,


We will now drop all the columns that are not needed for our model and keep only the ones we've created. Additionally, we'll shuffle the data to avoid any potential bias.

In [43]:
final_df = final_df[['batting_team','bowling_team','city','current_score','balls_left','wickets_left','crr','last_five','runs_x']]
final_df.dropna(inplace=True)
final_df.isnull().sum()

batting_team     0
bowling_team     0
city             0
current_score    0
balls_left       0
wickets_left     0
crr              0
last_five        0
runs_x           0
dtype: int64

In [44]:
final_df = final_df.sample(final_df.shape[0])
final_df

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five,runs_x
46129,New Zealand,Pakistan,Wellington,111,46,6,9.000000,47.0,196
16222,New Zealand,India,Johannesburg,87,52,7,7.676471,37.0,190
14968,Sri Lanka,New Zealand,Auckland,51,72,6,6.375000,26.0,115
38878,New Zealand,Sri Lanka,Pallekele,47,77,8,6.558140,35.0,142
5720,Bangladesh,India,Colombo,121,12,4,6.722222,27.0,139
...,...,...,...,...,...,...,...,...,...
17849,India,Australia,Durban,82,51,8,7.130435,45.0,188
32283,West Indies,England,Pallekele,84,67,10,9.509434,47.0,179
37937,South Africa,Australia,Centurion,89,27,4,5.741935,21.0,128
38710,Sri Lanka,England,London,183,0,3,9.150000,61.0,183


With this, we conclude the feature extraction phase of the project. After a lot of effort, we now have the precise data we needed from the beginning.

Let's now begin the model-building process. First, we'll split our dataset into training and testing sets using the train_test_split module from the sklearn library.

In [45]:
x = final_df.drop(columns = ['runs_x'])
y = final_df['runs_x']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test, = train_test_split(x,y,test_size=0.2, random_state=1)

In [46]:
x_train

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five
50463,Sri Lanka,Australia,Colombo,88,33,3,6.068966,26.0
476,New Zealand,Bangladesh,Mount Maunganui,150,18,7,8.823529,71.0
31774,Pakistan,New Zealand,Pallekele,63,76,9,8.590909,48.0
5057,India,South Africa,Johannesburg,138,42,7,10.615385,40.0
5791,Sri Lanka,Bangladesh,Colombo,85,67,9,9.622642,33.0
...,...,...,...,...,...,...,...,...
46041,Pakistan,New Zealand,Hamilton,146,12,5,8.111111,50.0
34046,West Indies,England,Nottingham,100,31,7,6.741573,43.0
5826,Sri Lanka,Bangladesh,Colombo,148,34,7,10.325581,52.0
5768,Sri Lanka,Bangladesh,Colombo,62,89,9,12.000000,56.0


Some preprocessing steps are necessary at this stage. We'll apply one-hot encoding to the categorical features (batting_team, bowling_team, and city). Then, we'll create a pipeline that includes our machine learning model. Additionally, we'll apply scaling to our data to ensure all values are within the same range.


For our model, I'll be using the XGBoost algorithm. However, you can experiment with other regression algorithms and select the one that delivers the best results.

In [47]:
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

!pip install xgboost
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error

Collecting xgboost
  Downloading xgboost-2.1.1-py3-none-win_amd64.whl (124.9 MB)
     ------------------------------------ 124.9/124.9 MB 877.2 kB/s eta 0:00:00
Installing collected packages: xgboost
Successfully installed xgboost-2.1.1


In [48]:
trf = ColumnTransformer ([
    ('trf', OneHotEncoder (sparse=False, drop='first'), ['batting_team', 'bowling_team', 'city'])
]
,remainder='passthrough')

In [49]:
pipe = Pipeline (steps=[
    ('step1', trf),
    ('step2', StandardScaler()),
    ('step3',XGBRegressor (n_estimators=1000, learning_rate=0.2, max_depth=12, random_state=1))
])

Now that our model is ready, it's time to evaluate its performance by checking the R² score to see how well it is working.

In [50]:
pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test) 
print(r2_score (y_test,y_pred)) 
print(mean_absolute_error(y_test,y_pred))



0.98765680007059
1.6494032573303412


This is Incredible. R2 score of 0.98. 

While this result is impressive, it raises concerns about potential overfitting. I encourage you to experiment with tuning the hyperparameters to see how it impacts the model's performance.