# <b><u>Uses of Machine Learning to calculate probabilities in a simulation.</u></b>

#### <b><u>Project by: Nishit Sonawane.</u></b>

## <p style="line-height: 1.5;">Predicting the future sounds like magic whether it be detecting in advance the intent of a potential customer to purchase your product or figuring out where the price of a stock is headed. If we can reliably predict the future of something, then we own a massive advantage. Machine learning has only served to amplify this magic and mystery.</p>
### <p style="line-height: 1.5;">•<u>This article is written to show how Machine Learning could be used to calculate probabilities in a simulation and does not attempt to actually get the results right since the data used is not enough for it (or maybe the event itself is simply not predictable).I used World Cup matches just because it’s a cool and up to date subject.</u></p>
### <p style="line-height: 1.5;">•<u>The 2019 ICC Men’s Cricket World Cup began on Thursday (30th May). This 12th edition of the Cricket World Cup ran for almost one and a half month in England and Wales. The tournament was contested by 10 teams who played in a single round-robin group, with the top four at the end of the group phase progressing to the semi-finals.</u></p>
### <p style="line-height: 1.5;"><u>What is the meaning of Round-Robin?</u>•In the context of scheduling and organizing events or competitions, a "round-robin group" refers to a specific format or structure in which participants or teams compete against each other in a predetermined sequence or rotation.</p>
### <p style="line-height: 1.5;">•In a round-robin group, each participant or team competes against every other participant or team in the group once. This ensures that each participant has an equal opportunity to compete against all other participants. The matches or games are scheduled in a round-robin manner, where each participant plays one game at a time before moving on to the next game.</p>

## <b><u>Applications</u></b>

### <p style="line-height: 1.5;">The main objective of sports prediction is to improve team performance and enhance the chances of winning the game. The value of a win takes on different forms like trickles down to the fans filling the stadium seats, television contracts, fan store merchandise, parking, concessions, sponsorships, enrollment and retention.</p>

## <b><u>Predictor</u></b>

#### <u>The libraries and dependencies used in the model are:</u>
#### <u>Pandas</u> -  It helps us work with and analyze data in tables (like Excel spreadsheets) called DataFrames.
#### <u>Numpy</u> -  It provides support for working with arrays and performing mathematical operations on them.
#### <p style="line-height: 1.5;"><u>SKlearn</u> – scikit-learn, also known as sklearn, is a popular open-source machine learning library for Python. It provides a wide range of tools and algorithms for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, and more.</p>
#### <p style="line-height: 1.5;">• These import statements ensure that we have access to the necessary tools and functionalities provided by these libraries for our data analysis and machine learning tasks.</p>

In [1]:
# Import all libraries and dependencies
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


## <b><u>Data</u></b>

### <p style="line-height: 1.5;">• Real world data is dirty. We can’t expect a nicely formatted and clean data as provided by Kaggle. Therefore, data pre-processing is so crucial that I can’t stress enough how important it is. It is the most important stage as it could occupy 40%-70% of the whole workflow, just to clean the data to be fed to your models.</p>
### <p style="line-height: 1.5;">• I scraped three scripts from Crickbuzz website comprising of rankings of teams as of May 2019, details of the fixtures of 2019 world cup and details of each team’s history in previous world cups. I stored the above piece of data in three separate csv files. For the fourth file, I grabbed odi data-set for matches played between 1975 and 2017 from Kaggle in another csv file. In this file, I removed all the data from 1975 to 2010. This was done as the results of the last few years should only matter for the prediction. Then I did manual cleaning of the data as per my needs to make a machine learning model out of it.</p>

#### <p style="line-height: 1.5;">• After importing the libraries and the dependencies, I loaded the csv file containing the details of each team’s history in previous world cups and also loaded the csv file containing the results of matches played between 2010 and 2019.</p>

In [2]:
# Load Data
wc=pd.read_csv('WC_2019_Dataset.csv')
results= pd.read_csv('results_19.csv')


In [3]:
wc.head()


Unnamed: 0,Team,Previous Appearances,Previous Titles,Previous Finals,Previous Semifinals,Current Rank
0,England,11,0,3,5,1
1,South Africa,6,0,0,4,4
2,West Indies,11,2,3,4,9
3,Pakistan,11,1,2,6,6
4,New Zealand,11,0,1,7,3


In [4]:
results.head()


Unnamed: 0,Match Date,Team_1,Team_2,Winner,Margin,Ground
0,"Jan 4, 2010",Bangladesh,Sri Lanka,Sri Lanka,7 wickets,Dhaka
1,"Jan 5, 2010",India,Sri Lanka,Sri Lanka,5 wickets,Dhaka
2,"Jan 7, 2010",Bangladesh,India,India,6 wickets,Dhaka
3,"Jan 8, 2010",Bangladesh,Sri Lanka,Sri Lanka,9 wickets,Dhaka
4,"Jan 10, 2010",India,Sri Lanka,India,8 wickets,Dhaka


## <b><u>1. Data Cleaning and Formatting</u></b>

#### <p style="line-height: 1.5;">• After displaying the details of matches played by India. the code filters the original DataFrame to select the rows where India participated in matches, creates a new DataFrame with those selected rows, and then shows the initial rows of that DataFrame to provide a glimpse of the data related to matches involving India.</p>

In [5]:
df=results[(results['Team_1']=='India')|(results['Team_2']=='India')]
india=df.iloc[:]
india.head()


Unnamed: 0,Match Date,Team_1,Team_2,Winner,Margin,Ground
1,"Jan 5, 2010",India,Sri Lanka,Sri Lanka,5 wickets,Dhaka
2,"Jan 7, 2010",Bangladesh,India,India,6 wickets,Dhaka
4,"Jan 10, 2010",India,Sri Lanka,India,8 wickets,Dhaka
5,"Jan 11, 2010",Bangladesh,India,India,6 wickets,Dhaka
6,"Jan 13, 2010",India,Sri Lanka,Sri Lanka,4 wickets,Dhaka


## <b><u>2. Exploratory Data analysis</u></b>

#### <p style="line-height: 1.5;">• The code filters the original results DataFrame to select only the matches involving World Cup teams.  It combines the matches where World Cup teams are either the first or second team into a single DataFrame, removes any duplicate rows, and finally provides the count of available data for each column in the resulting DataFrame.</p>

In [6]:
# Narrowing to team participating in the world cup
wc_teams=['England','South Africa','West Indies',
          'Pakistan','New Zealand','Sri Lanka',
          'Afghanistan', 'Australia','Bangladesh','India']
df_teams_1=results[results['Team_1'].isin(wc_teams)]
df_teams_2=results[results['Team_2'].isin(wc_teams)]
df_teams=pd.concat((df_teams_1,df_teams_2))
df_teams.drop_duplicates()
df_teams.count()


Match Date    1897
Team_1        1897
Team_2        1897
Winner        1897
Margin        1828
Ground        1897
dtype: int64

In [7]:
df_teams.head()


Unnamed: 0,Match Date,Team_1,Team_2,Winner,Margin,Ground
0,"Jan 4, 2010",Bangladesh,Sri Lanka,Sri Lanka,7 wickets,Dhaka
1,"Jan 5, 2010",India,Sri Lanka,Sri Lanka,5 wickets,Dhaka
2,"Jan 7, 2010",Bangladesh,India,India,6 wickets,Dhaka
3,"Jan 8, 2010",Bangladesh,Sri Lanka,Sri Lanka,9 wickets,Dhaka
4,"Jan 10, 2010",India,Sri Lanka,India,8 wickets,Dhaka


#### <p style="line-height: 1.5;"> • I deleted the columns like date of the match, margin of victory and the ground on which the match was played. These features don’t look important for our prediction.</p>

In [8]:
# Dropping columns that will not affect match outcomes
df_teams_2010=df_teams.drop(['Match Date','Margin','Ground'], axis=1)
df_teams_2010.head()


Unnamed: 0,Team_1,Team_2,Winner
0,Bangladesh,Sri Lanka,Sri Lanka
1,India,Sri Lanka,Sri Lanka
2,Bangladesh,India,India
3,Bangladesh,Sri Lanka,Sri Lanka
4,India,Sri Lanka,India


## <b><u>3. Feature Engineering and selection</u></b>

### <p style="line-height: 1.5;">This is probably the most important part in the machine learning workflow. Since the algorithm is totally dependent on how we feed data into it, feature engineering should be given topmost priority for every machine learning project.</p>
### <p style="line-height: 1.5;"><u>Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.</u></p>
### <b>– <u>Advantages of feature engineering</u></b>
#### • <u>Reduces overfitting</u>: Less redundant data means less opportunity to make decisions based on noise.
#### • <u>Improves Accuracy</u>: Less misleading data means modeling accuracy improves.
#### • <u>Reduces training time</u>: Fewer data points reduce algorithm complexity and algorithms train faster.
#### <p style="line-height: 1.5;">So continuing with the work, I created the model. If team-1 won the match, I assigned it label 1, else if team-2 won, I assigned it label 2. The code written wil drop the 'Winning_Team' column and then display the updated DataFrame.</p>
#### ----------------------------------------------------------------------------

#### • The Prediction label: The "winning_team" column will show "1" Team 1 has won and "2" if away team has won.

In [9]:
df_teams_2010=df_teams_2010.reset_index(drop=True)
df_teams_2010.loc[df_teams_2010.Winner==df_teams_2010.Team_1,'Winning_Team']=1
df_teams_2010.loc[df_teams_2010.Winner==df_teams_2010.Team_2,'Winning_Team']=2
df_teams_2010=df_teams_2010.drop(['Winning_Team'],axis=1)

df_teams_2010.head()


Unnamed: 0,Team_1,Team_2,Winner
0,Bangladesh,Sri Lanka,Sri Lanka
1,India,Sri Lanka,Sri Lanka
2,Bangladesh,India,India
3,Bangladesh,Sri Lanka,Sri Lanka
4,India,Sri Lanka,India


### <b>–  Converting team-1 and team-2 from categorical variables to continuous inputs</b>

#### <p style="line-height: 1.5;">• I have converted team-1 and team-2 from categorical variables to continuous inputs using pandas function pd.get_dummies. This variable has only two answer choices: team 1 and team 2. It creates a new dataframe which consists of zeros and ones. The dataframe will assign a value between 0 & 1 depending on the team and the result of a particular game in this case.</p>
#### • Also, I separated training and test sets with 70% and 30% in training and validation sets respectively.
#### ---------------------------
#### <p style="line-height: 1.5;">• In summary, the code prepares the cricket match data by converting team names into numerical dummy variables, separates the dummy variables as input features and the 'Winner' column as the target variable, and splits the data into training and testing sets for training a model and evaluating its performance.</p>

In [10]:
# Get Dummy Variables
final=pd.get_dummies(df_teams_2010, prefix=['Team_1','Team_2'],columns=['Team_1','Team_2'])

# Separate X and y sets
X=final.drop(['Winner'],axis=1)
y=final['Winner']

# Separate train and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42)


In [11]:
final.head()


Unnamed: 0,Winner,Team_1_Afghanistan,Team_1_Australia,Team_1_Bangladesh,Team_1_Canada,Team_1_England,Team_1_Hong Kong,Team_1_India,Team_1_Ireland,Team_1_Kenya,...,Team_2_Kenya,Team_2_Netherlands,Team_2_New Zealand,Team_2_Pakistan,Team_2_Scotland,Team_2_South Africa,Team_2_Sri Lanka,Team_2_U.A.E.,Team_2_West Indies,Team_2_Zimbabwe
0,Sri Lanka,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,Sri Lanka,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
2,India,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Sri Lanka,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,India,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0


## <b><u>What is Random Forest?</u></b>

### <p style="line-height: 1.5;">The <b>random forest</b> combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree.</p>
### <p style="line-height: 1.5;">RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data.</p>
#### ------------------------------------------------------

## <b><u>4. Perform hyperparameter tuning on the best model</u></b>

#### <p style="line-height: 1.5;">• I have used a <b>Random Forest Classifier model (rf)</b> using the training data. The model is configured with <b>100 decision trees</b> and a <b>maximum depth of 20 levels</b>. The <b>random_state=0 ensures reproducibility of the results</b>. After training, the accuracy of the model is evaluated on both the training and testing sets.</p>

In [12]:
rf=RandomForestClassifier(n_estimators=100,max_depth=20,
                          random_state=0)

rf.fit(X_train,y_train)

score=rf.score(X_train,y_train)
score2=rf.score(X_test,y_test)

print("Training set accuracy: ",'%.3f'%(score))
print("Test set accuracy: ",'%.3f'%(score2))


Training set accuracy:  0.698
Test set accuracy:  0.616


#### <p style="line-height: 1.5;">• The training set accuracy is found to be 0.698, meaning the model correctly predicts the outcomes of about <b>69.8%</b> of the matches it was trained on. The test set accuracy is 0.616, indicating that the model performs slightly less accurately on unseen data, correctly predicting the outcomes of about <b>61.6%</b> of the matches in the testing set.</p>
#### • These accuracy scores provide insights into the model's performance.

#### -------------------------------------------------------
### <b><u>– The popularity of the Random Forest model is explained by its various advantages:</u></b>
### • Accurate and efficient when running on large databases.
### • Multiple trees reduce the variance and bias of a smaller set or single tree.
### • Resistant to overfitting.
### • Can handle thousands of input variables without variable deletion.
### • Can estimate what variables are important in classification.
### • Provides effective methods for estimating missing data.
### • Maintains accuracy when a large proportion of the data is missing.
#### -------------------------------------------------------

## <b><u>5. Evaluate the best model on the testing set</u></b>

### • <b>Adding ICC Rankings and fixtures of the match</b>

#### <p style="line-height: 1.5;">• I loaded 2 more datasets "ICC rankings" and upcoming cricket match "fixtures". Also initializing an empty list called "pred_set" to store information or predictions related to the group stage matches. The purpose is to prepare the data and create a placeholder for storing relevant information or predictions for further analysis.</p>

In [13]:
# Loading new datasets
ranking=pd.read_csv('icc_rankings_19.csv')
fixtures=pd.read_csv('fixtures_19.csv')

# List for storing the group stage games
pred_set=[]

#### <p style="line-height: 1.5;">• Next, I added new columns with ranking position for each team and slicing the dataset for first 45 games since there are 45 league stage games in total.</p>
#### <p style="line-height: 1.5;">• The code enhances the 'fixtures' dataset by adding two new columns that represent the ranking positions of 'Team_1' and 'Team_2'. These ranking positions are obtained by mapping the team names to the 'ranking' dataset. The code selects only the group stage games from the 'fixtures' dataset. It does this by taking the first 45 rows of the dataset and keeping all the columns.</p>

In [14]:
# Create new column with ranking position of each team
fixtures.insert(1, 'first_position',fixtures['Team_1'].map(ranking.set_index('Team')['Position']))
fixtures.insert(2, 'second_position',fixtures['Team_2'].map(ranking.set_index('Team')['Position']))

# We only need the group stage games, so we have to slice the dataset
fixtures=fixtures.iloc[:45, :]
fixtures.tail()


Unnamed: 0,Round Number,first_position,second_position,Date,Location,Team_1,Team_2,Group
40,1,1.0,3.0,03/07/19,"Riverside Ground, Chester-le-Street",England,New Zealand,Group A
41,1,10.0,9.0,04/07/19,"Headingley, Leeds",Afghanistan,West Indies,Group A
42,1,6.0,7.0,05/07/19,"Lord's, London",Pakistan,Bangladesh,Group A
43,1,8.0,2.0,06/07/19,"Headingley, Leeds",Sri Lanka,India,Group A
44,1,5.0,4.0,06/07/19,"Emirates Old Trafford, Manchester",Australia,South Africa,Group A


#### • Then I added teams to new prediction dataset based on ranking position of each team.
#### <p style="line-height: 1.5;">• Over here I am trying to create a loop through each row in the 'fixtures' dataset. Based on the ranking positions of the teams, it creates a new prediction dataset called 'pred_set'. For each row, the code adds a dictionary to 'pred_set' with the team names and sets the winning team value as None. The 'pred_set' is then converted to a DataFrame, and its first few rows are displayed. The resulting DataFrame serves as a starting point for making predictions about the outcomes of the group stage matches.</p>


In [15]:
#Loop to add items to new prediction daatset based on the ranking position of each team.
for index, row in fixtures.iterrows():
    if row['first_position']<row['second_position']:
        pred_set.append({'Team_1':row['Team_1'], 'Team_2':row['Team_2'],'winning_team': None})
    else:
        pred_set.append({'Team_1':row['Team_2'], 'Team_2':row['Team_1'],'winning_team': None})
        
pred_set=pd.DataFrame(pred_set)
backup_pred_set=pred_set
pred_set.head()

Unnamed: 0,Team_1,Team_2,winning_team
0,England,South Africa,
1,Pakistan,West Indies,
2,New Zealand,Sri Lanka,
3,Australia,Afghanistan,
4,South Africa,Bangladesh,


#### <p style="line-height: 1.5;">• After that, I added scripts for getting dummy variables and added missing columns compared to model training dataset.</p>
#### <p style="line-height: 1.5;">• This bit of code performs the necessary data transformations on the 'pred_set' dataset to prepare it for prediction using the trained model. It creates dummy variables for categorical columns, adds missing columns, aligns the column order, and drops the 'Winner' column. The resulting 'pred_set' dataset is now ready to be used for making predictions based on the trained model.</p>

In [16]:
#Get dummy variables and drop winning_team column
pred_set=pd.get_dummies(pred_set,prefix=['Team_1','Team_2'],columns=['Team_1','Team_2'])

#Add missing columns compared to the model's training dataset
missing_cols=set(final.columns)-set(pred_set.columns)
for c in missing_cols:
    pred_set[c]=0
pred_set=pred_set[final.columns]

pred_set=pred_set.drop(['Winner'],axis=1)
pred_set.head()

Unnamed: 0,Team_1_Afghanistan,Team_1_Australia,Team_1_Bangladesh,Team_1_Canada,Team_1_England,Team_1_Hong Kong,Team_1_India,Team_1_Ireland,Team_1_Kenya,Team_1_Netherlands,...,Team_2_Kenya,Team_2_Netherlands,Team_2_New Zealand,Team_2_Pakistan,Team_2_Scotland,Team_2_South Africa,Team_2_Sri Lanka,Team_2_U.A.E.,Team_2_West Indies,Team_2_Zimbabwe
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## <b><u>6. Interpret the model results</u></b>

### Finally, the below code is for getting the results for each and every league stage match.
#### -------------------------------------------------

#### <p style="line-height: 1.5;">• I have coded these lines in order to use the trained model to predict the winners of the group stage matches. It iterates through each match, prints the team names, and displays the predicted winner based on the model's prediction.</p>

In [17]:
#Group Matches
predictions=rf.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i,1] + " and " + backup_pred_set.iloc[i,0])
    if predictions[i] == 1:
        print("Winner: " + backup_pred_set.iloc[i,1])
    else:
        print("Winner: " + backup_pred_set.iloc[i,0])
    print("")

South Africa and England
Winner: England

West Indies and Pakistan
Winner: Pakistan

Sri Lanka and New Zealand
Winner: New Zealand

Afghanistan and Australia
Winner: Australia

Bangladesh and South Africa
Winner: South Africa

Pakistan and England
Winner: England

Afghanistan and Sri Lanka
Winner: Sri Lanka

South Africa and India
Winner: India

Bangladesh and New Zealand
Winner: New Zealand

West Indies and Australia
Winner: Australia

Sri Lanka and Pakistan
Winner: Pakistan

Bangladesh and England
Winner: England

Afghanistan and New Zealand
Winner: New Zealand

Australia and India
Winner: India

West Indies and South Africa
Winner: South Africa

Sri Lanka and Bangladesh
Winner: Bangladesh

Pakistan and Australia
Winner: Australia

New Zealand and India
Winner: India

West Indies and England
Winner: England

Sri Lanka and Australia
Winner: Australia

Afghanistan and South Africa
Winner: South Africa

Pakistan and India
Winner: India

West Indies and Bangladesh
Winner: Bangladesh

Afg

#### <b><u>The four teams to march to the semi finals are New Zealand, India, England and South Africa.</u></b>

In [18]:
#List of tuples before
semi=[('New Zealand', 'India'),
      ('England','South Africa')]

#### <p style="line-height: 1.5;">• And then I created a function to repeat the above work. This is the <b>final function to predict the winner of ICC Cricket World Cup 2019.</b></p>
#### <p style="line-height: 1.5;">• The code defines a function called clean_and_predict that takes four arguments: matches, ranking, final, and logreg. It cleans the data, prepares it for prediction, and then uses a trained model (logreg) to make predictions on the prepared data.</p>

In [19]:
def clean_and_predict(matches, ranking, final, logreg):

    # Initialization of auxiliary list for data cleaning
    positions = []

    # Loop to retrieve each team's position according to ICC ranking
    for match in matches:
        positions.append(ranking.loc[ranking['Team'] == match[0],'Position'].iloc[0])
        positions.append(ranking.loc[ranking['Team'] == match[1],'Position'].iloc[0])
    
    # Creating the DataFrame for prediction
    pred_set = []

    # Initializing iterators for while loop
    i = 0
    j = 0

    # 'i' will be the iterator for the 'positions' list, and 'j' for the list of matches (list of tuples)
    while i < len(positions):
        dict1 = {}

        # If position of first team is better then this team will be the 'Team_1' team, and vice-versa
        if positions[i] < positions[i + 1]:
            dict1.update({'Team_1': matches[j][0], 'Team_2': matches[j][1]})
        else:
            dict1.update({'Team_1': matches[j][1], 'Team_2': matches[j][0]})

        # Append updated dictionary to the list, that will later be converted into a DataFrame
        pred_set.append(dict1)
        i += 2
        j += 1
        
        # Convert list into DataFrame
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    # Get dummy variables and drop winning_team column
    pred_set = pd.get_dummies(pred_set, prefix=['Team_1', 'Team_2'], columns=['Team_1', 'Team_2'])

    # Add missing columns compared to the model's training dataset
    missing_cols2 = set(final.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[final.columns]

    pred_set = pred_set.drop(['Winner'], axis=1)

    # Predict!
    predictions = logreg.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 1:
            print("Winner: " + backup_pred_set.iloc[i, 1])
        else:
            print("Winner: " + backup_pred_set.iloc[i, 0])
        print("")

#### •<u> I ran the same function for the semi finals.</u>

In [20]:
clean_and_predict(semi, ranking, final, rf)

New Zealand and India
Winner: India

South Africa and England
Winner: England



#### • <u>Hence, the two finalists are India and England which is quite evident as they are considered the favourites to win this year.</u>

In [21]:
#Finals
finals = [('India', 'England')]

In [22]:
clean_and_predict(finals, ranking, final, rf)

India and England
Winner: England



## <b><u>According to this model,  England is likely to win this World Cup!</u></b>

# <b><u>Summary</u></b>

## <p style="line-height: 1.5;">• In conclusion, in this project, I aimed to show how Machine Learning could be used to calculate probabilities of which team is most likely to win the 2019 world cup in a simulation, using machine learning techniques. The goal was to leverage historical data and team rankings to predict the outcomes of the tournament matches.</p>
## <p style="line-height: 1.5;">• I started by collecting and preprocessing the necessary datasets, including past match results, team rankings, and fixture information. Through data exploration and cleaning, I ensured the data was in a suitable format for analysis.</p>
## <p style="line-height: 1.5;">• Next, I applied machine learning algorithms, specifically a Random Forest Classifier, to train our prediction model. I used features such as team rankings, match venues, and other relevant factors to train the model on historical data.</p>
## <p style="line-height: 1.5;">• After training the model, I evaluated its performance using a train-test split approach. The model achieved an accuracy of <b>approximately 69.8%</b> on the training set and <b>61.6%</b> on the test set. These results indicate that the model has learned patterns from the training data and can make reasonable predictions on unseen test data.</p>
## <p style="line-height: 1.5;">• Using the trained model, I then made predictions for the group stage matches of the 2019 Cricket World Cup. I considered the teams' rankings and assigned the better-ranked team as 'Team_1' and the other team as 'Team_2' for each match. The model predicted the winning team for each match based on the trained patterns and provided insights into potential outcomes.</p>
## <p style="line-height: 1.5;">• Overall, this project demonstrates the potential of machine learning in predicting cricket match outcomes. I developed a prediction model that can provide valuable insights and aid in decision-making for the 2019 Cricket World Cup.</p>
#### ----------------------------------------------------------------------------

# <b><u>THANK YOU FOR READING!!!</u></b>