# World Cup 2022 Predictions with Machine Learning
**Author**: [Justin Schubeck](https://www.linkedin.com/in/justinschubeck/) (jschubeck7@gmail.com)\
**Date**: November 15, 2022
---

## Setup

Importing necessary packages.

In [1]:
import numpy as np # 1.21.5
import pandas as pd # 1.3.5
import matplotlib.pyplot as plt # 3.5.1
import seaborn as sns # 0.11.2
# sklearn # 1.0.2
# scipy # 1.7.3
import warnings

Import results of international soccer matches from 6/14/2018 (start of FIFA 2018 World Cup) to 9/27/2022 (final game before FIFA 2022 World Cup). 

In [2]:
games = pd.read_csv('results.csv')
games

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament
0,6/14/2018,Russia,Saudi Arabia,5,0,FIFA World Cup
1,6/15/2018,Egypt,Uruguay,0,1,FIFA World Cup
2,6/15/2018,Morocco,Iran,0,1,FIFA World Cup
3,6/15/2018,Portugal,Spain,3,3,FIFA World Cup
4,6/16/2018,France,Australia,2,1,FIFA World Cup
...,...,...,...,...,...,...
3617,9/27/2022,Albania,Iceland,1,1,UEFA Nations League
3618,9/27/2022,Norway,Serbia,0,2,UEFA Nations League
3619,9/27/2022,Sweden,Slovenia,1,1,UEFA Nations League
3620,9/27/2022,Kosovo,Cyprus,5,1,UEFA Nations League


Import FIFA International World Rankings from 6/7/2018 to 10/6/2022.

In [3]:
ranks = pd.read_csv('fifa_ranking.csv')
ranks

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation,rank_date
0,145,Afghanistan,AFG,188.00,199.00,5,AFC,6/7/2018
1,146,Afghanistan,AFG,1161.00,1161.00,0,AFC,7/1/2018
2,145,Afghanistan,AFG,1068.00,1068.00,0,AFC,8/16/2018
3,146,Afghanistan,AFG,1068.00,1068.00,1,AFC,9/20/2018
4,145,Afghanistan,AFG,1068.00,1068.00,-1,AFC,10/25/2018
...,...,...,...,...,...,...,...,...
7354,122,Zimbabwe,ZIM,1138.56,1138.44,1,CAF,2/10/2022
7355,122,Zimbabwe,ZIM,1138.56,1138.56,0,CAF,3/31/2022
7356,123,Zimbabwe,ZIM,1138.56,1138.56,1,CAF,6/23/2022
7357,123,Zimbabwe,ZIM,1138.56,1138.56,0,CAF,8/25/2022


---
## Data Cleaning

The date columns are changed to be in the datetime format.

In [4]:
games['date'] = pd.to_datetime(games['date'])
ranks['rank_date'] = pd.to_datetime(ranks['rank_date'])

Teams with different names in each dataset will be changed.

In [5]:
warnings.simplefilter(action='ignore', category=FutureWarning)
ranks['country_full'] = ranks['country_full']\
                        .str.replace('IR Iran', 'Iran')\
                        .str.replace('Korea Republic', 'South Korea')\
                        .str.replace('USA', 'United States')\
                        .str.replace('Curacao', 'Curaçao')\
                        .str.replace('FYR Macedonia', 'North Macedonia')\
                        .str.replace('Cabo Verde', 'Cape Verde')\
                        .str.replace('Cape Verde Islands', 'Cape Verde')\
                        .str.replace('St. Vincent / Grenadines', 'Saint Vincent and the Grenadines')\
                        .str.replace('St. Vincent and the Grenadines', 'Saint Vincent and the Grenadines')\
                        .str.replace('Swaziland', 'Eswatini')\
                        .str.replace('Sao Tome e Principe', 'São Tomé and Príncipe')\
                        .str.replace('Türkiye', 'Turkey')\
                        .str.replace('Congo DR', 'DR Congo')\
                        .str.replace('Korea DPR', 'North Korea')\
                        .str.replace('Kyrgyz Republic', 'Kyrgyzstan')\
                        .str.replace('US Virgin Islands', 'United States Virgin Islands')\
                        .str.replace('Côte d\'Ivoire', 'Ivory Coast')\
                        .str.replace('St. Lucia', 'Saint Lucia')\
                        .str.replace('Chinese Taipei', 'Taiwan')\
                        .str.replace('St. Kitts and Nevis', 'Saint Kitts and Nevis')\
                        .str.replace('Brunei Darussalam', 'Brunei')

The rankings will be restrucutred to associate a country's ranking with each day.

In [7]:
ranks = ranks.set_index(['rank_date'])\
             .groupby(['country_full'], group_keys=False)\
             .resample('D')\
             .first()\
             .fillna(method='ffill')\
             .reset_index()

In [17]:
ranks

Unnamed: 0,rank_date,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation
0,2018-06-07,145.0,Afghanistan,AFG,188.00,199.00,5.0,AFC
1,2018-06-08,145.0,Afghanistan,AFG,188.00,199.00,5.0,AFC
2,2018-06-09,145.0,Afghanistan,AFG,188.00,199.00,5.0,AFC
3,2018-06-10,145.0,Afghanistan,AFG,188.00,199.00,5.0,AFC
4,2018-06-11,145.0,Afghanistan,AFG,188.00,199.00,5.0,AFC
...,...,...,...,...,...,...,...,...
322927,2022-10-02,123.0,Zimbabwe,ZIM,1138.56,1138.56,0.0,CAF
322928,2022-10-03,123.0,Zimbabwe,ZIM,1138.56,1138.56,0.0,CAF
322929,2022-10-04,123.0,Zimbabwe,ZIM,1138.56,1138.56,0.0,CAF
322930,2022-10-05,123.0,Zimbabwe,ZIM,1138.56,1138.56,0.0,CAF


The game results dataframe will append the FIFA ranking data to the right-most columns for the home team on each specific date while making sure to not add redundant columns.

In [18]:
df_wc_ranked = games.merge(ranks[['country_full', 
                                  'total_points', 
                                  'previous_points', 
                                  'rank', 
                                  'rank_change', 
                                  'rank_date']], 
                            left_on=['date', 
                                     'home_team'], 
                            right_on=['rank_date', 
                                      'country_full'])\
                    .drop(['rank_date', 'country_full'], 
                          axis=1)

In [20]:
df_wc_ranked

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,total_points,previous_points,rank,rank_change
0,2018-06-14,Russia,Saudi Arabia,5,0,FIFA World Cup,457.00,493.00,70.0,4.0
1,2018-06-15,Egypt,Uruguay,0,1,FIFA World Cup,649.00,636.00,45.0,-1.0
2,2018-06-15,Morocco,Iran,0,1,FIFA World Cup,686.00,681.00,41.0,-1.0
3,2018-06-15,Portugal,Spain,3,3,FIFA World Cup,1274.00,1306.00,4.0,0.0
4,2018-06-16,France,Australia,2,1,FIFA World Cup,1198.00,1166.00,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...
3617,2022-09-27,Albania,Iceland,1,1,UEFA Nations League,1361.81,1361.81,66.0,0.0
3618,2022-09-27,Norway,Serbia,0,2,UEFA Nations League,1488.57,1488.57,36.0,0.0
3619,2022-09-27,Sweden,Slovenia,1,1,UEFA Nations League,1563.44,1563.44,20.0,0.0
3620,2022-09-27,Kosovo,Cyprus,5,1,UEFA Nations League,1183.90,1183.90,106.0,0.0


The game results dataframe will append the FIFA ranking data to the right-most columns for the away team for each specific date while making sure to not add redundant columns.

In [22]:
df_wc_ranked = df_wc_ranked.merge(ranks[['country_full', 
                                         'total_points', 
                                         'previous_points', 
                                         'rank', 
                                         'rank_change', 
                                         'rank_date']], 
                                  left_on=['date', 
                                           'away_team'], 
                                  right_on=['rank_date', 
                                            'country_full'], 
                                  suffixes=('_home', '_away'))\
                            .drop(['rank_date', 
                                   'country_full'], 
                                  axis=1)

In [23]:
df_wc_ranked

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,total_points_home,previous_points_home,rank_home,rank_change_home,total_points_away,previous_points_away,rank_away,rank_change_away
0,2018-06-14,Russia,Saudi Arabia,5,0,FIFA World Cup,457.00,493.00,70.0,4.0,465.00,462.00,67.0,0.0
1,2018-06-15,Egypt,Uruguay,0,1,FIFA World Cup,649.00,636.00,45.0,-1.0,1018.00,976.00,14.0,-3.0
2,2018-06-15,Morocco,Iran,0,1,FIFA World Cup,686.00,681.00,41.0,-1.0,708.00,727.00,37.0,1.0
3,2018-06-15,Portugal,Spain,3,3,FIFA World Cup,1274.00,1306.00,4.0,0.0,1126.00,1162.00,10.0,2.0
4,2018-06-16,France,Australia,2,1,FIFA World Cup,1198.00,1166.00,7.0,0.0,718.00,700.00,36.0,-4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3617,2022-09-27,Albania,Iceland,1,1,UEFA Nations League,1361.81,1361.81,66.0,0.0,1379.61,1379.61,63.0,0.0
3618,2022-09-27,Norway,Serbia,0,2,UEFA Nations League,1488.57,1488.57,36.0,0.0,1549.53,1549.53,25.0,0.0
3619,2022-09-27,Sweden,Slovenia,1,1,UEFA Nations League,1563.44,1563.44,20.0,0.0,1372.48,1372.48,65.0,0.0
3620,2022-09-27,Kosovo,Cyprus,5,1,UEFA Nations League,1183.90,1183.90,106.0,0.0,1180.52,1180.52,108.0,1.0


Renaming dataframe post-cleaning.

df = df_wc_ranked

## Feature Engineering