<a id="0"></a> <br>
 # Table of Contents  
1. [Data set: Analyze results from all matches: 1872-2022](#1)   
    1. [Remove rows with missing data](#1A) 
1. [Data set: Analyze results from FIFA team rankings: 1992-2022](#2) 
1. [Data set: Data for World Cup Groups: 2022](#3)
    1. [Match country names across the 3 data frames](#3A)
1. [Data set: Data from FiveThirtyEight Soccer Power Index(SPI): 2022](#4)     
    1. [Remove target column](#5) 
1. [Feature Scaling](#6)     
1. [First Model](#8)     
    1. [Evaluation Metrics for Training set](#9)     
    1. [Evaluation Metrics for Validation set](#10)     
    1. [First Submission](#11) 
1. [Selecting Models](#12)       
    1. [Helper Functions to Try New Models](#13)      
    1. [Split to the Small Data for Evaluating Models Fast](#14)     
    1. [ML Models](#15)         
        1. [XGBoost](#16)             
            1. [Training](#17)

In [1]:
import pandas as pd
import numpy as np

<a id="1"></a> 
# 1. Analyze results from all matches: 1872-2022

In [2]:
all_results_df =  pd.read_csv("../data/kaggle-1872-2022-results.csv")

all_results_df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0.0,0.0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4.0,2.0,Friendly,London,England,False
2,1874-03-07,Scotland,England,2.0,1.0,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2.0,2.0,Friendly,London,England,False
4,1876-03-04,Scotland,England,3.0,0.0,Friendly,Glasgow,Scotland,False


In [3]:
all_results_df.tail()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
44055,2022-09-27,Norway,Serbia,0.0,2.0,UEFA Nations League,Oslo,Norway,False
44056,2022-09-27,Sweden,Slovenia,1.0,1.0,UEFA Nations League,Stockholm,Sweden,False
44057,2022-09-27,Kosovo,Cyprus,5.0,1.0,UEFA Nations League,Pristina,Kosovo,False
44058,2022-09-27,Greece,Northern Ireland,3.0,1.0,UEFA Nations League,Athens,Greece,False
44059,2022-09-30,Fiji,Solomon Islands,,,MSG Prime Minister's Cup,Luganville,Vanuatu,True


In [4]:
all_results_df.dtypes

date           object
home_team      object
away_team      object
home_score    float64
away_score    float64
tournament     object
city           object
country        object
neutral          bool
dtype: object

In [5]:
all_results_df.shape

(44060, 9)

In [6]:
all_results_df.isnull().sum()

date          0
home_team     0
away_team     0
home_score    1
away_score    1
tournament    0
city          0
country       0
neutral       0
dtype: int64

In [7]:
all_results_df[all_results_df.isnull().any(axis=1)]

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
44059,2022-09-30,Fiji,Solomon Islands,,,MSG Prime Minister's Cup,Luganville,Vanuatu,True


<a id="1A"></a> 
## A. Remove rows with missing data


In [8]:
all_results_df = all_results_df[all_results_df['home_score'].notna()]

In [9]:
all_results_df.isnull().sum()

date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64

In [10]:
all_results_df.shape

(44059, 9)

In [11]:
all_results_df["date"] = pd.to_datetime(all_results_df["date"])

<a id="2"></a> 
# 2. Analyze results from FIFA team rankings: 1992-2022


Rankings scraped using python BeautifulSoup package from [FIFA Men's Rankings website](https://www.fifa.com/fifa-world-ranking/men?dateId=id13792).

In [12]:
team_rankings_df = pd.read_csv('../data/fifa-team-ranks-1992-2022.csv')
team_rankings_df.head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation,rank_date
0,1,Germany,GER,57.0,0.0,0,UEFA,1992-12-31
1,96,Syria,SYR,11.0,0.0,0,AFC,1992-12-31
2,97,Burkina Faso,BFA,11.0,0.0,0,CAF,1992-12-31
3,99,Latvia,LVA,10.0,0.0,0,UEFA,1992-12-31
4,100,Burundi,BDI,10.0,0.0,0,CAF,1992-12-31


In [13]:
team_rankings_df.tail()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation,rank_date
63911,74,El Salvador,SLV,1330.51,1333.48,3,CONCACAF,2022-10-06
63912,75,Oman,OMA,1320.29,1323.03,0,AFC,2022-10-06
63913,76,Israel,ISR,1316.55,1316.35,0,UEFA,2022-10-06
63914,78,Georgia,GEO,1307.34,1296.46,-4,UEFA,2022-10-06
63915,211,San Marino,SMR,762.22,763.82,0,UEFA,2022-10-06


In [14]:
team_rankings_df.dtypes

rank                 int64
country_full        object
country_abrv        object
total_points       float64
previous_points    float64
rank_change          int64
confederation       object
rank_date           object
dtype: object

In [15]:
team_rankings_df.shape

(63916, 8)

In [16]:
team_rankings_df.isnull().sum()

rank               0
country_full       0
country_abrv       0
total_points       0
previous_points    0
rank_change        0
confederation      0
rank_date          0
dtype: int64

In [18]:
# we will use the date as the primary index to match the match results dataframe
team_rankings_df["rank_date"] = pd.to_datetime(team_rankings_df["rank_date"])

In [20]:
team_rankings_df.sort_values(by='rank_date').head(300)

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation,rank_date
0,1,Germany,GER,57.0,0.0,0,UEFA,1992-12-31
95,74,Madagascar,MAD,18.0,0.0,0,CAF,1992-12-31
96,2,Italy,ITA,57.0,0.0,0,UEFA,1992-12-31
97,3,Brazil,BRA,56.0,0.0,0,CONMEBOL,1992-12-31
98,4,Sweden,SWE,56.0,0.0,0,UEFA,1992-12-31
...,...,...,...,...,...,...,...,...
199,151,Cuba,CUB,1.0,1.0,7,CONCACAF,1993-08-08
200,78,Syria,SYR,21.0,11.0,-18,AFC,1993-08-08
201,136,St. Lucia,LCA,4.0,6.0,18,CONCACAF,1993-08-08
202,115,Suriname,SUR,9.0,12.0,20,CONCACAF,1993-08-08


<div class="alert alert-block alert-info"> <b>Note:</b> We will truncate the results data to start from 1992 so we have parity with the rankings data. </div>

In [22]:
team_rankings_df = team_rankings_df[(team_rankings_df["rank_date"] >= "1992-12-31")].reset_index(drop=True)

In [25]:
all_results_df = all_results_df[(all_results_df["date"] >= "1992-12-31")].reset_index(drop=True)
all_results_df.sort_values(by='date').head(300)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1993-01-01,Ghana,Mali,1.0,1.0,Friendly,Libreville,Gabon,True
1,1993-01-02,Gabon,Burkina Faso,1.0,1.0,Friendly,Libreville,Gabon,False
2,1993-01-02,Kuwait,Lebanon,2.0,0.0,Friendly,Kuwait City,Kuwait,False
3,1993-01-03,Burkina Faso,Mali,1.0,0.0,Friendly,Libreville,Gabon,True
4,1993-01-03,Gabon,Ghana,2.0,3.0,Friendly,Libreville,Gabon,False
...,...,...,...,...,...,...,...,...,...
293,1993-05-07,Bangladesh,Sri Lanka,3.0,0.0,FIFA World Cup qualification,Dubai,United Arab Emirates,True
294,1993-05-07,Hong Kong,Bahrain,2.0,1.0,FIFA World Cup qualification,Beirut,Lebanon,True
297,1993-05-08,United States,Colombia,1.0,2.0,Friendly,Miami,United States,False
298,1993-05-09,Bahrain,South Korea,0.0,0.0,FIFA World Cup qualification,Beirut,Lebanon,True


In [30]:
#rank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()
#print(rank.shape)
#print(rank.dtypes)


In [31]:
#print(rank["country_full"].value_counts().sort_index().to_string())

<a id="3"></a> 
# 3. Data for World Cup Groups: 2022

In [50]:
world_cup_groups_df = pd.read_csv("../data/qatar-2022-groups.csv")

world_cup_groups_df.head()

Unnamed: 0,Group,Flag_Image,Team,Country_Name_Short,First match against,Second match against,Third match against
0,A,https://cloudinary.fifa.com/api/v3/picture/fla...,Qatar,QAT,Ecuador,Senegal,Netherlands
1,A,https://cloudinary.fifa.com/api/v3/picture/fla...,Ecuador,ECU,Qatar,Netherlands,Senegal
2,A,https://cloudinary.fifa.com/api/v3/picture/fla...,Senegal,SEN,Netherlands,Qatar,Ecuador
3,A,https://cloudinary.fifa.com/api/v3/picture/fla...,Netherlands,NED,Senegal,Ecuador,Qatar
4,B,https://cloudinary.fifa.com/api/v3/picture/fla...,England,ENG,Iran,USA,Wales


In [36]:
world_cup_groups_df.columns

Index(['Group', 'Flag_Image', 'Team', 'Country_Name_Short',
       'First match against', 'Second match against', 'Third match against'],
      dtype='object')

In [37]:
world_cup_groups_df.shape

(32, 7)

<a id="3A"></a> 
## A. Match country names across the 3 data frames

In [41]:
np.array(sorted(world_cup_groups_df["Team"].unique()))

array(['Argentina', 'Australia', 'Belgium', 'Brazil', 'Cameroon',
       'Canada', 'Costa Rica', 'Croatia', 'Denmark', 'Ecuador', 'England',
       'France', 'Germany', 'Ghana', 'Iran', 'Japan', 'Korea Republic',
       'Mexico', 'Morocco', 'Netherlands', 'Poland', 'Portugal', 'Qatar',
       'Saudi Arabia', 'Senegal', 'Serbia', 'Spain', 'Switzerland',
       'Tunisia', 'USA', 'Uruguay', 'Wales'], dtype='<U14')

In [42]:
np.array(sorted(all_results_df["home_team"].unique()))

array(['Abkhazia', 'Afghanistan', 'Albania', 'Alderney', 'Algeria',
       'American Samoa', 'Andalusia', 'Andorra', 'Angola', 'Anguilla',
       'Antigua and Barbuda', 'Arameans Suryoye', 'Argentina', 'Armenia',
       'Artsakh', 'Aruba', 'Australia', 'Austria', 'Aymara', 'Azerbaijan',
       'Bahamas', 'Bahrain', 'Bangladesh', 'Barawa', 'Barbados',
       'Basque Country', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Biafra', 'Bolivia', 'Bonaire',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'British Virgin Islands', 'Brittany', 'Brunei',
       'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi',
       'Cambodia', 'Cameroon', 'Canada', 'Canary Islands', 'Cape Verde',
       'Cascadia', 'Catalonia', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Chagos Islands', 'Chameria',
       'Chile', 'China PR', 'Colombia', 'Comoros', 'Congo',
       'Cook Islands', 'Corsica', 'Costa Rica', 'County of Nice',
       'Croatia', 'Cub

In [43]:
np.array(sorted(team_rankings_df["country_full"].unique()))

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan',
       'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
       'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'British Virgin Islands', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Canada', 'Cape Verde Islands', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Chile', 'China PR',
       'Chinese Taipei', 'Colombia', 'Comoros', 'Congo', 'Congo DR',
       'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curacao',
       'Curaçao', 'Cyprus', 'Czech Republic', 'Czechoslovakia',
       "Côte d'Ivoire", 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'England',
       'Equato

In [53]:
# correct the county name for 2 countries
world_cup_groups_df = world_cup_groups_df.replace({"Korea Republic" : "South Korea", "USA": "United States"})

In [54]:
world_cup_groups_df = world_cup_groups_df.set_index('Team')
world_cup_groups_df.head()

Unnamed: 0_level_0,Group,Flag_Image,Country_Name_Short,First match against,Second match against,Third match against
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Qatar,A,https://cloudinary.fifa.com/api/v3/picture/fla...,QAT,Ecuador,Senegal,Netherlands
Ecuador,A,https://cloudinary.fifa.com/api/v3/picture/fla...,ECU,Qatar,Netherlands,Senegal
Senegal,A,https://cloudinary.fifa.com/api/v3/picture/fla...,SEN,Netherlands,Qatar,Ecuador
Netherlands,A,https://cloudinary.fifa.com/api/v3/picture/fla...,NED,Senegal,Ecuador,Qatar
England,B,https://cloudinary.fifa.com/api/v3/picture/fla...,ENG,Iran,United States,Wales


<a id="4"></a> 
# 4. Data set: Data from FiveThirtyEight Soccer Power Index(SPI): 2022

SPI data from [fivethirtheight.com website](https://fivethirtyeight.com/features/how-our-2022-world-cup-predictions-work/).

In [56]:
spi_ratings_df = pd.read_csv("../data/fivethirtyeight-spi-index.csv")

spi_ratings_df.head()

Unnamed: 0,forecast_timestamp,team,group,spi,global_o,global_d,sim_wins,sim_ties,sim_losses,sim_goal_diff,...,group_1,group_2,group_3,group_4,make_round_of_16,make_quarters,make_semis,make_final,win_league,timestamp
0,2022-11-20 18:01:09 UTC,Brazil,G,93.54699,3.22213,0.29634,2.11717,0.59686,0.28597,4.46233,...,0.72109,0.19069,0.06774,0.02048,0.91178,0.68446,0.46037,0.32259,0.21689,2022-11-20 18:02:33 UTC
1,2022-11-20 18:01:09 UTC,Spain,E,89.50604,2.80203,0.38627,1.76786,0.69627,0.53587,2.89908,...,0.47131,0.33878,0.15173,0.03818,0.81009,0.56018,0.30576,0.19005,0.10784,2022-11-20 18:02:33 UTC
2,2022-11-20 18:01:09 UTC,France,D,87.70516,2.77362,0.47923,1.78685,0.73514,0.47801,2.97987,...,0.55701,0.27156,0.12059,0.05084,0.82857,0.54091,0.3299,0.17365,0.08682,2022-11-20 18:02:33 UTC
3,2022-11-20 18:01:09 UTC,Argentina,C,87.20776,2.62755,0.4317,1.83665,0.73167,0.43168,3.17458,...,0.5965,0.24498,0.11228,0.04624,0.84148,0.53055,0.32701,0.15601,0.0838,2022-11-20 18:02:33 UTC
4,2022-11-20 18:01:09 UTC,Portugal,H,87.77456,2.78861,0.48293,1.74272,0.74756,0.50972,2.80294,...,0.53465,0.27828,0.13315,0.05392,0.81293,0.46101,0.26312,0.15374,0.07754,2022-11-20 18:02:33 UTC


In [57]:
np.array(sorted(spi_ratings_df["team"].unique()))

array(['Argentina', 'Australia', 'Belgium', 'Brazil', 'Cameroon',
       'Canada', 'Costa Rica', 'Croatia', 'Denmark', 'Ecuador', 'England',
       'France', 'Germany', 'Ghana', 'Iran', 'Japan', 'Mexico', 'Morocco',
       'Netherlands', 'Poland', 'Portugal', 'Qatar', 'Saudi Arabia',
       'Senegal', 'Serbia', 'South Korea', 'Spain', 'Switzerland',
       'Tunisia', 'USA', 'Uruguay', 'Wales'], dtype='<U12')

In [59]:
# again correct the county name for 2 countries
spi_ratings_df = spi_ratings_df.replace({"Korea Republic" : "South Korea", "USA": "United States"})
np.array(sorted(spi_ratings_df["team"].unique()))

array(['Argentina', 'Australia', 'Belgium', 'Brazil', 'Cameroon',
       'Canada', 'Costa Rica', 'Croatia', 'Denmark', 'Ecuador', 'England',
       'France', 'Germany', 'Ghana', 'Iran', 'Japan', 'Mexico', 'Morocco',
       'Netherlands', 'Poland', 'Portugal', 'Qatar', 'Saudi Arabia',
       'Senegal', 'Serbia', 'South Korea', 'Spain', 'Switzerland',
       'Tunisia', 'United States', 'Uruguay', 'Wales'], dtype='<U13')

In [60]:
# index by country 
spi_ratings_df = spi_ratings_df.set_index('team')
spi_ratings_df.head()

Unnamed: 0_level_0,forecast_timestamp,group,spi,global_o,global_d,sim_wins,sim_ties,sim_losses,sim_goal_diff,goals_scored,...,group_1,group_2,group_3,group_4,make_round_of_16,make_quarters,make_semis,make_final,win_league,timestamp
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Brazil,2022-11-20 18:01:09 UTC,G,93.54699,3.22213,0.29634,2.11717,0.59686,0.28597,4.46233,6.3048,...,0.72109,0.19069,0.06774,0.02048,0.91178,0.68446,0.46037,0.32259,0.21689,2022-11-20 18:02:33 UTC
Spain,2022-11-20 18:01:09 UTC,E,89.50604,2.80203,0.38627,1.76786,0.69627,0.53587,2.89908,5.47503,...,0.47131,0.33878,0.15173,0.03818,0.81009,0.56018,0.30576,0.19005,0.10784,2022-11-20 18:02:33 UTC
France,2022-11-20 18:01:09 UTC,D,87.70516,2.77362,0.47923,1.78685,0.73514,0.47801,2.97987,5.31621,...,0.55701,0.27156,0.12059,0.05084,0.82857,0.54091,0.3299,0.17365,0.08682,2022-11-20 18:02:33 UTC
Argentina,2022-11-20 18:01:09 UTC,C,87.20776,2.62755,0.4317,1.83665,0.73167,0.43168,3.17458,5.33762,...,0.5965,0.24498,0.11228,0.04624,0.84148,0.53055,0.32701,0.15601,0.0838,2022-11-20 18:02:33 UTC
Portugal,2022-11-20 18:01:09 UTC,H,87.77456,2.78861,0.48293,1.74272,0.74756,0.50972,2.80294,5.17278,...,0.53465,0.27828,0.13315,0.05392,0.81293,0.46101,0.26312,0.15374,0.07754,2022-11-20 18:02:33 UTC
