# <I>Predicting the FIFA World Cup 2018 Winner with Machine Learning</I>

<b>Aim:</b> To predict the outcome of the FIFA World Cup 2018

<b>Method:</b>
1. Use data from kaggle to model the outcome of certain pairings between teams,given thier rank, points and weighted point difference with the opponents.
2. Use this model to predict the outcome of the group rounds and then the single-elimination phase

<b>Data</b>

There are three datasets that we used :
1. FIFA rankings from 1993-2018 : used to get the FIFA ranking and points for the teams, which is monthly changing rank previously shown as decent predictor of team performance
2. International Football Matches from 1872-2018 : To find out how much the difference in point, ranks and the current rank of the team affects the outcome of the match
3. FIFA World Cup 2018 dataset - to get the upcomming matches

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## <I>Get the Ranking Data</I>

In [2]:
rankings = pd.read_csv('https://raw.githubusercontent.com/rajeevratan84/data-analyst-bootcamp/master/fifa_ranking.csv')
rankings = rankings[['rank','country_full','country_abrv','rank_date','cur_year_avg_weighted','two_year_ago_weighted','three_year_ago_weighted']]

rankings = rankings.replace({'IR Iran':'Iran'})
# Get weighted pointss by summing curr + two years ago + three years ago (weighted points)
rankings['weighted_points'] = rankings['cur_year_avg_weighted'] + rankings['two_year_ago_weighted']+ rankings['three_year_ago_weighted']
rankings['rank_date'] = pd.to_datetime(rankings['rank_date'])

## <I>Get the Match Data</I>

In [3]:
matches = pd.read_csv('https://raw.githubusercontent.com/rajeevratan84/data-analyst-bootcamp/master/results.csv')
matches = matches.replace({'Germany DR': 'Germany','China PR':'China'})
matches['date'] = pd.to_datetime(matches['date'])

## <I>Get the World Cup Fixture Data</I>

In [4]:
world_cup = pd.read_csv('https://raw.githubusercontent.com/rajeevratan84/data-analyst-bootcamp/master/WorldCup2018Dataset.csv')
world_cup = world_cup[['Team','Group','First match \nagainst','Second match\n against','Third match\n against']]
world_cup = world_cup.dropna(how = 'all')

world_cup = world_cup.replace({'IRAN':'Iran',
                              'Costarica':'Costa Rica',
                              'Porugal':'Portugal',
                              'Columbia':'Colombia',
                              'Korea':'Korea Republic'})
world_cup = world_cup.set_index('Team')

In [5]:
rankings.head()

Unnamed: 0,rank,country_full,country_abrv,rank_date,cur_year_avg_weighted,two_year_ago_weighted,three_year_ago_weighted,weighted_points
0,1,Germany,GER,1993-08-08,0.0,0.0,0.0,0.0
1,2,Italy,ITA,1993-08-08,0.0,0.0,0.0,0.0
2,3,Switzerland,SUI,1993-08-08,0.0,0.0,0.0,0.0
3,4,Sweden,SWE,1993-08-08,0.0,0.0,0.0,0.0
4,5,Argentina,ARG,1993-08-08,0.0,0.0,0.0,0.0


In [6]:
matches.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [7]:
world_cup.head()

Unnamed: 0_level_0,Group,First match \nagainst,Second match\n against,Third match\n against
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Russia,A,Saudi Arabia,Egypt,Uruguay
Saudi Arabia,A,Russia,Uruguay,Egypt
Egypt,A,Uruguay,Russia,Saudi Arabia
Uruguay,A,Egypt,Saudi Arabia,Russia
Portugal,B,Spain,Morocco,Iran


#  <I>Feature extraction</I>

We join the matches with the ranks of the different teams.

Then extract some features:

- point and rank differences
- if the game was for some stakes, because my naive view was that typically friendly matches are harder to predict

In [8]:
# we want to have the ranks for every day 
# We use resample to sample by Day and get the first date
rankings = rankings.set_index(['rank_date'])\
            .groupby(['country_full'], group_keys=False)\
            .resample('D').first()\
            .fillna(method='ffill')\
            .reset_index()
  
rankings.head()

Unnamed: 0,rank_date,rank,country_full,country_abrv,cur_year_avg_weighted,two_year_ago_weighted,three_year_ago_weighted,weighted_points
0,2003-01-15,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
1,2003-01-16,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
2,2003-01-17,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
3,2003-01-18,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
4,2003-01-19,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0


## <I>Adding Fifa Rankings to Match Data</I>

In [9]:
matches.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [10]:
# join the ranks
# First we do it for the Home team
matches = matches.merge(rankings, 
                        left_on=['date', 'home_team'], 
                        right_on=['rank_date', 'country_full'])
matches.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,rank_date,rank,country_full,country_abrv,cur_year_avg_weighted,two_year_ago_weighted,three_year_ago_weighted,weighted_points
0,1993-08-08,Bolivia,Uruguay,3,1,FIFA World Cup qualification,La Paz,Bolivia,False,1993-08-08,59.0,Bolivia,BOL,0.0,0.0,0.0,0.0
1,1993-08-08,Brazil,Mexico,1,1,Friendly,Maceió,Brazil,False,1993-08-08,8.0,Brazil,BRA,0.0,0.0,0.0,0.0
2,1993-08-08,Ecuador,Venezuela,5,0,FIFA World Cup qualification,Quito,Ecuador,False,1993-08-08,35.0,Ecuador,ECU,0.0,0.0,0.0,0.0
3,1993-08-08,Guinea,Sierra Leone,1,0,Friendly,Conakry,Guinea,False,1993-08-08,65.0,Guinea,GUI,0.0,0.0,0.0,0.0
4,1993-08-08,Paraguay,Argentina,1,3,FIFA World Cup qualification,Asunción,Paraguay,False,1993-08-08,67.0,Paraguay,PAR,0.0,0.0,0.0,0.0


In [11]:
# Next we do it for the Away teams
matches = matches.merge(rankings, 
                        left_on=['date', 'away_team'], 
                        right_on=['rank_date', 'country_full'], 
                        suffixes=('_home', '_away'))
matches.head().T

Unnamed: 0,0,1,2,3,4
date,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00
home_team,Bolivia,Brazil,Ecuador,Guinea,Paraguay
away_team,Uruguay,Mexico,Venezuela,Sierra Leone,Argentina
home_score,3,1,5,1,1
away_score,1,1,0,0,3
tournament,FIFA World Cup qualification,Friendly,FIFA World Cup qualification,Friendly,FIFA World Cup qualification
city,La Paz,Maceió,Quito,Conakry,Asunción
country,Bolivia,Brazil,Ecuador,Guinea,Paraguay
neutral,False,False,False,False,False
rank_date_home,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00


In [12]:
# feature generation
matches['rank_difference'] = matches['rank_home'] - matches['rank_away']
matches['average_rank'] = (matches['rank_home'] + matches['rank_away'])/2
matches['point_difference'] = matches['weighted_points_home'] - matches['weighted_points_away']
matches['score_difference'] = matches['home_score'] - matches['away_score']
matches['is_won'] = matches['score_difference'] > 0 # take draw as lost
matches['is_stake'] = matches['tournament'] != 'Friendly'


matches.head().T

Unnamed: 0,0,1,2,3,4
date,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00
home_team,Bolivia,Brazil,Ecuador,Guinea,Paraguay
away_team,Uruguay,Mexico,Venezuela,Sierra Leone,Argentina
home_score,3,1,5,1,1
away_score,1,1,0,0,3
tournament,FIFA World Cup qualification,Friendly,FIFA World Cup qualification,Friendly,FIFA World Cup qualification
city,La Paz,Maceió,Quito,Conakry,Asunción
country,Bolivia,Brazil,Ecuador,Guinea,Paraguay
neutral,False,False,False,False,False
rank_date_home,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00,1993-08-08 00:00:00


## <I>Modeling</I>

In [13]:
from sklearn import linear_model
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import classification_report

# We look at only 4 features
X, y = matches[['average_rank', 'rank_difference', 'point_difference','is_stake']], matches['is_won']

# Create our test and train datasets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Use a logistic Regression model
logreg = linear_model.LogisticRegression(C=1e-5)
features = PolynomialFeatures(degree=2)
model = Pipeline([
    ('polynomial_features', features),
    ('logistic_regression', logreg)
])
model = model.fit(X_train, y_train)


predicted = model.predict(X_test)

print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

       False       0.69      0.71      0.70      1904
        True       0.67      0.65      0.66      1733

    accuracy                           0.68      3637
   macro avg       0.68      0.68      0.68      3637
weighted avg       0.68      0.68      0.68      3637



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## <I>World Cup simulation</I>

In [14]:
# let's define a small margin that we add, so that we predict draws when probabilies lie between 0.475 and 0.525
margin = 0.020

# let's define the rankings at the time of the World Cup (so we sort on most recent rankings)
world_cup_rankings = rankings.loc[(rankings['rank_date'] == rankings['rank_date'].max()) & 
                                    rankings['country_full'].isin(world_cup.index.unique())]
world_cup_rankings = world_cup_rankings.set_index(['country_full'])
world_cup_rankings.head()

Unnamed: 0_level_0,rank_date,rank,country_abrv,cur_year_avg_weighted,two_year_ago_weighted,three_year_ago_weighted,weighted_points
country_full,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Argentina,2018-06-07,5.0,ARG,404.07,248.99,183.59,836.65
Australia,2018-06-07,36.0,AUS,366.6,98.16,59.79,524.55
Belgium,2018-06-07,3.0,BEL,629.98,158.94,186.58,975.5
Brazil,2018-06-07,2.0,BRA,558.95,168.06,162.38,889.39
Colombia,2018-06-07,16.0,COL,292.09,199.73,166.38,658.2


In [15]:
from itertools import combinations

opponents = ['First match \nagainst', 'Second match\n against', 'Third match\n against']

world_cup['points'] = 0
world_cup['total_prob'] = 0

# We iterate through each game in the schedule, doing each group at a time
for group in set(world_cup['Group']):
    print('\n___Starting group {}:___\n'.format(group))

    for home, away in combinations(world_cup.query('Group == "{}"'.format(group)).index, 2):
        print("{} vs. {}: ".format(home, away), end='')
        row = pd.DataFrame(np.array([[np.nan, np.nan, np.nan, True]]), columns=X_test.columns)

        # Get features for each team competiting in the fixture
        home_rank = world_cup_rankings.loc[home, 'rank']
        home_points = world_cup_rankings.loc[home, 'weighted_points']
        opp_rank = world_cup_rankings.loc[away, 'rank']
        opp_points = world_cup_rankings.loc[away, 'weighted_points']
        row['average_rank'] = (home_rank + opp_rank) / 2
        row['rank_difference'] = home_rank - opp_rank
        row['point_difference'] = home_points - opp_points
        
        # get the prediction proability of the home team winning
        home_win_prob = model.predict_proba(row)[:,1][0]
        world_cup.loc[home, 'total_prob'] += home_win_prob
        world_cup.loc[away, 'total_prob'] += 1-home_win_prob
        
        points = 0
        # Allocate points 
        if home_win_prob <= 0.5 - margin:
            print("{} wins with {:.2f}".format(away, 1-home_win_prob))
            world_cup.loc[away, 'points'] += 3
        if home_win_prob > 0.5 - margin:
            points = 1
        if home_win_prob >= 0.5 + margin:
            points = 3
            world_cup.loc[home, 'points'] += 3
            print("{} wins with {:.2f}".format(home, home_win_prob))
        if points == 1:
            print("Draw")
            world_cup.loc[home, 'points'] += 1
            world_cup.loc[away, 'points'] += 1


___Starting group E:___

Brazil vs. Switzerland: Brazil wins with 0.54
Brazil vs. Costa Rica: Brazil wins with 0.63
Brazil vs. Serbia: Brazil wins with 0.67
Switzerland vs. Costa Rica: Switzerland wins with 0.60
Switzerland vs. Serbia: Switzerland wins with 0.65
Costa Rica vs. Serbia: Costa Rica wins with 0.55

___Starting group D:___

Argentina vs. Iceland: Argentina wins with 0.60
Argentina vs. Croatia: Argentina wins with 0.60
Argentina vs. Nigeria: Argentina wins with 0.71
Iceland vs. Croatia: Croatia wins with 0.52
Iceland vs. Nigeria: Iceland wins with 0.62
Croatia vs. Nigeria: Croatia wins with 0.63

___Starting group A:___

Russia vs. Saudi Arabia: Saudi Arabia wins with 0.55
Russia vs. Egypt: Egypt wins with 0.66
Russia vs. Uruguay: Uruguay wins with 0.82
Saudi Arabia vs. Egypt: Egypt wins with 0.64
Saudi Arabia vs. Uruguay: Uruguay wins with 0.82
Egypt vs. Uruguay: Uruguay wins with 0.73

___Starting group G:___

Belgium vs. Panama: Belgium wins with 0.73
Belgium vs. Tunisia

# Single-elimination rounds

In [16]:
world_cup

Unnamed: 0_level_0,Group,First match \nagainst,Second match\n against,Third match\n against,points,total_prob
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Russia,A,Saudi Arabia,Egypt,Uruguay,0,0.977404
Saudi Arabia,A,Russia,Uruguay,Egypt,3,1.092322
Egypt,A,Uruguay,Russia,Saudi Arabia,6,1.564348
Uruguay,A,Egypt,Saudi Arabia,Russia,9,2.365926
Portugal,B,Spain,Morocco,Iran,9,1.908832
Spain,B,Portugal,Iran,Morocco,6,1.754351
Morocco,B,Iran,Portugal,Spain,0,1.113372
Iran,B,Morocco,Spain,Portugal,3,1.223446
France,C,Australia,Peru,Denmark,7,1.661675
Australia,C,France,Denmark,Peru,0,0.978362


In [17]:
# Hardcode the mappings
pairing = [0,3,4,7,8,11,12,15,1,2,5,6,9,10,13,14]

world_cup = world_cup.sort_values(by=['Group', 'points', 'total_prob'], ascending=False).reset_index()
next_round_wc = world_cup.groupby('Group').nth([0, 1]) # select the top 2
next_round_wc = next_round_wc.reset_index()
next_round_wc = next_round_wc.loc[pairing]
next_round_wc = next_round_wc.set_index('Team')

next_round_wc.sort_values(by='Group')

Unnamed: 0_level_0,Group,First match \nagainst,Second match\n against,Third match\n against,points,total_prob
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Uruguay,A,Egypt,Saudi Arabia,Russia,9,2.365926
Egypt,A,Uruguay,Russia,Saudi Arabia,6,1.564348
Spain,B,Portugal,Iran,Morocco,6,1.754351
Portugal,B,Spain,Morocco,Iran,9,1.908832
Denmark,C,Peru,Australia,France,7,1.746665
France,C,Australia,Peru,Denmark,7,1.661675
Croatia,D,Nigeria,Argentina,Iceland,6,1.556164
Argentina,D,Iceland,Croatia,Nigeria,9,1.906269
Brazil,E,Switzerland,Costa Rica,Serbia,9,1.836395
Switzerland,E,Brazil,Serbia,Costa Rica,6,1.707431


In [18]:
finals = ['round_of_16', 'quarterfinal', 'semifinal', 'final']
labels = list()
odds = list()

# for each knockout stage
for f in finals:
    print("\n___Starting of the {}___\n\t".format(f))
    iterations = int(len(next_round_wc) / 2)
    winners = []

    for i in range(iterations):
        # get the teams playing
        home = next_round_wc.index[i*2]
        away = next_round_wc.index[i*2+1]
        print("{} vs. {}: ".format(home,
                                   away), 
                                   end='')
        
        # get the features for each team
        row = pd.DataFrame(np.array([[np.nan, np.nan, np.nan, True]]), columns=X_test.columns)
        home_rank = world_cup_rankings.loc[home, 'rank']
        home_points = world_cup_rankings.loc[home, 'weighted_points']
        opp_rank = world_cup_rankings.loc[away, 'rank']
        opp_points = world_cup_rankings.loc[away, 'weighted_points']
        row['average_rank'] = (home_rank + opp_rank) / 2
        row['rank_difference'] = home_rank - opp_rank
        row['point_difference'] = home_points - opp_points

        # Get the winner
        home_win_prob = model.predict_proba(row)[:,1][0]
        
        # Display resultes
        if model.predict_proba(row)[:,1] <= 0.5:
            print("{0} wins with probability {1:.2f}".format(away, 1-home_win_prob))
            winners.append(away)
        else:
            print("{0} wins with probability {1:.2f}".format(home, home_win_prob))
            winners.append(home)

        # Display winning team and probability of victory
        labels.append("{}({:.2f}) vs. {}({:.2f})".format(world_cup_rankings.loc[home, 'country_abrv'], 
                                                        1/home_win_prob, 
                                                        world_cup_rankings.loc[away, 'country_abrv'], 
                                                        1/(1-home_win_prob)))
        odds.append([home_win_prob, 1-home_win_prob])
                
    next_round_wc = next_round_wc.loc[winners]
    print("\n")


___Starting of the round_of_16___
	
Uruguay vs. Spain: Spain wins with probability 0.54
Denmark vs. Croatia: Denmark wins with probability 0.56
Brazil vs. Mexico: Brazil wins with probability 0.58
Belgium vs. Colombia: Belgium wins with probability 0.60
Egypt vs. Portugal: Portugal wins with probability 0.81
France vs. Argentina: Argentina wins with probability 0.53
Switzerland vs. Germany: Germany wins with probability 0.62
England vs. Poland: Poland wins with probability 0.52



___Starting of the quarterfinal___
	
Spain vs. Denmark: Denmark wins with probability 0.51
Brazil vs. Belgium: Belgium wins with probability 0.52
Portugal vs. Argentina: Portugal wins with probability 0.52
Germany vs. Poland: Germany wins with probability 0.59



___Starting of the semifinal___
	
Denmark vs. Belgium: Belgium wins with probability 0.57
Portugal vs. Germany: Germany wins with probability 0.57



___Starting of the final___
	
Belgium vs. Germany: Germany wins with probability 0.55




# Actual Results to compare

![](https://i.imgur.com/h1uA9WV.png?1)