# FIFA Prediction
**Objective**: Dev ML model to predict First, Second and Third Place for 2018 FIFA worldcup<br>
**Features**: Kaggle historical data for past matches including friendly games, Eloranking by country<br>
**Features**: Submissions for DBS internal competition (Due 25/06/2018)

## 1) Data prep
Process the history kaggle data from results.csv <br>
Possibly with elorating afterwards

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Read .csv files from kaggle
results = pd.read_csv('datasets/results.csv')

In [3]:
# observe results
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland
1,1873-03-08,England,Scotland,4,2,Friendly,London,England
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland
3,1875-03-06,England,Scotland,2,2,Friendly,London,England
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland


##### Through the exploration of data we need to find the absolute difference in score and the winning team and append the corresponding results to the newly created columns

In [4]:
# Adding new column for winner of each match
winner = []
for i in range(len(results['home_team'])):
    if results['home_score'][i] > results['away_score'][i]:
        winner.append(results['home_team'][i])
    elif results['home_score'][i] < results['away_score'][i]:
        winner.append(results['away_team'][i])
    else:
        winner.append('Tie')
results['winning_team'] = winner

# Adding new column for goal difference in matches
results['goal_difference'] = np.absolute(results['home_score'] - results['away_score'])

# view new sample header
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,winning_team,goal_difference
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,Tie,0
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,England,2
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,Scotland,1
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,Tie,0
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,Scotland,3


In [5]:
# scope current worldcup team qualifing teams
wc_teams = ['Australia', ' Iran', 'Japan', 'Korea Republic', 
            'Saudi Arabia', 'Egypt', 'Morocco', 'Nigeria', 
            'Senegal', 'Tunisia', 'Costa Rica', 'Mexico', 
            'Panama', 'Argentina', 'Brazil', 'Colombia', 
            'Peru', 'Uruguay', 'Belgium', 'Croatia', 
            'Denmark', 'England', 'France', 'Germany', 
            'Iceland', 'Poland', 'Portugal', 'Russia', 
            'Serbia', 'Spain', 'Sweden', 'Switzerland']

# Filter the 'results' dataframe to show only teams in this years' world cup, from 1930 onwards
# we only care about teams that qualify
df_teams_home = results[results['home_team'].isin(wc_teams)]
df_teams_away = results[results['away_team'].isin(wc_teams)]
df_teams = pd.concat((df_teams_home, df_teams_away))
df_teams.drop_duplicates()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,winning_team,goal_difference
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,England,2
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,Tie,0
6,1877-03-03,England,Scotland,1,3,Friendly,London,England,Scotland,2
10,1879-01-18,England,Wales,2,1,Friendly,London,England,England,1
11,1879-04-05,England,Scotland,5,4,Friendly,London,England,England,1
16,1881-02-26,England,Wales,0,1,Friendly,Blackburn,England,Wales,1
17,1881-03-12,England,Scotland,1,6,Friendly,London,England,Scotland,5
24,1883-02-03,England,Wales,5,0,Friendly,London,England,England,5
25,1883-02-24,England,Northern Ireland,7,0,Friendly,Liverpool,England,England,7
26,1883-03-10,England,Scotland,2,3,Friendly,Sheffield,England,Scotland,1


##### Slicing our data from 1930 onwards, due to limitation of elo ranking dataset 

In [6]:
# Loop for creating a new column 'year'
year = []
for row in df_teams['date']:
    year.append(int(row[:4]))
df_teams['match_year'] = year

# Slicing the dataset to see how many matches took place from 1930 onwards (the year of the first ever World Cup)
df_teams30 = df_teams[df_teams.match_year >= 1930]
df_teams30.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,winning_team,goal_difference,match_year
1230,1930-01-01,Spain,Czechoslovakia,1,0,Friendly,Barcelona,Spain,Spain,1,1930
1231,1930-01-12,Portugal,Czechoslovakia,1,0,Friendly,Lisbon,Portugal,Portugal,1,1930
1237,1930-02-23,Portugal,France,2,0,Friendly,Porto,Portugal,Portugal,2,1930
1238,1930-03-02,Germany,Italy,0,2,Friendly,Frankfurt am Main,Germany,Italy,2,1930
1240,1930-03-23,France,Switzerland,3,3,Friendly,Colombes,France,Tie,0,1930


##### Dropping unused column

In [7]:
df_teams30 = df_teams30.drop(['date', 'home_score', 'away_score', 'tournament', 'city', 'country', 'goal_difference', 'match_year'], axis=1)
df_teams30.head(5)

Unnamed: 0,home_team,away_team,winning_team
1230,Spain,Czechoslovakia,Spain
1231,Portugal,Czechoslovakia,Portugal
1237,Portugal,France,Portugal
1238,Germany,Italy,Italy
1240,France,Switzerland,Tie


## 2) Building our ML Model
Before building the model, we split the data into x,y (x variable like home vs away pair| y variable who wins)<br>
probably converting y into 1 hot vector to avoid bias

In [8]:
# rename winning team string as integer
df_teams30df_teams  = df_teams30.reset_index(drop=True)
df_teams30.loc[df_teams30.winning_team == df_teams30.home_team, 'winning_team']= 2
df_teams30.loc[df_teams30.winning_team == 'Tie', 'winning_team']= 1
df_teams30.loc[df_teams30.winning_team == df_teams30.away_team, 'winning_team']= 0

df_teams30.head()

Unnamed: 0,home_team,away_team,winning_team
1230,Spain,Czechoslovakia,2
1231,Portugal,Czechoslovakia,2
1237,Portugal,France,2
1238,Germany,Italy,0
1240,France,Switzerland,1


In [9]:
#using sklearn to have 1 liner cross validation
from sklearn.model_selection import train_test_split

# Get dummy variables (get unique pairs of home and away aka unique x)
final = pd.get_dummies(df_teams30, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# Separate X and y sets
X = final.drop(['winning_team'], axis=1)
y = final["winning_team"]
y = y.astype('int')

# Separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

##### basic logistic regression

In [10]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
score = logreg.score(X_train, y_train)
score2 = logreg.score(X_test, y_test)

print("Training set accuracy: ", '%.3f'%(score))
print("Test set accuracy: ", '%.3f'%(score2))

Training set accuracy:  0.571
Test set accuracy:  0.555


##### Noticed we only have 50+ % accuracy not very good for a ML model which should at least hit 70. We will be including elo ranking dataset as briefly mentioned above to improve our model accuracy 

## 3) Creating prediction sets from current 2018 data
before final prediction we will have to clean up the dataset and merge accordingly

In [11]:
# Loading new datasets
ranking = pd.read_csv('datasets/fifa_rankings.csv') # Obtained from https://us.soccerway.com/teams/rankings/fifa/?ICID=TN_03_05_01
fixtures = pd.read_csv('datasets/fixtures.csv') # Obtained from https://fixturedownload.com/results/fifa-world-cup-2018

# List for storing the group stage games
pred_set = []

##### include fix ranking within our 2018 group stage fixture

In [12]:
# Create new columns with ranking position of each team
fixtures.insert(1, 'first_position', fixtures['Home Team'].map(ranking.set_index('Team')['Position']))
fixtures.insert(2, 'second_position', fixtures['Away Team'].map(ranking.set_index('Team')['Position']))

# We only need the group stage games, so we have to slice the dataset
# the slice can be read as get till row 48 for all columns
fixtures = fixtures.iloc[:48, :]
fixtures.tail()

Unnamed: 0,Round Number,first_position,second_position,Date,Location,Home Team,Away Team,Group,Result
43,3,6.0,25.0,27/06/2018 21:00,Nizhny Novgorod Stadium,Switzerland,Costa Rica,Group E,
44,3,60.0,10.0,28/06/2018 17:00,Volgograd Stadium,Japan,Poland,Group H,
45,3,28.0,16.0,28/06/2018 17:00,Samara Stadium,Senegal,Colombia,Group H,
46,3,55.0,14.0,28/06/2018 21:00,Saransk Stadium,Panama,Tunisia,Group G,
47,3,13.0,3.0,28/06/2018 21:00,Kaliningrad Stadium,England,Belgium,Group G,


##### predicting which team proceeds to next stage

In [13]:

# Loop to add teams to new prediction dataset based on the ranking position of each team# Loop  
for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'home_team': row['Home Team'], 'away_team': row['Away Team'], 'winning_team': None})
    else:
        pred_set.append({'home_team': row['Away Team'], 'away_team': row['Home Team'], 'winning_team': None})
        
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set

pred_set.head()

Unnamed: 0,away_team,home_team,winning_team
0,Saudi Arabia,Russia,
1,Egypt,Uruguay,
2,Morocco,Iran,
3,Spain,Portugal,
4,Australia,France,


## 4)  Deploy Model
Using our previously trained model

In [15]:
# Get dummy variables and drop winning_team column
pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# Add missing columns compared to the model's training dataset
missing_cols = set(final.columns) - set(pred_set.columns)
for c in missing_cols:
    pred_set[c] = 0
pred_set = pred_set[final.columns]

# Remove winning team column
pred_set = pred_set.drop(['winning_team'], axis=1)

pred_set.head()

Unnamed: 0,home_team_Afghanistan,home_team_Albania,home_team_Algeria,home_team_Andorra,home_team_Angola,home_team_Argentina,home_team_Armenia,home_team_Aruba,home_team_Australia,home_team_Austria,...,away_team_Venezuela,away_team_Vietnam,away_team_Vietnam Republic,away_team_Wales,away_team_Western Australia,away_team_Yemen,away_team_Yemen DPR,away_team_Yugoslavia,away_team_Zambia,away_team_Zimbabwe
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
predictions = logreg.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
    if predictions[i] == 2:
        print("Winner: " + backup_pred_set.iloc[i, 1])
    elif predictions[i] == 1:
        print("Tie")
    elif predictions[i] == 0:
        print("Winner: " + backup_pred_set.iloc[i, 0])
    print('Probability of ' + backup_pred_set.iloc[i, 1] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][2]))
    print('Probability of Tie: ', '%.3f'%(logreg.predict_proba(pred_set)[i][1]))
    print('Probability of ' + backup_pred_set.iloc[i, 0] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][0]))
    print("")

Russia and Saudi Arabia
Winner: Russia
Probability of Russia winning:  0.695
Probability of Tie:  0.195
Probability of Saudi Arabia winning:  0.110

Uruguay and Egypt
Winner: Uruguay
Probability of Uruguay winning:  0.578
Probability of Tie:  0.323
Probability of Egypt winning:  0.099

Iran and Morocco
Tie
Probability of Iran winning:  0.315
Probability of Tie:  0.366
Probability of Morocco winning:  0.318

Portugal and Spain
Tie
Probability of Portugal winning:  0.332
Probability of Tie:  0.345
Probability of Spain winning:  0.323

France and Australia
Winner: France
Probability of France winning:  0.605
Probability of Tie:  0.232
Probability of Australia winning:  0.163

Argentina and Iceland
Winner: Argentina
Probability of Argentina winning:  0.821
Probability of Tie:  0.141
Probability of Iceland winning:  0.039

Peru and Denmark
Winner: Peru
Probability of Peru winning:  0.470
Probability of Tie:  0.169
Probability of Denmark winning:  0.361

Croatia and Nigeria
Winner: Croatia
P

##### hard codethe group stage tuple all the way to the finals, this sort of flexibility instead of code driven function allows us to modify who proceed to quater finals so on and so forth based on actual results (Ideally, our model should predict the outcode for every match correctly but hey nothing is perfect right?)

In [17]:
# List of tuples before we arrange the teams in home and away
group_16 = [('Uruguay', 'Spain'),
            ('France', 'Croatia'),
            ('Brazil', 'Mexico'),
            ('England', 'Colombia'),
            ('Portugal', 'Russia'),
            ('Argentina', 'Peru'),
            ('Germany', 'Switzerland'),
            ('Poland', 'Belgium')]

##### Function to clean tuple dataset from fixture and order by ranking (A not so nice approach to give higher ranking teams as homedue to higher win rate)

In [18]:
def clean_and_predict(matches, ranking, final, logreg):

    # Initialization of auxiliary list for data cleaning
    positions = []

    # Loop to retrieve each team's position according to FIFA ranking
    for match in matches:
        positions.append(ranking.loc[ranking['Team'] == match[0],'Position'].iloc[0])
        positions.append(ranking.loc[ranking['Team'] == match[1],'Position'].iloc[0])
    
    # Creating the DataFrame for prediction
    pred_set = []

    # Initializing iterators for while loop
    i = 0
    j = 0

    # 'i' will be the iterator for the 'positions' list, and 'j' for the list of matches (list of tuples)
    while i < len(positions):
        dict1 = {}

        # If position of first team is better, he will be the 'home' team, and vice-versa
        if positions[i] < positions[i + 1]:
            dict1.update({'home_team': matches[j][0], 'away_team': matches[j][1]})
        else:
            dict1.update({'home_team': matches[j][1], 'away_team': matches[j][0]})

        # Append updated dictionary to the list, that will later be converted into a DataFrame
        pred_set.append(dict1)
        i += 2
        j += 1

    # Convert list into DataFrame
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    # Get dummy variables and drop winning_team column
    pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

    # Add missing columns compared to the model's training dataset
    missing_cols2 = set(final.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[final.columns]

    # Remove winning team column
    pred_set = pred_set.drop(['winning_team'], axis=1)

    # Predict!
    predictions = logreg.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 2:
            print("Winner: " + backup_pred_set.iloc[i, 1])
        elif predictions[i] == 1:
            print("Tie")
        elif predictions[i] == 0:
            print("Winner: " + backup_pred_set.iloc[i, 0])
        print('Probability of ' + backup_pred_set.iloc[i, 1] + ' winning: ' , '%.3f'%(logreg.predict_proba(pred_set)[i][2]))
        print('Probability of Tie: ', '%.3f'%(logreg.predict_proba(pred_set)[i][1])) 
        print('Probability of ' + backup_pred_set.iloc[i, 0] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][0]))
        print("")

In [19]:
# actual run
clean_and_predict(group_16, ranking, final, logreg)

Spain and Uruguay
Winner: Spain
Probability of Spain winning:  0.644
Probability of Tie:  0.191
Probability of Uruguay winning:  0.165

France and Croatia
Winner: France
Probability of France winning:  0.459
Probability of Tie:  0.229
Probability of Croatia winning:  0.312

Brazil and Mexico
Winner: Brazil
Probability of Brazil winning:  0.687
Probability of Tie:  0.204
Probability of Mexico winning:  0.109

England and Colombia
Winner: England
Probability of England winning:  0.556
Probability of Tie:  0.304
Probability of Colombia winning:  0.140

Portugal and Russia
Tie
Probability of Portugal winning:  0.347
Probability of Tie:  0.355
Probability of Russia winning:  0.299

Argentina and Peru
Winner: Argentina
Probability of Argentina winning:  0.677
Probability of Tie:  0.248
Probability of Peru winning:  0.074

Germany and Switzerland
Winner: Germany
Probability of Germany winning:  0.677
Probability of Tie:  0.186
Probability of Switzerland winning:  0.136

Belgium and Poland
Win

##### based on previous result proceed

In [20]:
# List of matches
quarters = [('Spain', 'France'),
            ('Portugal', 'Argentina'),
            ('Brazil', 'England'),
            ('Germany', 'Belgium')]

In [21]:
clean_and_predict(quarters, ranking, final, logreg)

France and Spain
Winner: France
Probability of France winning:  0.390
Probability of Tie:  0.269
Probability of Spain winning:  0.341

Portugal and Argentina
Tie
Probability of Portugal winning:  0.320
Probability of Tie:  0.343
Probability of Argentina winning:  0.338

Brazil and England
Winner: Brazil
Probability of Brazil winning:  0.501
Probability of Tie:  0.256
Probability of England winning:  0.242

Germany and Belgium
Winner: Germany
Probability of Germany winning:  0.600
Probability of Tie:  0.244
Probability of Belgium winning:  0.156



In [24]:
# List of matches
semi = [('France', 'Brazil'),
        ('Argentina', 'Germany')]

In [25]:
clean_and_predict(semi, ranking, final, logreg)

Brazil and France
Winner: Brazil
Probability of Brazil winning:  0.651
Probability of Tie:  0.181
Probability of France winning:  0.168

Germany and Argentina
Winner: Germany
Probability of Germany winning:  0.460
Probability of Tie:  0.284
Probability of Argentina winning:  0.256



In [26]:
# The final game# The big 
finals = [('Brazil', 'Germany')]

In [27]:
clean_and_predict(finals, ranking, final, logreg)

Germany and Brazil
Winner: Germany
Probability of Germany winning:  0.404
Probability of Tie:  0.242
Probability of Brazil winning:  0.354



## 5) Project Review 

Better handling of draw to proceed in matches after group stage<br>
Also ranking should be used for training and not just sorting teams between home and away<br>
Finally, betting historic data would have been helpful<br><br>

Limitation of this model would be the inability to predict score but instead predicting win or lose<br>
A separate linear regression algorithm would be use to predict the score instead