# FIFA Prediction
**Objective**: Dev ML model to predict First, Second and Third Place for 2018 FIFA worldcup<br>
**Features**: Kaggle historical data for past matches including friendly games, Eloranking by country<br>
**Purpose**: Submissions for DBS internal competition (Due 25/06/2018)

## 1) Data prep
Process the history kaggle data from results.csv (1930 onwards due to limitation of elorating data)<br>
Possible integration with elorating (Used custom javascript to crawl data from https://www.eloratings.net/)

In [1]:
# import libraries for data manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# example making new class predictions for a classification problem
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, TimeDistributed, AveragePooling1D, Flatten
from keras.utils import to_categorical
from keras.optimizers import Adam, RMSprop

# back up model graph
from keras.models import load_model

# using sklearn to have 1 liner cross validation
from sklearn.model_selection import train_test_split

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# Read .csv files from kaggle
results = pd.read_csv('datasets/results.csv')

In [3]:
# observe results
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland
1,1873-03-08,England,Scotland,4,2,Friendly,London,England
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland
3,1875-03-06,England,Scotland,2,2,Friendly,London,England
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland


##### Through the exploration of data we need to find the absolute difference in score and the winning team<br>Append the corresponding results to the newly created columns [wining_team]<br>And finally, keep data of teams that make it to the group stage while dropping the rest

In [4]:
# Adding new column for winner of each match
winner = []
for i in range(len(results['home_team'])):
    if results['home_score'][i] > results['away_score'][i]:
        winner.append(results['home_team'][i])
    elif results['home_score'][i] < results['away_score'][i]:
        winner.append(results['away_team'][i])
    else:
        winner.append('Tie')
results['winning_team'] = winner

# Adding new column for goal difference in matches
results['goal_difference'] = np.absolute(results['home_score'] - results['away_score'])

# view new sample header
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,winning_team,goal_difference
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,Tie,0
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,England,2
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,Scotland,1
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,Tie,0
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,Scotland,3


In [5]:
# scope current worldcup team qualifing teams
wc_teams = ['Australia', ' Iran', 'Japan', 'Korea Republic', 
            'Saudi Arabia', 'Egypt', 'Morocco', 'Nigeria', 
            'Senegal', 'Tunisia', 'Costa Rica', 'Mexico', 
            'Panama', 'Argentina', 'Brazil', 'Colombia', 
            'Peru', 'Uruguay', 'Belgium', 'Croatia', 
            'Denmark', 'England', 'France', 'Germany', 
            'Iceland', 'Poland', 'Portugal', 'Russia', 
            'Serbia', 'Spain', 'Sweden', 'Switzerland']

# Filter the 'results' dataframe to show only teams in this years' world cup, from 1930 onwards
# we only care about teams that qualify
df_teams_home = results[results['home_team'].isin(wc_teams)]
df_teams_away = results[results['away_team'].isin(wc_teams)]
df_teams = pd.concat((df_teams_home, df_teams_away))
df_teams.drop_duplicates()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,winning_team,goal_difference
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,England,2
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,Tie,0
6,1877-03-03,England,Scotland,1,3,Friendly,London,England,Scotland,2
10,1879-01-18,England,Wales,2,1,Friendly,London,England,England,1
11,1879-04-05,England,Scotland,5,4,Friendly,London,England,England,1
16,1881-02-26,England,Wales,0,1,Friendly,Blackburn,England,Wales,1
17,1881-03-12,England,Scotland,1,6,Friendly,London,England,Scotland,5
24,1883-02-03,England,Wales,5,0,Friendly,London,England,England,5
25,1883-02-24,England,Northern Ireland,7,0,Friendly,Liverpool,England,England,7
26,1883-03-10,England,Scotland,2,3,Friendly,Sheffield,England,Scotland,1


##### As mentioned earlier, slicing our data from 1930 onwards, due to limitation of elo ranking dataset

In [6]:
# Loop for creating a new column 'year'
year = []
for row in df_teams['date']:
    year.append(int(row[:4]))
df_teams['match_year'] = year

# Slicing the dataset to see how many matches took place from 1930 onwards (the year of the first ever World Cup)
df_teams30 = df_teams[df_teams.match_year >= 1930]
df_teams30.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,winning_team,goal_difference,match_year
1230,1930-01-01,Spain,Czechoslovakia,1,0,Friendly,Barcelona,Spain,Spain,1,1930
1231,1930-01-12,Portugal,Czechoslovakia,1,0,Friendly,Lisbon,Portugal,Portugal,1,1930
1237,1930-02-23,Portugal,France,2,0,Friendly,Porto,Portugal,Portugal,2,1930
1238,1930-03-02,Germany,Italy,0,2,Friendly,Frankfurt am Main,Germany,Italy,2,1930
1240,1930-03-23,France,Switzerland,3,3,Friendly,Colombes,France,Tie,0,1930


##### Dropping unused column to reduce dimension needed for training (Speed up training)

In [7]:
df_teams30 = df_teams30.drop(['date', 'home_score', 'away_score', 'tournament', 'city', 'country', 'goal_difference', 'match_year'], axis=1)
df_teams30.head(5)

Unnamed: 0,home_team,away_team,winning_team
1230,Spain,Czechoslovakia,Spain
1231,Portugal,Czechoslovakia,Portugal
1237,Portugal,France,Portugal
1238,Germany,Italy,Italy
1240,France,Switzerland,Tie


##### Map ELO rating based on team name, a major mistake made during the collection of ELO rating <br>(Only qualifier data was collected, in order to have a balance historical view even for teams that did not make it to the qualifiers we only took 2018 rating for missing countries)

In [8]:
# Read .csv files from elo rating
elorating = pd.read_csv('datasets/EloRating.csv', encoding = 'ISO-8859-1')

ELODict={}
for index, row in elorating.iterrows():
    ELODict[row["Team"]]= row["Rating"]

# map rating information into our dataframe    
df_teams30['home_team_rating']=df_teams30['home_team'].map(ELODict)
df_teams30['away_team_rating']=df_teams30['away_team'].map(ELODict)
df_teams30['rating_diff'] = df_teams30['home_team_rating'] - df_teams30['away_team_rating']
df_teams30['rating_diff'] = df_teams30['rating_diff'].astype('int')

df_teams30.head()

Unnamed: 0,home_team,away_team,winning_team,home_team_rating,away_team_rating,rating_diff
1230,Spain,Czechoslovakia,Spain,2038,1882,156
1231,Portugal,Czechoslovakia,Portugal,1976,1882,94
1237,Portugal,France,Portugal,1976,1999,-23
1238,Germany,Italy,Italy,2077,1850,227
1240,France,Switzerland,Tie,1999,1890,109


## 2) Building our ML Model
Before building the model, we split the data into x,y (X variable like home vs away pair and Y variable who wins)<br>
probably should convert Y into 1 hot vector to avoid bias<br>
Also swap out X variable of country name into ID via hashmap (Keras Input limitation)

In [9]:
# rename winning team string as integer
df_teams30 = df_teams30.reset_index(drop=True)
df_teams30.loc[df_teams30.winning_team == df_teams30.home_team, 'winning_team']= 0
df_teams30.loc[df_teams30.winning_team == 'Tie', 'winning_team']= 2
df_teams30.loc[df_teams30.winning_team == df_teams30.away_team, 'winning_team']= 1 
df_teams30.head()

Unnamed: 0,home_team,away_team,winning_team,home_team_rating,away_team_rating,rating_diff
0,Spain,Czechoslovakia,0,2038,1882,156
1,Portugal,Czechoslovakia,0,1976,1882,94
2,Portugal,France,0,1976,1999,-23
3,Germany,Italy,1,2077,1850,227
4,France,Switzerland,2,1999,1890,109


In [10]:
# convert all teams into unique ID via dictionary
finalX = df_teams30['home_team'].append(df_teams30['away_team'])
CountryDict = dict([(y,x+1) for x,y in enumerate(sorted(set(finalX)))])      
df_teams30["home_team"].replace(CountryDict, inplace=True)
df_teams30["away_team"].replace(CountryDict, inplace=True)

# Separate X and y sets
X = df_teams30.drop(['winning_team'], axis=1)
y = df_teams30["winning_team"]
y = y.astype('int')

# Separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

# 1 hotvectorisation of our Y column to avoid bias (Also for model to predict category)
y_test_1hot = to_categorical(y_test).astype(np.int)
y_train_1hot = to_categorical(y_train).astype(np.int)

##### basic logistic regression

In [11]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
score = logreg.score(X_train, y_train)
score2 = logreg.score(X_test, y_test)

print("Training set accuracy: ", '%.3f'%(score))
print("Test set accuracy: ", '%.3f'%(score2))

Training set accuracy:  0.536
Test set accuracy:  0.541


##### Noticed model is only 50+% accuracy which is pretty low, slightly better than guessing even after using elo ranking dataset as briefly mentioned above to improve our model accuracy instead of just country ID

##### AlternatIve model in keras
Current stack consist input layer with 5 dimension (with elorating for away and home team)<br>
Next dense fully connected dense layers with built in activation relu<br>
Finally softmax to squash output prediction as probability between representing win, draw or tie

In [12]:
### define and fit the final model
model = Sequential()
model.add(Dense(32, input_dim=5, activation='relu'))
# add noise to model to avoid bais
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu',kernel_initializer='normal'))
model.add(Dense(128, activation='relu',kernel_initializer='normal'))
model.add(Dense(64, activation='relu',kernel_initializer='normal'))
model.add(Dense(32, activation='relu',kernel_initializer='normal'))
#model.add(Dense(3))
# squash result as probability
model.add(Dense(3, activation='softmax'))


#optimisation to converge faster
#epsilon so that division is not 0 think of it as bias recommended 10^-8
#rmsprop = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])



# toggle verbose to print text
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, verbose=0,batch_size=32)
#model.fit(X_train, y_train_1hot, validation_data=(X_test,y_test_1hot), epochs=1000, verbose=0)


<keras.callbacks.History at 0x221e863c780>

In [13]:
# backup our model
model.save('model/my_model_current.h5')

# show summary of model
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32)                192       
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                2112      
_________________________________________________________________
dense_3 (Dense)              (None, 128)               8320      
_________________________________________________________________
dense_4 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_5 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 99        
Total para

##### Noticed we only have 50+ % accuracy not very good for a ML model which should at least hit 70. We will be including elo ranking dataset as briefly mentioned above to improve our model accuracy 

In [14]:
scores = model.evaluate(X_train, y_train, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

#print(y_test)
# sampling our model prediction
test_sample=np.array([145,183,1902,1795,178]).reshape(1,5)
# predict output index 
sample_output=model.predict_classes(test_sample)
print(sample_output)

#import numpy
#from numpy import unravel_index
#numpy.set_printoptions(threshold=numpy.nan)
# result verification

#y_pred_probability = model.predict_proba(X_test)
#y_pred = model.predict_classes(X_test)

#print(y_pred)

#predictions = numpy.argmax(model.predict(X_test), axis=1)
#for i in range(1000):
#    print(predictions[i])
#plt.scatter(X_test['home_team'], y_pred)
#plt.show()



categorical_accuracy: 70.76%
[0]


## 3) Creating prediction sets from current 2018 data
before final prediction we will have to clean up the dataset and merge accordingly

In [15]:
# Loading new datasets
ranking = pd.read_csv('datasets/fifa_rankings.csv') # Obtained from https://us.soccerway.com/teams/rankings/fifa/?ICID=TN_03_05_01
fixtures = pd.read_csv('datasets/fixtures.csv') # Obtained from https://fixturedownload.com/results/fifa-world-cup-2018

# List for storing the group stage games
pred_set = []



##### include fix ranking within our 2018 group stage fixture further sort by ranking

In [16]:
# Create new columns with ranking position of each team
fixtures.insert(1, 'first_position', fixtures['Home Team'].map(ranking.set_index('Team')['Position']))
fixtures.insert(2, 'second_position', fixtures['Away Team'].map(ranking.set_index('Team')['Position']))

# We only need the group stage games, so we have to slice the dataset
# the slice can be read as get till row 48 for all columns
fixtures = fixtures.iloc[:48, :]
fixtures.tail()

Unnamed: 0,Round Number,first_position,second_position,Date,Location,Home Team,Away Team,Group,Result
43,3,6.0,25.0,27/06/2018 21:00,Nizhny Novgorod Stadium,Switzerland,Costa Rica,Group E,
44,3,60.0,10.0,28/06/2018 17:00,Volgograd Stadium,Japan,Poland,Group H,
45,3,28.0,16.0,28/06/2018 17:00,Samara Stadium,Senegal,Colombia,Group H,
46,3,55.0,14.0,28/06/2018 21:00,Saransk Stadium,Panama,Tunisia,Group G,
47,3,13.0,3.0,28/06/2018 21:00,Kaliningrad Stadium,England,Belgium,Group G,


##### predicting which team proceeds to next stage

In [17]:

# Loop to add teams to new prediction dataset based on the ranking position of each team# Loop  
for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'home_team': row['Home Team'], 'away_team': row['Away Team'], 'winning_team': None})
    else:
        pred_set.append({'home_team': row['Away Team'], 'away_team': row['Home Team'], 'winning_team': None})
        
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set

pred_set.head()

Unnamed: 0,away_team,home_team,winning_team
0,Saudi Arabia,Russia,
1,Egypt,Uruguay,
2,Morocco,Iran,
3,Spain,Portugal,
4,Australia,France,


## 4)  Deploy Model
Prepare match pairs from fixtures dataset (feed in eloranking from our dictionary)<br>
Using our previously trained model to predict outcome

In [43]:
#load our previously trained model
model = load_model('model/my_model_current.h5')

# convert our group stage data into tuples
groupstage=pred_set.drop(['winning_team'], axis=1)
groupstagetuples = [tuple(x) for x in groupstage.values]


def prepare_predict(matches):
    
    wc_x_pred = []
    # data preprocessing
    for matchPairs in matches:
        home_team_id = CountryDict[matchPairs[0]]
        away_team_id = CountryDict[matchPairs[1]]
        home_team_elorating = ELODict[matchPairs[0]]
        away_team_elorating = ELODict[matchPairs[1]]
        elodifference = ELODict[matchPairs[0]]-ELODict[matchPairs[1]]
        
        # transform array
        matchset = [home_team_id, away_team_id, home_team_elorating, away_team_elorating, elodifference]
        wc_x_pred.append(matchset)
    
    # convert prediction set into numpy array
    wc_x_pred = np.array(wc_x_pred)
    
    # return y_prediction probability
    wc_y_pred = model.predict_proba(wc_x_pred)
    # iterate results
    for index, outcome in enumerate(wc_y_pred):
        outcome_win = str(outcome[0])
        outcome_draw = str(outcome[2])
        outcome_lose = str(outcome[1])
        outcome_final = np.where(outcome == outcome.max())
        
        
        print('Probability of ' + matches[index][0]+ ' winning: ' + outcome_win)
        print('Probability of Tie: '+ outcome_draw) 
        print('Probability of ' + matches[index][1] + ' winning: ' + outcome_lose)
        if(outcome_final[0]==0):
            print("Final Winner: "+ matches[index][0])
        elif(outcome_final[0]==1):
            print("Final Winner: "+ matches[index][1])
        else:
            print("Draw")
        print("\n")
        
    
prepare_predict(groupstagetuples)


Probability of Saudi Arabia winning: 0.3777896
Probability of Tie: 0.28834853
Probability of Russia winning: 0.3338619
Final Winner: Saudi Arabia


Probability of Egypt winning: 0.259684
Probability of Tie: 0.25883275
Probability of Uruguay winning: 0.48148325
Final Winner: Uruguay


Probability of Morocco winning: 0.38790226
Probability of Tie: 0.2882633
Probability of Iran winning: 0.3238345
Final Winner: Morocco


Probability of Spain winning: 0.48158658
Probability of Tie: 0.27462465
Probability of Portugal winning: 0.24378887
Final Winner: Spain


Probability of Australia winning: 0.26497513
Probability of Tie: 0.26096773
Probability of France winning: 0.47405714
Final Winner: France


Probability of Iceland winning: 0.3103413
Probability of Tie: 0.27836928
Probability of Argentina winning: 0.4112895
Final Winner: Argentina


Probability of Denmark winning: 0.47184268
Probability of Tie: 0.27724323
Probability of Peru winning: 0.25091413
Final Winner: Denmark


Probability of Nige

##### hardcoded the group stage tuple all the way to the finals, this sort of flexibility instead of code driven function allows us to modify who proceed to quater finals so on and so forth based on actual results (Ideally, our model should predict the outcode for every match correctly but hey nothing is perfect right?)

Note: Replaced some of the wrongly predicted outcomes argentina with respective country<Br>
**Group knockoff**:<br>
1A vs 2B<br>
1C vs 2D<br>
1E vs 2F<br>
1G vs 2H<br>
<br>
1B vs 2A<br>
1D vs 2C<br>
1F vs 2E<br>
1H vs 2G<br>

In [44]:
# List of tuples before we arrange the teams in home and away
group_16 = [('Uruguay', 'Portugal'),
            ('France', 'Argentina'),
            ('Brazil', 'Germany'),
            ('Belgium', 'Japan'),
            ('Spain', 'Russia'),
            ('Croatia', 'Denmark'),
            ('Mexico', 'Switzerland'),
            ('Senegal', 'England')]

##### Function to clean tuple dataset from fixture and order by ranking (A not so nice approach to give higher ranking teams as homedue to higher win rate)

In [45]:
# actual run
prepare_predict(group_16)

Probability of Uruguay winning: 0.44903073
Probability of Tie: 0.28807977
Probability of Portugal winning: 0.26288947
Final Winner: Uruguay


Probability of France winning: 0.48491195
Probability of Tie: 0.27295083
Probability of Argentina winning: 0.24213725
Final Winner: France


Probability of Brazil winning: 0.51013225
Probability of Tie: 0.26428628
Probability of Germany winning: 0.22558145
Final Winner: Brazil


Probability of Belgium winning: 0.6177788
Probability of Tie: 0.22252059
Probability of Japan winning: 0.15970063
Final Winner: Belgium


Probability of Spain winning: 0.6461022
Probability of Tie: 0.21068195
Probability of Russia winning: 0.14321584
Final Winner: Spain


Probability of Croatia winning: 0.4617924
Probability of Tie: 0.28146687
Probability of Denmark winning: 0.25674075
Final Winner: Croatia


Probability of Mexico winning: 0.45148328
Probability of Tie: 0.28707248
Probability of Switzerland winning: 0.26144427
Final Winner: Mexico


Probability of Senegal

##### based on previous result proceed

In [47]:
# List of matches
quarters = [('Uruguay', 'France'),
            ('Brazil', 'Belgium'),
            ('Spain', 'Croatia'),
            ('Mexico', 'England')]

In [48]:
prepare_predict(quarters)

Probability of Uruguay winning: 0.43582198
Probability of Tie: 0.29039332
Probability of France winning: 0.27378476
Final Winner: Uruguay


Probability of Brazil winning: 0.58410466
Probability of Tie: 0.23638967
Probability of Belgium winning: 0.17950565
Final Winner: Brazil


Probability of Spain winning: 0.5283174
Probability of Tie: 0.25785416
Probability of Croatia winning: 0.21382849
Final Winner: Spain


Probability of Mexico winning: 0.38348788
Probability of Tie: 0.28775227
Probability of England winning: 0.32875988
Final Winner: Mexico




In [49]:
# List of matches
semi = [('Uruguay', 'Brazil'),
        ('Spain', 'Mexico')]

In [50]:
prepare_predict(semi)

Probability of Uruguay winning: 0.27910528
Probability of Tie: 0.26725912
Probability of Brazil winning: 0.4536356
Final Winner: Brazil


Probability of Spain winning: 0.54798007
Probability of Tie: 0.25092888
Probability of Mexico winning: 0.20109102
Final Winner: Spain




In [55]:
# List of matches
losersfinals = [('Uruguay', 'Mexico')]

In [56]:
prepare_predict(losersfinals)

Probability of Uruguay winning: 0.4739217
Probability of Tie: 0.2778399
Probability of Mexico winning: 0.24823843
Final Winner: Uruguay




In [57]:
# The final game# The big 
finals = [('Brazil', 'Spain')]

In [58]:
prepare_predict(finals)

Probability of Brazil winning: 0.53221256
Probability of Tie: 0.25645423
Probability of Spain winning: 0.2113332
Final Winner: Brazil




## 5) FIFA 2018 Score Prediction

Using simple linear regression to predict the Total Goals for the Final/SemiFinal Match<br>
After which a difference of winner loser linear regression is used based on the predicted Total Goals to retrieve the end result<br>
For more detail please refer to: 2018 world cup score prediction/(Total)<br>

<img style="margin-left: 0;" src="images/3rd place.jpg">
**The predicted Score for the 3rd place playoff: 4-3**


<img style="margin-left: 0;" src="images/Final Score.jpg">
**The predicted Score for the Finals: 2-1**


## 6) Project Review 
**The experiments**:<br>
Elo rating did improve prediction from 50~% to 70~%<br>
found that LSTM model would not help with model converging<br>
the score prediction was separated out from winners and losers to avoid dependancies

**For future work**:<br>
Improve handling of draw to proceed in matches after group stage<br>
Also ranking could be used for training and not just sorting teams between home and away<br>
Finally, betting historic data would have been helpful<br>

**Limitation**<br>
The limitation of this model would be the inability to predict score but instead predicting win or lose<br>
A separate linear regression algorithm would be use to predict the score instead

**Results**<br>
1st Brazil, 2nd Spain, 3rd Uruguay<br>
Final Match: 2-1<br>
Losers Final: 4-3

## 7) References
https://machinelearningmastery.com/evaluate-performance-deep-learning-models-keras/<br>
https://medium.com/@siavash_37715/how-to-predict-bitcoin-and-ethereum-price-with-rnn-lstm-in-keras-a6d8ee8a5109<br>
https://towardsdatascience.com/using-lstms-to-forecast-time-series-4ab688386b1f<br>
https://github.com/charliechurches/Russia_World_Cup_Prediction/blob/master/Russia_World_Cup_predict.ipynb<br>
https://www.eloratings.net/<br>
https://www.youtube.com/watch?v=Cltt47Ah3Q4