# Training and predicting

**IF YOU SEE MESSY TABLE OUTPUTS PLEASE RUN THE CELL BELOW**

(some CPDs are really long so we need to tweak the config to disable wrapping text and add a horizontal scrollbar to the output area)

In [1]:
%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

In [2]:
import pandas as pd
import numpy as np
import random
from pgmpy.models import BayesianModel
from pgmpy.estimators import BayesianEstimator
from ourFunctions import *

In [3]:
# read csv file
df = pd.read_csv('data_discretized.csv')
dfBackUp = pd.read_csv('data_discretized.csv')

In [4]:
df

Unnamed: 0,HomeTeamName,AwayTeamName,Result,HomeAverageGoals,AwayAverageGoals,Home5LastGames,Away5LastGames,Stage,City,Year
0,France,Yugoslavia,Lost,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Semi-finals,Paris,1960
1,Czechoslovakia,Soviet Union,Lost,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Semi-finals,Marseille,1960
2,Czechoslovakia,France,Won,Low,High,Bad,Bad,Third place play-off,Marseille,1960
3,Soviet Union,Yugoslavia,Won,High,High,Good,Good,Final,Paris,1960
4,Spain,Hungary,Won,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Semi-finals,Madrid,1964
...,...,...,...,...,...,...,...,...,...,...
281,Germany,Italy,Draw,High,Mid,Good,Good,Quarter-finals,Bordeaux,2016
282,France,Iceland,Won,High,High,Good,Good,Quarter-finals,Saint-Denis,2016
283,Portugal,Wales,Won,Mid,High,Good,Good,Semi-finals,Décines-Charpieu,2016
284,Germany,France,Lost,Mid,High,Good,Good,Semi-finals,Marseille,2016


## (a) and (b) Build a model using pgmpy, predict results of some games and calculate the conditional probability of winning the Euro championship of each country

We create a training data set by selected only necessary columns.

In [5]:
data = df[['HomeAverageGoals', 'AwayAverageGoals', 'Home5LastGames', 'Away5LastGames', 'City', 'Result']]
data

Unnamed: 0,HomeAverageGoals,AwayAverageGoals,Home5LastGames,Away5LastGames,City,Result
0,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Paris,Lost
1,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Marseille,Lost
2,Low,High,Bad,Bad,Marseille,Won
3,High,High,Good,Good,Paris,Won
4,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Madrid,Won
...,...,...,...,...,...,...
281,High,Mid,Good,Good,Bordeaux,Draw
282,High,High,Good,Good,Saint-Denis,Won
283,Mid,High,Good,Good,Décines-Charpieu,Won
284,Mid,High,Good,Good,Marseille,Lost


Creating the Bayesian model.

In [6]:
r = 'Result'
ha = 'HomeAverageGoals'
aa = 'AwayAverageGoals'
hl = 'Home5LastGames'
al = 'Away5LastGames'
c = 'City'

model = BayesianModel([(ha, hl), (aa, al),
                       (hl, r), (al, r), 
                       (c, ha), (c, aa),
                       (ha, r), (aa, r)])

Let the model learn from data.

In [7]:
model.fit(data, estimator=BayesianEstimator, prior_type="BDeu")

CPDs at nodes after training.

(some tables are really long so please use the scrollbar)

In [8]:
print(model.get_cpds(r))

+------------------+------------------------+------------------------+------------------------+--------------------------------+------------------------+------------------------+------------------------+--------------------------------+-------------------------+-------------------------+-------------------------+--------------------------------+------------------------+-----------------------+-----------------------+--------------------------------+------------------------+-----------------------+-----------------------+--------------------------------+-------------------------+-------------------------+-------------------------+--------------------------------+------------------------+-----------------------+-----------------------+--------------------------------+------------------------+-----------------------+-----------------------+--------------------------------+-------------------------+-------------------------+-------------------------+--------------------------------+-------

There are many cases where the probability of Draw - Lost - Won is 1/3 - 1/3 - 1/3. This is mostly because a *NoGamePlayed* pairs with a *Low/Mid/High*, which, in our data frame doesn't exit (NoGamePlayed always pairs with NoGamePlayed). Therefore, we don't need to worry about it because such cases don't happen in our data anyway. Let's just ignore these 0.3333... for now.

In [9]:
print(model.get_cpds(hl))

+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| HomeAverageGoals        | HomeAverageGoals(High) | HomeAverageGoals(Low) | HomeAverageGoals(Mid) | HomeAverageGoals(NoGamePlayed) |
+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| Home5LastGames(Bad)     | 0.05644302449414271    | 0.7134238310708899    | 0.15867944621938232   | 0.005257623554153523           |
+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| Home5LastGames(Good)    | 0.8104366347177848     | 0.07993966817496231   | 0.4781682641107561    | 0.005257623554153523           |
+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| Home5LastGames(Neutral) | 0.13312034078807242    | 0.2066365

In [10]:
print(model.get_cpds(al))

+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| AwayAverageGoals        | AwayAverageGoals(High) | AwayAverageGoals(Low) | AwayAverageGoals(Mid) | AwayAverageGoals(NoGamePlayed) |
+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| Away5LastGames(Bad)     | 0.04104104104104103    | 0.6048265460030166    | 0.1285551763367463    | 0.005257623554153523           |
+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| Away5LastGames(Good)    | 0.7977977977977977     | 0.0799396681749623    | 0.4425483503981797    | 0.005257623554153523           |
+-------------------------+------------------------+-----------------------+-----------------------+--------------------------------+
| Away5LastGames(Neutral) | 0.16116116116116114    | 0.3152337

In [11]:
print(model.get_cpds(ha))

+--------------------------------+-----------------------+---------------+----------------------+---------------------+---------------------+-----------------------+---------------------+----------------------+------------------+-----------------------+---------------------+----------------------+-----------------------+----------------------+---------------------+---------------------+-----------------------+------------------------+---------------------+----------------------+----------------------+----------------+---------------------+----------------------+---------------------+----------------------+-----------------------+---------------------+----------------------+---------------------+----------------------+----------------------+-----------------------+----------------------+----------------------+---------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+---------------------+-

In [12]:
print(model.get_cpds(aa))

+--------------------------------+-----------------------+---------------+----------------------+---------------------+---------------------+---------------------+---------------------+----------------------+----------------------+-----------------------+---------------------+----------------------+-----------------------+----------------------+---------------------+---------------------+-----------------------+------------------------+---------------------+----------------------+----------------------+----------------+---------------------+----------------------+---------------------+----------------------+--------------------+---------------------+---------------+---------------------+----------------------+----------------------+-----------------------+----------------------+----------------------+---------------------+---------------------+-----------------------+----------------------+-------------+-----------------------+----------------------+---------------------+--------------

We have completely created and trained our Bayesian networks. Let use it to predict results of games in Groups A, B, C and D in 2016.

We write a function to remove the Result column from data and let the model predict, then compare them with real life results.

In [13]:
def predictAndCompare(start, end, df):
    dfDropResult = df.drop('Result', axis=1)
    predictDf = dfDropResult[start:end][[ha, aa, hl, al, c]]
    print('Prediction:')
    print(model.predict(predictDf))
    print('\n')
    print('Real life results:')
    print(df['Result'][start:end])

Group A first:


In [14]:
df[235:241]

Unnamed: 0,HomeTeamName,AwayTeamName,Result,HomeAverageGoals,AwayAverageGoals,Home5LastGames,Away5LastGames,Stage,City,Year
235,France,Romania,Won,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group A,Saint-Denis,2016
236,Albania,Switzerland,Lost,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group A,Lens,2016
237,Romania,Switzerland,Draw,Mid,Mid,Bad,Good,Group A,Paris,2016
238,France,Albania,Won,High,Low,Good,Bad,Group A,Marseille,2016
239,Romania,Albania,Lost,Mid,Low,Bad,Bad,Group A,Décines-Charpieu,2016
240,Switzerland,France,Draw,Mid,High,Good,Good,Group A,Villeneuve-d'Ascq,2016


We are not going to predict all 6 games but only the last 4 games, because the first 2 have *NoGamePlayed* value and two teams look technically identical to our model (both NoGamePlayed - Neutral). Therefore the predicted results from first 2 games would not be reliable.

In [15]:
predictAndCompare(237, 241, df)

Prediction:


100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 400.60it/s]


  Result
0   Draw
1    Won
2   Lost
3   Draw


Real life results:
237    Draw
238     Won
239    Lost
240    Draw
Name: Result, dtype: object


4 out of 4 predictions are correct! We didn't expect it to be this accurate, but let's continue to predict other groups to test our model furthermore.

Group B:

In [16]:
df[241:247]

Unnamed: 0,HomeTeamName,AwayTeamName,Result,HomeAverageGoals,AwayAverageGoals,Home5LastGames,Away5LastGames,Stage,City,Year
241,Wales,Slovakia,Won,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group B,Bordeaux,2016
242,England,Russia,Draw,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group B,Marseille,2016
243,Russia,Slovakia,Lost,Mid,Mid,Neutral,Bad,Group B,Villeneuve-d'Ascq,2016
244,England,Wales,Won,Mid,High,Neutral,Good,Group B,Lens,2016
245,Russia,Wales,Lost,Mid,High,Bad,Neutral,Group B,Toulouse,2016
246,Slovakia,England,Draw,High,High,Neutral,Good,Group B,Saint-Étienne,2016


Now predicting:

In [17]:
predictAndCompare(243, 247, df)

Prediction:


100%|██████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1335.87it/s]


  Result
0   Lost
1   Draw
2   Lost
3   Draw


Real life results:
243    Lost
244     Won
245    Lost
246    Draw
Name: Result, dtype: object


Only 1 false predictions. Our model is still pretty reliable.

Group C:

In [18]:
df[247:253]

Unnamed: 0,HomeTeamName,AwayTeamName,Result,HomeAverageGoals,AwayAverageGoals,Home5LastGames,Away5LastGames,Stage,City,Year
247,Poland,Northern Ireland,Won,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group C,Nice,2016
248,Germany,Ukraine,Won,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group C,Villeneuve-d'Ascq,2016
249,Ukraine,Northern Ireland,Lost,Low,Low,Bad,Bad,Group C,Décines-Charpieu,2016
250,Germany,Poland,Draw,High,Mid,Good,Good,Group C,Saint-Denis,2016
251,Ukraine,Poland,Lost,Low,Low,Bad,Good,Group C,Marseille,2016
252,Northern Ireland,Germany,Lost,Mid,Mid,Neutral,Good,Group C,Paris,2016


Predicting:

In [19]:
predictAndCompare(249, 253, df)

Prediction:


100%|██████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1550.29it/s]

  Result
0   Lost
1   Lost
2   Lost
3   Lost


Real life results:
249    Lost
250    Draw
251    Lost
252    Lost
Name: Result, dtype: object





Again, just 1 false prediction. Our model is doing well.

The last Group:

In [20]:
df[253:259]

Unnamed: 0,HomeTeamName,AwayTeamName,Result,HomeAverageGoals,AwayAverageGoals,Home5LastGames,Away5LastGames,Stage,City,Year
253,Turkey,Croatia,Lost,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group D,Paris,2016
254,Spain,Czech Republic,Won,NoGamePlayed,NoGamePlayed,Neutral,Neutral,Group D,Toulouse,2016
255,Czech Republic,Croatia,Draw,Low,Mid,Bad,Good,Group D,Saint-Étienne,2016
256,Spain,Turkey,Won,Mid,Low,Good,Bad,Group D,Nice,2016
257,Czech Republic,Turkey,Lost,Mid,Low,Bad,Bad,Group D,Lens,2016
258,Croatia,Spain,Won,High,High,Good,Good,Group D,Bordeaux,2016


Predicting:

In [21]:
predictAndCompare(255, 259, df)

Prediction:


100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 983.08it/s]

  Result
0   Lost
1   Lost
2   Lost
3    Won


Real life results:
255    Draw
256     Won
257    Lost
258     Won
Name: Result, dtype: object





However, only 2/4 predictions are true this time.

Because the result of a football match depends on many other factors such as weather, performance of players or even luck. Our data only includes basic information such as total goals and cities. Therefore, it is unlikely that what our model predicts will match perfectly with what happened in real life.

### Probability of winning the Euro championship of each country given that this data frame is true

In this problem, we will create a new model and a new data frame. The data frame consists of all TeamNames and for each team, we will put the value with the highest counts, for every feature. For example, if the Result of a team has 5 "Won", 4 "Lost" and 2 "Draw", we will put "Won" in the Result column.

In [22]:
def getMost(col, teamName, df):
    counts = {}
    
    for i, row in df.iterrows():
        if row['HomeTeamName'] == teamName:
            value = row['Home' + col]
            # do not count NoGamePlayed
            if value == 'NoGamePlayed':
                continue
            
            if value in counts:
                counts[value] += 1
            else:
                counts[value] = 1
        
        elif row['AwayTeamName'] == teamName:
            value = row['Away' + col]
            # do not count NoGamePlayed
            if value == 'NoGamePlayed':
                continue
                
            if value in counts:
                counts[value] += 1
            else:
                counts[value] = 1
    return max(counts, key=counts.get)

def isChampion(teamName, df):
    for i, row in df.iterrows():
        if row['Stage'] == 'Final':
            if row['HomeTeamName'] == teamName:
                if row['Result'] == 'Won' or row['Result'] == 'Draw':
                    return True
            elif row['AwayTeamName'] == teamName:
                if row['Result'] == 'Lost' or row['Result'] == 'Draw':
                    return True
    return False

In [23]:
namelist = list(df['HomeTeamName'])
namelist.extend(df['AwayTeamName'])
nameSet = list(set(namelist))

# find 
averageGoals = []
last5Games = []
champion = []
for name in nameSet:
    averageGoals.append(getMost('AverageGoals', name, df))
    last5Games.append(getMost('5LastGames', name, df))
    champion.append(isChampion(name, df))
    
teamStat = pd.DataFrame(getResultAllTeam(df).idxmax(axis=1))
teamStat = teamStat.rename(columns={0: 'Result'})
teamStat['AverageGoals'] = averageGoals
teamStat['5LastGames'] = last5Games
teamStat['IsChampion'] = champion
teamStat

Unnamed: 0,Result,AverageGoals,5LastGames,IsChampion
Spain,Won,High,Good,True
CIS,Draw,Mid,Neutral,False
Poland,Draw,Low,Neutral,False
Norway,Won,Mid,Neutral,False
Albania,Lost,Low,Bad,False
Iceland,Won,Mid,Neutral,False
Denmark,Lost,Low,Neutral,True
Slovenia,Draw,High,Neutral,False
Ukraine,Lost,Low,Neutral,False
Turkey,Lost,Low,Bad,False


In [24]:
a = 'AverageGoals'
l = '5LastGames'
w = 'IsChampion'

model2 = BayesianModel([(a, w), (l, w), (r, w)])

data2 = teamStat.reset_index(drop=True)
model2.fit(data2, estimator=BayesianEstimator, prior_type="BDeu")

Here is our prediction.

In [25]:
predictData2 = data2.drop(w, axis=1)
prediction = model2.predict_probability(predictData2)
prediction['TeamName'] = nameSet
prediction = prediction[['TeamName', 'IsChampion_True']]
prediction

Unnamed: 0,TeamName,IsChampion_True
0,Spain,0.847938
1,CIS,0.078125
2,Poland,0.078125
3,Norway,0.343023
4,Albania,0.02907
5,Iceland,0.343023
6,Denmark,0.210714
7,Slovenia,0.042373
8,Ukraine,0.210714
9,Turkey,0.02907


## Precision score and recall score

First, we let the model predict the whole data, but droping row that has 'NoGamePlayed', for the same reason as mentioned.

In [26]:
data_dropNoGamePlay = data[data[ha] != 'NoGamePlayed'].reset_index(drop=True)
actualResults = data_dropNoGamePlay[r]
predictData = data_dropNoGamePlay.drop('Result', axis=1)
predictedResults = (model.predict(predictData))[r]
predictedResults

100%|████████████████████████████████████████████████████████████████████████████████| 194/194 [00:02<00:00, 96.09it/s]


0       Won
1       Won
2      Lost
3       Won
4       Won
       ... 
203    Lost
204     Won
205    Draw
206    Draw
207    Draw
Name: Result, Length: 208, dtype: object

Now we write a function to calculate score based on the predicted results and the actual ones

In [27]:
def precisionRecallScore(result, predictedDf, actualDf):
    truePositive = 0
    falsePositive = 0
    trueNegative = 0
    falseNegative = 0
    for i in range(len(predictedDf.index)):
        predict = predictedDf[i]
        actual = actualDf[i]
        
        if predict == result:
            if actual == result:
                truePositive += 1
            else:
                falsePositive += 1
        else:
            if actual != result:
                trueNegative += 1
            else:
                falseNegative += 1
    precisionScore = truePositive / (truePositive + falsePositive)
    recallScore = truePositive / (truePositive + falseNegative)
    print('Precision score and recall score of "' + result + '":')
    print((precisionScore, recallScore))

In [28]:
precisionRecallScore('Won', predictedResults, actualResults)

Precision score and recall score of "Won":
(0.6521739130434783, 0.5357142857142857)


In [29]:
precisionRecallScore('Lost', predictedResults, actualResults)

Precision score and recall score of "Lost":
(0.5833333333333334, 0.6447368421052632)


In [30]:
precisionRecallScore('Draw', predictedResults, actualResults)

Precision score and recall score of "Draw":
(0.4727272727272727, 0.5416666666666666)


The scores are pretty decent. Most of them are over 50%. The precision score of "Won" and the recall score of "Lost" even reach 65%. 

## (d) adjust the structure of the model slightly to see if it improves the scores

Because the City factor is somewhat variable (meaning there are so many cities), we try removing the City node. Next, we remove 2 edge HomeAverageGoals -> Result and AwayAverageGoals -> Result. Therefore, rather than depending on both average goals and 5 last games , the result now depends only on 5 last games of both teams.

In [31]:
adjustedModel = BayesianModel([(ha, hl), (aa, al),
                            (hl, r), (al, r)])

adjustedModel.fit(data, estimator=BayesianEstimator, prior_type="BDeu")
predictedResults_adjusted = (adjustedModel.predict(predictData.drop(c, axis=1)))[r]

100%|█████████████████████████████████████████████████████████████████████████████████| 56/56 [00:00<00:00, 764.57it/s]


In [32]:
precisionRecallScore('Won', predictedResults_adjusted, actualResults)

Precision score and recall score of "Won":
(0.46616541353383456, 0.7380952380952381)


In [33]:
precisionRecallScore('Lost', predictedResults_adjusted, actualResults)

Precision score and recall score of "Lost":
(0.5272727272727272, 0.3815789473684211)


In [34]:
precisionRecallScore('Draw', predictedResults_adjusted, actualResults)

Precision score and recall score of "Draw":
(0.45, 0.1875)


All the scores terribly go down, especially the recall score of "Draw", which drops more than 30%. However, the recall score of "Won" someshow reaches 75%, from which we can infer that 5LastGames has a strong causality with the "Won" result.

**Conclusion:** For this data set, our original model works best compared to others.