# Predicting the outcome of soccer matches

In this notebook we will use `keras` to predict the outcome of soccer match. 

We will use some preprocessed data from the "European Soccer Database" (https://www.kaggle.com/hugomathien/soccer).

#### 1. Read in the files `matches.csv` and `players.csv`.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("./data/matches.csv", index_col='id')
p = pd.read_csv("./data/players.csv")

In [3]:
df.head()

Unnamed: 0_level_0,home_team_goal,away_team_goal,home_player_1,home_player_2,home_player_3,home_player_4,home_player_5,home_player_6,home_player_7,home_player_8,...,away_player_2,away_player_3,away_player_4,away_player_5,away_player_6,away_player_7,away_player_8,away_player_9,away_player_10,away_player_11
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
146,2,1,38327.0,67950.0,67958.0,67959.0,37112.0,36393.0,148286.0,67898.0,...,38293.0,148313.0,104411.0,148314.0,37202.0,43158.0,9307.0,42153.0,32690.0,38782.0
154,1,3,36835.0,37047.0,37021.0,37051.0,104386.0,32863.0,37957.0,37909.0,...,21812.0,11736.0,37858.0,38366.0,37983.0,39578.0,38336.0,52280.0,27423.0,38440.0
156,2,0,34480.0,38388.0,26458.0,13423.0,38389.0,30949.0,38393.0,38253.0,...,37886.0,37903.0,37889.0,94030.0,37893.0,37981.0,131531.0,130027.0,38231.0,131530.0
163,2,1,38327.0,67950.0,67958.0,38801.0,67898.0,37112.0,67959.0,148286.0,...,38388.0,38389.0,31316.0,164694.0,30949.0,38378.0,38383.0,38393.0,38253.0,37069.0
169,0,0,37900.0,37886.0,37100.0,37903.0,37889.0,37893.0,37981.0,131531.0,...,38247.0,16387.0,94288.0,94284.0,45832.0,26669.0,33671.0,163670.0,37945.0,33622.0


In [4]:
players = {}

for i, k in enumerate(p['player_api_id']):
    players[k] = i
    
ix_players = {v:k for k,v in players.items()}

#### 2. Create a vector representation of the match. 

Each row is one match and each column is one player. We put 1 if the player was a local in that game or -1 if it was from the away team.

In [5]:
n_players = len(players)
n_matches = df.shape[0]
team_length = 11

In [6]:
def vectorize_match(match_id):
    # Initialize vector
    vector = np.zeros(n_players)
    
    # Put the winners and losers
    for j in range(0,team_length):
        vector[players[df.iloc[match_id,j+2]]] = 1
        vector[players[df.iloc[match_id,j+2+team_length]]] = -1        
    return vector

In [7]:
vec = np.zeros(n_players)

In [8]:
vectorize_match(1)

array([0., 0., 0., ..., 0., 0., 0.])

In [9]:
X = np.array([vectorize_match(match_id) for match_id in range(df.shape[0])])

#### 3. We creat a target with three categories: win, draw and lose.

In [10]:

y = np.round(((df['home_team_goal']-df['away_team_goal'])/abs(0.01+df['home_team_goal']-df['away_team_goal'])).values)

In [11]:
y

array([ 1., -1.,  1., ..., -1.,  1.,  1.])

#### 4.  Split in validation and training data. 

Since the matches happen sequentially, it makes more sense to split in the first 80% for training and the next 20% for testing. 

In [12]:
# Split in validation and training data 
train_idx = int(0.8*X.shape[1])
X_train = X[0:train_idx,:]
X_test = X[train_idx:,:]
y_train = y[0:train_idx]
y_test = y[train_idx:]

In [13]:
X.shape

(21374, 11060)

#### 5. Create and train the model 

We will use one hidden layer and an output layer of size 3.

In [14]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(hidden_layer_sizes=(5,5,3))

clf.fit(X_train,y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5, 5, 3), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

#### 6. Evaluate the model.

We first generate raw predictions and then compare those outcomes with the observed results.

In [15]:
raw_preds = clf.predict(X_test)

In [16]:
outcomes_guessed = sum([np.argmax(y_test[j])==np.argmax(raw_preds[j]) for j in range(y_test.shape[0])])/y_test.shape[0]*100
print("Percentage of outcomes guessed: ", outcomes_guessed)

Percentage of outcomes guessed:  100.0


#### 7. Extensions

- Can you improve the model by increasing the number of epochs or changing the architecture of the network?
- How could you modify the model to predict the outcome difference instead of winners/losers/draws?