# Implementation of baseline using Matrix Factorization with EPL

Necessary files: testing_functions.py, matrix_factorization_functions.py, EPL_18.csv, EPL_19.csv

In [None]:
import testing_functions
import matrix_factorization_functions
import pandas as pd
import numpy as np
from sklearn.model_selection import ParameterGrid

I used implementation of matrix factorization from https://albertauyeung.github.io/2017/04/23/python-matrix-factorization.html/#source-code.

I used season 2017/18 as training and validation dataset. First half of the season is used for training and second half for validation. The baseline was tested on the second half of the 2018/19 season.

I searched for best hyperparameters threshold and K. The best values are found with the 2017/18 season. For each hyperparameters combination is the model trained and evaluated on the 2017/18 season. The hyperparameter combination with the best validation accuracy is chosen.

In [None]:
param_grid = {
    'threshold': [0, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2, 0.225, 0.25],
    'K': [2, 3, 4, 5, 6, 7, 8, 9, 10]
}
param_comb = ParameterGrid(param_grid)

val_scores = []
for params in param_comb:
    # Training on the first half of the season and predicting the second half.
    score = matrix_factorization_functions.predict_season('EPL_18.csv', **params)
    val_scores.append(score)
    print(params, ' ', score)

best_params = param_comb[np.argmax(val_scores)]
best_params

{'K': 2, 'threshold': 0}   0.45263157894736844
{'K': 2, 'threshold': 0.025}   0.4631578947368421
{'K': 2, 'threshold': 0.05}   0.45789473684210524
{'K': 2, 'threshold': 0.075}   0.5052631578947369
{'K': 2, 'threshold': 0.1}   0.4842105263157895


  self.Q[j, :] += self.alpha * (e * self.P[i, :] -
  self.P[i, :] += self.alpha * (e * self.Q[j, :] -
  self.Q[j, :] += self.alpha * (e * self.P[i, :] -
  self.P[i, :] += self.alpha * (e * self.Q[j, :] -
  self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
  prediction = self.b + self.b_u[i] + self.b_i[j] + \


{'K': 2, 'threshold': 0.125}   0.0
{'K': 2, 'threshold': 0.15}   0.4631578947368421


  self.Q[j, :] += self.alpha * (e * self.P[i, :] -
  self.Q[j, :] += self.alpha * (e * self.P[i, :] -
  self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
  self.P[i, :] += self.alpha * (e * self.Q[j, :] -
  self.P[i, :] += self.alpha * (e * self.Q[j, :] -
  prediction = self.b + self.b_u[i] + self.b_i[j] + \


{'K': 2, 'threshold': 0.175}   0.0
{'K': 2, 'threshold': 0.2}   0.4631578947368421
{'K': 2, 'threshold': 0.225}   0.4789473684210526
{'K': 2, 'threshold': 0.25}   0.41578947368421054
{'K': 3, 'threshold': 0}   0.4789473684210526
{'K': 3, 'threshold': 0.025}   0.4263157894736842
{'K': 3, 'threshold': 0.05}   0.43157894736842106
{'K': 3, 'threshold': 0.075}   0.4473684210526316
{'K': 3, 'threshold': 0.1}   0.48947368421052634
{'K': 3, 'threshold': 0.125}   0.45789473684210524
{'K': 3, 'threshold': 0.15}   0.46842105263157896
{'K': 3, 'threshold': 0.175}   0.4368421052631579
{'K': 3, 'threshold': 0.2}   0.45789473684210524
{'K': 3, 'threshold': 0.225}   0.4842105263157895
{'K': 3, 'threshold': 0.25}   0.4789473684210526
{'K': 4, 'threshold': 0}   0.45263157894736844
{'K': 4, 'threshold': 0.025}   0.48947368421052634
{'K': 4, 'threshold': 0.05}   0.45789473684210524
{'K': 4, 'threshold': 0.075}   0.4789473684210526
{'K': 4, 'threshold': 0.1}   0.48947368421052634
{'K': 4, 'threshold': 0.12

{'threshold': 0.2, 'K': 6}

Best hyperparameters are threshold 0.2 and K 6.

I used the second half of the season 2018/19 as testing dataset. For evaluation on the testing dataset I divided the testing data into rounds approximately. The training data for here is the first half of the season 2018/19. The model is trained on the training data, it is evaluated on one round of the testing dataset and this round is added to the training data.

In [None]:
matrix_factorization_functions.test_season('EPL_19.csv', **best_params)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


0.45789473684210524

The testing accuracy is 45.79%