# Implementation of baseline using Logistic Regression with Win-loss feature set and EPL

Necessary files: win_loss_functions.py, logistic_regression_functions.py, testing_functions.py, EPL_11.csv, EPL_14.csv, EPL_15.csv, ..., EPL_19.csv

In [None]:
import win_loss_functions
import logistic_regression_functions
import testing_functions
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

The baseline is implemented according to the paper Predicting Football Match Results with Logistic Regression by D. Prasetio and M. Harlili from 2016. The features based only on results of previous matches are used and the model is evaluated on the EPL dataset.

The authors had the best results with 5 training seasons, so I used 5 seasons for training as well. I used seasons 2010/11, and from 2013/14 to 2016/17 for training (because there were some ratings missing in the 2011/12 and 2012/13 season), season 2017/18 as validation set and the second half of season 2018/19 for testing. The validation set is used to find the best value for the treshold, which defines, when are draws predicted.

In [None]:
# The draws are dropped for training, but not for validation and testing
X_train, y_train = win_loss_functions.create_data(['EPL_17.csv', 'EPL_16.csv',
                                'EPL_15.csv', 'EPL_14.csv', 'EPL_11.csv'],
                                drop_draws=True)
results_val, matches_per_round = win_loss_functions.create_data_single('EPL_18.csv', ['EPL_11.csv', 'EPL_14.csv',
                                'EPL_15.csv', 'EPL_16.csv', 'EPL_17.csv'])
# Dates are returned as well for dividing testing season into slices
results_test, matches_per_round = win_loss_functions.create_data_single('EPL_19.csv', ['EPL_11.csv', 'EPL_14.csv',
                                'EPL_15.csv', 'EPL_16.csv', 'EPL_17.csv', 'EPL_18.csv'],
                                return_dates=True)

Processing EPL_17.csv season file.
Processing EPL_16.csv season file.
Processing EPL_15.csv season file.
Processing EPL_14.csv season file.
Processing EPL_11.csv season file.


In [None]:
# Originaly the results are represented as 1 for home win, 0 for draw and -1 for
# away win, but Logistic Regression predicts numbers between 0 and 1, so home win
# is still represented as 1, but away win as 0 and draw as 2.
y_train.replace(-1, 0, inplace=True)
results_val.replace(0, 2, inplace=True)
results_val.replace(-1, 0, inplace=True)
results_test.replace(0, 2, inplace=True)
results_test.replace(-1, 0, inplace=True)

The best value for the treshold is found with the validation season. For each value of treshold is the model trained on the training data and evaluated on the validation dataset. The treshold with the best validation accuracy is chosen.

In [None]:
X_val = results_val.drop('FTR', axis=1)
y_val = results_val['FTR']
clf = LogisticRegression(random_state=42, max_iter=1000).fit(X_train, y_train)
best_score = 0
best_threshold = 0
for threshold in [0, 0.0125, 0.025, 0.0375, 0.05, 0.0625, 0.075, 0.0875, 0.1, 0.1125, 0.125]:
  preds = clf.predict_proba(X_val)[:, 1]
  score = logistic_regression_functions.evaluate(preds, y_val, threshold)
  print("threshold ", threshold, " score ", score)
  if score > best_score:
    best_score = score
    best_threshold = threshold
print("best threshold ", best_threshold, " best score ", best_score)

threshold  0  score  0.4526315789473684
threshold  0.0125  score  0.44999999999999996
threshold  0.025  score  0.4447368421052632
threshold  0.0375  score  0.4368421052631579
threshold  0.05  score  0.41315789473684206
threshold  0.0625  score  0.4078947368421053
threshold  0.075  score  0.4
threshold  0.0875  score  0.3894736842105263
threshold  0.1  score  0.3789473684210526
threshold  0.1125  score  0.368421052631579
threshold  0.125  score  0.37631578947368416
best threshold  0  best score  0.4526315789473684


The best treshold is 0.

To use all the data available, the validation season and the first half of the testing season was added to the training data.

In [None]:
X_test_to_append, y_test_to_append = testing_functions.prepare_test_to_append(results_test,
                                                                              int(results_test.shape[0] / 2),
                                                                              drop_draws=True)

For evaluation on the testing dataset I divided the testing data into rounds approximately. The model is trained on the training dataset, then it is evaluated on one round of the testing dataset and this round is added to the training dataset.

In [None]:
# Adding validation season and 1st half of testing season to the training data
X_train = pd.concat([X_train, X_val, X_test_to_append])
y_train = pd.concat([y_train, y_val, y_test_to_append])
# Rounds of the testing dataset
slices = testing_functions.get_slices(results_test, matches_per_round,
                                      int(results_test.shape[0] / 2))
weighted_sum = 0
sum = 0
for slc in slices:
  # Ignore draws for training
  test_to_append = slc[slc['FTR'] != 2]
  clf = LogisticRegression(random_state=42, max_iter=1000).fit(X_train, y_train)
  X_test = slc.drop(['FTR', 'Date'], axis=1)
  y_test = slc['FTR']
  preds = clf.predict_proba(X_test)[:, 1]
  weighted_sum += (logistic_regression_functions.evaluate(preds, y_test, best_threshold) * len(y_test))
  sum += len(y_test)
  # Add the round to the training dataset
  X_test = test_to_append.drop(['FTR', 'Date'], axis=1)
  y_test = test_to_append['FTR']
  X_train = pd.concat([X_train, X_test])
  y_train = pd.concat([y_train, y_test])
print(weighted_sum / sum)

0.5526315789473685


The testing accuracy is 55.26%