# Implementation of baseline using Logistic Regression with Original feature set and EPL

Necessary files: logistic_regression_functions.py, testing_functions.py, EPL_11.csv, EPL_14.csv, EPL_15.csv, ..., EPL_19.csv

In [None]:
import logistic_regression_functions
import testing_functions
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

The baseline is implemented according to the paper Predicting Football Match Results with Logistic Regression by D. Prasetio and M. Harlili from 2016. The original features as described in the paper are used and the model is evaluated on the EPL dataset.

The authors had the best results with 5 training seasons, so I used 5 seasons for training as well. I used seasons 2010/11, and from 2013/14 to 2016/17 for training (because there were some ratings missing in the 2011/12 and 2012/13 season), season 2017/18 as validation set and the second half of season 2018/19 for testing. The validation set is used to find the best value for the treshold, which defines, when are draws predicted.

In [None]:
# The draws are dropped for training, but not for validation and testing
X_train, y_train = logistic_regression_functions.create_data(13, [11, 14, 15, 16, 17], 
                               ['EPL_11.csv', 'EPL_14.csv', 'EPL_15.csv',
                                'EPL_16.csv', 'EPL_17.csv'],
                               logistic_regression_functions.team_names_map_epl,
                               logistic_regression_functions.secondary_team_names_map_epl)
results_val, matches_per_round = logistic_regression_functions.create_data_single(13, 18, 'EPL_18.csv',
                                                     logistic_regression_functions.team_names_map_epl, 
                                                     logistic_regression_functions.secondary_team_names_map_epl,
                                                     drop_draws=False)
# Dates are returned as well for dividing testing season into slices
results_test, matches_per_round = logistic_regression_functions.create_data_single(13, 19, 'EPL_19.csv',
                                                     logistic_regression_functions.team_names_map_epl, 
                                                     logistic_regression_functions.secondary_team_names_map_epl,
                                                     return_dates=True,
                                                     drop_draws=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['FTR'].replace('H', 1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['FTR'].replace('A', 0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['FTR'].replace('D', 2, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning

The best value for the treshold is found with the validation season. For each value of treshold is the model trained on the training data and evaluated on the validation dataset. The treshold with the best validation accuracy is chosen.

In [None]:
X_val = results_val.drop('FTR', axis=1)
y_val = results_val['FTR']
clf = LogisticRegression(random_state=42, max_iter=1000).fit(X_train, y_train)
best_score = 0
best_threshold = 0
for threshold in [0, 0.0125, 0.025, 0.0375, 0.05, 0.0625, 0.075, 0.0875, 0.1, 0.1125, 0.125]:
  preds = clf.predict_proba(X_val)[:, 1]
  score = logistic_regression_functions.evaluate(preds, y_val, threshold)
  print("threshold ", threshold, " score ", score)
  if score > best_score:
    best_score = score
    best_threshold = threshold
print("best threshold ", best_threshold, " best score ", best_score)

threshold  0  score  0.5263157894736843
threshold  0.0125  score  0.5236842105263158
threshold  0.025  score  0.5184210526315789
threshold  0.0375  score  0.5184210526315789
threshold  0.05  score  0.5263157894736843
threshold  0.0625  score  0.531578947368421
threshold  0.075  score  0.5210526315789474
threshold  0.0875  score  0.5342105263157895
threshold  0.1  score  0.5157894736842106
threshold  0.1125  score  0.5184210526315789
threshold  0.125  score  0.5184210526315789
best threshold  0.0875  best score  0.5342105263157895


The best treshold is 0.875.

To use all the data available, the validation season and the first half of the testing season was added to the training data.

In [None]:
X_test_to_append, y_test_to_append = testing_functions.prepare_test_to_append(results_test,
                                                                              int(results_test.shape[0] / 2),
                                                                              drop_draws=True)

For evaluation on the testing dataset I divided the testing data into rounds approximately. The model is trained on the training dataset, then it is evaluated on one round of the testing dataset and this round is added to the training dataset.

In [None]:
# Adding validation season and 1st half of testing season to the training data
X_train = pd.concat([X_train, X_val, X_test_to_append])
y_train = pd.concat([y_train, y_val, y_test_to_append])
# Rounds of the testing dataset
slices = testing_functions.get_slices(results_test, matches_per_round,
                                      int(results_test.shape[0] / 2))
weighted_sum = 0
sum = 0
for slc in slices:
  # Ignore draws for training
  test_to_append = slc[slc['FTR'] != 2]
  clf = LogisticRegression(random_state=42, max_iter=1000).fit(X_train, y_train)
  X_test = slc.drop(['FTR', 'Date'], axis=1)
  y_test = slc['FTR']
  preds = clf.predict_proba(X_test)[:, 1]
  weighted_sum += (logistic_regression_functions.evaluate(preds, y_test, best_threshold) * len(y_test))
  sum += len(y_test)
  # Add the round to the training dataset
  X_test = test_to_append.drop(['FTR', 'Date'], axis=1)
  y_test = test_to_append['FTR']
  X_train = pd.concat([X_train, X_test])
  y_train = pd.concat([y_train, y_test])
print(weighted_sum / sum)

0.5473684210526316


The testing accuracy is 54.74%