# Implementation of baseline using Logistic Regression with Original + Win-loss + feature vectors feature set and SA

Necessary files: win_loss_functions.py, logistic_regression_functions.py, testing_functions.py, feature_vectors_functions.py, SA_13.csv, SA_14.csv, ..., SA_19.csv, A_SA.csv, H_SA.csv

In [None]:
import win_loss_functions
import logistic_regression_functions
import testing_functions
import feature_vectors_functions
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import requests
from bs4 import BeautifulSoup

The baseline is implemented according to the paper Predicting Football Match Results with Logistic Regression by D. Prasetio and M. Harlili from 2016. The original features from paper, features based only on results of previous matches and feature vectors are used and the model is evaluated on the SA dataset.

The authors had the best results with 5 training seasons, so I used 5 seasons for training as well. I used seasons from 2012/13 to 2016/17 for training, season 2017/18 as validation set and the second half of season 2018/19 for testing. The validation set is used to find the best value for the treshold, which defines, when are draws predicted.

In [None]:
# The draws are dropped for training, but not for validation and testing
X_train_win_loss, y_train = win_loss_functions.create_data(['SA_17.csv', 'SA_16.csv', 'SA_15.csv',
                                'SA_14.csv', 'SA_13.csv'],
                                drop_draws=True, return_names=True)
results_val_win_loss, matches_per_round = win_loss_functions.create_data_single('SA_18.csv', ['SA_13.csv', 'SA_14.csv', 'SA_15.csv',
                                'SA_16.csv', 'SA_17.csv'], return_names=True)
results_test_win_loss, matches_per_round = win_loss_functions.create_data_single('SA_19.csv', ['SA_13.csv', 'SA_14.csv', 'SA_15.csv',
                                'SA_16.csv', 'SA_17.csv', 'SA_18.csv'],
                                return_dates=True, return_names=True)

Processing SA_17.csv season file.
Processing SA_16.csv season file.
Processing SA_15.csv season file.
Processing SA_14.csv season file.
Processing SA_13.csv season file.


In [None]:
# Originaly the results are represented as 1 for home win, 0 for draw and -1 for
# away win, but Logistic Regression predicts numbers between 0 and 1, so home win
# is still represented as 1, but away win as 0 and draw as 2.
y_train.replace(-1, 0, inplace=True)
results_val_win_loss.replace(0, 2, inplace=True)
results_val_win_loss.replace(-1, 0, inplace=True)
results_test_win_loss.replace(0, 2, inplace=True)
results_test_win_loss.replace(-1, 0, inplace=True)

In [None]:
# The draws are dropped for training, but not for validation and testing
X_train_originals, y_train = logistic_regression_functions.create_data(31, [17, 16, 15, 14, 13], 
                               ['SA_17.csv', 'SA_16.csv', 'SA_15.csv',
                                'SA_14.csv', 'SA_13.csv'],
                               logistic_regression_functions.team_names_map_sa,
                               logistic_regression_functions.secondary_team_names_map_sa)
results_val_originals, matches_per_round = logistic_regression_functions.create_data_single(31, 18, 'SA_18.csv',
                                                     logistic_regression_functions.team_names_map_sa, 
                                                     logistic_regression_functions.secondary_team_names_map_sa,
                                                     drop_draws=False)
# Dates are returned as well for dividing testing season into slices
results_test_originals, matches_per_round = logistic_regression_functions.create_data_single(31, 19, 'SA_19.csv',
                                                     logistic_regression_functions.team_names_map_sa, 
                                                     logistic_regression_functions.secondary_team_names_map_sa,
                                                     return_dates=True,
                                                     drop_draws=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['FTR'].replace('H', 1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['FTR'].replace('A', 0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['FTR'].replace('D', 2, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning

In [None]:
# Concatenating the two feature sets together
X_train_mix = pd.concat([X_train_originals, X_train_win_loss], axis=1)
results_val_mix = pd.concat([results_val_originals, results_val_win_loss], axis=1)
results_test_mix = pd.concat([results_test_originals, results_test_win_loss], axis=1)
results_val_mix = results_val_mix.loc[:,~results_val_mix.columns.duplicated()].copy()
results_test_mix = results_test_mix.loc[:,~results_test_mix.columns.duplicated()].copy()

In [None]:
A = pd.read_csv('A_SA.csv')
H = pd.read_csv('H_SA.csv')

In [None]:
# Adding the feature vectors to the features
X_train = feature_vectors_functions.add_feature_vector(X_train_mix, A, H)
results_val = feature_vectors_functions.add_feature_vector(results_val_mix, A, H)
results_test = feature_vectors_functions.add_feature_vector(results_test_mix, A, H)

The best value for the treshold is found with the validation season. For each value of treshold is the model trained on the training data and evaluated on the validation dataset. The treshold with the best validation accuracy is chosen.

In [None]:
X_val = results_val.drop('FTR', axis=1)
y_val = results_val['FTR']
clf = LogisticRegression(random_state=42, max_iter=10000).fit(X_train, y_train)
best_score = 0
best_threshold = 0
for threshold in [0, 0.0125, 0.025, 0.0375, 0.05, 0.0625, 0.075, 0.0875, 0.1, 0.1125, 0.125]:
  preds = clf.predict_proba(X_val)[:, 1]
  score = logistic_regression_functions.evaluate(preds, y_val, threshold)
  print("threshold ", threshold, " score ", score)
  if score > best_score:
    best_score = score
    best_threshold = threshold
print("best threshold ", best_threshold, " best score ", best_score)

threshold  0  score  0.5868421052631578
threshold  0.0125  score  0.5842105263157895
threshold  0.025  score  0.581578947368421
threshold  0.0375  score  0.5710526315789474
threshold  0.05  score  0.5684210526315789
threshold  0.0625  score  0.5684210526315789
threshold  0.075  score  0.5684210526315789
threshold  0.0875  score  0.5657894736842105
threshold  0.1  score  0.5578947368421052
threshold  0.1125  score  0.5526315789473684
threshold  0.125  score  0.5473684210526315
best threshold  0  best score  0.5868421052631578


The best treshold is 0.

To use all the data available, the validation season and the first half of the testing season was added to the training data.

In [None]:
X_test_to_append, y_test_to_append = testing_functions.prepare_test_to_append(results_test,
                                                                              int(results_test.shape[0] / 2),
                                                                              drop_draws=True)

For evaluation on the testing dataset I divided the testing data into rounds approximately. The model is trained on the training dataset, then it is evaluated on one round of the testing dataset and this round is added to the training dataset.

In [None]:
# Adding validation season and 1st half of testing season to the training data
X_train = pd.concat([X_train, X_val, X_test_to_append])
y_train = pd.concat([y_train, y_val, y_test_to_append])
# Rounds of the testing dataset
slices = testing_functions.get_slices(results_test, matches_per_round,
                                      int(results_test.shape[0] / 2))
weighted_sum = 0
sum = 0
for slc in slices:
  # Ignore draws for training
  test_to_append = slc[slc['FTR'] != 2]
  clf = LogisticRegression(random_state=42, max_iter=10000).fit(X_train, y_train)
  X_test = slc.drop(['FTR', 'Date'], axis=1)
  y_test = slc['FTR']
  preds = clf.predict_proba(X_test)[:, 1]
  weighted_sum += (logistic_regression_functions.evaluate(preds, y_test, best_threshold) * len(y_test))
  sum += len(y_test)
  # Add the round to the training dataset
  X_test = test_to_append.drop(['FTR', 'Date'], axis=1)
  y_test = test_to_append['FTR']
  X_train = pd.concat([X_train, X_test])
  y_train = pd.concat([y_train, y_test])
  #print(evaluate(preds, y_test, best_threshold) * len(y_test))
print(weighted_sum / sum)

0.49473684210526314
