# Implementation of baseline using PCA and Naive Bayes with Original feature set and BSA

Necessary files: testing_functions.py, pca_nb_functions.py, BRA.csv

In [None]:
import testing_functions
import pca_nb_functions
import pandas as pd
from datetime import datetime
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

The baseline is implemented according to the paper Predicting The Dutch Football Competition Using Public Data: A Machine Learning Approach by N. Tax and Y. Joustra from 2015. The original features as described in the paper (except for few ones which are not available for the BSA) are used and the model is evaluated on the BSA dataset.

The authors used data from 13 seasons. Unfortunaltely, the data for BSA are avilable only from 2012. So I used seasons from 2012 to 2017 for training, season 2018 for validation and the second half of season 2019 for testing. Here the validation part is not necessary, but I wanted to keep the split of the dataset as similar as possible with other baselines and models.

In [None]:
full_dataset = pd.read_csv('BRA.csv')
full_dataset.rename(columns = {'Home':'HomeTeam', 'Away': 'AwayTeam',
                               'HG': 'FTHG', 'AG': 'FTAG', 'Res': 'FTR',
                               'PH': 'B365H', 'PD': 'B365D',
                               'PA': 'B365A'}, inplace = True)
for season in full_dataset['Season'].unique():
  dataset = full_dataset[full_dataset['Season'] == season]
  dataset.to_csv('BSA_' + str(season)[-2:] + '.csv', index=False)

In [None]:
X_train_stable, y_train_stable = pca_nb_functions.create_data(['BSA_17.csv', 'BSA_16.csv',
                             'BSA_15.csv', 'BSA_14.csv',
                             'BSA_13.csv',
                             'BSA_12.csv'], pca_nb_functions.teams_bsa,
                             include_shots_fauls=False)
results_val, matches_per_round = pca_nb_functions.create_data_single('BSA_18.csv', 'BSA_17.csv',
                            ['BSA_16.csv',
                             'BSA_15.csv', 'BSA_14.csv', 'BSA_13.csv',
                             'BSA_12.csv'], pca_nb_functions.teams_bsa,
                             include_shots_fauls=False)
# Dates are returned as well for dividing testing season into slices
results_test, matches_per_round = pca_nb_functions.create_data_single('BSA_19.csv', 'BSA_18.csv',
                            ['BSA_17.csv', 'BSA_16.csv',
                             'BSA_15.csv', 'BSA_14.csv', 'BSA_13.csv',
                             'BSA_12.csv'], pca_nb_functions.teams_bsa,
                             return_dates=True, include_shots_fauls=False)

Processing BSA_17.csv season file.
Processing BSA_16.csv season file.
Processing BSA_15.csv season file.
Processing BSA_14.csv season file.
Processing BSA_13.csv season file.


In [None]:
X_val = results_val.drop('FTR', axis=1)
y_val = results_val['FTR']

In the paper the highest accuracy was achieved with 3 PCA components. So I used 3 components as well.

To use all the data available, the validation season and the first half of the testing season was added to the training data.

In [None]:
# Some rounds in the beginning are ignored, this is the correct index
# of the start of the second half of the season
start_test_index = 13 * matches_per_round

In [None]:
X_test_to_append, y_test_to_append = testing_functions.prepare_test_to_append(results_test,
                                                                              start_test_index)

For evaluation on the testing dataset I divided the testing data into rounds approximately. The model is trained on the training dataset, then it is evaluated on one round of the testing dataset and this round is added to the training dataset.

In [None]:
# Adding validation season and 1st half of testing season to the training data
X_train = pd.concat([X_train_stable, X_val, X_test_to_append])
y_train = pd.concat([y_train_stable, y_val, y_test_to_append])
# Rounds of the testing dataset
slices = testing_functions.get_slices(results_test, matches_per_round,
                                      start_test_index)
weighted_sum = 0
sum = 0
for slc in slices:
  pca = PCA(n_components=3)
  X_train_pca = pca.fit_transform(X_train)
  clf = GaussianNB().fit(X_train_pca, y_train)
  X_test = slc.drop(['FTR', 'Date'], axis=1)
  y_test = slc['FTR']
  X_test_pca = pca.transform(X_test)
  weighted_sum += (clf.score(X_test_pca, y_test) * len(y_test))
  sum += len(y_test)
  # Add the round to the training dataset
  X_train = pd.concat([X_train, X_test])
  y_train = pd.concat([y_train, y_test])
print(weighted_sum / sum)

0.4631578947368421


The testing accuracy is 46.32%