# Implementation of baseline using PCA and Naive Bayes with Original feature set and EPL

Necessary files: testing_functions.py, pca_nb_functions.py, EPL_06.csv, EPL_07.csv, ..., EPL_19.csv

In [None]:
import testing_functions
import pca_nb_functions
import pandas as pd
from datetime import datetime
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

The baseline is implemented according to the paper Predicting The Dutch Football Competition Using Public Data: A Machine Learning Approach by N. Tax and Y. Joustra from 2015. The original features as described in the paper are used and the model is evaluated on the EPL dataset.

The authors used data from 13 seasons, so I used data from 13 seasons as well. I used seasons from 2006/07 to 2016/17 for training, season 2017/18 for validation and the second half of season 2018/19 for testing. Here the validation part is not necessary, but I wanted to keep the split of the dataset as similar as possible with other baselines and models.

In [None]:
X_train, y_train = pca_nb_functions.create_data(['EPL_17.csv', 'EPL_16.csv',
                                'EPL_15.csv', 'EPL_14.csv', 'EPL_13.csv',
                                'EPL_12.csv', 'EPL_11.csv', 'EPL_10.csv',
                                'EPL_09.csv', 'EPL_08.csv', 'EPL_07.csv',
                                'EPL_06.csv'], pca_nb_functions.teams_epl)
results_val, matches_per_round = pca_nb_functions.create_data_single('EPL_18.csv', 'EPL_17.csv',
                            ['EPL_16.csv',
                             'EPL_15.csv', 'EPL_14.csv', 'EPL_13.csv',
                             'EPL_12.csv', 'EPL_11.csv', 'EPL_10.csv',
                             'EPL_09.csv', 'EPL_08.csv', 'EPL_07.csv',
                             'EPL_06.csv'], pca_nb_functions.teams_epl)
# Dates are returned as well for dividing testing season into slices
results_test, matches_per_round = pca_nb_functions.create_data_single('EPL_19.csv', 'EPL_18.csv',
                            ['EPL_17.csv', 'EPL_16.csv',
                             'EPL_15.csv', 'EPL_14.csv', 'EPL_13.csv',
                             'EPL_12.csv', 'EPL_11.csv', 'EPL_10.csv',
                             'EPL_09.csv', 'EPL_08.csv', 'EPL_07.csv',
                             'EPL_06.csv'], pca_nb_functions.teams_epl,
                             return_dates=True)

Processing EPL_17.csv season file.
Processing EPL_16.csv season file.
Processing EPL_15.csv season file.
Processing EPL_14.csv season file.
Processing EPL_13.csv season file.
Processing EPL_12.csv season file.
Processing EPL_11.csv season file.
Processing EPL_10.csv season file.
Processing EPL_09.csv season file.
Processing EPL_08.csv season file.
Processing EPL_07.csv season file.


In [None]:
X_val = results_val.drop('FTR', axis=1)
y_val = results_val['FTR']

In the paper the highest accuracy was achieved with 3 PCA components. So I used 3 components as well.

To use all the data available, the validation season and the first half of the testing season was added to the training data.

In [None]:
# Some rounds in the beginning are ignored, this is the correct index
# of the start of the second half of the season
start_test_index = 13 * matches_per_round

In [None]:
X_test_to_append, y_test_to_append = testing_functions.prepare_test_to_append(results_test,
                                                                              start_test_index)

For evaluation on the testing dataset I divided the testing data into rounds approximately. The model is trained on the training dataset, then it is evaluated on one round of the testing dataset and this round is added to the training dataset.


In [None]:
# Adding validation season and 1st half of testing season to the training data
X_train = pd.concat([X_train, X_val, X_test_to_append])
y_train = pd.concat([y_train, y_val, y_test_to_append])
# Rounds of the testing dataset
slices = testing_functions.get_slices(results_test, matches_per_round,
                                      start_test_index)
weighted_sum = 0
sum = 0
for slc in slices:
  pca = PCA(n_components=3)
  X_train_pca = pca.fit_transform(X_train)
  clf = GaussianNB().fit(X_train_pca, y_train)
  X_test = slc.drop(['FTR', 'Date'], axis=1)
  y_test = slc['FTR']
  X_test_pca = pca.transform(X_test)
  weighted_sum += (clf.score(X_test_pca, y_test) * len(y_test))
  sum += len(y_test)
  # Add the round to the training dataset
  X_train = pd.concat([X_train, X_test])
  y_train = pd.concat([y_train, y_test])
print(weighted_sum / sum)

0.5


The testing accuracy is 50%