# Building a simple machine learning program on Serie A 2018/19 Season Stats - Using TensorFlow

Let us begin by importing the necessary packages.

In [1]:
import numpy as np
import pandas as pd

Defining a general function that will take a dataset of football/soccer results and output the number of wins, draws, and losses. Note that the 'data' should be in Panda DataFrame format, with FTR being a category/heading of a column. In our datasets, this is a given.

In [2]:
# General function that isolates the full time result to see record of wins/losses/draws

def winRecords(data, leagueName):
    
    ftr = data['FTR']

    homeWins=0
    awayWins=0
    draws=0
    
    for i in range(0, len(ftr.to_numpy())-1):
    
        if (ftr.to_numpy()[i] == 'H'):
            homeWins += 1
            #print(j)
        elif (ftr.to_numpy()[i] == 'A'):
            awayWins += 1
        else:
            draws +=1

    return leagueName, 'HomeWins: %s' % homeWins, 'Away Wins: %s' % awayWins, 'Draws: %s' % draws

Now, let's read the files and print out the stats for Serie A leagues in several seasons (and the EPL season).

In [3]:
print(winRecords(pd.read_csv("serieA_season-1516.csv"), 'SerieA_15_16'))
print(winRecords(pd.read_csv("serieA_season-1617.csv"), 'SerieA_16_17'))
print(winRecords(pd.read_csv("serieA_season-1819.csv"), 'SerieA_18_19'))
print(winRecords(pd.read_csv("epl_season-1819.csv"), 'EPL_18_19'))

('SerieA_15_16', 'HomeWins: 175', 'Away Wins: 109', 'Draws: 95')
('SerieA_16_17', 'HomeWins: 183', 'Away Wins: 116', 'Draws: 80')
('SerieA_18_19', 'HomeWins: 165', 'Away Wins: 106', 'Draws: 108')
('EPL_18_19', 'HomeWins: 181', 'Away Wins: 127', 'Draws: 71')


# Let's start training/learning/evaluating/testing our Model

Beginning with Serie A 2018/19 dataset, we must first 'clean' the data. This is done by dropping the useless categories. In our case, this is the division and the date of the games. We don't care when the games were played (they're already ordered chronologically), but who played whom and when. We also want to drop the Full Time Home and Away Goals Scored and the Half Time Home & Away Goals and Result categories, as these will obviously influence our data.

In [4]:
dataSerieA1819 = pd.read_csv("serieA_season-1819.csv")

# Dropping the Div and Dates

dataSerieA1819 = dataSerieA1819.drop('Div', axis=1)
dataSerieA1819 = dataSerieA1819.drop('Date', axis=1)

# Dropping the Full Time Home Goals Scored, FTAG

dataSerieA1819 = dataSerieA1819.drop('FTAG', axis=1)
dataSerieA1819 = dataSerieA1819.drop('FTHG', axis=1)

# Dropping the Half Time Home Goals Scored, HTAG, and the Half time result

dataSerieA1819 = dataSerieA1819.drop('HTHG', axis=1)
dataSerieA1819 = dataSerieA1819.drop('HTAG', axis=1)
dataSerieA1819 = dataSerieA1819.drop('HTR', axis=1)

dataSerieA1819['FTR'].unique()

array(['A', 'H', 'D'], dtype=object)

We will take a look at the column of interest, which is FTR (Full Time Result), and change it to 0s and 1s for our system to be able to classify/predict these outcomes easily. 0 if the Home Team draws or loses, 1 if the Home Team wins.

In [5]:
# Change the string for Full Time Result (FTR), AKA the outcome to 0s and 1s. 1 if the Home Team wins, 0 if the Away Team won or if a Draw was achieved

def fix_outcome(outcome):
    if outcome == 'H':
        return 1
    else:
        return 0

dataSerieA1819['FTR'] = dataSerieA1819['FTR'].apply(fix_outcome)

Import the train_test_split function from the sklearn package.

In [6]:
from sklearn.model_selection import train_test_split

Now we can Train_Test_Split our data

In [7]:
# Train Test Split Data

x_data = dataSerieA1819.drop('FTR', axis=1)
y_labels = dataSerieA1819['FTR']
X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size=0.3, random_state=101)

Import TensorFlow package

In [8]:
import tensorflow as tf

Create the feature columns. In this notebook, this is split up into a section for the categorical values, and a section for the continuous/numerical values. This is because they use different lines of code.

In [9]:
# Create tf.feature_columns for Categorical Values

HomeTeam = tf.feature_column.categorical_column_with_hash_bucket('HomeTeam', hash_bucket_size=1000)
AwayTeam = tf.feature_column.categorical_column_with_hash_bucket('AwayTeam', hash_bucket_size=1000)

Note that for these values, there is not an 'FTR' column.

In [10]:
# Create tf.feature_columns for Numerical/Continuous Values

HS   = tf.feature_column.numeric_column('HS')   # Home Shots
AS   = tf.feature_column.numeric_column('AS')   # Away Shots
HST  = tf.feature_column.numeric_column('HST')  # Home Shots on Target
AST  = tf.feature_column.numeric_column('AST')  # Away Shots on Target
HF  = tf.feature_column.numeric_column('HF')    # Home Fouls
AF  = tf.feature_column.numeric_column('AF')    # Away Fouls
HC  = tf.feature_column.numeric_column('HC')    # Home Corners
AC  = tf.feature_column.numeric_column('AC')    # Away Corners
HY  = tf.feature_column.numeric_column('HY')    # Home Yellow Cards
AY  = tf.feature_column.numeric_column('AY')    # Away Yellow Cards
HR  = tf.feature_column.numeric_column('HR')    # Home Red Cards
AR  = tf.feature_column.numeric_column('AR')    # Away Red Cards

Compile into one features column.

In [11]:
# Compile into one features column

feature_cols = [HomeTeam,AwayTeam,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR]

Using the tensor flow estimator method, let's build our input function. Note that the batch_size is 100 and the number of epochs is "None". The batch size is the number of samples that are processed before updating the model, while the epochs are the number of complete passes through the training dataset.

In [12]:
# Building an input function

input_fnc = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=100, num_epochs=None, shuffle=True)

Using TensorFlow Estimator, with Linear Classifier, let's create our model.

In [13]:
# Using Linear Classifier to create a model with tf.Estimator

model = tf.estimator.LinearClassifier(feature_columns=feature_cols)

model.train(input_fn=input_fnc, steps=5000)

W0917 20:28:28.041749 140735565427584 estimator.py:1811] Using temporary folder as model directory: /var/folders/tw/j24f0q85731142mf21hrd98c0000gn/T/tmpw9otpb5m
W0917 20:28:28.087664 140735565427584 deprecation.py:323] From //anaconda3/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0917 20:28:28.114566 140735565427584 deprecation.py:323] From //anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/inputs/queues/feeding_queue_runner.py:62: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
W0917 20:28:28.118185

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x1a481d0588>

Using model.predict and a new input function pred_fn, generate predictions. Use list() to put those predictions in a list.

In [None]:
# Now to make 'predictions'

pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test, batch_size = len(X_test), shuffle=False)
predictions = list(model.predict(input_fn=pred_fn))

What do the predictions look like? 

In [None]:
print(predictions[0])

That's interesting. Each prediction has a logit, logistic, probability, class_id, etc. associated with it. We care most about the class_ids as that will indicate whether the Home Team won the match or not. In this model, draws are considered losses. Iterate over the list of predictions, appending only the relevant category, class_ids, to a new list.

In [None]:
final_preds= []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])

In [None]:
print(final_preds[:8])

Time to evaluate, using sklearn metrics, how accurate our model actually is! Import Classification_report from the sklearn metrics package.

In [None]:
from sklearn.metrics import classification_report

In [None]:
# Import Classification report from SkLearn-Metrics to evaluate the performance of the test model

print(classification_report(y_test, final_preds))

Cleaning up the code, I create separate functions for each process. cleanData drops the irrelevant columns and/or rows.

In [1]:
# Let's start with the training/learning/evaluating/testing Machine Learning Method! First, we need to clean the data.
# Note: This script was written assuming all the data files will have the same default columns. This is true for the datasets that I've downloaded from the internet.
# I removed (in some of these files) the betting odds, as I figure these may influence results in ways outside of the determinants of the game. 
# Odds are set by bookkeepers, based on their own analysis.


def cleanData(dataFileCSV):

    data = pd.read_csv(dataFileCSV)

    # Dropping the Div and Dates

    data = data.drop('Div', axis=1)
    data = data.drop('Date', axis=1)

    # Dropping the Full Time Home Goals Scored, FTAG

    data = data.drop('FTAG', axis=1)
    data = data.drop('FTHG', axis=1)

    # Dropping the Half Time Home Goals Scored, HTAG, and the Half time result

    data = data.drop('HTHG', axis=1)
    data = data.drop('HTAG', axis=1)
    data = data.drop('HTR', axis=1)

    data['FTR'].unique()

    return data

The _fix outcome_ function stays the same as above, so I won't include it here. _trainModel_ takes in clean data, creates and trains a model. This function returns a _trained model_.

In [2]:
def trainModel(data):
    data['FTR'] = data['FTR'].apply(fix_outcome)

    # Train Test Split Data

    x_data = data.drop('FTR', axis=1)
    y_labels = data['FTR']
    X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size=0.3, random_state=101)


    # Create tf.feature_columns for Categorical Values

    HomeTeam = tf.feature_column.categorical_column_with_hash_bucket('HomeTeam', hash_bucket_size=1000)
    AwayTeam = tf.feature_column.categorical_column_with_hash_bucket('AwayTeam', hash_bucket_size=1000)
    

    # Create tf.feature_columns for Numerical/Continuous Values

    HS   = tf.feature_column.numeric_column('HS')   # Home Shots
    AS   = tf.feature_column.numeric_column('AS')   # Away Shots
    HST  = tf.feature_column.numeric_column('HST')  # Home Shots on Target
    AST  = tf.feature_column.numeric_column('AST')  # Away Shots on Target
    HF  = tf.feature_column.numeric_column('HF')    # Home Fouls
    AF  = tf.feature_column.numeric_column('AF')    # Away Fouls
    HC  = tf.feature_column.numeric_column('HC')    # Home Corners
    AC  = tf.feature_column.numeric_column('AC')    # Away Corners
    HY  = tf.feature_column.numeric_column('HY')    # Home Yellow Cards
    AY  = tf.feature_column.numeric_column('AY')    # Away Yellow Cards
    HR  = tf.feature_column.numeric_column('HR')    # Home Red Cards
    AR  = tf.feature_column.numeric_column('AR')    # Away Red Cards


    # Compile into one features column
    
    feature_cols = [HomeTeam,AwayTeam,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR]

    # Building an input function

    input_fnc = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=100, num_epochs=None, shuffle=True)

    # Using Linear Classifier to create a model with tf.Estimator

    model = tf.estimator.LinearClassifier(feature_columns=feature_cols)

    model.train(input_fn=input_fnc, steps=5000)

    return model, X_test, y_test

Now, we define a function that makes predictions and evaluates/tests the model using sklearn-classification-report

In [3]:
# Now to make 'predictions'

def predict(data):

    model, X_test, y_test = trainModel(data)

    pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test, batch_size = len(X_test), shuffle=False)
    predictions = list(model.predict(input_fn=pred_fn))

    print(predictions[0])

    final_preds= []
    for pred in predictions:
        final_preds.append(pred['class_ids'][0])

    print(final_preds[:10])

    # Import Classification report from SkLearn-Metrics to evaluate the performance of the test model

    print(classification_report(y_test, final_preds))
    return final_preds

Let's create a main() function that compiles all these other functions so that we only have to write in one function every time we want to test new data! This is fairly reminiscent of OOP principles, something I'd normally do when I'm using Java or C. I'm not sure yet whether it's better notation to just have one function and have all these sub-functions as sub-functions, or just place them as lines of code in one large main function, or to have them split up as such. For now, I'll keep them split up. At the very least, it helps break down the code and make it easier to comprehend.

In [5]:
# Compiling everything into a 'main' function - this is reminiscent of OOP principles, something I'd do in Java - but it's nice to keep things compact, even in Python

def main(data):
    return predict(cleanData(data))

# I suppose I could very easily just create one main function and wrap the other 'predict' and 'cleanData' methods into the main function, but I'm still undecided as to whether that is better documentation than splitting it up.

In [None]:
Using the same (serieA_season-1819.csv) data as above, let's see if we get the same results. Hint: if we return the right datatypes for each function, we should.