# Building a simple machine learning program on Serie A 2018/19 Season Stats - Using TensorFlow

Let us begin by importing the necessary packages.

In [1]:
import numpy as np
import pandas as pd

Defining a general function that will take a dataset of football/soccer results and output the number of wins, draws, and losses. Note that the 'data' should be in Panda DataFrame format, with FTR being a category/heading of a column. In our datasets, this is a given.

In [2]:
# General function that isolates the full time result to see record of wins/losses/draws

def winRecords(data, leagueName):
    
    ftr = data['FTR']

    homeWins=0
    awayWins=0
    draws=0
    
    for i in range(0, len(ftr.to_numpy())-1):
    
        if (ftr.to_numpy()[i] == 'H'):
            homeWins += 1
            #print(j)
        elif (ftr.to_numpy()[i] == 'A'):
            awayWins += 1
        else:
            draws +=1

    return leagueName, 'HomeWins: %s' % homeWins, 'Away Wins: %s' % awayWins, 'Draws: %s' % draws

Now, let's read the files and print out the stats for Serie A leagues in several seasons (and the EPL season).

In [3]:
print(winRecords(pd.read_csv("serieA_season-1516.csv"), 'SerieA_15_16'))
print(winRecords(pd.read_csv("serieA_season-1617.csv"), 'SerieA_16_17'))
print(winRecords(pd.read_csv("serieA_season-1819.csv"), 'SerieA_18_19'))
print(winRecords(pd.read_csv("epl_season-1819.csv"), 'EPL_18_19'))

('SerieA_15_16', 'HomeWins: 175', 'Away Wins: 109', 'Draws: 95')
('SerieA_16_17', 'HomeWins: 183', 'Away Wins: 116', 'Draws: 80')
('SerieA_18_19', 'HomeWins: 165', 'Away Wins: 106', 'Draws: 108')
('EPL_18_19', 'HomeWins: 181', 'Away Wins: 127', 'Draws: 71')


# Let's start training/learning/evaluating/testing our Model

Beginning with Serie A 2018/19 dataset, we must first 'clean' the data. This is done by dropping the useless categories. In our case, this is the division and the date of the games. We don't care when the games were played (they're already ordered chronologically), but who played whom and when.

In [4]:
dataSerieA1819 = pd.read_csv("serieA_season-1819Dd.csv")
dataSerieA1819 = dataSerieA1819.drop('Div', axis=1)
dataSerieA1819 = dataSerieA1819.drop('Date', axis=1)

We will take a look at the column of interest, which is FTR (Full Time Result), and change it to 0s and 1s for our system to be able to classify/predict these outcomes easily. 0 if the Home Team draws or loses, 1 if the Home Team wins.

In [5]:
# Change the string for Full Time Result (FTR), AKA the outcome to 0s and 1s. 1 if the Home Team wins, 0 if the Away Team won or if a Draw was achieved

dataSerieA1819['FTR'].unique()

def fix_outcome(outcome):
    if outcome == 'H':
        return 1
    else:
        return 0

dataSerieA1819['FTR'] = dataSerieA1819['FTR'].apply(fix_outcome)

Import the train_test_split function from the sklearn package.

In [6]:
from sklearn.model_selection import train_test_split

Now we can Train_Test_Split our data

In [7]:
# Train Test Split Data

x_data = dataSerieA1819.drop('FTR', axis=1)
y_labels = dataSerieA1819['FTR']
X_train, X_test, y_train, y_test = train_test_split(x_data, y_labels, test_size=0.3, random_state=101)