# Predicting winning football team 


![](https://cdn.sportsbettingdime.com/app/uploads/global-sports-betting-market-header.jpg)


Sports betting is more than 200 billion dollars market (Source: https://www.sportsbettingdime.com/guides/finance/global-sports-betting-market/),


Can we guess which team is going to win based on the historical data we have ?

### Steps for predicting who's going to win based on historical data,

1. Collect the hisorical data.
2. Clean the collected data.
3. Find relevant features from that data.
4. Train different classifiers and find which is the best.
5. Use the best classifier to predict which team is going to win.

In 2014 world cup, Bing correctly predicted the outcomes for all the 15 games in the knockout round.

They are never going to share their models with the world, so let's build our own.

### Dataset:

1. The English premier league is the most popular premier leagur in the world.
2. We can retrieve data for the same from http://football-data.co.uk/data.php

From this dataset, we will predict whether the Home team or Away team will win or tha match will be draw.

There are a bund of values in this dataset whos's acronyms can be found at,   
https://rstudio-pubs-static.s3.amazonaws.com/179121_70eb412bbe6c4a55837f2439e5ae6d4e.html

In [1]:
# Importing the dependencies
import pandas as pd

import xgboost as xgb
# produces a prediction model in the form of an ensemble of weak prediction models, typically decision tree

from sklearn.linear_model import LogisticRegression
# Logistic Regression is used when response variable is categorical in nature.

from sklearn.ensemble import RandomForestClassifier
# A random forest is a meta estimator that fits a number of decision tree classifiers
# on various sub-samples of the dataset and use averaging to improve the predictive
# accuracy and control over-fitting.

from sklearn.svm import SVC
# a discriminative classifier formally defined by a separating hyperplane.

from sklearn.cross_validation import train_test_split
from sklearn.metrics import f1_score, accuracy_score
import time

  from numpy.core.umath_tests import inner1d


# Preprocessing

In [2]:
# Loading the dataset
df = pd.read_csv('football_data.csv', index_col=False)
df.head().T

Unnamed: 0,0,1,2,3,4
Div,E0,E0,E0,E0,E0
Date,10-08-2018,11-08-2018,11-08-2018,11-08-2018,11-08-2018
HomeTeam,Man United,Bournemouth,Fulham,Huddersfield,Newcastle
AwayTeam,Leicester,Cardiff,Crystal Palace,Chelsea,Tottenham
FTHG,2,2,0,0,1
FTAG,1,0,2,3,2
FTR,H,H,A,A,A
HTHG,1,1,0,0,1
HTAG,0,0,1,2,2
HTR,H,H,A,A,A


In [3]:
df.shape

(380, 23)

In [4]:
# Finding for any null values
df.isnull().sum()

Div         0
Date        0
HomeTeam    0
AwayTeam    0
FTHG        0
FTAG        0
FTR         0
HTHG        0
HTAG        0
HTR         0
Referee     0
HS          0
AS          0
HST         0
AST         0
HF          0
AF          0
HC          0
AC          0
HY          0
AY          0
HR          0
AR          0
dtype: int64

## Feature selection and Normalization

In [5]:
df.columns

Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
       'AC', 'HY', 'AY', 'HR', 'AR'],
      dtype='object')

In [6]:
# Dropping unnecessary columns
df.drop(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'Referee', 'HTR'],
        axis=1,
        inplace=True)

In [7]:
# Finding total trainable features -1, as one feature is used as our label for prediction
print('Trainable features: ', df.shape[1] - 1)

Trainable features:  16


In [8]:
# Now to normalize the data, we will drop our label and use Min-Max normalization
data = df.drop(['FTR'], axis=1)
x = data.values
scaled = (data - data.min()) / (data.max() - data.min())
data = pd.DataFrame(scaled)
data.head()

Unnamed: 0,FTHG,FTAG,HTHG,HTAG,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR
0,0.333333,0.166667,0.25,0.0,0.222222,0.478261,0.428571,0.333333,0.478261,0.277778,0.125,0.357143,0.333333,0.2,0.0,0.0
1,0.333333,0.0,0.25,0.0,0.333333,0.347826,0.285714,0.083333,0.478261,0.333333,0.4375,0.285714,0.166667,0.2,0.0,0.0
2,0.0,0.333333,0.0,0.333333,0.416667,0.347826,0.428571,0.75,0.391304,0.444444,0.3125,0.357143,0.166667,0.4,0.0,0.0
3,0.0,0.5,0.0,0.666667,0.166667,0.478261,0.071429,0.333333,0.391304,0.277778,0.125,0.357143,0.333333,0.2,0.0,0.0
4,0.166667,0.333333,0.25,0.666667,0.416667,0.565217,0.142857,0.416667,0.478261,0.5,0.1875,0.357143,0.333333,0.4,0.0,0.0


In [9]:
# Now we will join our label back again to the new dataframe and use this dataframe as our dataset for building the model
data = data.join(other=df['FTR'], how='left')
data.head()

Unnamed: 0,FTHG,FTAG,HTHG,HTAG,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR,FTR
0,0.333333,0.166667,0.25,0.0,0.222222,0.478261,0.428571,0.333333,0.478261,0.277778,0.125,0.357143,0.333333,0.2,0.0,0.0,H
1,0.333333,0.0,0.25,0.0,0.333333,0.347826,0.285714,0.083333,0.478261,0.333333,0.4375,0.285714,0.166667,0.2,0.0,0.0,H
2,0.0,0.333333,0.0,0.333333,0.416667,0.347826,0.428571,0.75,0.391304,0.444444,0.3125,0.357143,0.166667,0.4,0.0,0.0,A
3,0.0,0.5,0.0,0.666667,0.166667,0.478261,0.071429,0.333333,0.391304,0.277778,0.125,0.357143,0.333333,0.2,0.0,0.0,A
4,0.166667,0.333333,0.25,0.666667,0.416667,0.565217,0.142857,0.416667,0.478261,0.5,0.1875,0.357143,0.333333,0.4,0.0,0.0,A


# Building the model 

In [10]:
X = data.drop(['FTR'], axis=1)
Y = data['FTR']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.5, random_state=9)

In [12]:
def train_predict(model, X_train, X_test, y_train, y_test):
    """
    This function returns a model and outputs the F1-Score and Accuracy for it.
    """
    print('Training a {} on training size of {}...\n'.format(
        model.__class__.__name__, len(X_train)))
    s = time.time()
    model.fit(X_train, y_train)
    print("Trained model in {:.2f} seconds".format(time.time() - s))

    s = time.time()
    model.predict(X_test)
    print("Model predicted in {:.2f} seconds".format(time.time() - s))

    # Print the results of prediction for both training and testing
    f1, acc = f1_score(y_train, model.predict(X_train), average='weighted'), accuracy_score(
        y_train, model.predict(X_train))
    print("F1 score and accuracy score for training set: {:.2f} , {:.2f}.".format(
        f1, acc))

    f1, acc = f1_score(y_test, model.predict(X_test), average='weighted'), accuracy_score(
        y_test, model.predict(X_test))
    print(
        "F1 score and accuracy score for testing set: {:.2f} , {:.2f}.".format(f1, acc))
    print()
    return model

In [13]:
# Here we will consider 4 classification algorithms for building the model 
model_1 = LogisticRegression(random_state=9)
model_2 = SVC(kernel='rbf', random_state=81)
model_3 = RandomForestClassifier(random_state=72)
model_4 = xgb.XGBClassifier(random_state=90)

# Prediction

In [14]:
model_1 = train_predict(model_1, X_train, X_test, y_train, y_test)
model_2 = train_predict(model_2, X_train, X_test, y_train, y_test)
model_3 = train_predict(model_3, X_train, X_test, y_train, y_test)
model_4 = train_predict(model_4, X_train, X_test, y_train, y_test)

Training a LogisticRegression on training size of 190...

Trained model in 0.01 seconds
Model predicted in 0.00 seconds
F1 score and accuracy score for training set: 0.80 , 0.84.
F1 score and accuracy score for testing set: 0.79 , 0.83.

Training a SVC on training size of 190...

Trained model in 0.00 seconds
Model predicted in 0.00 seconds
F1 score and accuracy score for training set: 0.69 , 0.78.
F1 score and accuracy score for testing set: 0.72 , 0.79.

Training a RandomForestClassifier on training size of 190...

Trained model in 0.02 seconds
Model predicted in 0.00 seconds
F1 score and accuracy score for training set: 1.00 , 1.00.
F1 score and accuracy score for testing set: 0.84 , 0.83.

Training a XGBClassifier on training size of 190...

Trained model in 0.11 seconds
Model predicted in 0.01 seconds
F1 score and accuracy score for training set: 1.00 , 1.00.
F1 score and accuracy score for testing set: 0.98 , 0.98.



  'precision', 'predicted', average, warn_for)
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
