# ML Final Project (give it a fancy name!)

## Goals of the Project

This project presents an attempt to quantify the predictability of the Bundesliga football games based on the information contained in game statistics starting from season 2006/07 onwards. In particular,

* we use a deep neural network to examine the correlations between different features provided from the statistics of individual games.
* we ask the question: "to what accuracy can one determine the outcome and score of any Bundesliga game, given the statistics of games from a number of previous seasons?".
* we compare the predictions of our model on the Bundesliga season 2016/17 up to the present matchday. We also make a prediction for the final Bundesliga table for season 2016/17.

Studying the predictive power of football statistics on the outcome of games is an interesting and challenging problem of general importance both to sports, and the development of a proper intuition about the inner workings of machine learning techniques. The difficulty of the problem is enhanced by the occurence of physical outlier events in the data, for example a team performing better in a given game but nevertheless losing to their opponent 'misfortunately'. Such events, although they must be accounted for by the correct reallistic model, can easily be recognised as noise or bias by the neural network leading to an increase in both the in-sample and out-of-sample errors. 

## Data and Features

We use a manually developed football games dataset, collected from data provided at ... and kicker.de . We then modified those sets introducing some new features as combinations of the existing ones, while at the same time dropping all irrelevant ones.

In [12]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils

import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

from matplotlib import pyplot as plt
%matplotlib inline

# load data
feats_import = pd.read_csv('All_Data_2006_2016.csv')
try:
    feats_import = feats_import.drop(['Unnamed: 0'], axis=1)
    print "Reshape successful"
except:
    print "Successful import"
    
# Each data point is represented as a vector containing the following features:
print feats_import.columns.values

Reshape successful
['Season' 'Gameday' 'TID_H' 'TID_A' 'FTHG' 'HTHG' 'HS' 'AS' 'HST' 'AST'
 'HF' 'AF' 'HC' 'AC' 'HY' 'AY' 'HR' 'AR' 'HGA' 'AGA' 'FTGD' 'HTGD' 'Odds'
 'HP3' 'AP3']


|  feature name | meaning |  feature name | meaning |  feature name | meaning |
|---|---|---|---|---|---|
|Season|season number|TID_H|home team ID|TID_A|away team ID|
|Gameday|match day number|HS|home team shots|AS|away team shots|
|HTGD|half time goal difference|HST|home team shots on target|AST|away team shots on target|
|HTHG|half time goals scored by home team|HF|home team fouls committed|AF|away team fouls committed|
|FTGD|-|HC|home teams corner kicks|AC|away teams corner kicks|
|Odds|-|HY|home team yellow cards|AY|away team yellow cards|
|HP3|-|HR|home team red cards|AR|away team red cards|
|AP3|-| 


Last, each feature from the data was normalised to unity with respect to all games in the data set.

To acquire a better understanding of the data used to train the deep neural net, below we show histrograms of selected interesting features, that are often used by the experts to intuitively estimate and argue for the outcome of a particular

 fixture:

1. Shots (again, two hists: H and A)

2. Shots on target (let's plot the two hists (H and A) on top of each other with some transparency)

3. Attendance

4. 5, 6....

### Training and Test Datasets

Using the data points discussed above, we now put together a training and a test data set as follows: 

## Model: Deep Neural Network

Our model is a deep relu neural net with the following architecture (consider putting table horizontally?):

|layer|number of relu neurons
|---|---|
|input | ? |
|hidden 1| 50 |
|hidden 2| 500|
|hidden 3| 500|
|hidden 4| 50|
|output softmax|?| 

We train with minibatches using the 'adam' optimiser, and minimise the categorical entropy function. Additionally, we apply 'dropout' regularisation after hidden layer 4. 

In [None]:
### set up deep neural net with Keras
# initiate model
model = Sequential()
# input layer
model.add(Dense(50, input_dim=len(X[0]), init='lecun_uniform', activation='relu')) 
# hidden layer 1
model.add(Dense(500, activation='relu'))
# hidden layer 2
model.add(Dense(500, activation='relu'))
# hidden layer 3
model.add(Dense(50, activation='relu'))
# hidden layer 4
model.add(Dropout(dropout_p))
# output layer
model.add(Dense(2*cutoff_GD+1, activation='relu'))
model.add(Activation('softmax'))
### compile model
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

## Results

show some plots here