# Barclays Premier League Fantasy Football Data Analysis

## Introduction

I've been really into BPL Fantasy Football this season and have been building a fantasy team since the start of the season last year. 

I'm not very good at picking the right squad based on my instincts, so my basic goal is to use the data from the present and past weeks to determine the best picks for the upcoming week.

## How Fantasy Football Works

I play Fantasy Football off the <a href='http://fantasy.premierleague.com'> official BPL Fantasy site</a>. For the uninitiated, Fantasy Football is an online game where users pick a squad of 15 real-life football players from a league. In this case, I'm playing the BPL, which is the English top division. Points are scored and collated depending on a player's actions in the actual game. The aim of the game is to amass the highest number of points each game week. The rules are <a href='http://fantasy.premierleague.com/rules/'> here</a>. 

In general, players will register points for scoring goals, notching assists, clean sheets, penalties. They will lose points for getting yellow cards, red cards, missing penalties and conceding goals. 

Friends tend to form mini-competitions to see who's better at picking the best squad week after week. At the end of the season, the winner of their league with the higest points gets the honour of having the best fantasy football instincts (although there's usually money involved in mini-leagues amongst friends). There's a budget allocated to each user so you can't always choose the best player because he might be too expensive. Players get transferred in and out of teams weekly, so there's an entire transfer market as well. 

## My Objective

The aim of my analysis is to predict a team of 15 players who will score the highest fantasy points within a budget of £100m in the upcoming gameweek

## Fantasy Football Data

Data is available from an API from the official site on players' basic information, team information, fixture information and performance stats. Each player's performance is aggregated every gameweek to give an overall view of the players' performance. There are nested data sets on each player's weekly performance history, as well as performance in past seasons.

The dataset contains a database of 550 players that are distributed amongst 20 teams in the BPL. There are a total of 63 data attributes for any single player. Each player has a unique player id, and data is updated once every <i>gameweek</i>. That is to say, a full round of fixtures played amongst 20 teams. Fixtures sometimes get postponed from one gameweek to another, so I'm only focusing on gameweeks with the full 10 games being played.

In essence, the gameweek view reflects the player's (a) cumulative performance up to that gameweek and (b) the player's performance in that particular gameweek. 

Above that, there are some <b>nested</b> attributes that give us a whole lot more information.

The dataset contains the "fixture history" of a player, ie. the performance of each and every player in each and every game he has played so far. This is an <i>t</i> by 20 dataframe, where *t* is the gameweek number. I will be tapping on the fixture history of each individual player to construct features about their form.

Given the wide number of features available, I'm not going to describe every single one of them. Nonetheless, here's a brief overview of the more important variables that are available in the data set:

- <b>Personal Attributes</b>
    - Player ID - Unique Identifier
    - First Name, Last name
    - Position, ie. Goalkeeper, Midfielder
- <b>Team Attributes</b>
    - Team name, Team ID
- <b>Real Life Performance Attributes</b> (measured up to current gameweek)
    - Total Goals scored
    - Assists
    - Goals Conceded
    - Total minutes played
- <b>Fantasy Performance Attributes</b> 
    - Total points scored
    - Points per game
    - Total Bonus points
    - Number of points scored in the most recent gameweek
    - Number of Bonus points scored in the most recent gameweek

[<i>These characterise the fantasy points, measured up to current gameweek, that each player has scored according to the fantasy system, and not in "real life". Bonus points are tallied in a separate system, where more "minor actions" in a game, such as a shot on target, a tackle or a clearance can be tallied. These do not directly contribute towards fantasy points.</i>]
 
- <b> Transfer Market Attributes</b>
    - Total Transfers In
    - Total Transfers Out
- <b> Injury Attributes </b>
    - % Chance of playing in this round
    - % Chance of playing in the next round
    
    

### Model Specification & Feature Selection

I conducted some preliminary modelling with a sample data set from Gameweek 26. The main purpose of my analysis was to uncover a base model specification and select a set of features for my model. 

I came to the following main conclusions in my preliminary analysis.

1. **Form matters more than cumulative performance ** 
    - Based on regression that I ran on Gameweek 26 data, I inferred that features on cumulative performance are not as significant as "form" variables. These measure the streakiness of a players' performance.
    - However, defining a form variable is tricky and highly arbitrary. I'm going to have to restrict myself to certain specification of "form" for a start.
2. **Individual performance matters more than team performance**
    - I found that team performance variables, like team form, or whether a team is playing home or away is not as significant a feature as individual performance variables

I will specify the model generally as a classification problem. My outcome variable $Y_t$ is a binary variable which is 1 if a player scores more than *r* points in the *t-th* gameweek; 0 otherwise. 
$$ Y_t =
\begin{cases}
1 & fantasypoints_t \ge r\\
0 & fantasypoints_t < r
\end{cases} $$

I will only consider features of lag 1 for a start. This means that I am assuming that a player's performance in gameweek *t-1* is relevant to his performance in gameweek *t*. As a basic model, the logistic regression model is specified as such

$$ log\frac{P(Y_{t,i}=1|\mathbf{X_{t-1,i}})}{1-P(Y_{t,i}=1|\mathbf{X_{t-1,i}})} = \beta^T.\mathbf{X_{t-1,i}} + \epsilon_i$$

for i = 1,...,550 and $ \epsilon_i \sim N(0,\sigma^2) $



### Approach

The general approach to predicting my set of 15 players for the next gameweek would thus be as follows:

1. Use the model to predict a set of top performing players for the *t-1th* gameweek, defined as players who would satisfy $Y_{t+1} =1$ and where *r* is arbitrarily defined.
2. Obtain *m* players . If *m* <15, decrease r and perform step 1 again.
3. Narrow down to 15 players using linear optimisation, with the objective of filling 15 positions subject to the constraints that there has to be 2 goalkeepers, 5 defenders, 5 midfielders and 3 forwards, and all must be fit to play in the *t+1-th* gameweek.

These are the main features in my dataset

Attributes|Type|Description
-|-|-
id|int|Unique Player ID
first_name|str| Player's First Name
second_name|str| Player's Second Name
team_name|str| Team which player belongs to
team_id|int| Unique ID number of the team that the player belongs to
type_name|str| Categorical. Whether the player is a Goalkeeper, Defender, Midfielder or Forward
now_cost|int| Cost of player in £100,000 in the *t-th* gameweek
cost_t1|int| Cost of player in £100,000 in the *t-1th* gameweek
chance_of_playing_next_round|float| Percentage likelihood that the player will play in the *t+1th* gameweek
chance_of_playing_this_round|float| Percentage likelihood that the player will play in the *tth* gameweek
minutes|int| Total number of minutes played up to the *t-th* gameweek
form|float64| A player's form, as calculated by BPL over the last 30 days in the latest gameweek. 
bonus_form_t1|int| The sum of a player's bonus points for the *t-6th* to the *t-1th* gameweek
fantasy_form_t1|int| The sum of a player's fantasy points for the *t-6th* to the *t-1th* gameweek
own_team_form|int| Total number of points that the player's team has garnered from the *t-6th* to the *t-1th* gameweek
home|boolean| Whether the *t-th* game will be played at the player's team home stadium
next_team_t|str| The name of the team that the player is up against in the *t-th* week
next_team_pos|int| The league position of the team that that the player is up against in the *t-th* week

I'll be focusing on the bottom 6 features for my model. 

In [51]:
required_fields=['id','first_name','second_name','team_name','team_id','type_name','now_cost','chance_of_playing_next_round','chance_of_playing_this_round','minutes','form']
print (len(required_fields))
full_fields = ['id','first_name','second_name','team_name','team_id','type_name','now_cost','chance_of_playing_next_round','chance_of_playing_this_round','minutes','form','bonus_form_t1','fantasy_form_t1','cost_t1','own_team_form','home','next_team_t','next_team_pos','fantasy_points_t']
print (len(full_fields))

11
19


In [27]:
import re 
import requests as rq
import csv 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline
import numpy as np

In [None]:
###
TeamPositions= {}
TeamIndex=[1,3,4,6,7,8,11,13,14,20,21,31,35,43,45,56,57,80,91,110]
for i in TeamIndex:
    d=rq.get('http://www.premierleague.com/ajax/league-table/date-timeline/expanded/2015-2016/28-02-2016/%d/CLUB.json' %(i))
    D=d.json()
    TeamPos = [d['Pos'] for d in D['clubDetails']['performance']]
    TeamName = D['clubDetails']['playersToWatch']['byRatingType']['EA_SPORTS_PLAYER_PERFORMANCE_INDEX'][0]['club']['clubCode']
    TeamPositions[TeamName]=TeamPos

In [78]:
## This is the code I used to extract and compile 20 weeks worth of gameweek data in the csv files that I have pushed to my project folder in the git

for gw in range(6:26)
    data=[]
    t_form=[]
    for i in range(1,551):
        r = rq.get('http://fantasy.premierleague.com/web/api/elements/%d' %(i)) ## Pulls data from the api ##
        d = r.json()
        p_data = [d[required_fields[j]] for j in range(len(required_fields))] ## These "required fields" are the player id, name, team name etc. ##
        p_hist = d['fixture_history']['all'] ## This contains the fixture history of the i-th player amongst 550 players.##
        bonus_form_t1 = sum([p_hist[j][16] for j in range((gw-6),(gw-1))]) 
        fantasy_form_t1 = sum([p_hist[j][19] for j in range((gw-6),(gw-1))]) ## form variables are the sum of points from gw-1th to the gw-6th 
        cost_t1 = d['fixture_history']['all'][gw-1][18] 
        next_team_t=d['fixture_history']['all'][gw][2].split()[0][0:3]
        next_team_pos= TeamPositions[next_team_t][gw]
        fantasy_points_t=d['fixture_history']['all'][gw][19]
        for j in range((gw-6),(gw-1)):
            g= re.match("(.*)(...) ([0-9]+)?-?([0-9]+)",p_hist[j][2])
            if g.groups()[2]==g.groups()[3]:
                result=1
            elif g.groups()[2]>g.groups()[3]:
                result = 3
            elif g.group()[2]<g.groups()[3]:
                result =0
        t_form.append(result)
        own_team_form=sum(t_form) # use regex to derive the "team form" feature#
        t_form=[]
        home = re.match("(.*)(...) ([0-9]+)?-?([0-9]+)",p_hist[gw][2]).groups()[1] == '(H)'
        p_data.extend([bonus_form_t1,fantasy_form_t1,cost_t1,own_team_form,home,next_team_t,next_team_pos,fantasy_points_t])
        data.append(p_data)
    df = pd.DataFrame(data=data,columns=full_fields) # convert to pandas dataframe
    df.to_csv('Fantasy Gameweek %d.csv' %(gw), encoding = 'utf') # Writes the csv
    
## I'd really appreciate some feedback on how I can run this more efficiently! Using list comprehension creates a lot of loops, and it takes me a few hours to run these.
## I'm in a little trouble if I need to go back and do feature transformations, or extract new features from the underlying data set.

IndexError: list index out of range

In [82]:
### I'm going to do some modelling on Gameweeks 6 and 7

GW6 = pd.read_csv('Fantasy Gameweek 6.csv')
GW6.head()

Unnamed: 0.1,Unnamed: 0,id,first_name,second_name,team_name,team_id,type_name,now_cost,chance_of_playing_next_round,chance_of_playing_this_round,minutes,form,bonus_form_t1,fantasy_form_t1,cost_t1,own_team_form,home,next_team_t,next_team_pos,fantasy_points_t
0,0,1,Wojciech,Szczesny,Arsenal,1,Goalkeeper,50,0.0,0.0,0,0.0,0,0,50,3,False,LEI,6,0
1,1,2,David,Ospina,Arsenal,1,Goalkeeper,47,100.0,100.0,0,0.0,0,0,48,3,False,LEI,6,0
2,2,3,Petr,Cech,Arsenal,1,Goalkeeper,59,,,2430,5.3,109,28,55,3,False,LEI,6,2
3,3,4,Laurent,Koscielny,Arsenal,1,Defender,63,100.0,100.0,2035,3.3,107,17,60,3,False,LEI,6,1
4,4,5,Per,Mertesacker,Arsenal,1,Defender,52,100.0,100.0,1637,0.5,36,3,54,3,False,LEI,6,1


In [13]:
## Top 10 players for GW26

GW6.sort_values(by='fantasy_points_t',ascending=False).head(10)

Unnamed: 0.1,Unnamed: 0,id,first_name,second_name,team_name,team_id,type_name,now_cost,chance_of_playing_next_round,chance_of_playing_this_round,minutes,form,bonus_form_t1,fantasy_form_t1,cost_t1,own_team_form,home,next_team_t,next_team_pos,fantasy_points_t
12,12,13,Alexis,Sánchez,Arsenal,1,Midfielder,110,100.0,100.0,1486,2.5,36,15,110,3,False,LEI,6,20
148,148,149,Romelu,Lukaku,Everton,6,Forward,87,100.0,100.0,2241,4.0,79,24,82,3,False,WBA,15,16
237,237,238,Juan,Mata,Man Utd,10,Midfielder,81,,,2080,2.8,101,27,86,3,True,SUN,20,14
530,530,531,Rudy,Gestede,Aston Villa,2,Forward,55,100.0,100.0,1177,1.0,38,11,59,1,False,LIV,7,13
198,198,199,Daniel,Sturridge,Liverpool,8,Forward,99,50.0,100.0,352,2.0,0,0,104,1,True,AVL,18,13
169,169,170,Jamie,Vardy,Leicester,7,Forward,77,100.0,100.0,2358,6.8,115,25,62,3,True,ARS,4,12
339,339,340,Erik,Lamela,Spurs,14,Midfielder,68,100.0,100.0,1553,2.5,30,8,68,3,True,MCI,2,12
272,272,273,Ayoze,Pérez Gutiérrez,Newcastle,11,Forward,51,,,1492,1.0,14,4,52,1,True,CHE,15,12
312,312,313,Dusan,Tadic,Southampton,13,Midfielder,66,,,1500,0.8,98,26,69,1,True,SWA,11,11
364,364,365,Jonathan,Walters,Stoke,15,Midfielder,56,100.0,100.0,1418,2.5,14,3,58,1,True,BOU,16,10


In [65]:
##Drop redundant players who have played 0 minutes up to GW 6. 

gw6=GW6[GW6.minutes!=0]


In [74]:
# Simple Linear Regression 
import statsmodels.formula.api as smf
lm= smf.ols(formula= 'fantasy_points_t ~ bonus_form_t1 + fantasy_form_t1 + own_team_form + home + next_team_pos',data=gw6)

results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,fantasy_points_t,R-squared:,0.107
Model:,OLS,Adj. R-squared:,0.096
Method:,Least Squares,F-statistic:,10.29
Date:,"Fri, 04 Mar 2016",Prob (F-statistic):,2.45e-09
Time:,01:25:17,Log-Likelihood:,-1050.4
No. Observations:,437,AIC:,2113.0
Df Residuals:,431,BIC:,2137.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.2964,0.458,0.647,0.518,-0.604 1.196
home[T.True],0.2788,0.266,1.048,0.295,-0.244 0.802
bonus_form_t1,-0.0141,0.007,-1.883,0.060,-0.029 0.001
fantasy_form_t1,0.1519,0.034,4.463,0.000,0.085 0.219
own_team_form,0.0644,0.137,0.470,0.639,-0.205 0.334
next_team_pos,0.0428,0.022,1.948,0.052,-0.000 0.086

0,1,2,3
Omnibus:,235.595,Durbin-Watson:,1.836
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1452.676
Skew:,2.32,Prob(JB):,0.0
Kurtosis:,10.632,Cond. No.,227.0


In [105]:
## Let's try some classifiers...

from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [96]:
## This is my binary variable, with r = 5
gw6['points_scorer'] = gw6.fantasy_points_t>5
gw6['points_scorer'] = gw6.points_scorer.astype(int)
print sum(gw6['points_scorer'])


X_t= gw6[['bonus_form_t1','fantasy_form_t1','own_team_form', 'home', 'next_team_pos']]
y_t= gw6.points_scorer
print len(gw6)

46
437


In [97]:
## I'm going to use Gameweek 7 data as test set

GW7 = pd.read_csv('Fantasy Gameweek 7.csv')
gw7 = GW7[GW7.minutes!=0]
len(gw7)

gw7['points_scorer'] = gw7.fantasy_points_t>5
gw7['points_scorer'] = gw7.points_scorer.astype(int)
print sum(gw7['points_scorer'])
# Create new feature set and define new binary outcome variable

X_t1= gw7[['bonus_form_t1','fantasy_form_t1','own_team_form', 'home', 'next_team_pos']]
y_t1= gw7.points_scorer

45


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [98]:
# Recycling a function from the lab

def accuracy_report(clf,xtrain,ytrain,xtest,ytest):
    print "Accuracy: %0.2f%%" % (100 * clf.score(xtest, ytest))

    #Print the accuracy on the test and training dataset
    training_accuracy = clf.score(xtrain, ytrain)
    test_accuracy = clf.score(xtest, ytest)

    print "Accuracy on training data: %0.2f" % (training_accuracy)

In [101]:
## Try some classifiers.. 
clf_bn=BernoulliNB().fit(X_t,y_t)
print "Fit Accuracy for Binomial Naive Bayes"
accuracy_report(clf_bn,X_t,y_t,X_t1,y_t1)

clf_gauss = GaussianNB().fit(X_t,y_t)
print "Fit Accuracy for Gaussian Naive Bayes"
accuracy_report(clf_gauss,X_t,y_t,X_t1,y_t1)

clf_log=LogisticRegression().fit(X_t,y_t)
print "Fit Accuracy for Logistic Regression"
accuracy_report(clf_log,X_t,y_t,X_t1,y_t1)

rbf_svm=SVC(kernel='rbf').fit(X_t,y_t)
accuracy_report(rbf_svm,X_t,y_t,X_t1,y_t1)

Fit Accuracy for Binomial Naive Bayes
Accuracy: 89.70%
Accuracy on training data: 0.89
Fit Accuracy for Gaussian Naive Bayes
Accuracy: 88.10%
Accuracy on training data: 0.88
Fit Accuracy for Logistic Regression
Accuracy: 89.24%
Accuracy on training data: 0.89
Accuracy: 89.70%
Accuracy on training data: 0.96


In [106]:
confusion_matrix(y_t1,rbf_svm.predict(X_t1))

## My classifier isn't actually predicting true positives for Y_t > 5! It's basically just predicting most players are scoring below than 5.

array([[392,   0],
       [ 45,   0]])