![image](https://user-images.githubusercontent.com/92790663/189541779-82e3ea35-da9c-444d-b5d6-3dfe4456c09c.png)

### BUSINESS PROBLEM

Approximately two million managers play the Fantasy Premier League (FPL) for 38 game weeks every season. Every game week, the big question on the two million minds is which soccer players will provide maximum ROI throughout the season. A predictive analytics model for weekly ROI from a player is crucial information sought by FPL managers. This project investigates a model based on the historical data of players' performance against their opponents.

#### BUSINESS OBJECTIVES
- Create a model that predicts points for each player weekly and evaluate the model's accuracy.
- Predict and select players with high returns on fantasy points before every game week.
- Players comparison using analytics.

#### DATA SOURCES 
[Link 1](https://www.fantasynutmeg.com)

- This source provides historical data from the 2016 season till the current season. The extracted data contains only data of players with double digits fantasy points across every fixture in respective seasons. There is an opportunity to extract every player's performance for every fixture from 2016 to the current season. This is an issue to resolve as part of the optimization phase of this project.


[Link 2](https://fantasy.premierleague.com/api/)

- This source is the official FPL API that only contains the data on players’ performance in the current season, players' positions and all the current season fixtures.


#### PERFORMANCE METRICS
- Accuracy.
- R-Squared ($R^2$) score (Coefficient of determination).
- RMSE (Root Mean Squared Error).

### IMPORT NECESSARY LIBRARIES

In [1]:
import requests
import numpy as np
import pandas as pd

# sklearn
from sklearn.svm import SVR
from sklearn import linear_model
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor

### ACCESSING DATA

In [2]:
# Read data.
df_allseasons = pd.read_csv('cleaned_merged_seasons.csv', index_col = 'Unnamed: 0')
df_allseasons.head()

  df_allseasons = pd.read_csv('cleaned_merged_seasons.csv', index_col = 'Unnamed: 0')


Unnamed: 0,season_x,name,position,team_x,assists,bonus,bps,clean_sheets,creativity,element,...,team_h_score,threat,total_points,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards,GW
0,2016-17,Aaron Cresswell,DEF,,0,0,0,0,0.0,454,...,2.0,0.0,0,0,0,0,55,False,0,1
1,2016-17,Aaron Lennon,MID,,0,0,6,0,0.3,142,...,1.0,0.0,1,0,0,0,60,True,0,1
2,2016-17,Aaron Ramsey,MID,,0,0,5,0,4.9,16,...,3.0,23.0,2,0,0,0,80,True,0,1
3,2016-17,Abdoulaye Doucouré,MID,,0,0,0,0,0.0,482,...,1.0,0.0,0,0,0,0,50,False,0,1
4,2016-17,Adam Forshaw,MID,,0,0,3,0,1.3,286,...,1.0,0.0,1,0,0,0,45,True,1,1


In [3]:
# Print all columns.
df_allseasons.columns

Index(['season_x', 'name', 'position', 'team_x', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
       'goals_scored', 'ict_index', 'influence', 'kickoff_time', 'minutes',
       'opponent_team', 'opp_team_name', 'own_goals', 'penalties_missed',
       'penalties_saved', 'red_cards', 'round', 'saves', 'selected',
       'team_a_score', 'team_h_score', 'threat', 'total_points',
       'transfers_balance', 'transfers_in', 'transfers_out', 'value',
       'was_home', 'yellow_cards', 'GW'],
      dtype='object')

In [4]:
# Descriptive information on features.
df_allseasons.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98402 entries, 0 to 98401
Data columns (total 37 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   season_x           98402 non-null  object 
 1   name               98402 non-null  object 
 2   position           98402 non-null  object 
 3   team_x             48930 non-null  object 
 4   assists            98402 non-null  int64  
 5   bonus              98402 non-null  int64  
 6   bps                98402 non-null  int64  
 7   clean_sheets       98402 non-null  int64  
 8   creativity         98402 non-null  float64
 9   element            98402 non-null  int64  
 10  fixture            98402 non-null  int64  
 11  goals_conceded     98402 non-null  int64  
 12  goals_scored       98402 non-null  int64  
 13  ict_index          98402 non-null  float64
 14  influence          98402 non-null  float64
 15  kickoff_time       98402 non-null  object 
 16  minutes            984

In [5]:
# Check for any missing values.
df_allseasons.isnull().values.any()

True

In [6]:
# Check missing values for each feature.
df_allseasons.isna().sum()

season_x                 0
name                     0
position                 0
team_x               49472
assists                  0
bonus                    0
bps                      0
clean_sheets             0
creativity               0
element                  0
fixture                  0
goals_conceded           0
goals_scored             0
ict_index                0
influence                0
kickoff_time             0
minutes                  0
opponent_team            0
opp_team_name            0
own_goals                0
penalties_missed         0
penalties_saved          0
red_cards                0
round                    0
saves                    0
selected                 0
team_a_score            49
team_h_score            49
threat                   0
total_points             0
transfers_balance        0
transfers_in             0
transfers_out            0
value                    0
was_home                 0
yellow_cards             0
GW                       0
d

In [7]:
df_allseasons.team_x

0                NaN
1                NaN
2                NaN
3                NaN
4                NaN
            ...     
98397      Leicester
98398      Newcastle
98399    Southampton
98400       Brighton
98401       West Ham
Name: team_x, Length: 98402, dtype: object

In [8]:
# Check for duplicates on each row.
df_allseasons.duplicated().value_counts()

False    98402
dtype: int64

In [9]:
# Check for unique values.
df_allseasons.nunique()

season_x                 6
name                   989
position                 4
team_x                  23
assists                  5
bonus                    4
bps                    113
clean_sheets             2
creativity             860
element                737
fixture                380
goals_conceded          10
goals_scored             5
ict_index              273
influence              528
kickoff_time          1428
minutes                 91
opponent_team           20
opp_team_name           31
own_goals                2
penalties_missed         2
penalties_saved          3
red_cards                2
round                   47
saves                   14
selected             65713
team_a_score             9
team_h_score            10
threat                 149
total_points            31
transfers_balance    32217
transfers_in         24344
transfers_out        26734
value                  100
was_home                 2
yellow_cards             2
GW                      47
d

In [10]:
# Descriptive statistics.
df_allseasons.describe()

Unnamed: 0,assists,bonus,bps,clean_sheets,creativity,element,fixture,goals_conceded,goals_scored,ict_index,...,team_a_score,team_h_score,threat,total_points,transfers_balance,transfers_in,transfers_out,value,yellow_cards,GW
count,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,...,98353.0,98353.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0,98402.0
mean,0.045873,0.122599,6.825359,0.120993,5.352928,311.321701,196.188248,0.542845,0.051279,1.963898,...,1.262097,1.491708,6.127121,1.541798,1318.144,13950.5,12631.75,52.49687,0.057814,20.718309
std,0.22768,0.520794,10.252218,0.326121,11.305636,181.148434,108.6632,0.995002,0.247819,3.218001,...,1.224245,1.310472,14.476371,2.658725,58594.15,50342.2,42870.93,13.123029,0.233392,11.605966
min,0.0,0.0,-18.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-7.0,-1857821.0,0.0,0.0,37.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,157.0,103.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,-1721.0,64.0,185.0,45.0,0.0,11.0
50%,0.0,0.0,0.0,0.0,0.0,306.0,200.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,-78.0,587.0,1422.0,49.0,0.0,21.0
75%,0.0,0.0,12.0,0.0,3.9,459.0,290.0,1.0,0.0,2.9,...,2.0,2.0,4.0,2.0,144.0,5894.0,8614.75,55.0,0.0,30.0
max,4.0,3.0,128.0,1.0,170.9,737.0,380.0,9.0,4.0,35.8,...,9.0,9.0,186.0,29.0,1983733.0,2104464.0,1872898.0,136.0,1.0,47.0


#### Observation
- Some records are missing.
- No duplicate observation.

#### FEATURE ENGINEERING

In [11]:
# Make a copy of the original piece of data.
df_allseasons_clean = df_allseasons.copy()

> To engineer two new features named `club_name` and `form`, we collect data from the `fantasynutmeg` API (Link 1 in the business problem statement), compare the columns of the collected data with the `df_allseasons` dataframe and extract the features highlighted above.

In [12]:
# Get yearly historic data from endpoint for available seasons and identify the keys in each disctionary using 2016 as an example.
Y2016= requests.get('https://www.fantasynutmeg.com/api/history/season/2016-17').json()
Y2017= requests.get('https://www.fantasynutmeg.com/api/history/season/2017-18').json()
Y2018= requests.get('https://www.fantasynutmeg.com/api/history/season/2018-19').json()
Y2019= requests.get('https://www.fantasynutmeg.com/api/history/season/2019-20').json()
Y2020= requests.get('https://www.fantasynutmeg.com/api/history/season/2020-21').json()
Y2021= requests.get('https://www.fantasynutmeg.com/api/history/season/2021-22').json()
Y2022= requests.get('https://www.fantasynutmeg.com/api/history/season/2022-23').json()

Y2016.keys()

dict_keys(['dd_agg_fixture', 'dd_agg_player', 'dd_hauls', 'history'])

In [13]:
# Convert history data dictionary to a pandas dataframe.
hist16_df = pd.DataFrame(Y2016['history'])
hist17_df = pd.DataFrame(Y2017['history'])
hist18_df = pd.DataFrame(Y2018['history'])
hist19_df = pd.DataFrame(Y2019['history'])
hist20_df = pd.DataFrame(Y2020['history'])
hist21_df = pd.DataFrame(Y2021['history'])

In [14]:
# Engineer feature to highlight each season year.
hist16_df['year'] = hist16_df.apply(lambda x: "2016-17", axis=1)
hist17_df['year'] = hist17_df.apply(lambda x: "2017-18", axis=1)
hist18_df['year'] = hist18_df.apply(lambda x: "2018-19", axis=1)
hist19_df['year'] = hist19_df.apply(lambda x: "2019-20", axis=1)
hist20_df['year'] = hist20_df.apply(lambda x: "2020-21", axis=1)
hist21_df['year'] = hist21_df.apply(lambda x: "2021-22", axis=1)

In [15]:
# Concatenate all history data across years.
hist_df = [hist16_df, hist17_df, hist18_df, hist19_df, hist20_df, hist21_df]

hist = pd.concat(hist_df, axis = 0, ignore_index=True)

In [16]:
# Preview history data.
hist.head()

Unnamed: 0,assists,bonus,bps,chance_of_playing_next_round,chance_of_playing_this_round,clean_sheets,code,cost_change_event,cost_change_event_fall,cost_change_start,...,influence_rank,influence_rank_type,threat_rank,threat_rank_type,corners_and_indirect_freekicks_order,corners_and_indirect_freekicks_text,direct_freekicks_order,direct_freekicks_text,penalties_order,penalties_text
0,0,0,18,100,100,0,48844,0,0,-3,...,,,,,,,,,,
1,0,2,660,100,100,12,11334,0,0,-1,...,,,,,,,,,,
2,1,19,723,0,75,10,51507,0,0,1,...,,,,,,,,,,
3,0,0,5,100,100,0,17127,0,0,-2,...,,,,,,,,,,
4,0,2,296,75,100,5,158074,0,0,-2,...,,,,,,,,,,


In [17]:
# Engineer feature to highlight the form of the players.
hist['form'] = hist['total_points']/38 

In [18]:
# Print all columns.
hist.columns

Index(['assists', 'bonus', 'bps', 'chance_of_playing_next_round',
       'chance_of_playing_this_round', 'clean_sheets', 'code',
       'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
       'cost_change_start_fall', 'creativity', 'dreamteam_count', 'ea_index',
       'element_type', 'ep_next', 'ep_this', 'event_points', 'first_name',
       'form', 'goals_conceded', 'goals_scored', 'ict_index', 'id',
       'in_dreamteam', 'influence', 'loaned_in', 'loaned_out', 'loans_in',
       'loans_out', 'minutes', 'news', 'now_cost', 'own_goals',
       'penalties_missed', 'penalties_saved', 'photo', 'points_per_game',
       'position', 'red_cards', 'saves', 'second_name', 'selected_by_percent',
       'special', 'squad_number', 'status', 'team', 'team_code', 'team_name',
       'threat', 'total_points', 'transfers_in', 'transfers_in_event',
       'transfers_out', 'transfers_out_event', 'value_form', 'value_season',
       'web_name', 'yellow_cards', 'year', 'news_added', 

We have the historical data from 16-21. We proceed as described below:
1. Create a column with the First Name, Last Name and the Year. 
2. Combine the name and the year of the df_`allseasons` dataframe to match the historical data. 
3. Map the two based on the First Name, Last Name and the Year and extract needed features.

In [19]:
# Engineer feature to highlight the players name and the season they played in.
hist['name_season'] = hist['first_name'] + ' ' + hist['second_name'] + '_' + hist['year']

In [20]:
# Display a sample of name_season column.
hist.name_season.head()

0                David Ospina_2016-17
1                   Petr Cech_2016-17
2           Laurent Koscielny_2016-17
3             Per Mertesacker_2016-17
4    Gabriel Armando de Abreu_2016-17
Name: name_season, dtype: object

In [21]:
# Data Quality Checks.
subset  = ['Mohamed Salah_2018-19']
check = hist[hist.name_season.isin(subset)]
check.form

1582    6.815789
Name: form, dtype: float64

In [22]:
# Engineer feature to highlight the players name and the season they played in.
df_allseasons_clean['name_season'] = df_allseasons_clean['name'] + '_' + df_allseasons_clean['season_x']

In [23]:
# Engineer a feature to highlight the club of the player.
teams=dict(zip(hist.name_season, hist.team_name))

df_allseasons_clean['club_name'] = df_allseasons_clean['name_season'].map(teams)

In [24]:
# Engineer a feature to highlight the form of the player.
teams=dict(zip(hist.name_season, hist.form))

df_allseasons_clean['form'] = df_allseasons_clean['name_season'].map(teams)

In [25]:
# Preview dataframe.
df_allseasons_clean.head()

Unnamed: 0,season_x,name,position,team_x,assists,bonus,bps,clean_sheets,creativity,element,...,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards,GW,name_season,club_name,form
0,2016-17,Aaron Cresswell,DEF,,0,0,0,0,0.0,454,...,0,0,0,55,False,0,1,Aaron Cresswell_2016-17,WHU,1.578947
1,2016-17,Aaron Lennon,MID,,0,0,6,0,0.3,142,...,0,0,0,60,True,0,1,Aaron Lennon_2016-17,EVE,0.578947
2,2016-17,Aaron Ramsey,MID,,0,0,5,0,4.9,16,...,0,0,0,80,True,0,1,Aaron Ramsey_2016-17,ARS,1.473684
3,2016-17,Abdoulaye Doucouré,MID,,0,0,0,0,0.0,482,...,0,0,0,50,False,0,1,Abdoulaye Doucouré_2016-17,WAT,1.0
4,2016-17,Adam Forshaw,MID,,0,0,3,0,1.3,286,...,0,0,0,45,True,1,1,Adam Forshaw_2016-17,MID,2.026316


In [26]:
# Data Quality Checks.
subset  = ['Marcus Rashford_2020-21']
check = df_allseasons_clean[df_allseasons_clean.name_season.isin(subset)]
check.form

50237    4.578947
50787    4.578947
51344    4.578947
51917    4.578947
52611    4.578947
53207    4.578947
53807    4.578947
54408    4.578947
55010    4.578947
55566    4.578947
56157    4.578947
56764    4.578947
57374    4.578947
57988    4.578947
58498    4.578947
59044    4.578947
59480    4.578947
60327    4.578947
60328    4.578947
61026    4.578947
61674    4.578947
62336    4.578947
63001    4.578947
63780    4.578947
64523    4.578947
65608    4.578947
65609    4.578947
65890    4.578947
66643    4.578947
67606    4.578947
68293    4.578947
68985    4.578947
69666    4.578947
70880    4.578947
70881    4.578947
70882    4.578947
72475    4.578947
73181    4.578947
Name: form, dtype: float64

In [27]:
# Print all columns.
df_allseasons_clean.columns

Index(['season_x', 'name', 'position', 'team_x', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
       'goals_scored', 'ict_index', 'influence', 'kickoff_time', 'minutes',
       'opponent_team', 'opp_team_name', 'own_goals', 'penalties_missed',
       'penalties_saved', 'red_cards', 'round', 'saves', 'selected',
       'team_a_score', 'team_h_score', 'threat', 'total_points',
       'transfers_balance', 'transfers_in', 'transfers_out', 'value',
       'was_home', 'yellow_cards', 'GW', 'name_season', 'club_name', 'form'],
      dtype='object')

In [28]:
# Engineer feature to highlight the game dates from kickoff_time.
df_allseasons_clean['game_date'] = df_allseasons_clean['kickoff_time'].str.replace('T', ' ')
df_allseasons_clean['game_date'] = df_allseasons_clean['game_date'].str.replace(':00Z', '')

In [29]:
# Preview series.
df_allseasons_clean.game_date.head()

0    2016-08-15 19:00
1    2016-08-13 14:00
2    2016-08-14 15:00
3    2016-08-13 14:00
4    2016-08-13 14:00
Name: game_date, dtype: object

In [30]:
# Convert game_date feature to appropriate dtype.
df_allseasons_clean['game_date'] = pd.to_datetime(df_allseasons_clean['game_date'])

In [31]:
# Preview series.
df_allseasons_clean.game_date.head()

0   2016-08-15 19:00:00
1   2016-08-13 14:00:00
2   2016-08-14 15:00:00
3   2016-08-13 14:00:00
4   2016-08-13 14:00:00
Name: game_date, dtype: datetime64[ns]

In [32]:
# Engineer game season weather feature.
seasons = [1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1]

month_to_season = dict(zip(range(1,13), seasons))
df_allseasons_clean['game_weather'] = df_allseasons_clean.game_date.dt.month.map(month_to_season) 

In [33]:
# Data Quality Check.
df_allseasons_clean.game_weather.value_counts()

1    36939
2    26810
4    24533
3    10120
Name: game_weather, dtype: int64

In [34]:
# Engineer feature to highlights games that started before 13:00 (early starts) and those that started after 13:00 (late starts)
df_allseasons_clean['start_label'] = np.where((df_allseasons_clean['game_date'].dt.hour) < 13, 0, 1)

In [35]:
# Quality Check.
df_allseasons_clean[['game_date', 'start_label']].head(20)

Unnamed: 0,game_date,start_label
0,2016-08-15 19:00:00,1
1,2016-08-13 14:00:00,1
2,2016-08-14 15:00:00,1
3,2016-08-13 14:00:00,1
4,2016-08-13 14:00:00,1
5,2016-08-14 15:00:00,1
6,2016-08-15 19:00:00,1
7,2016-08-14 15:00:00,1
8,2016-08-13 14:00:00,1
9,2016-08-14 15:00:00,1


In [36]:
# Engineer feature tp highlight the game year only.
df_allseasons_clean['year'] = df_allseasons_clean.game_date.dt.year

In [37]:
# Check unique years.
df_allseasons_clean['year'].unique()

array([2016, 2017, 2018, 2019, 2020, 2021, 2022])

In [38]:
# Descriptive information on all features.
df_allseasons_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98402 entries, 0 to 98401
Data columns (total 44 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   season_x           98402 non-null  object        
 1   name               98402 non-null  object        
 2   position           98402 non-null  object        
 3   team_x             48930 non-null  object        
 4   assists            98402 non-null  int64         
 5   bonus              98402 non-null  int64         
 6   bps                98402 non-null  int64         
 7   clean_sheets       98402 non-null  int64         
 8   creativity         98402 non-null  float64       
 9   element            98402 non-null  int64         
 10  fixture            98402 non-null  int64         
 11  goals_conceded     98402 non-null  int64         
 12  goals_scored       98402 non-null  int64         
 13  ict_index          98402 non-null  float64       
 14  influe

In [39]:
# Check missing values for each feature.
df_allseasons_clean.isna().sum()

season_x                 0
name                     0
position                 0
team_x               49472
assists                  0
bonus                    0
bps                      0
clean_sheets             0
creativity               0
element                  0
fixture                  0
goals_conceded           0
goals_scored             0
ict_index                0
influence                0
kickoff_time             0
minutes                  0
opponent_team            0
opp_team_name            0
own_goals                0
penalties_missed         0
penalties_saved          0
red_cards                0
round                    0
saves                    0
selected                 0
team_a_score            49
team_h_score            49
threat                   0
total_points             0
transfers_balance        0
transfers_in             0
transfers_out            0
value                    0
was_home                 0
yellow_cards             0
GW                       0
n

In [40]:
# Check the Nan values in team_a_score and team_h_score.
filt = df_allseasons_clean['team_a_score'].isna() == True
df_allseasons_clean.loc[filt, 'team_h_score']

44426   NaN
44428   NaN
44430   NaN
44444   NaN
44450   NaN
44453   NaN
44458   NaN
44465   NaN
44485   NaN
44490   NaN
44494   NaN
44498   NaN
44502   NaN
44505   NaN
44510   NaN
44514   NaN
44521   NaN
44535   NaN
44549   NaN
44559   NaN
44572   NaN
44574   NaN
44589   NaN
44602   NaN
44616   NaN
44622   NaN
44624   NaN
44627   NaN
44630   NaN
44656   NaN
44658   NaN
44681   NaN
44691   NaN
44700   NaN
44714   NaN
44728   NaN
44736   NaN
44740   NaN
44761   NaN
44769   NaN
44785   NaN
44807   NaN
44839   NaN
44856   NaN
44866   NaN
44870   NaN
44872   NaN
44902   NaN
44917   NaN
Name: team_h_score, dtype: float64

#### Observation
- `team_x` can be dropped since another feature (`club_name`) that highlights the name of the clubs of the players has been engineered.
- The features (`team_a_score` and `team_h_score`, `club_name`, `form`) have missing values for the same observations (49 observations). The observations can thus be comfortably dropped.

#### Data Quality
Data quality issues are mostly divided into four:
- Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
- Validity: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
- Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
- Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

After assessing the data, we have the following issues:

1. Missing data (`team_x`, `team_a_score`, `team_h_score`, `club_name`, `form`).
2. Erroneous data types (`team_a_score`, `team_h_score`).
3. Redundant features (`opponent_team` and `opp_team_name`, `kickoff_time` and `game_date`)

#### Data Tidiness
There are three main requirements for tidiness.

1. Each variable forms a column,
2. Each observation forms a row, and
3. Each type of observational unit forms a table.

The three above criteria's are fairly met by the dataset.

### CLEANING DATA

In [41]:
# Make a copy of the original piece of data.
df_allseasons_final = df_allseasons_clean.copy()

#### QUALITY ISSUES

#### Issue #1:
- Missing data (`team_x`, `team_a_score`, `team_h_score`, `club_name`, `form`)

#### Define
- Drop `team_x` column.
- Drop all missing observations.

#### Code

In [42]:
# Drop feature.
df_allseasons_final.drop('team_x', axis = 1, inplace=True)

# Drop all missing observations.
df_allseasons_final.dropna(inplace=True)

#### Test

In [43]:
# Descriptive information on all features.
df_allseasons_final.columns

Index(['season_x', 'name', 'position', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
       'goals_scored', 'ict_index', 'influence', 'kickoff_time', 'minutes',
       'opponent_team', 'opp_team_name', 'own_goals', 'penalties_missed',
       'penalties_saved', 'red_cards', 'round', 'saves', 'selected',
       'team_a_score', 'team_h_score', 'threat', 'total_points',
       'transfers_balance', 'transfers_in', 'transfers_out', 'value',
       'was_home', 'yellow_cards', 'GW', 'name_season', 'club_name', 'form',
       'game_date', 'game_weather', 'start_label', 'year'],
      dtype='object')

In [44]:
# Check missing values for each feature.
df_allseasons_final.isna().sum()

season_x             0
name                 0
position             0
assists              0
bonus                0
bps                  0
clean_sheets         0
creativity           0
element              0
fixture              0
goals_conceded       0
goals_scored         0
ict_index            0
influence            0
kickoff_time         0
minutes              0
opponent_team        0
opp_team_name        0
own_goals            0
penalties_missed     0
penalties_saved      0
red_cards            0
round                0
saves                0
selected             0
team_a_score         0
team_h_score         0
threat               0
total_points         0
transfers_balance    0
transfers_in         0
transfers_out        0
value                0
was_home             0
yellow_cards         0
GW                   0
name_season          0
club_name            0
form                 0
game_date            0
game_weather         0
start_label          0
year                 0
dtype: int6

#### Issue #2:
- Erroneous data types (`team_a_score`, `team_h_score`).

#### Define
- Convert features to their appropriate data types (int).

#### Code

In [45]:
# Change dypes.
df_allseasons_final['team_h_score'] = df_allseasons_final['team_h_score'].astype(int)
df_allseasons_final['team_a_score'] = df_allseasons_final['team_a_score'].astype(int)

#### Test

In [46]:
# Check dtype.
df_allseasons_final[['team_h_score', 'team_a_score', 'year']].dtypes

team_h_score    int64
team_a_score    int64
year            int64
dtype: object

#### Issue #3:
- Redundant features (`opponent_team` and `opp_team_name`, `kickoff_time` and `game_date`)

#### Define
- Drop `opponent_team` and `kickoff_time`.

#### Code

In [47]:
# Drop features.
df_allseasons_final.drop(['opponent_team', 'kickoff_time'], axis = 1, inplace=True)

#### Test

In [48]:
# Print all columns.
df_allseasons_final.columns

Index(['season_x', 'name', 'position', 'assists', 'bonus', 'bps',
       'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
       'goals_scored', 'ict_index', 'influence', 'minutes', 'opp_team_name',
       'own_goals', 'penalties_missed', 'penalties_saved', 'red_cards',
       'round', 'saves', 'selected', 'team_a_score', 'team_h_score', 'threat',
       'total_points', 'transfers_balance', 'transfers_in', 'transfers_out',
       'value', 'was_home', 'yellow_cards', 'GW', 'name_season', 'club_name',
       'form', 'game_date', 'game_weather', 'start_label', 'year'],
      dtype='object')

#### REFACTORING DATA

- Drop features that are not needed for modeling (`season_x`, `name`, `name_season`, `fixture`, `game_date`, `round`)
- Drop all players that had zero playtime.
- Convert dataframe to a time series by making `season_x` the index.

In [49]:
# Drop features.
df_allseasons_final.drop(['season_x', 'name', 'name_season', 'fixture', 'game_date', 'round'], axis=1, inplace=True)

# Descriptive information.
df_allseasons_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98315 entries, 0 to 98401
Data columns (total 35 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   position           98315 non-null  object 
 1   assists            98315 non-null  int64  
 2   bonus              98315 non-null  int64  
 3   bps                98315 non-null  int64  
 4   clean_sheets       98315 non-null  int64  
 5   creativity         98315 non-null  float64
 6   element            98315 non-null  int64  
 7   goals_conceded     98315 non-null  int64  
 8   goals_scored       98315 non-null  int64  
 9   ict_index          98315 non-null  float64
 10  influence          98315 non-null  float64
 11  minutes            98315 non-null  int64  
 12  opp_team_name      98315 non-null  object 
 13  own_goals          98315 non-null  int64  
 14  penalties_missed   98315 non-null  int64  
 15  penalties_saved    98315 non-null  int64  
 16  red_cards          983

In [50]:
# Drop all players with zero playtime.
zero_minutes = df_allseasons_final[df_allseasons_final.minutes == 0].index
df_allseasons_final.drop(zero_minutes, axis = 0, inplace=True)

# Descriptive info.
df_allseasons_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49194 entries, 1 to 98399
Data columns (total 35 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   position           49194 non-null  object 
 1   assists            49194 non-null  int64  
 2   bonus              49194 non-null  int64  
 3   bps                49194 non-null  int64  
 4   clean_sheets       49194 non-null  int64  
 5   creativity         49194 non-null  float64
 6   element            49194 non-null  int64  
 7   goals_conceded     49194 non-null  int64  
 8   goals_scored       49194 non-null  int64  
 9   ict_index          49194 non-null  float64
 10  influence          49194 non-null  float64
 11  minutes            49194 non-null  int64  
 12  opp_team_name      49194 non-null  object 
 13  own_goals          49194 non-null  int64  
 14  penalties_missed   49194 non-null  int64  
 15  penalties_saved    49194 non-null  int64  
 16  red_cards          491

In [51]:
# Make season_x the index.
df_allseasons_final.set_index('year', inplace=True)

df_allseasons_final.head()

Unnamed: 0_level_0,position,assists,bonus,bps,clean_sheets,creativity,element,goals_conceded,goals_scored,ict_index,...,transfers_in,transfers_out,value,was_home,yellow_cards,GW,club_name,form,game_weather,start_label
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016,MID,0,0,6,0,0.3,142,0,0,0.9,...,0,0,60,True,0,1,EVE,0.578947,3,1
2016,MID,0,0,5,0,4.9,16,3,0,3.0,...,0,0,80,True,0,1,ARS,1.473684,3,1
2016,MID,0,0,3,0,1.3,286,1,0,0.3,...,0,0,45,True,1,1,MID,2.026316,3,1
2016,MID,1,2,33,0,33.7,205,3,1,14.2,...,0,0,70,False,1,1,LIV,3.657895,3,1
2016,GK,0,0,16,0,0.0,450,2,0,3.0,...,0,0,50,False,0,1,WHU,1.684211,3,1


#### SPLIT TIME SERIES DATA.

In [52]:
# Sort index (just in case).
df_allseasons_final.sort_index(inplace=True)

# Assign features and target variable.
features = df_allseasons_final.drop(['total_points'], axis = 1)
target = df_allseasons_final['total_points']

In [53]:
# Time series split.
tss = TimeSeriesSplit(n_splits = 3)

for train_index, test_index in tss.split(features):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = features.iloc[train_index, :], features.iloc[test_index,:]
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]

TRAIN: [    0     1     2 ... 12297 12298 12299] TEST: [12300 12301 12302 ... 24595 24596 24597]
TRAIN: [    0     1     2 ... 24595 24596 24597] TEST: [24598 24599 24600 ... 36893 36894 36895]
TRAIN: [    0     1     2 ... 36893 36894 36895] TEST: [36896 36897 36898 ... 49191 49192 49193]


In [54]:
X_train

Unnamed: 0_level_0,position,assists,bonus,bps,clean_sheets,creativity,element,goals_conceded,goals_scored,ict_index,...,transfers_in,transfers_out,value,was_home,yellow_cards,GW,club_name,form,game_weather,start_label
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016,MID,0,0,6,0,0.3,142,0,0,0.9,...,0,0,60,True,0,1,EVE,0.578947,3,1
2016,MID,0,0,2,0,1.8,181,1,0,0.4,...,277,2612,49,True,0,9,LEI,1.526316,4,1
2016,DEF,0,0,18,1,0.1,570,0,0,1.3,...,7048,1525,60,True,1,9,CHE,3.473684,4,1
2016,GK,0,0,11,0,0.0,242,4,0,1.3,...,8617,28641,55,False,0,9,MUN,3.578947,4,1
2016,MID,0,0,12,1,13.9,43,0,0,3.1,...,337,173,47,True,1,9,BOU,1.289474,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021,MID,0,0,11,0,0.9,304,3,0,1.8,...,213,1889,48,False,0,17,NEW,1.631579,1,1
2021,MID,0,0,23,0,25.7,138,1,1,15.9,...,213788,25083,76,True,0,17,CHE,4.447368,1,1
2021,DEF,1,0,28,0,21.5,342,2,0,6.7,...,900,2026,48,False,0,17,SOU,2.157895,1,1
2021,DEF,0,0,17,0,2.7,243,1,0,2.0,...,1605,3144,51,True,0,17,LIV,0.868421,1,1


### ENCODING CATEGORICAL FEATURES

- Encoding will be carried out with a feature extraction class in sklearn called `dictvectorizer`
- `X_train` and `X_test` will be encoded independently.

#### For X_train

In [55]:
# Descriptive info of categorical features.
X_train[['position', 'opp_team_name', 'club_name', 'was_home']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36896 entries, 2016 to 2021
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   position       36896 non-null  object
 1   opp_team_name  36896 non-null  object
 2   club_name      36896 non-null  object
 3   was_home       36896 non-null  bool  
dtypes: bool(1), object(3)
memory usage: 1.2+ MB


In [56]:
# Convert dataframe to a dictionary.
X_train_dict = X_train.to_dict(orient='records')

In [57]:
# Print sample observation.
X_train_dict[0]

{'position': 'MID',
 'assists': 0,
 'bonus': 0,
 'bps': 6,
 'clean_sheets': 0,
 'creativity': 0.3,
 'element': 142,
 'goals_conceded': 0,
 'goals_scored': 0,
 'ict_index': 0.9,
 'influence': 8.2,
 'minutes': 15,
 'opp_team_name': 'Spurs',
 'own_goals': 0,
 'penalties_missed': 0,
 'penalties_saved': 0,
 'red_cards': 0,
 'saves': 0,
 'selected': 13918,
 'team_a_score': 1,
 'team_h_score': 1,
 'threat': 0.0,
 'transfers_balance': 0,
 'transfers_in': 0,
 'transfers_out': 0,
 'value': 60,
 'was_home': True,
 'yellow_cards': 0,
 'GW': 1,
 'club_name': 'EVE',
 'form': 0.5789473684210527,
 'game_weather': 3,
 'start_label': 1}

In [58]:
dv = DictVectorizer(sparse=False) 

# sparse = False makes the output is not a sparse matrix.

X_train_encoded = dv.fit_transform(X_train_dict)

In [59]:
X_train_encoded

array([[ 1.,  0.,  0., ..., 60.,  1.,  0.],
       [ 9.,  0.,  0., ..., 49.,  1.,  0.],
       [ 9.,  0.,  0., ..., 60.,  1.,  1.],
       ...,
       [17.,  1.,  0., ..., 48.,  0.,  0.],
       [17.,  0.,  0., ..., 51.,  1.,  0.],
       [17.,  0.,  0., ..., 53.,  1.,  0.]])

In [60]:
# vocabulary
vocab = dv.vocabulary_

# show vocab
vocab

{'position=MID': 82,
 'assists': 1,
 'bonus': 2,
 'bps': 3,
 'clean_sheets': 4,
 'creativity': 36,
 'element': 37,
 'goals_conceded': 40,
 'goals_scored': 41,
 'ict_index': 42,
 'influence': 43,
 'minutes': 44,
 'opp_team_name=Spurs': 68,
 'own_goals': 76,
 'penalties_missed': 77,
 'penalties_saved': 78,
 'red_cards': 83,
 'saves': 84,
 'selected': 85,
 'team_a_score': 87,
 'team_h_score': 88,
 'threat': 89,
 'transfers_balance': 90,
 'transfers_in': 91,
 'transfers_out': 92,
 'value': 93,
 'was_home': 94,
 'yellow_cards': 95,
 'GW': 0,
 'club_name=EVE': 14,
 'form': 38,
 'game_weather': 39,
 'start_label': 86,
 'opp_team_name=Crystal Palace': 53,
 'club_name=LEI': 19,
 'position=DEF': 79,
 'opp_team_name=Man Utd': 62,
 'club_name=CHE': 12,
 'position=GK': 81,
 'opp_team_name=Chelsea': 52,
 'club_name=MUN': 23,
 'club_name=BOU': 8,
 'opp_team_name=Bournemouth': 47,
 'club_name=TOT': 31,
 'position=FWD': 80,
 'opp_team_name=West Brom': 73,
 'club_name=LIV': 20,
 'opp_team_name=Southampt

In [61]:
# Check feature names.
dv.feature_names_

['GW',
 'assists',
 'bonus',
 'bps',
 'clean_sheets',
 'club_name=ARS',
 'club_name=AVL',
 'club_name=BHA',
 'club_name=BOU',
 'club_name=BRE',
 'club_name=BUR',
 'club_name=CAR',
 'club_name=CHE',
 'club_name=CRY',
 'club_name=EVE',
 'club_name=FUL',
 'club_name=HUD',
 'club_name=HUL',
 'club_name=LEE',
 'club_name=LEI',
 'club_name=LIV',
 'club_name=MCI',
 'club_name=MID',
 'club_name=MUN',
 'club_name=NEW',
 'club_name=NOR',
 'club_name=SHU',
 'club_name=SOU',
 'club_name=STK',
 'club_name=SUN',
 'club_name=SWA',
 'club_name=TOT',
 'club_name=WAT',
 'club_name=WBA',
 'club_name=WHU',
 'club_name=WOL',
 'creativity',
 'element',
 'form',
 'game_weather',
 'goals_conceded',
 'goals_scored',
 'ict_index',
 'influence',
 'minutes',
 'opp_team_name=Arsenal',
 'opp_team_name=Aston Villa',
 'opp_team_name=Bournemouth',
 'opp_team_name=Brentford',
 'opp_team_name=Brighton',
 'opp_team_name=Burnley',
 'opp_team_name=Cardiff',
 'opp_team_name=Chelsea',
 'opp_team_name=Crystal Palace',
 'opp_t

In [62]:
# Convert array returned from dictvectorizer to a dataframe.
X_train_transformed = pd.DataFrame(X_train_encoded, columns=dv.feature_names_)

X_train_transformed.head()

Unnamed: 0,GW,assists,bonus,bps,clean_sheets,club_name=ARS,club_name=AVL,club_name=BHA,club_name=BOU,club_name=BRE,...,start_label,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards
0,1.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,60.0,1.0,0.0
1,9.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,3.0,0.0,-2335.0,277.0,2612.0,49.0,1.0,0.0
2,9.0,0.0,0.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,5523.0,7048.0,1525.0,60.0,1.0,1.0
3,9.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,-20024.0,8617.0,28641.0,55.0,0.0,0.0
4,9.0,0.0,0.0,12.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,2.0,164.0,337.0,173.0,47.0,1.0,1.0


In [63]:
# Check the shape of the dataframe.
X_train_transformed.shape

(36896, 96)

#### For X_test

In [64]:
X_test

Unnamed: 0_level_0,position,assists,bonus,bps,clean_sheets,creativity,element,goals_conceded,goals_scored,ict_index,...,transfers_in,transfers_out,value,was_home,yellow_cards,GW,club_name,form,game_weather,start_label
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021,GK,0,0,18,1,0.0,559,0,0,1.0,...,114182,17707,50,True,1,17,ARS,3.552632,1,1
2021,DEF,0,0,12,0,0.0,190,7,0,1.0,...,2627,4527,45,False,0,17,LEE,2.026316,1,1
2021,FWD,1,0,10,1,32.0,6,0,0,13.6,...,55274,2983,83,True,0,17,ARS,2.368421,1,1
2021,MID,0,3,52,1,58.9,251,0,2,22.5,...,7102,4326,118,True,0,17,MCI,5.157895,1,1
2021,MID,0,1,23,0,20.2,233,1,1,10.5,...,38580,5123,131,True,0,17,LIV,6.973684,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022,MID,1,0,15,0,12.4,222,0,0,3.4,...,10125,12825,50,True,0,26,LIV,2.894737,1,1
2022,MID,1,1,31,0,44.8,222,1,0,10.0,...,10125,12825,50,True,0,26,LIV,2.894737,1,1
2022,DEF,0,0,12,0,0.8,23,1,0,2.7,...,191379,17128,53,True,0,26,ARS,3.842105,1,1
2022,GK,0,0,19,0,0.0,80,2,0,3.2,...,11550,4861,43,False,0,26,BRE,2.500000,1,1


In [65]:
# Descriptive info of categorical features.
X_test[['position', 'opp_team_name', 'club_name', 'was_home']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12298 entries, 2021 to 2022
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   position       12298 non-null  object
 1   opp_team_name  12298 non-null  object
 2   club_name      12298 non-null  object
 3   was_home       12298 non-null  bool  
dtypes: bool(1), object(3)
memory usage: 396.3+ KB


In [66]:
# Convert dataframe to a dictionary.
X_test_dict = X_test.to_dict(orient='records')

In [67]:
# Print sample observation.
X_test_dict[0]

{'position': 'GK',
 'assists': 0,
 'bonus': 0,
 'bps': 18,
 'clean_sheets': 1,
 'creativity': 0.0,
 'element': 559,
 'goals_conceded': 0,
 'goals_scored': 0,
 'ict_index': 1.0,
 'influence': 10.4,
 'minutes': 90,
 'opp_team_name': 'West Ham',
 'own_goals': 0,
 'penalties_missed': 0,
 'penalties_saved': 0,
 'red_cards': 0,
 'saves': 1,
 'selected': 1459440,
 'team_a_score': 0,
 'team_h_score': 2,
 'threat': 0.0,
 'transfers_balance': 96475,
 'transfers_in': 114182,
 'transfers_out': 17707,
 'value': 50,
 'was_home': True,
 'yellow_cards': 1,
 'GW': 17,
 'club_name': 'ARS',
 'form': 3.5526315789473686,
 'game_weather': 1,
 'start_label': 1}

In [68]:
# sparse = False makes the output is not a sparse matrix.

X_test_encoded = dv.transform(X_test_dict)

In [69]:
X_test_encoded

array([[17.,  0.,  0., ..., 50.,  1.,  1.],
       [17.,  0.,  0., ..., 45.,  0.,  0.],
       [17.,  1.,  0., ..., 83.,  1.,  0.],
       ...,
       [26.,  0.,  0., ..., 53.,  1.,  0.],
       [26.,  0.,  0., ..., 43.,  0.,  0.],
       [38.,  0.,  0., ..., 59.,  0.,  0.]])

In [70]:
# vocabulary
vocab = dv.vocabulary_

# show vocab
vocab

{'position=MID': 82,
 'assists': 1,
 'bonus': 2,
 'bps': 3,
 'clean_sheets': 4,
 'creativity': 36,
 'element': 37,
 'goals_conceded': 40,
 'goals_scored': 41,
 'ict_index': 42,
 'influence': 43,
 'minutes': 44,
 'opp_team_name=Spurs': 68,
 'own_goals': 76,
 'penalties_missed': 77,
 'penalties_saved': 78,
 'red_cards': 83,
 'saves': 84,
 'selected': 85,
 'team_a_score': 87,
 'team_h_score': 88,
 'threat': 89,
 'transfers_balance': 90,
 'transfers_in': 91,
 'transfers_out': 92,
 'value': 93,
 'was_home': 94,
 'yellow_cards': 95,
 'GW': 0,
 'club_name=EVE': 14,
 'form': 38,
 'game_weather': 39,
 'start_label': 86,
 'opp_team_name=Crystal Palace': 53,
 'club_name=LEI': 19,
 'position=DEF': 79,
 'opp_team_name=Man Utd': 62,
 'club_name=CHE': 12,
 'position=GK': 81,
 'opp_team_name=Chelsea': 52,
 'club_name=MUN': 23,
 'club_name=BOU': 8,
 'opp_team_name=Bournemouth': 47,
 'club_name=TOT': 31,
 'position=FWD': 80,
 'opp_team_name=West Brom': 73,
 'club_name=LIV': 20,
 'opp_team_name=Southampt

In [71]:
# Check feature names.
dv.feature_names_

['GW',
 'assists',
 'bonus',
 'bps',
 'clean_sheets',
 'club_name=ARS',
 'club_name=AVL',
 'club_name=BHA',
 'club_name=BOU',
 'club_name=BRE',
 'club_name=BUR',
 'club_name=CAR',
 'club_name=CHE',
 'club_name=CRY',
 'club_name=EVE',
 'club_name=FUL',
 'club_name=HUD',
 'club_name=HUL',
 'club_name=LEE',
 'club_name=LEI',
 'club_name=LIV',
 'club_name=MCI',
 'club_name=MID',
 'club_name=MUN',
 'club_name=NEW',
 'club_name=NOR',
 'club_name=SHU',
 'club_name=SOU',
 'club_name=STK',
 'club_name=SUN',
 'club_name=SWA',
 'club_name=TOT',
 'club_name=WAT',
 'club_name=WBA',
 'club_name=WHU',
 'club_name=WOL',
 'creativity',
 'element',
 'form',
 'game_weather',
 'goals_conceded',
 'goals_scored',
 'ict_index',
 'influence',
 'minutes',
 'opp_team_name=Arsenal',
 'opp_team_name=Aston Villa',
 'opp_team_name=Bournemouth',
 'opp_team_name=Brentford',
 'opp_team_name=Brighton',
 'opp_team_name=Burnley',
 'opp_team_name=Cardiff',
 'opp_team_name=Chelsea',
 'opp_team_name=Crystal Palace',
 'opp_t

In [72]:
# Convert array returned from dictvectorizer to a dataframe.
X_test_transformed = pd.DataFrame(X_test_encoded, columns=dv.feature_names_)

X_test_transformed.head()

Unnamed: 0,GW,assists,bonus,bps,clean_sheets,club_name=ARS,club_name=AVL,club_name=BHA,club_name=BOU,club_name=BRE,...,start_label,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards
0,17.0,0.0,0.0,18.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,2.0,0.0,96475.0,114182.0,17707.0,50.0,1.0,1.0
1,17.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,7.0,0.0,-1900.0,2627.0,4527.0,45.0,0.0,0.0
2,17.0,1.0,0.0,10.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,2.0,59.0,52291.0,55274.0,2983.0,83.0,1.0,0.0
3,17.0,0.0,3.0,52.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,7.0,81.0,2776.0,7102.0,4326.0,118.0,1.0,0.0
4,17.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,3.0,54.0,33457.0,38580.0,5123.0,131.0,1.0,0.0


In [73]:
# Check the shape of the dataframe.
X_test_transformed.shape

(12298, 96)

### FEATURE SELECTION

In [74]:
# Compare the correlation of other features with the target variable.
df_corr = df_allseasons_final.corr()['total_points'].abs().sort_values(ascending=False).drop('total_points')
df_corr

bps                  0.859812
bonus                0.779137
influence            0.734977
goals_scored         0.686719
ict_index            0.621022
clean_sheets         0.486071
assists              0.422291
threat               0.404461
form                 0.321511
minutes              0.302459
creativity           0.256758
goals_conceded       0.255173
value                0.217509
selected             0.192458
transfers_in         0.147065
transfers_out        0.109964
yellow_cards         0.104482
penalties_saved      0.095853
red_cards            0.090103
saves                0.080535
transfers_balance    0.077859
own_goals            0.062311
was_home             0.059392
element              0.037572
team_a_score         0.024424
penalties_missed     0.007473
GW                   0.004295
game_weather         0.002931
team_h_score         0.001931
start_label          0.000347
Name: total_points, dtype: float64

In [75]:
# Get all the features that has at least 0.5 in correlation to the target.
df_corr_features = df_corr[df_corr > 0.3].index.to_list()
df_corr_features

['bps',
 'bonus',
 'influence',
 'goals_scored',
 'ict_index',
 'clean_sheets',
 'assists',
 'threat',
 'form',
 'minutes']

#### Using the Variance Inflation Factor (VIF) method to check and remove Multicollinearity

The VIF directly measures the ratio of the variance of the entire model to the variance of a model with only the feature in question. Simply put, it gauges how much a feature’s inclusion contributes to the overall variance of the coefficients of the features in the model.

A VIF of 1 indicates that the feature has no correlation with any of the other features. It is given by the equation below:

$$ VIF_i = \frac{1}{1-R_i^2} $$

Where $R_i^2$ represents the unadjusted coefficient of determination for regressing the $i^{th}$ independent variable on the remaining ones. The reciprocal of VIF is known as tolerance. Either VIF or tolerance can be used to detect multicollinearity, depending on personal preference.

If $R_i^2$ is equal to 0 (implies VIF = 1), the variance of the remaining independent variables cannot be predicted from the $i^{th}$ independent variable. Therefore, the $i^{th}$ independent variable is not correlated to the remaining ones, which means multicollinearity does not exist. In this case, the variance of the $i^{th}$ regression coefficient is not inflated.  

Statsmodels provides a class called `variance_inflation_factor` for the computation.

In [76]:
# Utility function to return the VIF value for each feature provided
def compute_vif(features, df):
    """
    Returns a DataFrame containing features and their corresponding variance inflation factor
    features: list of features whoes multicollinearity check is needed
    df: DataFrame of the data under review
    """
    X = df[features]   
    X['intercept'] = 1
    # Create dataframe to store vif values
    vif = pd.DataFrame()
    vif['Feature'] = X.columns
    vif['Vif Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif = vif[vif['Feature']!='intercept']
    return vif

In [77]:
def select_features(df, threshold):
    """
    Returns two objects;
    1. a DataFrame containing features and their corresponding variance inflation factor, and
    2. Pandas Index object containing the list of features that have the least Multicollinearity in accordance with
       the supplied threshold.
    train_df: The training dataset whoes Multicollinearity is to be checked
    threshold: value to compare VIF value with, above which, the feature is dropped.
    """
    data = df.copy()
    flag = True
    while flag:
        features_to_consider = data.columns
        # Calling the "compute_vif" utility function the Variance Inflation Factor dataframe
        sorted_vif_df = (compute_vif(features_to_consider, data) 
                         .sort_values('Vif Factor', ascending=False).reset_index().drop('index', axis=1))

        # Get the highest vif value to compare against a threshold
        highest_vif = sorted_vif_df.at[0, 'Vif Factor']
        
        # Compare the highest_vif with a threshold (100 was decided for this problem by the team)
        if highest_vif >= threshold: # or highest_vif=='inf':
            # Select the feature corresponding to the highest_vif (index 0 for both)
            feature = sorted_vif_df.at[0, 'Feature'] 
            # Drop the feature
            data.drop(feature, axis=1, inplace=True) 
            
        else:
            flag = False
    return sorted_vif_df, data.columns

In [78]:
# Get the VIF and the selected features based on the threshold set.
vif_df, selected_features = select_features(features[df_corr_features], 10)

In [79]:
# Print the VIF of the correlated features.
vif_df

Unnamed: 0,Feature,Vif Factor
0,ict_index,8.620392
1,influence,6.891249
2,bps,5.714201
3,threat,5.047271
4,goals_scored,3.55474
5,bonus,2.144341
6,minutes,1.740977
7,assists,1.450045
8,clean_sheets,1.397063
9,form,1.23884


In [80]:
# Display selected features based on the VIF.
selected_features

Index(['bps', 'bonus', 'influence', 'goals_scored', 'ict_index',
       'clean_sheets', 'assists', 'threat', 'form', 'minutes'],
      dtype='object')

#### Observation

Most of the features that might contribute to the model were not selected as seen above. This will be neglected and all of the features will be used in training the model so as to ensure that the model captures all possible trend in the data.

#### FEATURE SCALING
NORMALIZATION AND STANDARDIZATION.

In [81]:
# Normalizing and Standardizing the train data.
min_max_scaler = MinMaxScaler()
std_scaler = StandardScaler()

# Fit scalar and transform train data.
X_train_norm = min_max_scaler.fit_transform(X_train_transformed)
X_train_std = std_scaler.fit_transform(X_train_transformed)

# Transform test data.
X_test_norm = min_max_scaler.transform(X_test_transformed)
X_test_std = std_scaler.transform(X_test_transformed)

In [82]:
# Print sample of normalized data.
X_train_norm[0]

array([0.        , 0.        , 0.        , 0.18181818, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 1.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00175541, 0.19803371, 0.07260726, 0.66666667,
       0.        , 0.        , 0.02513966, 0.05012225, 0.15730337,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 1.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [83]:
# Print sample of standardized data.
X_train_std[0]

array([-1.40530271, -0.29293036, -0.34504467, -0.7149476 , -0.56533125,
       -0.24204803, -0.15649433, -0.22051911, -0.14512034, -0.06843673,
       -0.24097962, -0.03252912, -0.23813342, -0.25120008,  4.33765517,
       -0.12615456, -0.08325825, -0.05367698, -0.10587479, -0.24770616,
       -0.24665677, -0.23921177, -0.06127222, -0.24160854, -0.22870798,
       -0.12002786, -0.14453512, -0.24696577, -0.08159172, -0.03646674,
       -0.10228265, -0.24715103, -0.18710686, -0.15151448, -0.23787914,
       -0.19719274, -0.7428365 , -0.84046033, -1.43963967,  0.45334348,
       -0.92373275, -0.30244624, -0.85128769, -0.54911248, -2.03394709,
       -0.22975588, -0.1568575 , -0.2056437 , -0.06722698, -0.21339698,
       -0.22593874, -0.10574383, -0.2308651 , -0.23008257, -0.22785373,
       -0.13116155, -0.14276625, -0.08226227, -0.10639709, -0.22857672,
       -0.2292979 , -0.22897032, -0.22752448, -0.08192766, -0.21109586,
       -0.13855816, -0.14147608, -0.22877359,  4.40534671, -0.12

### MODELING

Modeling and evaluation will be carried out using the standardized and normalized X_train and X_test sets.

In [84]:
# Utility function
def evaluate_model(model, x, y):
    """
    Utility function to print the model performance, (RMSE and R-Squared scores)
    model: Fitted model
    x: cross validation features dataset
    y: cross validation target values
    """
    predicted = model.predict(x) #get predictions
    RSME_score = mean_squared_error(y_true=y, y_pred=predicted, squared=False) #squared=False will RMSE instead of MSE
    R2_score = r2_score(y, predicted)
    
    print('RMSE:', RSME_score)
    print('R-Squared:', R2_score)
    print()

In [85]:
# creating a dictionary of Regressors to be experimented on.
models_dict = {'Linear Reg': LinearRegression(), 'DT Regressor': DecisionTreeRegressor(random_state=2),
          'RF Regressor':RandomForestRegressor(random_state=2), 'Lasso': LassoCV(random_state=2), 'Ridge Regressor': RidgeCV(),
          'BayesianRidge': linear_model.BayesianRidge(),'Gradient Boost': GradientBoostingRegressor(random_state=2), 'SGDRegressor': SGDRegressor(random_state=0)
         }

#### Modeling and Evaluation with Normalized X_train and X_test

In [86]:
# Looping through all the regressors, fitting and evaluating them on Cross validation and test data respectively
for key, model_norm in models_dict.items():
    model_norm.fit(X_train_norm, y_train)
    print(f'Performance of {key} on Validation and Test:')
    print('=='*24)
    print ( 'Validation set:')
    print("**"*8)
    evaluate_model(model_norm, X_test_norm, y_test)
    print ( 'Test set:')
    print("**"*8)
    evaluate_model(model_norm, X_test_norm, y_test)

Performance of Linear Reg on Validation and Test:
Validation set:
****************
RMSE: 0.775164140873671
R-Squared: 0.9346684474968867

Test set:
****************
RMSE: 0.775164140873671
R-Squared: 0.9346684474968867

Performance of DT Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.45338932495205997
R-Squared: 0.9776499647979159

Test set:
****************
RMSE: 0.45338932495205997
R-Squared: 0.9776499647979159

Performance of RF Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.3133618121356328
R-Squared: 0.9893235278716844

Test set:
****************
RMSE: 0.3133618121356328
R-Squared: 0.9893235278716844

Performance of Lasso on Validation and Test:
Validation set:
****************
RMSE: 0.7780432076594441
R-Squared: 0.9341822454234988

Test set:
****************
RMSE: 0.7780432076594441
R-Squared: 0.9341822454234988

Performance of Ridge Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.7758903498793237


#### Modeling and Evaluation with Standardized X_train and X_test

In [87]:
# Looping through all the regressors, fitting and evaluating them on Cross validation and test data respectively
for key, model_std in models_dict.items():
    model_std.fit(X_train_std, y_train)
    print(f'Performance of {key} on Validation and Test:')
    print('=='*24)
    print ( 'Validation set:')
    print("**"*8)
    evaluate_model(model_std, X_test_std, y_test)
    print ( 'Test set:')
    print("**"*8)
    evaluate_model(model_std, X_test_std, y_test)

Performance of Linear Reg on Validation and Test:
Validation set:
****************
RMSE: 0.7751544349706883
R-Squared: 0.934670083531649

Test set:
****************
RMSE: 0.7751544349706883
R-Squared: 0.934670083531649

Performance of DT Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.4501495026406205
R-Squared: 0.977968240615667

Test set:
****************
RMSE: 0.4501495026406205
R-Squared: 0.977968240615667

Performance of RF Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.3130967744600136
R-Squared: 0.9893415802992473

Test set:
****************
RMSE: 0.3130967744600136
R-Squared: 0.9893415802992473

Performance of Lasso on Validation and Test:
Validation set:
****************
RMSE: 0.7765334024405729
R-Squared: 0.9344374383791438

Test set:
****************
RMSE: 0.7765334024405729
R-Squared: 0.9344374383791438

Performance of Ridge Regressor on Validation and Test:
Validation set:
****************
RMSE: 0.775173927595431
R-Squ