<a href="https://colab.research.google.com/github/hashimsharkh/CS4442-FPLForecasting/blob/Hashim_DataExploratory/CS4442_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **0. Introduction**



Each year, with the commencing of the Premier League, millions of players worldwide are pulled into the Fantasy Football marathon.

Whether its casual players competing against their friends or tryhards who want to win it all, who will come triamphant in the end?

> FPL is a game that casts you in the role of a Fantasy manager of Premier League players. 

Quoted from:

https://www.premierleague.com/news/2173986#:~:text=FPL%20is%20a%20game%20that,their%20clubs%20in%20PL%20matches.

## **1. Problem Statement**

Problem statement is to create a model that will help forecasting performance of players and achieve a top 1% in the world. 

(As the season is ending, we will just test our model against previous season and hopefully achieve 1%)

## **2. Imports and Helper Methods**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style


In [None]:
def latest_gw(season: str = '2122') -> int:
  ''' 
  season parameter defaults to current season
  latest_gw returns the latest gameweek in a certain season
  '''
  new_df = train_df[train_df['season'] == season]
  gw_df = new_df['gw']
  return max(gw_df)

def sum_goals(season: str, player: str) -> int:
   '''
   Returns the number of goals a player scored in a season
   '''
   player_stats = train_df[(train_df['season'] == season) & (train_df['player'] == player)]
   goals_scored = player_stats.goals_scored.sum()
   print(player, 'scored ', goals_scored, 'until gw: ', latest_gw(season))
   return goals_scored

def points_achieved(season: str, player: str, gw: int) -> int:
   '''
   Returns the number of points a player achieved in a certain gw
   '''
   player_stats = train_df[(train_df['gw'] == gw) & (train_df['season'] == season) & (train_df['player'] == player)]
   points_scored = player_stats.total_points
   return points_scored

## **3. Data Set**


Credits to [vaastav](https://github.com/vaastav/Fantasy-Premier-League) for scraping fantasy Premier league data

In [None]:
# Change username to your own
# Generate personal token and replace with password, dont give unneccessary access
#username='hashimsharkh'
#password='password'

In [3]:
# !git clone https://hashimsharkh:{password}@github.com/hashimsharkh/CS4442-FPLForecasting.git
# !git clone https://{username}:{password}@github.com/hashimsharkh/CS4442-FPLForecasting.git
!git clone https://github.com/hashimsharkh/CS4442-FPLForecasting/


Cloning into 'CS4442-FPLForecasting'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 13 (delta 1), reused 4 (delta 0), pack-reused 0[K
Unpacking objects: 100% (13/13), done.


In [4]:
%cd CS4442-FPLForecasting/

/content/CS4442-FPLForecasting


In [5]:
!git checkout Hashim_DataExploratory

Branch 'Hashim_DataExploratory' set up to track remote branch 'Hashim_DataExploratory' from 'origin'.
Switched to a new branch 'Hashim_DataExploratory'


In [9]:
!git branch -r

  [31morigin/HEAD[m -> origin/main
  [31morigin/Hashim_DataExploratory[m
  [31morigin/main[m


In [6]:
%ls

[0m[01;34mdata[0m/  README.md


In [None]:
pd.options.display.max_columns = None

In [None]:
# Train df contains all historical data
train_df = pd.read_csv('data/train_fpl.csv',
                       index_col=0,
                       dtype={'season':str,
                              'squad':str,
                              'comp':str})

In [None]:
train_df.head()

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,assists,bonus,bps,clean_sheets,creativity,goals_conceded,goals_scored,ict_index,influence,own_goals,penalties_missed,penalties_saved,red_cards,saves,selected,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season,date,squad,comp,shots_total,shots_on_target,touches,pressures,tackles,interceptions,blocks,xg,npxg,xa,sca,gca,passes_completed,passes,passes_pct,carries,dribbles_completed,dribbles,crowds
0,Aaron Cresswell,1,2,0,West Ham United,Chelsea,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,14023,1,2,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,0.895471,2.243698,2016-08-15,,,,,,,,,,,,,,,,,,,,,1
1,Aaron Lennon,1,3,15,Everton,Tottenham Hotspur,,,True,1,0,0,6,0,0.3,0,0,0.9,8.2,0,0,0,0,0,13918,1,1,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,1.057509,1.43369,2016-08-13,,,,,,,,,,,,,,,,,,,,,1
2,Aaron Ramsey,1,3,60,Arsenal,Liverpool,,,True,2,0,0,5,0,4.9,3,0,3.0,2.2,0,0,0,0,0,163170,4,3,23.0,0,0,0,0,2016-08-14T15:00:00Z,1617,,1.944129,1.46586,2016-08-14,,,,,,,,,,,,,,,,,,,,,1
3,Abdoulaye Doucouré,1,3,0,Watford,Southampton,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,1051,1,1,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,0.7042,0.796805,2016-08-13,,,,,,,,,,,,,,,,,,,,,1
4,Abdul Rahman Baba,1,2,0,Chelsea,West Ham United,,,True,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,1243,1,2,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,2.243698,0.895471,2016-08-15,,,,,,,,,,,,,,,,,,,,,1


In [None]:
# Determine which columns have nan values
train_df.isna().sum()

player                                            0
gw                                                0
position                                          0
minutes                                           0
team                                              0
opponent_team                                     0
relative_market_value_team                    72817
relative_market_value_opponent_team           72840
was_home                                          0
total_points                                      0
assists                                           0
bonus                                             0
bps                                               0
clean_sheets                                      0
creativity                                        0
goals_conceded                                    0
goals_scored                                      0
ict_index                                         0
influence                                         0
own_goals   

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 132305 entries, 0 to 132304
Data columns (total 59 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   player                                      132305 non-null  object 
 1   gw                                          132305 non-null  int64  
 2   position                                    132305 non-null  int64  
 3   minutes                                     132305 non-null  int64  
 4   team                                        132305 non-null  object 
 5   opponent_team                               132305 non-null  object 
 6   relative_market_value_team                  59488 non-null   float64
 7   relative_market_value_opponent_team         59465 non-null   float64
 8   was_home                                    132305 non-null  bool   
 9   total_points                                132305 non-null  int64  
 

In [None]:
pd.unique(train_df.team)

array(['West Ham United', 'Everton', 'Arsenal', 'Watford', 'Chelsea',
       'Hull City', 'Middlesbrough', 'Bournemouth', 'Liverpool',
       'Sunderland', 'Leicester City', 'Manchester City', 'Southampton',
       'Tottenham Hotspur', 'Manchester United', 'Burnley',
       'Crystal Palace', 'Swansea City', 'West Bromwich Albion',
       'Stoke City', 'Huddersfield Town', 'Newcastle United',
       'Brighton and Hove Albion', 'Fulham', 'Wolverhampton Wanderers',
       'Cardiff City', 'Aston Villa', 'Norwich', 'Sheffield United',
       'Leeds', 'Brentford'], dtype=object)

In [None]:
sum_goals("2122", "Mohamed Salah")
points_achieved("2122", "Mohamed Salah", 25)

Mohamed Salah scored  19 until gw:  28


24

In [None]:
train_df.head()

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,assists,bonus,bps,clean_sheets,creativity,goals_conceded,goals_scored,ict_index,influence,own_goals,penalties_missed,penalties_saved,red_cards,saves,selected,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season,date,squad,comp,shots_total,shots_on_target,touches,pressures,tackles,interceptions,blocks,xg,npxg,xa,sca,gca,passes_completed,passes,passes_pct,carries,dribbles_completed,dribbles,crowds
0,Aaron Cresswell,1,2,0,West Ham United,Chelsea,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,14023,1,2,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,0.895471,2.243698,2016-08-15,,,,,,,,,,,,,,,,,,,,,1
1,Aaron Lennon,1,3,15,Everton,Tottenham Hotspur,,,True,1,0,0,6,0,0.3,0,0,0.9,8.2,0,0,0,0,0,13918,1,1,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,1.057509,1.43369,2016-08-13,,,,,,,,,,,,,,,,,,,,,1
2,Aaron Ramsey,1,3,60,Arsenal,Liverpool,,,True,2,0,0,5,0,4.9,3,0,3.0,2.2,0,0,0,0,0,163170,4,3,23.0,0,0,0,0,2016-08-14T15:00:00Z,1617,,1.944129,1.46586,2016-08-14,,,,,,,,,,,,,,,,,,,,,1
3,Abdoulaye Doucouré,1,3,0,Watford,Southampton,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,1051,1,1,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,0.7042,0.796805,2016-08-13,,,,,,,,,,,,,,,,,,,,,1
4,Abdul Rahman Baba,1,2,0,Chelsea,West Ham United,,,True,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,1243,1,2,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,2.243698,0.895471,2016-08-15,,,,,,,,,,,,,,,,,,,,,1


## **4. Baseline prediction**

In [None]:
# add all lag points scored to training set
train_df, team_lag_vars = team_lag_features(train_df, ['total_points'], ['all'])
train_df, player_lag_vars = player_lag_features(train_df, ['total_points'], ['all', 1])