In [None]:
import pandas as pd
import numpy as np
import get_data
import feature_engineering as fe

The first step was to input the three dataframes using the functions defined in **get_data.py**

1. **players**: Data for each player (team, position, etc) with some aggregate statistics (total point, total goals, etc)
2. **gws**: Data for each player for each gameweek. It includes detailed information such as the number of goals, assists, points,etc for a single gameweek.
3. **fixtures**: All the fixtures for the season

I then got the current gameweek number using the gws dataframe

In [None]:
players=get_data.get_player_data()
gws=get_data.get_gameweek_data()
fixtures=get_data.get_fixtures()

# Getting the current gameweek number
current_gw=max(set(gws['round']))+1

*players* has a lot of columns that are not available for each gameweek and so we cannot use them in making predictions. So I filtered the dataset to only include basic information such as **id, first_name, second_name, team, and position**

Similarly for *fixtures*, I filtered to all the previous fixtures plus the current gameweek fixtures and only kept three columns: ** event, team_a (away team), team_h (home team)**

In [None]:
# Filtering player data to only required columns
players=players[['id_x','first_name','second_name','team','position']]
# test=test[(test['id_x']==531)|(test['id_x']==390)|(test['id_x']==569)]

# Filtering the fixtures upto the current list of fixtures
fixtures=fixtures[['event','team_a','team_h']]
fixtures=fixtures[fixtures['event']<= current_gw]

I got all the home and away fixtures for each player by merging *players* and *fixtures* and then concatenated the two dataframes to create a single dataframe with all past and present fixtures (fixtures for the current gameweek)

In [None]:
# Home fixtures for each player
home=pd.merge(players,fixtures,left_on='team',right_on='team_h')
# Away fixxtures for each player
away=pd.merge(players,fixtures,left_on='team',right_on='team_a')
# Concatenating home and away fixtures to a single dataframe
df=pd.concat([home,away]).sort_values('event').reset_index(drop=True)

I added a binary feature called **is_home** to denote whether player was playing at his home stadium and another feature **opposition team** that includes the team number of the opponent. I then dropped the **team_a** and **team_h** columns as the data was already contained in the two new features that were created

In [None]:
# Adding a binary feature for home fixtures
df['is_home']=np.where(df['team']==df['team_h'],1,0)
# Adding a feature for the opposition team team number
df['opposition_team']=np.where(df['is_home'],df['team_a'],df['team_h'])
df=df.drop(['team_a','team_h'],axis=1)

I defined a list of features from *gws* that would be useful in predicting points for the current gameweeek. Since this information wouldn't be available at the time of prediction, I created features that included the average values of these features for all gameweeeks before or n gameweeks before. 

For example, for each player I include information such as the average number of goals per game, average number of assists per game, etc. I also want to take into account the current form of the player, so I also include features such as average goals in the last n games.

I added these features with the help of a function called ***get_average_stats*** located in **feature_enginnering.py**

In [None]:
# List of stats from gws dataset that would be useful
cols=['minutes','goals_scored', 'assists', 'clean_sheets', 'goals_conceded',
      'yellow_cards','red_cards', 'saves', 'bonus', 'bps', 'influence', 'creativity',
      'threat', 'ict_index', 'value']
target_col=['total_points',]
# Merging the current dataframe with gws to include the above information
df=pd.merge(df,gws[['element','round']+cols+target_col],left_on=['id_x','event'],right_on=['element','round'],how='left').drop(['element','round'],axis=1)
# Removing NA values that result from the merge. These NA's are because of new transfers as the player did not have scores/points for previous gameweeks.
df=df[~((df['event']!=current_gw)&(df['total_points'].isna()))]
# Get the average stats for each player for past gameweeks as well as average for last 2 gameweeks
df=df.groupby('id_x').apply(lambda x: fe.get_average_stats(x,cols+target_col))
df=df.groupby('id_x').apply(lambda x: fe.get_average_stats(x,cols+target_col,2))
# Drop these columns as they cannot be used for new predictions
df=df.drop(cols,axis=1)

In [None]:
print("Final dataset: \n")
df.head()

In [None]:
print("List of columns:\n")
_=[print(col) for col in df.columns]