## Feature Engineering

How can we predict clean sheets and goals? (Follow up from feature_analysis)

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import random
import pingouin as pg
import glob
import re
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")

In [2]:
# Current gameweek 
gameweek = 15

## Collect available player data

In [3]:
# Initialize an empty list to store all individual, player gameweek data 
all_player_sep = []

# Loop through each gameweek
for i in range(1, gameweek + 1):  # Adjusting the range to start from 1 to gameweek
    # Read the CSV for the current gameweek
    x = pd.read_csv(rf'C:\Users\thoma\Code\Projects\Fantasy-Premier-League\Data\Players\Seperate_GW\GW_{i}.csv')
    
    # Append the current gameweek data to the list
    all_player_sep.append(x)

# Concatenate all dataframes in the list into a single dataframe
player_data = pd.concat(all_player_sep, axis=0, ignore_index=True)

# Drop unnamed column
player_data = player_data.drop(columns = ['Unnamed: 0'])

# Sort dataset correctly IMPORTANT
player_data = player_data.sort_values(by= ['Player ID','Gameweek'])

## Fixture Difficulty Rating (updated version)

In [4]:
# Read the difficulty data
difficulty = pd.read_csv(r'C:\Users\thoma\Code\Projects\Fantasy-Premier-League\Data\Fixtures\Difficulty_ratings\FD_combined\Current_FD.csv', index_col=0)

# Create a mapping dictionary
mapping = difficulty.set_index(['Opponent', 'Position'])['FD_combined'].to_dict()

# Apply the mapping to a new column in player_data
player_data['FD_combined'] = player_data.apply(
    lambda row: mapping.get((row['Opponent'], row['Position']), None), axis=1
)

## Team data

In [5]:
# Specify the path to the files
attack = glob.glob(r'C:\Users\thoma\Code\Projects\Fantasy-Premier-League\Data\Team\Seperate_GW\Attacking\*.csv')
defense = glob.glob(r'C:\Users\thoma\Code\Projects\Fantasy-Premier-League\Data\Team\Seperate_GW\Defensive\*.csv')

# Define a function to extract the week number from the filename
def extract_week_number(filename):
    match = re.search(r'GW_(\d+)', filename)
    return int(match.group(1)) if match else None

# Read each attacking file and add the 'Week' column
att_weekly_data = pd.concat(
    [pd.read_csv(file).assign(Week=extract_week_number(file)) for file in attack],
    ignore_index=True
)

# Read each defensive file and add the 'Week' column
def_weekly_data = pd.concat(
    [pd.read_csv(file).assign(Week=extract_week_number(file)) for file in defense],
    ignore_index=True
)
# Remove 'VS' team from defensive data
def_weekly_data['Team'] = def_weekly_data['Team'].str[3:]

# Choose columns data 
columns_new = ['Team','Week', 'Playing TimeMP', 'Possession','PerformanceGls','PerformanceAst','ExpectedxG','ExpectedxAG',
               'Per 90 MinutesGls','Per 90 MinutesAst','Per 90 MinutesxG','Per 90 MinutesxAG']

# Attacking data
attacking_data = pd.DataFrame(att_weekly_data[columns_new]).sort_values(by = 'Week')

# # Defensive data
defensive_data = pd.DataFrame(def_weekly_data[columns_new]).sort_values(by = 'Week')

# Collect fixture list
fixtures = pd.read_csv(r'C:\Users\thoma\Code\Projects\Fantasy-Premier-League\Data\Fixtures\Schedule\Fixtures_alt_names.csv')

# Create function to collect fixture data
def fixture_data(team, fixtures, gameweek):
    
    # Create empty list of fixtures
    fix_data = []
    # Iterate over each row of the fixtures DataFrame
    for index, row in fixtures.iterrows():
        # Check if the row's team matches the input team
        if row['Team'] == team:
            # Loop through the columns corresponding to gameweeks
            for col in fixtures.columns[1:gameweek + 1]:
                if '(H)' or '(A)' in row[col]:  # Check row has fixture information
                    fix_data.append([col, row[col]])

    # Return the collected home data
    return fix_data

# Get games
games = []

# List of unique teams 
teams = attacking_data['Team'].unique()

# For all teams
for team in teams:
    # Get fixture information
    fix_data = fixture_data(team, fixtures, gameweek)  # Fetch data for the team
    for info in fix_data:
        # You can extract relevant information from 'game', like opponent, week, etc.
        games.append([info[0], team, info[1]])

# Creating DataFrame for all teams fixture list
fix = pd.DataFrame(games, columns=['Week', 'Team', 'Opponent'])

# Remove 'GW' from the 'Week' string and convert it to an integer
fix['Week'] = fix['Week'].str[2:].astype(int)

# Define columns
cols = ['Team', 'Week', 'Possession', 'PerformanceGls',
       'PerformanceAst', 'ExpectedxG', 'ExpectedxAG', 'Per 90 MinutesGls',
       'Per 90 MinutesAst', 'Per 90 MinutesxG', 'Per 90 MinutesxAG']

# Get attacking and defensive data
attacking = attacking_data[cols]
defensive = defensive_data[cols]

# Merge attacking and defensive data for each team for each gameweek
team_attack = fix.merge(attacking, on=['Week', 'Team'])
team_defense = fix.merge(defensive, on=['Week', 'Team'])

# Rename team_names to align with player_data
# Define a dictionary of old team names as keys and new names as values
name_changes = {
    "Nott'ham Forest": 'Nottingham Forest',
    'Manchester Utd': 'Man Utd',
    'Manchester City': 'Man City',
    'Newcastle Utd': 'Newcastle',
    'Leicester City': 'Leicester',
    'Ipswich Town': 'Ipswich',
    'Tottenham': 'Spurs',
    
}
# Replace the team names using the dictionary
team_attack['Team'] = team_attack['Team'].replace(name_changes)
team_defense['Team'] = team_defense['Team'].replace(name_changes)

# Rename team columns
team_defense.rename(columns=lambda col: f"{col} against", inplace=True)
team_defense.rename(columns={'Week against': 'Week', 'Team against': 'Team', 'Opponent against': 'Opponent'}, inplace=True)

# Merge the playerdata with attacking, and then defensive team information
merged_df = pd.merge(player_data, team_attack, on=['Team', 'Opponent'], how='left')
player_d = pd.merge(merged_df, team_defense, on=['Team', 'Opponent'], how='left')

# Drop uneeded columns
player_data = player_d.drop(columns = ['Week_x', 'Week_y', 'KO_time'])

# Collect columns that are averages of team performance (per_90)
Per_90 = player_data[['Player ID', 'Gameweek','Per 90 MinutesxG', 'Per 90 MinutesGls', 'Per 90 MinutesxG against','Per 90 MinutesGls against']]

# Filter on latest GW possible to get most accurate average value
Per_90 = Per_90[Per_90['Gameweek'] == 8]

# Merge the data on 'Player ID'
complete = player_data.merge(Per_90, on='Player ID', how='left', suffixes=('_postGW8', '_preGW8'))

# Replace NaN values in POST_8 variables with average values from 'Per 90 MinutesxG_team'
complete['PerformanceGls'] = complete['PerformanceGls'].fillna(complete['Per 90 MinutesGls_preGW8'])
complete['ExpectedxG'] = complete['ExpectedxG'].fillna(complete['Per 90 MinutesxG_preGW8'])
complete['PerformanceGls against'] = complete['PerformanceGls against'].fillna(complete['Per 90 MinutesGls against_preGW8'])
complete['ExpectedxG against'] = complete['ExpectedxG against'].fillna(complete['Per 90 MinutesxG against_preGW8'])

# Rename column
complete = complete.rename(columns={'Gameweek_postGW8': 'Gameweek',
                                    'PerformanceGls': 'Team_gls',
                                    'ExpectedxG': 'TeamxG',
                                    'PerformanceGls against': 'Team_gls_against',
                                    'ExpectedxG against': 'TeamxG_against',
                                    })

columns_to_keep = ['Player ID', 'Name', 'Last_Name', 'Team', 'Position', 'Cost_Today',
       'GW Points', 'Minutes', 'Goals', 'Assists', 'Clean Sheets',
       'Goals Conceded', 'Penalties Saved', 'Penalties Missed', 'YC', 'RC',
       'Saves', 'Total Bonus Points', 'Total BPS', 'Influence', 'Creativity',
       'Threat', 'ICT Index', 'xG', 'xA', 'xGi', 'xGc', 'Transfers In GW',
       'Transfers Out GW', 'Gameweek', 'Opponent', 'FD_combined',
       'Team_gls', 'TeamxG', 'Team_gls_against', 'TeamxG_against']

# New player data with columns
player_data = complete[columns_to_keep]

# Sort dataset correctly IMPORTANT
player_data = player_data.sort_values(by= ['Player ID','Gameweek'])

# Feature_engineering

# Fixture Difficulty Difference

In [6]:
# Opponent difficulty ( = same as FD_combined)
player_data['Opponent_Difficulty'] = player_data['FD_combined'].rename(inplace= True)

# Initialize a list to store the results
player_difficulty = []

# Iterate through each player and consecutive gameweek
for _, row in player_data.iterrows():
    team = row['Team']  # Get the player's team
    player_position = row['Position']  # Get the player's position
    opponent_info = row['Opponent']  # Get the player's opponent info
    
    # Filter difficulty list for the player's team
    difficulty_filtered = difficulty[difficulty['Team'] == team]
    
    # Player played at home
    if "(H)" in opponent_info:
        # Filter by (A), as this is from opponent perspective
        player = difficulty_filtered[difficulty_filtered['Opponent'].str.contains(r"\(A\)")]
    # Player played away
    elif "(A)" in opponent_info:
        # Filter by (H), as this is from opponent perspective
        player = difficulty_filtered[difficulty_filtered['Opponent'].str.contains(r"\(H\)")]
    else:
        continue

    # Determine difficulty based on player position
    if player_position in ['MID', 'FWD']:
        difficulty_player = player[player['Position'] == 'DEF']
    elif player_position in ['GK', 'DEF']:
        difficulty_player = player[player['Position'] == 'FWD']
    else:
        continue  # Skip if position is not recognized

    # Collect the value that remains after filtering away irrelvant information
    score = difficulty_player['FD_combined'].sum()

    # Append the result for this player
    player_difficulty.append({
        'Player ID': row['Player ID'],
        'Opponent': opponent_info,
        'Player_Difficulty': score
    })

# Convert to DataFrame, excluding the Difficulty DataFrame
player_difficulty_summary = pd.DataFrame(player_difficulty)

# Merge back with original dataframe 
player_data = player_data.merge(player_difficulty_summary, on = ['Player ID', 'Opponent'])

# Create difficulty difference
player_data['FD_Difference'] = player_data['Player_Difficulty'] - player_data['Opponent_Difficulty']

## Feature Engineering: Rolling averages

In [7]:
number_of_games = 4  # Define the window size

# Apply rolling mean for "Form" (GW Points) (excluding current gameweek)
player_data["Form"] = (
    player_data.groupby("Player ID")["GW Points"]
    .transform(lambda x: x.shift(1).rolling(window=number_of_games).mean().round(3))
)

# Apply rolling mean for "Form_xG" (excluding current gameweek)
player_data["Form_player_xG"] = (
    player_data.groupby("Player ID")["xG"]
    .transform(lambda x: x.shift(1).rolling(window=number_of_games).mean().round(3))
)

# Apply rolling mean for "Form_xGc" (excluding current gameweek)
player_data["Form_player_xGc"] = (
    player_data.groupby("Player ID")["xGc"]
    .transform(lambda x: x.shift(1).rolling(window=number_of_games).mean().round(3))
)

# Apply rolling mean for "Team xG" (excluding current gameweek)
player_data["Form_Team_xG"] = (
    player_data.groupby("Player ID")["TeamxG"]
    .transform(lambda x: x.shift(1).rolling(window=number_of_games).mean().round(3))
)

# Apply rolling mean for "Team xG against" (excluding current gameweek)
player_data["Form_TeamxG_against"] = (
    player_data.groupby("Player ID")["TeamxG_against"]
    .transform(lambda x: x.shift(1).rolling(window=number_of_games).mean().round(3))
)

# Apply roling mean for 'Team performance goals' (excluding current gameweek
player_data["Form_Team_gls"] = (
    player_data.groupby("Player ID")["Team_gls"]
    .transform(lambda x: x.shift(1).rolling(window=number_of_games).mean().round(3))
)

# Apply roling mean for 'Team performance goals' (excluding current gameweek
player_data["Form_Team_gls_against"] = (
    player_data.groupby("Player ID")["Team_gls_against"]
    .transform(lambda x: x.shift(1).rolling(window=number_of_games).mean().round(3))
)

In [8]:
# Remove players who play less than 61 minutes in a game (i.e. they do not recieve their 2 points minimum for playoing this amount)
player_data = player_data[player_data['Minutes'] > 60].copy()

# Moderators

A moderator effect occurs, when there is the relationship between two variables changes when you include another variable. 

For example, is there a different relationship between Clean Sheets and Fixture Difficulty for teams who have recently been giving away many more scoring chances than others (xG against).

To assess for moderation effect, we simply need to create an interaction variable (which is one variable multiplied by the other). 

## Multicollinearity

This occurs when 2 or more predictors are highly correlated with each other. It is problematic, because parameters become interchangeable, as the model cannot discriminate between each predictor. 

It makes interpreting the model much more difficult, due to increased 'noise'. 

Noise refers to irrelevant information that does not contribute to the predictive relationship.

In the context of multicollinearity, noise can arise when redundant predictors inflate the variability in coefficient estimates, making it difficult to discern the true relationship.

How to check for multicollinearity?

- Use r^2 (if > 80% you're in trouble)
- Use Variance Inflation Factor (VIF) = 1/1 - r^2 (1 < 5 indicates moderate; > 5 indicates severe)

How to control for multicollinearity?

1) Remove one of the variables
2) Create a composite variable (using principal component analysis)
3) Centre (standardize) variables: subtract each row value by the column mean into a new columns 
    e.g. [player_data['GW_Points_c'] = player_data['GW Points'] - player_data['GW Points'].mean()]

In [9]:
# Min/Max scaler 
scaler = MinMaxScaler(feature_range=(2, 5))

# Normalise FD_Difference
player_data['FD_Difference_norm'] = scaler.fit_transform(player_data['FD_Difference'].values.reshape(-1, 1))

# Normalise PlayerxG form
player_data['Form_player_xG_norm'] = scaler.fit_transform(player_data['Form_player_xG'].values.reshape(-1,1))

# Normalise TeamxG_against
player_data['Form_TeamxG_against_norm'] = scaler.fit_transform(player_data['Form_TeamxG_against'].values.reshape(-1,1))

# Normalise Form
player_data['Form_norm'] = scaler.fit_transform(player_data['Form'].values.reshape(-1,1))

In [10]:
# Non-normalized variables

# Creating interaction variable (for defensive players - based on best variables)
player_data['FD_xGA_interaction'] = player_data['FD_Difference'] / player_data['Form_TeamxG_against']

# Creating interaction variable (for attacking players - based on best variables)
player_data['FD_Diff_xG_interaction'] = player_data['FD_Difference'] * player_data['Form_player_xG']

# Creating add on interaction variable (for attacking players - based on best variables)
player_data['Form_FD_xG_interaction'] = player_data['FD_Diff_xG_interaction'] * player_data['Form']

In [11]:
# Normalised variables

# Creating interaction variable (for defensive players - based on best variables)
player_data['FD_xGA_interaction_norm'] = player_data['FD_Difference_norm'] / player_data['Form_TeamxG_against_norm']

# Creating interaction variable (for attacking players - based on best variables)
player_data['FD_Diff_xG_interaction_norm'] = player_data['FD_Difference_norm'] * player_data['Form_player_xG_norm']

# Creating add on interaction variable (for attacking players - based on best variables)
player_data['Form_FD_xG_interaction_norm'] = player_data['FD_Diff_xG_interaction_norm'] * player_data['Form_norm']

In [12]:
# Form/Fixture Difficulty
player_data['Form_FD'] = player_data['Form'] / player_data['FD_combined']

# xG/Fixture Difficulty
player_data['Form_player_xG_FD'] = player_data['Form_player_xG'] / player_data['FD_combined']

# xGc/Fixture Difficulty
player_data['Form_player_xGc_FD'] = player_data['Form_player_xGc'] / player_data['FD_combined']

In [16]:
# Columns for correlations
columns = player_data.columns.drop(['Player ID', 'Name', 'Last_Name', 'Team', 'Position', 'Cost_Today', 'Opponent'])

# Defensive and Forward players
attackers = player_data[player_data['Position'].isin(['MID', 'FWD'])].copy()
defenders = player_data[player_data['Position'].isin(['GK', 'DEF'])].copy()

# Sort the correlation matrix for defenders
corr = attackers[columns].corr().sort_values(by='Goals', ascending=False)
corr.head(20)

Unnamed: 0,GW Points,Minutes,Goals,Assists,Clean Sheets,Goals Conceded,Penalties Saved,Penalties Missed,YC,RC,...,Form_norm,FD_xGA_interaction,FD_Diff_xG_interaction,Form_FD_xG_interaction,FD_xGA_interaction_norm,FD_Diff_xG_interaction_norm,Form_FD_xG_interaction_norm,Form_FD,Form_player_xG_FD,Form_player_xGc_FD
Goals,0.865022,0.080701,1.0,0.076969,-0.015393,0.019113,,-0.004398,-0.054476,8.905301e-18,...,0.166821,0.176265,0.254421,0.256269,0.123814,0.274546,0.26829,0.211952,0.239867,0.081858
GW Points,1.0,0.094438,0.865022,0.499702,0.100158,-0.087076,,-0.035522,-0.166589,-0.06279974,...,0.157891,0.216419,0.258398,0.256778,0.167778,0.285352,0.271477,0.203576,0.217481,0.063252
Influence,0.893921,0.222226,0.845135,0.387448,0.0042,-0.006606,,-0.000451,-0.056837,-0.02306502,...,0.102363,0.210645,0.24739,0.233467,0.152337,0.228635,0.208988,0.154098,0.13119,0.095504
Total BPS,0.89962,0.189605,0.804504,0.439992,0.01334,-0.018156,,-0.033296,-0.134231,-0.04730498,...,0.085886,0.244686,0.255141,0.233593,0.193997,0.239081,0.207707,0.149177,0.122846,0.08218
Total Bonus Points,0.855956,0.122883,0.760279,0.295445,-0.008559,-0.060064,,-0.015019,-0.054499,-0.0241098,...,0.132865,0.165307,0.220539,0.231578,0.116294,0.239408,0.232702,0.170706,0.197682,0.077622
ICT Index,0.74264,0.229141,0.670868,0.361038,0.029109,-0.058284,,0.021578,-0.071971,-0.02082271,...,0.199128,0.279057,0.28979,0.269133,0.243084,0.353789,0.332208,0.255017,0.251011,0.117829
xG,0.53352,0.068224,0.625039,0.092369,-0.00188,-0.006484,,0.140369,-0.078268,-0.002361777,...,0.224597,0.238352,0.307117,0.292143,0.20575,0.394208,0.373386,0.276945,0.372398,0.116101
Threat,0.537334,0.107115,0.585909,0.118586,0.021516,-0.055015,,0.043721,-0.08133,0.006455685,...,0.26952,0.238042,0.298613,0.285812,0.213335,0.426251,0.41332,0.300434,0.401436,0.108103
xGi,0.5695,0.116145,0.571787,0.240611,-0.008479,-0.01314,,0.123505,-0.068643,-0.01533219,...,0.234673,0.292071,0.321255,0.301637,0.254514,0.408515,0.384732,0.299546,0.336258,0.127782
FD_Diff_xG_interaction_norm,0.285352,0.067064,0.274546,0.123797,-0.04161,-0.068166,,-0.020941,-0.103877,-0.02238673,...,0.502376,0.720469,0.730216,0.671239,0.646408,1.0,0.922085,0.634346,0.790014,0.244625


## Mediation (partial correlation/confounders)

Assesses the unique relationship between two variables, when controlling for another variable (also known as a mediator or confounder). 

The confounding variable may suppress the interation or real relationship between the other variables. The null hypothesis is that after controlling for the third variable (or confounder), there is no relationship between a and b.

For example, is there a relationship between coffee consumption and productivity? One potential confounder that definately effects both of these variables is sleep quality. Lower sleep means increased coffee consumption and lower productivity. Better sleep means reduced coffee consumption and higher productivity. If everyone slept the same amount, would there be a relationship (in terms of co-variance/correlation) between coffee and productivity?

Equation = 'Relationship between a and b, while controlling for variable c.'
> rab.c = rab - rac-rbc/ sqrt(1-r^2ac)sqrt(1-r^2bc).
Relationship between a and b, while controlling for variable c. 

The method of partial correlation is to use the residuals of other variables. 

Regress Y onto X: Y is your target variable, X is your predictor. You determine the part of Y that can be explained by X. A better model has lower residuals. Either way, the residuals (Y') represent the portion of Y that is unrelated to X.

Regress Z onto X: Z is your target variable, X is your predictor. You determine the part of Z that can be explained by X. A better model has lower residuals. Either way, the residuals (Z') represent the portion of Z that is not explained by X.

Correlation between residuals of (Y' and Z') reflects the relationship between these two without the influence of X.

In [14]:
partial_corr_results = pg.partial_corr(attackers, x = 'GW Points', y = 'ICT Index', covar = ['Threat', 'Creativity'])
# Most of the ICT index is actually made up of 'Influence'. Controlling for threat and creativity havs little effect on
# the direct relationship between the two. Threat and Creativity are not significant confounders 
# between ICT and Influence for attackers.

In [15]:
partial_corr_results = pg.partial_corr(attackers, x = 'Goals', y = 'xG', covar = ['Threat'])

# Threat represents a fairly strong confounder of the relationship between goals and xG. The correlation
# drops from 0.6 to 0.3.

## Key Conclusions

Our feature engineering process was built from our feature analysis using linear regression. We undercovered that defensives players were getting most points from clean sheets and attacking players
were getting most points for scoring goals. To keep the model simple, we split by position and created two distinct features that share a good amount of variance with goals scored and clean sheets. 

Attacking players: Predicting goals 

- Fixture_Difficulty_Difference * Form_player_xG (both normalised between 2-5) (shares 28% variance)

Defending players: Predicting clean sheets

- Fixture_Difficulty_Difference / Form_Team_xG_against (shares 35% variance)