## Correlation, partial correlation and multicollinerarity

## Multi-collineraity

This occurs when 2 or more predictors share over 80% variance with each other.
This could be indicated with an r^2 value of over 0.8. It means one could be predicted from the other to a substantial degree.
This is problematic, as the parameters of the model (b) become interchangeable (and therefore unreliable) and the mathmatical techniques cannot discriminate between
each predictor. 
One other test is the Variance Inflation Factor (VIF) = 1/ 1 r^2.
If the number is greater than 5 this is moderate, if over 10 then severe multicollineraity. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from imblearn.over_sampling import SMOTE
from scipy.stats import boxcox, zscore

In [2]:
# Global parameters

# Current gameweek 
gameweek = 11

# Number of gameweeks to calculate rolling averages off 
rolling_number = 3

## Collect available player data

In [3]:
# Initialize an empty list to store all individual, player gameweek data 
all_player_sep = []

# Loop through each gameweek
for i in range(1, gameweek + 1):  # Adjusting the range to start from 1 to gameweek
    # Read the CSV for the current gameweek
    x = pd.read_csv(rf'C:\Users\thoma\Code\Projects\Fantasy-Premier-League\Data\Players\Seperate_GW\GW_{i}.csv')
    
    # Append the current gameweek data to the list
    all_player_sep.append(x)

# Concatenate all dataframes in the list into a single dataframe
player_data = pd.concat(all_player_sep, axis=0, ignore_index=True)

# Drop unnamed column
player_data = player_data.drop(columns = ['Unnamed: 0'])

In [4]:
# Remove players who play less than 61 minutes in a game (i.e. they do not recieve their 2 points minimum for playoing this amount)
player_data = player_data[player_data['Minutes'] > 60].copy()

In [5]:
# Filter by Goalkeepers, Defenders, Midfielders, and Forwards
final_data_mids = player_data[player_data['Position'] == 'MID'].copy()
final_data_defs = player_data[player_data['Position'] == 'DEF'].copy()
final_data_gks = player_data[player_data['Position'] == 'GK'].copy()
final_data_fwds = player_data[player_data['Position'] == 'FWD'].copy()

In [6]:
## Assess sample size of each category
print(f'GK:',final_data_gks.shape)
print(f'DEF:',final_data_defs.shape)
print(f'MID:',final_data_mids.shape)
print(f'FWD:',final_data_fwds.shape)

GK: (219, 33)
DEF: (859, 33)
MID: (952, 33)
FWD: (218, 33)


In [None]:
# Define correlation columns
correlations = ['GW Points','Influence', 'Minutes', 'Goals', 'Assists', 'Clean Sheets',
       'Goals Conceded', 'Penalties Saved', 'Penalties Missed', 'YC', 'RC',
       'Saves', 'Total Bonus Points', 'Total BPS', 'Creativity',
       'Threat', 'ICT Index', 'xG', 'xA', 'xGi', 'xGc', 'Transfers In GW',
       'Transfers Out GW', 'Gameweek','Difficulty']

# Sort by correlations
final_data_defs[correlations].corr().sort_values(by = 'GW Points', ascending = False).head(40)

Unnamed: 0,GW Points,Influence,Minutes,Goals,Assists,Clean Sheets,Goals Conceded,Penalties Saved,Penalties Missed,YC,...,Threat,ICT Index,xG,xA,xGi,xGc,Transfers In GW,Transfers Out GW,Gameweek,Difficulty
GW Points,1.0,0.420141,0.02064,0.462684,0.266254,0.755845,-0.609091,,,-0.206265,...,0.21702,0.365922,0.137406,0.142875,0.196793,-0.309598,0.02217,0.05295,0.007484,-0.119303
Total BPS,0.893727,0.398072,0.072532,0.262583,0.216237,0.741986,-0.703661,,,-0.212979,...,0.142662,0.360338,0.037567,0.200016,0.166673,-0.383752,0.017004,0.084932,0.02437,-0.151602
Clean Sheets,0.755845,-0.022733,-0.044252,-0.028743,-0.027724,1.0,-0.647349,,,-0.05149,...,-0.033375,-0.035701,-0.021674,0.014898,-0.00479,-0.28979,-0.027069,0.029328,-0.007676,-0.081007
Total Bonus Points,0.655749,0.303869,0.041201,0.232771,0.163836,0.422636,-0.306072,,,-0.081766,...,0.095891,0.27987,0.0256,0.179169,0.143641,-0.181406,0.01915,0.040892,0.000757,-0.043948
Goals,0.462684,0.649023,0.051358,1.0,0.000508,-0.028743,0.073759,,,-0.030757,...,0.4782,0.547832,0.317313,0.024383,0.240183,0.044173,0.032899,0.048824,0.031972,-0.002215
Influence,0.420141,1.0,0.271489,0.649023,0.295749,-0.022733,0.073985,,,-0.11328,...,0.391974,0.767385,0.224536,0.159587,0.269769,0.093942,0.062577,0.044764,0.061994,-0.050364
Influence,0.420141,1.0,0.271489,0.649023,0.295749,-0.022733,0.073985,,,-0.11328,...,0.391974,0.767385,0.224536,0.159587,0.269769,0.093942,0.062577,0.044764,0.061994,-0.050364
ICT Index,0.365922,0.767385,0.179363,0.547832,0.318232,-0.035701,0.063038,,,-0.08476,...,0.670927,1.0,0.413644,0.448885,0.60559,0.000107,0.145881,0.117106,0.06985,-0.088125
Assists,0.266254,0.295749,0.017199,0.000508,1.0,-0.027724,0.068396,,,0.005547,...,0.041435,0.318232,0.056094,0.310502,0.257177,0.01914,0.069466,-0.03418,-0.005669,-0.083098
Threat,0.21702,0.391974,0.124319,0.4782,0.041435,-0.033375,0.051837,,,-0.024287,...,1.0,0.670927,0.659716,0.111463,0.541971,-0.019089,0.132579,0.069357,0.010857,-0.050261


From our linear model, we have established that 
GK points are coming from
Defs points are coming from
Mids points are coming from
Stks points are coming from:

Variables we are interested in doing further analysis on: 

ICT index
Total BPS
Difficulty
Influence
Creativity
Threat
xGi
xG

because these have strong correlations to these 'point' variables. They are like the new metrics that can help us have great onwership over which players 



In [None]:
# Key varables we are 