## Predicting goalscoring based on previous performance

Output stats

1- Goals 

Input stats

2- Prev. year goals 

3- prev. year xG 

4- prev. year team goals

5- prev. year team xG 

6- Player age.

7- Player position.

8 - did they change club?

9 - did they change league (if so, which) can we rank the leagues in some way


In [1]:
import numpy as np 
import pandas as pd 
import plotly
from plotly import graph_objects as GO
import os 


## Loading the data

In [None]:
#create blank data_frame with necessary columns
column_names = pd.read_csv('Project_data/player_stats/Bundesliga_2014').columns
df = pd.DataFrame(columns=column_names)
#get all the files in the folder and extract the league and year from them
for file_suffix in os.listdir('Project_data/player_stats'):
    len_of_file = len(file_suffix)
    year = file_suffix[-4:]
    league = file_suffix[:len_of_file-5]
    df_temp = pd.read_csv(f'Project_data/player_stats/{file_suffix}')
    df_temp['league'] = league
    df_temp['year'] = year
    df=df.append(df_temp)
df = df.reset_index().drop(columns='Unnamed: 0')
df = df.drop(columns='index')

## Data cleaning and combining

We need to append "Last Year stats" for each player

In [30]:
#create a dictionary to reference that shows each of the 'previous year values
last_year_dictionary = {
    '2015':'2014',
    '2016':'2015',
    '2017':'2016',
    '2018':'2017',
    '2019': '2018',
    '2020': '2019'
}
#Create blank rows
df[['npg_LY','xG_LY','npxG_LY']] = ""


In [236]:
#iterate over the length of the df, and add the previous years metrics.
for i in range(len(df)):
    if df.iloc[i]['year'] != '2014':
        year = df.loc[i]['year']
        id = df.loc[i]['id']
        try:
            df.iloc[i,21] = np.float(df.loc[(df['id'] == id) & (df['year'] == last_year_dictionary[year])]['goals'])
            df.iloc[i,22] = np.float(df.loc[(df['id'] == id) & (df['year'] == last_year_dictionary[year])]['npg'])
            df.iloc[i,23] = np.float(df.loc[(df['id'] == id) & (df['year'] == last_year_dictionary[year])]['xG'])
            df.iloc[i,24] = np.float(df.loc[(df['id'] == id) & (df['year'] == last_year_dictionary[year])]['npxG'])
        except:
            continue
        counter +=1 

In [228]:
#Make sure we have all the new columns as floats
df["npg_LY"] = pd.to_numeric(df["npg_LY"], downcast="float")
df["xG_LY"] = pd.to_numeric(df["xG_LY"], downcast="float")
df["npxG_LY"] = pd.to_numeric(df["npxG_LY"], downcast="float")

We have a couple of small issues.

1) The French league finished prematurely in 2019 (28 games)

2) The German league only has 34 games, compared to 38 for the other season.

To fix this, we will normalise all seasons to a 38 game season. This will help us if we have players transferring from the German league, too.

In [305]:
#Create a function which creates a normalising constant for each league
def multiplier_code(league):
    if league.startswith('Bundesliga'):
        return 38/34
    elif league == 'Ligue 1_2019':
        return 38/28
    else:
        return 1
#Create a league_year column
df['league_year'] = df['league'] + '_' + df['year']
#call the function on league_year and produce a column with norm constant
df['normalise_constant'] = df['league_year'].apply(multiplier_code)


In [307]:
normalisable_columns = ['goals','xG','assists','xA','shots','key_passes','yellow_cards', 'red_cards','npg', 'npxG', 'xGChain', 'xGBuildup','goals_LY', 'npg_LY', 'xG_LY', 'npxG_LY']
#now multiply the values in the relevant columns by the normalising constant for that league.
for col in normalisable_columns:
    df[f'{col}_norm'] = df[col] * df['normalise_constant']