Goals

be able to access all data in S3 for a given season
- events table
- player and team match stats
- lineups and missing players table
- odds table


our end goal is to have player and team tables for the season which will facilitate making our features very easily. 
that means we should have a player table of every performance in the league with vaep, xG, rest days (have to incorporate european fixtures), travel distance (have to manually get coordinates for stadiums),  

then wrangle event data to get 

In [1]:
import boto3
from dotenv import load_dotenv
import os
import warnings
from io import StringIO
import pandas as pd
import socceraction.spadl as spadl
from tqdm import tqdm
import numpy as np
import xgboost

import sys
sys.path.append('..')
import utils
tqdm.pandas()

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
warnings.filterwarnings('ignore')

load_dotenv()
aws_access_key = os.getenv('AWS_ACCESS_KEY')
aws_secret_access = os.getenv('AWS_SECRET_ACCESS')
aws_region = os.getenv('AWS_REGION')

s3 = boto3.client('s3',
                aws_access_key_id=aws_access_key,
                aws_secret_access_key=aws_secret_access,
                region_name=aws_region)

bucket = 'footballbets'
league = "ENG-Premier League"
season = 2223

In [17]:
spadl_e = s3.get_object(Bucket=bucket, Key=f'ENG-Premier League/2223/events_spadl.csv')
spadldf = pd.read_csv(StringIO(spadl_e['Body'].read().decode('utf-8')))
spadldf = spadl.add_names(spadldf)

scheduler = s3.get_object(Bucket=bucket, Key=f'ENG-Premier League/2223/schedule.csv')
schedule = pd.read_csv(StringIO(scheduler['Body'].read().decode('utf-8')))

spadldf = spadldf.merge(schedule[['game', 'home_team_id', 'ws_game_id']].rename(columns={'game':'fixture'}), how='left', left_on='game_id', right_on='ws_game_id')



spadldf['prevEvent'] = spadldf.shift(1, fill_value=0)['type_name']
spadldf['nextEvent'] = spadldf.shift(-1, fill_value=0)['type_name']
spadldf['nextTeamId'] = spadldf.shift(-1, fill_value=0)['team_id']

## Possession Sequences

In [18]:
spadldf = utils.get_season_possessions(spadldf)

100%|██████████| 380/380 [01:05<00:00,  5.78it/s]


## xG

In [19]:
from xg import xG
xgm = xG(spadldf)
spadldf['xG'] = xgm.get_xg()

Calculating play types: 100%|██████████| 112764/112764 [01:29<00:00, 1263.95it/s]


In [20]:
from sklearn.metrics import r2_score
xG_ser = spadldf[spadldf['type_name'].isin(['shot', 'shot_freekick', 'shot_penalty'])].xG
ground_truth = spadldf[spadldf['type_name'].isin(['shot', 'shot_freekick', 'shot_penalty'])].result_id

r2_score(ground_truth, xG_ser)

0.15537494505886906

In [21]:
spadldf[['player', 'xG']].groupby('player').sum().sort_values('xG', ascending=False).head(10)

Unnamed: 0_level_0,xG
player,Unnamed: 1_level_1
Erling Haaland,23.214673
Mohamed Salah,19.236746
Harry Kane,18.598679
Ivan Toney,17.722215
Callum Wilson,14.250796
Aleksandar Mitrovic,13.661042
Ollie Watkins,13.387657
Gabriel Jesus,12.715813
Marcus Rashford,12.38376
Dominic Solanke,11.578854


## VAEP

In [26]:
spadldf = pd.concat([spadldf, utils.get_vaep(spadldf)], axis=1)

Calculating Features for VAEP


Training Model for VAEP


In [27]:
spadldf[['player', 'vaep_value']].groupby('player').sum().sort_values('vaep_value', ascending=False).head(10)

Unnamed: 0_level_0,vaep_value
player,Unnamed: 1_level_1
Kieran Trippier,17.933631
Martin Ødegaard,16.865478
Harry Kane,15.117221
Kevin De Bruyne,14.192955
Trent Alexander-Arnold,13.158614
Erling Haaland,12.971816
Mathias Jensen,12.927472
Gabriel Martinelli,11.260491
Marcus Rashford,11.15064
James Ward-Prowse,10.736066
