### Expected Goals including player positions
Step-by-step walk thru of making an expected goals model leveraging additional information concerning opposition player location. This tutorial follows similar design choices as Javier Fernandez’s expected goals model in "A framework for the fine-grained evaluation of the instantaneous expected value of soccer possessions.

This effort will train a shallow neural network with following features:

- ball location (x)
- binary variable signifying if ball was closer to the goal than the opponent’s goalkeeper
- angle between the ball and the goal
- distance between the ball and the goal
- distance between the ball and the goalkeeper in y-axis
- distance between the ball and the goalkeeper
- number of opponent players inside the triangle formed between the ball location and opponent’s goal posts
- number of opponent players less than 3 meters away from the ball location
- binary variable signifying if shot was a header
- expected goals based on distance to goal and angle between the ball and the goal

In [2]:
#importing necessary libraries
from mplsoccer import Sbopen
import pandas as pd
import numpy as np
import warnings
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import os
import random as rn
import tensorflow as tf
#warnings not visible on the course webpage
pd.options.mode.chained_assignment = None
warnings.filterwarnings('ignore')

2023-12-09 21:15:27.710109: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
#setting random seeds so that the results are reproducible on the webpage

os.environ['PYTHONHASHSEED'] = '0'
os.environ['CUDA_VISIBLE_DEVICES'] = ''
np.random.seed(1)
rn.seed(1)
tf.random.set_seed(1)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

### Opening data
This task will use Statsbomb Indian Super League 2021/2022 data since it is the only dataset openly available that **contains both event and tracking data** for the entire season. Open each game and store data for the entire season in dataframes shot_df and track_df. Then change yards to meters. In the end, filter open play shots and remove shots without the goalkeeper tracked.

In [5]:
parser = Sbopen()
#get list of games during Indian Super League season
df_match = parser.match(competition_id=1238, season_id=108)
matches = df_match.match_id.unique()
matches

array([3827767, 3827335, 3827336, 3827338, 3827337, 3817856, 3817891,
       3817870, 3817899, 3817866, 3817889, 3817887, 3817872, 3817900,
       3817855, 3817863, 3817877, 3817901, 3817880, 3817902, 3817862,
       3817873, 3817869, 3817881, 3817895, 3817884, 3817896, 3817852,
       3817883, 3817886, 3817864, 3817882, 3817894, 3817871, 3817861,
       3817888, 3817875, 3817868, 3817867, 3813313, 3817890, 3813302,
       3817857, 3817898, 3817893, 3817854, 3817892, 3817859, 3817858,
       3817897, 3817885, 3817878, 3817876, 3813305, 3813266, 3813271,
       3813311, 3813303, 3813304, 3813306, 3813283, 3813275, 3813278,
       3813295, 3813296, 3813282, 3813269, 3817879, 3817874, 3817865,
       3817860, 3817853, 3817851, 3817850, 3817849, 3817848, 3813318,
       3813287, 3813270, 3813307, 3813279, 3813281, 3813274, 3813285,
       3813317, 3813316, 3813315, 3813314, 3813312, 3813310, 3813309,
       3813308, 3813301, 3813300, 3813299, 3813298, 3813297, 3813294,
       3813293, 3813

In [6]:


shot_df = pd.DataFrame()
track_df = pd.DataFrame()
#store data in one dataframe
for match in matches:
    #open events
    df_event = parser.event(match)[0]
    #open 360 data
    df_track = parser.event(match)[2]
    #get shots
    shots = df_event.loc[df_event["type_name"] == "Shot"]
    shots.x = shots.x.apply(lambda cell: cell*105/120)
    shots.y = shots.y.apply(lambda cell: cell*68/80)
    df_track.x = df_track.x.apply(lambda cell: cell*105/120)
    df_track.y = df_track.y.apply(lambda cell: cell*68/80)
    #append event and trackings to a dataframe
    shot_df = pd.concat([shot_df, shots], ignore_index = True)
    track_df = pd.concat([track_df, df_track], ignore_index = True)

#reset indicies
shot_df.reset_index(drop=True, inplace=True)
track_df.reset_index(drop=True, inplace=True)
#filter out non open-play shots
shot_df = shot_df.loc[shot_df["sub_type_name"] == "Open Play"]
#filter out shots where goalkeeper was not tracked
gks_tracked = track_df.loc[track_df["teammate"] == False].loc[track_df["position_name"] == "Goalkeeper"]['id'].unique()
shot_df = shot_df.loc[shot_df["id"].isin(gks_tracked)]

In [8]:
shot_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2875 entries, 0 to 3094
Data columns (total 87 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              2875 non-null   object 
 1   index                           2875 non-null   int64  
 2   period                          2875 non-null   int64  
 3   timestamp                       2875 non-null   object 
 4   minute                          2875 non-null   int64  
 5   second                          2875 non-null   int64  
 6   possession                      2875 non-null   int64  
 7   duration                        2875 non-null   float64
 8   match_id                        2875 non-null   int64  
 9   type_id                         2875 non-null   int64  
 10  type_name                       2875 non-null   object 
 11  possession_team_id              2875 non-null   int64  
 12  possession_team_name            2875 no