## STAT 430 Final Project

- Author: Lucas Nelson
- Date: TBD

In this notebook, we want to design a playing style vector for each player given enough data of their on-ball actions. This analysis specfically focuses on the 2003-2004 season of Arsenal FC, an English club that competes in the Premier League. Data is provided by StatsBomb OpenAccess.

---

### 00. Imported Libraries

In [7]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.patheffects as path_effects
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from scipy.ndimage import gaussian_filter

from statsbombpy import sb
from statsbombpy.api_client import NoAuthWarning
from mplsoccer import Pitch, VerticalPitch, FontManager
from mplsoccer.statsbomb import read_event, EVENT_SLUG

from warnings import filterwarnings
filterwarnings('ignore', category=NoAuthWarning)

from sklearn.decomposition import NMF
from sklearn.manifold import TSNE

In [8]:
# sb.competitions()

---

### 01. Filtering Play-by-Play Data per Match

In [25]:
invincibles_df = sb.matches(competition_id=2, season_id=44)
invincibles_df = invincibles_df.sort_values('match_date').reset_index(drop=True)

In [26]:
# convert list column to two string columns
list_to_string = lambda x: ','.join([str(i) for i in x])

# yessirrrrrrr
def preprocessing_events_df(
    events_df,
    o_cols=['player', 'location', 'position', 'type', 'pass_end_location',
        'shot_outcome', 'dribble_outcome', 'pass_cross', 'shot_statsbomb_xg'],
    o_attrs=['Pass', 'Shot', 'Dribble', 'Cross']
    ):
    '''
    Return dataframe that contains offense-related metrics
    found in `offensive_cols` and `offensive_attrs`

    > events_df: play-by-play dataframe of team formations,
                 match start/finish, and on-ball actions
    '''

    # events from specific match with valid on-ball player data
    nonempty_df = events_df[(events_df['player_id'].notna()) & (events_df['team'] == 'Arsenal')][o_cols]

    # select specific offensive actions (types)
    nonempty_df = nonempty_df[nonempty_df['type'].isin(o_attrs)]

    # split x,y coordinates
    nonempty_df = pd.merge(
        nonempty_df,
        nonempty_df['location'].apply(list_to_string).str.split(',', expand=True),
        left_index=True, right_index=True, how='outer'
        )
    nonempty_df.rename(columns={0:'location_x', 1:'location_y'}, inplace=True)

    nonempty_df = pd.merge(
        nonempty_df,
        nonempty_df[nonempty_df['type'] == 'Pass']['pass_end_location'].apply(list_to_string).str.split(',', expand=True),
        left_index=True, right_index=True, how='outer'
        )
    nonempty_df.rename(columns={0:'pass_end_x', 1:'pass_end_y'}, inplace=True)

    # update type column to include crosses
    nonempty_df['type'] = np.where(nonempty_df['pass_cross'] == 1, 'Cross', nonempty_df['type'])

    # return dataframe with desired events
    return nonempty_df.drop(columns=['location', 'pass_end_location'])

We'll store the results in a dataframe that contains all on-ball, offensive-oriented actions performed by Arsenal players in the 33 (of 38) matches provided in this database.

In [27]:
master_df = pd.concat([
    preprocessing_events_df(sb.events(match_id=idx))
    for idx in invincibles_df['match_id']
]).reset_index(drop=True)

TypeError: 'float' object is not iterable

In [None]:
# master_df['location_x'] = master_df['location_x'].astype('float64')
# master_df['location_y'] = master_df['location_x'].astype('float64')
# master_df['pass_end_x'] = master_df['location_x'].astype('float64')
# master_df['pass_end_y'] = master_df['location_x'].astype('float64')

To create our player vectors, we need to first separate out the actions by `player` to distinguish who did what and by `type` to learn more about how often (and - later on - where on the pitch) a player will commit an action. This will be saved in a dictionary for simplified looping later on.

In [None]:
grouped_df = master_df.groupby(['player', 'type'])

In [19]:
player_dict = {player : dict() for player in master_df['player'].unique()}

for player_type, type_df in grouped_df:
    player_dict[player_type[0]][player_type[1]] = type_df

In [22]:
player_dict

{'Sylvain Wiltord': {'Cross':                 player              position   type shot_outcome  \
  10     Sylvain Wiltord  Right Center Forward  Cross          NaN   
  75     Sylvain Wiltord  Right Center Forward  Cross          NaN   
  1472   Sylvain Wiltord  Right Center Forward  Cross          NaN   
  1563   Sylvain Wiltord  Right Center Forward  Cross          NaN   
  2149   Sylvain Wiltord  Right Center Forward  Cross          NaN   
  2863   Sylvain Wiltord  Right Center Forward  Cross          NaN   
  2907   Sylvain Wiltord  Right Center Forward  Cross          NaN   
  3809   Sylvain Wiltord  Right Center Forward  Cross          NaN   
  4133   Sylvain Wiltord  Right Center Forward  Cross          NaN   
  4864   Sylvain Wiltord        Right Midfield  Cross          NaN   
  15386  Sylvain Wiltord        Right Midfield  Cross          NaN   
  15410  Sylvain Wiltord        Right Midfield  Cross          NaN   
  15587  Sylvain Wiltord        Right Midfield  Cross         

---

### 03. Plotting Data

For fun, we'll strictly gather Thierry Henry's shot data and visualize a heat map below.

In [None]:
th_shot = player_dict['Thierry Henry']['Shot']
df = th_shot[['location_x', 'location_y']].astype('float64')

In [None]:
# Tom Decroos, author of `matplotsoccer <https://github.com/TomDecroos/matplotsoccer>`_,
# asked whether it was possible to plot a Gaussian smoothed heatmap,
# which are available in matplotsoccer. Here is an example demonstrating this.

# setup pitch
pitch = Pitch(pitch_type='statsbomb', line_zorder=2,
              pitch_color='#22312b', line_color='#efefef')
# draw
fig, ax = pitch.draw(figsize=(6.6, 4.125))
fig.set_facecolor('#22312b')
bin_statistic = pitch.bin_statistic(df.location_x, df.location_y, statistic='count', bins=(24,25))
bin_statistic['statistic'] = gaussian_filter(bin_statistic['statistic'], 1)
pcm = pitch.heatmap(bin_statistic, ax=ax, cmap='mako', edgecolors='#22312b')
# Add the colorbar and format off-white
cbar = fig.colorbar(pcm, ax=ax, shrink=0.6)
cbar.outline.set_edgecolor('#efefef')
cbar.ax.yaxis.set_tick_params(color='#efefef')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')
plt.show()

In [None]:
th_pass = player_dict['Thierry Henry']['Pass']
df = th_pass[['location_x', 'location_y']].astype('float64')

In [None]:
# Tom Decroos, author of `matplotsoccer <https://github.com/TomDecroos/matplotsoccer>`_,
# asked whether it was possible to plot a Gaussian smoothed heatmap,
# which are available in matplotsoccer. Here is an example demonstrating this.

# setup pitch
pitch = Pitch(pitch_type='statsbomb', line_zorder=2,
              pitch_color='#22312b', line_color='#efefef')
# draw
fig, ax = pitch.draw(figsize=(6.6, 4.125))
fig.set_facecolor('#22312b')
bin_statistic = pitch.bin_statistic(df.location_x, df.location_y, statistic='count', bins=(20,24))
bin_statistic['statistic'] = gaussian_filter(bin_statistic['statistic'], 1)
pcm = pitch.heatmap(bin_statistic, ax=ax, cmap='mako', edgecolors='#22312b')
# Add the colorbar and format off-white
cbar = fig.colorbar(pcm, ax=ax, shrink=0.6)
cbar.outline.set_edgecolor('#efefef')
cbar.ax.yaxis.set_tick_params(color='#efefef')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')
plt.show()

In [None]:
th_dribble = player_dict['Thierry Henry']['Dribble']
df = th_dribble[['location_x', 'location_y']].astype('float64')

In [None]:
# Tom Decroos, author of `matplotsoccer <https://github.com/TomDecroos/matplotsoccer>`_,
# asked whether it was possible to plot a Gaussian smoothed heatmap,
# which are available in matplotsoccer. Here is an example demonstrating this.

# setup pitch
pitch = Pitch(pitch_type='statsbomb', line_zorder=2,
              pitch_color='#22312b', line_color='#efefef')
# draw
fig, ax = pitch.draw(figsize=(6.6, 4.125))
fig.set_facecolor('#22312b')
bin_statistic = pitch.bin_statistic(df.location_x, df.location_y, statistic='count', bins=(24,25))
bin_statistic['statistic'] = gaussian_filter(bin_statistic['statistic'], 1)
pcm = pitch.heatmap(bin_statistic, ax=ax, cmap='mako', edgecolors='#22312b')
# Add the colorbar and format off-white
cbar = fig.colorbar(pcm, ax=ax, shrink=0.6)
cbar.outline.set_edgecolor('#efefef')
cbar.ax.yaxis.set_tick_params(color='#efefef')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')
plt.show()

In [None]:
th_cross = player_dict['Thierry Henry']['Cross']
df = th_cross[['location_x', 'location_y']].astype('float64')

In [None]:
# Tom Decroos, author of `matplotsoccer <https://github.com/TomDecroos/matplotsoccer>`_,
# asked whether it was possible to plot a Gaussian smoothed heatmap,
# which are available in matplotsoccer. Here is an example demonstrating this.

# setup pitch
pitch = Pitch(pitch_type='statsbomb', line_zorder=2,
              pitch_color='#22312b', line_color='#efefef')
# draw
fig, ax = pitch.draw(figsize=(6.6, 4.125))
fig.set_facecolor('#22312b')
bin_statistic = pitch.bin_statistic(df.location_x, df.location_y, statistic='count', bins=(24,25))
bin_statistic['statistic'] = gaussian_filter(bin_statistic['statistic'], 1)
pcm = pitch.heatmap(bin_statistic, ax=ax, cmap='mako', edgecolors='#22312b')
# Add the colorbar and format off-white
cbar = fig.colorbar(pcm, ax=ax, shrink=0.6)
cbar.outline.set_edgecolor('#efefef')
cbar.ax.yaxis.set_tick_params(color='#efefef')
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')
plt.show()

---

#### 02. Aggregate Data for Arsenal

With specific player data out of the way, we can move onto aggregate team data to learn more about the team as a whole.

In [None]:
def gather_team_data(match_id, events_df):
    return pd.DataFrame(
        {match_id : {
            'xG' : events_df['shot_statsbomb_xg'].astype('float64').sum(),
            'shots' : events_df[events_df['type'] == 'Shot'].shape[0],
            'passes' : events_df[events_df['type'] == 'Pass'].shape[0],
            'dribbles' : events_df[events_df['type'] == 'Dribble'].shape[0],
            'goals' : events_df[events_df['shot_outcome'] == 'Goal'].shape[0]
        }}
    ).T
    

arsenal_summary_statistics = pd.concat([
    gather_team_data(idx, sb.events(match_id=idx))
    for idx in invincibles_df['match_id']
])

In [None]:
arsenal_summary_statistics

Now let's plot Arsenal's shots on the pitch with their corresponding `xG` values.

In [None]:
arsenal_shot_df = master_df[master_df['type'] == 'Shot']

In [None]:
fm = FontManager()
fm_rubik = FontManager(('https://github.com/google/fonts/blob/main/ofl/rubikmonoone/'
                        'RubikMonoOne-Regular.ttf?raw=true'))

vertical_pitch = VerticalPitch(half=True, pad_top=0.05, pad_right=0.05, pad_bottom=0.05,
                               pad_left=0.05, line_zorder=2)

fig, axs = vertical_pitch.jointgrid(figheight=10, left=None, bottom=None,  # center aligned
                                    grid_width=0.95, marginal=0.1,
                                    # setting up the heights/space so it takes up 95% of the figure
                                    grid_height=0.80,
                                    title_height=0.1, endnote_height=0.03,
                                    title_space=0.01, endnote_space=0.01,
                                    axis=False,  # turn off title/ endnote/ marginal axes
                                    # here we filter out the left and top marginal axes
                                    ax_top=False, ax_bottom=True,
                                    ax_left=False, ax_right=True)
# typical shot map where the scatter points vary by the expected goals value
# using alpha for transparency as there are a lot of shots stacked around the six-yard box
sc_team2 = vertical_pitch.scatter(arsenal_shot_df['location_x'].astype('float64'), arsenal_shot_df['location_y'].astype('float64'),  s=arsenal_shot_df['shot_statsbomb_xg'] * 700,
                                  alpha=0.5, ec='black', color='#db0007', ax=axs['pitch'])
# kdeplots on the marginals
# remember to flip the coordinates y=x, x=y for the marginals when using vertical orientation
team2_hist_x = sns.kdeplot(y=arsenal_shot_df['location_x'].astype('float64'), ax=axs['right'], color='#db0007', shade=True)
team2_hist_y = sns.kdeplot(x=arsenal_shot_df['location_y'].astype('float64'), ax=axs['bottom'], color='#db0007', shade=True)
# txt1 = axs['pitch'].text(x=40, y=80, s='Arsenal', fontproperties=fm_rubik.prop, color=pitch.line_color,
#                          ha='center', va='center', fontsize=60)

# titles and endnote
axs['title'].text(0.5, 0.7, "Arsenal Shooting Distribution (2003/2004)", color='#db0007',
                  fontproperties=fm_rubik.prop, fontsize=18, ha='center', va='center')
axs['title'].text(0.5, 0.3, "[scaled by xG]", color='#db0007',
                  fontproperties=fm_rubik.prop, fontsize=12, ha='center', va='center')

plt.show()

---

#### 04. Heatmaps to Vectors

In [None]:
# keys are action types, values will be dataframes containing each player's compressed heatmap
nmf_dict = {'Pass':[], 'Shot':[], 'Dribble':[], 'Cross':[]}

In [None]:
player_dict

In [None]:
def player_action_heatmap(player, action, nrows=24, ncols=25):
    if player not in player_dict:
        print(f'Invalid player entry: {player} not found')
        return None
    if action not in player_dict[player]:
        print(f'Invalid action type for {player}: {action} not found')
        return None

    heatmap_dict = {'Pass':[], 'Shot':[], 'Dribble':[], 'Cross':[]}
    # create empty dataframe of specified dimensions
    heatmap_matrix = np.zeros(shape=(nrows, ncols))

    # partition the field evenly (could be altered depending on weights
    # of different grid patterns on the field)
    row_divs = np.round(np.linspace(0, 80, nrows), 2)
    col_divs = np.round(np.linspace(0, 120, ncols), 2)

    # iterate over rows and assign count to specific cell grid
    for _, action in player_dict[player][action][['location_x', 'location_y']].iterrows():
        # assign to closest row grid cell and column grid cell
        grid_row = np.abs(row_divs - np.float64(action['location_y'])).argmin()
        grid_col = np.abs(col_divs - np.float64(action['location_x'])).argmin()
        # argmin index out of bounds if closest to upper bound (nrows, ncols)
        if grid_row == nrows: grid_row -= 1
        if grid_col == ncols: grid_col -= 1
        # update corresponding grid cell (of player-action combo) by one frequency
        heatmap_matrix[grid_row, grid_col] += 1

    return heatmap_matrix

In [None]:
heatmap_matrix = player_action_heatmap('Thierry Henry', 'Dribble')
if isinstance(heatmap_matrix, np.ndarray):
    sns.heatmap(heatmap_matrix)

In [None]:
def compressed_heatmap_matrix(player_action_dict, nrows=24, ncols=25):
    nmf_dict = {action:pd.DataFrame([0]*600, index=range(600), columns=['dummy']) for action in ['Pass', 'Dribble', 'Shot', 'Cross']}
    
    for player in player_action_dict:
        for action in player_action_dict[player]:
            nmf_dict[action][player] = player_action_heatmap(player, action, nrows, ncols).reshape(1, 600)[0]
    
    return {action: nmf_dict[action].drop(columns='dummy') for action in ['Pass', 'Dribble', 'Shot', 'Cross']}

In [None]:
(_, nmf_pass), (_, nmf_dribble), (_, nmf_shot), (_, nmf_cross) = compressed_heatmap_matrix(player_action_dict=player_dict).items()

In [None]:
# apply Gaussian blur
filtered_X = gaussian_filter(heatmap_matrix, sigma=0.7)
sns.heatmap(filtered_X)

---

### 0x. Non-negative Matrix Factoriziation

Let's get into the thick of it. Now that we have a player's shot matrix, let's see what merit we can pull from it.

In [None]:
nmf = NMF(n_components=18, random_state=100)
W = pd.DataFrame(nmf.fit_transform(nmf_pass))
H = pd.DataFrame(nmf.components_, columns=nmf_pass.columns)

In [None]:
W

In [None]:
cls_mem = W.apply(lambda x: x / sum(x), axis=1)

In [None]:
H

In [None]:
cls_mem

In [None]:
for i in range(W.shape[1]):
    sns.heatmap(np.array(W.iloc[:, i]).reshape((24, 25)))
    plt.title(f'Heatmap for Grid Cluster {i}')
    plt.show()