**3. Calculating metrics for passes**

The following tasks are taken into account in this notebook:

1. Cluster the defender lineups into sub categories based on the footedness pattern of the defender lineup starting from the right back (RB) position

    For example, **right-right-right-left (rrrl)** category indicates that this is a lineup of four defenders where -

    **right back (RB) is right footed**

    **right center back (RCB) is right footed**

    **left center back (LCB) is right footed**

    **left back (LB) is left footed**

2. Compute multiple passing based attributes for defenders for each match using match lineup data (from **match+def_lineup+footedness_ver2_top5.pkl**) and events data (from **events_com.pkl**)

The following are the resulting pickle files:

1. Cluster wise files with passing attributes for each defender for each match




In [1]:
import pandas as pd
import numpy as np
from unidecode import unidecode
from tqdm import tqdm
import re
from difflib import SequenceMatcher
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns",1000)

**Loading pickle file with Top 5 Leagues 2017-18 events data (along with player roles i.e. whether the player is a goalkeeper (GKP), defender (DEF), midfielder (MID) or forward (FWD))**

In [2]:
df_events_roles = pd.read_pickle("../data_top5/events/events_com.pkl")

**Loading the pickle file with defence lineup information for each team participating in a particular match.**

In [3]:
df_defence_footed = pd.read_pickle("../data_top5/matches/match+def_lineup+footedness_ver2_top5.pkl")

**Observing the unique footedness categories in the dataframe**

In [4]:
footedness_patterns = df_defence_footed["footedness"].unique()

**Renaming certain positional columns for better understanding**

In [5]:
df_defence_footed.rename(columns={'R-CB':'R_CB',"L-CB":'L_CB'},inplace=True)

**Filtering out pass data for defenders and finding league wise total passes and total accurate passes for defenders**

In [6]:
df_events_pass = df_events_roles.loc[df_events_roles['eventName'].str.contains('Pass')].loc[df_events_roles['role']=='DEF']

In [7]:
league_pass_info = dict()
league_pass_info['totalpasses'] = len(df_events_pass)

In [8]:
league_pass_info['totalaccuratepasses']=len(df_events_pass[df_events_pass['tags'].apply(lambda x: "Accurate" in x)])

In [9]:
league_pass_info

{'totalpasses': 660055, 'totalaccuratepasses': 552506}

**Creating seperate dataframes for four defenders and three/five defenders in the lineup**

In [10]:
df_four_defs = df_defence_footed[df_defence_footed['backline']==4]
df_three_five_defs = df_defence_footed[df_defence_footed['backline'].isin([3,5])]

In [11]:
df_defs_atb = [df_four_defs,df_three_five_defs]

**Creating a metrics collection function that takes in x (match_id) and y (player name) and returns the following metrics-**

**numpasses** - number of passes made by the player in the queried match

**numaccpasses** - number of accurate passes made by the player in the queried match

**numhighpasses** - number of high (aerial) passes made by the player in the queried match

**numhighaccpasses** - number of high (aerial) accurate passes made by the player in the queried match

**accpasslocs** - starting and ending coordinates of all the accurate passes made by the player in the queried match

**inaccpasslocs** - starting and ending coordinates of all the inaccurate passes made by the player in the queried match

**acchighpasslocs** - starting and ending coordinates of all the accurate high passes made by the player in the queried match

**inacchighpasslocs** - starting and ending coordinates of all the inaccurate high passes made by the player in the queried match

In [12]:
def getmetrics(x,y):
    split_y = re.findall('[A-Z][^A-Z]*',y)
    try:
        pass_df = df_events_pass.loc[(df_events_pass['playerName'].str.contains(split_y[-1]))&
                                     (df_events_pass['playerName'].str.contains(split_y[-2]))&
                                     (df_events_pass['playerName'].str.contains(split_y[-3]))&
                                     (df_events_pass['matchId']==int(x))]
    except:
        try:
            pass_df = df_events_pass.loc[(df_events_pass['playerName'].str.contains(split_y[-1]))&
                                     (df_events_pass['playerName'].str.contains(split_y[-2]))&
                                     (df_events_pass['matchId']==int(x))]
        except:
            pass_df = df_events_pass.loc[(df_events_pass['playerName'].str.contains(split_y[-1]))&
                                             (df_events_pass['matchId']==int(x))]
    numpasses = len(pass_df)
    numaccpasses = len(pass_df.loc[pass_df['tags'].apply(lambda a: "Accurate" in a)])
    numhighpasses = len(pass_df.loc[pass_df['subEventName']=='High pass'])
    numhighaccpasses = len(pass_df.loc[(pass_df['subEventName']=='High pass') & (pass_df['tags'].apply(lambda a: "Accurate" in a))])
    accpasslocs = pass_df.loc[pass_df['tags'].apply(lambda a: "Accurate" in a)]['positions'].tolist()
    inaccpasslocs = pass_df.loc[pass_df['tags'].apply(lambda a: "Not accurate" in a)]['positions'].tolist()
    acchighpasslocs = pass_df.loc[(pass_df['subEventName']=='High pass') & (pass_df['tags'].apply(lambda a: "Accurate" in a))]['positions'].tolist()
    inacchighpasslocs = pass_df.loc[(pass_df['subEventName']=='High pass') & (pass_df['tags'].apply(lambda a: "Not accurate" in a))]['positions'].tolist()
    return [numpasses,numaccpasses,numhighpasses,numhighaccpasses,accpasslocs,inaccpasslocs,acchighpasslocs,inacchighpasslocs]


In [13]:
getmetrics(2500081,"Bruno")

[29,
 23,
 5,
 3,
 [[[30.16, 5.44], [26.0, 15.64]],
  [[33.28, 12.92], [29.12, 29.92]],
  [[32.24, 4.76], [37.44, 14.96]],
  [[75.92, 4.76], [83.2, 4.76]],
  [[99.84, 12.92], [91.52, 48.96]],
  [[69.68, 11.56], [78.0, 10.88]],
  [[32.24, 4.76], [36.4, 17.0]],
  [[78.0, 6.12], [71.76, 8.16]],
  [[32.24, 23.12], [26.0, 29.92]],
  [[47.84, 27.88], [28.08, 34.68]],
  [[4.16, 6.12], [16.64, 4.08]],
  [[71.76, 6.12], [74.88, 9.52]],
  [[39.52, 16.32], [23.92, 32.64]],
  [[43.68, 12.92], [30.16, 34.0]],
  [[35.36, 10.88], [29.12, 27.88]],
  [[46.8, 28.56], [93.6, 51.68]],
  [[9.36, 18.36], [10.4, 23.12]],
  [[63.44, 7.48], [65.52, 2.72]],
  [[64.48, 2.04], [58.24, 28.56]],
  [[17.68, 1.36], [43.68, 1.36]],
  [[31.2, 12.92], [34.32, 3.4]],
  [[58.24, 7.48], [30.16, 24.48]],
  [[36.4, 7.48], [60.32, 18.36]]],
 [[[32.24, 3.4], [53.04, 33.32]],
  [[20.8, 20.4], [40.56, 8.84]],
  [[17.68, 5.44], [59.28, 6.8]],
  [[87.36, 4.76], [76.96, 46.92]],
  [[100.88, 17.68], [0.0, 68.0]],
  [[36.4, 1.36], [6

In [14]:
new_cols = ['RB_all',
            'R_CB_all',
            'L_CB_all',
            'LB_all',
            'RCB_all',
            'CB_all',
            'LCB_all',
            'RWB_all',
            'LWB_all']

**Collecting metrics for each defender location for various clusters**

In [15]:
#R_CB - Right center back for 4 defender formation
#RCB - Right center back for 3 or 5 defender formation
#L_CB - Left center back for 4 defender formation
#LCB - Left center back for 3 or 5 defender formation
df_defs_atb_updated = list()
for df in tqdm(df_defs_atb):
    df = df.reindex(columns = df.columns.tolist() + new_cols)
    if df.iloc[0]['backline'] == 4.0:     
        df['RB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.RB), axis=1)
        df['R_CB_all'] = df.apply(lambda x: getmetrics(x.wyId,x['R_CB']), axis=1)
        df['L_CB_all'] = df.apply(lambda x: getmetrics(x.wyId,x['L_CB']), axis=1)
        df['LB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.LB), axis=1)
        df_defs_atb_updated.append(df)
    
    elif df.iloc[0]['backline'] == 3.0:
        df['RCB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.RCB), axis=1)
        df['CB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.CB), axis=1)
        df['LCB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.LCB), axis=1)
        df_defs_atb_updated.append(df)
        
    elif df.iloc[0]['backline'] == 5.0:
        df['RWB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.RWB), axis=1)
        df['RCB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.RCB), axis=1)
        df['CB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.CB), axis=1)
        df['LCB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.LCB), axis=1)
        df['LWB_all'] = df.apply(lambda x: getmetrics(x.wyId,x.LWB), axis=1)
        df_defs_atb_updated.append(df)

100%|██████████| 2/2 [4:11:29<00:00, 7544.94s/it]   


In [16]:
df_defs_atb_metrics = list()
for df in tqdm(df_defs_atb_updated):
    if df.iloc[0]['backline'] == 4.0:
        df[['RB_pass','RB_accpass','RB_highpass','RB_acchighpass','RB_accpassloc','RB_inaccpassloc','RB_acchighpassloc','RB_inacchighpassloc']] = pd.DataFrame(df['RB_all'].to_list(), index=df.index)
        df[['R_CB_pass','R_CB_accpass','R_CB_highpass','R_CB_acchighpass','R_CB_accpassloc','R_CB_inaccpassloc','R_CB_acchighpassloc','R_CB_inacchighpassloc']] = pd.DataFrame(df['R_CB_all'].to_list(), index=df.index)
        df[['L_CB_pass','L_CB_accpass','L_CB_highpass','L_CB_acchighpass','L_CB_accpassloc','L_CB_inaccpassloc','L_CB_acchighpassloc','L_CB_inacchighpassloc']] = pd.DataFrame(df['L_CB_all'].to_list(), index=df.index)
        df[['LB_pass','LB_accpass','LB_highpass','LB_acchighpass','LB_accpassloc','LB_inaccpassloc','LB_acchighpassloc','LB_inacchighpassloc']] = pd.DataFrame(df['LB_all'].to_list(), index=df.index)
        df.drop(['RB_all','R_CB_all','L_CB_all','LB_all','RCB_all','LCB_all','CB_all','RWB_all','LWB_all'], axis=1, inplace = True)
        df_defs_atb_metrics.append(df)
       
    elif df.iloc[0]['backline'] == 3.0:
        df[['RCB_pass','RCB_accpass','RCB_highpass','RCB_acchighpass','RCB_accpassloc','RCB_inaccpassloc','RCB_acchighpassloc','RCB_inacchighpassloc']] = pd.DataFrame(df['RCB_all'].to_list(), index=df.index)
        df[['CB_pass','CB_accpass','CB_highpass','CB_acchighpass','CB_accpassloc','CB_inaccpassloc','CB_acchighpassloc','CB_inacchighpassloc']] = pd.DataFrame(df['CB_all'].to_list(), index=df.index)
        df[['LCB_pass','LCB_accpass','LCB_highpass','LCB_acchighpass','LCB_accpassloc','LCB_inaccpassloc','LCB_acchighpassloc','LCB_inacchighpassloc']] = pd.DataFrame(df['LCB_all'].to_list(), index=df.index)
        df.drop(['RB_all','R_CB_all','L_CB_all','LB_all','RCB_all','LCB_all','CB_all','RWB_all','LWB_all'], axis=1, inplace = True)
        df_defs_atb_metrics.append(df)
       
    elif df.iloc[0]['backline'] == 5.0:
        df[['RCB_pass','RCB_accpass','RCB_highpass','RCB_acchighpass','RCB_accpassloc','RCB_inaccpassloc','RCB_acchighpassloc','RCB_inacchighpassloc']] = pd.DataFrame(df['RCB_all'].to_list(), index=df.index)
        df[['CB_pass','CB_accpass','CB_highpass','CB_acchighpass','CB_accpassloc','CB_inaccpassloc','CB_acchighpassloc','CB_inacchighpassloc']] = pd.DataFrame(df['CB_all'].to_list(), index=df.index)
        df[['LCB_pass','LCB_accpass','LCB_highpass','LCB_acchighpass','LCB_accpassloc','LCB_inaccpassloc','LCB_acchighpassloc','LCB_inacchighpassloc']] = pd.DataFrame(df['LCB_all'].to_list(), index=df.index)
        df[['RWB_pass','RWB_accpass','RWB_highpass','RWB_acchighpass','RWB_accpassloc','RWB_inaccpassloc','RWB_acchighpassloc','RWB_inacchighpassloc']] = pd.DataFrame(df['RWB_all'].to_list(), index=df.index)
        df[['LWB_pass','LWB_accpass','LWB_highpass','LWB_acchighpass','LWB_accpassloc','LWB_inaccpassloc','LWB_acchighpassloc','LWB_inacchighpassloc']] = pd.DataFrame(df['LWB_all'].to_list(), index=df.index)
        df.drop(['RB_all','R_CB_all','L_CB_all','LB_all','RCB_all','LCB_all','CB_all','RWB_all','LWB_all'], axis=1, inplace = True)
        df_defs_atb_metrics.append(df)

100%|██████████| 2/2 [00:00<00:00, 11.13it/s]


In [17]:
atb = ['four_defs','three_five_defs']
for i,df in enumerate(df_defs_atb_metrics):
    df.to_pickle(f'../data_top5/clusters/clusters_v3/cluster_{atb[i]}.pkl')

**Steps to validate if all players have been assigned metrics**

**Fetch players that have not registered a single pass in any particular match**

In [71]:
players_no_pass = list()
for df in df_defs_atb_metrics:
    if df.iloc[0]['backline']==4:
        for col in ['RB_pass','R_CB_pass','L_CB_pass','LB_pass']:
            players_no_pass.append(df[df[col].eq(0)][col.rsplit('_',1)[0]].values.tolist())
    elif df.iloc[0]['backline']==3:
        for col in ['RCB_pass','CB_pass','LCB_pass']:
            players_no_pass.append(df[df[col].eq(0)][col.rsplit('_',1)[0]].values.tolist())
    else:
        for col in ['RWB_pass','RCB_pass','CB_pass','LCB_pass','LWB_pass']:
            players_no_pass.append(df[df[col].eq(0)][col.rsplit('_',1)[0]].values.tolist())
players_no_pass_set = list(set([i for sublist in players_no_pass for i in sublist]))

In [72]:
players_no_pass_set

['IsmaelTiemokoDiomande',
 'StefanIlsanker',
 'RubenPenaJimenez',
 'RaniKhedira',
 'DavidTimorCopovi',
 'EnricoBearzotti',
 'AssaneDiousseElHadji',
 'HavardNordtveit',
 'ManuelRolandoIturraUrrutia',
 'JesusNavasGonzalez',
 'JorgeAndujarMoreno',
 'GerritHoltmann',
 'KwadwoAsamoah',
 'YvesBissouma',
 'FabioBorini',
 "DigboG'nampaHabibMaiga",
 'DanielWass',
 'AinsleyMaitlandNiles',
 'StevenZuber',
 'EricDier',
 'BounaSarr',
 'MitchellWeiser',
 'JulianBaumgartlinger',
 'JulianSchuster',
 'VictorSanchezMata',
 'ChrisPhilipps',
 'SamMcQueen',
 'SergiGomezSola',
 'MatthiasLehmann',
 'JohannesGeis',
 'ChrisBrunt',
 'BenjaminStambouli',
 'MatthiasZimmermann',
 'NicoSchulz',
 'EmreCan',
 'JonathanSchmid',
 'RomuloSouzaOrestesCaldeira',
 "AlfredJohnMomarN'Diaye",
 'FranciscoJavierGuerreroMartin',
 'KonradLaimer',
 'AlfonsoPedrazaSag',
 'JavierMartinezAginaga',
 'BelDurelAvounou',
 'OleksandrZinchenko',
 'ThomasTeyePartey',
 'IsaacHayden',
 'FranckTabanou',
 'IgorZubeldiaElorza',
 'MarcelRisse',
 

**Further filter players that are defenders (Note: Players who have played in a defensive position but have not been marked as defenders are not assigned metrics)**

In [20]:
players = pd.read_pickle('../data/players/players.pkl')

In [22]:
for player in players_no_pass_set:
    player_name_split = re.findall('[A-Z][^A-Z]*',player)
    try:
        role = players[(players['playerName'].str.contains(player_name_split[-1]))&
                       (players['playerName'].str.contains(player_name_split[-2]))&
                       (players['playerName'].str.contains(player_name_split[-3]))]['role'].values.tolist()[0]['code2']
    except:
        try:
            role = players[(players['playerName'].str.contains(player_name_split[-1]))&
                           (players['playerName'].str.contains(player_name_split[-2]))]['role'].values.tolist()[0]['code2']
        except:
            role = players[(players['playerName'].str.contains(player_name_split[-1]))]['role'].values.tolist()[0]['code2']
    if role=='DF':
        print(player+':'+role)

JorgeAndujarMoreno:DF
SergiGomezSola:DF
JeremyGelin:DF
JordanTorunarigha:DF


**Finding match ids for which these defenders do not have a single pass**

In [73]:
no_pass_defs = ['JorgeAndujarMoreno', 'SergiGomezSola','JeremyGelin', 'JordanTorunarigha']
df_indexes = dict()
for i in range(len(df_defs_atb_metrics)):
    check_indexes = dict()
    for defender in no_pass_defs:
        if df_defs_atb_metrics[i].iloc[0]['backline']==4:
            index_list = list()
            for col in ['RB_pass','R_CB_pass','L_CB_pass','LB_pass']:
                index_list.append(df_defs_atb_metrics[i][(df_defs_atb_metrics[i][col].eq(0))&(df_defs_atb_metrics[i][col.rsplit('_',1)[0]]==defender)].index.tolist())
                check_indexes[defender]=index_list
        elif df_defs_atb_metrics[i].iloc[0]['backline']==3:
            index_list = list()
            for col in ['RCB_pass','CB_pass','LCB_pass']:
                index_list.append(df_defs_atb_metrics[i][(df_defs_atb_metrics[i][col].eq(0))&(df_defs_atb_metrics[i][col.rsplit('_',1)[0]]==defender)].index.tolist())
                check_indexes[defender]=index_list
        else:
            index_list = list()
            for col in ['RWB_pass','RCB_pass','CB_pass','LCB_pass','LWB_pass']:
                index_list.append(df_defs_atb_metrics[i][(df_defs_atb_metrics[i][col].eq(0))&(df_defs_atb_metrics[i][col.rsplit('_',1)[0]]==defender)].index.tolist())
                check_indexes[defender]=index_list
        df_indexes[i]=check_indexes

In [74]:
df_indexes

{0: {'JorgeAndujarMoreno': [[1948], [], [], []],
  'SergiGomezSola': [[], [], [1855], []],
  'JeremyGelin': [[], [], [808], []],
  'JordanTorunarigha': [[], [1406], [], []]},
 1: {'JorgeAndujarMoreno': [[], [], []],
  'SergiGomezSola': [[], [], []],
  'JeremyGelin': [[], [], []],
  'JordanTorunarigha': [[], [], []]}}

**Wyscout has not recorded any data for Jeremy Gelin and Sergio Gomez for these match ids, even though they have played 90 mins. Coke (JorgeAndujarMoreno) was substituted while Jordan Torunaringha was shown a red card for the respective matches and hence do not have any passing event associated to them.**

In [31]:
# for df in df_defs_atb_metrics:
#     df.reset_index(inplace=True)

In [66]:
# df_defs_atb_metrics[0][df_defs_atb_metrics[0].eq('SergiGomezSola').any(axis=1)]

In [70]:
# events_spain = pd.read_json('../data_top5/events/events_Spain.json')

In [67]:
# events_spain[(events_spain['matchId']==2565681)&(events_spain['playerId']==3338)]

In [68]:
# players[players['playerName']=='SergiGomezSola']