**2. Pre-processing the Wyscout data**

**A) The following tasks are taken into account in this notebook:**

1. The Wyscout data jsons (Competitions, Matches, Events and Players) are processed and stored in dataframes.

2. The defence formation data (as obtained from 1_PLparser.ipynb) is merged with the Wyscout data, in such a way that each row of the new dataframe denotes a team for a particular match.

3. The resulting dataframe is further processed to normalise team and player names

4. Footedness and event tags information is added to the dataframe

5. Event coordinates are converted to the scale of x belongs to [0,104] and y belongs to [0,68]

**B) The following were the resulting pickle files of this notebook:**

**events_v2.pkl** - processed Wyscout events data with scaled coordinates

**match+def_lineup+footedness_ver2.pkl** - defence formation data with each row denoting a team for a particular match

**Note:**

1. All experiments have been run on Premier League data only

2. We understand that the role of center backs changes with change in number of players in the defence lineups.  For simplicity, our analysis includes two categories - lineups with four defenders and lineups with other than four defenders. The following is the formation nomenclature:


    a. For a defence lineup with 4 players:
            i. LB - Left Back
            ii. L_CB - Left Center Back
            iii. R_CB - Right Center Back
            iv. RB - Right Back
            
            
    b. For a defence lineup with 3 or 5 players:
            i. LWB - Left Wing Back (only for 5 defenders)
            ii. LCB - Left Center Back (for both 3 and 5 defenders)
            iii. CB - Center Back (for both 3 and 5 defenders)
            iv. RCB - Right Center Back (for both 3 and 5 defenders)
            v. RB - Right Wing Back (only for 5 defenders)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import warnings
import json
import operator
from matplotlib.ticker import FuncFormatter
from matplotlib.patches import Ellipse
import base64
from collections import defaultdict
import sys,os
import math
import random
import matplotlib.pylab as pyl
import itertools
import pickle
import swifter
import warnings
from unidecode import unidecode
from itertools import chain
from multiprocessing import  Pool

warnings.filterwarnings('ignore')

In [3]:
# pd.set_option('max_colwidth', 999)

In [4]:
pd.set_option('display.max_columns', 1000)
pd.set_option("display.max_rows", 3000)
warnings.filterwarnings('ignore')

# Competitions 

- area: it denotes the geographic area associated with the league as a sub-document, using the ISO 3166-1 specification (https://www.iso.org/iso-3166-country-codes.html);
- format: the format of the competition. All competitions for clubs have value "Domestic league". The competitions for national teams have value "International cup";
- name: the official name of the competition (e.g., Italian first division, Spanish first division, World Cup, etc.);
- type: the typology of the competition. It is "club" for the competitions for clubs and "international" for the competitions for national teams (World Cup 2018, European Cup 2016);
- wyId: the unique identifier of the competition, assigned by Wyscout.

In [5]:
comps = pd.read_json('../data_top5/competitions/competitions.json')
comps = pd.DataFrame(comps)
comps

Unnamed: 0,name,wyId,format,area,type
0,Italian first division,524,Domestic league,"{'name': 'Italy', 'id': '380', 'alpha3code': '...",club
1,English first division,364,Domestic league,"{'name': 'England', 'id': '0', 'alpha3code': '...",club
2,Spanish first division,795,Domestic league,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
3,French first division,412,Domestic league,"{'name': 'France', 'id': '250', 'alpha3code': ...",club
4,German first division,426,Domestic league,"{'name': 'Germany', 'id': '276', 'alpha3code':...",club
5,European Championship,102,International cup,"{'name': '', 'id': 0, 'alpha3code': 'XEU', 'al...",international
6,World Cup,28,International cup,"{'name': '', 'id': 0, 'alpha3code': 'XWO', 'al...",international


# Matches 

- competitionId: the identifier of the competition to which the match belongs to. It is a integer and refers to the field "wyId" of the competition document;
- date and dateutc: the former specifies date and time when the match starts in explicit format (e.g., May 20, 2018 at 8:45:00 PM GMT+2), the latter contains the same information but in the compact format YYYY-MM-DD hh:mm:ss;
- duration: the duration of the match. It can be "Regular" (matches of regular duration of 90 minutes + stoppage time), "ExtraTime" (matches with supplementary times, as it may happen for matches in continental or international competitions), or "Penalities" (matches which end at penalty kicks, as it may happen for continental or international competitions);
- gameweek: the week of the league, starting from the beginning of the league;
- label: contains the name of the two clubs and the result of the match (e.g., "Lazio - Internazionale, 2 - 3");
- roundID: indicates the match-day of the competition to which the match belongs to. During a competition for soccer clubs, each of the participating clubs plays against each of the other clubs twice, once at home and once away. The matches are organized in match-days: all the matches in match-day i are played before the matches in match-day i + 1, even tough some matches can be anticipated or postponed to facilitate players and clubs participating in Continental or Intercontinental competitions. During a competition for national teams, the "roundID" indicates the stage of the competition (eliminatory round, round of 16, quarter finals, semifinals, final);
- seasonId: indicates the season of the match;
- status: it can be "Played" (the match has officially finished), "Cancelled" (the match has been canceled for some reason), "Postponed" (the match has been postponed and no new date and time is available yet) or "Suspended" (the match has been suspended and no new date and time is available yet);
- venue: the stadium where the match was held (e.g., "Stadio Olimpico");
- winner: the identifier of the team which won the game, or 0 if the match ended with a draw;
- wyId: the identifier of the match, assigned by Wyscout;
- teamsData: it contains several subfields describing information about each team that is playing that match: such as lineup, bench composition, list of substitutions, coach and scores:
- hasFormation: it has value 0 if no formation (lineups and benches) is present, and 1 otherwise;
- score: the number of goals scored by the team during the match (not counting penalties);
- scoreET: the number of goals scored by the team during the match, including the extra time (not counting penalties);
- scoreHT: the number of goals scored by the team during the first half of the match;
- scoreP: the total number of goals scored by the team after the penalties;
- side: the team side in the match (it can be "home" or "away");
- teamId: the identifier of the team;
- coachId: the identifier of the team's coach;
- bench: the list of the team's players that started the match in the bench and some basic statistics about their performance during the match (goals, own goals, cards);
- lineup: the list of the team's players in the starting lineup and some basic statistics about their performance during the match (goals, own goals, cards);
- substitutions: the list of team's substitutions during the match, describing the players involved and the minute of the substitution.

In [6]:
country = ['England', 'France', 'Germany', 'Italy', 'Spain']
matches1 = pd.DataFrame()

for i in country:
    path = '../data_top5/matches/matches_'+i+'.json'
    matches = pd.read_json(path)
    matches = pd.DataFrame(matches)
    matches1 = pd.concat([matches, matches1]) 

matches1.reset_index(drop=True, inplace=True)
print(matches1.shape)

(1826, 14)


Sample 'TeamsData' row entry:

# Events

- eventId: the identifier of the event's type. Each eventId is associated with an event name (see next point);
- eventName: tteamIdhe name of the event's type. There are seven types of events: pass, foul, shot, duel, free kick, offside and touch;
- subEventId: the identifier of the subevent's type. Each subEventId is associated with a subevent name (see next point);
- subEventName: the name of the subevent's type. Each event type is associated with a different set of subevent types;
- tags: a list of event tags, each one describes additional information about the event (e.g., accurate). Each event type is associated with a different set of tags;
- eventSec: the time when the event occurs (in seconds since the beginning of the current half of the match);
- id: a unique identifier of the event;
- matchId: the identifier of the match the event refers to. The identifier refers to the field "wyId" in the match dataset;
- matchPeriod: the period of the match. It can be "1H" (first half of the match), "2H" (second half of the match), "E1" (first extra time), "E2" (second extra time) or "P" (penalties time);
- playerId: the identifier of the player who generated the event. The identifier refers to the field "wyId" in a player dataset;
- positions: the origin and destination positions associated with the event. Each position is a pair of coordinates (x, y). The x and y coordinates are always in the range [0, 100] and indicate the percentage of the field from the perspective of the attacking team. In particular, the value of the x coordinate indicates the event's nearness (in percentage) to the opponent's goal, while the value of the y coordinates indicates the event's nearness (in percentage) to the right side of the field;
- teamId: the identifier of the player's team. The identifier refers to the field "wyId" in the team dataset.

In [7]:
events1 = pd.DataFrame()

for i in country:
    path = '../data_top5/events/events_'+i+'.json'
    events = pd.read_json(path)
    events = pd.DataFrame(events)
    events1 = pd.concat([events, events1]) 

events1.reset_index(drop=True, inplace=True)
print(events1.shape)

(3071395, 12)


In [8]:
print(len(events1.matchId.unique()))
print(matches1.shape)

1826
(1826, 14)


In [9]:
events1.matchId.value_counts().mean()

1682.0345016429353

Approximately, every 3.19 seconds, an event activity is registered

# Player 

- birthArea: geographic information about the player's birth area;
- birthDate: the birth date of the player, in the format "YYYY-MM-DD";
- currentNationalTeamId: the identifier of the national team where the players currently plays;
- currentTeamId: the identifier of the team where the player plays for. The identifier refers to the field "wyId" in a team document;
- firstName: the first name of the player;
- lastName: the last name of the player;
- foot: the preferred foot of the player;
- height: the height of the player (in centimeters);
- middleName: the middle name (if any) of the player;
- passportArea: the geographic area associated with the player's current passport;
- role: the main role of the player. It is a subdocument containing the role's name and two abbreviations of it;
- shortName2: the short name of the player;
- weight: the weight of the player (in kilograms);
- wyId: the identifier of the player, assigned by Wyscout.

In [10]:
players = pd.read_json('../data_top5/players/players.json')
players = pd.DataFrame(players)
print(players.shape)
display(players)

(3603, 14)


Unnamed: 0,passportArea,weight,firstName,middleName,lastName,currentTeamId,birthDate,height,role,birthArea,wyId,foot,shortName,currentNationalTeamId
0,"{'name': 'Turkey', 'id': '792', 'alpha3code': ...",78,Harun,,Tekin,4502,1989-06-17,187,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'Turkey', 'id': '792', 'alpha3code': ...",32777,right,H. Tekin,4687
1,"{'name': 'Senegal', 'id': '686', 'alpha3code':...",73,Malang,,Sarr,3775,1999-01-23,182,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'France', 'id': '250', 'alpha3code': ...",393228,left,M. Sarr,4423
2,"{'name': 'France', 'id': '250', 'alpha3code': ...",72,Over,,Mandanda,3772,1998-10-26,176,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'France', 'id': '250', 'alpha3code': ...",393230,,O. Mandanda,
3,"{'name': 'Senegal', 'id': '686', 'alpha3code':...",82,Alfred John Momar,,N'Diaye,683,1990-03-06,187,"{'code2': 'MD', 'code3': 'MID', 'name': 'Midfi...","{'name': 'France', 'id': '250', 'alpha3code': ...",32793,right,A. N'Diaye,19314
4,"{'name': 'France', 'id': '250', 'alpha3code': ...",84,Ibrahima,,Konat\u00e9,2975,1999-05-25,192,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'France', 'id': '250', 'alpha3code': ...",393247,right,I. Konat\u00e9,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3598,"{'name': 'Tunisia', 'id': 788, 'alpha3code': '...",72,Ali,,Ma\u00e2loul,16041,1990-01-01,175,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'Tunisia', 'id': 788, 'alpha3code': '...",120839,left,A. Ma\u00e2loul,
3599,"{'name': 'Peru', 'id': 604, 'alpha3code': 'PER...",76,Carlos Alberto,,C\u00e1ceda Oyaguez,15591,1991-09-27,183,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'Peru', 'id': 604, 'alpha3code': 'PER...",114736,right,C. C\u00e1ceda,
3600,"{'name': 'Peru', 'id': 604, 'alpha3code': 'PER...",78,Miguel Gianpierre,,Araujo Blanco,12072,1994-10-24,179,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'Peru', 'id': 604, 'alpha3code': 'PER...",114908,right,M. Araujo,
3601,"{'name': 'Morocco', 'id': 504, 'alpha3code': '...",70,Ahmed Reda,,Tagnaouti,16183,1996-04-05,182,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'Morocco', 'id': 504, 'alpha3code': '...",285583,right,A. Tagnaouti,


# Events + Player DF 

In [11]:
events_com = pd.merge(right=players, left=events1, right_on='wyId', left_on='playerId', how='left')
print(events_com.shape)

(3071395, 26)


In [12]:
#Those playerIds from df events that dont have an entry in df players
noplayer = events_com[events_com['firstName'].isna().values]
noplayer

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id,passportArea,weight,firstName,middleName,lastName,currentTeamId,birthDate,height,role,birthArea,wyId,foot,shortName,currentNationalTeamId
24,5,Ball out of the field,[],0,"[{'y': 35, 'x': 0}, {'y': 100, 'x': 100}]",2565548,Interruption,682,1H,83.100786,50,180864467,,,,,,,,,,,,,,
29,5,Ball out of the field,[],0,"[{'y': 19, 'x': 7}, {'y': 100, 'x': 100}]",2565548,Interruption,682,1H,119.589776,50,180864441,,,,,,,,,,,,,,
32,5,Ball out of the field,[],0,"[{'y': 0, 'x': 61}, {'y': 0, 'x': 0}]",2565548,Interruption,695,1H,139.970551,50,180865320,,,,,,,,,,,,,,
57,5,Ball out of the field,[],0,"[{'y': 0, 'x': 46}, {'y': 100, 'x': 100}]",2565548,Interruption,682,1H,243.077555,50,180864466,,,,,,,,,,,,,,
71,5,Ball out of the field,[],0,"[{'y': 41, 'x': 0}, {'y': 100, 'x': 100}]",2565548,Interruption,682,1H,301.181860,50,180864469,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3071325,5,Ball out of the field,[],0,"[{'y': 100, 'x': 51}, {'y': 100, 'x': 100}]",2500098,Interruption,1623,2H,2589.146520,50,251596374,,,,,,,,,,,,,,
3071346,3,Free Kick,[{'id': 1801}],0,"[{'y': 87, 'x': 40}, {'y': 80, 'x': 56}]",2500098,Free Kick,1623,2H,2665.580593,31,251596391,,,,,,,,,,,,,,
3071362,1,Ground defending duel,"[{'id': 502}, {'id': 701}, {'id': 1802}]",0,"[{'y': 34, 'x': 7}, {'y': 28, 'x': 5}]",2500098,Duel,1623,2H,2712.941293,12,251596987,,,,,,,,,,,,,,
3071385,3,Free Kick,[{'id': 1801}],0,"[{'y': 81, 'x': 70}, {'y': 70, 'x': 59}]",2500098,Free Kick,1633,2H,2780.300522,31,251596224,,,,,,,,,,,,,,


In [13]:
noplayer.subEventName.value_counts()

Ball out of the field     129154
Ground defending duel      29468
Ground attacking duel      23047
Ground loose ball duel     18365
Air duel                   12320
Free Kick                   9495
Throw in                    2272
Whistle                      913
Foul                         335
Touch                        255
Goal kick                    193
Corner                       131
                              28
Simple pass                   18
Free kick cross               11
Launch                         7
Hand foul                      7
Clearance                      5
Cross                          3
Shot                           3
Save attempt                   2
Reflexes                       2
Head pass                      1
Out of game foul               1
Smart pass                     1
High pass                      1
Name: subEventName, dtype: int64

These entries are the interruptions that dont need a player (mostly, rest are data collection anomalies)

In [14]:
events_com = events_com.dropna(subset=['firstName'])
events_com = events_com.reset_index(drop=True)

In [15]:
events_com.shape

(2845357, 26)

In [16]:
events_com.to_pickle('../data_top5/events/events_com.pkl')

# Extracting Player Position from Match Lineup 

## Extracting Defender Position 

In [17]:
# sys.setrecursionlimit(10000)
defence = pd.read_pickle('../data_top5/matches/top_5_lineups.pkl')

In [18]:
defence.shape

(1826, 9)

In [19]:
defence

Unnamed: 0,tournament,gameweek,date,home_team,score,away_team,match_link,home_team_defs,away_team_defs
0,Premier League,1,2017-08-11,Arsenal,4–3,Leicester City,http://www.fbref.com/en/matches/e3c3ddf0/Arsen...,"[RobHolding, NachoMonreal, SeadKolasinac]","[DannySimpson, WesMorgan, HarryMaguire, Christ..."
1,Premier League,1,2017-08-12,Watford,3–3,Liverpool,http://www.fbref.com/en/matches/60f6cc1d/Watfo...,"[DarylJanmaat, YounesKaboul, MiguelBritos, Jos...","[TrentAlexanderArnold, JoelMatip, DejanLovren,..."
2,Premier League,1,2017-08-12,West Brom,1–0,Bournemouth,http://www.fbref.com/en/matches/684f704a/West-...,"[AllanNyom, CraigDawson, AhmedHegazi, ChrisBrunt]","[SimonFrancis, SteveCook, NathanAke, CharlieDa..."
3,Premier League,1,2017-08-12,Everton,1–0,Stoke City,http://www.fbref.com/en/matches/7c834541/Evert...,"[MichaelKeane, AshleyWilliams, PhilJagielka]","[KurtZouma, RyanShawcross, GeoffCameron]"
4,Premier League,1,2017-08-12,Southampton,0–0,Swansea City,http://www.fbref.com/en/matches/e782371e/South...,"[CedricSoares, JackStephens, MayaYoshida, Ryan...","[KyleNaughton, FedericoFernandez, AlfieMawson,..."
5,Premier League,1,2017-08-12,Chelsea,2–3,Burnley,http://www.fbref.com/en/matches/71b00bca/Chels...,"[AntonioRudiger, DavidLuiz, GaryCahill]","[MatthewLowton, JamesTarkowski, BenMee, Stephe..."
6,Premier League,1,2017-08-12,Crystal Palace,0–3,Huddersfield,http://www.fbref.com/en/matches/2d369d17/Cryst...,"[TimothyFosuMensah, ScottDann, JairoRiedewald]","[TommySmith, MathiasJorgensen, ChristopherSchi..."
7,Premier League,1,2017-08-12,Brighton,0–2,Manchester City,http://www.fbref.com/en/matches/072bfc99/Brigh...,"[Bruno, LewisDunk, ShaneDuffy, MarkusSuttner]","[VincentKompany, JohnStones, NicolasOtamendi]"
8,Premier League,1,2017-08-13,Newcastle Utd,0–2,Tottenham,http://www.fbref.com/en/matches/d8a995d7/Newca...,"[JavierManquillo, FlorianLejeune, CiaranClark,...","[KyleWalkerPeters, TobyAlderweireld, JanVerton..."
9,Premier League,1,2017-08-13,Manchester Utd,4–0,West Ham,http://www.fbref.com/en/matches/f5d1f6f4/Manch...,"[AntonioValencia, EricBailly, PhilJones, Daley...","[PabloZabaleta, WinstonReid, AngeloOgbonna, Ar..."


In [20]:
defence['home_team'].unique()

array(['Arsenal', 'Watford', 'West Brom', 'Everton', 'Southampton',
       'Chelsea', 'Crystal Palace', 'Brighton', 'Newcastle Utd',
       'Manchester Utd', 'Swansea City', 'Leicester City', 'Bournemouth',
       'Burnley', 'Liverpool', 'Stoke City', 'Huddersfield', 'Tottenham',
       'Manchester City', 'West Ham', 'Bayern Munich', 'Hoffenheim',
       'Mainz 05', 'Wolfsburg', 'Hamburger SV', 'Hertha BSC',
       'Schalke 04', 'Freiburg', "M'Gladbach", 'Köln', 'Stuttgart',
       'Eint Frankfurt', 'Werder Bremen', 'Leverkusen', 'Augsburg',
       'Dortmund', 'RB Leipzig', 'Hannover 96', 'Monaco', 'Paris S-G',
       'Saint-Étienne', 'Montpellier', 'Metz', 'Lyon', 'Troyes', 'Lille',
       'Angers', 'Marseille', 'Nice', 'Rennes', 'Nantes', 'Amiens',
       'Toulouse', 'Caen', 'Bordeaux', 'Strasbourg', 'Dijon', 'Guingamp',
       'Leganés', 'Valencia', 'Celta Vigo', 'Girona', 'Sevilla',
       'Athletic Club', 'Barcelona', 'La Coruña', 'Levante', 'Málaga',
       'Real Sociedad', 'Beti

In [21]:
#defence.drop(columns = ['Unnamed: 0'], inplace=True)

In [22]:
# def clean_tokenize(s):
#     s = s.replace(",", ' ')
#     s = s.replace("[", '')
#     s = s.replace("]", '')
#     s = s.replace("'", '')
#     s = s.split()
#     return s

In [23]:
# defence['home_team_defs'] = defence['home_team_defs'].apply(lambda x: clean_tokenize(x))
# defence['away_team_defs'] = defence['away_team_defs'].apply(lambda x: clean_tokenize(x))

In [24]:
def add_home_posn(df):
    for i in range(0, len(df)):
        if(len(df['home_team_defs'][i])==4):
            df['home_RB'][i] = df['home_team_defs'][i][0]
            df['home_R-CB'][i] = df['home_team_defs'][i][1]
            df['home_L-CB'][i] = df['home_team_defs'][i][2]
            df['home_LB'][i] = df['home_team_defs'][i][3]
            df['home_backline'][i] = 4
        elif(len(df['home_team_defs'][i])==3):
            df['home_RCB'][i] = df['home_team_defs'][i][0]
            df['home_CB'][i] = df['home_team_defs'][i][1]
            df['home_LCB'][i] = df['home_team_defs'][i][2]
            df['home_backline'][i] = 3
        else:
            df['home_RWB'][i] = df['home_team_defs'][i][0]
            df['home_RCB'][i] = df['home_team_defs'][i][1]
            df['home_CB'][i] = df['home_team_defs'][i][2]
            df['home_LCB'][i] = df['home_team_defs'][i][3]
            df['home_LWB'][i] = df['home_team_defs'][i][4]
            df['home_backline'][i] = 5

In [25]:
def add_away_posn(df):
    for i in range(0, len(df)):
        if(len(df['away_team_defs'][i])==4):
            df['away_RB'][i] = df['away_team_defs'][i][0]
            df['away_R-CB'][i] = df['away_team_defs'][i][1]
            df['away_L-CB'][i] = df['away_team_defs'][i][2]
            df['away_LB'][i] = df['away_team_defs'][i][3]
            df['away_backline'][i] = 4
        elif(len(df['away_team_defs'][i])==3):
            df['away_RCB'][i] = df['away_team_defs'][i][0]
            df['away_CB'][i] = df['away_team_defs'][i][1]
            df['away_LCB'][i] = df['away_team_defs'][i][2]
            df['away_backline'][i] = 3
        else:
            df['away_RWB'][i] = df['away_team_defs'][i][0]
            df['away_RCB'][i] = df['away_team_defs'][i][1]
            df['away_CB'][i] = df['away_team_defs'][i][2]
            df['away_LCB'][i] = df['away_team_defs'][i][3]
            df['away_LWB'][i] = df['away_team_defs'][i][4]
            df['away_backline'][i] = 5

In [26]:
defence = defence.reindex(columns=defence.columns.tolist() + [
    'home_RB',
    'home_R-CB',
    'home_L-CB',
    'home_LB',
    'home_RCB',
    'home_CB',
    'home_LCB',
    'home_RWB',
    'home_LWB',
    'away_RB',
    'away_R-CB',
    'away_L-CB',
    'away_LB',
    'away_RCB',
    'away_CB',
    'away_LCB',
    'away_RWB',
    'away_LWB',
    'home_backline',
    'away_backline'
])

In [27]:
add_home_posn(defence)
add_away_posn(defence)

## Merge Match_IDs 

In [28]:
def label_splitter(df):
    df['label'] = df['label'].apply(lambda x: x.replace(",", ' '))
    df['label'] = df['label'].apply(lambda x: x.split('  '))
    df['score'] = df['label'].apply(lambda x: x[1])
    df['label'] = df['label'].apply(lambda x: x[0].split(' - '))
    df['home_team'] = df['label'].apply(lambda x: x[0])
    df['away_team'] = df['label'].apply(lambda x: x[1])
    return df

In [29]:
label_splitter(matches1)

Unnamed: 0,status,roundId,gameweek,teamsData,seasonId,dateutc,winner,venue,wyId,label,date,referees,duration,competitionId,score,home_team,away_team
0,Played,4406122,38,"{'676': {'scoreET': 0, 'coachId': 92894, 'side...",181144,2018-05-20 18:45:00,676,Camp Nou,2565922,"[Barcelona, Real Sociedad]","May 20, 2018 at 8:45:00 PM GMT+2","[{'refereeId': 398931, 'role': 'referee'}, {'r...",Regular,795,1 - 0,Barcelona,Real Sociedad
1,Played,4406122,38,"{'679': {'scoreET': 0, 'coachId': 3427, 'side'...",181144,2018-05-20 16:30:00,0,Estadio Wanda Metropolitano,2565925,"[Atl\u00e9tico Madrid, Eibar]","May 20, 2018 at 6:30:00 PM GMT+2","[{'refereeId': 395056, 'role': 'referee'}, {'r...",Regular,795,2 - 2,Atl\u00e9tico Madrid,Eibar
2,Played,4406122,38,"{'691': {'scoreET': 0, 'coachId': 444778, 'sid...",181144,2018-05-20 14:15:00,691,San Mam\u00e9s Barria,2565919,"[Athletic Club, Espanyol]","May 20, 2018 at 4:15:00 PM GMT+2","[{'refereeId': 384957, 'role': 'referee'}, {'r...",Regular,795,0 - 1,Athletic Club,Espanyol
3,Played,4406122,38,"{'674': {'scoreET': 0, 'coachId': 210074, 'sid...",181144,2018-05-20 10:00:00,674,Estadio de Mestalla,2565924,"[Valencia, Deportivo La Coru\u00f1a]","May 20, 2018 at 12:00:00 PM GMT+2","[{'refereeId': 398913, 'role': 'referee'}, {'r...",Regular,795,2 - 1,Valencia,Deportivo La Coru\u00f1a
4,Played,4406122,38,"{'675': {'scoreET': 0, 'coachId': 275283, 'sid...",181144,2018-05-19 18:45:00,0,Estadio de la Cer\u00e1mica,2565927,"[Villarreal, Real Madrid]","May 19, 2018 at 8:45:00 PM GMT+2","[{'refereeId': 395085, 'role': 'referee'}, {'r...",Regular,795,2 - 2,Villarreal,Real Madrid
5,Played,4406122,38,"{'696': {'scoreET': 0, 'coachId': 230918, 'sid...",181144,2018-05-19 16:30:00,680,Estadio Ram\u00f3n S\u00e1nchez Pizju\u00e1n,2565920,"[Sevilla, Deportivo Alav\u00e9s]","May 19, 2018 at 6:30:00 PM GMT+2","[{'refereeId': 379388, 'role': 'referee'}, {'r...",Regular,795,1 - 0,Sevilla,Deportivo Alav\u00e9s
6,Played,4406122,38,"{'698': {'scoreET': 0, 'coachId': 4107, 'side'...",181144,2018-05-19 16:30:00,698,Estadio La Rosaleda,2565921,"[M\u00e1laga, Getafe]","May 19, 2018 at 6:30:00 PM GMT+2","[{'refereeId': 398919, 'role': 'referee'}, {'r...",Regular,795,0 - 1,M\u00e1laga,Getafe
7,Played,4406122,38,"{'714': {'scoreET': 0, 'coachId': 4258, 'side'...",181144,2018-05-19 16:30:00,756,Estadio de Gran Canaria,2565923,"[Las Palmas, Girona]","May 19, 2018 at 6:30:00 PM GMT+2","[{'refereeId': 381854, 'role': 'referee'}, {'r...",Regular,795,1 - 2,Las Palmas,Girona
8,Played,4406122,38,"{'684': {'scoreET': 0, 'coachId': 0, 'side': '...",181144,2018-05-19 14:15:00,712,Estadio Municipal de Butarque,2565926,"[Legan\u00e9s, Real Betis]","May 19, 2018 at 4:15:00 PM GMT+2","[{'refereeId': 381927, 'role': 'referee'}, {'r...",Regular,795,3 - 2,Legan\u00e9s,Real Betis
9,Played,4406122,38,"{'692': {'scoreET': 0, 'coachId': 3880, 'side'...",181144,2018-05-19 11:00:00,692,Estadio de Bala\u00eddos,2565918,"[Celta de Vigo, Levante]","May 19, 2018 at 1:00:00 PM GMT+2","[{'refereeId': 395078, 'role': 'referee'}, {'r...",Regular,795,4 - 2,Celta de Vigo,Levante


In [30]:
defence.tournament.unique()

array(['Premier League', 'Bundesliga', 'Ligue 1 Conforama',
       'LaLiga Santander', 'Serie A TIM'], dtype=object)

In [31]:
matches1['home_team'] = matches1['home_team'].swifter.set_npartitions(
    8).apply(lambda x: x.encode().decode('unicode_escape').replace('\xad',''))
matches1['away_team'] = matches1['away_team'].swifter.set_npartitions(
    8).apply(lambda x: x.encode().decode('unicode_escape').replace('\xad',''))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=1826.0), HTML(value='')))




HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=1826.0), HTML(value='')))




In [32]:
sorted(matches1[matches1['competitionId']==524].home_team.unique())

['Atalanta',
 'Benevento',
 'Bologna',
 'Cagliari',
 'Chievo',
 'Crotone',
 'Fiorentina',
 'Genoa',
 'Hellas Verona',
 'Internazionale',
 'Juventus',
 'Lazio',
 'Milan',
 'Napoli',
 'Roma',
 'SPAL',
 'Sampdoria',
 'Sassuolo',
 'Torino',
 'Udinese']

In [33]:
sorted(defence[defence['tournament']=='Bundesliga'].home_team.unique())

['Augsburg',
 'Bayern Munich',
 'Dortmund',
 'Eint Frankfurt',
 'Freiburg',
 'Hamburger SV',
 'Hannover 96',
 'Hertha BSC',
 'Hoffenheim',
 'Köln',
 'Leverkusen',
 "M'Gladbach",
 'Mainz 05',
 'RB Leipzig',
 'Schalke 04',
 'Stuttgart',
 'Werder Bremen',
 'Wolfsburg']

In [34]:
matches1 = matches1.replace({
    'home_team': {
        'Brighton & Hove Albion': 'Brighton',
        'AFC Bournemouth': 'Bournemouth',
        'Huddersfield Town': 'Huddersfield',
        'Manchester United': 'Manchester Utd',
        'Newcastle United': 'Newcastle Utd',
        'Tottenham Hotspur': 'Tottenham',
        'West Bromwich Albion': 'West Brom',
        'West Ham United': 'West Ham',
        'Bayer Leverkusen': 'Leverkusen',
        'Bayern München': 'Bayern Munich',
        'Borussia Dortmund': 'Dortmund',
        "Borussia M'gladbach": "M'Gladbach",
        'Eintracht Frankfurt': 'Eint Frankfurt',
        'Amiens SC': 'Amiens',
        'Angers SCO': 'Angers',
        'Olympique Lyonnais': 'Lyon',
        'PSG': 'Paris S-G',
        'Olympique Marseille': 'Marseille',
        'Deportivo Alavés': 'Alavés',
        'Real Betis': 'Betis',
        'Celta de Vigo': 'Celta Vigo',
        'Deportivo La Coruña': 'La Coruña',
        'Internazionale': 'Inter'
    },
    'away_team': {
        'Brighton & Hove Albion': 'Brighton',
        'AFC Bournemouth': 'Bournemouth',
        'Huddersfield Town': 'Huddersfield',
        'Manchester United': 'Manchester Utd',
        'Newcastle United': 'Newcastle Utd',
        'Tottenham Hotspur': 'Tottenham',
        'West Bromwich Albion': 'West Brom',
        'West Ham United': 'West Ham',
        'Bayer Leverkusen': 'Leverkusen',
        'Bayern München': 'Bayern Munich',
        'Borussia Dortmund': 'Dortmund',
        "Borussia M'gladbach": "M'Gladbach",
        'Eintracht Frankfurt': 'Eint Frankfurt',
        'Amiens SC': 'Amiens',
        'Angers SCO': 'Angers',
        'Olympique Lyonnais': 'Lyon',
        'PSG': 'Paris S-G',
        'Olympique Marseille': 'Marseille',
        'Deportivo Alavés': 'Alavés',
        'Real Betis': 'Betis',
        'Celta de Vigo': 'Celta Vigo',
        'Deportivo La Coruña': 'La Coruña',
        'Internazionale': 'Inter'
    }
})

In [35]:
matches1.drop(columns=['label'], inplace=True)
matches1['match'] = matches1['home_team'] + '-' + matches1['away_team']

In [36]:
defence['match'] = defence['home_team'] + '-' + defence['away_team']

In [37]:
match_def = pd.merge(right=matches1, left=defence, right_on='match', left_on='match')

In [38]:
match_def.shape

(1826, 46)

In [39]:
match_def.columns

Index(['tournament', 'gameweek_x', 'date_x', 'home_team_x', 'score_x',
       'away_team_x', 'match_link', 'home_team_defs', 'away_team_defs',
       'home_RB', 'home_R-CB', 'home_L-CB', 'home_LB', 'home_RCB', 'home_CB',
       'home_LCB', 'home_RWB', 'home_LWB', 'away_RB', 'away_R-CB', 'away_L-CB',
       'away_LB', 'away_RCB', 'away_CB', 'away_LCB', 'away_RWB', 'away_LWB',
       'home_backline', 'away_backline', 'match', 'status', 'roundId',
       'gameweek_y', 'teamsData', 'seasonId', 'dateutc', 'winner', 'venue',
       'wyId', 'date_y', 'referees', 'duration', 'competitionId', 'score_y',
       'home_team_y', 'away_team_y'],
      dtype='object')

In [40]:
match_def.drop(columns=[
    'status', 'roundId', 'seasonId', 'date_x', 'winner',
    'duration', 'competitionId', 'home_team_y', 'away_team_y', 'match_link'
], inplace=True)

In [41]:
len(match_def)

1826

In [42]:
match_def.to_pickle('../data_top5/matches/matches_defence_top5.pkl')

# Event Data Pre-Processing

## Normalizing Names 

In [43]:
events_com = pd.read_pickle('../data_top5/events/events_com.pkl')

In [44]:
events_com['firstName'] = events_com['firstName'].swifter.set_npartitions(
    8).apply(lambda x: x.encode().decode('unicode_escape').replace('\xad',''))
events_com['lastName'] = events_com['lastName'].swifter.set_npartitions(
    8).apply(lambda x: x.encode().decode('unicode_escape').replace('\xad',''))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




In [45]:
players['firstName'] = players['firstName'].swifter.set_npartitions(
    8).apply(lambda x: x.encode().decode('unicode_escape').replace('\xad',''))
players['lastName'] = players['lastName'].swifter.set_npartitions(
    8).apply(lambda x: x.encode().decode('unicode_escape').replace('\xad',''))
players['playerName'] = players['firstName']+players['lastName']
players['playerName'] = players['playerName'].apply(lambda x: unidecode(x))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=3603.0), HTML(value='')))




HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=3603.0), HTML(value='')))




In [46]:
players['playerName'] = players['playerName'].apply(lambda x: x.replace('-', ''))

In [47]:
players['playerName'] = players['playerName'].apply(lambda x: x.replace(' ', ''))

In [48]:
players.at[players[players['playerName']=='DiegoGonzalezPolanco']['foot'].index[0], 'foot'] = 'left'
players.at[players[players['playerName']=='MujaidSadickAliu']['foot'].index[0], 'foot'] = 'right'
players.at[players[players['playerName']=='DavidAlbaFernandez']['foot'].index[0], 'foot'] = 'right'
players.at[players[players['playerName']=='DavidAlcibiade']['foot'].index[0], 'foot'] = 'right'
players.at[players[players['playerName']=='DylanLempereur']['foot'].index[0], 'foot'] = 'left'

In [49]:
players.to_pickle('../data_top5/players/players.pkl')

## Modifying match_def 

In [50]:
match_def.columns

Index(['tournament', 'gameweek_x', 'home_team_x', 'score_x', 'away_team_x',
       'home_team_defs', 'away_team_defs', 'home_RB', 'home_R-CB', 'home_L-CB',
       'home_LB', 'home_RCB', 'home_CB', 'home_LCB', 'home_RWB', 'home_LWB',
       'away_RB', 'away_R-CB', 'away_L-CB', 'away_LB', 'away_RCB', 'away_CB',
       'away_LCB', 'away_RWB', 'away_LWB', 'home_backline', 'away_backline',
       'match', 'gameweek_y', 'teamsData', 'dateutc', 'venue', 'wyId',
       'date_y', 'referees', 'score_y'],
      dtype='object')

In [51]:
match_def.columns

Index(['tournament', 'gameweek_x', 'home_team_x', 'score_x', 'away_team_x',
       'home_team_defs', 'away_team_defs', 'home_RB', 'home_R-CB', 'home_L-CB',
       'home_LB', 'home_RCB', 'home_CB', 'home_LCB', 'home_RWB', 'home_LWB',
       'away_RB', 'away_R-CB', 'away_L-CB', 'away_LB', 'away_RCB', 'away_CB',
       'away_LCB', 'away_RWB', 'away_LWB', 'home_backline', 'away_backline',
       'match', 'gameweek_y', 'teamsData', 'dateutc', 'venue', 'wyId',
       'date_y', 'referees', 'score_y'],
      dtype='object')

In [52]:
match_def_away = match_def[['wyId', 'away_team_x', 'away_team_defs', 'away_RB', 'away_R-CB', 'away_L-CB', 'away_LB', 'away_RCB',
       'away_CB', 'away_LCB', 'away_RWB', 'away_LWB', 'away_backline', 'match', 'gameweek_x', 'teamsData',
       'dateutc', 'venue', 'referees', 'score_x']]

In [53]:
match_def_away.columns = ['wyId', 'team', 'team_defense', 'RB', 'R-CB', 'L-CB', 'LB', 'RCB',
       'CB', 'LCB', 'RWB', 'LWB', 'backline', 'match', 'gameweek', 'teamsData',
       'dateutc', 'venue', 'referees', 'score']

In [54]:
match_def = match_def[['wyId', 'home_team_x', 'home_team_defs', 'home_RB', 'home_R-CB', 'home_L-CB', 'home_LB', 'home_RCB',
       'home_CB', 'home_LCB', 'home_RWB', 'home_LWB', 'home_backline', 'match', 'gameweek_x', 'teamsData',
       'dateutc', 'venue', 'referees', 'score_x']]

In [55]:
match_def.columns = ['wyId', 'team', 'team_defense', 'RB', 'R-CB', 'L-CB', 'LB', 'RCB',
       'CB', 'LCB', 'RWB', 'LWB', 'backline', 'match', 'gameweek', 'teamsData',
       'dateutc', 'venue', 'referees', 'score']

In [56]:
match_def = match_def.append(match_def_away, ignore_index=True)

In [57]:
match_def.sort_values(by=['gameweek'], inplace=True, ascending=False)

In [58]:
match_def.reset_index(drop=True, inplace=True)

In [59]:
match_def.to_pickle('../data_top5/matches/match+def_lineup_top5.pkl')

## Adding Footedness 

In [60]:
events_com['playerName'] = events_com['firstName']+events_com['lastName']
events_com['playerName'] = events_com['playerName'].apply(lambda x: unidecode(x))
events_com['playerName'] = events_com['playerName'].apply(lambda x: x.replace(" ", ''))
events_com['playerName'] = events_com['playerName'].apply(lambda x: x.replace('-', ''))
events_com.drop(columns=[
    'currentTeamId', 'passportArea', 'weight', 'firstName', 'lastName',
    'middleName', 'birthDate', 'height', 'role', 'birthArea', 'shortName',
    'currentNationalTeamId'
], inplace=True)
# events_com.rename(columns={'id': 'eventId', 'wyId': 'playerId'})

In [61]:
events_com.to_pickle('../data_top5/events/events_com.pkl')

In [62]:
events_com = pd.read_pickle('../data_top5/events/events_com.pkl')
match_def = pd.read_pickle('../data_top5/matches/match+def_lineup_top5.pkl')

In [63]:
imp = pd.DataFrame(match_def[['team', 'team_defense', 'wyId']].explode('team_defense'))

In [64]:
imp.loc[((imp['team'] == 'Real Madrid') & (imp['team_defense'] == 'Marcelo')),
        'team_defense'] = 'MarceloVieiradaSilvaJunior'
imp.loc[((imp['team'] == 'Lyon') & (imp['team_defense'] == 'Marcelo')),
        'team_defense'] = 'MarceloAntonioGuedesFilho'
imp.loc[((imp['team'] == 'Atlético Madrid') &
         (imp['team_defense'] == 'Juanfran')),
        'team_defense'] = 'JuanFranciscoTorresBelen'
imp.loc[((imp['team'] == 'La Coruña') &
         (imp['team_defense'] == 'Juanfran')),
        'team_defense'] = 'JuanFranciscoMorenoFuertes'
imp.loc[((imp['team'] == 'Schalke 04') &
         (imp['team_defense'] == 'Naldo')),
        'team_defense'] = 'RonaldoAparecidoRodrigues'
imp.loc[((imp['team'] == 'Espanyol') &
         (imp['team_defense'] == 'Naldo')),
        'team_defense'] = 'EdinaldoGomesPereira'

In [65]:
imp = pd.DataFrame(imp.groupby(['wyId', 'team'])['team_defense'].apply(list)).reset_index()

In [66]:
match_def = match_def.sort_values(by=['wyId', 'team']).reset_index(drop=True)
match_def['team_defense'] = imp['team_defense']
match_def.to_pickle('../data_top5/matches/match+def_lineup_top5.pkl')

In [67]:
match_def = pd.read_pickle('../data_top5/matches/match+def_lineup_top5.pkl')

In [68]:
a = pd.DataFrame(match_def[['team_defense', 'backline']].explode('team_defense'))

In [69]:
players = pd.read_pickle('../data_top5/players/players.pkl')

In [70]:
p = players[['playerName', 'foot']]

In [71]:
b = pd.merge(right=p, left=a, right_on='playerName', left_on='team_defense', how='left')

Storing the mismatch names in np array

In [72]:
np.save('../data_top5/players/mismatch_names.npy', b[b['foot'].isna()]['team_defense'].unique())

**Importing PlayerMap to match names across DFs**

In [73]:
player_map = pd.read_csv('../data_top5/players/PlayerMap.csv')

In [74]:
player_map.head()

Unnamed: 0,Lineup Name,Wyscout Name
0,FelipedalBelo,FelipeDiasdaSilvadalBelo
1,CristianZapata,CristianEduardoZapataValencia
2,RicardoRodriguez,RicardoIvanRodriguezAraya
3,AliAdnanKadhim,AliAdnanKadhimAlTameemi
4,SamirSantos,SamirCaetanodeSouzaSantos


In [75]:
player_map.drop([
    player_map[player_map['Lineup Name'] == 'Naldo'].index[0],
    player_map[player_map['Lineup Name'] == 'Marcelo'].index[0],
    player_map[player_map['Lineup Name']=='Juanfran'].index[0]
], inplace=True)
player_map.reset_index(drop=True, inplace=True)

IndexError: index 0 is out of bounds for axis 0 with size 0

In [76]:
player_map.to_csv('../data_top5/players/PlayerMap.csv')

In [77]:
player_map = dict([(i,a) for i, a in zip(player_map['Lineup Name'], player_map['Wyscout Name'])])

In [78]:
a = match_def[['team_defense', 'backline']].explode('team_defense')
a = a.replace({'team_defense': player_map})
p = players[['playerName', 'foot']]
b = pd.merge(right=p, left=a, right_on='playerName', left_on='team_defense', how='left')

In [79]:
b['foot'].isna().sum()

0

In [80]:
b[b['foot'].isna()]['team_defense'].unique()

array([], dtype=object)

In [81]:
b.index = a.index

In [82]:
f = [list(b['foot'][i]) for i in a.index.unique()]
f = pd.Series(f)
match_def['footedness'] = f.values
match_def['footedness'] = match_def['footedness'].apply(lambda x: '-'.join(x))

In [83]:
match_def.reset_index(drop=True, inplace=True)
# match_def.rename(columns={'wyId': 'matchId'})

In [84]:
match_def

Unnamed: 0,wyId,team,team_defense,RB,R-CB,L-CB,LB,RCB,CB,LCB,RWB,LWB,backline,match,gameweek,teamsData,dateutc,venue,referees,score,footedness
0,2499719,Arsenal,"[RobHolding, NachoMonreal, SeadKolasinac]",,,,,RobHolding,NachoMonreal,SeadKolasinac,,,3.0,Arsenal-Leicester City,1,"{'1609': {'scoreET': 0, 'coachId': 7845, 'side...",2017-08-11 18:45:00,Emirates Stadium,"[{'refereeId': 385909, 'role': 'referee'}, {'r...",4–3,right-left-left
1,2499719,Leicester City,"[DannySimpson, WesMorgan, HarryMaguire, Christ...",DannySimpson,WesMorgan,HarryMaguire,ChristianFuchs,,,,,,4.0,Arsenal-Leicester City,1,"{'1609': {'scoreET': 0, 'coachId': 7845, 'side...",2017-08-11 18:45:00,Emirates Stadium,"[{'refereeId': 385909, 'role': 'referee'}, {'r...",4–3,right-right-right-left
2,2499720,Brighton,"[Bruno, LewisDunk, ShaneDuffy, MarkusSuttner]",Bruno,LewisDunk,ShaneDuffy,MarkusSuttner,,,,,,4.0,Brighton-Manchester City,1,"{'1651': {'scoreET': 0, 'coachId': 8093, 'side...",2017-08-12 16:30:00,The American Express Community Stadium,"[{'refereeId': 384965, 'role': 'referee'}, {'r...",0–2,right-right-right-left
3,2499720,Manchester City,"[VincentKompany, JohnStones, NicolasOtamendi]",,,,,VincentKompany,JohnStones,NicolasOtamendi,,,3.0,Brighton-Manchester City,1,"{'1651': {'scoreET': 0, 'coachId': 8093, 'side...",2017-08-12 16:30:00,The American Express Community Stadium,"[{'refereeId': 384965, 'role': 'referee'}, {'r...",0–2,right-right-right
4,2499721,Burnley,"[MatthewLowton, JamesTarkowski, BenMee, Stephe...",MatthewLowton,JamesTarkowski,BenMee,StephenWard,,,,,,4.0,Chelsea-Burnley,1,"{'1646': {'scoreET': 0, 'coachId': 8880, 'side...",2017-08-12 14:00:00,Stamford Bridge,"[{'refereeId': 378951, 'role': 'referee'}, {'r...",2–3,right-right-left-left
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3647,2576336,Sassuolo,"[MauricioLemos, FrancescoAcerbi, FedericoPeluso]",,,,,MauricioLemos,FrancescoAcerbi,FedericoPeluso,,,3.0,Sassuolo-Roma,38,"{'3158': {'scoreET': 0, 'coachId': 210119, 'si...",2018-05-20 18:45:00,MAPEI Stadium - Citt\u00e0 del Tricolore,"[{'refereeId': 377255, 'role': 'referee'}, {'r...",0–1,right-left-left
3648,2576337,SPAL,"[LorencoSimic, FrancescoVicari, FelipedalBelo]",,,,,LorencoSimic,FrancescoVicari,FelipedalBelo,,,3.0,SPAL-Sampdoria,38,"{'3164': {'scoreET': 0, 'coachId': 210121, 'si...",2018-05-20 16:00:00,,"[{'refereeId': 377256, 'role': 'referee'}, {'r...",3–1,right-right-left
3649,2576337,Sampdoria,"[BartoszBereszynski, JoachimAndersen, VascoReg...",BartoszBereszynski,JoachimAndersen,VascoRegini,NicolaMurru,,,,,,4.0,SPAL-Sampdoria,38,"{'3164': {'scoreET': 0, 'coachId': 210121, 'si...",2018-05-20 16:00:00,,"[{'refereeId': 377256, 'role': 'referee'}, {'r...",3–1,right-right-left-left
3650,2576338,Genoa,"[DavideBiraschi, JawadElYamiq, ArmandoIzzo]",,,,,DavideBiraschi,JawadElYamiq,ArmandoIzzo,,,3.0,Genoa-Torino,38,"{'3185': {'scoreET': 0, 'coachId': 21155, 'sid...",2018-05-20 13:00:00,,"[{'refereeId': 393614, 'role': 'referee'}, {'r...",1–2,right-right-right


In [85]:
match_def.to_pickle('../data_top5/matches/match+def_lineup+footedness_ver2_top5.pkl')

## Mapping Event Tags

In [86]:
tag_key = pd.read_csv('../data_top5/events/tags2name.csv')

In [87]:
k = list(zip(tag_key['Tag'], tag_key['Description']))

In [88]:
def unpack(x):
    x = [list(x[i].values()) for i in range(0, len(x))]
    return x

In [89]:
def functag2name(x):
    tag_names = list()
    for val in x:
        val = int(val)
        val_pair = [item for item in k if item[0]==val]
        tag_names.append(val_pair[0][1])
    return tag_names

In [90]:
events_com['tags'] = events_com['tags'].swifter.set_npartitions(8).apply(lambda x: unpack(x))
events_com['tags'] = events_com['tags'].swifter.set_npartitions(8).apply(lambda x: list(chain(*x)))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




In [91]:
events_com['tags'] = events_com['tags'].swifter.set_npartitions(8).apply(lambda x: functag2name(x))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




In [92]:
#event.to_pickle('../data/events/events+players.pkl')

## Converting Positions Co-ordinates 

In [93]:
#events = pd.read_pickle('../data/events/events+players.pkl')

In [94]:
def clean_coordinates(x):
    x = [list(d.values()) for d in x]
    x = [l[::-1] for l in x]
    return x

In [95]:
events_com['positions'] = events_com['positions'].swifter.set_npartitions(8).apply(lambda x: clean_coordinates(x))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




In [96]:
events_com['positions'] = events_com['positions'].swifter.set_npartitions(8).apply(lambda x: [np.multiply([-1.04, 0.68], i) for i in x])
events_com['positions'] = events_com['positions'].swifter.set_npartitions(8).apply(lambda x: [np.subtract([0, 68], i) for i in x])

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=2845357.0), HTML(value='')))




In [97]:
def roundcoords(x):
    final_roundedcoords = list()
    for l in x:
        roundedcoords = [np.round(val,decimals=2) for val in l]
        final_roundedcoords.append(roundedcoords)
    return final_roundedcoords

In [98]:
events_com['positions'] = events_com['positions'].apply(lambda x: roundcoords(x))

In [99]:
df_players = players

In [100]:
# Adding Roles
roles_temp = df_players['role'].values

roles = list()
for i in roles_temp:
    roles.append(i['code3'])
    

players_roles = list(zip(roles,df_players['wyId'],df_players['playerName']))

df_players_roles = pd.DataFrame(players_roles,columns = ['role','playerId','playerName1'])

df_events_roles = events_com.merge(df_players_roles, left_on = 'playerId', right_on = 'playerId')

df_events_roles.drop(['playerName1'], axis = 1, inplace = True)

df_events_roles.to_pickle('../data_top5/events/events_com.pkl')

# Changing Names in match_def permanently 

In [101]:
match_def = pd.read_pickle('../data_top5/matches/match+def_lineup+footedness_ver2_top5.pkl')
player_map = pd.read_csv('../data_top5/players/PlayerMap.csv')
player_map = dict([(i,a) for i, a in zip(player_map['Lineup Name'], player_map['Wyscout Name'])])

In [102]:
imp = pd.DataFrame(match_def[['team', 'team_defense', 'wyId']].explode('team_defense'))
imp = imp.replace({'team_defense': player_map})
imp = pd.DataFrame(imp.groupby(['wyId', 'team'])['team_defense'].apply(list)).reset_index()

In [103]:
match_def = match_def.sort_values(by=['wyId', 'team']).reset_index(drop=True)
match_def['team_defense'] = imp['team_defense']

In [104]:
def add_posn(df):
    for i in range(0, len(df)):
        if(len(df['team_defense'][i])==4):
            df['RB'][i] = df['team_defense'][i][0]
            df['R-CB'][i] = df['team_defense'][i][1]
            df['L-CB'][i] = df['team_defense'][i][2]
            df['LB'][i] = df['team_defense'][i][3]
            df['backline'][i] = 4
        elif(len(df['team_defense'][i])==3):
            df['RCB'][i] = df['team_defense'][i][0]
            df['CB'][i] = df['team_defense'][i][1]
            df['LCB'][i] = df['team_defense'][i][2]
            df['backline'][i] = 3
        else:
            df['RWB'][i] = df['team_defense'][i][0]
            df['RCB'][i] = df['team_defense'][i][1]
            df['CB'][i] = df['team_defense'][i][2]
            df['LCB'][i] = df['team_defense'][i][3]
            df['LWB'][i] = df['team_defense'][i][4]
            df['backline'][i] = 5

In [105]:
add_posn(match_def)

In [106]:
match_def.head()

Unnamed: 0,wyId,team,team_defense,RB,R-CB,L-CB,LB,RCB,CB,LCB,RWB,LWB,backline,match,gameweek,teamsData,dateutc,venue,referees,score,footedness
0,2499719,Arsenal,"[RobHolding, IgnacioMonrealEraso, SeadKolasinac]",,,,,RobHolding,IgnacioMonrealEraso,SeadKolasinac,,,3.0,Arsenal-Leicester City,1,"{'1609': {'scoreET': 0, 'coachId': 7845, 'side...",2017-08-11 18:45:00,Emirates Stadium,"[{'refereeId': 385909, 'role': 'referee'}, {'r...",4–3,right-left-left
1,2499719,Leicester City,"[DannySimpson, WesMorgan, HarryMaguire, Christ...",DannySimpson,WesMorgan,HarryMaguire,ChristianFuchs,,,,,,4.0,Arsenal-Leicester City,1,"{'1609': {'scoreET': 0, 'coachId': 7845, 'side...",2017-08-11 18:45:00,Emirates Stadium,"[{'refereeId': 385909, 'role': 'referee'}, {'r...",4–3,right-right-right-left
2,2499720,Brighton,"[BrunoSaltorGrau, LewisDunk, ShaneDuffy, Marku...",BrunoSaltorGrau,LewisDunk,ShaneDuffy,MarkusSuttner,,,,,,4.0,Brighton-Manchester City,1,"{'1651': {'scoreET': 0, 'coachId': 8093, 'side...",2017-08-12 16:30:00,The American Express Community Stadium,"[{'refereeId': 384965, 'role': 'referee'}, {'r...",0–2,right-right-right-left
3,2499720,Manchester City,"[VincentKompany, JohnStones, NicolasHernanOtam...",,,,,VincentKompany,JohnStones,NicolasHernanOtamendi,,,3.0,Brighton-Manchester City,1,"{'1651': {'scoreET': 0, 'coachId': 8093, 'side...",2017-08-12 16:30:00,The American Express Community Stadium,"[{'refereeId': 384965, 'role': 'referee'}, {'r...",0–2,right-right-right
4,2499721,Burnley,"[MatthewLowton, JamesTarkowski, BenMee, Stephe...",MatthewLowton,JamesTarkowski,BenMee,StephenWard,,,,,,4.0,Chelsea-Burnley,1,"{'1646': {'scoreET': 0, 'coachId': 8880, 'side...",2017-08-12 14:00:00,Stamford Bridge,"[{'refereeId': 378951, 'role': 'referee'}, {'r...",2–3,right-right-left-left


In [107]:
match_def.to_pickle('../data_top5/matches/match+def_lineup+footedness_ver2_top5.pkl')

# Result

1. Use **events_plot-a-match.pkl** to plot match events from same match simultaneously<br><br>
2. Use **events_com.pkl** to plot match events from different matches cumulatively over a season - used in all the analysis<br><br>
3. Use **match+def_lineup+footedness_ver2.pkl** for Defensive lineups of teams in each match from the PL 17-18 season<br><br>
**Note:** All event coordinate data has origin on the bottom left corner of the pitch when viewed horizontally and x = [0,104], y= [0,68]