Neste notebook faremos um pré-processamento dos dados da Wyscout, juntando os dados das 3 fontes diferentes fornecidas e realizando alguns cálculos iniciais básicos

O objetivo ao final de todo este processo será obtermos um modelo de Expected Goals (xG)

A definição de xG é a probabilidade de que em um dia típico de futebol um chute particular de uma determinada localização resultaria em um gol. Costuma ser baseado em medidas tomadas de muitos chutes dentro de uma mesma liga e temporada, ou agregando-se dados de diferentes ligas (estratégia usada neste projeto)

In [6]:
import pandas as pd
import numpy as np

import aux_functions_data as aux


### Carregando os dados da Wyscout

fonte (artigo): https://www.nature.com/articles/s41597-019-0247-7 [1]

fonte (dados): https://figshare.com/collections/Soccer_match_event_dataset/4415000/5 [2]


### Dados de eventos (events.json):

Este dataset descreve todos os eventos que acontecem durante cada partida. Cada evento refere-se a uma ação feita na bola e contém as seguintes features (em inglês, da documentação original [2])

- eventId: the identifier of the event's type. Each eventId is associated with an event name (see next point);
- eventName: tteamIdhe name of the event's type. There are seven types of events: pass, foul, shot, duel, free kick, offside and touch;
- subEventId: the identifier of the subevent's type. Each subEventId is associated with a subevent name (see next point);
- subEventName: the name of the subevent's type. Each event type is associated with a different set of subevent types;
- tags: a list of event tags, each one describes additional information about the event (e.g., accurate). Each event type is associated with a different set of tags;
- eventSec: the time when the event occurs (in seconds since the beginning of the current half of the match);
- id: a unique identifier of the event;
- matchId: the identifier of the match the event refers to. The identifier refers to the field "wyId" in the match dataset;
- matchPeriod: the period of the match. It can be "1H" (first half of the match), "2H" (second half of the match), "E1" (first extra time), "E2" (second extra time) or "P" (penalties time);
- playerId: the identifier of the player who generated the event. The identifier refers to the field "wyId" in a player dataset;
- positions: the origin and destination positions associated with the event. Each position is a pair of coordinates (x, y). The x and y coordinates are always in the range [0, 100] and indicate the percentage of the field from the perspective of the attacking team. In particular, the value of the x coordinate indicates the event's nearness (in percentage) to the opponent's goal, while the value of the y coordinates indicates the event's nearness (in percentage) to the right side of the field;
- teamId: the identifier of the player's team. The identifier refers to the field "wyId" in the team dataset.

### Dados de partidas (matches.json)


- competitionId: the identifier of the competition to which the match belongs to. It is a integer and refers to the field "wyId" of the competition document;
- date and dateutc: the former specifies date and time when the match starts in explicit format (e.g., May 20, 2018 at 8:45:00 PM GMT+2), the latter contains the same information but in the compact format YYYY-MM-DD hh:mm:ss;
- duration: the duration of the match. It can be "Regular" (matches of regular duration of 90 minutes + stoppage time), "ExtraTime" (matches with supplementary times, as it may happen for matches in continental or international competitions), or "Penalities" (matches which end at penalty kicks, as it may happen for continental or international competitions);
- gameweek: the week of the league, starting from the beginning of the league;
- label: contains the name of the two clubs and the result of the match (e.g., "Lazio - Internazionale, 2 - 3");
- roundID: indicates the match-day of the competition to which the match belongs to. During a competition for soccer clubs, each of the participating clubs plays against each of the other clubs twice, once at home and once away. The matches are organized in match-days: all the matches in match-day i are played before the matches in match-day i + 1, even tough some matches can be anticipated or postponed to facilitate players and clubs participating in Continental or Intercontinental competitions. During a competition for national teams, the "roundID" indicates the stage of the competition (eliminatory round, round of 16, quarter finals, semifinals, final);
- seasonId: indicates the season of the match;
- status: it can be "Played" (the match has officially finished), "Cancelled" (the match has been canceled for some reason), "Postponed" (the match has been postponed and no new date and time is available yet) or "Suspended" (the match has been suspended and no new date and time is available yet);
- venue: the stadium where the match was held (e.g., "Stadio Olimpico");
- winner: the identifier of the team which won the game, or 0 if the match ended with a draw;
- wyId: the identifier of the match, assigned by Wyscout;
- teamsData: it contains several subfields describing information about each team that is playing that match: such as lineup, bench composition, list of substitutions, coach and scores:
- hasFormation: it has value 0 if no formation (lineups and benches) is present, and 1 otherwise;
- score: the number of goals scored by the team during the match (not counting penalties);
- scoreET: the number of goals scored by the team during the match, including the extra time (not counting penalties);
- scoreHT: the number of goals scored by the team during the first half of the match;
- scoreP: the total number of goals scored by the team after the penalties;
- side: the team side in the match (it can be "home" or "away");
- teamId: the identifier of the team;
- coachId: the identifier of the team's coach;
- bench: the list of the team's players that started the match in the bench and some basic statistics about their performance during the match (goals, own goals, cards);
- lineup: the list of the team's players in the starting lineup and some basic statistics about their performance during the match (goals, own goals, cards);
- substitutions: the list of team's substitutions during the match, describing the players involved and the minute of the substitution.

### Dados de jogadores (players.json)

- birthArea: geographic information about the player's birth area;
- birthDate: the birth date of the player, in the format "YYYY-MM-DD";
- currentNationalTeamId: the identifier of the national team where the players currently plays;
- currentTeamId: the identifier of the team where the player plays for. The identifier refers to the field "wyId" in a team document;
- firstName: the first name of the player;
- lastName: the last name of the player;
- foot: the preferred foot of the player;
- height: the height of the player (in centimeters);
- middleName: the middle name (if any) of the player;
- passportArea: the geographic area associated with the player's current passport;
- role: the main role of the player. It is a subdocument containing the role's name and two abbreviations of it;
- shortName2: the short name of the player;
- weight: the weight of the player (in kilograms);
- wyId: the identifier of the player, assigned by Wyscout.

### Lendo os dados

In [2]:
events = aux.read_wyscout_event_data()
events.head()

  0%|          | 0/7 [00:00<?, ?it/s]

Working on England file


 14%|█▍        | 1/7 [00:07<00:47,  7.92s/it]

Working on European_Championship file


 29%|██▊       | 2/7 [00:12<00:28,  5.72s/it]

Working on France file


 43%|████▎     | 3/7 [00:51<01:23, 20.94s/it]

Working on Germany file


 57%|█████▋    | 4/7 [01:09<01:00, 20.05s/it]

Working on Spain file


 71%|███████▏  | 5/7 [02:47<01:35, 47.97s/it]

Working on Italy file


 86%|████████▌ | 6/7 [03:40<00:49, 49.79s/it]

Working on World_Cup file


100%|██████████| 7/7 [03:49<00:00, 32.85s/it]


Concatenating dataframes


Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id,league
0,8,Simple pass,[{'id': 1801}],25413,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",2499719,Pass,1609,1H,2.758649,85,177959171,England
1,8,High pass,[{'id': 1801}],370224,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",2499719,Pass,1609,1H,4.94685,83,177959172,England
2,8,Head pass,[{'id': 1801}],3319,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",2499719,Pass,1609,1H,6.542188,82,177959173,England
3,8,Head pass,[{'id': 1801}],120339,"[{'y': 71, 'x': 35}, {'y': 95, 'x': 41}]",2499719,Pass,1609,1H,8.143395,82,177959174,England
4,8,Simple pass,[{'id': 1801}],167145,"[{'y': 95, 'x': 41}, {'y': 88, 'x': 72}]",2499719,Pass,1609,1H,10.302366,85,177959175,England


In [8]:
matches = aux.read_wyscout_match_data(path = './data/wyscout/matches/')
matches.head()

  0%|          | 0/7 [00:00<?, ?it/s]

Working on England file


 14%|█▍        | 1/7 [00:01<00:07,  1.32s/it]

Working on European_Championship file


 29%|██▊       | 2/7 [00:01<00:03,  1.42it/s]

Working on France file


 43%|████▎     | 3/7 [00:02<00:03,  1.12it/s]

Working on Germany file


 57%|█████▋    | 4/7 [00:03<00:02,  1.29it/s]

Working on Spain file


 71%|███████▏  | 5/7 [00:05<00:02,  1.33s/it]

Working on Italy file


100%|██████████| 7/7 [00:07<00:00,  1.06s/it]


Working on World_Cup file


  0%|          | 0/7 [00:00<?, ?it/s]

Concatenating dataframes


100%|██████████| 7/7 [00:00<00:00,  7.25it/s]

Concatenating dataframes
Concatenating dataframes
Concatenating dataframes
Concatenating dataframes
Concatenating dataframes
Concatenating dataframes





Unnamed: 0,status,roundId,gameweek,teamsData,seasonId,dateutc,winner,venue,wyId,label,date,referees,duration,competitionId,league,groupName
0,Played,4405654,38,"{'1646': {'scoreET': 0, 'coachId': 8880, 'side...",181150,2018-05-13 14:00:00,1659,Turf Moor,2500089,"Burnley - AFC Bournemouth, 1 - 2","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 385705, 'role': 'referee'}, {'r...",Regular,364,England,
1,Played,4405654,38,"{'1628': {'scoreET': 0, 'coachId': 8357, 'side...",181150,2018-05-13 14:00:00,1628,Selhurst Park,2500090,"Crystal Palace - West Bromwich Albion, 2 - 0","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 381851, 'role': 'referee'}, {'r...",Regular,364,England,
2,Played,4405654,38,"{'1609': {'scoreET': 0, 'coachId': 7845, 'side...",181150,2018-05-13 14:00:00,1609,The John Smith's Stadium,2500091,"Huddersfield Town - Arsenal, 0 - 1","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 384965, 'role': 'referee'}, {'r...",Regular,364,England,
3,Played,4405654,38,"{'1651': {'scoreET': 0, 'coachId': 8093, 'side...",181150,2018-05-13 14:00:00,1612,Anfield,2500092,"Liverpool - Brighton & Hove Albion, 4 - 0","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 385704, 'role': 'referee'}, {'r...",Regular,364,England,
4,Played,4405654,38,"{'1644': {'scoreET': 0, 'coachId': 93112, 'sid...",181150,2018-05-13 14:00:00,1611,Old Trafford,2500093,"Manchester United - Watford, 1 - 0","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 381853, 'role': 'referee'}, {'r...",Regular,364,England,


In [9]:
players = aux.read_wyscout_player_data()
players.head()

Unnamed: 0,passportArea,weight,firstName,middleName,lastName,currentTeamId,birthDate,height,role,birthArea,wyId,foot,shortName,currentNationalTeamId
0,"{'name': 'Turkey', 'id': '792', 'alpha3code': ...",78,Harun,,Tekin,4502,1989-06-17,187,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'Turkey', 'id': '792', 'alpha3code': ...",32777,right,H. Tekin,4687.0
1,"{'name': 'Senegal', 'id': '686', 'alpha3code':...",73,Malang,,Sarr,3775,1999-01-23,182,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'France', 'id': '250', 'alpha3code': ...",393228,left,M. Sarr,4423.0
2,"{'name': 'France', 'id': '250', 'alpha3code': ...",72,Over,,Mandanda,3772,1998-10-26,176,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'France', 'id': '250', 'alpha3code': ...",393230,,O. Mandanda,
3,"{'name': 'Senegal', 'id': '686', 'alpha3code':...",82,Alfred John Momar,,N'Diaye,683,1990-03-06,187,"{'code2': 'MD', 'code3': 'MID', 'name': 'Midfi...","{'name': 'France', 'id': '250', 'alpha3code': ...",32793,right,A. N'Diaye,19314.0
4,"{'name': 'France', 'id': '250', 'alpha3code': ...",84,Ibrahima,,Konat\u00e9,2975,1999-05-25,192,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'France', 'id': '250', 'alpha3code': ...",393247,right,I. Konat\u00e9,


In [10]:
events_copy = events.copy()

In [11]:
events=pd.merge(events, players[['wyId', 'foot', 'firstName', 'lastName']], 
                       left_on='playerId', right_on='wyId')
events=pd.merge(events, matches[['wyId', 'label', 'venue', 'date']], left_on='matchId', right_on='wyId')
events=events.drop(columns=['wyId_x', 'wyId_y'])
events=events.sort_values(['matchId', 'matchPeriod', 'eventSec'])

### Criando algumas variáveis importantes relacionadas aos chutes

fonte: Soccermatics - https://soccermatics.readthedocs.io/en/latest/

In [12]:
events['previous_event'] = events['subEventName'].shift(1) # qual foi o evento anterior ao chute
events['x'] = events.positions.apply(lambda c: (100 - c[0]['x']) * 105/100)  # convertendo de percentual para metros (um campo tem 100 x 68 metros)
events['y'] = events.positions.apply(lambda c: c[0]['y'] * 68/100)  # convertendo de percentual para metros
events['c'] = events.positions.apply(lambda c: abs(c[0]['y'] - 50) * 68/100) # convertendo de percentual para metros - distância do centro do campo
events["distance"] = np.sqrt(events["x"]**2 + events["c"]**2)
events["angle"] = np.where(np.arctan(7.32 * events["x"] / (events["x"]**2 + events["c"]**2 - (7.32/2)**2)) > 0, np.arctan(7.32 * events["x"] /(events["x"]**2 + events["c"]**2 - (7.32/2)**2)), np.arctan(7.32 * events["x"] /(events["x"]**2 + events["c"]**2 - (7.32/2)**2)) + np.pi)
events["goal"] = events.tags.apply(lambda x: 1 if {'id':101} in x else 0).astype(object)


In [13]:
all_shots = events[(events['subEventName'] == 'Shot') |(events['subEventName'] =='Free kick shot')].copy()

all_shots['free_kick'] = 1*(all_shots['subEventName'] == 'Free kick shot')
all_shots["counter_attack"] = events.tags.apply(lambda x: 1 if {'id':1901} in x else 0).astype(object)

In [14]:
all_shots['rebound']= 1*(all_shots['previous_event'] == 'Penalty') \
                    + 1*(all_shots['previous_event'] == 'Free kick shot') \
                    + 1*(all_shots['previous_event'] == 'Shot') \
                    + 1*(all_shots['previous_event'] == 'Save attempt') 

all_shots['prev_cross'] = 1*(all_shots['previous_event'] == 'Corner') \
                        + 1*(all_shots['previous_event'] == 'Free kick cross') \
                        + 1*(all_shots['previous_event'] == 'Cross') 

all_shots['prev_touch'] = 1*(all_shots['previous_event'] == 'Touch')

all_shots['prev_pass'] = 1*(all_shots['previous_event'] == 'Simple pass') \
                       + 1*(all_shots['previous_event'] == 'Head pass') \
                       + 1*(all_shots['previous_event'] == 'Goal kick') \

all_shots['prev_smart_pass'] = 1*(all_shots['previous_event'] == 'Smart pass')

all_shots['prev_duel'] = 1*(all_shots['previous_event'] == 'Air duel') \
                       + 1*(all_shots['previous_event'] == 'Ground defending duel')  \
                       + 1*(all_shots['previous_event'] == 'Ground attacking duel') \
                       + 1*(all_shots['previous_event'] == 'Ground loose ball duel duel') 

In [15]:
all_shots.head()

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,...,angle,goal,free_kick,counter_attack,rebound,prev_cross,prev_touch,prev_pass,prev_smart_pass,prev_duel
257339,10,Shot,"[{'id': 402}, {'id': 1401}, {'id': 1203}, {'id...",25437,"[{'y': 29, 'x': 91}, {'y': 0, 'x': 0}]",1694390,Shot,4418,1H,31.226217,...,0.242346,0,0,0,0,0,0,1,0,0
257923,10,Shot,"[{'id': 402}, {'id': 201}, {'id': 1216}, {'id'...",83824,"[{'y': 29, 'x': 71}, {'y': 100, 'x': 100}]",1694390,Shot,11944,1H,143.119551,...,0.196835,0,0,0,0,0,0,1,0,0
258172,10,Shot,"[{'id': 402}, {'id': 201}, {'id': 1201}, {'id'...",33235,"[{'y': 57, 'x': 96}, {'y': 100, 'x': 100}]",1694390,Shot,11944,1H,219.576026,...,0.851948,0,0,0,0,0,0,0,0,1
257684,10,Shot,"[{'id': 403}, {'id': 201}, {'id': 1215}, {'id'...",6165,"[{'y': 61, 'x': 96}, {'y': 100, 'x': 100}]",1694390,Shot,11944,1H,247.532561,...,0.472204,0,0,0,0,0,0,0,0,1
257303,10,Shot,"[{'id': 401}, {'id': 2101}, {'id': 1802}]",3682,"[{'y': 33, 'x': 75}, {'y': 0, 'x': 0}]",1694390,Shot,4418,1H,557.319065,...,0.233111,0,0,0,0,0,0,1,0,0


In [16]:
all_shots.columns

Index(['eventId', 'subEventName', 'tags', 'playerId', 'positions', 'matchId',
       'eventName', 'teamId', 'matchPeriod', 'eventSec', 'subEventId', 'id',
       'league', 'foot', 'firstName', 'lastName', 'label', 'venue', 'date',
       'previous_event', 'x', 'y', 'c', 'distance', 'angle', 'goal',
       'free_kick', 'counter_attack', 'rebound', 'prev_cross', 'prev_touch',
       'prev_pass', 'prev_smart_pass', 'prev_duel'],
      dtype='object')

In [17]:
all_shots.to_parquet('shots_dataframe.parquet')