In [1]:
import pandas as pd
import pickle

In [2]:
# Set a value for the column width layout.
pd.options.display.max_colwidth = 20

# Dataset

Dataset link: https://figshare.com/collections/Soccer_match_event_dataset/4415000/2

Dataset documentation: https://apidocs.wyscout.com/, https://support.wyscout.com/matches-wyid-events#10-available-tag-ids

In [3]:
# Select a competition as example.
COMPETITION = "World_Cup" # Italy, England, France, Germany, Spain, World_Cup, European_Championship

In [4]:
# Define the path where the official released dataset is stored.
path_to_dataset = "../dataset/"

## Matches dataset

This dataset describes all the matches made available.

In [5]:
# Load the matches dataset.
matches = pd.read_json(path_to_dataset + f"matches/matches_{COMPETITION}.json", encoding = "unicode_escape")
matches.head()

Unnamed: 0,status,roundId,gameweek,teamsData,seasonId,dateutc,winner,venue,wyId,label,date,groupName,referees,duration,competitionId
0,Played,4165368,0,{'9598': {'score...,10078,2018-07-15 15:00:00,4418,Olimpiyskiy stad...,2058017,France - Croatia...,2018-07-15 17:00:...,,[{'refereeId': 3...,Regular,28
1,Played,4165367,0,{'2413': {'score...,10078,2018-07-14 14:00:00,5629,Stadion Krestovskyi,2058016,Belgium - Englan...,2018-07-14 16:00:...,,[{'refereeId': 3...,Regular,28
2,Played,4165366,0,{'2413': {'score...,10078,2018-07-11 18:00:00,9598,Olimpiyskiy stad...,2058015,Croatia - Englan...,2018-07-11 20:00:...,,[{'refereeId': 3...,ExtraTime,28
3,Played,4165366,0,{'5629': {'score...,10078,2018-07-10 18:00:00,4418,Stadion Krestovskyi,2058014,France - Belgium...,2018-07-10 20:00:...,,[{'refereeId': 3...,Regular,28
4,Played,4165365,0,{'14358': {'scor...,10078,2018-07-07 18:00:00,9598,Olimpiyskiy Stad...,2058012,Russia - Croatia...,2018-07-07 20:00:...,,[{'refereeId': 3...,Penalties,28


In [6]:
len(matches)

64

### Key description

- **status**: it can be "Played" (the match has officially finished), "Cancelled" (the match has been canceled for some reason), "Postponed" (the match has been postponed and no new date and time is available yet) or "Suspended" (the match has been suspended and no new date and time is available yet);
- **roundId**: indicates the match-day of the competition to which the match belongs to. During a competition for soccer clubs, each of the participating clubs plays against each of the other clubs twice, once at home and once away. The matches are organized in match-days: all the matches in match-day i are played before the matches in match-day i + 1, even tough some matches can be anticipated or postponed to facilitate players and clubs participating in Continental or Intercontinental competitions. During a competition for national teams, the "roundID" indicates the stage of the competition (eliminatory round, round of 16, quarter finals, semifinals, final);
- **gameweek**: the week of the league, starting from the beginning of the league;
- **teamsData**: it contains several subfields describing information about each team that is playing that match: such as lineup, bench composition, list of substitutions, coach and scores;
- **seasonId**: indicates the season of the match;
- **date and dateutc**: the former specifies date and time when the match starts in explicit format (e.g., May 20, 2018 at 8:45:00 PM GMT+2), the latter contains the same information but in the compact format YYYY-MM-DD hh:mm:ss;
- **winner**: the identifier of the team which won the game, or 0 if the match ended with a draw;
- **venue**: the stadium where the match was held (e.g., "Stadio Olimpico");
- **wyId**: the identifier of the match, assigned by Wyscout;
- **label**: contains the name of the two clubs and the result of the match (e.g., "Lazio - Internazionale, 2 - 3");
- **referees**: the referees information of the match;
- **duration**: the duration of the match. It can be "Regular" (matches of regular duration of 90 minutes + stoppage time), "ExtraTime" (matches with supplementary times, as it may happen for matches in continental or international competitions), or "Penalities" (matches which end at penalty kicks, as it may happen for continental or international competitions);
- **competitionId**: the identifier of the competition to which the match belongs to. It is a integer and refers to the field "wyId" of the competition document.

In [7]:
# Check index identifier of the dataframe.
matches.set_index(["wyId"]).index.is_unique

True

In [8]:
# Check if dataframe contains some nan values.
matches.isna().sum()

status           0
roundId          0
gameweek         0
teamsData        0
seasonId         0
dateutc          0
winner           0
venue            0
wyId             0
label            0
date             0
groupName        0
referees         0
duration         0
competitionId    0
dtype: int64

## Events dataset

This dataset describes all the events that occur during each match. Each event refers to a ball touch.

In [9]:
# Load the events dataset.
events = pd.read_json(path_to_dataset + f"events/events_{COMPETITION}.json", encoding = "unicode_escape")
events.head()

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id
0,8,Simple pass,[{'id': 1801}],122671,"[{'y': 50, 'x': ...",2057954,Pass,16521,1H,1.656214,85,258612104
1,8,High pass,[{'id': 1801}],139393,"[{'y': 53, 'x': ...",2057954,Pass,16521,1H,4.487814,83,258612106
2,1,Air duel,"[{'id': 703}, {'...",103668,"[{'y': 81, 'x': ...",2057954,Duel,14358,1H,5.937411,10,258612077
3,1,Air duel,"[{'id': 701}, {'...",122940,"[{'y': 19, 'x': ...",2057954,Duel,16521,1H,6.406961,10,258612112
4,8,Simple pass,[{'id': 1801}],122847,"[{'y': 17, 'x': ...",2057954,Pass,16521,1H,8.562167,85,258612110


In [10]:
len(events)

101759

### Key description

- **eventId**: the identifier of the event's type. Each eventId is associated with an event name;
- **eventName**: the name of the event's type. There are seven types of events: pass, foul, shot, duel, free kick, offside and touch;
- **subEventId**: the identifier of the subevent's type. Each subEventId is associated with a subevent name;
- **subEventName**: the name of the subevent's type. Each event type is associated with a different set of subevent types;

<img src="../dataset/images/events_events.png" width="300">

- **tags**: a list of event tags, each one describes additional information about the event (e.g., accurate). Each event type is associated with a different set of tags. Example `[{'id': 503}, {'id': 703}, {'id': 1801}]`; For example 1801 is 'accurate' and 1802 is 'not accurate'. 

<img src="../dataset/images/events_tags.png" width="200">

- **playerId**: the identifier of the player who generated the event. The identifier refers to the field "wyId" in a player dataset;
- **positions**: the origin and destination positions associated with the event. Each position is a pair of coordinates (x, y). The x and y coordinates are always in the range [0, 100] and indicate the percentage of the field from the perspective of the attacking team. In particular, the value of the x coordinate indicates the event's nearness (in percentage) to the opponent's goal, while the value of the y coordinates indicates the event's nearness (in percentage) to the right side of the field. The event's coordinates depends on the subject. The subject's goal to be defended is always x=0% and the attack is always x=100%. All values are % expressed as (x,y).

<img src="../dataset/images/pitch_coordinates.png" width="400">

- **matchId**: the identifier of the match the event refers to. The identifier refers to the field "wyId" in the match dataset;

- **teamId**: the identifier of the player's team. The identifier refers to the field "wyId" in the team dataset.
- **matchPeriod**: the period of the match. It can be "1H" (first half of the match), "2H" (second half of the match), "E1" (first extra time), "E2" (second extra time) or "P" (penalties time);
- **eventSec**: the time when the event occurs (in seconds since the beginning of the current half of the match);
- **id**: a unique identifier of the event;

In [11]:
# Check index identifier of the dataframe.
events.set_index(["id"]).index.is_unique

True

In [12]:
# Check if dataframe contains some nan values.
events.isna().sum()

eventId         0
subEventName    0
tags            0
playerId        0
positions       0
matchId         0
eventName       0
teamId          0
matchPeriod     0
eventSec        0
subEventId      0
id              0
dtype: int64