## Wyscout

Wyscout data we have is a bit different to using Statsbomb's API. The data exists in json files that we need to read in.

In [3]:
import pathlib
import os
import pandas as pd
import json

path = os.path.join(str(pathlib.Path().resolve()), 'soccermatics', 'data', 'wyscout', 'competitions.json') # put # in front if used locally

# open the data
with open(path) as f:
    data = json.load(f)

# save this data to a dataframe
df_competitions = pd.DataFrame(data)
df_competitions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    7 non-null      object
 1   wyId    7 non-null      int64 
 2   format  7 non-null      object
 3   area    7 non-null      object
 4   type    7 non-null      object
dtypes: int64(1), object(4)
memory usage: 412.0+ bytes


In [4]:
df_competitions['name']

0    Italian first division
1    English first division
2    Spanish first division
3     French first division
4     German first division
5     European Championship
6                 World Cup
Name: name, dtype: object

## Matches

The separate match data is contained in its own json file, so we can easily access these for which competition we like.

In [9]:
path = os.path.join(str(pathlib.Path().resolve()), 'soccermatics', 'data', 'wyscout', 'matches', 'matches_England.json') # put # in front if used locally

with open(path) as f:
    data = json.load(f)

df_matches = pd.DataFrame(data)
df_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   status         380 non-null    object
 1   roundId        380 non-null    int64 
 2   gameweek       380 non-null    int64 
 3   teamsData      380 non-null    object
 4   seasonId       380 non-null    int64 
 5   dateutc        380 non-null    object
 6   winner         380 non-null    int64 
 7   venue          380 non-null    object
 8   wyId           380 non-null    int64 
 9   label          380 non-null    object
 10  date           380 non-null    object
 11  referees       380 non-null    object
 12  duration       380 non-null    object
 13  competitionId  380 non-null    int64 
dtypes: int64(6), object(8)
memory usage: 41.7+ KB


## Player data

In this dataframe you can find information about all players available for Wyscout public dataset. wyId is the player id in the Wyscout database. In the currentTeamId you can find the id of a team that the player plays form. shortName is an important column for vizualisations and rankings since player’s name is written in a shorter way.

In [10]:
path = os.path.join(str(pathlib.Path().resolve()), 'soccermatics', 'data', 'wyscout', 'players.json') 

with open(path) as f:
    data = json.load(f)

df_players = pd.DataFrame(data)
df_players.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3603 entries, 0 to 3602
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   passportArea           3603 non-null   object
 1   weight                 3603 non-null   int64 
 2   firstName              3603 non-null   object
 3   middleName             3603 non-null   object
 4   lastName               3603 non-null   object
 5   currentTeamId          3512 non-null   object
 6   birthDate              3603 non-null   object
 7   height                 3603 non-null   int64 
 8   role                   3603 non-null   object
 9   birthArea              3603 non-null   object
 10  wyId                   3603 non-null   int64 
 11  foot                   3603 non-null   object
 12  shortName              3603 non-null   object
 13  currentNationalTeamId  3603 non-null   object
dtypes: int64(3), object(11)
memory usage: 394.2+ KB
