## Using StatsBomb
This will provide an initial overview of the StatsBomb match data available freely for hobbies soccer analyst like myself.

In [1]:
# importing SBopen from mplsoccer library to open the match data
# mplsoccer is leveraging a statsbomb library dependancy for this feature
from mplsoccer import Sbopen

# Instantiate a data parser
parser = Sbopen()

# competitions, seasons and match data available
df_comps = parser.competition()

In [2]:
df_comps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   competition_id             43 non-null     int64 
 1   season_id                  43 non-null     int64 
 2   country_name               43 non-null     object
 3   competition_name           43 non-null     object
 4   competition_gender         43 non-null     object
 5   competition_youth          43 non-null     bool  
 6   competition_international  43 non-null     bool  
 7   season_name                43 non-null     object
 8   match_updated              43 non-null     object
 9   match_updated_360          42 non-null     object
 10  match_available_360        4 non-null      object
 11  match_available            43 non-null     object
dtypes: bool(2), int64(2), object(8)
memory usage: 3.6+ KB


In [3]:
# Using chaining methods from Matt Harrison
(df_comps
 .competition_name
 .unique()
)

array(['Champions League', "FA Women's Super League", 'FIFA World Cup',
       'Indian Super league', 'La Liga', 'NWSL', 'Premier League',
       'UEFA Euro', "UEFA Women's Euro", "Women's World Cup"],
      dtype=object)

###  Competitions available
There are quite a few competitions available.  The EPL and La Liga are the most immediately intersting, especially to compare over styles of play between the two groups.  I would be curious to know which of the two leauges plays more fluidly, with more possession, less turnovers, and more successfully executed sequences.  Some ideas for aggregated comparison:
- Average passes per game in the same season
- Average shots per game
- Average fouls, restarts, etc.  

In [4]:

(df_comps
 .query('competition_name.isin(["Premier League","La Liga"])')
)

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_gender,competition_youth,competition_international,season_name,match_updated,match_updated_360,match_available_360,match_available
21,11,90,Spain,La Liga,male,False,False,2020/2021,2022-12-05T14:39:07.366723,2023-04-14T11:06:45.699840,2023-04-14T11:06:45.699840,2022-12-05T14:39:07.366723
22,11,42,Spain,La Liga,male,False,False,2019/2020,2023-02-25T12:34:02.427349,2021-06-13T16:17:31.694,,2023-02-25T12:34:02.427349
23,11,4,Spain,La Liga,male,False,False,2018/2019,2023-04-14T08:32:25.407577,2021-07-09T14:53:22.103024,,2023-04-14T08:32:25.407577
24,11,1,Spain,La Liga,male,False,False,2017/2018,2023-04-07T05:26:33.303597,2021-06-13T16:17:31.694,,2023-04-07T05:26:33.303597
25,11,2,Spain,La Liga,male,False,False,2016/2017,2022-11-30T18:35:52.394297,2021-06-13T16:17:31.694,,2022-11-30T18:35:52.394297
26,11,27,Spain,La Liga,male,False,False,2015/2016,2023-02-21T15:55:07.071365,2021-06-13T16:17:31.694,,2023-02-21T15:55:07.071365
27,11,26,Spain,La Liga,male,False,False,2014/2015,2022-08-14T18:49:03.341489,2021-06-13T16:17:31.694,,2022-08-14T18:49:03.341489
28,11,25,Spain,La Liga,male,False,False,2013/2014,2022-07-23T12:18:49.547396,2021-06-13T16:17:31.694,,2022-07-23T12:18:49.547396
29,11,24,Spain,La Liga,male,False,False,2012/2013,2022-09-25T20:52:24.444609,2021-06-13T16:17:31.694,,2022-09-25T20:52:24.444609
30,11,23,Spain,La Liga,male,False,False,2011/2012,2022-12-01T14:10:17.791769,2021-06-13T16:17:31.694,,2022-12-01T14:10:17.791769


> **Note**: Scratch my EPL La Liga comparison.  There is only one season of EPL games in 2003/2004.  La Liga looks healthy enough.  

### Match Data
To select match data, you must select a competitiion and a season.  This is done with the .match() method specifying the attributes previously identified.  This will still not provide event data, but we must crawl, walk, and then run.

In [5]:
# df_matches = parser.match(competition_id=72, season_id=30)

### Library Issue, shift to statsbombpy
The mplsoccer library call to the Statsbomb API is not working.  There seems to be an error with the dateformating process which is impacts the code execution.  Pivoting to the statsbomb API directly to ascertain if the data can be pulled directly.  First I need to install the libarary, the import it into this notebook.

In [6]:
from statsbombpy import sb

In [7]:
# suppress due to free data warning
import warnings
warnings.filterwarnings('ignore')

#### Competitions data

In [8]:
sb.competitions(creds={'user': None, 'passwd': None}).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   competition_id             43 non-null     int64 
 1   season_id                  43 non-null     int64 
 2   country_name               43 non-null     object
 3   competition_name           43 non-null     object
 4   competition_gender         43 non-null     object
 5   competition_youth          43 non-null     bool  
 6   competition_international  43 non-null     bool  
 7   season_name                43 non-null     object
 8   match_updated              43 non-null     object
 9   match_updated_360          42 non-null     object
 10  match_available_360        4 non-null      object
 11  match_available            43 non-null     object
dtypes: bool(2), int64(2), object(8)
memory usage: 3.6+ KB


In [28]:
sb.competitions(creds={'user': None, 'passwd': None})

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_gender,competition_youth,competition_international,season_name,match_updated,match_updated_360,match_available_360,match_available
0,16,4,Europe,Champions League,male,False,False,2018/2019,2023-03-07T12:20:48.118250,2021-06-13T16:17:31.694,,2023-03-07T12:20:48.118250
1,16,1,Europe,Champions League,male,False,False,2017/2018,2021-08-27T11:26:39.802832,2021-06-13T16:17:31.694,,2021-01-23T21:55:30.425330
2,16,2,Europe,Champions League,male,False,False,2016/2017,2021-08-27T11:26:39.802832,2021-06-13T16:17:31.694,,2020-07-29T05:00
3,16,27,Europe,Champions League,male,False,False,2015/2016,2021-08-27T11:26:39.802832,2021-06-13T16:17:31.694,,2020-07-29T05:00
4,16,26,Europe,Champions League,male,False,False,2014/2015,2021-08-27T11:26:39.802832,2021-06-13T16:17:31.694,,2020-07-29T05:00
5,16,25,Europe,Champions League,male,False,False,2013/2014,2021-08-27T11:26:39.802832,2021-06-13T16:17:31.694,,2020-07-29T05:00
6,16,24,Europe,Champions League,male,False,False,2012/2013,2021-08-27T11:26:39.802832,2021-06-13T16:17:31.694,,2021-07-10T13:41:45.751
7,16,23,Europe,Champions League,male,False,False,2011/2012,2021-08-27T11:26:39.802832,2021-06-13T16:17:31.694,,2020-07-29T05:00
8,16,22,Europe,Champions League,male,False,False,2010/2011,2022-01-26T21:07:11.033473,2021-06-13T16:17:31.694,,2022-01-26T21:07:11.033473
9,16,21,Europe,Champions League,male,False,False,2009/2010,2022-11-15T17:26:10.871011,2021-06-13T16:17:31.694,,2022-11-15T17:26:10.871011


#### Match data within competitions
Notice on this exrtact it only has 35 games.  It is only Barcelona in the dataset.

In [9]:
sb.matches(competition_id=11, season_id=90)

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,...,last_updated_360,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version,shot_fidelity_version,xy_fidelity_version
0,3773457,2021-05-16,18:30:00.000,Spain - La Liga,2020/2021,Barcelona,Celta Vigo,1,2,available,...,2022-08-04T12:00,37,Regular Season,Spotify Camp Nou,,Ronald Koeman,Eduardo Germán Coudet,1.1.0,2,2
1,3773631,2021-02-07,21:00:00.000,Spain - La Liga,2020/2021,Real Betis,Barcelona,2,3,available,...,2022-08-04T12:00,22,Regular Season,Estadio Benito Villamarín,,Manuel Luis Pellegrini Ripamonti,Ronald Koeman,1.1.0,2,2
2,3773665,2021-03-06,21:00:00.000,Spain - La Liga,2020/2021,Osasuna,Barcelona,0,2,available,...,2022-08-04T12:00,26,Regular Season,Estadio El Sadar,Guillermo Cuadra Fernández,Jagoba Arrasate Elustondo,Ronald Koeman,1.1.0,2,2
3,3773497,2021-04-10,21:00:00.000,Spain - La Liga,2020/2021,Real Madrid,Barcelona,2,1,available,...,2022-08-04T12:00,30,Regular Season,Estadio Alfredo Di Stéfano,Jesús Gil Manzano,Zinédine Zidane,Ronald Koeman,1.1.0,2,2
4,3773660,2020-12-13,21:00:00.000,Spain - La Liga,2020/2021,Barcelona,Levante,1,0,available,...,2022-08-04T12:00,13,Regular Season,Spotify Camp Nou,Ricardo De Burgos Bengoetxea,Ronald Koeman,Francisco José López Fernández,1.1.0,2,2
5,3773593,2020-09-27,21:00:00.000,Spain - La Liga,2020/2021,Barcelona,Villarreal,4,0,available,...,2022-08-04T12:00,3,Regular Season,Spotify Camp Nou,Guillermo Cuadra Fernández,Ronald Koeman,Unai Emery Etxegoien,1.1.0,2,2
6,3773466,2020-10-01,21:30:00.000,Spain - La Liga,2020/2021,Celta Vigo,Barcelona,0,3,available,...,2022-08-04T12:00,4,Regular Season,Abanca-Balaídos,Carlos del Cerro Grande,Óscar García Junyent,Ronald Koeman,1.1.0,2,2
7,3773585,2020-10-24,16:00:00.000,Spain - La Liga,2020/2021,Barcelona,Real Madrid,1,3,available,...,2022-08-04T12:00,7,Regular Season,Spotify Camp Nou,Juan Martínez Munuera,Ronald Koeman,Zinédine Zidane,1.1.0,2,2
8,3773552,2021-01-03,21:00:00.000,Spain - La Liga,2020/2021,Huesca,Barcelona,0,1,available,...,2022-08-04T12:00,17,Regular Season,Estadio El Alcoraz,Guillermo Cuadra Fernández,Miguel Ángel Sánchez Muñoz,Ronald Koeman,1.1.0,2,2
9,3773672,2020-10-04,21:00:00.000,Spain - La Liga,2020/2021,Barcelona,Sevilla,1,1,available,...,2022-08-04T12:00,5,Regular Season,Spotify Camp Nou,Jesús Gil Manzano,Ronald Koeman,Julen Lopetegui Argote,1.1.0,2,2


#### Event data within Matches
This is where things get more interesting.  There is excessive amounts of information at this level.  The data can be extracted into a default dict storing multiple independent dataframes.  This is helpful if you want the data parsed at a lower level to minimize the dataoverload.

In [10]:
sb.events(match_id=3773457).type.value_counts()

type
Pass                 1201
Ball Receipt*        1153
Carry                 916
Pressure              411
Ball Recovery          79
Duel                   67
Dribble                43
Block                  38
Dispossessed           37
Clearance              34
Foul Committed         34
Foul Won               33
Dribbled Past          31
Goal Keeper            30
Interception           29
Shot                   25
Miscontrol             20
Substitution            8
Half Start              4
Half End                4
Injury Stoppage         2
Referee Ball-Drop       2
Starting XI             2
Shield                  1
Tactical Shift          1
Name: count, dtype: int64

### Distinct event type dataframes
Here, the data is split using the split parameter, and setting it to True.  This provides a dictionary to leverage different event type keys allows inspection of specific types of occurances.  

In [11]:
# Multiple event types can be split out versus provided in a single frame

sb.events(match_id=3773457, split=True, flatten_attrs=False).keys() # Dictionary format, access dfs with keys

dict_keys(['starting_xis', 'half_starts', 'passes', 'ball_receipts', 'carrys', 'pressures', 'interceptions', 'dispossesseds', 'duels', 'miscontrols', 'ball_recoverys', 'dribbled_pasts', 'dribbles', 'clearances', 'shots', 'goal_keepers', 'blocks', 'foul_committeds', 'foul_wons', 'shields', 'half_ends', 'substitutions', 'injury_stoppages', 'referee_ball_drops', 'tactical_shifts'])

In [12]:
# Passing data extracted from overall view, 1201 passsing records in the fixture

sb.events(match_id=3773457, split=True, flatten_attrs=False)['passes'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1201 entries, 0 to 1200
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  1201 non-null   object 
 1   index               1201 non-null   int64  
 2   period              1201 non-null   int64  
 3   timestamp           1201 non-null   object 
 4   minute              1201 non-null   int64  
 5   second              1201 non-null   int64  
 6   type                1201 non-null   object 
 7   possession          1201 non-null   int64  
 8   possession_team     1201 non-null   object 
 9   play_pattern        1201 non-null   object 
 10  team                1201 non-null   object 
 11  player              1201 non-null   object 
 12  position            1201 non-null   object 
 13  location            1201 non-null   object 
 14  duration            1201 non-null   float64
 15  related_events      1192 non-null   object 
 16  pass  

### Competition Events
All events from a single competition can be queried using the .compeition_events() method.  This allows an entire seasons to be pulled at once, which is very useful when you concerned with larger macro level trends the in the game.  This will be a lot of rows.  Call below is 1M entries.

>**Note**: You can also split the call to obtain grouped events as shown in the individual matches.

In [16]:
sb.competition_events(
    country="Spain",
    division= "La Liga",
    season="2019/2020",
    gender="male"
).head()



Unnamed: 0,50_50,bad_behaviour_card,ball_receipt_outcome,ball_recovery_offensive,ball_recovery_recovery_failure,block_deflection,block_offensive,block_save_block,carry_end_location,clearance_aerial_won,...,shot_technique,shot_type,substitution_outcome,substitution_replacement,tactics,team,team_id,timestamp,type,under_pressure
0,,,,,,,,,,,...,,,,,"{'formation': 433, 'lineup': [{'player': {'id'...",Barcelona,217,00:00:00.000,Starting XI,
1,,,,,,,,,,,...,,,,,"{'formation': 4231, 'lineup': [{'player': {'id...",Eibar,322,00:00:00.000,Starting XI,
2,,,,,,,,,,,...,,,,,"{'formation': 433, 'lineup': [{'player': {'id'...",Barcelona,217,00:00:00.000,Starting XI,
3,,,,,,,,,,,...,,,,,"{'formation': 352, 'lineup': [{'player': {'id'...",Leganés,205,00:00:00.000,Starting XI,
4,,,,,,,,,,,...,,,,,"{'formation': 352, 'lineup': [{'player': {'id'...",Celta Vigo,209,00:00:00.000,Starting XI,


In [29]:
laliga_20and21_events = sb.competition_events(
    country="Spain",
    division= "La Liga",
    season="2020/2021",
    gender="male",
    split=True
)



In [34]:
# apparently the data is not complete for La Liga
(laliga_20and21_events['starting_xis']
#  .groupby(by=['team'])
 .team
 .value_counts()
#  .unstack()
)

team
Barcelona           35
Granada              2
Athletic Club        2
Cádiz                2
Real Valladolid      2
Real Sociedad        2
Valencia             2
Atlético Madrid      2
Getafe               2
Deportivo Alavés     2
Celta Vigo           2
Sevilla              2
Huesca               2
Villarreal           2
Levante              2
Real Madrid          2
Osasuna              2
Real Betis           2
Elche                1
Name: count, dtype: int64

In [40]:
WC_2022 = sb.competition_events(
    country="International",
    division= "FIFA World Cup",
    season="2022",
    gender="male",
    split=True
)



### Check for completeness
Competition data sometimes only has partial datasets.  In the case of the 2022 World Cup, there is a record for every game seemingly available through the statsbomb dataset.

In [55]:
(WC_2022['starting_xis']
 .assign(formation = lambda df: [d.get('formation') for d in df.tactics])
 .formation
 .value_counts()
#  tactics.iloc[0]['formation']
#  .value_counts()
)

formation
433      36
4231     32
442      17
352      11
343       8
3421      7
3412      6
4141      6
4411      4
41212     1
Name: count, dtype: int64

In [64]:
(WC_2022['shots']
 .groupby(['team'])
 .shot_outcome
 .value_counts()
 .unstack()
 .assign(tot = lambda df: df.sum(axis=1))
 .sort_values(by='tot', ascending=False)
#  tactics.iloc[0]['formation']
#  .value_counts()
)

shot_outcome,Blocked,Goal,Off T,Post,Saved,Saved Off Target,Saved to Post,Wayward,tot
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Argentina,18.0,23.0,30.0,1.0,33.0,,,5.0,110.0
France,32.0,18.0,27.0,1.0,20.0,,,8.0,106.0
Brazil,21.0,10.0,22.0,3.0,36.0,,,7.0,99.0
Croatia,19.0,15.0,31.0,1.0,18.0,,,3.0,87.0
Germany,14.0,6.0,21.0,5.0,18.0,,,4.0,68.0
Portugal,19.0,12.0,15.0,2.0,14.0,,,4.0,66.0
Morocco,16.0,9.0,24.0,,11.0,1.0,1.0,4.0,66.0
England,13.0,13.0,17.0,3.0,15.0,,,2.0,63.0
Spain,14.0,9.0,15.0,2.0,9.0,,1.0,1.0,51.0
Senegal,14.0,5.0,21.0,,7.0,,,3.0,50.0


### 360 data
If there is 360 data avaialble, you can access the data with the include_360_metrics=True flag.  This is typically a paywall dataset, so we may not get much in terms of access for the course.

In [66]:
try:
    WC_2022_360 = sb.competition_events(
    country="International",
    division= "FIFA World Cup",
    season="2022",
    gender="male",
    split=True,
    include_360_metrics=True
    )
except:
    print('not available in open data')

not available in open data


