# Data Extraction from StatsBomb

Geeting public information from Stats Bomb using public API

The main idea is to request information from the API and then get the information in the way is need so we can create a functional Power BI Dashboard with main charts to analize the game selected.

## Libraries needed
First, we import all the libraries we are going to use. (We have to be sure that all the packages and dependecies are installed).

The libraris used on this project are:
- **statsbombpy:** this is the main library. This is the python library provided by Stats Bomb to import games information from their database. You can read more about it here: https://github.com/statsbomb/statsbombpy

- **pandas:** popular library for dataframe manipulation
- **mplsoccer:** this library include functions and method to generate soccer related visuals so we can present football soccer analysis. For instance, we can generate a soccer field with the measures we want and add visuals for passes directions, hearmaps, percentage of ball possesion per area in the pitch, etc. You can read more about it here: https://mplsoccer.readthedocs.io/en/latest/
- **highlight_text:** this package help us to create easy anotations in matplolib. You can read more about ir here: https://pypi.org/project/highlight-text/
- **matplotlob.colors & matplotlib.pyplot:** helps us to create statistical charts with python.
- **openpyxl:** help us to manipulate excel files. This is useful when we want to export (for example) the information to an Excel file. Also could help us to manipulates rows and colums inside the Excel file, just like we could do with a macro using VBA.

In [2]:
from statsbombpy import sb
import pandas as pd
from mplsoccer import VerticalPitch,Pitch
from highlight_text import ax_text, fig_text
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
import openpyxl

### Geeting competion matches

The first step is get the information we need. As the way of statsbomb information is designed, we need to provide the competition id and the season id (Basically the competion's year)

We are getting the games from the FIFA World Cup 2022 as follows

In [6]:
competion_id = 43 #Fifa World Cup
season_id = 106   #Year 2022

matches_data = sb.matches(competition_id=competion_id, season_id=season_id)
matches_data.columns

Index(['match_id', 'match_date', 'kick_off', 'competition', 'season',
       'home_team', 'away_team', 'home_score', 'away_score', 'match_status',
       'match_status_360', 'last_updated', 'last_updated_360', 'match_week',
       'competition_stage', 'stadium', 'referee', 'home_managers',
       'away_managers', 'data_version', 'shot_fidelity_version',
       'xy_fidelity_version'],
      dtype='object')

Notice that the marches_data dataframe now contains the matches' information for the competition and season we need. The output of the **matches_date.columns** command is the list of columns available.

Let's analyse one game, the great final of the tourtnament! Argentina vs France

What we are going to do is to get the Argentina games (we could us France's games too) and save the result in a dataframe named **argentina_matches**. Then we sout the result by the match week. This way we can see Argentina' progression during the tourtnament.

In [4]:
#Checking all Argentina games
argentina_matches = matches_data[(matches_data["home_team"] == "Argentina") | (matches_data["away_team"] == "Argentina")]
argentina_matches.sort_values(by="match_week")

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,...,last_updated_360,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version,shot_fidelity_version,xy_fidelity_version
37,3857300,2022-11-22,12:00:00.000,International - FIFA World Cup,2022,Argentina,Saudi Arabia,1,2,available,...,2023-06-19T15:59:46.628887,1,Group Stage,Lusail Stadium,Slavko Vinčić,Lionel Sebastián Scaloni,Hervé Renard,1.1.0,2,2
13,3857289,2022-11-26,21:00:00.000,International - FIFA World Cup,2022,Argentina,Mexico,2,0,available,...,2023-06-20T11:57:08.547882,2,Group Stage,Lusail Stadium,Daniele Orsato,Lionel Sebastián Scaloni,Gerardo Daniel Martino,1.1.0,2,2
11,3857264,2022-11-30,21:00:00.000,International - FIFA World Cup,2022,Poland,Argentina,0,2,available,...,2023-07-25T09:10:13.832053,3,Group Stage,Stadium 974,Danny Desmond Makkelie,Czesław Michniewicz,Lionel Sebastián Scaloni,1.1.0,2,2
1,3869151,2022-12-03,21:00:00.000,International - FIFA World Cup,2022,Argentina,Australia,2,1,available,...,2023-07-30T07:48:51.865595,4,Round of 16,Ahmad bin Ali Stadium,Szymon Marciniak,Lionel Sebastián Scaloni,Graham James Arnold,1.1.0,2,2
6,3869321,2022-12-09,21:00:00.000,International - FIFA World Cup,2022,Netherlands,Argentina,2,2,available,...,2023-06-21T17:51:12.511460,5,Quarter-finals,Lusail Stadium,Antonio Miguel Mateu Lahoz,Louis van Gaal,Lionel Sebastián Scaloni,1.1.0,2,2
19,3869519,2022-12-13,21:00:00.000,International - FIFA World Cup,2022,Argentina,Croatia,3,0,available,...,2023-04-26T22:32:37.808359,6,Semi-finals,Lusail Stadium,Daniele Orsato,Lionel Sebastián Scaloni,Zlatko Dalić,1.1.0,2,2
9,3869685,2022-12-18,17:00:00.000,International - FIFA World Cup,2022,Argentina,France,3,3,available,...,2023-08-17T15:55:15.164685,7,Final,Lusail Stadium,Szymon Marciniak,Lionel Sebastián Scaloni,Didier Deschamps,1.1.0,2,2


The match ID of the Final game is 3869685. Let's get all the information about that game. With sv.events() method, we can extract all the events of the game. Cool right?!

In [20]:
final_events = sb.events(match_id=3869685)
#final_events['id', 'timestamp','minute', 'second', 'team_id', 'team', 'location', 'pass_type', 'pass_outcome', 'player_id', 'player', 'pass_end_location']
filter_pass_events = final_events[(final_events['pass_type'].notna())]
pass_events = filter_pass_events[['id', 'timestamp','minute', 'second', 'team_id', 'team', 'location', 'pass_type', 'pass_outcome', 'player_id', 'player', 'pass_end_location']]