# FBREF.COM - Scraping Info, Examples & Exploration

### Most important: Scraping match data 
- The page we need to scrape match data of a given **type** for a given **team**, **season** and **league** is the following:
    - ``https://fbref.com/en/squads/{squad_index}/{yyyy-yyyy}/matchlogs/c{league_index}/{data_type}``

### URL Composition

- urls always start with the following stem:
``https://fbref.com/en/``

#### Squads (teams)
- Squads have an alphanumerical index with length 8. Squad specific pages have ``/squads/{squad_index}`` added to the stem.
- Example: Eintracht Frankfurt
    - https://fbref.com/en/squads/f0ac8ee6

#### Competitions (leagues) 
- Each league has an integer index.
- To get to a league overview page, we add ``comp/{league_index}`` to the stem.
- Example: Premier league overview
    - https://fbref.com/en/comps/9
- To get to a squad- and league-specific match data page, we add ``/c{league_index}`` after the squad index.
-Example: Eintracht Frankfurt, Bundesliga
    - https://fbref.com/en/squads/f0ac8ee6/c20

#### Seasons
- Season-specific pages generally just add ``/{yyyy-yyyy}``, i.e. two consecutive years separated by a minus, in the url (following the squad id for squad-specific pages). Omitting this part usually defaults the page to the most recent/currently running season.
- Example: Premier league season 2017/2018 overview
    - https://fbref.com/en/comps/9/2017-2018

#### Data types
- There are (at most) 9 types of squad-level match data, each displayed in separate tables and accessible via separate urls.

|**data type (displayed table header)** | **url string** | 
|-|-|
| Scores & Fixtures | ``schedule`` |
| Shooting | ``shooting`` |
| Goalkeeping | ``keeper`` |
| Passing | ``passing`` |
| Pass Types | ``passing_types`` |
| Goal and Shot Creation | ``gca`` |
| Defensive Actions | ``defense`` |
| Posession | ``posession`` |
| Miscellaneous Stats | ``misc`` |

- Example: Arsenal passing stats, season 2017/2018, premier league
    - https://fbref.com/en/squads/18bb7c10/2017-2018/matchlogs/c9/passing

### Setup

In [53]:
import requests
import pandas as pd
import bs4

In [54]:
# This table contains scraping-relevant info for every data type available on fbref. 
# Storing it here makes it possible to write the scraping procedure as loop over all data types later instead of dealing with them individually.

MATCH_DATA_TYPES = pd.DataFrame({
    'filter_text': ['Scores & Fixtures', 'Shooting', 'Goalkeeping', 'Passing', 'Pass Types', 'Goal and Shot Creation', 'Defensive Actions', 'Possession', 'Miscellaneous Stats'],
    'url_string':['schedule', 'shooting', 'keeper', 'passing', 'passing_types', 'gca', 'defense', 'possession', 'misc'],
    'n_expected_cols':[18, 25, 36, 32, 25, 24, 26, 33, 26], # expected cols in raw table (as read in by pd.read_html())
    'missing_header_replacement': [None, None, None, 'Attacking', 'General', None, 'General', 'General', None], # some tables have missing first-lvl headers for a couple second-level headers
    '10th_col_header_fix': [False, False, False, False, True, False, False, True, False] # some tables have the 'For {squadname}' header in the 10th column which we want to avoid (-> replace)
    }) 
MATCH_DATA_TYPES

# Having the expected number of columns per data type table saved might be useful for cases where not all data types are available, or some columns are missing (e.g. xG metrics in older seasons)
# It looks like fbref simply displays an empty table (with expected dimensions) when acessing an url for which no data is available, which is advantageous for us.

Unnamed: 0,filter_text,url_string,n_expected_cols,missing_header_replacement,10th_col_header_fix
0,Scores & Fixtures,schedule,18,,False
1,Shooting,shooting,25,,False
2,Goalkeeping,keeper,36,,False
3,Passing,passing,32,Attacking,False
4,Pass Types,passing_types,25,General,True
5,Goal and Shot Creation,gca,24,,False
6,Defensive Actions,defense,26,General,False
7,Possession,possession,33,General,True
8,Miscellaneous Stats,misc,26,,False


### Step-by-step Scraping Example

The following code should be able to handle any combination of (valid) parameters specified in the cell below. In particular, it is designed to handle any of the 9 different squad-level data types we can obtain from fbref. This will allow us to package the code into a loop to scrape all data types for a given squad/season/league combination. 

In [55]:
# define parameters (using fbref indices for squads/leagues)
squad_index = "18bb7c10" # 18bb7c10 -> Arsenal
season_start_year = 2021
league_index = 9 # 9 -> Premier League
data_type = MATCH_DATA_TYPES.iloc[0] # 0 -> Scores & Fixtures, 1 -> Shooting, etc.

# build url 
url = f"https://fbref.com/en/squads/{squad_index}/{season_start_year}-{season_start_year+1}/matchlogs/c{league_index}/{data_type['url_string']}"
url

'https://fbref.com/en/squads/18bb7c10/2021-2022/matchlogs/c9/schedule'

In [56]:
# get data
response = requests.get(url)
response.status_code # 200 -> OK

200

In [57]:
# we can get our table without even using bs4 via pandas' read_html function
table_df = pd.read_html(response.text, match=data_type['filter_text'])[0]
table_df.head(2)

Unnamed: 0,Date,Time,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2021-08-13,20:00,Matchweek 1,Fri,Away,L,0,2,Brentford,1.3,1.2,64,16479,Granit Xhaka,4-2-3-1,Michael Oliver,Match Report,
1,2021-08-22,16:30,Matchweek 2,Sun,Home,L,0,2,Chelsea,0.3,3.1,35,58729,Granit Xhaka,4-2-3-1,Paul Tierney,Match Report,


In [58]:
# show the summary row (exists in every table except for Scores & Fixtures)
table_df.tail(2) 

Unnamed: 0,Date,Time,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
36,2022-05-16,20:00,Matchweek 37,Mon,Away,L,0,2,Newcastle Utd,0.6,1.6,49,52274,Martin Ødegaard,4-2-3-1,Darren England,Match Report,
37,2022-05-22,16:00,Matchweek 38,Sun,Home,W,5,1,Everton,4.2,1.1,73,60201,Martin Ødegaard,4-2-3-1,Andre Marriner,Match Report,


In [59]:
# Except for the Scores & Fixtures table, we always drop the first 9 columns and the last one (which contain repeated info) as well as the last row (which is a summary row)
# check if multiindex is present
if table_df.shape[1] == data_type['n_expected_cols']: # check if column count as expected
    if data_type['filter_text'] != MATCH_DATA_TYPES.iloc[0]['filter_text']: # check if not Scores & Fixtures
        # drop first 9 columns, last column, and last row
        table_df = table_df.iloc[:-1, 9:-1]
else: 
    raise ValueError(f"Unexpected number of columns ({table_df.shape[1]}, exp: {data_type['n_expected_cols']}) in scraped table for data type: {data_type['filter_text']}")
table_df.tail(2)

Unnamed: 0,Date,Time,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
36,2022-05-16,20:00,Matchweek 37,Mon,Away,L,0,2,Newcastle Utd,0.6,1.6,49,52274,Martin Ødegaard,4-2-3-1,Darren England,Match Report,
37,2022-05-22,16:00,Matchweek 38,Sun,Home,W,5,1,Everton,4.2,1.1,73,60201,Martin Ødegaard,4-2-3-1,Andre Marriner,Match Report,


In [60]:
# Some data types have a single index, some a multiindex (i.e. two levels of column names)
# If it's a multiindex we first replace any pandas-auto-generated missing first-level column headers (as specified in MATCH_DATA_TYPES)
# Then we drop the first level and add it as a prefix to the second level column names.
# This prevents duplicate column names.

# check if multiindex (note: there are at most 2 levels)
if table_df.columns.nlevels == 2:
    # first: replace missing first-level headers and perform 10th column header fix if necessary
    new_colnames = [(l1, l2) if not l1.startswith('Unnamed:') else (data_type['missing_header_replacement'], l2) for l1, l2 in table_df.columns]
    if data_type['10th_col_header_fix']: # we already dropped the first 9 columns so the 10th column is now the first
        new_colnames[0] = (data_type['missing_header_replacement'], new_colnames[0][1]) # rename first-level header
    table_df.columns = pd.MultiIndex.from_tuples(new_colnames)
    # second: 
    # create new colnames with first level as prefix (lowercase, whitespaces removed)
    new_colnames = [f"{l1.lower().replace(' ', '')}_{l2}" for l1, l2 in table_df.columns]
    # drop first level to get rid of multiindex
    table_df.columns = table_df.columns.droplevel(0)
    # rename columns
    table_df.columns = new_colnames
table_df.columns

Index(['Date', 'Time', 'Round', 'Day', 'Venue', 'Result', 'GF', 'GA',
       'Opponent', 'xG', 'xGA', 'Poss', 'Attendance', 'Captain', 'Formation',
       'Referee', 'Match Report', 'Notes'],
      dtype='object')

In the actual scraping function we want to add a few extra columns with the following information for every match:
- season -> already known as parameter
- fbref league id -> already known as parameter
- fbref squad id  -> already known as parameter
- fbref opponent squad id -> can be extracted from links in table (or matched later using opponent name)
- fbref match id -> must be extracted from links in table

In [61]:
# note: when scraping the entire set of data types, this entire process will only be performed once (-> during the first iteration)

# find table with bs4
soup = bs4.BeautifulSoup(response.text, 'html.parser')
soup_table = soup.find('table', {'id': 'matchlogs_for'}) # table id is always 'matchlogs_for'

match_ids, opponent_ids = [], []
# iterate through table rows 
for row in soup_table.find_all('tr'):
    # skip avoid non-data rows
    if (row.find('th', {'class': 'poptip'}) is None and # no header row
        row.find('th', {'class': 'over_header'}) is None and # no over header row
        row.find('th', {'class': 'left iz'}) is None): # no bottom summary row
    
        # find opponent column
        td_opp = row.find('td', {'data-stat': 'opponent'})
        # extract link (has form: /en/squads/{squad_id}/...)
        opponent_link = td_opp.find('a')['href']
        # extract opponent squad id from link (has)
        opponent_squad_id = opponent_link.split('/')[3]

        # find match report column
        td_match = row.find('td', {'data-stat': 'match_report'})
        # extract link (has form: /en/matches/{match_id}/..)
        match_report_link = td_match.find('a')['href']
        # extract match id from link
        match_id = match_report_link.split('/')[3]

        match_ids.append(match_id)
        opponent_ids.append(opponent_squad_id)

# append columns with data we have as parameters
table_df['fbref_season'] = f"{season_start_year}-{season_start_year+1}" # same value each row
table_df['fbref_league_id'] = league_index # same value each row
table_df['fbref_squad_id'] = squad_index # same value each row

# append extracted ids (lengths should match, if not pandas will throw error)
table_df['fbref_opponent_id'] = opponent_ids
table_df['fbref_match_id'] = match_ids

table_df.head(2)

Unnamed: 0,Date,Time,Round,Day,Venue,Result,GF,GA,Opponent,xG,...,Captain,Formation,Referee,Match Report,Notes,fbref_season,fbref_league_id,fbref_squad_id,fbref_opponent_id,fbref_match_id
0,2021-08-13,20:00,Matchweek 1,Fri,Away,L,0,2,Brentford,1.3,...,Granit Xhaka,4-2-3-1,Michael Oliver,Match Report,,2021-2022,9,18bb7c10,cd051869,3adf2aa7
1,2021-08-22,16:30,Matchweek 2,Sun,Home,L,0,2,Chelsea,0.3,...,Granit Xhaka,4-2-3-1,Paul Tierney,Match Report,,2021-2022,9,18bb7c10,cff3d9bb,93954213


In [62]:
# Finally, we prefix the data type (url string) to all column names. 
# This is done to avoid duplicates in the final concatenated dataframe (e.g. Goalkeeping and Miscellaneous Stats both have a first-level header called 'Performance').
table_df.columns = [f"{data_type['url_string']}_{col}" for col in table_df.columns]
table_df.columns

Index(['schedule_Date', 'schedule_Time', 'schedule_Round', 'schedule_Day',
       'schedule_Venue', 'schedule_Result', 'schedule_GF', 'schedule_GA',
       'schedule_Opponent', 'schedule_xG', 'schedule_xGA', 'schedule_Poss',
       'schedule_Attendance', 'schedule_Captain', 'schedule_Formation',
       'schedule_Referee', 'schedule_Match Report', 'schedule_Notes',
       'schedule_fbref_season', 'schedule_fbref_league_id',
       'schedule_fbref_squad_id', 'schedule_fbref_opponent_id',
       'schedule_fbref_match_id'],
      dtype='object')

In [63]:
print(f'Shape of final {data_type["filter_text"]} table: {table_df.shape}')

Shape of final Scores & Fixtures table: (38, 23)


In [64]:
# To build our scraping function we can iterate the steps above for every match data type and concatenate the results into one dataframe.

# (see other file(s))
