# Introduction

This notebook looks to demo how to grab data from fbref: https://fbref.com/en/.

There are two ways to grab data from fbref. We can use:
* Method 1: Using pandas .read_html()
* Method 2: Use beautiful soup which is a package used to scrape data from the internet

With the above two methods, I will look to grab the currents seasons worth of data. Data include
* league table
* season averages (xG, xA, G, A, presses per final third etc)
* player stats 
* per match data

# Imports

In [6]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

pd.set_option('display.max_columns', None)

# Functions

FBref beautifulSoup functions

In [2]:
def html_table_to_list_of_rows(html_table):
    """Function used to convert html tables to pandas dataframes"""
    # fetch table rows
    table_rows = html_table.find_all('tr') # find table rows

    # grab rows of data from website into list format
    table_rows_list = []

    for table_row in table_rows:
        td = table_row.find_all('th') + table_row.find_all('td') # th header, td cell of a table
        row = [data.text for data in td]
        table_rows_list.append(row) 
    
    return table_rows_list

def fbref_list_of_rows_to_df(table_rows_list, column_level):
    """Function used to convert rows of data from fbref to a df"""
    if column_level == 1:
        table_columns_names = table_rows_list[:1][0]
        table_rows = table_rows_list[1:]
    elif column_level == 2:
        table_columns_names = table_rows_list[1:2][0]
        table_rows = table_rows_list[2:]
    else:
        raise ValueError('Can only take 1 or 2 for column level argument')
        
    # create dataframe for above data
    table_df = pd.DataFrame(data=table_rows, columns=table_columns_names)
    
    return table_df

def get_prem_league_table(season):
    # grab data from website
    page = requests.get(
        f"https://fbref.com/en/comps/9/{season}-{season+1}/{season}-{season+1}-Premier-League-Stats"
    )
    soup = BeautifulSoup(page.content, 'html.parser')

    # find table
    prem_league_table = soup.find(id=f"results{season}-{season+1}91_overall") 

    # get table data into list of lists
    prem_league_table_rows_list = html_table_to_list_of_rows(prem_league_table)
    
    # create df
    prem_league_table_df = fbref_list_of_rows_to_df(prem_league_table_rows_list, 1)

    
    return prem_league_table_df

def get_team_stats(season):
    # grab data from website
    page = requests.get(
        f"https://fbref.com/en/comps/9/{season}-{season+1}/{season}-{season+1}-Premier-League-Stats"
    )
    soup = BeautifulSoup(page.content, 'html.parser')

    # find table
    prem_league_stats_table = soup.find(id="stats_squads_standard_for") 

    # get tabel data into list of lists
    prem_league_stats_table_rows_list = html_table_to_list_of_rows(prem_league_stats_table)
    
    # create df
    prem_league_stats_df = fbref_list_of_rows_to_df(prem_league_stats_table_rows_list, 2)
    
    
    return prem_league_stats_df

def get_big5_player_stats(season):
    # grab data from website
    page = requests.get(
        f"https://fbref.com/en/comps/Big5/{season}-{season+1}/stats/players/{season}-{season+1}-Big-5-European-Leagues-Stats"
    )
    soup = BeautifulSoup(page.content, 'html.parser')

    # find table
    big5_player_table = soup.find(id="stats_standard") 

    # get tabel data into list of lists
    big5_player_table_rows_list = html_table_to_list_of_rows(big5_player_table)
    
    # create df
    big5_player_df = fbref_list_of_rows_to_df(big5_player_table_rows_list, 2)
    
    return big5_player_df

Utility functions

In [3]:
def flatten_cols(df):
    df.columns = ['_'.join(x) for x in
        df.columns.to_flat_index()]
    return df

# Config

We will to to extract data for the current premier league season

In [4]:
season = 2022
fbref_prem_id = 9

# Method 1: pandas .read_html()

### Load league data 

Here we are making a request for html tables which are changed to a list of dataframes that holds all standard team data we will need from fbref.

In [5]:
league_data = pd.read_html(f'https://fbref.com/en/comps/9/{season}-{season+1}/{season}-{season+1}-Premier-League-Stats')

### Fetching league table

In [6]:
regular_season_table = league_data[0]
regular_season_table_home_away = league_data[1]

In [7]:
# table for prem league table
regular_season_table.head(3)

Unnamed: 0,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,Pts/MP,xG,xGA,xGD,xGD/90,Last 5,Attendance,Top Team Scorer,Goalkeeper,Notes
0,1,Arsenal,11,9,1,1,25,11,14,28,2.55,19.4,10.4,9.0,0.82,W W W W D,60109,Gabriel Jesus - 5,Aaron Ramsdale,
1,2,Manchester City,11,8,2,1,36,11,25,26,2.36,23.0,8.2,14.8,1.35,W W W L W,53340,Erling Haaland - 17,Ederson,
2,3,Tottenham,12,7,2,3,23,14,9,23,1.92,19.2,13.9,5.3,0.44,L W W L L,61610,Harry Kane - 10,Hugo Lloris,


In [8]:
# home and away data
regular_season_table_home_away.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Home,Home,Home,Home,Home,Home,Home,Home,...,Away,Away,Away,Away,Away,Away,Away,Away,Away,Away
Unnamed: 0_level_1,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,...,L,GF,GA,GD,Pts,Pts/MP,xG,xGA,xGD,xGD/90
0,1,Arsenal,5,5,0,0,14,7,7,15,...,1,11,4,7,13,2.17,6.4,6.0,0.4,0.07
1,2,Manchester City,6,6,0,0,27,6,21,18,...,1,9,5,4,8,1.6,8.4,5.0,3.4,0.67
2,3,Tottenham,6,5,0,1,16,6,10,15,...,2,7,8,-1,8,1.33,6.9,8.3,-1.4,-0.23


### Fetching standard team stats (e.g. xG, xA, goals and assist data )

In [9]:
squad_standard_stats = league_data[2]
squad_standard_stats_opponents = league_data[3]

In [10]:
squad_standard_stats.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Playing Time,Playing Time,Playing Time,Playing Time,Performance,Performance,...,Per 90 Minutes,Expected,Expected,Expected,Expected,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes
Unnamed: 0_level_1,Squad,# Pl,Age,Poss,MP,Starts,Min,90s,Gls,Ast,...,G+A-PK,xG,npxG,xAG,npxG+xAG,xG,xAG,xG+xAG,npxG,npxG+xAG
0,Arsenal,21,24.6,56.6,11,121,990,11.0,24,18,...,3.73,19.4,19.0,13.0,32.0,1.77,1.18,2.95,1.73,2.91
1,Aston Villa,21,27.4,49.5,12,132,1080,12.0,11,6,...,1.33,14.5,13.8,10.0,23.8,1.21,0.83,2.05,1.15,1.99
2,Bournemouth,25,26.8,39.3,12,132,1080,12.0,10,8,...,1.5,7.0,7.0,5.4,12.3,0.58,0.45,1.03,0.58,1.03


In [11]:
squad_standard_stats_opponents.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Playing Time,Playing Time,Playing Time,Playing Time,Performance,Performance,...,Per 90 Minutes,Expected,Expected,Expected,Expected,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes
Unnamed: 0_level_1,Squad,# Pl,Age,Poss,MP,Starts,Min,90s,Gls,Ast,...,G+A-PK,xG,npxG,xAG,npxG+xAG,xG,xAG,xG+xAG,npxG,npxG+xAG
0,vs Arsenal,21,27.0,43.4,11,121,990,11.0,10,7,...,1.45,10.4,8.9,7.9,16.8,0.95,0.72,1.67,0.81,1.52
1,vs Aston Villa,21,26.7,50.5,12,132,1080,12.0,14,7,...,1.67,14.9,13.3,9.2,22.5,1.25,0.76,2.01,1.11,1.88
2,vs Bournemouth,25,27.0,60.8,12,132,1080,12.0,23,15,...,2.83,19.3,16.2,12.4,28.5,1.61,1.03,2.64,1.35,2.38


### Fetching team defensive actions stats (e.g tackles, blocks and pressing data)

In [12]:
squad_defensive_actions_stats = league_data[16]
squad_defensive_actions_stats_opponents = league_data[17]

In [13]:
# squad defensive actions data
squad_defensive_actions_stats.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Tackles,Tackles,Tackles,Tackles,Tackles,Vs Dribbles,Vs Dribbles,Vs Dribbles,Vs Dribbles,Blocks,Blocks,Blocks,Unnamed: 15_level_0,Unnamed: 16_level_0,Unnamed: 17_level_0,Unnamed: 18_level_0
Unnamed: 0_level_1,Squad,# Pl,90s,Tkl,TklW,Def 3rd,Mid 3rd,Att 3rd,Tkl,Att,Tkl%,Past,Blocks,Sh,Pass,Int,Tkl+Int,Clr,Err
0,Arsenal,21,11.0,166,107,80,52,34,68,126,54.0,58,104,26,78,82,248,195,5
1,Aston Villa,21,12.0,229,123,109,95,25,103,187,55.1,84,149,41,108,107,336,174,3
2,Bournemouth,25,12.0,203,129,102,83,18,79,170,46.5,91,144,54,90,89,292,296,0


In [14]:
# squad defensive actions data opponents
squad_defensive_actions_stats_opponents.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Tackles,Tackles,Tackles,Tackles,Tackles,Vs Dribbles,Vs Dribbles,Vs Dribbles,Vs Dribbles,Blocks,Blocks,Blocks,Unnamed: 15_level_0,Unnamed: 16_level_0,Unnamed: 17_level_0,Unnamed: 18_level_0
Unnamed: 0_level_1,Squad,# Pl,90s,Tkl,TklW,Def 3rd,Mid 3rd,Att 3rd,Tkl,Att,Tkl%,Past,Blocks,Sh,Pass,Int,Tkl+Int,Clr,Err
0,vs Arsenal,21,11.0,211,109,123,72,16,89,198,44.9,109,127,50,77,96,307,212,7
1,vs Aston Villa,21,12.0,183,112,104,57,22,70,134,52.2,64,126,45,81,95,278,243,3
2,vs Bournemouth,25,12.0,209,131,81,86,42,92,170,54.1,78,118,21,97,84,293,160,3


### Fetching player stats

We can grab all top five plyers stats data all in one go.

NB: when trying to access player data via the following links - https://fbref.com/en/comps/9/stats/Premier-League-Stats, we are unable to get the players data on this page

In [15]:
top_5_stats_players = pd.read_html('https://fbref.com/en/comps/Big5/stats/players/Big-5-European-Leagues-Stats')
top_5_stats_players_df = top_5_stats_players[0]

In [16]:
top_5_stats_players_df.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Playing Time,Playing Time,...,Expected,Expected,Expected,Expected,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Unnamed: 33_level_0
Unnamed: 0_level_1,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,xG,npxG,xAG,npxG+xAG,xG,xAG,xG+xAG,npxG,npxG+xAG,Matches
0,1,Brenden Aaronson,us USA,MF,Leeds United,eng Premier League,22-005,2000,11,11,...,1.5,1.5,2.3,3.8,0.15,0.22,0.37,0.15,0.37,Matches
1,2,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,35-029,1987,12,12,...,0.6,0.6,0.1,0.7,0.05,0.01,0.06,0.05,0.06,Matches
2,3,Himad Abdelli,fr FRA,MF,Angers,fr Ligue 1,22-344,1999,5,1,...,0.2,0.2,0.3,0.5,0.12,0.19,0.31,0.12,0.31,Matches


Grab data for players in premier league

In [17]:
prem_players_df = (
    flatten_cols(top_5_stats_players_df)
    .rename(columns={'Unnamed: 5_level_0_Comp':'Competition'})
    .query("Competition == 'eng Premier League'")
)

In [18]:
prem_players_df.head(3)

Unnamed: 0,Unnamed: 0_level_0_Rk,Unnamed: 1_level_0_Player,Unnamed: 2_level_0_Nation,Unnamed: 3_level_0_Pos,Unnamed: 4_level_0_Squad,Competition,Unnamed: 6_level_0_Age,Unnamed: 7_level_0_Born,Playing Time_MP,Playing Time_Starts,...,Expected_xG,Expected_npxG,Expected_xAG,Expected_npxG+xAG,Per 90 Minutes_xG,Per 90 Minutes_xAG,Per 90 Minutes_xG+xAG,Per 90 Minutes_npxG,Per 90 Minutes_npxG+xAG,Unnamed: 33_level_0_Matches
0,1,Brenden Aaronson,us USA,MF,Leeds United,eng Premier League,22-005,2000,11,11,...,1.5,1.5,2.3,3.8,0.15,0.22,0.37,0.15,0.37,Matches
11,12,Che Adams,sct SCO,FW,Southampton,eng Premier League,26-106,1996,11,9,...,2.5,2.5,1.2,3.7,0.27,0.14,0.41,0.27,0.41,Matches
12,13,Tyler Adams,us USA,MF,Leeds United,eng Premier League,23-255,1999,10,10,...,0.0,0.0,0.8,0.9,0.0,0.08,0.09,0.0,0.09,Matches


### Fetching player stats (per team)

In [19]:
arsenal_stats = pd.read_html(f'https://fbref.com/en/squads/18bb7c10/{season}-{season+1}/Arsenal-Stats')

arsenal_player_stats = arsenal_stats[0]

In [20]:
arsenal_player_stats.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Playing Time,Playing Time,Playing Time,Performance,Performance,...,Expected,Expected,Expected,Expected,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Unnamed: 29_level_0
Unnamed: 0_level_1,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,xG,npxG,xAG,npxG+xAG,xG,xAG,xG+xAG,npxG,npxG+xAG,Matches
0,Gabriel Dos Santos,br BRA,DF,24-312,11,11,990.0,11.0,1.0,0.0,...,1.3,1.3,0.2,1.5,0.12,0.01,0.14,0.12,0.14,Matches
1,Aaron Ramsdale,eng ENG,GK,24-166,11,11,990.0,11.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Matches
2,William Saliba,fr FRA,DF,21-217,11,11,990.0,11.0,2.0,1.0,...,0.3,0.3,0.8,1.2,0.03,0.07,0.11,0.03,0.11,Matches


### Fetching team fixtures and results

In [21]:
arsenal_fixtures_results = arsenal_stats[1]

In [22]:
arsenal_fixtures_results.head(3)

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2022-08-05,20:00,Premier League,Matchweek 1,Fri,Away,W,2.0,0.0,Crystal Palace,1.0,1.2,44.0,25286.0,Martin Ødegaard,4-3-3,Anthony Taylor,Match Report,
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4.0,2.0,Leicester City,2.7,0.5,50.0,60033.0,Martin Ødegaard,4-2-3-1,Darren England,Match Report,
2,2022-08-20,17:30,Premier League,Matchweek 3,Sat,Away,W,3.0,0.0,Bournemouth,1.3,0.3,57.0,10423.0,Martin Ødegaard,4-2-3-1,Craig Pawson,Match Report,


# Method 2: Beautiful soup
Helpful links:
Following link grabs data from basketball reference: https://stackoverflow.com/questions/57049359/how-to-get-table-headers-and-table-data-from-a-table-row-together-in-list

Checkout following link which contains example code for extracting fbref data with bs: https://github.com/socrstats/Medium_Football/blob/main/English%20Premier%20League%20Dendrogram.ipynb

--------------------------

The general method for using BeautifulSoup is to give the webpage of interest which will be read by the request page and then use BeautifulSoup to format the data to how we would like it.

### Fetching league table

In [23]:
prem_league_table_df = get_prem_league_table(season)
prem_league_table_df.head(3)

Unnamed: 0,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,Pts/MP,xG,xGA,xGD,xGD/90,Last 5,Attendance,Top Team Scorer,Goalkeeper,Notes
0,1,Arsenal,11,9,1,1,25,11,14,28,2.55,19.4,10.4,9.0,0.82,W W W W D,60109,Gabriel Jesus - 5,Aaron Ramsdale,
1,2,Manchester City,11,8,2,1,36,11,25,26,2.36,23.0,8.2,14.8,1.35,W W W L W,53340,Erling Haaland - 17,Ederson,
2,3,Tottenham,12,7,2,3,23,14,9,23,1.92,19.2,13.9,5.3,0.44,L W W L L,61610,Harry Kane - 10,Hugo Lloris,


### Fetch team stats

In [7]:
prem_team_stats = get_team_stats(season)
prem_team_stats.head(3)

Unnamed: 0,Squad,# Pl,Age,Poss,MP,Starts,Min,90s,Gls,Ast,G-PK,PK,PKatt,CrdY,CrdR,Gls.1,Ast.1,G+A,G-PK.1,G+A-PK,xG,npxG,xAG,npxG+xAG,xG.1,xAG.1,xG+xAG,npxG.1,npxG+xAG.1
0,Arsenal,23,24.7,57.7,12,132,1080,12.0,29,22,28,1,1,19,0,2.42,1.83,4.25,2.33,4.17,21.8,21.4,15.0,36.3,1.82,1.25,3.07,1.78,3.03
1,Aston Villa,22,27.4,48.8,13,143,1170,13.0,11,6,10,1,1,27,1,0.85,0.46,1.31,0.77,1.23,14.9,14.2,10.4,24.6,1.15,0.8,1.95,1.09,1.89
2,Bournemouth,25,26.8,38.6,13,143,1170,13.0,12,10,12,0,0,20,0,0.92,0.77,1.69,0.92,1.69,7.5,7.5,5.9,13.4,0.58,0.45,1.03,0.58,1.03


### Fetch player stats

NB: appears not all rows are getting captured

In [25]:
big5_player_df = get_big5_player_stats(season)
big5_player_df.head(3)

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,xG,npxG,xAG,npxG+xAG,xG.1,xAG.1,xG+xAG,npxG.1,npxG+xAG.1,Matches
0,1,Brenden Aaronson,us USA,MF,Leeds United,eng Premier League,22-005,2000,11,11,...,1.5,1.5,2.3,3.8,0.15,0.22,0.37,0.15,0.37,Matches
1,2,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,35-029,1987,12,12,...,0.6,0.6,0.1,0.7,0.05,0.01,0.06,0.05,0.06,Matches
2,3,Himad Abdelli,fr FRA,MF,Angers,fr Ligue 1,22-344,1999,5,1,...,0.2,0.2,0.3,0.5,0.12,0.19,0.31,0.12,0.31,Matches


# Conclusion

In this notebook we have shown two ways to extract data from the fbref. It appears that pandas .read_html() does most of the heavy lifting and can get the data we desire. The data grabbed through this method will need to get cleaned up, specifically with the column names used. BeautifulSoup offers an alternative solution as well but will require more functions to be created to extract data for each table we are interested in.

NB: unable to get player data from team specific pages but can overcome this by grabbing player data from competition page

### Next steps:
* Create clean functions to format data to desired format
* Do validation checks to see we are getting the amount and quality of data we would expect
* Use this data to start other projects of use