<center><h1><font size=6> Scraping Historic EPL Match Data </h1></center>

This notebook scrapes data on the results and match statistics of historic Premier League football matches from [this EPL data page](https://fbref.com/en/comps/9/Premier-League-Stats). The data is collected from the 2017-18 season onwards as I want to make use of the Expected Goals statistic, which was only introduced for the EPL in the 2017-18 season.

### Load libraries and setup notebook configuration

In [54]:
team_url

'https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats'

In [7]:
# import packages
import pandas as pd 
import numpy as np
import os
from pathlib import Path  #for Windows/Linux compatibility
import requests
from bs4 import BeautifulSoup


# set pandas configurations
pd.set_option("display.precision", 2) # display to 1 decimpal place
pd.set_option("display.max.columns", None) # display all columns so we can view the whole dataset


# set directories
os.chdir('..') # change current working directory to the parent directory to help access files/directories at a higher level
DATAPATH = Path(r'data') # set data path

In [47]:
# define URL link for the latest season
season_url = "https://fbref.com/en/comps/9/Premier-League-Stats" # define URL for the season's page

In [48]:
# collect the website links for each squad in the premier league for the given season
season_response = requests.get(season_url) # send GET request to URL and store response
season_request_text = BeautifulSoup(season_response.text) # get the text content by parsing the HTML response
season_standings_table = season_request_text.select('table.stats_table')[0] # collect the league table from the text content
season_all_links =  [l.get("href") for l in season_standings_table.find_all('a')] # find all links from the table and extract href objects
season_team_links = [l for l in season_all_links if '/squads/' in l] # collect all links that contain '/squads' in their URLs
season_team_urls = [f"https://fbref.com{l}" for l in season_team_links] # complete the URLs by adding website opening text on the front

# set the new season url to the previous season to set up next stage of the loop
previous_season_href = season_request_text.select("a.prev")[0].get("href") # collect the href link for the previous season
season_url = f"https://fbref.com{previous_season_href}" # complete the previous season url to set up next stage of the loop once data on all teams are collected

In [53]:
# collect data from team url
team_url = season_team_urls[0] # turn this into loop later on
team_df_list = pd.read_html(team_url)
standard_stats = pd.read_html(team_url)[0]
scores_and_fixtures = pd.read_html(team_url)[1]
# and so on - make this a loop

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2022-07-30,17:00,Community Shield,FA Community Shield,Sat,Neutral,L,1,3,Liverpool,,,57,,Rúben Dias,4-3-3,Craig Pawson,Match Report,
1,2022-08-07,16:30,Premier League,Matchweek 1,Sun,Away,W,2,0,West Ham,2.2,0.5,75,62443.0,İlkay Gündoğan,4-3-3,Michael Oliver,Match Report,
2,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,0,Bournemouth,1.7,0.1,67,53453.0,İlkay Gündoğan,4-2-3-1,David Coote,Match Report,
3,2022-08-21,16:30,Premier League,Matchweek 3,Sun,Away,D,3,3,Newcastle Utd,2.1,1.8,69,52258.0,İlkay Gündoğan,4-3-3,Jarred Gillett,Match Report,
4,2022-08-27,15:00,Premier League,Matchweek 4,Sat,Home,W,4,2,Crystal Palace,2.2,0.1,74,53112.0,Kevin De Bruyne,4-2-3-1,Darren England,Match Report,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,2023-05-21,16:00,Premier League,Matchweek 37,Sun,Home,W,1,0,Chelsea,1.2,1.2,64,53490.0,Kyle Walker,3-4-3◆,Michael Oliver,Match Report,
57,2023-05-24,20:00,Premier League,Matchweek 32,Wed,Away,D,1,1,Brighton,1.8,2.2,60,31388.0,İlkay Gündoğan,4-3-3,Simon Hooper,Match Report,
58,2023-05-28,16:30,Premier League,Matchweek 38,Sun,Away,L,0,1,Brentford,1.6,1.3,65,17120.0,Kyle Walker,3-2-4-1,John Brooks,Match Report,
59,2023-06-03,15:00,FA Cup,Final,Sat,Neutral,W,2,1,Manchester Utd,,,60,83179.0,İlkay Gündoğan,3-2-4-1,Paul Tierney,Match Report,


17:00 
Community Shield
FA Community Shield
Sat
Neutral
L
1
3
Liverpool


57

Rúben Dias
4-3-3
Craig Pawson
Match Report

16:30 
Premier League
Matchweek 1
Sun
Away
W
2
0
West Ham
2.2
0.5
75
62,443
İlkay Gündoğan
4-3-3
Michael Oliver
Match Report

15:00 
Premier League
Matchweek 2
Sat
Home
W
4
0
Bournemouth
1.7
0.1
67
53,453
İlkay Gündoğan
4-2-3-1
David Coote
Match Report

16:30 
Premier League
Matchweek 3
Sun
Away
D
3
3
Newcastle Utd
2.1
1.8
69
52,258
İlkay Gündoğan
4-3-3
Jarred Gillett
Match Report

15:00 
Premier League
Matchweek 4
Sat
Home
W
4
2
Crystal Palace
2.2
0.1
74
53,112
Kevin De Bruyne
4-2-3-1
Darren England
Match Report

19:30 
Premier League
Matchweek 5
Wed
Home
W
6
0
Nott'ham Forest
3.3
0.7
74
53,409
İlkay Gündoğan
4-2-3-1
Paul Tierney
Match Report

17:30 
Premier League
Matchweek 6
Sat
Away
D
1
1
Aston Villa
2.1
0.3
71
41,830
İlkay Gündoğan
4-3-3
Simon Hooper
Match Report

21:00 
Champions Lg
Group stage
Tue
Away
W
4
0
es Sevilla
3.6
0.3
61
38,764
Kevin De Bruyne
4-3-3


In [None]:
        
team_data = requests.get(team_url) # send GET request to the team's URL

## GET BASIC MATCH DATA FROM TEAM URL

matches = pd.read_html(team_data.text, match="Scores & Fixtures")[0] # read the matches data from the scores and fixtures table

matches.columns = map(str.lower, matches.columns) # make columns lower case

matches.columns = matches.columns.str.replace(' ', '_') # replace spaces in column names

## GET SHOOTING DATA FROM TEAM URL

soup = BeautifulSoup(team_data.text) # parse the team_url GET request

links = [l.get("href") for l in soup.find_all('a')] # collect the links

shooting_links = [l for l in links if l and 'all_comps/shooting/' in l] # collect the links that contain the shooting variables

all_shooting_data = pd.read_html(f"https://fbref.com{shooting_links[0]}") # read all the tables from the shooting URL into a pandas data frame

shooting_data_for = all_shooting_data[0] # Collect the 'for' data (i.e., for the team in question)

shooting_data_for.columns = shooting_data_for.columns.droplevel() # now drop the top level above the standard columns

shooting_data_for.columns = map(str.lower, shooting_data_for.columns) # make all the column names lower case       

shooting_data_for.columns = shooting_data_for.columns.str.replace(' ', '_') # replace any spaces with an underscore

shooting_data_against = all_shooting_data[-1] # Collect the 'against' data

shooting_data_against.columns = shooting_data_against.columns.droplevel() # now drop the top level above the standard columns

shooting_data_against.columns = map(str.lower, shooting_data_against.columns) # make all the column names lower case       

shooting_data_against.columns = shooting_data_against.columns.str.replace(' ', '_') # replace any spaces with an underscore

shooting_data_against.rename(columns={'gls':'gls_a',
                      'sh':'sh_a',
                     'sot':'sot_a',
                     'sot%':'sot%_a',
                     'g/sh':'g/sh_a',
                     'g/sot':'g/sot_a',
                     'dist':'dist_a',
                     'fk':'fk_a',
                     'pk':'pk_a',
                     'pkatt':'pkatt_a',
                     'xg':'xg_a',
                     'npxg':'npxg_a',
                     'g-xg':'g-xg_a',
                     'np:g-xg':'np:g-xg_a',}, inplace=True) # replace all column names so we know this is 'against' data      