<center><h1><font size=6> Scraping Historic EPL Match Data </h1></center>

This notebook scrapes data on the results and match statistics of historic Premier League football matches from [this EPL data page](https://fbref.com/en/comps/9/Premier-League-Stats). The data is collected from the 2017-18 season onwards as I want to make use of the Expected Goals statistic, which was only introduced for the EPL in the 2017-18 season.

### Load libraries and setup notebook configuration

In [29]:
# import packages
import pandas as pd 
import numpy as np
import os
from pathlib import Path
import requests
from bs4 import BeautifulSoup
import time


# set pandas configurations
pd.set_option("display.precision", 2) # display to 1 decimpal place
pd.set_option("display.max.columns", None) # display all columns so we can view the whole dataset


# set directories
os.chdir('..') # change current working directory to the parent directory to help access files/directories at a higher level
DATAPATH = Path(r'data') # set data path


# import from source directory
from src import constants

### Defining inputs to the scraping process

In [7]:
# create an empty object to store the data
all_matches = []


# create a today's date object to keep a record of when data was downloaded
todays_date = time.strftime("%Y-%m-%d")


# define URL link for the latest season
season_url = "https://fbref.com/en/comps/9/Premier-League-Stats" # define URL for the season's page


# creates a list of years (seasons) to repeat the data scraping loop over
years = list(range(constants.LATEST_SEASON, 2016, -1)) # we only include data post-2017 because crucual predictors like expected goals are not available before then
years

[2022, 2021, 2020, 2019, 2018, 2017]

### Scraping

The below code runs a loop to go through each season (as specified above) and:
1. Collect the names and URL links of all the teams that played in the EPL in that season.
2. From each team's link, scrape their basic match data including result, goals, expected goals, possession, and other bits of info.

In [3]:
for year in years:
    
    # wrap code in a while loop that keeps trying until the scraping is successful
    while True:
        try:
    
            # collect the website links for each squad in the premier league for the latest season as defined above
            season_response = requests.get(season_url) # send GET request to URL and store response
            season_request_text = BeautifulSoup(season_response.text) # get the text content by parsing the HTML response
            season_standings_table = season_request_text.select('table.stats_table')[0] # collect the league table from the text content
            season_all_links =  [l.get("href") for l in season_standings_table.find_all('a')] # find all links from the table and extract href objects
            season_team_links = [l for l in season_all_links if '/squads/' in l] # collect all links that contain '/squads' in their URLs
            season_team_urls = [f"https://fbref.com{l}" for l in season_team_links] # complete the URLs by adding website opening text on the front


            # collect the URL for the previous season and set it to the new season URL to set up next stage of the loop
            previous_season_href = season_request_text.select("a.prev")[0].get("href") # collect the href link for the previous season
            season_url = f"https://fbref.com{previous_season_href}" # complete the previous season url to set up next stage of the loop once data on all teams are collected


            # for each URL in the collected team URLs, scrape match data
            for team_url in season_team_urls:

                team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ") # obtain the team name from the URL
                team_df_list = pd.read_html(team_url) # collect list of tables from team web page
                matches = pd.read_html(team_url)[1] # collect the 2nd table which contains the match data
                matches["season"] = year # create a new column to add the season
                matches["team"] = team_name # create a new column to add the team
                matches["date_downloaded"] = todays_date # create a new column to keep track of when data was downloaded

                all_matches.append(matches) # append team data to dataframe

                print(f"Data for {team_name} in year {year} successfully collected.")
                
                time.sleep(5) # rest before moving on
            
            break # exit the while loop if the scraping is successful
        
        except Exception as e:
            print("Error occurred:", e)
            print("Retrying after 5 seconds...")
            time.sleep(5)
        

matches_df = pd.concat(all_matches) # concatinate into a dataframe
matches_df.to_csv(f"{DATAPATH}/raw/matches_raw.csv") # store this raw data as csv in the local data file

Data for Manchester City in year 2022 successfully colected.
Data for Arsenal in year 2022 successfully colected.


KeyboardInterrupt: 

### Repeat another scraping process that scrapes more historic data (but not xG)

In [8]:
# create an empty object to store the data
all_matches_long = []


# define URL link for the latest season
season_url = "https://fbref.com/en/comps/9/Premier-League-Stats" # define URL for the season's page


# creates a list of years (seasons) to repeat the data scraping loop over
years = list(range(constants.LATEST_SEASON, 1991, -1)) # we only include data post-2017 because crucual predictors like expected goals are not available before then

In [10]:
for year in years:
    
    # wrap code in a while loop that keeps trying until the scraping is successful
    while True:
        try:
    
            # collect the website links for each squad in the premier league for the latest season as defined above
            season_response = requests.get(season_url) # send GET request to URL and store response
            season_request_text = BeautifulSoup(season_response.text) # get the text content by parsing the HTML response
            season_standings_table = season_request_text.select('table.stats_table')[0] # collect the league table from the text content
            season_all_links =  [l.get("href") for l in season_standings_table.find_all('a')] # find all links from the table and extract href objects
            season_team_links = [l for l in season_all_links if '/squads/' in l] # collect all links that contain '/squads' in their URLs
            season_team_urls = [f"https://fbref.com{l}" for l in season_team_links] # complete the URLs by adding website opening text on the front


            # collect the URL for the previous season and set it to the new season URL to set up next stage of the loop
            previous_season_href = season_request_text.select("a.prev")[0].get("href") # collect the href link for the previous season
            season_url = f"https://fbref.com{previous_season_href}" # complete the previous season url to set up next stage of the loop once data on all teams are collected


            # for each URL in the collected team URLs, scrape match data
            for team_url in season_team_urls:

                team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ") # obtain the team name from the URL
                team_df_list = pd.read_html(team_url) # collect list of tables from team web page
                matches = pd.read_html(team_url)[1] # collect the 2nd table which contains the match data
                matches["season"] = year # create a new column to add the season
                matches["team"] = team_name # create a new column to add the team
                matches["date_downloaded"] = todays_date # create a new column to keep track of when data was downloaded

                all_matches_long.append(matches) # append team data to dataframe

                print(f"Data for {team_name} in year {year} successfully collected.")
                
                time.sleep(5) # rest before moving on
            
            break # exit the while loop if the scraping is successful
        
        except Exception as e:
            print("Error occurred:", e)
            print("Retrying after 5 seconds...")
            time.sleep(5)
        

matches_df_long = pd.concat(all_matches_long) # concatinate into a dataframe
matches_df_long.to_csv(f"{DATAPATH}/raw/matches_long_raw.csv", index=False) # store this raw data as csv in the local data file

Data for Manchester City in year 2022 successfully collected.
Data for Liverpool in year 2022 successfully collected.
Data for Chelsea in year 2022 successfully collected.
Data for Tottenham Hotspur in year 2022 successfully collected.
Data for Arsenal in year 2022 successfully collected.
Data for Manchester United in year 2022 successfully collected.
Data for West Ham United in year 2022 successfully collected.
Data for Leicester City in year 2022 successfully collected.
Data for Brighton and Hove Albion in year 2022 successfully collected.
Data for Wolverhampton Wanderers in year 2022 successfully collected.
Data for Newcastle United in year 2022 successfully collected.
Data for Crystal Palace in year 2022 successfully collected.
Data for Brentford in year 2022 successfully collected.
Data for Aston Villa in year 2022 successfully collected.
Data for Southampton in year 2022 successfully collected.
Data for Everton in year 2022 successfully collected.
Data for Leeds United in year 20

OSError: Cannot save file into a non-existent directory: 'data\raw'

In [33]:
#matches_df_long.to_csv("Documents/L_and_D/1.DS_Learning/EPL-Prediction/data/raw/matches_long_raw.csv")