**Web-scraping football game data from the English Premier League** 

This first section of code follows 3 broad steps to extract data on premier league matches across multiple seasons:

For each season, the code does the following:

1. Go to [this EPL data page](https://fbref.com/en/comps/9/Premier-League-Stats) and find the premier league table from that season to extract the URL links for every club in the league.

2. For each team in the league in this given season, go into the team's URL link and:

    *2a) Scrape **basic match data** from the 'Scores & Fixtures' table. This data includes the date, matchweek, opponent, venue, score, possession and expected goals for and against.*
    
    *2b) Scrape **shooting data** from the 'Shooting' table. This provides us with useful information on shots and shots on target, both for and against each team.*
    
3. Once this is repeated across multiple seasons and across all teams in the premier league in that season, all the above information is then merged into a single dataframe. 

The second section of this code follows a similar, albeit simpler process to go to the same website and extract the final premier leage standings table for each season - this is important as it provides a useful feature for our model to indicate the relative strength of a team based on their league position in the previous season.

In [1]:
# import packages
import requests
import pandas as pd 
import numpy as np
from bs4 import BeautifulSoup 
import csv
from datetime import datetime
import time

# let's configure python to display to 1 decimpal places
pd.set_option("display.precision", 2)

# conigure pandas to display all columns so we can view the whole dataset
pd.set_option("display.max.columns", None)

In [2]:
# creates a list of years (seasons) to repeat the data scraping loop over
years = list(range(2022, 2016, -1)) # we only include data post-2017 because crucual predictors like expected goals are not available before then
years

[2022, 2021, 2020, 2019, 2018, 2017]

In [3]:
# create an empty object to store the data
all_matches = []

In [4]:
# define the URL with the premier league table from the latest season (2022-23)
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [5]:
for year in years:
    
    standings_data = requests.get(standings_url) # send GET request to the standing URL
    
    ## COLLECT URLS FOR EACH INDIVIDUAL TEAM
                        
    soup = BeautifulSoup(standings_data.text) # use BeautifulSoup to parse the text object received from the URL
    
    standings_table = soup.select('table.stats_table')[0] # select the stats table object from the parsed object

    links = [l.get("href") for l in standings_table.find_all('a')] # find all objects starting with '<a' and extract href objects
    
    links = [l for l in links if '/squads/' in l] # collect all links that contain '/squads' in their URLs
    
    team_urls = [f"https://fbref.com{l}" for l in links] # complete the URLs by adding website opening text on the front
    
    previous_season = soup.select("a.prev")[0].get("href") # collect the href link for the previous season
    
    standings_url = f"https://fbref.com{previous_season}" # complete the previous season url to set up next stage of the loop once data on all teams are collected
    
    for team_url in team_urls: # for each URL in the team URLs
        
        team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ") # obtain the team name from the URL
        
        team_data = requests.get(team_url) # send GET request to the team's URL
        
        ## GET BASIC MATCH DATA FROM TEAM URL
        
        matches = pd.read_html(team_data.text, match="Scores & Fixtures")[0] # read the matches data from the scores and fixtures table
        
        matches.columns = map(str.lower, matches.columns) # make columns lower case
        
        matches.columns = matches.columns.str.replace(' ', '_') # replace spaces in column names
        
        ## GET SHOOTING DATA FROM TEAM URL
        
        soup = BeautifulSoup(team_data.text) # parse the team_url GET request
        
        links = [l.get("href") for l in soup.find_all('a')] # collect the links
        
        shooting_links = [l for l in links if l and 'all_comps/shooting/' in l] # collect the links that contain the shooting variables
        
        all_shooting_data = pd.read_html(f"https://fbref.com{shooting_links[0]}") # read all the tables from the shooting URL into a pandas data frame
        
        shooting_data_for = all_shooting_data[0] # Collect the 'for' data (i.e., for the team in question)
        
        shooting_data_for.columns = shooting_data_for.columns.droplevel() # now drop the top level above the standard columns
              
        shooting_data_for.columns = map(str.lower, shooting_data_for.columns) # make all the column names lower case       
        
        shooting_data_for.columns = shooting_data_for.columns.str.replace(' ', '_') # replace any spaces with an underscore
         
        shooting_data_against = all_shooting_data[-1] # Collect the 'against' data
               
        shooting_data_against.columns = shooting_data_against.columns.droplevel() # now drop the top level above the standard columns
               
        shooting_data_against.columns = map(str.lower, shooting_data_against.columns) # make all the column names lower case       
        
        shooting_data_against.columns = shooting_data_against.columns.str.replace(' ', '_') # replace any spaces with an underscore
                
        shooting_data_against.rename(columns={'gls':'gls_a',
                              'sh':'sh_a',
                             'sot':'sot_a',
                             'sot%':'sot%_a',
                             'g/sh':'g/sh_a',
                             'g/sot':'g/sot_a',
                             'dist':'dist_a',
                             'fk':'fk_a',
                             'pk':'pk_a',
                             'pkatt':'pkatt_a',
                             'xg':'xg_a',
                             'npxg':'npxg_a',
                             'g-xg':'g-xg_a',
                             'np:g-xg':'np:g-xg_a',}, inplace=True) # replace all column names so we know this is 'against' data           
        
        try: # set a try function 
            
            # Merge all dataframes together selecting the relevant columns
            matches_sdf = matches.merge(shooting_data_for[["date", "sh", "sot"]], on="date")
            team_data = matches_sdf.merge(shooting_data_against[["date", "sh_a", "sot_a"]], on="date")
            
        except ValueError: # if error try again
            
            continue # otherwise continue to the rest of the code
        
        team_data["season"] = year # create a new column to add the season
        
        team_data["team"] = team_name # create a new column to add the team
        
        all_matches.append(team_data) # append team data to dataframe
        
        print(year, team_name)
        
        time.sleep(1) # rest for a second before moving on
        

match_df = pd.concat(all_matches) # concatinate into a dataframe

match_df.rename(columns={'ga':'g_a',
                         'gf':'g',
                         'xga': 'xg_a'}, inplace=True) # rename columns for consistency

match_df.to_csv(f"../data/raw_EPL_data.csv") # store this raw data as csv with time stamp

IndexError: list index out of range

**Collect data on PL table position and points per game from previous seasons as an indication of team strength**

One important aspect of prediction will be to tell how strong each team and opponent are. A proxy for this is their points per game across the previous season. We therefore need to collect data from the final standings of each PL table (including one for the season before our first season of full data).

This section of code collects this data from the same website.

In [None]:
# create an empty object to store the premier league table position data
pl_tables = []

In [None]:
# define the URL with the premier league table 
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [None]:
# define list of years to collect PL table data (one more than stats perviously selected given we are using previous position)
years = list(range(2022, 2015, -1))
years

In [None]:
## COLLECT PL TABLE POSITION DATA FOR THE SEASON
for year in years:  
    
    pl_table_data = requests.get(standings_url) # send GET request to the standings URL
    
    pl_table = pd.read_html(pl_table_data.text)[0] # collect premier league table from each season 
    
    pl_table.columns = map(str.lower, pl_table.columns) # make all the column names lower case
    
    pl_table["season"] = year # add a season name to the PL table
    
    pl_table = pl_table[["season", "rk", "squad", "pts", "pts/mp"]] # collect only relevant columns
    
    soup = BeautifulSoup(pl_table_data.text) # use BeautifulSoup to parse the text object received from the URL
    
    previous_season = soup.select("a.prev")[0].get("href") # collect the href link for the previous season
    
    standings_url = f"https://fbref.com{previous_season}" # complete the previous season url
    
    pl_tables.append(pl_table) # append pl position data to dataframe
        
    print(year) # print season year to help track progress
    
    time.sleep(1) # rest for a second before moving on

pl_tables = pd.concat(pl_tables) # concatinate into a dataframe

pl_tables.to_csv(f"../data/raw_EPL_tables_data.csv") # store as csv with time stamp