# Web Scraping Football Match Data

In this notebook, we'll scrape Premier League match data from FBref.com. We'll collect:
- Team standings
- Match shooting statistics
- Individual match data

This data will be used to predict match winners in the prediction notebook.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

## Define the Target Seasons and URL

We'll scrape multiple seasons of Premier League data.

In [None]:
# URL for Premier League standings - we'll modify this for different years
years = list(range(2024, 2020, -1))
all_matches = []

## Scraping Function

This function scrapes the standings page and extracts links to team pages.

In [None]:
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [None]:
for year in years:
    data = requests.get(standings_url)
    soup = BeautifulSoup(data.text, 'html.parser')
    standings_table = soup.select('table.stats_table')[0]
    
    links = [l.get("href") for l in standings_table.find_all('a')]
    links = [l for l in links if '/squads/' in l]
    team_urls = [f"https://fbref.com{l}" for l in links]
    
    previous_season = soup.select("a.prev")[0].get("href")
    standings_url = f"https://fbref.com{previous_season}"
    
    for team_url in team_urls:
        team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ")
        
        data = requests.get(team_url)
        matches = pd.read_html(data.text, match="Scores & Fixtures")[0]
        soup = BeautifulSoup(data.text, 'html.parser')
        links = [l.get("href") for l in soup.find_all('a')]
        links = [l for l in links if l and 'all_comps/shooting/' in l]
        
        data = requests.get(f"https://fbref.com{links[0]}")
        shooting = pd.read_html(data.text, match="Shooting")[0]
        shooting.columns = shooting.columns.droplevel()
        
        try:
            team_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on="Date")
        except ValueError:
            continue
            
        team_data = team_data[team_data["Comp"] == "Premier League"]
        
        team_data["Season"] = year
        team_data["Team"] = team_name
        all_matches.append(team_data)
        time.sleep(3)

## Combine All Match Data

In [None]:
match_df = pd.concat(all_matches)
match_df.columns = [c.lower() for c in match_df.columns]
match_df.head()

## Save to CSV

In [None]:
match_df.to_csv("matches.csv", index=False)

## Data Overview

In [None]:
print(f"Total matches: {len(match_df)}")
print(f"Columns: {match_df.columns.tolist()}")
print(f"\nSeasons included: {match_df['season'].unique()}")
print(f"Teams included: {match_df['team'].nunique()}")