# 00 Data Scraper

This notebook scrapes structured player statistics for Valencia CF from [FBref](https://fbref.com) across three seasons (2022–2025). It includes:

* **Seasonal data scraping:** Extracts 7 core stat tables (e.g. passing, defense, possession) for each season using `pandas.read_html` from public FBref squad pages
* **Automated filename mapping:** Dynamically names and saves each table as a CSV in `data/raw/` using season and table type
* **Rate limit protection:** Implements a request counter and 15-minute cooldown after 10 requests to avoid getting blocked by FBref
* **Reproducible storage:** Skips already-downloaded files to prevent unnecessary re-fetches and ensure consistent local copies

> Output of this notebook is a version-controlled local dump of raw FBref tables for further inspection, cleaning, and analysis. Scraper code is commented out after use to avoid accidental API overload.

In [1]:
import pandas as pd
from pathlib import Path
import time
import random

In [2]:
RAW_DIR = Path("..", "data", "raw")
RAW_DIR.mkdir(parents=True, exist_ok=True)

## Scrape Valencia data from FBref
- Saved to CSV files in notebooks/data/raw to avoid hitting HTTP request limit
- Will comment the code to not run it (unless needed)

In [3]:
# # Current season 2024-2025
# df_player_stats_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]


In [4]:
# # Season 2023-2024
# df_player_stats_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]


In [5]:
# # Season 2022-2023
# df_player_stats_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]

In [6]:
# ##### Save all dataframes to CSV files for future use #####

# # 1 Folder →  data/raw   (create if it doesn't exist)

# RAW_DIR = Path("..", "data", "raw")
# RAW_DIR.mkdir(parents=True, exist_ok=True)

# # 2 Find every variable in the notebook whose name starts with df_
# frames = {
#     name: obj
#     for name, obj in globals().items()
#     if name.startswith("df_") and isinstance(obj, pd.DataFrame)
# }

# # 3  Save each DataFrame to CSV
# for name, df in frames.items():
#     filepath = RAW_DIR / f"{name}.csv"
#     df.to_csv(filepath, index=False)
#     print(f"{filepath}")

In [7]:
BASE_URLS = {
    "2425": "https://fbref.com/en/squads/dcc91a7b/Valencia-Stats",
    "2324": "https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats",
    "2223": "https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats",
}

TABLE_IDS = [
    "stats_standard_12",
    "stats_shooting_12",
    "stats_passing_12",
    "stats_passing_types_12",
    "stats_gca_12",
    "stats_defense_12",
    "stats_possession_12",
]

In [8]:
MAX_REQUESTS = 10
COOLDOWN_SECONDS = 15 * 60  # 15 minutes

request_counter = 0

In [9]:
def strip_suffix(table_id: str, suffix="_12") -> str:
    return table_id[:-len(suffix)] if table_id.endswith(suffix) else table_id

We added a request counter and cooldown timer to the scraper to avoid triggering FBref’s rate limits and getting blocked after multiple table fetches.

In [10]:
for season, url in BASE_URLS.items():
    for table_id in TABLE_IDS:
        table_base = strip_suffix(table_id)
        if table_base == "stats_standard":
            fname = f"df_player_stats_{season}.csv"
        else:
            fname = f"df_player_{table_base.replace('stats_', '')}_{season}.csv"
        fpath = RAW_DIR / fname

        if fpath.exists():
            print(f"Skipping existing file: {fname}")
            continue

        if request_counter >= MAX_REQUESTS:
            print(f"Request cap hit. Cooling down for {COOLDOWN_SECONDS // 60} minutes...")
            time.sleep(COOLDOWN_SECONDS)
            request_counter = 0

        try:
            print(f"Fetching: {season} | {table_id}")
            df = pd.read_html(url, attrs={"id": table_id})[0]
            df.to_csv(fpath, index=False)
            print(f"Saved {fpath.name}")
            request_counter += 1
        except Exception as e:
            print(f"Failed to fetch {table_id} for {season}: {e}")

        time.sleep(random.uniform(5, 10))

Skipping existing file: df_player_stats_2425.csv
Skipping existing file: df_player_shooting_2425.csv
Skipping existing file: df_player_passing_2425.csv
Skipping existing file: df_player_passing_types_2425.csv
Skipping existing file: df_player_gca_2425.csv
Skipping existing file: df_player_defense_2425.csv
Skipping existing file: df_player_possession_2425.csv
Skipping existing file: df_player_stats_2324.csv
Skipping existing file: df_player_shooting_2324.csv
Skipping existing file: df_player_passing_2324.csv
Skipping existing file: df_player_passing_types_2324.csv
Skipping existing file: df_player_gca_2324.csv
Skipping existing file: df_player_defense_2324.csv
Skipping existing file: df_player_possession_2324.csv
Skipping existing file: df_player_stats_2223.csv
Skipping existing file: df_player_shooting_2223.csv
Skipping existing file: df_player_passing_2223.csv
Skipping existing file: df_player_passing_types_2223.csv
Skipping existing file: df_player_gca_2223.csv
Skipping existing file

# Scrape Market Value Historical