# 00 Data Scraper

This notebook scrapes structured player statistics for Valencia CF from [FBref](https://fbref.com) across three seasons (2022–2025). It includes:

* **Seasonal data scraping:** Extracts 7 core stat tables (e.g. passing, defense, possession) for each season using `pandas.read_html` from public FBref squad pages
* **Automated filename mapping:** Dynamically names and saves each table as a CSV in `data/raw/` using season and table type
* **Rate limit protection:** Implements a request counter and 15-minute cooldown after 10 requests to avoid getting blocked by FBref
* **Reproducible storage:** Skips already-downloaded files to prevent unnecessary re-fetches and ensure consistent local copies

> Output of this notebook is a version-controlled local dump of raw FBref tables for further inspection, cleaning, and analysis. Scraper code is commented out after use to avoid accidental API overload.

In [15]:
import warnings
warnings.filterwarnings('ignore')

# Standard library imports
import json
import random
import re
import ssl
import time
import pathlib
from pathlib import Path
from urllib.request import Request, urlopen
import urllib.parse
import io

# Third-party imports
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Import our scraper module using relative path
import sys
sys.path.append("..")
from src.scrapers.fbref_scraper import scrape_fbref_squad, FBrefScraper
from src.scrapers.transfermarkt_scraper import scrape_transfermarkt_team

In [16]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

---

## FBref Data Scraper
- Saved to CSV files in notebooks/data/raw to avoid hitting HTTP request limit
- Will comment the code to not run it (unless needed)

In [46]:
# # Current season 2024-2025
# df_player_stats_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]

In [47]:
# # Season 2023-2024
# df_player_stats_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]

In [48]:
# # Season 2022-2023
# df_player_stats_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]

In [49]:
# ##### Save all dataframes to CSV files for future use #####

# # 1 Folder →  data/raw   (create if it doesn't exist)

# RAW_DIR = Path("..", "data", "raw")
# RAW_DIR.mkdir(parents=True, exist_ok=True)

# # 2 Find every variable in the notebook whose name starts with df_
# frames = {
#     name: obj
#     for name, obj in globals().items()
#     if name.startswith("df_") and isinstance(obj, pd.DataFrame)
# }

# # 3  Save each DataFrame to CSV
# for name, df in frames.items():
#     filepath = RAW_DIR / f"{name}.csv"
#     df.to_csv(filepath, index=False)
#     print(f"{filepath}")

In [50]:
# BASE_URLS = {
#     "2425": "https://fbref.com/en/squads/dcc91a7b/Valencia-Stats",
#     "2324": "https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats",
#     "2223": "https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats",
# }

# TABLE_IDS = [
#     "stats_standard_12",
#     "stats_shooting_12",
#     "stats_passing_12",
#     "stats_passing_types_12",
#     "stats_gca_12",
#     "stats_defense_12",
#     "stats_possession_12",
# ]

In [51]:
# # only 10 requests per 15 minutes
# MAX_REQUESTS = 10
# COOLDOWN_SECONDS = 15 * 60  # 15 minutes

# request_counter = 0

In [52]:
# def strip_suffix(table_id: str, suffix="_12") -> str:
#     return table_id[:-len(suffix)] if table_id.endswith(suffix) else table_id

We added a request counter and cooldown timer to the scraper to avoid triggering FBref’s rate limits and getting blocked after multiple table fetches.

In [53]:
# for season, url in BASE_URLS.items():
#     for table_id in TABLE_IDS:
#         table_base = strip_suffix(table_id)
#         if table_base == "stats_standard":
#             fname = f"df_player_stats_{season}.csv"
#         else:
#             fname = f"df_player_{table_base.replace('stats_', '')}_{season}.csv"
#         fpath = RAW_DIR / fname

#         if fpath.exists():
#             print(f"Skipping existing file: {fname}")
#             continue

#         if request_counter >= MAX_REQUESTS:
#             print(f"Request cap hit. Cooling down for {COOLDOWN_SECONDS // 60} minutes...")
#             time.sleep(COOLDOWN_SECONDS)
#             request_counter = 0

#         try:
#             print(f"Fetching: {season} | {table_id}")
#             df = pd.read_html(url, attrs={"id": table_id})[0]
#             df.to_csv(fpath, index=False)
#             print(f"Saved {fpath.name}")
#             request_counter += 1
#         except Exception as e:
#             print(f"Failed to fetch {table_id} for {season}: {e}")

#         time.sleep(random.uniform(5, 10))

In [3]:
team_name = "Valencia CF"

In [7]:
# Valencia CF URLs for different seasons
valencia_urls = {
    "2425": "https://fbref.com/en/squads/dcc91a7b/Valencia-Stats",
    "2324": "https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats", 
    "2223": "https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats",
}

In [None]:
RAW_DIR_FBREF = Path("..", "data", "raw", team_name, "fbref")
RAW_DIR_FBREF.mkdir(parents=True, exist_ok=True)

In [None]:
# Create scraper instance with custom settings
scraper = FBrefScraper(
    output_dir=RAW_DIR_FBREF,
    max_requests=10,
    cooldown_seconds=15 * 60,
    delay_range=(5, 10),
    current_season="2425"
)

# Scrape all seasons
for season, url in valencia_urls.items():
    print(f"\n{'='*60}")
    print(f"Scraping Valencia CF season {season}")
    print(f"{'='*60}")
    
    result = scraper.scrape_squad_stats(url, force_overwrite=False)
    
    if result:
        print(f"Successfully scraped {len(result)} tables for season {season}")
    else:
        print(f"No data scraped for season {season}")

# Scrape Market Value Historical

- Encountered difficulties scraping market data from trasnfermarkt
- Used a service called Apify to scrape (it's paid but has a good free tier)
- Testing it below by running scraper in browser and saving file
- Still needs adjustment

In [56]:
# # ── 1) read the file (local) ──────────────────────────────────────────────
# json_path = pathlib.Path(
#     "..", "data", "raw", "dataset_transfermarkt_2025-06-15_15-34-31-954.json"
# )                     # <— adjust if you stored it elsewhere

# with json_path.open(encoding="utf-8") as f:
#     data = json.load(f)          # top level is a list with a single club dict

# # ── 2) flatten the “players” list into a table ────────────────────────────
# club_record = data[0]            # only one element
# players_raw = club_record["players"]

# df_players = pd.json_normalize(players_raw)  # one row per player
# df_players


---

## Transfermarkt Data Scraper

- Below is the Apify scraper code to extract valencia market value of players for season 2022,23,24

In [None]:
"""
import os, json, requests, pandas as pd
from pathlib import Path
from dotenv import load_dotenv          # pip install python-dotenv

# ── environment ───────────────────────────────────────────────────────────────
load_dotenv()
APIFY_TOKEN = os.getenv("APIFY_TOKEN")          
if not APIFY_TOKEN:
    raise RuntimeError("Set APIFY_TOKEN first – never hard-code it in notebooks!")

ACTOR  = "curious_coder~transfermarkt"
ENDPT  = (f"https://api.apify.com/v2/acts/{ACTOR}"
          "/run-sync-get-dataset-items?token=" + APIFY_TOKEN +
          "&clean=true&format=json")

BASE_URL = ("https://www.transfermarkt.co.uk/valencia-cf/kader/verein/1049/"
            "plus/0/galerie/0?saison_id={year}")

def fetch_squad(year: int) -> pd.DataFrame:
    #Run the Transfermarkt actor for one Valencia squad year → DataFrame.
    payload = {
        "startUrls": [ { "url": BASE_URL.format(year=year) } ],  # <- corrected
        "proxyConfiguration": { "useApifyProxy": True },         # free pool only
        "maxCrawlingDepth": 0
    }

    r = requests.post(ENDPT, json=payload, timeout=180)
    if r.status_code >= 400:
        raise RuntimeError(f"{year}: HTTP {r.status_code}\n{r.text}")

    rows = r.json()
    if rows and "error.type" in rows[0]:
        msg = rows[0].get("error.message", "no message")
        raise RuntimeError(f"{year}: actor error – {msg}")

    # actor returns one club record → extract player list
    club_record    = rows[0]
    players_raw    = club_record["players"]
    df_players     = pd.json_normalize(players_raw)
    df_players["Season"] = year
    return df_players


# ── fetch three seasons & inspect ────────────────────────────────────────────
seasons   = [2022, 2023, 2024]
valencia_player_value  = pd.concat([fetch_squad(y) for y in seasons], ignore_index=True)

valencia_player_value.to_csv(RAW_DIR / "valencia_market_value_22_25.csv", index=False)

# valencia_player_value = pd.read_csv(RAW_DIR / "valencia_market_value_22_25.csv")

# valencia_player_value.head()

# javi_guerra_rows = valencia_player_value[valencia_player_value['Player'].astype(str).str.contains('Javi Guerra')]
# javi_guerra_rows
"""


**Expected Format**

| # | Player | Age | Current club | Market value | Nat. | Season | Contract |
|---|--------|-----|--------------|--------------|------|--------|----------|
| 36.0 | ['Javi Guerra', 'Central Midfield'] | 20 | Valencia CF | €2.00m | Spain | 2022 | NaN |
| 8.0 | ['Javi Guerra', 'Central Midfield'] | 21 | Valencia CF | €20.00m | Spain | 2023 | NaN |
| 8.0 | ['Javi Guerra', 'Central Midfield'] | 22 | NaN | €25.00m | Spain | 2024 | Jun 30, 2027 |


- We can see the market value of Javi Guerra.
- Interesting features are: Position, Market Value, Contract length

---

# Transfermarkt Scraper

In [None]:
team_name = "Valencia CF"

In [6]:
RAW_DIR_TRANSFERMARKET = Path("..", "data", "raw", team_name, "transfermarkt")
RAW_DIR_TRANSFERMARKET.mkdir(parents=True, exist_ok=True)

In [None]:
min_season = 2020
max_season = 2024

valencia_team_data = scrape_transfermarkt_team(
    team_name=team_name,
    min_season=min_season,
    max_season=max_season,
    output_dir=RAW_DIR_TRANSFERMARKET,
    output_filename="valencia_market_values_22_25.csv",
    drop_metadata_columns=True
)

valencia_team_data.head()

NOTE: This is our raw data, which we will save before cleaning it in the next pipeline step and merging it with the other data sources

---

# Scrape multiple teams (transfermarkt)

In [30]:
TEAMS = [
    "Real Madrid CF",
    "FC Barcelona",
    "Atlético Madrid",
    "Sevilla FC", 
    "Athletic Club",
    "Villarreal CF",
    "Real Sociedad",
    "Real Betis",
    "Valencia CF",
]

min_season = 2020
max_season = 2024

for current_team_name in TEAMS:
    team_slug = current_team_name.lower().replace(" ", "_").replace("cf", "").replace("fc", "").strip("_")
    raw_dir_transfermarkt = Path("..", "data", "raw", current_team_name, "transfermarkt")
    raw_dir_transfermarkt.mkdir(parents=True, exist_ok=True)
    
    output_filename = f"{team_slug}_market_values_{min_season}_{max_season}.csv"
    
    team_data = scrape_transfermarkt_team(
        team_name=current_team_name,
        min_season=min_season,
        max_season=max_season,
        output_dir=raw_dir_transfermarkt,
        output_filename=output_filename,
        drop_metadata_columns=True
    )
    
    print(f"Scraped data for {current_team_name}: {len(team_data)} records")
    print(f"Saved to: {raw_dir_transfermarkt / output_filename}")
    print("-" * 50)

Starting to scrape Real Madrid CF players from season 2020 to 2024
------------------------------------------------------------
Scraping Real Madrid CF season 2020...
Successfully scraped 37 players for Real Madrid CF season 2020
Waiting 1.7 seconds before next request...
Scraping Real Madrid CF season 2021...
Successfully scraped 36 players for Real Madrid CF season 2021
Waiting 2.2 seconds before next request...
Scraping Real Madrid CF season 2022...
Successfully scraped 36 players for Real Madrid CF season 2022
Waiting 2.3 seconds before next request...
Scraping Real Madrid CF season 2023...
Successfully scraped 38 players for Real Madrid CF season 2023
Waiting 2.0 seconds before next request...
Scraping Real Madrid CF season 2024...
Successfully scraped 29 players for Real Madrid CF season 2024

Scraped 176 player records
Dropped metadata columns: ['Shirt Number', 'Photo URL', 'Profile URL']

Data saved to ../data/raw/Real Madrid CF/transfermarkt/real_madrid_market_values_2020_2024

KeyboardInterrupt: 

---

# Scrape multiple teams (FBRef)

In [None]:
# --------------------------------------------------------------------
# 0) imports and constants
# --------------------------------------------------------------------
from pathlib import Path
import random, time, requests, pandas as pd, unicodedata, io

TABLE_IDS = [
    "stats_standard_12", "stats_shooting_12", "stats_passing_12",
    "stats_passing_types_12", "stats_gca_12", "stats_defense_12",
    "stats_possession_12",
]
SEASONS = ["2223", "2324", "2425"]

TEAMS = {
    # "Real Madrid CF":   "53a2f082", # We're getting 404s for this one, so we'll skip it for now
    "FC Barcelona":     "206d90db",
    "Sevilla FC":       "ad2be733",
    # "Atlético Madrid":  "db3b9613",
    "Athletic Club":    "2b390eca",
    "Villarreal CF":    "2a8183b3",
    "Real Sociedad":    "e31d1cd9",
    "Real Betis":       "fc536746",
    "Valencia CF":      "dcc91a7b",
}

def slugify(name: str) -> str:
    """Convert 'Atlético Madrid' → 'Atletico-Madrid-Stats'."""
    base = name.replace(" CF", "").replace(" FC", "")  # optional trims
    base = unicodedata.normalize("NFD", base)          # strip accents
    base = "".join(c for c in base if unicodedata.category(c) != "Mn")
    return base.replace(" ", "-") + "-Stats"

def polite_get(url: str) -> requests.Response:
    hdr = {"User-Agent": "Mozilla/5.0 (polite-bot/0.1)"}
    time.sleep(random.uniform(5, 10))
    return requests.get(url, headers=hdr, timeout=40)

# --------------------------------------------------------------------
# 1) main loop
# --------------------------------------------------------------------
for team, squad_id in TEAMS.items():
    raw_dir = Path("..", "data", "raw", team, "fbref")
    raw_dir.mkdir(parents=True, exist_ok=True)
    slug = slugify(team)

    for season in SEASONS:

        # build correct base URL
        if season == "2425":
            base_url = f"https://fbref.com/en/squads/{squad_id}/{slug}"
        else:
            base_url = (
                f"https://fbref.com/en/squads/{squad_id}/20{season[:2]}-20{season[2:]}/{slug}"
            )

        # quick 404 check – skip if the page isn’t live yet
        resp = polite_get(base_url)
        if resp.status_code == 404:
            print(f"{team} {season}: page not found – skipped")
            continue

        for t_id in TABLE_IDS:
            name_part = t_id.replace("stats_", "").replace("_12", "")
            csv_name  = (
                f"df_player_stats_{season}.csv"
                if t_id == "stats_standard_12"
                else f"df_player_{name_part}_{season}.csv"
            )
            out_path = raw_dir / csv_name
            if out_path.exists():
                continue

            try:
                # pandas ≥3.0 wants StringIO, and we catch “no table” errors
                df = pd.read_html(io.StringIO(resp.text), attrs={"id": t_id})[0]
            except ValueError:
                print(f"{team} {season} {t_id}: table missing – skipped")
                continue

            df.to_csv(out_path, index=False)
            print(f"{team} {season} {t_id} -> {out_path.name}")

---

Since we're getting 404s for Real Madrid CF, we'll use Selenium to scrape the data.

In [24]:
# --------------------------------------------------------------------
# Dynamic Selenium Scraper that Detects Correct Table IDs
# --------------------------------------------------------------------

def setup_selenium_driver() -> webdriver.Chrome:
    """Initialize headless Chrome driver for web scraping."""
    chrome_options: Options = Options()
    chrome_options.headless = True
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    return webdriver.Chrome(options=chrome_options)

def wait_for_table_to_load(selenium_driver: webdriver.Chrome, table_id: str, timeout_seconds: int = 10) -> bool:
    """Wait for a specific table to be present in the DOM."""
    try:
        WebDriverWait(selenium_driver, timeout_seconds).until(
            EC.presence_of_element_located((By.ID, table_id))
        )
        return True
    except Exception as error:
        print(f"Table {table_id} not found: {error}")
        return False

def detect_table_ids_on_page(selenium_driver: webdriver.Chrome) -> list:
    """Dynamically detect which table IDs are available on the current page."""
    # Look for any elements with 'stats' in their ID
    all_elements_with_stats: list = selenium_driver.find_elements(By.CSS_SELECTOR, "[id*='stats']")
    
    # Extract the actual table IDs (not the wrapper divs)
    table_ids = []
    for element in all_elements_with_stats:
        element_id: str = element.get_attribute("id")
        # Look for main table IDs (not the wrapper divs, links, etc.)
        if (element_id.startswith("stats_") and 
            not element_id.endswith("_link") and 
            not element_id.endswith("_sh") and 
            not element_id.endswith("_per_match_toggle") and
            not element_id.startswith("div_") and
            not element_id.startswith("tfooter_") and
            not element_id.startswith("sticky_style_") and
            not element_id.startswith("all_")):
            table_ids.append(element_id)
    
    return sorted(list(set(table_ids)))  # Remove duplicates and sort

def debug_page_content(selenium_driver: webdriver.Chrome, table_id: str) -> None:
    """Debug function to check what tables are actually present on the page."""
    print(f"\nDebugging page content for table ID: {table_id}")
    
    # Check current URL
    current_url: str = selenium_driver.current_url
    print(f"Current URL: {current_url}")
    
    # Check page title
    page_title: str = selenium_driver.title
    print(f"Page title: {page_title}")
    
    # Detect available table IDs
    available_table_ids = detect_table_ids_on_page(selenium_driver)
    print(f"Available table IDs on page: {available_table_ids}")
    
    # Check if the specific table ID exists
    specific_table: list = selenium_driver.find_elements(By.ID, table_id)
    if specific_table:
        print(f"Table {table_id} found in DOM")
    else:
        print(f"Table {table_id} NOT found in DOM")

def scrape_team_tables_with_selenium_dynamic(team_name: str, squad_id: str, season: str) -> None:
    """Scrape all statistical tables for a specific team and season using Selenium with dynamic table ID detection."""
    selenium_driver: webdriver.Chrome = setup_selenium_driver()
    team_slug: str = slugify(team_name)
    
    # Build correct base URL for the season
    if season == "2425":
        base_url: str = f"https://fbref.com/en/squads/{squad_id}/2024-2025/{team_slug}"
    else:
        base_url: str = (
            f"https://fbref.com/en/squads/{squad_id}/20{season[:2]}-20{season[2:]}/{team_slug}"
        )
    
    print(f"Navigating to: {base_url}")
    
    # Navigate to page and wait for initial load
    selenium_driver.get(base_url)
    time.sleep(5)  # Increased initial page load wait
    
    # Create output directory
    raw_directory: Path = Path("..", "data", "raw", team_name, "fbref")
    raw_directory.mkdir(parents=True, exist_ok=True)
    
    # Dynamically detect available table IDs on the page
    available_table_ids = detect_table_ids_on_page(selenium_driver)
    print(f"Detected {len(available_table_ids)} available tables: {available_table_ids}")
    
    # Define the table types we want to scrape
    desired_table_types = [
        "stats_standard", "stats_shooting", "stats_passing",
        "stats_passing_types", "stats_gca", "stats_defense",
        "stats_possession"
    ]
    
    # Find matching table IDs for each desired type
    table_ids_to_scrape = []
    for desired_type in desired_table_types:
        matching_ids = [tid for tid in available_table_ids if tid.startswith(desired_type)]
        if matching_ids:
            table_ids_to_scrape.append(matching_ids[0])  # Take the first match
            print(f"Found {desired_type}: {matching_ids[0]}")
        else:
            print(f"Missing {desired_type}")
    
    # Scrape each detected table
    for table_id in table_ids_to_scrape:
        # Extract the base name for filename generation
        table_name_part: str = table_id.replace("stats_", "")
        # Remove the suffix (could be _12, _719, etc.)
        if "_" in table_name_part:
            table_name_part = table_name_part.rsplit("_", 1)[0]
            
        csv_filename: str = (
            f"df_player_stats_{season}.csv"
            if "standard" in table_id
            else f"df_player_{table_name_part}_{season}.csv"
        )
            
        output_path: Path = raw_directory / csv_filename
        
        if output_path.exists():
            print(f"{team_name} {season} {table_id}: file already exists – skipped")
            continue
        
        print(f"\nProcessing table: {table_id}")
        
        # Wait for specific table to load
        table_loaded: bool = wait_for_table_to_load(selenium_driver, table_id)
        if not table_loaded:
            print(f"{team_name} {season} {table_id}: table not found in DOM – skipped")
            continue
        
        try:
            # Get fresh page source after table loads
            current_page_html: str = selenium_driver.page_source
            table_dataframe: pd.DataFrame = pd.read_html(
                io.StringIO(current_page_html), 
                attrs={"id": table_id}
            )[0]
            table_dataframe.to_csv(output_path, index=False)
            print(f"{team_name} {season} {table_id} -> {output_path.name}")
        except ValueError as error:
            print(f"{team_name} {season} {table_id}: table extraction failed – {error}")
            continue
    
    selenium_driver.quit()

In [27]:
# Test the fixed scraper for Real Madrid 2024-25 season
real_madrid_squad_id: str = "53a2f082"
scrape_team_tables_with_selenium_dynamic("Real Madrid CF", real_madrid_squad_id, "2425")

Navigating to: https://fbref.com/en/squads/53a2f082/2024-2025/Real-Madrid-Stats
Detected 11 available tables: ['stats_defense_12', 'stats_gca_12', 'stats_keeper_12', 'stats_keeper_adv_12', 'stats_misc_12', 'stats_passing_12', 'stats_passing_types_12', 'stats_playing_time_12', 'stats_possession_12', 'stats_shooting_12', 'stats_standard_12']
Found stats_standard: stats_standard_12
Found stats_shooting: stats_shooting_12
Found stats_passing: stats_passing_12
Found stats_passing_types: stats_passing_types_12
Found stats_gca: stats_gca_12
Found stats_defense: stats_defense_12
Found stats_possession: stats_possession_12
Real Madrid CF 2425 stats_standard_12: file already exists – skipped
Real Madrid CF 2425 stats_shooting_12: file already exists – skipped
Real Madrid CF 2425 stats_passing_12: file already exists – skipped
Real Madrid CF 2425 stats_passing_types_12: file already exists – skipped
Real Madrid CF 2425 stats_gca_12: file already exists – skipped
Real Madrid CF 2425 stats_defense_

In [28]:
# Test the fixed scraper for Atlético Madrid 2024-25 season ()
atletico_madrid_squad_id: str = "db3b9613"
scrape_team_tables_with_selenium_dynamic("Atlético Madrid", atletico_madrid_squad_id, "2425")

Navigating to: https://fbref.com/en/squads/db3b9613/2024-2025/Atletico-Madrid-Stats
Detected 11 available tables: ['stats_defense_12', 'stats_gca_12', 'stats_keeper_12', 'stats_keeper_adv_12', 'stats_misc_12', 'stats_passing_12', 'stats_passing_types_12', 'stats_playing_time_12', 'stats_possession_12', 'stats_shooting_12', 'stats_standard_12']
Found stats_standard: stats_standard_12
Found stats_shooting: stats_shooting_12
Found stats_passing: stats_passing_12
Found stats_passing_types: stats_passing_types_12
Found stats_gca: stats_gca_12
Found stats_defense: stats_defense_12
Found stats_possession: stats_possession_12
Atlético Madrid 2425 stats_standard_12: file already exists – skipped
Atlético Madrid 2425 stats_shooting_12: file already exists – skipped
Atlético Madrid 2425 stats_passing_12: file already exists – skipped
Atlético Madrid 2425 stats_passing_types_12: file already exists – skipped
Atlético Madrid 2425 stats_gca_12: file already exists – skipped
Atlético Madrid 2425 stat