# 00 Data Scraper

This notebook scrapes structured player statistics for Valencia CF from [FBref](https://fbref.com) across three seasons (2022–2025). It includes:

* **Seasonal data scraping:** Extracts 7 core stat tables (e.g. passing, defense, possession) for each season using `pandas.read_html` from public FBref squad pages
* **Automated filename mapping:** Dynamically names and saves each table as a CSV in `data/raw/` using season and table type
* **Rate limit protection:** Implements a request counter and 15-minute cooldown after 10 requests to avoid getting blocked by FBref
* **Reproducible storage:** Skips already-downloaded files to prevent unnecessary re-fetches and ensure consistent local copies

> Output of this notebook is a version-controlled local dump of raw FBref tables for further inspection, cleaning, and analysis. Scraper code is commented out after use to avoid accidental API overload.

In [53]:
import pandas as pd
from pathlib import Path
import time
import random

In [54]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [55]:
RAW_DIR = Path("..", "data", "raw")
RAW_DIR.mkdir(parents=True, exist_ok=True)

## FBref Data Scraper
- Saved to CSV files in notebooks/data/raw to avoid hitting HTTP request limit
- Will comment the code to not run it (unless needed)

In [56]:
# # Current season 2024-2025
# df_player_stats_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2425 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]


In [57]:
# # Season 2023-2024
# df_player_stats_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2324 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]


In [58]:
# # Season 2022-2023
# df_player_stats_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_standard_12"})[0]
# df_player_shooting_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_shooting_12"})[0]
# df_player_passing_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_passing_12"})[0]
# df_player_passing_types_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_passing_types_12"})[0]
# df_player_gca_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_gca_12"})[0]
# df_player_defense_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_defense_12"})[0]
# df_player_possession_2223 = pd.read_html('https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats', attrs={"id": "stats_possession_12"})[0]

In [59]:
# ##### Save all dataframes to CSV files for future use #####

# # 1 Folder →  data/raw   (create if it doesn't exist)

# RAW_DIR = Path("..", "data", "raw")
# RAW_DIR.mkdir(parents=True, exist_ok=True)

# # 2 Find every variable in the notebook whose name starts with df_
# frames = {
#     name: obj
#     for name, obj in globals().items()
#     if name.startswith("df_") and isinstance(obj, pd.DataFrame)
# }

# # 3  Save each DataFrame to CSV
# for name, df in frames.items():
#     filepath = RAW_DIR / f"{name}.csv"
#     df.to_csv(filepath, index=False)
#     print(f"{filepath}")

In [60]:
BASE_URLS = {
    "2425": "https://fbref.com/en/squads/dcc91a7b/Valencia-Stats",
    "2324": "https://fbref.com/en/squads/dcc91a7b/2023-2024/Valencia-Stats",
    "2223": "https://fbref.com/en/squads/dcc91a7b/2022-2023/Valencia-Stats",
}

TABLE_IDS = [
    "stats_standard_12",
    "stats_shooting_12",
    "stats_passing_12",
    "stats_passing_types_12",
    "stats_gca_12",
    "stats_defense_12",
    "stats_possession_12",
]

In [61]:
MAX_REQUESTS = 10
COOLDOWN_SECONDS = 15 * 60  # 15 minutes

request_counter = 0

In [62]:
def strip_suffix(table_id: str, suffix="_12") -> str:
    return table_id[:-len(suffix)] if table_id.endswith(suffix) else table_id

We added a request counter and cooldown timer to the scraper to avoid triggering FBref’s rate limits and getting blocked after multiple table fetches.

In [63]:
for season, url in BASE_URLS.items():
    for table_id in TABLE_IDS:
        table_base = strip_suffix(table_id)
        if table_base == "stats_standard":
            fname = f"df_player_stats_{season}.csv"
        else:
            fname = f"df_player_{table_base.replace('stats_', '')}_{season}.csv"
        fpath = RAW_DIR / fname

        if fpath.exists():
            print(f"Skipping existing file: {fname}")
            continue

        if request_counter >= MAX_REQUESTS:
            print(f"Request cap hit. Cooling down for {COOLDOWN_SECONDS // 60} minutes...")
            time.sleep(COOLDOWN_SECONDS)
            request_counter = 0

        try:
            print(f"Fetching: {season} | {table_id}")
            df = pd.read_html(url, attrs={"id": table_id})[0]
            df.to_csv(fpath, index=False)
            print(f"Saved {fpath.name}")
            request_counter += 1
        except Exception as e:
            print(f"Failed to fetch {table_id} for {season}: {e}")

        time.sleep(random.uniform(5, 10))

Skipping existing file: df_player_stats_2425.csv
Skipping existing file: df_player_shooting_2425.csv
Skipping existing file: df_player_passing_2425.csv
Skipping existing file: df_player_passing_types_2425.csv
Skipping existing file: df_player_gca_2425.csv
Skipping existing file: df_player_defense_2425.csv
Skipping existing file: df_player_possession_2425.csv
Skipping existing file: df_player_stats_2324.csv
Skipping existing file: df_player_shooting_2324.csv
Skipping existing file: df_player_passing_2324.csv
Skipping existing file: df_player_passing_types_2324.csv
Skipping existing file: df_player_gca_2324.csv
Skipping existing file: df_player_defense_2324.csv
Skipping existing file: df_player_possession_2324.csv
Skipping existing file: df_player_stats_2223.csv
Skipping existing file: df_player_shooting_2223.csv
Skipping existing file: df_player_passing_2223.csv
Skipping existing file: df_player_passing_types_2223.csv
Skipping existing file: df_player_gca_2223.csv
Skipping existing file

# Scrape Market Value Historical

- Encountered difficulties scraping market data from trasnfermarkt
- Used a service called Apify to scrape (it's paid but has a good free tier)
- Testing it below by running scraper in browser and saving file
- Still needs adjustment

In [64]:
import json, pathlib, pandas as pd

# ── 1) read the file (local) ──────────────────────────────────────────────
json_path = pathlib.Path(
    "..", "data", "raw", "dataset_transfermarkt_2025-06-15_15-34-31-954.json"
)                     # <— adjust if you stored it elsewhere

with json_path.open(encoding="utf-8") as f:
    data = json.load(f)          # top level is a list with a single club dict

# ── 2) flatten the “players” list into a table ────────────────────────────
club_record = data[0]            # only one element
players_raw = club_record["players"]

df_players = pd.json_normalize(players_raw)  # one row per player
df_players


Unnamed: 0,#,Player,Date of birth/Age,Market value,Nat.
0,25.0,"[Giorgi Mamardashvili, Goalkeeper]","Sep 29, 2000 (24)",€30.00m,Georgia
1,13.0,"[Stole Dimitrievski, Goalkeeper]","Dec 25, 1993 (31)",€2.50m,"[North Macedonia, Spain]"
2,1.0,"[Jaume Doménech, Goalkeeper]","Nov 5, 1990 (34)",€400k,Spain
3,3.0,"[Cristhian Mosquera, Centre-Back]","Jun 27, 2004 (20)",€30.00m,"[Spain, Colombia]"
4,15.0,"[César Tárrega, Centre-Back]","Feb 26, 2002 (23)",€10.00m,Spain
5,24.0,"[Yarek Gasiorowski, Centre-Back]","Jan 12, 2005 (20)",€7.50m,"[Spain, Poland]"
6,4.0,"[Mouctar Diakhaby, Centre-Back]","Dec 19, 1996 (28)",€2.00m,"[Guinea, France]"
7,14.0,"[José Gayà, Left-Back]","May 25, 1995 (30)",€9.00m,Spain
8,21.0,"[Jesús Vázquez, Left-Back]","Jan 2, 2003 (22)",€3.00m,Spain
9,19.0,"[Max Aarons, Right-Back]","Jan 4, 2000 (25)",€6.00m,"[England, Jamaica]"


---

## Transfermarkt Data Scraper

- Below is the Apify scraper code to extract valencia market value of players for season 2022,23,24

In [None]:
"""
import os, json, requests, pandas as pd
from pathlib import Path
from dotenv import load_dotenv          # pip install python-dotenv

# ── environment ───────────────────────────────────────────────────────────────
load_dotenv()
APIFY_TOKEN = os.getenv("APIFY_TOKEN")          
if not APIFY_TOKEN:
    raise RuntimeError("Set APIFY_TOKEN first – never hard-code it in notebooks!")

ACTOR  = "curious_coder~transfermarkt"
ENDPT  = (f"https://api.apify.com/v2/acts/{ACTOR}"
          "/run-sync-get-dataset-items?token=" + APIFY_TOKEN +
          "&clean=true&format=json")

BASE_URL = ("https://www.transfermarkt.co.uk/valencia-cf/kader/verein/1049/"
            "plus/0/galerie/0?saison_id={year}")

def fetch_squad(year: int) -> pd.DataFrame:
    #Run the Transfermarkt actor for one Valencia squad year → DataFrame.
    payload = {
        "startUrls": [ { "url": BASE_URL.format(year=year) } ],  # <- corrected
        "proxyConfiguration": { "useApifyProxy": True },         # free pool only
        "maxCrawlingDepth": 0
    }

    r = requests.post(ENDPT, json=payload, timeout=180)
    if r.status_code >= 400:
        raise RuntimeError(f"{year}: HTTP {r.status_code}\n{r.text}")

    rows = r.json()
    if rows and "error.type" in rows[0]:
        msg = rows[0].get("error.message", "no message")
        raise RuntimeError(f"{year}: actor error – {msg}")

    # actor returns one club record → extract player list
    club_record    = rows[0]
    players_raw    = club_record["players"]
    df_players     = pd.json_normalize(players_raw)
    df_players["Season"] = year
    return df_players


# ── fetch three seasons & inspect ────────────────────────────────────────────
seasons   = [2022, 2023, 2024]
valencia_player_value  = pd.concat([fetch_squad(y) for y in seasons], ignore_index=True)

valencia_player_value.to_csv(RAW_DIR / "valencia_market_value_22_25.csv", index=False)
"""


In [71]:
valencia_player_value = pd.read_csv(RAW_DIR / "valencia_market_value_22_25.csv")

In [72]:
valencia_player_value.head()

Unnamed: 0,#,Player,Age,Current club,Market value,Nat.,Season,Contract
0,25.0,"['Giorgi Mamardashvili', 'Goalkeeper']",22,Valencia CF,€25.00m,Georgia,2022,
1,23.0,"['Jaume Doménech', 'Goalkeeper']",32,Valencia CF,€1.00m,Spain,2022,
2,1.0,"['Iago Herrerín', 'Goalkeeper']",35,Sestao River,€500k,Spain,2022,
3,42.0,"['Emilio Bernad', 'Goalkeeper']",23,Racing Ferrol,€300k,Spain,2022,
4,13.0,"['Cristian Rivero', 'Goalkeeper']",25,Albacete Balompié,€200k,Spain,2022,


In [73]:
javi_guerra_rows = valencia_player_value[valencia_player_value['Player'].astype(str).str.contains('Javi Guerra')]
javi_guerra_rows

Unnamed: 0,#,Player,Age,Current club,Market value,Nat.,Season,Contract
23,36.0,"['Javi Guerra', 'Central Midfield']",20,Valencia CF,€2.00m,Spain,2022,
57,8.0,"['Javi Guerra', 'Central Midfield']",21,Valencia CF,€20.00m,Spain,2023,
91,8.0,"['Javi Guerra', 'Central Midfield']",22,,€25.00m,Spain,2024,"Jun 30, 2027"


- We can see the market value of Javi Guerra.
- Interesting features are: Position, Market Value, Contract length

---

In [1]:
# Multi-Team Multi-Season Scraper
# Extended functionality to scrape any team data across multiple seasons


In [18]:
import ssl
import time
import re
from urllib.request import Request, urlopen
import random
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib.parse

In [22]:
def get_transfermarkt_club_url(club_name, country='DE'):
    # Format search query
    base_search_url = 'https://www.transfermarkt.de/schnellsuche/ergebnis/schnellsuche'
    query = {'query': club_name}
    search_url = f"{base_search_url}?{urllib.parse.urlencode(query)}"

    headers = {
        'User-Agent': 'Mozilla/5.0'
    }

    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find club link in search results
    club_links = soup.select('a[href*="/startseite/verein/"]')
    for link in club_links:
        href = link.get('href', '')
        if '/startseite/verein/' in href:
            return urllib.parse.urljoin("https://www.transfermarkt.com", href)

    return None  # If no match found

In [23]:
club_url = get_transfermarkt_club_url("Valencia CF")
print(club_url)

https://www.transfermarkt.com/fc-valencia/startseite/verein/1049


In [10]:
def get_team_squad_url(team_name: str, season: int) -> str:
    """Generate squad URL for any team and season using English Transfermarkt."""
    # Use English Transfermarkt URL - Valencia CF specific
    if team_name.lower() == "valencia cf":
        return f"https://www.transfermarkt.com/fc-valencia/kader/verein/1049/saison_id/{season}/plus/1"
    else:
        # For other teams, you'd need to get their team ID first
        team_slug = team_name.lower().replace(' ', '-').replace('cf', 'fc')
        return f"https://www.transfermarkt.com/{team_slug}/kader/verein/1049/saison_id/{season}/plus/1"

def random_delay(min_seconds: float = 1.0, max_seconds: float = 3.0):
    """Add random delay to avoid being blocked."""
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)

In [11]:
def extract_age_from_cell(age_text: str) -> int:
    """Extract age from the date of birth/age cell."""
    if pd.isna(age_text) or age_text == '':
        return None
    
    # Look for age pattern like "(24)" or "24"
    age_match = re.search(r'\((\d+)\)', str(age_text))
    if age_match:
        return int(age_match.group(1))
    
    return None

def extract_nationality_from_cell(nat_cell) -> str:
    """Extract nationality from the nationality cell using flag alt attribute."""
    if pd.isna(nat_cell) or nat_cell == '':
        return 'Unknown'
    
    # Look for flaggenrahmen img tags
    if hasattr(nat_cell, 'find'):
        flag_imgs = nat_cell.find_all('img', {'class': 'flaggenrahmen'})
        if flag_imgs:
            # Get the first nationality (primary)
            alt_text = flag_imgs[0].get('alt', '')
            if alt_text:
                return alt_text
    
    return 'Unknown'

In [12]:
def scrape_team_season(team_name: str, season: int) -> pd.DataFrame:
    """Scrape team player data for a specific season from English Transfermarkt."""
    url = get_team_squad_url(team_name, season)
    
    try:
        ssl._create_default_https_context = ssl._create_unverified_context
        req = Request(url, headers=headers)
        html = urlopen(req)
        
        # Parse with BeautifulSoup to get the exact table structure
        soup = BeautifulSoup(html, 'html.parser')
        
        # Find the squad table using the exact CSS selector
        squad_table = soup.select_one('div.responsive-table table.items')
        if not squad_table:
            print(f"No squad table found for {team_name} season {season}")
            return pd.DataFrame()
        
        # Find all player rows using the exact CSS selector
        player_rows = squad_table.select('tbody > tr')
        
        # Process the data
        processed_data = []
        
        for row in player_rows:
            try:
                # Extract shirt number
                number_cell = row.select_one('td.rn_nummer')
                shirt_number = number_cell.text.strip() if number_cell else None
                
                # Extract player name and profile link
                name_link = row.select_one('td.hauptlink a')
                player_name = name_link.text.strip() if name_link else 'Unknown'
                profile_url = name_link.get('href') if name_link else None
                
                # Extract player image
                player_img = row.select_one('td.hauptlink img')
                player_photo = None
                if player_img:
                    player_photo = player_img.get('data-src') or player_img.get('src')
                
                # Extract position from inline table
                posrela_cell = row.select_one('td.posrela')
                position = 'Unknown Position'
                if posrela_cell:
                    inline_table = posrela_cell.select_one('table.inline-table')
                    if inline_table:
                        position_rows = inline_table.select('tr')
                        if len(position_rows) > 1:
                            position_cell = position_rows[1].select_one('td')
                            if position_cell:
                                position = position_cell.text.strip()
                
                # Extract age from zentriert cells (find the one with age pattern)
                zentriert_cells = row.select('td.zentriert')
                age = None
                for cell in zentriert_cells:
                    if re.search(r'\(\d+\)', cell.text):
                        age = extract_age_from_cell(cell.text)
                        break
                
                # Extract nationality from flag images
                nationality = 'Unknown'
                flag_imgs = row.select('td img.flaggenrahmen')
                if flag_imgs:
                    nationality = flag_imgs[0].get('alt', 'Unknown')
                
                # Extract market value
                market_value_cell = row.select_one('td.rechts')
                market_value = '€0'
                if market_value_cell:
                    market_value_link = market_value_cell.select_one('a')
                    if market_value_link:
                        market_value = market_value_link.text.strip()
                
                # Extract contract (if available)
                contract = None
                # Look for contract info in the last zentriert cell or specific contract column
                contract_cells = row.select('td.zentriert')
                if len(contract_cells) > 2:  # Assuming contract might be in later zentriert cells
                    for cell in contract_cells[-2:]:  # Check last two zentriert cells
                        cell_text = cell.text.strip()
                        if cell_text and not re.search(r'\(\d+\)', cell_text) and not cell_text.isdigit():
                            contract = cell_text
                            break
                
                # Create player record
                player_record = {
                    'Player': [player_name, position],
                    'Age': age,
                    'Current club': team_name,
                    'Market value': market_value,
                    'Nat.': nationality,
                    'Season': season,
                    'Contract': contract,
                    'Shirt Number': shirt_number,
                    'Profile URL': profile_url,
                    'Photo URL': player_photo
                }
                
                processed_data.append(player_record)
                
            except Exception as e:
                print(f"Error processing player row: {str(e)}")
                continue
        
        result_df = pd.DataFrame(processed_data)
        print(f"Successfully scraped {len(result_df)} players for {team_name} season {season}")
        return result_df
        
    except Exception as e:
        print(f"Error scraping {team_name} season {season}: {str(e)}")
        return pd.DataFrame()

In [13]:
def scrape_team_multiple_seasons(team_name: str, min_season: int, max_season: int) -> pd.DataFrame:
    """Scrape team data for multiple seasons with random delays."""
    all_data = []
    
    for season in range(min_season, max_season + 1):
        print(f"Scraping {team_name} season {season}...")
        season_data = scrape_team_season(team_name, season)
        
        if not season_data.empty:
            all_data.append(season_data)
        
        # Add random delay to avoid being blocked
        if season < max_season:  # Don't delay after the last season
            print(f"Waiting {random.uniform(1.0, 3.0):.1f} seconds before next request...")
            random_delay(1.0, 3.0)
    
    if all_data:
        combined_data = pd.concat(all_data, ignore_index=True)
        combined_data.index = range(len(combined_data.index))
        return combined_data
    else:
        return pd.DataFrame()

In [14]:
# Function to scrape any team with custom parameters
def scrape_any_team(team_name: str, min_season: int, max_season: int, output_filename: str = None):
    """
    Scrape any team's data across multiple seasons from English Transfermarkt.
    """
    print(f"Starting to scrape {team_name} players from season {min_season} to {max_season}")
    print("-" * 60)
    
    team_data = scrape_team_multiple_seasons(team_name, min_season, max_season)
    
    if not team_data.empty:
        print(f"\nScraped {len(team_data)} player records")
        print("\nFirst few records:")
        display(team_data.head())
        
        # Generate output filename if not provided
        if output_filename is None:
            output_filename = f"{team_name.lower().replace(' ', '_')}_players_{min_season}_{max_season}.xlsx"
        
        # Save to Excel
        team_data.to_excel(output_filename, encoding='utf-8', index=False)
        print(f"\nData saved to {output_filename}")
        print(f"Total records: {len(team_data)}")
        
        return team_data
    else:
        print("No data was scraped. Please check the team name and season range.")
        return pd.DataFrame()

In [16]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'}

In [17]:
# Example usage: scrape Valencia CF from 2020 to 2024
team_name = "Valencia CF"
min_season = 2020
max_season = 2024

print(f"Starting to scrape {team_name} players from season {min_season} to {max_season}")

team_data = scrape_team_multiple_seasons(team_name, min_season, max_season)

if not team_data.empty:
    print(f"\nScraped {len(team_data)} player records")
    print("\nFirst few records:")
    display(team_data.head())
else:
    print("No data was scraped. Please check the team name and season range.")

Starting to scrape Valencia CF players from season 2020 to 2024
Scraping Valencia CF season 2020...
Successfully scraped 35 players for Valencia CF season 2020
Waiting 2.0 seconds before next request...
Scraping Valencia CF season 2021...
Successfully scraped 44 players for Valencia CF season 2021
Waiting 1.9 seconds before next request...
Scraping Valencia CF season 2022...
Successfully scraped 40 players for Valencia CF season 2022
Waiting 2.8 seconds before next request...
Scraping Valencia CF season 2023...
Successfully scraped 36 players for Valencia CF season 2023
Waiting 1.8 seconds before next request...
Scraping Valencia CF season 2024...
Successfully scraped 26 players for Valencia CF season 2024

Scraped 181 player records

First few records:


Unnamed: 0,Player,Age,Current club,Market value,Nat.,Season,Contract,Shirt Number,Profile URL,Photo URL
0,"[Jasper Cillessen, Goalkeeper]",32,Valencia CF,€5.00m,Netherlands,2020,"Jul 1, 2019",,/jasper-cillessen/profil/spieler/146227,
1,"[Jaume Doménech, Goalkeeper]",30,Valencia CF,€4.00m,Spain,2020,"Jul 1, 2015",,/jaume-domenech/profil/spieler/227805,
2,"[Cristian Rivero, Goalkeeper]",23,Valencia CF,€300k,Spain,2020,"Aug 1, 2020",,/cristian-rivero/profil/spieler/398131,
3,"[Unai Etxebarria, Goalkeeper]",24,Valencia CF,€150k,Spain,2020,,,/unai-etxebarria/profil/spieler/288376,
4,"[Gabriel Paulista, Centre-Back]",30,Valencia CF,€15.00m,Brazil,2020,"Aug 18, 2017",,/gabriel-paulista/profil/spieler/149498,


NOTE: We can remove the Shirt Number, Photo URL and Profile URL columns as they are not needed for our analysis

In [None]:
team_data.drop(columns=['Shirt Number', 'Photo URL', 'Profile URL'], inplace=True)

In [None]:
team_data.head()

Unnamed: 0,Player,Age,Current club,Market value,Nat.,Season,Contract
0,"[Jasper Cillessen, Goalkeeper]",32,Valencia CF,€5.00m,Netherlands,2020,"Jul 1, 2019"
1,"[Jaume Doménech, Goalkeeper]",30,Valencia CF,€4.00m,Spain,2020,"Jul 1, 2015"
2,"[Cristian Rivero, Goalkeeper]",23,Valencia CF,€300k,Spain,2020,"Aug 1, 2020"
3,"[Unai Etxebarria, Goalkeeper]",24,Valencia CF,€150k,Spain,2020,
4,"[Gabriel Paulista, Centre-Back]",30,Valencia CF,€15.00m,Brazil,2020,"Aug 18, 2017"


In [None]:
# Save to Excel (removed encoding parameter)
output_filename = f"data/{team_name.lower().replace(' ', '_')}_players_{min_season}_{max_season}.xlsx"
team_data.to_excel(output_filename, index=False)
print(f"\nData saved to {output_filename}")
print(f"Total records: {len(team_data)}")


Data saved to data/valencia_cf_players_2020_2024.xlsx
Total records: 181


NOTE: This is our raw data, which we will save before cleaning it in the next pipeline step and merging it with the other data sources