# Extraction of Players Information of Swiss Football Players

## General Information
**Objective**  
The purpose of this Jupyter Notebook is to automate the process of extracting player information from Transfermarkt pages. The data collected will include personal details such as age, value, position, height, and more, enriching the previously scraped lineups data for deeper analysis.

**Scope**  
This script scrapes player data and saves raw CSV files in the `data/raw` subfolder. The collected data will later be cleaned, processed, and merged for advanced analysis.

**Methodology**  
The scraping process involves the following steps:
- Defining URLs for data extraction.
- Navigating Transfermarkt pages using web scraping techniques.
- Collecting and structuring the scraped data.
- Saving the structured data in CSV format for downstream analysis.

**Usage**  
The data extracted will contribute to a centralized dataset to facilitate insights into player statistics, transfers, and development trends. This structured dataset will be used in further stages of the project.

---

## Setting Up the Chrome WebDriver
To scrape data from Transfermarkt pages, a compatible Chrome WebDriver must be installed.  
- **Installation**: Download the WebDriver from [Google Chrome Labs](https://googlechromelabs.github.io/chrome-for-testing/).  
- **Check Version**: Find your Chrome version in `Settings > About Google Chrome`.


In [1]:
# Install the required WebDriver manager if not already installed
#!pip install webdriver-manager

In [None]:
# Core Python Libraries
import os
import re
import time
from datetime import datetime

# Data Manipulation Libraries
import pandas as pd

# Web Scraping Libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException

# OCR and Image Processing Libraries (for Market Value extraction)
from PIL import Image, ImageEnhance, ImageFilter
import pytesseract

# Automation Libraries
from selenium.webdriver.common.action_chains import ActionChains

# Set the path to Tesseract OCR executable (adjust for your system)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Replace with your path

# Set the path to your ChromeDriver. Adjust the path for your system.
chrome_driver_directory = "C:/Users/moreno/Downloads/chromedriver-win64"  # Replace with your path
os.environ["PATH"] += os.pathsep + chrome_driver_directory


## Scraping Players Information


**Objective**  
Extract player information for various Swiss football teams from Transfermarkt, including personal data, position, and performance metrics.

**Scope**  
Scrape and save raw CSV files for players across multiple junior levels and seasons to create a comprehensive dataset.


In [None]:
# Define junior levels and years for scraping
junior_levels = ['28436', '28155', '23140', '22998', '19429', '9534', '3384']  # Junior team identifiers
years = [2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009]  # Seasons

# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Create an empty DataFrame to store player data
players = pd.DataFrame()

# Iterate over each junior level and season
for junior_level in junior_levels:
    for year in years:
        # Construct the URL for the current junior level and year
        url = f'https://www.transfermarkt.ch/schweiz-u15/kader/verein/{junior_level}/plus/1/galerie/0?saison_id={year}'
        driver.get(url)  # Navigate to the webpage
        time.sleep(2)    # Allow the page to load

        # Handle cookie consent iframe if present
        try:
            wait = WebDriverWait(driver, 2.5)
            iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953386")))
            driver.switch_to.frame(iframe)
            accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
            accept_button.click()
            driver.switch_to.default_content()
        except Exception as e:
            print(f"No iframe found or error handling iframe for {year}: {e}")

        # Extract data from the table
        try:
            table = driver.find_element(By.XPATH, '//*[@id="yw1"]/table')  # Find the main data table
            rows = table.find_elements(By.TAG_NAME, "tr")  # Retrieve all rows
            table_data = [[td.text for td in row.find_elements(By.TAG_NAME, "td")] for row in rows if len(row.find_elements(By.TAG_NAME, "td")) > 1]
            
            # If no data is found, raise an error
            if not table_data:
                raise ValueError("No data found for this year.")
            
            # Convert table data to DataFrame and clean it
            df = pd.DataFrame(table_data)
            df.drop(df.columns[[0, 1, 2, 6, 10]], axis=1, inplace=True)  # Drop unnecessary columns
            df.columns = ['Name', 'Position', 'Birthdate', 'Height', 'Foot', 'GamesPlayed', 'Debut', 'Value']  # Assign column names
            df.dropna(how='all', inplace=True)  # Remove empty rows
            df['Year'] = year
            df['Category'] = junior_level
            players = pd.concat([players, df], ignore_index=True)  # Append to main DataFrame
        except Exception as e:
            print(f"Failed to process data for {year} at {junior_level}: {e}")

# Save the collected data to a CSV file
current_date = datetime.now().strftime('%Y-%m-%d')
filename = f'../data/raw/players_NT_{current_date}.csv'
players.to_csv(filename, index=False)

# Close the WebDriver after scraping
driver.quit()

# Print a success message and a preview of the data
print("Web scraping successfully completed")
print(players.head())


## Extracting Player IDs from Transfermarkt  

**Objective**  
This section retrieves unique player IDs from Transfermarkt. These IDs are essential for accessing individual player pages for further data extraction.

**Scope**  
Search each player by name and age to find the corresponding Transfermarkt profile and extract the unique Player ID.

**Methodology** 

- Use players_unique.csv file which was cleaned in a first step - see data preparation script
- Search each player's name on Transfermarkt.
- Match names and ages to find the correct profile.
- Extract and save the Player IDs.

In [None]:
# Load the CSV file containing unique player data
df = pd.read_csv('../data/cleaned/players_unique.csv', encoding='utf-8')

# Set up the Chrome WebDriver
driver = webdriver.Chrome()

# Function to handle cookie consent iframe
def handle_iframe():
    try:
        # Wait for the iframe to load and switch to it
        iframe = WebDriverWait(driver, 1.5).until(
            EC.presence_of_element_located((By.ID, "sp_message_iframe_953386"))
        )
        driver.switch_to.frame(iframe)
        
        # Wait for the "accept" button in the iframe and click it
        accept_button = WebDriverWait(driver, 1.3).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]"))
        )
        accept_button.click()
        
        # Switch back to the default content after handling the iframe
        driver.switch_to.default_content()
        print("Iframe handled successfully.")
    except Exception as e:
        # If iframe is not found or an error occurs, print the exception
        print(f"No iframe found or error handling iframe: {e}")
        driver.switch_to.default_content()

# Navigate to the Transfermarkt website and handle the initial cookie consent iframe
driver.get("https://www.transfermarkt.ch/")
handle_iframe()

# Iterate over each player in the DataFrame
for index, row in df.iterrows():
    # Reset the state by navigating back to the main Transfermarkt page
    driver.get("https://www.transfermarkt.ch/")
    handle_iframe()

    # Locate the search bar, clear it, and enter the player's name
    search_bar = WebDriverWait(driver, 1.3).until(
        EC.visibility_of_element_located((By.XPATH, "//*[@id='schnellsuche']/input"))
    )
    search_bar.clear()  # Clear any existing text in the search bar
    search_bar.send_keys(row['Name'])  # Enter the player's name
    search_bar.submit()  # Submit the search query
    
    # Wait for the search results to load and process them
    try:
        # Locate all rows in the search results table
        results = WebDriverWait(driver, 1.3).until(
            EC.visibility_of_all_elements_located((By.XPATH, "//*[@id='yw0']/table/tbody/tr"))
        )

        player_found = False  # Flag to track if the player is found
        
        # First attempt: Match the name and age
        for result in results:
            try:
                # Extract the name from the search result
                name_in_result = result.find_element(By.XPATH, "./td[1]/table/tbody/tr[1]/td[2]/a").text
                if name_in_result == row['Name']:
                    # Extract and validate the player's age
                    age_text = result.find_element(By.XPATH, "./td[4]").text
                    age = int(age_text) if age_text.isdigit() else None

                    if age == row['Age']:
                        # Check for nationality (e.g., "Schweiz")
                        nationality_elements = result.find_elements(By.XPATH, "./td[5]/img")
                        nationalities = [img.get_attribute("title") for img in nationality_elements]
                        
                        if "Schweiz" in nationalities:
                            # Click the player's name and extract the Player ID from the URL
                            result.find_element(By.XPATH, "./td[1]/table/tbody/tr[1]/td[2]/a").click()
                            player_id = driver.current_url.split('/')[-1]
                            df.at[index, 'Player ID'] = player_id
                            player_found = True  # Update the flag
                            break
            except Exception as e:
                print(f"Error accessing details for {row['Name']} in one of the results: {e}")

        # Second attempt: Match using age only if no exact name match is found
        if not player_found:
            print(f"No exact name match for {row['Name']}, attempting age-based match.")
            for result in results:
                try:
                    # Extract and validate the player's age
                    age_text = result.find_element(By.XPATH, "./td[4]").text
                    age = int(age_text) if age_text.isdigit() else None

                    if age == row['Age']:
                        # Click the player's name and extract the Player ID from the URL
                        result.find_element(By.XPATH, "./td[1]/table/tbody/tr[1]/td[2]/a").click()
                        player_id = driver.current_url.split('/')[-1]
                        df.at[index, 'Player ID'] = player_id
                        player_found = True  # Update the flag
                        break
                except Exception as e:
                    print(f"Error accessing details for {row['Name']} in one of the results: {e}")

        # If the player is still not found, log it and mark as "Not Found"
        if not player_found:
            print(f"Player {row['Name']} with age {row['Age']} not found.")
            df.at[index, 'Player ID'] = 'Not Found'

    except Exception as e:
        # Handle errors during the search process and mark the Player ID as "Not Found"
        print(f"Error processing {row['Name']}: {e}")
        df.at[index, 'Player ID'] = 'Not Found'

# Save the updated DataFrame with the extracted Player IDs to a CSV file
df.to_csv('../data/cleaned/players_incl_ID.csv', index=False)

# Close the WebDriver after processing
driver.quit()

print("Player IDs added successfully.")


## Scraping Player Stats Per National League and Season  

**Objective**  
Extract detailed player statistics, including games played, goals scored, assists, and other metrics, from Transfermarkt pages.

**Scope**  
Iterate through player profiles and extract data for each season and competition.

**Methodology** 

- Use Player IDs to access individual pages.
- Extract statistics for each season and competition type (e.g., national leagues, cups).
- Save the data for further analysis.

In [None]:
# Load the two CSV files
players_unique = pd.read_csv('../data/cleaned/players_cleaned.csv')
players_incl_ID = pd.read_csv('../data/cleaned/players_incl_ID.csv')

# Sort players_unique by date or another relevant column if available
# Here, we assume sorting by index to get the last occurrence if there's no date column
players_unique_sorted = players_unique.sort_index()

# Drop duplicates in players_unique, keeping only the last occurrence of each 'Name'
players_unique_last = players_unique_sorted.drop_duplicates(subset='Name', keep='last')

# Merge the cleaned dataframe with players_incl_ID on 'Name'
merged_df = players_incl_ID.merge(players_unique_last[['Name', 'Position']], on='Name', how='left')

# Save the result to a new CSV file or display it
merged_df.to_csv('../data/cleaned/players_incl_ID.csv', index=False)

# Display the merged dataframe to verify
print(merged_df.head())



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from datetime import datetime
import pandas as pd
import time

# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Load the CSV file containing Player IDs and their respective Positions
df_players = pd.read_csv('../data/cleaned/players_incl_ID.csv')
player_ids = df_players['Player ID'].unique()  # Extract unique Player IDs

# Prepare an empty DataFrame to collect all player data
all_players_data = pd.DataFrame()

# Function to handle the correction of season formatting
def correct_season_post_processing(season):
    try:
        # Handle cases where the season format includes text (e.g., '01-Feb')
        if '-' in season and any(char.isalpha() for char in season):
            parsed_date = datetime.strptime(season, "%d-%b")
            return f"{parsed_date.year - 1}/{parsed_date.year % 100:02d}"  # Convert to a standard format
        # Handle cases like '00/01' and correct to '00-01'
        if '/' in season and len(season.split('/')) == 2:
            return season.replace('/', '-')
    except ValueError:
        pass
    return season  # Return the original if no changes are required

# Define XPaths and types for tables containing the relevant statistics
table_info = [
    ("//*[@id='yw1']/table/tbody", "Nationale Ligen"),  # National leagues table
    ("//*[@id='yw2']/table/tbody", "Nationale Pokalwettbewerbe"),  # National cups table
    ("//*[@id='yw3']/table/tbody", "Internationale Pokalwettbewerbe")  # International competitions table
]

# Iterate over each player in the DataFrame
for index, row in df_players.iterrows():
    player_id = row['Player ID']  # Extract the Player ID
    position = row['Position']  # Get the player's position (e.g., Goalkeeper, Outfield player)
    
    # Construct the URL to the player's detailed statistics page
    player_url = f"https://www.transfermarkt.ch/yanick-brecher/detaillierteleistungsdaten/spieler/{player_id}/plus/1"
    driver.get(player_url)  # Open the player's page
    time.sleep(3)  # Allow the page to load fully
    
    # Handle cookie consent if required
    try:
        wait = WebDriverWait(driver, 2.5)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953386")))
        driver.switch_to.frame(iframe)
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'message-component message-button no-children')]")))
        accept_button.click()
        driver.switch_to.default_content()
    except Exception as e:
        print(f"Cookie consent not handled: {e}")

    # Iterate over each type of table (national league, cups, international)
    for xpath, table_type in table_info:
        try:
            wait = WebDriverWait(driver, 2.5)
            table = wait.until(EC.presence_of_element_located((By.XPATH, xpath)))  # Locate the table
            rows = table.find_elements(By.TAG_NAME, "tr")  # Get all rows in the table
            
            player_data = []  # Temporary list to store data for this player and table
            for row in rows:
                cols = row.find_elements(By.TAG_NAME, "td")  # Get all columns in the row
                if cols and len(cols) > 10:  # Ensure sufficient columns are present
                    # Extract and process relevant data from the row
                    season = cols[0].text.strip()
                    season = correct_season_post_processing(season)  # Correct the season format
                    competition = cols[2].find_element(By.TAG_NAME, "a").text.strip()  # Competition name
                    club_img = cols[3].find_element(By.TAG_NAME, "img")  # Club information
                    club = club_img.get_attribute("alt")  # Extract club name
                    
                    # Extract statistics based on the player's position
                    if position == "Torwart":  # Goalkeeper-specific stats
                        games_played = cols[4].text.strip()
                        insgesamt = cols[5].text.strip()
                        own_goals = cols[6].text.strip()
                        substituted_on = cols[7].text.strip()
                        substituted_off = cols[8].text.strip()
                        yellow_cards = cols[9].text.strip()
                        yellow_red = cols[10].text.strip()
                        red_cards = cols[11].text.strip()
                        goals_conceded = cols[12].text.strip()
                        clean_sheets = cols[13].text.strip()
                        played_minutes = cols[14].text.strip().replace("'", "")
                        
                        # Append the extracted data to the list
                        player_data.append([
                            player_id, season, competition, club, games_played, insgesamt, own_goals, substituted_on, 
                            substituted_off, yellow_cards, yellow_red, red_cards, goals_conceded, clean_sheets, played_minutes, table_type
                        ])
                    else:  # Outfield player-specific stats
                        games_played = cols[4].text.strip()
                        goals = cols[5].text.strip()
                        assists = cols[6].text.strip()
                        own_goals = cols[7].text.strip()
                        substituted_on = cols[8].text.strip()
                        substituted_off = cols[9].text.strip()
                        yellow_cards = cols[10].text.strip()
                        yellow_red = cols[11].text.strip()
                        red_cards = cols[12].text.strip()
                        penalty_goals = cols[13].text.strip()
                        minutes_per_goal = cols[14].text.strip()
                        played_minutes = cols[15].text.strip().replace("'", "")
                        
                        # Append the extracted data to the list
                        player_data.append([
                            player_id, season, competition, club, games_played, goals, assists, own_goals, substituted_on, 
                            substituted_off, yellow_cards, yellow_red, red_cards, penalty_goals, minutes_per_goal, played_minutes, table_type
                        ])
            
            # Append the player's data to the main DataFrame
            if player_data:
                if position == "Torwart":  # Define goalkeeper-specific columns
                    columns = [
                        'Player ID', 'Season', 'Competition', 'Club', 'Games Played', 'insgesamt', 'Own Goals', 
                        'Substituted On', 'Substituted Off', 'Yellow Cards', 'Yellow Red', 'Red Cards', 'Goals Conceded', 
                        'Clean Sheets', 'Played Minutes', 'Type'
                    ]
                else:  # Define outfield player-specific columns
                    columns = [
                        'Player ID', 'Season', 'Competition', 'Club', 'Games Played', 'Goals', 'Assists', 'Own Goals', 
                        'Substituted On', 'Substituted Off', 'Yellow Cards', 'Yellow Red', 'Red Cards', 'Penalty Goals', 
                        'Minutes per Goal', 'Played Minutes', 'Type'
                    ]
                df_temp = pd.DataFrame(player_data, columns=columns)
                
                # Ensure consistent structure by filling missing columns with 0
                all_columns = [
                    'Player ID', 'Season', 'Competition', 'Club', 'Games Played', 'insgesamt', 'Goals', 'Assists', 'Own Goals', 
                    'Substituted On', 'Substituted Off', 'Yellow Cards', 'Yellow Red', 'Red Cards', 'Penalty Goals', 
                    'Minutes per Goal', 'Goals Conceded', 'Clean Sheets', 'Played Minutes', 'Type'
                ]
                df_temp = df_temp.reindex(columns=all_columns, fill_value=0)
                
                all_players_data = pd.concat([all_players_data, df_temp], ignore_index=True)
                
        except Exception as e:
            print(f"Failed to extract data for player ID {player_id} from table {table_type}: {e}")

# Close the WebDriver after the scraping is complete
driver.quit()

# Apply season correction to the collected data
all_players_data['Season'] = all_players_data['Season'].apply(correct_season_post_processing)

# Save the final DataFrame to a CSV file
all_players_data.to_csv('../data/raw/players_club_stats.csv', index=False)

# Print a success message and preview the data
print("Web scraping successfully completed")
print(all_players_data.head())



## Scraping National Team Call-Ups and Goals  

**Objective**  
Collect data on national team call-ups and goals scored for each player.

**Scope**  
Use Player IDs to extract data on matches, call-ups, and statistics for different national team levels.

**Methodology**  

- Access national team statistics pages using Player IDs.
- Extract data on call-ups, goals scored, and matches played.
- Save the data for further use.

In [None]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Load the CSV file containing Player IDs
df_players = pd.read_csv('../data/cleaned/players_incl_ID.csv')  # Read the player data
player_ids = df_players['Player ID'].unique()  # Extract unique Player IDs

# Prepare an empty DataFrame to collect all player data
all_players_data = pd.DataFrame()

# Iterate over each Player ID to scrape national team data
for player_id in player_ids:
    # Construct the URL for the player's national team page
    player_url = f"https://www.transfermarkt.ch/johan-vonlanthen/nationalmannschaft/spieler/{player_id}"
    
    # Navigate to the player's national team page
    driver.get(player_url)
    time.sleep(3)  # Allow the page to load fully
    
    # Handle cookie consent if needed
    try:
        wait = WebDriverWait(driver, 1)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953386")))  # Locate the cookie iframe
        driver.switch_to.frame(iframe)  # Switch to the iframe
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'message-component message-button no-children')]")))  # Locate and click the accept button
        accept_button.click()
        driver.switch_to.default_content()  # Switch back to the main content
    except Exception as e:
        print(f"Cookie consent not handled: {e}")  # Print an error message if cookie consent handling fails

    # Initialize a list to store the player's national team data
    player_data = []
    try:
        # Attempt to locate the primary table containing national team data
        wait = WebDriverWait(driver, 2)
        table = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='tm-main']/div[2]/div[1]/div[1]/table")))
    except Exception:
        try:
            # If the first XPath fails, try an alternative XPath for the table
            table = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='tm-main']/div[1]/div[1]/div[1]/table")))
        except Exception:
            # If both attempts fail, log the issue and add a "not found" entry
            print(f"No data found for player ID {player_id}. Adding 'not found' entry.")
            player_data.append([player_id, "not found", "not found", "not found", "not found"])

    # Only parse the table if it was successfully located
    if not player_data:  # Ensure no "not found" entries before processing the table
        try:
            rows = table.find_elements(By.TAG_NAME, "tr")  # Locate all rows in the table
            for row in rows:
                cols = row.find_elements(By.TAG_NAME, "td")  # Get all columns in the row
                if cols and len(cols) >= 6:  # Ensure the row has enough columns
                    # Extract relevant data from the columns
                    national_team = cols[2].text  # Name of the national team
                    debut_date = cols[3].text  # Date of the player's debut
                    games_played = cols[4].text  # Total games played
                    goals_scored = cols[5].text  # Total goals scored
                    
                    # Append the extracted data to the player_data list
                    player_data.append([player_id, national_team, debut_date, games_played, goals_scored])
        except Exception as e:
            # Log any errors encountered during table parsing
            print(f"Failed to parse data for player ID {player_id}: {e}")
            player_data.append([player_id, "not found", "not found", "not found", "not found"])

    # Add the extracted data for the player to the main DataFrame
    if player_data:
        df_temp = pd.DataFrame(player_data, columns=['Player ID', 'National Team', 'Debut Date', 'Games Played', 'Goals Scored'])
        all_players_data = pd.concat([all_players_data, df_temp], ignore_index=True)

# Save the collected data to a CSV file
all_players_data.to_csv('../data/raw/players_NT_stats.csv', index=False)

# Close the WebDriver after scraping is complete
driver.quit()

# Print a success message and show the first few rows of the collected data
print("Webscraping successfully completed")
print(all_players_data.head())


## Extracting National Team IDs  
**Objective** 
Retrieve unique IDs for national teams associated with players, enabling detailed analysis of national team statistics.

**Scope**  
Identify and extract National Team IDs from Transfermarkt using team names and search functionalities.

**Methodology**

- Use team names from the player data to search Transfermarkt.
- Locate the national team's page and extract its unique ID.
- Handle multiple attempts to find IDs for missing teams.
- Save the IDs alongside the player data for further analysis.

In [None]:
# Load the CSV file containing player national team data
df = pd.read_csv('../data/raw/players_NT_stats.csv')

# Extract unique national team names
unique_teams = df['National Team'].unique()

# Set up the Chrome WebDriver
driver = webdriver.Chrome()

# Function to handle cookie consent iframe
def handle_iframe():
    try:
        # Wait for the iframe to load and switch to it
        iframe = WebDriverWait(driver, 1.5).until(
            EC.presence_of_element_located((By.ID, "sp_message_iframe_953386"))
        )
        driver.switch_to.frame(iframe)
        # Wait for the "accept" button to become clickable and click it
        accept_button = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]"))
        )
        accept_button.click()
        # Switch back to the main page content
        driver.switch_to.default_content()
        print("Iframe handled successfully.")
    except Exception as e:
        # Log if iframe handling fails
        print(f"No iframe found or error handling iframe: {e}")
        driver.switch_to.default_content()

# Navigate to Transfermarkt and handle the initial cookie consent
driver.get("https://www.transfermarkt.ch/")
handle_iframe()

# Dictionary to store mappings of national team names to their IDs
team_id_dict = {}

# First attempt to find National Team IDs
for team in unique_teams:
    # Navigate back to the Transfermarkt homepage
    driver.get("https://www.transfermarkt.ch/")
    handle_iframe()

    # Locate and interact with the search bar
    search_bar = WebDriverWait(driver, 2).until(
        EC.visibility_of_element_located((By.XPATH, "//*[@id='schnellsuche']/input"))
    )
    search_bar.clear()  # Clear any existing text in the search bar
    search_bar.send_keys(team)  # Enter the team name
    search_bar.submit()  # Submit the search query
    
    try:
        # Wait for search results to load
        rows = WebDriverWait(driver, 2).until(
            EC.presence_of_all_elements_located((By.XPATH, "//*[@id='yw0']/table/tbody/tr"))
        )
        
        found = False  # Flag to track if the team was found
        for i, row in enumerate(rows, start=1):
            # Construct XPath to find team names in the search results
            team_name_xpath = f"//*[@id='yw0']/table/tbody/tr[{i}]/td[2]/table/tbody/tr[1]/td/a"
            team_name_element = driver.find_element(By.XPATH, team_name_xpath)
            # Match team name (case-insensitive) and navigate to its page
            if team_name_element.text.strip().lower() == team.strip().lower():
                team_name_element.click()
                # Extract the National Team ID from the URL
                team_id = driver.current_url.split('/')[-1]
                team_id_dict[team] = team_id
                found = True
                break
        
        if not found:
            # If the team is not found in the first search results, mark as "Not Found"
            team_id_dict[team] = 'Not Found'
            print(f"{team} not found in first search results.")

    except Exception as e:
        # Handle errors during search processing
        print(f"Error processing {team}: {e}")
        team_id_dict[team] = 'Not Found'

# Map the found IDs back to the DataFrame
df['National Team ID'] = df['National Team'].map(team_id_dict)
# Save the updated DataFrame to a new CSV file
df.to_csv('../data/cleaned/players_NT_stats_incl_ID.csv', index=False)

# Second attempt for teams not found in the first round
not_found_teams = df[df['National Team ID'] == 'Not Found']['National Team'].unique()

for team in not_found_teams:
    driver.get("https://www.transfermarkt.ch/")
    handle_iframe()

    search_bar = WebDriverWait(driver, 2).until(
        EC.visibility_of_element_located((By.XPATH, "//*[@id='schnellsuche']/input"))
    )
    search_bar.clear()
    search_bar.send_keys(team)
    search_bar.submit()
    
    try:
        rows = WebDriverWait(driver, 2).until(
            EC.presence_of_all_elements_located((By.XPATH, "//*[@id='yw2']/table/tbody/tr"))
        )
        
        found = False
        for i, row in enumerate(rows, start=1):
            team_name_xpath = f"//*[@id='yw2']/table/tbody/tr[{i}]/td[2]/table/tbody/tr[1]/td/a"
            team_name_element = driver.find_element(By.XPATH, team_name_xpath)
            if team_name_element.text.strip().lower() == team.strip().lower():
                team_name_element.click()
                team_id = driver.current_url.split('/')[-1]
                df.loc[df['National Team'] == team, 'National Team ID'] = team_id
                found = True
                break
        
        if not found:
            print(f"{team} not found in second search results.")

    except Exception as e:
        print(f"Error processing {team} in second attempt: {e}")

df.to_csv('../data/raw/players_NT_stats_incl_ID.csv', index=False)

# Third attempt for teams still not found
not_found_teams = df[df['National Team ID'] == 'Not Found']['National Team'].unique()

for team in not_found_teams:
    driver.get("https://www.transfermarkt.ch/")
    handle_iframe()

    search_bar = WebDriverWait(driver, 2).until(
        EC.visibility_of_element_located((By.XPATH, "//*[@id='schnellsuche']/input"))
    )
    search_bar.clear()
    search_bar.send_keys(team)
    search_bar.submit()
    
    try:
        rows = WebDriverWait(driver, 2).until(
            EC.presence_of_all_elements_located((By.XPATH, "//*[@id='yw1']/table/tbody/tr"))
        )
        
        found = False
        for i, row in enumerate(rows, start=1):
            team_name_xpath = f"//*[@id='yw1']/table/tbody/tr[{i}]/td[2]/table/tbody/tr[1]/td/a"
            team_name_element = driver.find_element(By.XPATH, team_name_xpath)
            if team_name_element.text.strip().lower() == team.strip().lower():
                team_name_element.click()
                team_id = driver.current_url.split('/')[-1]
                df.loc[df['National Team'] == team, 'National Team ID'] = team_id
                found = True
                break
        
        if not found:
            print(f"{team} not found in third search results.")

    except Exception as e:
        print(f"Error processing {team} in third attempt: {e}")

df.to_csv('../data/raw/players_NT_stats_incl_ID.csv', index=False)

# Close the WebDriver after completing the process
driver.quit()

print("National Team IDs updated successfully.")


## Scraping National Team Playing Minutes  

**Objective**  
Collect detailed playing minutes statistics for players in their national teams, categorized by competition and season.

**Scope**  
Extract statistics such as games played, goals scored, and minutes on the field for each player in their respective national teams.

**Methodology**  

- Use Player IDs and National Team IDs to access detailed pages.
- Extract season-wise performance data for each competition.
- Save the collected data for further analysis.

In [None]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Load the files containing Player ID, Position, and National Team ID
df_positions = pd.read_csv('../data/cleaned/players_incl_ID.csv')  # Contains Player ID and Position
df_national_teams = pd.read_csv('../data/raw/players_NT_stats_incl_ID.csv')  # Contains Player ID and National Team ID

# Merge player and national team data to associate each Player ID with multiple National Team IDs and Position
df_players = pd.merge(df_national_teams, df_positions[['Player ID', 'Position']], on='Player ID', how='left')

# Prepare an empty DataFrame to collect all player data
all_players_data = pd.DataFrame()

# Function to correct season formatting to a standard format
def correct_season_post_processing(season):
    try:
        # Handle cases with text dates like '01-Feb'
        if '-' in season and any(char.isalpha() for char in season):
            parsed_date = datetime.strptime(season, "%d-%b")
            return f"{parsed_date.year - 1}/{parsed_date.year % 100:02d}"
        # Handle cases like '00/01' and convert to '00-01'
        if '/' in season and len(season.split('/')) == 2:
            return season.replace('/', '-')
    except ValueError:
        pass
    return season  # Return the original if no formatting changes are required

# Iterate over each Player ID and National Team ID in the merged DataFrame
for index, row in df_players.iterrows():
    player_id = row['Player ID']  # Extract Player ID
    national_team_id = row['National Team ID']  # Extract National Team ID for each player
    position = row['Position']  # Extract the player's position (e.g., Goalkeeper or Outfield)

    # Construct the URL for the player's national team statistics page
    player_url = f"https://www.transfermarkt.ch/yanick-brecher/nationalmannschaft/spieler/{player_id}/plus/1/verein_id/{national_team_id}"
    driver.get(player_url)  # Navigate to the player's statistics page
    time.sleep(3)  # Allow the page to fully load
    
    # Handle cookie consent if needed
    try:
        wait = WebDriverWait(driver, 2.5)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953386")))
        driver.switch_to.frame(iframe)
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'message-component message-button no-children')]")))
        accept_button.click()
        driver.switch_to.default_content()
    except Exception as e:
        print(f"Cookie consent not handled or already accepted: {e}")

    # Try to locate the main table containing national team statistics
    try:
        wait = WebDriverWait(driver, 2.5)
        table = wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='yw0']/table")))  # Locate the table
        rows = table.find_elements(By.TAG_NAME, "tr")  # Extract all rows from the table
        
        player_data = []  # Initialize a list to store the player's data
        for row in rows:
            cols = row.find_elements(By.TAG_NAME, "td")  # Extract columns from the row
            if cols and len(cols) > 10:  # Ensure there are enough columns to extract data
                season = cols[0].text.strip()  # Extract the season
                season = correct_season_post_processing(season)  # Format the season
                competition = cols[1].text.strip()  # Extract the competition name
                
                # Extract data based on player position
                if position == "Torwart":  # Goalkeeper-specific stats
                    games_played = cols[2].text.strip()
                    goals = cols[3].text.strip()                
                    own_goals = cols[4].text.strip()             
                    substituted_on = cols[5].text.strip()              
                    substituted_off = cols[6].text.strip()             
                    yellow_cards = cols[7].text.strip()                
                    yellow_red = cols[8].text.strip()                  
                    red_cards = cols[9].text.strip()                   
                    goals_conceded = cols[10].text.strip()               
                    clean_sheets = cols[11].text.strip()    
                    played_minutes = cols[12].text.strip().replace("'", "")  
                    
                    player_data.append([
                        player_id, national_team_id, position, season, competition, games_played, goals, own_goals, substituted_on, substituted_off, 
                        yellow_cards, yellow_red, red_cards, goals_conceded, clean_sheets, played_minutes
                    ])
                else:  # Outfield player-specific stats
                    games_played = cols[2].text.strip()
                    goals = cols[3].text.strip()
                    assists = cols[4].text.strip()
                    own_goals = cols[5].text.strip()
                    substituted_on = cols[6].text.strip()
                    substituted_off = cols[7].text.strip()
                    yellow_cards = cols[8].text.strip()
                    yellow_red = cols[9].text.strip()
                    red_cards = cols[10].text.strip()
                    penalty_goals = cols[11].text.strip()
                    minutes_per_goal = cols[12].text.strip()
                    played_minutes = cols[13].text.strip().replace("'", "")
                    
                    player_data.append([
                        player_id, national_team_id, position, season, competition, games_played, goals, assists, own_goals, substituted_on, substituted_off, 
                        yellow_cards, yellow_red, red_cards, penalty_goals, minutes_per_goal, played_minutes
                    ])
        
        # Append the extracted data for the player to the main DataFrame
        if player_data:
            if position == "Torwart":  # Define goalkeeper-specific columns
                columns = [
                    'Player ID', 'National Team ID', 'Position', 'Season', 'Competition', 'Games Played', 'Goals', 'Own Goals', 'Substituted On', 'Substituted Off', 
                    'Yellow Cards', 'Yellow Red', 'Red Cards', 'Goals Conceded', 'Clean Sheets', 'Played Minutes'
                ]
            else:  # Define outfield player-specific columns
                columns = [
                    'Player ID', 'National Team ID', 'Position', 'Season', 'Competition', 'Games Played', 'Goals', 'Assists', 'Own Goals', 'Substituted On', 'Substituted Off', 
                    'Yellow Cards', 'Yellow Red', 'Red Cards', 'Penalty Goals', 'Minutes per Goal', 'Played Minutes'
                ]
            df_temp = pd.DataFrame(player_data, columns=columns)  # Create a temporary DataFrame
            
            # Ensure all columns exist, fill missing values with "-"
            all_columns = [
                'Player ID', 'National Team ID', 'Position', 'Season', 'Competition', 'Games Played', 'Goals', 'Assists', 'Own Goals', 'Substituted On', 
                'Substituted Off', 'Yellow Cards', 'Yellow Red', 'Red Cards', 'Penalty Goals', 'Minutes per Goal', 'Goals Conceded', 'Clean Sheets', 
                 'Played Minutes'
            ]
            df_temp = df_temp.reindex(columns=all_columns, fill_value="-")  # Ensure consistent structure
            
            all_players_data = pd.concat([all_players_data, df_temp], ignore_index=True)  # Append to main DataFrame
            
    except Exception as e:
        print(f"Failed to extract data for player ID {player_id} and National Team ID {national_team_id}: {e}")

# Close the WebDriver after the scraping is complete
driver.quit()

# Correct season formatting in the final DataFrame
all_players_data['Season'] = all_players_data['Season'].apply(correct_season_post_processing)

# Save the final DataFrame to a CSV file
all_players_data.to_csv('../data/raw/players_detailed_stats_NT.csv', index=False)

# Print a success message and preview the data
print("Web scraping successfully completed")
print(all_players_data.head())


## Extracting Market Values for Each Player ID  

**Objective**  
Retrieve historical market value data for players from Transfermarkt, using tooltips to extract detailed information.

**Scope**  
Extract market values, dates, and associated club information for each player over time.

**Methodology**

- Use Player IDs to access market value trend pages.
- Trigger tooltips on the graph for historical data.
- Use OCR to extract tooltip text and parse market value details.

In [None]:
# Path to Tesseract OCR executable (modify based on your installation)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# Initialize Chrome WebDriver with options to maximize the window
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=options)

# Temporarily set a larger viewport height to capture tooltips outside the visible area
driver.set_window_size(1920, 1600)

# Load the CSV file containing Player IDs
df_players = pd.read_csv('../data/cleaned/players_incl_ID.csv')  # Load player data
player_ids = df_players['Player ID'].unique()  # Extract unique Player IDs

# List to store market value data for efficiency
all_market_values = []

# Function to parse text extracted from OCR output
def parse_tooltip_from_text(text):
    # Define replacements for common OCR errors
    replacements = {
        "Verzin": "Verein",
        "Aiter": "Alter",
        "Variin": "Verein",
        "Tsc": "Tsd.",
        "Vio": "Mio.",
        "Kio": "Mio.",
        "lM": "Mio.",
        "Om%": "2.5 Mio",
        "Werbung": "",
        "Anzeige": "",
        "anne": "",
        "Warhiing": "",
        "anAan": "",
        "Aktueller Marktwert": ""
    }
    for wrong, correct in replacements.items():
        text = text.replace(wrong, correct)

    # Clean the text and extract relevant fields using regex
    text = re.sub(r'[^\w\s.,:€()-]', '', text)  # Remove unwanted characters
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace

    # Extract relevant data fields
    date_match = re.search(r"(\d{2}\.\d{2}\.\d{4})", text)  # Find date
    market_value_match = re.search(r"Marktwert:\s*([0-9,.]+\s*(Mio\.|Tsd\.))", text, re.IGNORECASE)  # Find market value
    club_match = re.search(r"Verein:\s*([^\n]+)", text, re.IGNORECASE)  # Find club name
    age_match = re.search(r"Alter:\s*(\d+)", text, re.IGNORECASE)  # Find age

    # Extract matched values or set them to None if not found
    date = date_match.group(1) if date_match else None
    market_value = market_value_match.group(1).replace(',', '.') if market_value_match else None
    club = club_match.group(1).strip() if club_match else None
    age = age_match.group(1) if age_match else None

    # Clean the club name to remove unnecessary text
    if club:
        club = re.split(r"(\sAnzeige|\sWerbung)", club)[0].strip()

    return date, market_value, club, age

# Iterate over each Player ID to scrape market value data
for player_id in player_ids:
    player_url = f'https://www.transfermarkt.ch/hakan-yakin/marktwertverlauf/spieler/{player_id}'
    print(f"\nProcessing Player ID: {player_id}")
    driver.get(player_url)

    # Handle cookie consent iframe if present
    try:
        WebDriverWait(driver, 5).until(
            EC.frame_to_be_available_and_switch_to_it((By.ID, "sp_message_iframe_953386"))
        )
        WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]"))
        ).click()
        driver.switch_to.default_content()
    except Exception as e:
        print("No cookie iframe found or error in handling iframe.")

    # Retrieve the current market value (last entry shown on the page)
    try:
        last_market_value = driver.find_element(By.XPATH, '//*[@id="tm-main"]/div[2]/div[1]/div/tm-market-value-development-graph-extended/div/h3').text
        current_date = datetime.now().strftime('%Y-%m-%d')
        all_market_values.append({
            'Player_ID': player_id,
            'OCR_Text': "N/A",  # No OCR data for this entry
            'Date': current_date,
            'Age': "N/A",  # Age can be derived later from other sources
            'Market_Value': last_market_value,
            'Club': "See Player Profile"
        })
        print(f"Current market value data for {player_id} saved.")
    except Exception as e:
        print(f"Error retrieving last market value for Player {player_id}: {e}")

    # Locate the graph elements representing tooltip points
    path_elements = driver.find_elements(By.CSS_SELECTOR, "#tm-main > div.row > div.large-8.columns > div > tm-market-value-development-graph-extended > div > div > svg > g.svelte-14kpmtb > path")

    if not path_elements:
        print(f"No path elements found for Player {player_id}.")
        continue

    print(f"{len(path_elements)} path elements found for Player {player_id}.")
    actions = ActionChains(driver)

    # Process each tooltip element to extract market value data
    for index, element in enumerate(path_elements):
        driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", element)  # Center the element in view
        driver.execute_script("arguments[0].dispatchEvent(new MouseEvent('mouseover', {bubbles: true}));", element)  # Trigger tooltip
        time.sleep(1)  # Allow the tooltip to appear

        # Capture screenshot of the tooltip
        screenshot_path = f"tooltip_{player_id}_{index}.png"
        driver.save_screenshot(screenshot_path)

        # Process the screenshot using OCR
        image = Image.open(screenshot_path)
        crop_coordinates = (0, 0, 150, 115)  # Crop the tooltip region
        cropped_image = image.crop(crop_coordinates)
        cropped_image = cropped_image.resize((cropped_image.width * 4, cropped_image.height * 4), Image.LANCZOS)
        cropped_image = ImageEnhance.Contrast(cropped_image.convert('L')).enhance(2)
        cropped_image = cropped_image.filter(ImageFilter.SHARPEN)

        # Extract text using OCR
        tooltip_text = pytesseract.image_to_string(cropped_image, lang='de+eng').strip()

        # Parse the OCR text to extract fields
        date, market_value, club, age = parse_tooltip_from_text(tooltip_text)

        # Delete the screenshot after processing to save space
        os.remove(screenshot_path)

        # Append valid data to the results list
        if date or market_value or club or age:
            all_market_values.append({
                'Player_ID': player_id,
                'OCR_Text': tooltip_text,
                'Date': date,
                'Age': age,
                'Market_Value': market_value,
                'Club': club
            })
            print(f"Data added for Tooltip {index + 1} of Player {player_id}.")
        else:
            print(f"No valid data from Tooltip {index + 1} for Player {player_id}.")

# Save all collected data to a CSV file
current_date = datetime.now().strftime('%Y-%m-%d')
filename = f'../data/raw/market_values_{current_date}.csv'
df_final = pd.DataFrame(all_market_values)
df_final.to_csv(filename, index=False)
print(f"\nData saved in {filename}")

# Close the WebDriver
driver.quit()
print("Web scraping completed successfully.")
