# Coach Level Scraping

As a first draft, the following codes extracts the data from the super league game  **Servette FC - Lugano (23.12.2023, Result = 2:2)**

In [2]:
import time
import os
import re
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException

# Specify the path to the directory containing the ChromeDriver executable
chrome_driver_directory = "C:/Users/moren/Downloads/chromedriver-win64" #insert your own path here #User moreno: 'moren'

# Add the ChromeDriver directory to the PATH environment variable
os.environ["PATH"] += os.pathsep + chrome_driver_directory

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Line-Ups per Game

NOCH ZU MACHEN:
- MySQL Verbindung o. csv-datei

**Page Link:** https://www.transfermarkt.com/servette-fc_fc-lugano/aufstellung/spielbericht/4089797 

**Description:** Shows Line-Up of the team and counter team, its substitudes as well as different statistics such as average age, market value of them.



We aim to extract the following attributes for each Game:

Table **lineups_df**
- Position (GK, CB, RM, CF, etc.)
- Player Name
- Player Age
- Market Value (in Euros)
- Club
- H/A (Home Team / Away Team)
- Status (Starting 11 or Subsititude)

Table **lineups_stats_df**
- Club
- H/A (Home Team / Away Team)
- Manager
- Foreigners Starting (Amount of Foreigners in Starting LineUp)
- Foreigners Subs (Amount of Foreigners as Subs)
- Avg Age Starting
- Avg Age Subs
- Purchase Value Starting (Aggregated value of players in Starting LineUp that have been purchased by Club [in EUR])
- Purchase Value Subs (Aggregated value of players as Subs that have been purchased by Club [in EUR])
- Total Market Value Starting
- Total Market Value Subs

In [4]:
# Initialize the Chrome driver
driver = webdriver.Chrome()

# Define the start and end match IDs
start_match_id = 3840895 #4089693 #4089693 # First Game ID of the season
end_match_id = 3841127 #3841212 #4244808 #4089823  # Adjust this according to your requirement


# Initialize an empty DataFrame to store all lineup data
lineups_df = pd.DataFrame()

# Initialize an empty list to store all lineup stats dataframes
all_lineup_stats_dfs = []


# Loop through the range of match IDs
for match_id in range(start_match_id, end_match_id + 1):
    # Construct the URL for the current match ID
    match_url = f"https://www.transfermarkt.com/servette-fc_fc-lugano/aufstellung/spielbericht/{match_id}"

    # Navigate to the match URL
    driver.get(match_url)

    # Wait for page to load
    time.sleep(2)

    try:
        # Wait for the iframe to be present and switch to it
        wait = WebDriverWait(driver, 1)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358")))
        driver.switch_to.frame(iframe)

        # Now wait for the 'Accept & continue' button to be clickable inside the iframe
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
        accept_button.click()

        # Switch back to the main document
        driver.switch_to.default_content()

    except TimeoutException:
        # If the iframe doesn't appear, continue after a couple of seconds
        print("Iframe not found. Continuing after a couple of seconds...")
        time.sleep(1)  # Adjust the time delay as needed

    
    
    ## SCRAPING ##
    
    # Extract the gameday information from the top of the page
    #gameday = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[2]/p[1]/a[1]').text
    gameday = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[2]/p[1]').text

    # Extract the home and away club names from the 'title' attribute
    home_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[1]/a[2]').get_attribute("title")
    away_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[3]/a[2]').get_attribute("title")


    # Function to extract data from a table given its rows
    def extract_table_data(table_rows, club_name):
        positions = []
        players = []
        ages = []
        market_values = []
        club_names = [club_name] * (len(table_rows) // 3) # There's a club name for each player row
        gamedays = [gameday] * (len(table_rows) // 3)  # Same gameday for all players in the match
        
        
    
        
    
        for i in range(0, len(table_rows), 3):  # Increment by 3 for each player's data set
            cells = table_rows[i].find_elements(By.TAG_NAME, "td")
            player_info = cells[1].text
            name_age_parts = player_info.split(' (')
            player_name = name_age_parts[0].strip()
            age_part = name_age_parts[1] if len(name_age_parts) > 1 else ''
            age_match = re.search(r'(\d+) years old', age_part)
            age = age_match.group(1) if age_match else None
            position_market_value = cells[4].text
            if ', ' in position_market_value:
                position, market_value = position_market_value.split(', ')
            else:
                position = position_market_value
                market_value = None
        
            players.append(player_name)
            ages.append(age)
            positions.append(position)
            market_values.append(market_value)
    
        return pd.DataFrame({
            'Position': positions,
            'Player': players,
            'Age': ages,
            'Market Value': market_values,
            'Club': club_names,
            'Gameday': gamedays,
        })
    all_tables_df = []

    # XPath or CSS Selector for each table
    tables_xpaths = {
        'starting_lineup_home': '//*[@id="main"]/main/div[4]/div[1]/div/div[1]/table', 
        'substitutes_home': '//*[@id="main"]/main/div[5]/div[1]/div/div[1]/table',
        'starting_lineup_away': '//*[@id="main"]/main/div[4]/div[2]/div/div[1]/table',
        'substitutes_away': '//*[@id="main"]/main/div[5]/div[2]/div/div[1]/table'
    }

    all_tables_df = []

    # Loop through the table paths and extract data
    for key, xpath in tables_xpaths.items():
        try:
            table = driver.find_element(By.XPATH, xpath)
            rows = table.find_elements(By.TAG_NAME, "tr")
            team_type = 'Home' if 'home' in key else 'Away'
            club_name = home_club_name if 'home' in key else away_club_name
            df = extract_table_data(rows, club_name)  # Your custom function to extract data from rows
            df['H/A'] = team_type
            df['Status'] = 'Starting' if 'starting' in key else 'Substitute'
            all_tables_df.append(df)
        except NoSuchElementException:
            print(f"Table not found for {key} in match ID: {match_id}, skipping.")
            continue  # Skip this iteration if table is not found

    # Combine all dataframes from the current page into lineups_df
    if all_tables_df:  # Check if there's any data to concatenate
        temp_df = pd.concat(all_tables_df, ignore_index=True)
        temp_df['Match ID'] = match_id  # Add the match_id to every row in temp_df
    
        # Assuming lineups_df is defined somewhere above as the final dataframe
        lineups_df = pd.concat([lineups_df, temp_df], ignore_index=True)

    # Extract the home and away club names
    home_club_name_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[1]/a[2]')
    home_club_name = home_club_name_element.get_attribute("title")
    away_club_name_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[3]/a[2]')
    away_club_name = away_club_name_element.get_attribute("title")

    # Extract home and away managers' names using the updated XPaths
    home_manager_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[1]/div/div/table/tbody/tr/td[1]/table/tbody/tr[1]/td[2]')
    home_manager_name = home_manager_element.text
    away_manager_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[2]/div/div/table/tbody/tr/td[1]/table/tbody/tr[1]/td[2]')
    away_manager_name = away_manager_element.text

    # Extract additional information for both home and away teams with exception handling
    try:
        foreigners_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_starting_home = "N/A"

    try:
        foreigners_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_subs_home = "N/A"

    try:
        avg_age_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_starting_home = "N/A"

    try:
        avg_age_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_subs_home = "N/A"

    try:
        purchase_value_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_starting_home = "N/A"

    try:
        purchase_value_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_subs_home = "N/A"

    try:
        total_market_value_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_starting_home = "N/A"

    try:
        total_market_value_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_subs_home = "N/A"

    try:
        foreigners_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_starting_away = "N/A"

    try:
        foreigners_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_subs_away = "N/A"

    try:
        avg_age_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_starting_away = "N/A"

    try:
        avg_age_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_subs_away = "N/A"

    try:
        purchase_value_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_starting_away = "N/A"

    try:
        purchase_value_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_subs_away = "N/A"

    try:
        total_market_value_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_starting_away = "N/A"

    try:
        total_market_value_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_subs_away = "N/A"



    # Function to clean the extracted data by removing preceding text
    def clean_data(text, keep_eur_sign=False):
        if keep_eur_sign:
            # Directly slice away the preceding text if it follows a known pattern
            if 'Purchase value: ' in text:
                return text.replace('Purchase value: ', '')
            elif 'Total MV: ' in text:
                return text.replace('Total MV: ', '')
        else:
            # Using regex to find numeric values or percentages and return them for other columns
            match = re.search(r'\d+(\.\d+)?%?', text)
            return match.group(0) if match else text
    
    # Create a DataFrame for the club and manager information along with the newly extracted data
    lineups_stats_df = pd.DataFrame({
        'Club': [home_club_name, away_club_name],
        'H/A': ['Home', 'Away'],
        'Manager': [home_manager_name, away_manager_name],
        'Foreigners Starting': [clean_data(foreigners_starting_home), clean_data(foreigners_starting_away)],
        'Foreigners Subs': [clean_data(foreigners_subs_home), clean_data(foreigners_subs_away)],
        'Avg Age Starting': [clean_data(avg_age_starting_home), clean_data(avg_age_starting_away)],
        'Avg Age Subs': [clean_data(avg_age_subs_home), clean_data(avg_age_subs_away)],
        'Purchase Value Starting': [clean_data(purchase_value_starting_home, True), clean_data(purchase_value_starting_away, True)],
        'Purchase Value Subs': [clean_data(purchase_value_subs_home, True), clean_data(purchase_value_subs_away, True)],
        'Total Market Value Starting': [clean_data(total_market_value_starting_home, True), clean_data(total_market_value_starting_away, True)],
        'Total Market Value Subs': [clean_data(total_market_value_subs_home, True), clean_data(total_market_value_subs_away, True)],
        'Match ID': [match_id, match_id]
    })

    # Append the lineup stats dataframe for the current match to the list
    all_lineup_stats_dfs.append(lineups_stats_df)

    # Print the number of dataframes collected after each match
    print(f"Collected {len(all_lineup_stats_dfs)} dataframes after match ID: {match_id}")


# Before the concatenation, print out the number of dataframes to be concatenated
print(f"Concatenating {len(all_lineup_stats_dfs)} dataframes.")

# Concatenate all the lineup stats dataframes in the list
final_lineup_stats_df = pd.concat(all_lineup_stats_dfs, ignore_index=True)

# Close the driver after scraping is done
driver.quit()

# Print a success message after scraping all matches
print("Webscraping successfully completed for all matches.")

# Finally, save the dataframe to a CSV file for persistence
lineups_df.to_csv('lineups_2022_2023_1.csv', index=False)


The chromedriver version (121.0.6167.85) detected in PATH at C:\Users\moren\Downloads\chromedriver-win64\chromedriver.exe might not be compatible with the detected chrome version (122.0.6261.112); currently, chromedriver 122.0.6261.111 is recommended for chrome 122.*, so it is advised to delete the driver in PATH and retry


Collected 1 dataframes after match ID: 3840895
Iframe not found. Continuing after a couple of seconds...
Collected 2 dataframes after match ID: 3840896
Iframe not found. Continuing after a couple of seconds...
Collected 3 dataframes after match ID: 3840897
Iframe not found. Continuing after a couple of seconds...
Collected 4 dataframes after match ID: 3840898
Iframe not found. Continuing after a couple of seconds...
Collected 5 dataframes after match ID: 3840899
Iframe not found. Continuing after a couple of seconds...
Collected 6 dataframes after match ID: 3840900
Iframe not found. Continuing after a couple of seconds...
Collected 7 dataframes after match ID: 3840901
Iframe not found. Continuing after a couple of seconds...
Collected 8 dataframes after match ID: 3840902
Iframe not found. Continuing after a couple of seconds...
Collected 9 dataframes after match ID: 3840903
Iframe not found. Continuing after a couple of seconds...
Collected 10 dataframes after match ID: 3840904
Iframe 

In [12]:
# Initialize the Chrome driver
driver = webdriver.Chrome()

# Define the start and end match IDs
start_match_id = 3841148 #4089693 #4089693 # First Game ID of the season
end_match_id = 3841212 #3841212 #4244808 #4089823  # Adjust this according to your requirement


# Initialize an empty DataFrame to store all lineup data
lineups_df_2 = pd.DataFrame()

# Initialize an empty list to store all lineup stats dataframes
all_lineup_stats_dfs_2 = []


# Loop through the range of match IDs
for match_id in range(start_match_id, end_match_id + 1):
    # Construct the URL for the current match ID
    match_url = f"https://www.transfermarkt.com/servette-fc_fc-lugano/aufstellung/spielbericht/{match_id}"

    # Navigate to the match URL
    driver.get(match_url)

    # Wait for page to load
    time.sleep(2)

    try:
        # Wait for the iframe to be present and switch to it
        wait = WebDriverWait(driver, 1)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358")))
        driver.switch_to.frame(iframe)

        # Now wait for the 'Accept & continue' button to be clickable inside the iframe
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
        accept_button.click()

        # Switch back to the main document
        driver.switch_to.default_content()

    except TimeoutException:
        # If the iframe doesn't appear, continue after a couple of seconds
        print("Iframe not found. Continuing after a couple of seconds...")
        time.sleep(1)  # Adjust the time delay as needed

    
    
    ## SCRAPING ##
    
    # Extract the gameday information from the top of the page
    #gameday = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[2]/p[1]/a[1]').text
    gameday = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[2]/p[1]').text

    # Extract the home and away club names from the 'title' attribute
    home_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[1]/a[2]').get_attribute("title")
    away_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[3]/a[2]').get_attribute("title")


    # Function to extract data from a table given its rows
    def extract_table_data(table_rows, club_name):
        positions = []
        players = []
        ages = []
        market_values = []
        club_names = [club_name] * (len(table_rows) // 3) # There's a club name for each player row
        gamedays = [gameday] * (len(table_rows) // 3)  # Same gameday for all players in the match
        
        
    
        
    
        for i in range(0, len(table_rows), 3):  # Increment by 3 for each player's data set
            cells = table_rows[i].find_elements(By.TAG_NAME, "td")
            player_info = cells[1].text
            name_age_parts = player_info.split(' (')
            player_name = name_age_parts[0].strip()
            age_part = name_age_parts[1] if len(name_age_parts) > 1 else ''
            age_match = re.search(r'(\d+) years old', age_part)
            age = age_match.group(1) if age_match else None
            position_market_value = cells[4].text
            if ', ' in position_market_value:
                position, market_value = position_market_value.split(', ')
            else:
                position = position_market_value
                market_value = None
        
            players.append(player_name)
            ages.append(age)
            positions.append(position)
            market_values.append(market_value)
    
        return pd.DataFrame({
            'Position': positions,
            'Player': players,
            'Age': ages,
            'Market Value': market_values,
            'Club': club_names,
            'Gameday': gamedays,
        })
    all_tables_df_2 = []

    # XPath or CSS Selector for each table
    tables_xpaths = {
        'starting_lineup_home': '//*[@id="main"]/main/div[4]/div[1]/div/div[1]/table', 
        'substitutes_home': '//*[@id="main"]/main/div[5]/div[1]/div/div[1]/table',
        'starting_lineup_away': '//*[@id="main"]/main/div[4]/div[2]/div/div[1]/table',
        'substitutes_away': '//*[@id="main"]/main/div[5]/div[2]/div/div[1]/table'
    }

    all_tables_df_2 = []

    # Loop through the table paths and extract data
    for key, xpath in tables_xpaths.items():
        try:
            table = driver.find_element(By.XPATH, xpath)
            rows = table.find_elements(By.TAG_NAME, "tr")
            team_type = 'Home' if 'home' in key else 'Away'
            club_name = home_club_name if 'home' in key else away_club_name
            df_2 = extract_table_data(rows, club_name)  # Your custom function to extract data from rows
            df_2['H/A'] = team_type
            df_2['Status'] = 'Starting' if 'starting' in key else 'Substitute'
            all_tables_df_2.append(df_2)
        except NoSuchElementException:
            print(f"Table not found for {key} in match ID: {match_id}, skipping.")
            continue  # Skip this iteration if table is not found

    # Combine all dataframes from the current page into lineups_df
    if all_tables_df_2:  # Check if there's any data to concatenate
        temp_df_2 = pd.concat(all_tables_df_2, ignore_index=True)
        temp_df_2['Match ID'] = match_id  # Add the match_id to every row in temp_df
    
        # Assuming lineups_df is defined somewhere above as the final dataframe
        lineups_df_2 = pd.concat([lineups_df_2, temp_df_2], ignore_index=True)

    # Extract the home and away club names
    home_club_name_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[1]/a[2]')
    home_club_name = home_club_name_element.get_attribute("title")
    away_club_name_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[3]/a[2]')
    away_club_name = away_club_name_element.get_attribute("title")

    # Extract home and away managers' names using the updated XPaths
    home_manager_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[1]/div/div/table/tbody/tr/td[1]/table/tbody/tr[1]/td[2]')
    home_manager_name = home_manager_element.text
    away_manager_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[2]/div/div/table/tbody/tr/td[1]/table/tbody/tr[1]/td[2]')
    away_manager_name = away_manager_element.text

    # Extract additional information for both home and away teams with exception handling
    try:
        foreigners_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_starting_home = "N/A"

    try:
        foreigners_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_subs_home = "N/A"

    try:
        avg_age_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_starting_home = "N/A"

    try:
        avg_age_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_subs_home = "N/A"

    try:
        purchase_value_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_starting_home = "N/A"

    try:
        purchase_value_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_subs_home = "N/A"

    try:
        total_market_value_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[1]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_starting_home = "N/A"

    try:
        total_market_value_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_subs_home = "N/A"

    try:
        foreigners_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_starting_away = "N/A"

    try:
        foreigners_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[1]').text
    except NoSuchElementException:
        foreigners_subs_away = "N/A"

    try:
        avg_age_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_starting_away = "N/A"

    try:
        avg_age_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[2]').text
    except NoSuchElementException:
        avg_age_subs_away = "N/A"

    try:
        purchase_value_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_starting_away = "N/A"

    try:
        purchase_value_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[3]').text
    except NoSuchElementException:
        purchase_value_subs_away = "N/A"

    try:
        total_market_value_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[4]/div[2]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_starting_away = "N/A"

    try:
        total_market_value_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[4]').text
    except NoSuchElementException:
        total_market_value_subs_away = "N/A"



    # Function to clean the extracted data by removing preceding text
    def clean_data(text, keep_eur_sign=False):
        if keep_eur_sign:
            # Directly slice away the preceding text if it follows a known pattern
            if 'Purchase value: ' in text:
                return text.replace('Purchase value: ', '')
            elif 'Total MV: ' in text:
                return text.replace('Total MV: ', '')
        else:
            # Using regex to find numeric values or percentages and return them for other columns
            match = re.search(r'\d+(\.\d+)?%?', text)
            return match.group(0) if match else text
    
    # Create a DataFrame for the club and manager information along with the newly extracted data
    lineups_stats_df_2 = pd.DataFrame({
        'Club': [home_club_name, away_club_name],
        'H/A': ['Home', 'Away'],
        'Manager': [home_manager_name, away_manager_name],
        'Foreigners Starting': [clean_data(foreigners_starting_home), clean_data(foreigners_starting_away)],
        'Foreigners Subs': [clean_data(foreigners_subs_home), clean_data(foreigners_subs_away)],
        'Avg Age Starting': [clean_data(avg_age_starting_home), clean_data(avg_age_starting_away)],
        'Avg Age Subs': [clean_data(avg_age_subs_home), clean_data(avg_age_subs_away)],
        'Purchase Value Starting': [clean_data(purchase_value_starting_home, True), clean_data(purchase_value_starting_away, True)],
        'Purchase Value Subs': [clean_data(purchase_value_subs_home, True), clean_data(purchase_value_subs_away, True)],
        'Total Market Value Starting': [clean_data(total_market_value_starting_home, True), clean_data(total_market_value_starting_away, True)],
        'Total Market Value Subs': [clean_data(total_market_value_subs_home, True), clean_data(total_market_value_subs_away, True)],
        'Match ID': [match_id, match_id]
    })

    # Append the lineup stats dataframe for the current match to the list
    all_lineup_stats_dfs_2.append(lineups_stats_df_2)

    # Print the number of dataframes collected after each match
    print(f"Collected {len(all_lineup_stats_dfs_2)} dataframes after match ID: {match_id}")


# Before the concatenation, print out the number of dataframes to be concatenated
print(f"Concatenating {len(all_lineup_stats_dfs_2)} dataframes.")

# Concatenate all the lineup stats dataframes in the list
final_lineup_stats_df_2 = pd.concat(all_lineup_stats_dfs_2, ignore_index=True)

# Close the driver after scraping is done
driver.quit()

# Print a success message after scraping all matches
print("Webscraping successfully completed for all matches.")

# Finally, save the dataframe to a CSV file for persistence
lineups_df_2.to_csv('lineups_2022_2023_2.csv', index=False)


The chromedriver version (121.0.6167.85) detected in PATH at C:\Users\moren\Downloads\chromedriver-win64\chromedriver.exe might not be compatible with the detected chrome version (122.0.6261.112); currently, chromedriver 122.0.6261.111 is recommended for chrome 122.*, so it is advised to delete the driver in PATH and retry


Collected 1 dataframes after match ID: 3841148
Iframe not found. Continuing after a couple of seconds...
Collected 2 dataframes after match ID: 3841149
Iframe not found. Continuing after a couple of seconds...
Collected 3 dataframes after match ID: 3841150
Iframe not found. Continuing after a couple of seconds...
Collected 4 dataframes after match ID: 3841151
Iframe not found. Continuing after a couple of seconds...
Collected 5 dataframes after match ID: 3841152
Iframe not found. Continuing after a couple of seconds...
Collected 6 dataframes after match ID: 3841153
Iframe not found. Continuing after a couple of seconds...
Table not found for substitutes_away in match ID: 3841154, skipping.
Collected 7 dataframes after match ID: 3841154
Iframe not found. Continuing after a couple of seconds...
Table not found for substitutes_home in match ID: 3841155, skipping.
Table not found for substitutes_away in match ID: 3841155, skipping.
Collected 8 dataframes after match ID: 3841155
Iframe not 

In [5]:
lineups_df

Unnamed: 0,Position,Player,Age,Market Value,Club,Gameday,H/A,Status,Match ID
0,Goalkeeper,David von Ballmoos,27,€2.50m,BSC Young Boys,"1. Matchday | Sat, 7/16/22 | 6:00 PM",Home,Starting,3840895
1,Centre-Back,Cédric Zesiger,24,€3.20m,BSC Young Boys,"1. Matchday | Sat, 7/16/22 | 6:00 PM",Home,Starting,3840895
2,Centre-Back,Fabian Lustenberger,34,€400k,BSC Young Boys,"1. Matchday | Sat, 7/16/22 | 6:00 PM",Home,Starting,3840895
3,Left-Back,Ulisses Garcia,26,€2.00m,BSC Young Boys,"1. Matchday | Sat, 7/16/22 | 6:00 PM",Home,Starting,3840895
4,Right-Back,Lewin Blum,20,€750k,BSC Young Boys,"1. Matchday | Sat, 7/16/22 | 6:00 PM",Home,Starting,3840895
...,...,...,...,...,...,...,...,...,...
7923,Central Midfield,Luca Zuffi,33,€300k,FC Sion,"31. Matchday | Sun, 4/30/23 | 2:15 PM",Away,Substitute,3841127
7924,Central Midfield,Anto Grgic,26,€1.50m,FC Sion,"31. Matchday | Sun, 4/30/23 | 2:15 PM",Away,Substitute,3841127
7925,Central Midfield,Denis Will Poha,25,€1.00m,FC Sion,"31. Matchday | Sun, 4/30/23 | 2:15 PM",Away,Substitute,3841127
7926,Left Winger,Kevin Halabaku,21,€200k,FC Sion,"31. Matchday | Sun, 4/30/23 | 2:15 PM",Away,Substitute,3841127


In [13]:
lineups_df_2

Unnamed: 0,Position,Player,Age,Market Value,Club,Gameday,H/A,Status,Match ID
0,Goalkeeper,Marwin Hitz,35,€500k,FC Basel 1893,"32. Matchday | Sun, 5/7/23 | 4:30 PM",Home,Starting,3841148
1,Centre-Back,Kasim Adams,27,€1.30m,FC Basel 1893,"32. Matchday | Sun, 5/7/23 | 4:30 PM",Home,Starting,3841148
2,Centre-Back,Michael Lang,32,€500k,FC Basel 1893,"32. Matchday | Sun, 5/7/23 | 4:30 PM",Home,Starting,3841148
3,Centre-Back,Sergio López,24,€1.00m,FC Basel 1893,"32. Matchday | Sun, 5/7/23 | 4:30 PM",Home,Starting,3841148
4,Defensive Midfield,Wouter Burger,22,€4.00m,FC Basel 1893,"32. Matchday | Sun, 5/7/23 | 4:30 PM",Home,Starting,3841148
...,...,...,...,...,...,...,...,...,...
1904,Right-Back,Allan Arigoni,24,€800k,FC Lugano,"36. Matchday | Mon, 5/29/23 | 4:30 PM",Away,Substitute,3841212
1905,Defensive Midfield,Ousmane Doumbia,31,€1.00m,FC Lugano,"36. Matchday | Mon, 5/29/23 | 4:30 PM",Away,Substitute,3841212
1906,Left Winger,Ignacio Aliseda,23,€1.50m,FC Lugano,"36. Matchday | Mon, 5/29/23 | 4:30 PM",Away,Substitute,3841212
1907,Right Winger,Renato Steffen,31,€1.00m,FC Lugano,"36. Matchday | Mon, 5/29/23 | 4:30 PM",Away,Substitute,3841212


In [6]:
all_lineup_stats_dfs

[             Club   H/A        Manager Foreigners Starting Foreigners Subs  \
 0  BSC Young Boys  Home  Raphael Wicky                   3               4   
 1       FC Zürich  Away    Franco Foda                   5               6   
 
   Avg Age Starting Avg Age Subs Purchase Value Starting Purchase Value Subs  \
 0             25.6         25.5                  €6.45m              €8.62m   
 1             25.3         28.2                  €2.15m               €200k   
 
   Total Market Value Starting Total Market Value Subs  Match ID  
 0                     €30.55m                 €19.50m   3840895  
 1                     €26.00m                  €7.60m   3840895  ,
             Club   H/A       Manager Foreigners Starting Foreigners Subs  \
 0  FC Winterthur  Home  Bruno Berner                   1               1   
 1  FC Basel 1893  Away     Alex Frei                   5               4   
 
   Avg Age Starting Avg Age Subs Purchase Value Starting Purchase Value Subs  \
 0  

In [14]:
all_lineup_stats_dfs_2

[            Club   H/A       Manager Foreigners Starting Foreigners Subs  \
 0  FC Basel 1893  Home   Heiko Vogel                   4               5   
 1      FC Zürich  Away  Bo Henriksen                   5               3   
 
   Avg Age Starting Avg Age Subs Purchase Value Starting Purchase Value Subs  \
 0             26.3         21.0                  €4.20m              €1.55m   
 1             26.4         24.9                  €1.70m                   0   
 
   Total Market Value Starting Total Market Value Subs  Match ID  
 0                     €24.60m                  €8.80m   3841148  
 1                     €10.20m                 €14.60m   3841148  ,
                       Club   H/A          Manager Foreigners Starting  \
 0  Grasshopper Club Zurich  Home  Giorgio Contini                   8   
 1              Servette FC  Away     Alain Geiger                   7   
 
   Foreigners Subs Avg Age Starting Avg Age Subs Purchase Value Starting  \
 0               2     

In [7]:
final_lineup_stats_df

Unnamed: 0,Club,H/A,Manager,Foreigners Starting,Foreigners Subs,Avg Age Starting,Avg Age Subs,Purchase Value Starting,Purchase Value Subs,Total Market Value Starting,Total Market Value Subs,Match ID
0,BSC Young Boys,Home,Raphael Wicky,3,4,25.6,25.5,€6.45m,€8.62m,€30.55m,€19.50m,3840895
1,FC Zürich,Away,Franco Foda,5,6,25.3,28.2,€2.15m,€200k,€26.00m,€7.60m,3840895
2,FC Winterthur,Home,Bruno Berner,1,1,27.1,22.4,0,0,€2.98m,€2.80m,3840896
3,FC Basel 1893,Away,Alex Frei,5,4,24.7,25.4,€7.70m,€450k,€21.60m,€4.85m,3840896
4,FC Lugano,Home,Mattia Croci-Torti,6,2,26.5,23.0,€2.08m,0,€8.60m,€1.85m,3840897
...,...,...,...,...,...,...,...,...,...,...,...,...
461,FC Basel 1893,Away,Heiko Vogel,5,5,25.4,20.7,€4.60m,€3.75m,€21.90m,€15.80m,3841125
462,BSC Young Boys,Home,Raphael Wicky,3,2,24.8,25.5,€7.55m,€6.52m,€35.60m,€16.60m,3841126
463,FC Luzern,Away,Mario Frick,4,2,25.1,20.5,€380k,0,€11.93m,€3.18m,3841126
464,FC Zürich,Home,Bo Henriksen,5,3,27.5,25.1,€500k,0,€13.40m,€6.10m,3841127


In [15]:
final_lineup_stats_df_2

Unnamed: 0,Club,H/A,Manager,Foreigners Starting,Foreigners Subs,Avg Age Starting,Avg Age Subs,Purchase Value Starting,Purchase Value Subs,Total Market Value Starting,Total Market Value Subs,Match ID
0,FC Basel 1893,Home,Heiko Vogel,4,5,26.3,21.0,€4.20m,€1.55m,€24.60m,€8.80m,3841148
1,FC Zürich,Away,Bo Henriksen,5,3,26.4,24.9,€1.70m,0,€10.20m,€14.60m,3841148
2,Grasshopper Club Zurich,Home,Giorgio Contini,8,2,25.8,24.6,0,0,€11.75m,€2.60m,3841149
3,Servette FC,Away,Alain Geiger,7,5,27.7,24.4,€70k,0,€15.40m,€6.80m,3841149
4,FC Luzern,Home,Mario Frick,5,4,25.7,24.0,€380k,€800k,€15.93m,€5.20m,3841150
...,...,...,...,...,...,...,...,...,...,...,...,...
125,FC Sion,Away,Paolo Tramezzani,5,3,27.6,24.6,€890k,0,€5.90m,€1.46m,3841210
126,BSC Young Boys,Home,Raphael Wicky,2,3,24.8,26.8,€7.55m,€6.20m,€35.70m,€17.70m,3841211
127,FC Winterthur,Away,Bruno Berner,3,0,27.2,24.3,€25k,0,€4.40m,€2.71m,3841211
128,FC Zürich,Home,Bo Henriksen,4,4,27.7,25.8,€1.65m,0,€15.20m,€10.20m,3841212


From the scraping we get the following two table **lineups_df** and **stats_df**:

In [8]:
lineups_stats_df

Unnamed: 0,Club,H/A,Manager,Foreigners Starting,Foreigners Subs,Avg Age Starting,Avg Age Subs,Purchase Value Starting,Purchase Value Subs,Total Market Value Starting,Total Market Value Subs,Match ID
0,FC Zürich,Home,Bo Henriksen,5,3,27.5,25.1,€500k,0,€13.40m,€6.10m,3841127
1,FC Sion,Away,David Bettoni,5,4,27.2,26.8,0,€1.74m,€6.31m,€4.65m,3841127


In [16]:
lineups_stats_df_2

Unnamed: 0,Club,H/A,Manager,Foreigners Starting,Foreigners Subs,Avg Age Starting,Avg Age Subs,Purchase Value Starting,Purchase Value Subs,Total Market Value Starting,Total Market Value Subs,Match ID
0,FC Zürich,Home,Bo Henriksen,4,4,27.7,25.8,€1.65m,0,€15.20m,€10.20m,3841212
1,FC Lugano,Away,Mattia Croci-Torti,6,3,25.5,27.8,€6.38m,€4.35m,€7.90m,€6.70m,3841212


## Matchsheet (evtl)

Page Link: https://www.transfermarkt.com/servette-fc_fc-lugano/index/spielbericht/4089797

Description: Shows events such as Goals, Substitutions and Cards as well


We aim to extract the following attributes for each Game:

Table **events_df**
- Club
- H/A (Home Team / Away Team)
- Timestamp (of event in the game)*
- Event (Goal, Substitution, Card)
- Player Event (Name of the player relevant to the event)
- Remark Event (additional information)
- Player Assist (name of player if event = goal)
- Player Out (player substituted out if event = substitution)

**Events in overtime (45'+ and 90'+) get the timestamps 45' or 90' (not exact time such as 94' or 45' + 2')*


In [5]:
# Initialize the Chrome driver
driver = webdriver.Chrome()

# Define the start and end match IDs
start_match_id = 3840939 #3840895 #4089693 #4089693 # First Game ID of the season
end_match_id = 3840940 #3841127 #3841212 #4244808 #4089823  # Adjust this according to your requirement

# Initialize an empty list to store all events dataframes
all_events_dfs = []

# Loop through the range of match IDs
for match_id in range(start_match_id, end_match_id + 1):
    # Construct the URL for the current match ID
    match_url = f"https://www.transfermarkt.com/servette-fc_fc-lugano/index/spielbericht/{match_id}"

    # Navigate to the match URL
    driver.get(match_url)

    # Wait for page to load
    time.sleep(2)

    # Handling the iframe and accept button if exists
    try:
        wait = WebDriverWait(driver, 10)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358")))
        driver.switch_to.frame(iframe)
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
        accept_button.click()
        driver.switch_to.default_content()
    except:
        print("Iframe not found. Continuing after a couple of seconds...")

    ## SCRAPING ## 

    # Extracting club names
    home_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div/div/div[1]/div[1]/div[2]/nobr/a').get_attribute("title")
    away_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div/div/div[2]/div[1]/div[2]/nobr/a').get_attribute("title")

    # Function to convert pixel values to minutes based on the pattern provided
    def convert_px_to_minute(x_px, y_px):
        # Remove any non-numeric characters and convert to integer
        x_px = int(re.sub(r'[^\d-]', '', str(x_px)))
        y_px = int(re.sub(r'[^\d-]', '', str(y_px)))
    
        # Convert negative values to positive
        x_px = abs(x_px)
        y_px = abs(y_px)
    
        unit_minutes = (x_px // 36) + 1
        ten_minutes = (y_px // 36) * 10
        timestamp = f"{unit_minutes + ten_minutes}'"
        return timestamp


    def extract_px_from_style(style_str):
        # Use regular expression to find all pixel values in the style string
        px_values = re.findall(r'-?\d+px', style_str)  # Include optional minus sign
    
        # Check if there are at least two pixel values
        if len(px_values) >= 2:
            x_px, y_px = [int(px.strip('px')) for px in px_values[:2]]  # Take the first two values
            return x_px, y_px
        else:
            # Handle the case when there are not enough values
            return None, None  # You can return None or some default values


    # Function to extract events with Remark Event adjustment
    def extract_events(event_type_xpath, event_type, home_club_name, away_club_name):
        events_list = driver.find_element(By.XPATH, event_type_xpath)
        events_items = events_list.find_elements(By.TAG_NAME, "li")
        events_data = []

        for item in events_items:
            team = "Home" if "heim" in item.get_attribute("class") else "Away"
            club = home_club_name if team == "Home" else away_club_name

            # Extract the style attribute for timestamp
            style_str = item.find_element(By.XPATH, ".//div/div[1]/span").get_attribute("style")
            x_px, y_px = extract_px_from_style(style_str)
            timestamp = convert_px_to_minute(x_px, y_px)

            player_event = "N/A"  # Default value if player name is not found
            player_out = None  # Initialize player_out to None
            remark_event = ""  # Initialize remark_event to empty string
            player_assist = None  # Ensure this variable is also initialized

            try:
                player_event_element = None
                full_text = item.find_element(By.XPATH, ".//div/div[4]").text.strip()
                if event_type == "Substitution":
                    parts = full_text.split('\n')
                    if len(parts) > 1:
                        player_out_part = parts[-1]
                        player_out_parts = player_out_part.split(', ')
                        if len(player_out_parts) > 1:
                            player_out = player_out_parts[0]
                            remark_event = player_out_parts[1]
                        else:
                            player_out = player_out_parts[0]
                    player_event_element = item.find_element(By.XPATH, ".//div/div[4]/span[1]/a")
                    player_event = player_event_element.get_attribute("title")


                else:
                    player_event_element = item.find_element(By.XPATH, ".//div/div[4]/a")
                    player_event = player_event_element.get_attribute("title")
                    # Adjust this block to handle goals and cards specifically
                    full_text = item.find_element(By.XPATH, ".//div/div[4]").text
                    if event_type == "Goal":
                        parts = full_text.split(',')
                        if len(parts) > 2:  # If there are at least 3 parts, indicating a remark is present
                            remark_event = parts[1].strip()  # The part before the second ',' is the remark for goals
                            # Handling Assist information for goals
                            if "Assist:" in full_text:
                                assist_part = full_text.split('Assist:')[1].split(',')[0].strip()
                                player_assist = assist_part  # Assume player_assist is already defined elsewhere as None
                        else:
                            remark_event = parts[0].strip() if len(parts) > 1 else ""
                    else:
                        # For Cards, just an example, adjust as needed
                        remark_event = full_text.split(',')[-1].strip() if ',' in full_text else full_text
            except NoSuchElementException:
                pass



            card_type = event_type  # Default card type is the event type itself
            if event_type == "Card":
                card_span_class = item.find_element(By.XPATH, ".//div/div[2]/span").get_attribute("class")
                if "gelbrot" in card_span_class:
                    card_type = "Yellow-Red Card"
                elif "gelb" in card_span_class and "rot" not in card_span_class:
                    card_type = "Yellow Card"
                elif "rot" in card_span_class:
                    card_type = "Direct Red Card"

            events_data.append({
                "Timestamp": timestamp,
                "Club": club,
                "H/A": team,
                "Event": card_type,
                "Player Event": player_event,
                "Remark Event": remark_event,
                "Player Assist": player_assist,
                "Player Out": player_out,
                "Match ID": match_id,
            })     
        return events_data


    all_events_data = []
    event_types = {"Goal": '//*[@id="sb-tore"]/ul', "Substitution": '//*[@id="sb-wechsel"]/ul', "Card": '//*[@id="sb-karten"]/ul'}


    # Iterate through each event type and extract data
    for event_type, xpath in event_types.items():
        events_data = extract_events(xpath, event_type, home_club_name, away_club_name)
        all_events_data.extend(events_data)


    # Create DataFrame and reorder columns to put 'Timestamp' second
    if all_events_data:  # Ensure there's data before creating the DataFrame
        events_df = pd.DataFrame(all_events_data)
        columns_order = ['Club', 'H/A', 'Timestamp', 'Event', 'Player Event', 'Remark Event', 'Player Assist', 'Player Out', 'Match ID']
        events_df = events_df[columns_order]
        all_events_dfs.append(events_df)
    
    print(f"Scraping completed for match ID: {match_id}")

# Check if all_events_dfs is not empty before attempting to concatenate
if all_events_dfs:  # This checks if the list is not empty
    # Concatenate all events dataframes
    final_events_df = pd.concat(all_events_dfs, ignore_index=True)

    # Finally, save the dataframe to a CSV file for persistence
    final_events_df.to_csv('match_events.csv', index=False)
else:
    print("No data was scraped.")

# Close the driver after scraping is done
driver.quit()

# Print a success message
print("Webscraping successfully completed for all matches.")

# Finally, save the dataframe to a CSV file for persistence
final_events_df.to_csv('match_events_test.csv', index=False)

SyntaxError: invalid syntax (1224564621.py, line 149)

In [11]:
# Initialize the Chrome driver
driver = webdriver.Chrome()

# Define the start and end match IDs
start_match_id = 3840895  # First Game ID of the season
end_match_id = 3841127  # Adjust this according to your requirement

# Initialize an empty list to store all events dataframes
all_events_dfs = []

# Loop through the range of match IDs
for match_id in range(start_match_id, end_match_id + 1):
    # Construct the URL for the current match ID
    match_url = f"https://www.transfermarkt.com/servette-fc_fc-lugano/index/spielbericht/{match_id}"

    # Navigate to the match URL
    driver.get(match_url)

    # Wait for page to load
    time.sleep(2)

    # Handling the iframe and accept button if exists
    try:
        wait = WebDriverWait(driver, 10)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358")))
        driver.switch_to.frame(iframe)
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
        accept_button.click()
        driver.switch_to.default_content()
    except:
        print("Iframe not found. Continuing after a couple of seconds...")

    ## SCRAPING ## 

    # Extracting club names
    home_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div/div/div[1]/div[1]/div[2]/nobr/a').get_attribute("title")
    away_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div/div/div[2]/div[1]/div[2]/nobr/a').get_attribute("title")

    # Function to convert pixel values to minutes based on the pattern provided
    def convert_px_to_minute(x_px, y_px):
        # Remove any non-numeric characters and convert to integer
        x_px = int(re.sub(r'[^\d-]', '', str(x_px)))
        y_px = int(re.sub(r'[^\d-]', '', str(y_px)))
    
        # Convert negative values to positive
        x_px = abs(x_px)
        y_px = abs(y_px)
    
        unit_minutes = (x_px // 36) + 1
        ten_minutes = (y_px // 36) * 10
        timestamp = f"{unit_minutes + ten_minutes}'"
        return timestamp


    def extract_px_from_style(style_str):
        # Use regular expression to find all pixel values in the style string
        px_values = re.findall(r'-?\d+px', style_str)  # Include optional minus sign
    
        # Check if there are at least two pixel values
        if len(px_values) >= 2:
            x_px, y_px = [int(px.strip('px')) for px in px_values[:2]]  # Take the first two values
            return x_px, y_px
        else:
            # Handle the case when there are not enough values
            return None, None  # You can return None or some default values


    # Function to extract events with Remark Event adjustment
    def extract_events(event_type_xpath, event_type, home_club_name, away_club_name):
        try:
            events_list = driver.find_element(By.XPATH, event_type_xpath)
            events_items = events_list.find_elements(By.TAG_NAME, "li")
            events_data = []

            for item in events_items:
                team = "Home" if "heim" in item.get_attribute("class") else "Away"
                club = home_club_name if team == "Home" else away_club_name

                # Extract the style attribute for timestamp
                style_str = item.find_element(By.XPATH, ".//div/div[1]/span").get_attribute("style")
                x_px, y_px = extract_px_from_style(style_str)
                timestamp = convert_px_to_minute(x_px, y_px)

                player_event = "N/A"  # Default value if player name is not found
                player_out = None  # Initialize player_out to None
                remark_event = ""  # Initialize remark_event to empty string
                player_assist = None  # Ensure this variable is also initialized

                try:
                    player_event_element = None
                    full_text = item.find_element(By.XPATH, ".//div/div[4]").text.strip()
                    if event_type == "Substitution":
                        parts = full_text.split('\n')
                        if len(parts) > 1:
                            player_out_part = parts[-1]
                            player_out_parts = player_out_part.split(', ')
                            if len(player_out_parts) > 1:
                                player_out = player_out_parts[0]
                                remark_event = player_out_parts[1]
                            else:
                                player_out = player_out_parts[0]
                        player_event_element = item.find_element(By.XPATH, ".//div/div[4]/span[1]/a")
                        player_event = player_event_element.get_attribute("title")


                    else:
                        player_event_element = item.find_element(By.XPATH, ".//div/div[4]/a")
                        player_event = player_event_element.get_attribute("title")
                        # Adjust this block to handle goals and cards specifically
                        full_text = item.find_element(By.XPATH, ".//div/div[4]").text
                        if event_type == "Goal":
                            parts = full_text.split(',')
                            if len(parts) > 2:  # If there are at least 3 parts, indicating a remark is present
                                remark_event = parts[1].strip()  # The part before the second ',' is the remark for goals
                                # Handling Assist information for goals
                                if "Assist:" in full_text:
                                    assist_part = full_text.split('Assist:')[1].split(',')[0].strip()
                                    player_assist = assist_part  # Assume player_assist is already defined elsewhere as None
                            else:
                                remark_event = parts[0].strip() if len(parts) > 1 else ""
                        else:
                            # For Cards, just an example, adjust as needed
                            remark_event = full_text.split(',')[-1].strip() if ',' in full_text else full_text
                except NoSuchElementException:
                    pass



                card_type = event_type  # Default card type is the event type itself
                if event_type == "Card":
                    card_span_class = item.find_element(By.XPATH, ".//div/div[2]/span").get_attribute("class")
                    if "gelbrot" in card_span_class:
                        card_type = "Yellow-Red Card"
                    elif "gelb" in card_span_class and "rot" not in card_span_class:
                        card_type = "Yellow Card"
                    elif "rot" in card_span_class:
                        card_type = "Direct Red Card"

                events_data.append({
                    "Timestamp": timestamp,
                    "Club": club,
                    "H/A": team,
                    "Event": card_type,
                    "Player Event": player_event,
                    "Remark Event": remark_event,
                    "Player Assist": player_assist,
                    "Player Out": player_out,
                    "Match ID": match_id,
                }) 
            return events_data
        except NoSuchElementException:
            print(f"No {event_type} events found on the page.")
            return []

    all_events_data = []
    event_types = {"Goal": '//*[@id="sb-tore"]/ul', "Substitution": '//*[@id="sb-wechsel"]/ul', "Card": '//*[@id="sb-karten"]/ul'}

    # Iterate through each event type and extract data
    for event_type, xpath in event_types.items():
        events_data = extract_events(xpath, event_type, home_club_name, away_club_name)
        all_events_data.extend(events_data)

    # Create DataFrame and reorder columns to put 'Timestamp' second
    if all_events_data:  # Ensure there's data before creating the DataFrame
        events_df = pd.DataFrame(all_events_data)
        columns_order = ['Club', 'H/A', 'Timestamp', 'Event', 'Player Event', 'Remark Event', 'Player Assist', 'Player Out', 'Match ID']
        events_df = events_df[columns_order]
        all_events_dfs.append(events_df)
    
    print(f"Scraping completed for match ID: {match_id}")

# Check if all_events_dfs is not empty before attempting to concatenate
if all_events_dfs:  # This checks if the list is not empty
    # Concatenate all events dataframes
    final_events_df = pd.concat(all_events_dfs, ignore_index=True)

    # Finally, save the dataframe to a CSV file for persistence
    final_events_df.to_csv('match_events_2022_2023_1.csv', index=False)
else:
    print("No data was scraped.")

# Close the driver after scraping is done
driver.quit()

# Print a success message
print("Webscraping successfully completed for all matches.")

# Finally, save the dataframe to a CSV file for persistence
final_events_df.to_csv('match_events_2022_2023_1.csv', index=False)


The chromedriver version (121.0.6167.85) detected in PATH at C:\Users\moren\Downloads\chromedriver-win64\chromedriver.exe might not be compatible with the detected chrome version (122.0.6261.112); currently, chromedriver 122.0.6261.111 is recommended for chrome 122.*, so it is advised to delete the driver in PATH and retry


Scraping completed for match ID: 3840895
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840896
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840897
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840898
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840899
Iframe not found. Continuing after a couple of seconds...
No Goal events found on the page.
Scraping completed for match ID: 3840900
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840901
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840902
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840903
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3840904
Iframe not found. Continuing after

In [13]:
# Initialize the Chrome driver
driver = webdriver.Chrome()

# Define the start and end match IDs
start_match_id = 3841148  # First Game ID of the season
end_match_id = 3841212  # Adjust this according to your requirement

# Initialize an empty list to store all events dataframes
all_events_dfs_2 = []

# Loop through the range of match IDs
for match_id in range(start_match_id, end_match_id + 1):
    # Construct the URL for the current match ID
    match_url = f"https://www.transfermarkt.com/servette-fc_fc-lugano/index/spielbericht/{match_id}"

    # Navigate to the match URL
    driver.get(match_url)

    # Wait for page to load
    time.sleep(2)

    # Handling the iframe and accept button if exists
    try:
        wait = WebDriverWait(driver, 10)
        iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358")))
        driver.switch_to.frame(iframe)
        accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
        accept_button.click()
        driver.switch_to.default_content()
    except:
        print("Iframe not found. Continuing after a couple of seconds...")

    ## SCRAPING ## 

    # Extracting club names
    home_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div/div/div[1]/div[1]/div[2]/nobr/a').get_attribute("title")
    away_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div/div/div[2]/div[1]/div[2]/nobr/a').get_attribute("title")

    # Function to convert pixel values to minutes based on the pattern provided
    def convert_px_to_minute(x_px, y_px):
        # Remove any non-numeric characters and convert to integer
        x_px = int(re.sub(r'[^\d-]', '', str(x_px)))
        y_px = int(re.sub(r'[^\d-]', '', str(y_px)))
    
        # Convert negative values to positive
        x_px = abs(x_px)
        y_px = abs(y_px)
    
        unit_minutes = (x_px // 36) + 1
        ten_minutes = (y_px // 36) * 10
        timestamp = f"{unit_minutes + ten_minutes}'"
        return timestamp


    def extract_px_from_style(style_str):
        # Use regular expression to find all pixel values in the style string
        px_values = re.findall(r'-?\d+px', style_str)  # Include optional minus sign
    
        # Check if there are at least two pixel values
        if len(px_values) >= 2:
            x_px, y_px = [int(px.strip('px')) for px in px_values[:2]]  # Take the first two values
            return x_px, y_px
        else:
            # Handle the case when there are not enough values
            return None, None  # You can return None or some default values


    # Function to extract events with Remark Event adjustment
    def extract_events(event_type_xpath, event_type, home_club_name, away_club_name):
        try:
            events_list = driver.find_element(By.XPATH, event_type_xpath)
            events_items = events_list.find_elements(By.TAG_NAME, "li")
            events_data = []

            for item in events_items:
                team = "Home" if "heim" in item.get_attribute("class") else "Away"
                club = home_club_name if team == "Home" else away_club_name

                # Extract the style attribute for timestamp
                style_str = item.find_element(By.XPATH, ".//div/div[1]/span").get_attribute("style")
                x_px, y_px = extract_px_from_style(style_str)
                timestamp = convert_px_to_minute(x_px, y_px)

                player_event = "N/A"  # Default value if player name is not found
                player_out = None  # Initialize player_out to None
                remark_event = ""  # Initialize remark_event to empty string
                player_assist = None  # Ensure this variable is also initialized

                try:
                    player_event_element = None
                    full_text = item.find_element(By.XPATH, ".//div/div[4]").text.strip()
                    if event_type == "Substitution":
                        parts = full_text.split('\n')
                        if len(parts) > 1:
                            player_out_part = parts[-1]
                            player_out_parts = player_out_part.split(', ')
                            if len(player_out_parts) > 1:
                                player_out = player_out_parts[0]
                                remark_event = player_out_parts[1]
                            else:
                                player_out = player_out_parts[0]
                        player_event_element = item.find_element(By.XPATH, ".//div/div[4]/span[1]/a")
                        player_event = player_event_element.get_attribute("title")


                    else:
                        player_event_element = item.find_element(By.XPATH, ".//div/div[4]/a")
                        player_event = player_event_element.get_attribute("title")
                        # Adjust this block to handle goals and cards specifically
                        full_text = item.find_element(By.XPATH, ".//div/div[4]").text
                        if event_type == "Goal":
                            parts = full_text.split(',')
                            if len(parts) > 2:  # If there are at least 3 parts, indicating a remark is present
                                remark_event = parts[1].strip()  # The part before the second ',' is the remark for goals
                                # Handling Assist information for goals
                                if "Assist:" in full_text:
                                    assist_part = full_text.split('Assist:')[1].split(',')[0].strip()
                                    player_assist = assist_part  # Assume player_assist is already defined elsewhere as None
                            else:
                                remark_event = parts[0].strip() if len(parts) > 1 else ""
                        else:
                            # For Cards, just an example, adjust as needed
                            remark_event = full_text.split(',')[-1].strip() if ',' in full_text else full_text
                except NoSuchElementException:
                    pass



                card_type = event_type  # Default card type is the event type itself
                if event_type == "Card":
                    card_span_class = item.find_element(By.XPATH, ".//div/div[2]/span").get_attribute("class")
                    if "gelbrot" in card_span_class:
                        card_type = "Yellow-Red Card"
                    elif "gelb" in card_span_class and "rot" not in card_span_class:
                        card_type = "Yellow Card"
                    elif "rot" in card_span_class:
                        card_type = "Direct Red Card"

                events_data.append({
                    "Timestamp": timestamp,
                    "Club": club,
                    "H/A": team,
                    "Event": card_type,
                    "Player Event": player_event,
                    "Remark Event": remark_event,
                    "Player Assist": player_assist,
                    "Player Out": player_out,
                    "Match ID": match_id,
                }) 
            return events_data
        except NoSuchElementException:
            print(f"No {event_type} events found on the page.")
            return []

    all_events_data = []
    event_types = {"Goal": '//*[@id="sb-tore"]/ul', "Substitution": '//*[@id="sb-wechsel"]/ul', "Card": '//*[@id="sb-karten"]/ul'}

    # Iterate through each event type and extract data
    for event_type, xpath in event_types.items():
        events_data = extract_events(xpath, event_type, home_club_name, away_club_name)
        all_events_data.extend(events_data)

    # Create DataFrame and reorder columns to put 'Timestamp' second
    if all_events_data:  # Ensure there's data before creating the DataFrame
        events_df_2 = pd.DataFrame(all_events_data)
        columns_order = ['Club', 'H/A', 'Timestamp', 'Event', 'Player Event', 'Remark Event', 'Player Assist', 'Player Out', 'Match ID']
        events_df_2 = events_df_2[columns_order]
        all_events_dfs_2.append(events_df_2)
    
    print(f"Scraping completed for match ID: {match_id}")

# Check if all_events_dfs is not empty before attempting to concatenate
if all_events_dfs_2:  # This checks if the list is not empty
    # Concatenate all events dataframes
    final_events_df_2 = pd.concat(all_events_dfs_2, ignore_index=True)

    # Finally, save the dataframe to a CSV file for persistence
    final_events_df_2.to_csv('match_events_2022_2023_2.csv', index=False)
else:
    print("No data was scraped.")

# Close the driver after scraping is done
driver.quit()

# Print a success message
print("Webscraping successfully completed for all matches.")

# Finally, save the dataframe to a CSV file for persistence
final_events_df_2.to_csv('match_events_2022_2023_2.csv', index=False)


The chromedriver version (121.0.6167.85) detected in PATH at C:\Users\moren\Downloads\chromedriver-win64\chromedriver.exe might not be compatible with the detected chrome version (122.0.6261.112); currently, chromedriver 122.0.6261.111 is recommended for chrome 122.*, so it is advised to delete the driver in PATH and retry


Scraping completed for match ID: 3841148
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3841149
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3841150
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3841151
Iframe not found. Continuing after a couple of seconds...
Scraping completed for match ID: 3841152
Iframe not found. Continuing after a couple of seconds...
No Card events found on the page.
Scraping completed for match ID: 3841153
Iframe not found. Continuing after a couple of seconds...
No Card events found on the page.
Scraping completed for match ID: 3841154
Iframe not found. Continuing after a couple of seconds...
No Goal events found on the page.
No Substitution events found on the page.
No Card events found on the page.
Scraping completed for match ID: 3841155
Iframe not found. Continuing after a couple of seconds...
No Card events found on the pa

In [12]:
final_events_df

Unnamed: 0,Club,H/A,Timestamp,Event,Player Event,Remark Event,Player Assist,Player Out,Match ID
0,BSC Young Boys,Home,62',Goal,Christian Fassnacht,Header,Ulisses Garcia,,3840895
1,BSC Young Boys,Home,77',Goal,Cedric Itten,Right-footed shot,Cheikh Niasse,,3840895
2,BSC Young Boys,Home,81',Goal,Fabian Rieder,Header,Wilfried Kanga,,3840895
3,BSC Young Boys,Home,85',Goal,Wilfried Kanga,Right-footed shot,Christian Fassnacht,,3840895
4,BSC Young Boys,Home,63',Substitution,Cedric Itten,Tactical,,Meschack Elia,3840895
...,...,...,...,...,...,...,...,...,...
2945,FC Sion,Away,78',Yellow Card,Giovanni Sio,Dissent,,,3841127
2946,FC Zürich,Home,85',Yellow Card,Nikola Boranijasevic,Foul,,,3841127
2947,FC Zürich,Home,90',Yellow Card,Fabian Rohner,Unsporting behaviour,,,3841127
2948,FC Zürich,Home,90',Yellow Card,Junior Ligue,Foul,,,3841127


In [14]:
final_events_df_2

Unnamed: 0,Club,H/A,Timestamp,Event,Player Event,Remark Event,Player Assist,Player Out,Match ID
0,FC Zürich,Away,90',Goal,Roko Simic,Penalty,,,3841148
1,FC Zürich,Away,90',Goal,Ifeanyi Mathew,Left-footed shot,Antonio Marchesano,,3841148
2,FC Basel 1893,Home,45',Substitution,Jean-Kévin Augustin,Injury,,Sergio López,3841148
3,FC Zürich,Away,54',Substitution,Bledian Krasniqi,Tactical,,Junior Ligue,3841148
4,FC Basel 1893,Home,71',Substitution,Bradley Fink,Tactical,,Andi Zeqiri,3841148
...,...,...,...,...,...,...,...,...,...
597,FC Zürich,Home,29',Yellow Card,Ifeanyi Mathew,Foul,,,3841212
598,FC Zürich,Home,50',Yellow Card,Jonathan Okita,Foul,,,3841212
599,FC Lugano,Away,73',Yellow Card,Renato Steffen,Foul,,,3841212
600,FC Zürich,Home,77',Yellow Card,Nikola Katic,Foul,,,3841212


And now take a beer, you deserved it big time! ;)

let's try for another game to see, if the code is generic and works for all these pages:

## Match statistics (evtl)

Page Link: https://www.transfermarkt.com/servette-fc_fc-lugano/statistik/spielbericht/4089797

Description: