# Coach Level Scraping

As a first draft, the following codes extracts the data from the super league game  **Servette FC - Lugano (23.12.2023, Result = 2:2)**

In [36]:
import time
import os
import re
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Specify the path to the directory containing the ChromeDriver executable
chrome_driver_directory = "C:/Users/arnol/Downloads/chromedriver-win64/chromedriver.exe" #insert your own path here #User moreno: 'moren'

# Add the ChromeDriver directory to the PATH environment variable
os.environ["PATH"] += os.pathsep + chrome_driver_directory

## Line-Ups per Game

NOCH ZU MACHEN:
- MySQL Verbindung o. csv-datei

**Page Link:** https://www.transfermarkt.com/servette-fc_fc-lugano/aufstellung/spielbericht/4089797 

**Description:** Shows Line-Up of the team and counter team, its substitudes as well as different statistics such as average age, market value of them.



We aim to extract the following attributes for each Game:

Table **lineups_df**
- Position (GK, CB, RM, CF, etc.)
- Player Name
- Player Age
- Market Value (in Euros)
- Club
- H/A (Home Team / Away Team)
- Status (Starting 11 or Subsititude)

Table **lineups_stats_df**
- Club
- H/A (Home Team / Away Team)
- Manager
- Foreigners Starting (Amount of Foreigners in Starting LineUp)
- Foreigners Subs (Amount of Foreigners as Subs)
- Avg Age Starting
- Avg Age Subs
- Purchase Value Starting (Aggregated value of players in Starting LineUp that have been purchased by Club [in EUR])
- Purchase Value Subs (Aggregated value of players as Subs that have been purchased by Club [in EUR])
- Total Market Value Starting
- Total Market Value Subs

In [31]:
## PAGE NAVIGATION ##
# Initialize the Chrome driver
driver = webdriver.Chrome()

# Navigate to the tm page
driver.get('https://www.transfermarkt.com/servette-fc_fc-lugano/aufstellung/spielbericht/4089797') 

# Wait for page to load
time.sleep(2) 

# Wait for the iframe to be present and switch to it
wait = WebDriverWait(driver, 10)
iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358")))
driver.switch_to.frame(iframe)

# Now wait for the 'Accept & continue' button to be clickable inside the iframe
accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
accept_button.click()

# Switch back to the main document
driver.switch_to.default_content()

## SCRAPING ## 

# Extract the home and away club names from the 'title' attribute
home_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[1]/a[2]').get_attribute("title")
away_club_name = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[3]/a[2]').get_attribute("title")

# Function to extract data from a table given its rows
def extract_table_data(table_rows, club_name):
    positions = []
    players = []
    ages = []
    market_values = []
    club_names = [club_name] * (len(table_rows) // 3)  # There's a club name for each player row
    
    for i in range(0, len(table_rows), 3):  # Increment by 3 for each player's data set
        cells = table_rows[i].find_elements(By.TAG_NAME, "td")
        player_info = cells[1].text
        name_age_parts = player_info.split(' (')
        player_name = name_age_parts[0].strip()
        age_part = name_age_parts[1] if len(name_age_parts) > 1 else ''
        age_match = re.search(r'(\d+) years old', age_part)
        age = age_match.group(1) if age_match else None

        position_market_value = cells[4].text
        if ', ' in position_market_value:
            position, market_value = position_market_value.split(', ')
        else:
            position = position_market_value
            market_value = None
        
        players.append(player_name)
        ages.append(age)
        positions.append(position)
        market_values.append(market_value)
    
    return pd.DataFrame({
        'Position': positions,
        'Player': players,
        'Age': ages,
        'Market Value': market_values,
        'Club': club_names 
    })

all_tables_df = []

# XPath or CSS Selector for each table
tables_xpaths = {
    'starting_lineup_home': '//*[@id="main"]/main/div[5]/div[1]/div/div[1]/table', 
    'substitutes_home': '//*[@id="main"]/main/div[6]/div[1]/div/div[1]/table',
    'starting_lineup_away': '//*[@id="main"]/main/div[5]/div[2]/div/div[1]/table',
    'substitutes_away': '//*[@id="main"]/main/div[6]/div[2]/div/div[1]/table'
}

all_tables_df = []

# Loop through the table paths and extract data
for key, value in tables_xpaths.items():
    table = driver.find_element(By.XPATH, value)
    rows = table.find_elements(By.TAG_NAME, "tr")
    team_type = 'Home' if 'home' in key else 'Away'
    club_name = home_club_name if 'home' in key else away_club_name
    df = extract_table_data(rows, club_name)
    df['H/A'] = team_type
    df['Status'] = 'Starting' if 'starting' in key else 'Substitute'
    all_tables_df.append(df)

# Combine all dataframes
lineups_df = pd.concat(all_tables_df, ignore_index=True)

# Convert 'Age' to int, handling missing or malformed data
lineups_df['Age'] = pd.to_numeric(lineups_df['Age'], errors='coerce').astype('Int64')


# Extract the home and away club names
home_club_name_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[1]/a[2]')
home_club_name = home_club_name_element.get_attribute("title")
away_club_name_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[1]/div/div/div[2]/div[3]/a[2]')
away_club_name = away_club_name_element.get_attribute("title")

# Extract the home and away managers' names using the updated XPaths
home_manager_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[7]/div[1]/div/div/table/tbody/tr/td[1]/table/tbody/tr[1]/td[2]')
home_manager_name = home_manager_element.text
away_manager_element = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[7]/div[2]/div/div/table/tbody/tr/td[1]/table/tbody/tr[1]/td[2]')
away_manager_name = away_manager_element.text

# Extract additional information for both home and away teams
foreigners_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[1]').text
foreigners_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[1]/div/div[2]/table/tbody/tr/td[1]').text
avg_age_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[2]').text
avg_age_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[1]/div/div[2]/table/tbody/tr/td[2]').text
purchase_value_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[3]').text
purchase_value_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[1]/div/div[2]/table/tbody/tr/td[3]').text
total_market_value_starting_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[1]/div/div[2]/table/tbody/tr/td[4]').text
total_market_value_subs_home = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[1]/div/div[2]/table/tbody/tr/td[4]').text

foreigners_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[1]').text
foreigners_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[2]/div/div[2]/table/tbody/tr/td[1]').text
avg_age_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[2]').text
avg_age_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[2]/div/div[2]/table/tbody/tr/td[2]').text
purchase_value_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[3]').text
purchase_value_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[2]/div/div[2]/table/tbody/tr/td[3]').text
total_market_value_starting_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[5]/div[2]/div/div[2]/table/tbody/tr/td[4]').text
total_market_value_subs_away = driver.find_element(By.XPATH, '//*[@id="main"]/main/div[6]/div[2]/div/div[2]/table/tbody/tr/td[4]').text

# Function to clean the extracted data by removing preceding text
def clean_data(text, keep_eur_sign=False):
    if keep_eur_sign:
        # Directly slice away the preceding text if it follows a known pattern
        if 'Purchase value: ' in text:
            return text.replace('Purchase value: ', '')
        elif 'Total MV: ' in text:
            return text.replace('Total MV: ', '')
    else:
        # Using regex to find numeric values or percentages and return them for other columns
        match = re.search(r'\d+(\.\d+)?%?', text)
        return match.group(0) if match else text

# Create a DataFrame for the club and manager information along with the newly extracted data
lineups_stats_df = pd.DataFrame({
    'Club': [home_club_name, away_club_name],
    'H/A': ['Home', 'Away'],
    'Manager': [home_manager_name, away_manager_name],
    'Foreigners Starting': [clean_data(foreigners_starting_home), clean_data(foreigners_starting_away)],
    'Foreigners Subs': [clean_data(foreigners_subs_home), clean_data(foreigners_subs_away)],
    'Avg Age Starting': [clean_data(avg_age_starting_home), clean_data(avg_age_starting_away)],
    'Avg Age Subs': [clean_data(avg_age_subs_home), clean_data(avg_age_subs_away)],
    'Purchase Value Starting': [clean_data(purchase_value_starting_home, True), clean_data(purchase_value_starting_away, True)],
    'Purchase Value Subs': [clean_data(purchase_value_subs_home, True), clean_data(purchase_value_subs_away, True)],
    'Total Market Value Starting': [clean_data(total_market_value_starting_home, True), clean_data(total_market_value_starting_away, True)],
    'Total Market Value Subs': [clean_data(total_market_value_subs_home, True), clean_data(total_market_value_subs_away, True)]
})

## WRAP UP ##

# Close the driver after scraping is done
driver.quit()

# Print a success message
print("Webscraping successfully completed")

# Display the combined DataFrame
#lineups_df.head()
#lineups_stats_df.head()


Webscraping successfully completed


From the scraping we get the following two table **lineups_df** and **stats_df**:

In [33]:
lineups_df

Unnamed: 0,Position,Player,Age,Market Value,Club,H/A,Status
0,Goalkeeper,Jérémy Frick,30,€500k,Servette FC,Home,Starting
1,Centre-Back,Yoan Severin,26,€1.10m,Servette FC,Home,Starting
2,Centre-Back,Steve Rouiller,33,€200k,Servette FC,Home,Starting
3,Left-Back,Bradley Mazikou,27,€1.00m,Servette FC,Home,Starting
4,Right-Back,Keigo Tsunemoto,25,€900k,Servette FC,Home,Starting
5,Central Midfield,Gaël Ondoua,28,€700k,Servette FC,Home,Starting
6,Central Midfield,Timothé Cognat,25,€2.80m,Servette FC,Home,Starting
7,Right Midfield,Dereck Kutesa,26,€1.70m,Servette FC,Home,Starting
8,Left Midfield,Miroslav Stevanovic,33,€1.50m,Servette FC,Home,Starting
9,Centre-Forward,Alexis Antunes,23,€1.30m,Servette FC,Home,Starting


In [35]:
lineups_stats_df

Unnamed: 0,Club,H/A,Manager,Foreigners Starting,Foreigners Subs,Avg Age Starting,Avg Age Subs,Purchase Value Starting,Purchase Value Subs,Total Market Value Starting,Total Market Value Subs
0,Servette FC,Home,René Weiler,7,4,28.1,22.9,€420k,€500k,€14.20m,€6.30m
1,FC Lugano,Away,Mattia Croci-Torti,6,3,25.6,22.6,€7.77m,€475k,€12.20m,€5.01m


In [38]:
## PAGE NAVIGATION ##
# Initialize the Chrome driver
driver = webdriver.Chrome()

# Navigate to the tm page
driver.get('https://www.transfermarkt.com/servette-fc_fc-lugano/index/spielbericht/4089797') 

# Wait for page to load
time.sleep(2) 

# Wait for the iframe to be present and switch to it
wait = WebDriverWait(driver, 10)
iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358")))
driver.switch_to.frame(iframe)

# Now wait for the 'Accept & continue' button to be clickable inside the iframe
accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
accept_button.click()

# Switch back to the main document
driver.switch_to.default_content()

## SCRAPING ## 

print("Parat fir chli raiber und poli")


## WRAP UP ##

# Close the driver after scraping is done
driver.quit()

# Print a success message
print("Webscraping successfully completed")

# Display the combined DataFrame
#lineups_df.head()
#lineups_stats_df.head()


Parat fir chli raiber und poli
Webscraping successfully completed


## Matchsheet (evtl)

Page Link: https://www.transfermarkt.com/servette-fc_fc-lugano/index/spielbericht/4089797

Description: Shows events such as Goals, Substitutions and Cards as well


In [None]:
We aim to extract the following attributes for each Game:

Table **tbd_df**
- 



## Match statistics (evtl)

Page Link: https://www.transfermarkt.com/servette-fc_fc-lugano/statistik/spielbericht/4089797

Description: