# Transfermarkt Data Extraction

## General Information

**Objective**
The purpose of this Jupyter Notebook is to automate the process of extracting data from various pages on Transfermarkt. This data forms the foundation of our project with Servette FC, where the overarching goal is to analyze the club and its processes.

**Scope**
We will scrape relevant data that relates to the team's composition, players' performance, coaching strategies, and other KPIs identified in our data catalog. This information will include, but is not limited to, player statistics, team values, and match outcomes.

**Methodology**
The extraction process will involve:
- Defining the URLs of the pages containing the necessary data.
- Using web scraping techniques to retrieve the data from these pages.
- Cleaning and structuring the data into a format suitable for analysis.

**Usage**
The collected data will be merged into a central dataset, which will be used for detailed analysis and to derive actionable insights. These insights aim to support Servette FC in strategic decision-making to enhance player development and increase the club's chances of winning titles.

**Note**
This notebook is part of a larger project, and the data extracted herein will be handled in accordance with the project's data governance and privacy policies.

## Transfermarkt Pages

Before we can delve into the data analysis for Servette FC, we need a systematic approach to data extraction. Therefore, we will navigate through each relevant page of the Transfermarkt website in the following subchapters to collect comprehensive player data.


### Detailed Player Information

Page: https://www.transfermarkt.com/servette-fc/kader/verein/61/saison_id/2023/plus/1

We aim to extract the following attributes for each player:
- Jersey number (#)
- Name (Player)
- Date of Birth (with Age)
- Nationality
- Height
- Preferred foot (Foot)
- Date joined the team (Joined)
- Contract expiration (Contract)
- Current market value (Market value)


In [8]:
!pip install --upgrade selenium webdriver_manager





[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [13]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initialize the Chrome driver using WebDriver Manager
# Note that Service object is used with webdriver.Chrome
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Navigate to the detailed player information page on Transfermarkt
driver.get('https://www.transfermarkt.com/servette-fc/kader/verein/61/saison_id/2023/plus/1')
time.sleep(2)  # Wait for the detailed page to load

# Wait for the iframe to be present and switch to it if necessary
# Code for handling the iframe is omitted, assuming it is not needed for the detailed page

# Find the table containing the detailed player information
table = driver.find_element(By.XPATH, '//*[@id="yw1"]/table') //*[@id="yw1"]/table

# Initialize a list to store all rows of the table
table_data = []

# Locate all rows of the table
rows = table.find_elements(By.TAG_NAME, "tr")

# Loop through each row in the table
for row in rows:
    # Get the text of each cell in the row
    row_data = [td.text for td in row.find_elements(By.TAG_NAME, "td")]
    table_data.append(row_data)

# Convert the list of rows to a pandas DataFrame
df = pd.DataFrame(table_data)

# If necessary, adjust the column selection to match the detailed view's layout
# The example below assumes that the detailed view has the same column order as the image provided

# Rename the DataFrame columns to match the requested data
df.columns = ['Number', 'Player', 'Date of Birth', 'dsd', 'Nationality', 'Height', 'Foot', 'Joined', 'Contract', 'Market value']

# Drop rows where all the elements are 'None' (usually the header row)
df.dropna(how='all', inplace=True)

# Print the DataFrame to verify the contents
print(df)

# Close the driver after scraping is done
driver.quit()

# Print a success message
print("Webscraping successfully completed")



ValueError: Length mismatch: Expected axis has 13 elements, new values have 10 elements

In [15]:
# Loop through each row in the table
for row in rows:
    # Get the text of each cell in the row
    row_data = [td.text.strip() for td in row.find_elements(By.TAG_NAME, "td") if td.text.strip() != ""]
    
    # Check if row_data has meaningful data before appending
    if row_data:
        table_data.append(row_data)



### video youtube - irgend äs gwurschtel vu miär

In [16]:
from selenium import webdriver
from selenium.webdriver.commen.by import By # for locating elements

In [37]:
# lets start the browser
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Navigate to the initial webpage where the list of products is
# Replace this URL with the actual URL of your product list page
driver.get('https://www.transfermarkt.com/servette-fc/kader/verein/61/saison_id/2023/plus/1') 
time.sleep(2)  # Wait for page to load

# Wait for the iframe to be present and switch to it
wait = WebDriverWait(driver, 10)
iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358"))) # !!!!!! replace with your iframe ID (search for iframe tag in DevTool) --> just change numbers
driver.switch_to.frame(iframe)

# Now wait for the 'Akzeptieren' button to be clickable inside the iframe
accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]"))) #//*[@id="notice"]/div[3]/div[1]/div/button
accept_button.click()

# Switch back to the main document
driver.switch_to.default_content()

In [44]:
# lets start the browser
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Navigate to the initial webpage where the list of products is
# Replace this URL with the actual URL of your product list page
driver.get('https://www.transfermarkt.com/servette-fc/kader/verein/61/saison_id/2023/plus/1') 
time.sleep(2)  # Wait for page to load

# Wait for the iframe to be present and switch to it
wait = WebDriverWait(driver, 10)
iframe = wait.until(EC.presence_of_element_located((By.ID, "sp_message_iframe_953358"))) # !!!!!! replace with your iframe ID (search for iframe tag in DevTool) --> just change numbers
driver.switch_to.frame(iframe)

# Now wait for the 'Akzeptieren' button to be clickable inside the iframe
accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'accept')]")))
accept_button.click()

# Switch back to the main document
driver.switch_to.default_content()


# Find the table by its XPath or CSS Selector
table = driver.find_element(By.XPATH, '//*[@id="yw1"]/table')

# Initialize a list to store all rows of the table
table_data = []

# Locate all rows of the table
rows = table.find_elements(By.TAG_NAME, "tr")

# Loop through each row in the table
for row in rows:
    # Get the text of each cell in the row
    row_data = [td.text for td in row.find_elements(By.TAG_NAME, "td")]
    table_data.append(row_data)

# Convert the list of rows to a pandas DataFrame
df = pd.DataFrame(table_data)

# Drop the first three columns and the sixth and seventh columns (adjust indices as needed)
# Columns are 0-indexed, so column 1 is at index 0, column 6 is at index 5, etc.
#df.drop(df.columns[[1,5,8,9]], axis=1, inplace=True)

# Rename the remaining columns
#df.columns = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

# Drop rows where all the elements are 'None'
df.dropna(how='all', inplace=True)

# Close the driver after scraping is done
driver.quit()

# Print a success message
print("Webscraping successfully completed")
df

Webscraping successfully completed


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
1,1,Joël Mall\nGoalkeeper,,Joël Mall,Goalkeeper,"Apr 5, 1991 (32)",,"1,97m",right,"Jul 1, 2023",,"Jun 30, 2025",€500k
2,,Joël Mall,,,,,,,,,,,
3,Goalkeeper,,,,,,,,,,,,
4,32,Jérémy Frick \nGoalkeeper,,Jérémy Frick,Goalkeeper,"Mar 8, 1993 (30)",,"1,92m",left,"Jul 1, 2016",,"Jun 30, 2027",€500k
5,,Jérémy Frick,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,,Enzo Crivelli,,,,,,,,,,,
87,Centre-Forward,,,,,,,,,,,,
88,-,Alexandre Patrício\nCentre-Forward,,Alexandre Patrício,Centre-Forward,"Feb 17, 2004 (19)",,"1,79m",right,"Aug 26, 2022",,"Jun 30, 2025",€100k
89,,Alexandre Patrício,,,,,,,,,,,


In [30]:
from bs4 import BeautifulSoup

page_source = BeautifulSoup(driver.page_source, 'html.parser')

# Find the table containing the detailed player information
table = page_source.find('table[class="items"] tbody tr')
table[2]

TypeError: 'NoneType' object is not subscriptable