## Project: Scrapping Premier League Incoming Transfer data between 2000 and 2024 from transfermarkt.co.uk

Libraries: requests (to fetch web pages) , BeautifulSoup (to navigate and extract data from requested web pages) , pandas (to put extracted data into dataframe and save locally)

This is the first part of a three project series where I will ultimately analyse the performances of all incoming players to find which ones had the best debut seasons. This comparison will be done by their whoscored ratings from the end of the season. Let's get started here.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
# Headers for the request
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'
}

We use Headers to mimic a web browser when making requests. This helps in avoiding issues like being blocked by the website due to bot-like behavior. After all, bots are just a bunch of codes, just like this. This particular User Agent mimics a chrome browser being used on Linux OS.

In [5]:
# Initialize empty list to store player data
all_player_data = []

To store all the player and transfer data we extract, we create an empty list like the one above.

In [6]:
# Function to extract player information for incoming transfers only
def extract_player_data(team_name, team_div, season_year):
    transfer_table = team_div.find_all('table')

    # Use the first table for incoming transfers (transfer_type = 0)
    if transfer_table and len(transfer_table) >= 2:
        headers = [th.text.strip() for th in transfer_table[0].find('thead').find_all('th')]  # First table is for 'In' transfers
        rows = transfer_table[0].find('tbody').find_all('tr')

        for row in rows:
            # Replace '-' with None (null) in each cell value
            data = [td.text.strip() if td.text.strip() != "-" else None for td in row.find_all('td')]
            player_data = [team_name, 'In', season_year] + data
            all_player_data.append(player_data)

We had to inspect the element of the webpage to understand the structure before writing this code so don't worry if it is a bit confusing. Basically, this function extracts player data for incoming transfers,*'In'*, from a given team's transfer table.
It finds the relevant table, extracts headers and rows, then processes each row to extract player details and transfer information.
Each player's data is stored in **all_player_data** as a list containing team name, transfer direction, season year, and player details. 
Initially, I extracted transfers in both directions but limited it to just In's.

In [7]:
# Iterate through each season from 2000 to 2023
for season_year in range(2000, 2024):
    # URL for both summer and winter transfers
    base_url = f"https://www.transfermarkt.co.uk/premier-league/transfers/wettbewerb/GB1/plus/?saison_id={season_year}&s_w=&leihe=1&intern=0&intern=1"

    # Request the page and set encoding
    pageTree = requests.get(base_url, headers=headers)
    pageTree.encoding = 'utf-8'  # Ensure UTF-8 encoding for the request
    soup = BeautifulSoup(pageTree.content, 'lxml', from_encoding='utf-8')  # Use 'lxml' parser with UTF-8 encoding

    # Extract teams from the page and filter out empty strings
    teams = soup.select('h2.content-box-headline a[href*="/transfers/verein/"]')
    teams_list = [(team.text.strip(), team) for team in teams if team.text.strip()]

    # Iterate through each team and extract player information for incoming transfers only
    for team_name, team_anchor in teams_list:
        
        # Using the previously selected 'team_anchor' directly instead of searching for it again and only process if the URL contains '/transfers/verein/'
        if '/transfers/verein/' in team_anchor['href']:
            team_div = team_anchor.find_parent('div', class_='box')

            # If not found, try finding a different div class (like 'responsive-table')
            if not team_div:
                print(f"Warning: No div with class 'box' found for {team_name}. Trying 'responsive-table'.")
                team_div = team_anchor.find_parent('div', class_='responsive-table')

            # Check if the team_div is found
            if team_div:
                # Extract incoming transfers only (transfer_type=0)
                extract_player_data(team_name, team_div, season_year)
            else:
                print(f"Warning: No div found for {team_name} after searching multiple classes.")

Here lies the heavy lifter. Since the transfer information we need is from different pages on the transfermarkt site, we need a way to automatically open each page. Luckily, the only thing that changes with each url is the **season year**, we can easily iterate through each page by changing the part of the url that changes. We tried a couple of methods and ran into different errors which is why the *soup.select* to ensure that the \<a> anchor has a \<href> which contains part of the transfers url. 
The next step is to add the team names and anchor to **teams_list** then check through the parent div and extract the transfer information. If there's any issue with finding a div element of any team, a warning message will be printed.

In [8]:
# Create a DataFrame from the collected player data
columns = ['Team', 'Transfer Direction', 'Year', 'Player', 'Age', 'Nationality', 'Position', 'Short Position',
           'Market Value', 'Left Team', 'Left Team Flag', 'Fee']
df = pd.DataFrame(all_player_data, columns=columns)

# Display the DataFrame
print(df)

# Export the DataFrame to a CSV file with UTF-8 encoding, treating None as empty strings in the CSV
df.to_csv('20Years_PL_transfers.csv', index=False, encoding='utf-8')

                          Team Transfer Direction  Year  \
0                 Leeds United                 In  2000   
1                 Leeds United                 In  2000   
2                 Leeds United                 In  2000   
3                 Leeds United                 In  2000   
4                 Leeds United                 In  2000   
...                        ...                ...   ...   
10219  Wolverhampton Wanderers                 In  2023   
10220  Wolverhampton Wanderers                 In  2023   
10221  Wolverhampton Wanderers                 In  2023   
10222  Wolverhampton Wanderers                 In  2023   
10223  Wolverhampton Wanderers                 In  2023   

                          Player Age Nationality            Position  \
0      Rio FerdinandR. Ferdinand  22                     Centre-Back   
1      Olivier DacourtO. Dacourt  25              Defensive Midfield   
2           Mark VidukaM. Viduka  24                  Centre-Forward   
3  

Assigning the appripriate column names, creating and viweing and saving the dataframe is the final step in this process. I hope you had some fun doing this. See you soon!