<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scraping-TransferMarkt.us" data-toc-modified-id="Scraping-TransferMarkt.us-1">Scraping TransferMarkt.us</a></span></li></ul></div>

# Scraping TransferMarkt.us

In this notebook, I will be taking a look at www.transfermarkt.us. It's a website that primarily focuses on the - you guessed it - market value of football (soccer) player. But it also has a bunch of other statistics - height, age, team history, win history, fixtures, etc. 

I'm going to start by writing a script to scrape the data, and then I'll carry on with some analysis in the next notebook. A good way to start is to scrape a particular team. Lets start with Manchester United from the English Premier League.

We need `requests` to download the website, `BeautifulSoup` to scrape the website, and `pandas` because we aren't cavemen. 

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.transfermarkt.us/manchester-united/kader/verein/985/saison_id/2020/plus/1'

response = requests.get(url, headers = {'User-Agent': 'Custom5'})
print(response.status_code)

200


200 is always what we're looking for. This means that the connection to our website has been established. 

In [2]:
soup = BeautifulSoup(response.content, 'html.parser')

In [3]:
# list of player names
player_names = []

name_tags = soup.find_all("a", {"class": "spielprofil_tooltip"})

for tag in name_tags:
    player_names.append(tag.text)
player_names = player_names[::2]

In [255]:
player_names

['David de Gea',
 'Dean Henderson',
 'Sergio Romero',
 'Lee Grant',
 'Harry Maguire',
 'Victor Lindelöf',
 'Eric Bailly',
 'Axel Tuanzebe',
 'Phil Jones',
 'Marcos Rojo',
 'Alex Telles',
 'Luke Shaw',
 'Brandon Williams',
 'Aaron Wan-Bissaka',
 'Timothy Fosu-Mensah',
 'Nemanja Matic',
 'Paul Pogba',
 'Donny van de Beek',
 'Scott McTominay',
 'Fred',
 'Bruno Fernandes',
 'Juan Mata',
 'Marcus Rashford',
 'Daniel James',
 'Mason Greenwood',
 'Jesse Lingard',
 'Facundo Pellistri',
 'Anthony Martial',
 'Edinson Cavani',
 'Odion Ighalo']

In [5]:
# list of market_value
player_value = []

value_tags = soup.find_all("td", {"class": "rechts"})

for tag in value_tags:
    edit = tag.text
    edit = edit.replace(u'\xa0', u'')
    edit = edit.replace(u'$', u'')
    player_value.append(edit)
player_value[:10]

['27.50m',
 '22.00m',
 '2.20m',
 '385Th.',
 '44.00m',
 '26.40m',
 '19.25m',
 '8.80m',
 '6.60m',
 '6.60m']

In [7]:
# list of player attributes
player_attributes = []

attribute_tags = soup.find_all("td", {"class": "zentriert"})

for tag in attribute_tags:
    player_attributes.append(tag.text)
player_attributes[:20]

['1',
 'Nov 7, 1990 (30)',
 '',
 '1,92 m',
 'right',
 'Jul 1, 2011',
 '',
 'Jun 30, 2023',
 '26',
 'Mar 12, 1997 (23)',
 '',
 '1,88 m',
 'right',
 'Aug 1, 2020',
 '',
 'Jun 30, 2025',
 '22',
 'Feb 22, 1987 (33)',
 '',
 '1,92 m']

In [8]:
# team name
team_name = []

team_tag = soup.find_all("span")

for tag in team_tag:
    team_name.append(tag.text)
team = team_name[9]
team

'Manchester United'

In [10]:
# list of player positions
player_positions = []

height_tags = soup.find_all("td", {"class": "posrela"})

for tag in height_tags:
    tag = tag.text
    if 'Goalkeeper' in tag:
        tag = 'Goalkeeper'
    elif 'Centre-Back' in tag:
        tag = 'Centre-Back'
    elif 'Left-Back' in tag:
        tag = 'Left-Back'
    elif 'Right-Back' in tag:
        tag = 'Right-Back'
    elif 'Defensive Midfield' in tag:
        tag = 'Defensive Midfield'
    elif 'Central Midfield' in tag:
        tag = 'Central Midfield'
    elif 'Attacking Midfield':
        tag = 'Attacking Midfield'
    elif 'Right Winger':
        tag = 'Right Winger'
    elif 'Left Winger':
        tag = 'Left Winger'
    elif 'Centre-Forward':
        tag = 'Centre-Forward'
    else:
        tag = 'Missing Position'
    player_positions.append(tag)
    
player_positions[:10]

['Goalkeeper',
 'Goalkeeper',
 'Goalkeeper',
 'Goalkeeper',
 'Centre-Back',
 'Centre-Back',
 'Centre-Back',
 'Centre-Back',
 'Centre-Back',
 'Centre-Back']

In [11]:
player_numbers = player_attributes[::8]
player_birth = player_attributes[1::8]
player_height = player_attributes[3::8]
player_foot = player_attributes[4::8]

In [12]:
player_data = pd.DataFrame(
    {
     'name': player_names,
     'number': player_numbers,
     'position': player_positions,
     'team': team,
     'height': player_height,
     'foot': player_foot,
     'value': player_value
    })

In [282]:
player_data

Unnamed: 0,name,number,position,team,height,foot,value
0,David de Gea,1,Goalkeeper,Manchester United,"1,92 m",right,27.50m
1,Dean Henderson,26,Goalkeeper,Manchester United,"1,88 m",right,22.00m
2,Sergio Romero,22,Goalkeeper,Manchester United,"1,92 m",right,2.20m
3,Lee Grant,13,Goalkeeper,Manchester United,"1,93 m",right,385Th.
4,Harry Maguire,5,Centre-Back,Manchester United,"1,94 m",right,44.00m
5,Victor Lindelöf,2,Centre-Back,Manchester United,"1,87 m",right,26.40m
6,Eric Bailly,3,Centre-Back,Manchester United,"1,87 m",right,19.25m
7,Axel Tuanzebe,38,Centre-Back,Manchester United,"1,86 m",right,8.80m
8,Phil Jones,4,Centre-Back,Manchester United,"1,85 m",right,6.60m
9,Marcos Rojo,16,Centre-Back,Manchester United,"1,87 m",left,6.60m


Now, lets turn this into a function:

In [13]:
def get_team_info(team_url):
    '''
    This function takes a url from transfermarkt.us as a string, 
    and returns a dataset that contains: name, player number, position,
    height, team, foot (right footed or left footed), and transfer price.
    
    Note: this function only works on team websites if the squad information
    is under 'detailed', rather than 'compact' or 'gallery'.
    '''
    response = requests.get(team_url, headers = {'User-Agent': 'Custom5'})
    soup = BeautifulSoup(response.content, 'html.parser')
    
    player_value = []
    player_names = []
    player_attributes = []
    player_positions = []
    team_name = []
    
    # Player Names
    name_tags = soup.find_all("a", {"class": "spielprofil_tooltip"})
    for tag in name_tags:
        player_names.append(tag.text)
    player_names = player_names[::2]
    
    # Market Value
    value_tags = soup.find_all("td", {"class": "rechts"})
    for tag in value_tags:
        edit = tag.text
        edit = edit.replace(u'\xa0', u'')
        edit = edit.replace(u'$', u'')
        player_value.append(edit)
    
    # Team
    team_tag = soup.find_all("span")
    for tag in team_tag:
        team_name.append(tag.text)
    team = team_name[9]  

    # Player positions
    position_tags = soup.find_all("td", {"class": "posrela"})
    for tag in position_tags:
        tag = tag.text
        if 'Goalkeeper' in tag:
            tag = 'Goalkeeper'
        elif 'Centre-Back' in tag:
            tag = 'Centre-Back'
        elif 'Left-Back' in tag:
            tag = 'Left-Back'
        elif 'Right-Back' in tag:
            tag = 'Right-Back'
        elif 'Defensive Midfield' in tag:
            tag = 'Defensive Midfield'
        elif 'Central Midfield' in tag:
            tag = 'Central Midfield'
        elif 'Attacking Midfield':
            tag = 'Attacking Midfield'
        elif 'Right Winger':
            tag = 'Right Winger'
        elif 'Left Winger':
            tag = 'Left Winger'
        elif 'Centre-Forward':
            tag = 'Centre-Forward'
        else:
            tag = 'Missing Position'
        player_positions.append(tag)

    # Attributes
    attribute_tags = soup.find_all("td", {"class": "zentriert"})
    for tag in attribute_tags:
        player_attributes.append(tag.text)
        
    player_numbers = player_attributes[::8]
    player_birth = player_attributes[1::8]
    player_height = player_attributes[3::8]
    player_foot = player_attributes[4::8]
    
    # Creating the dataset
    player_data = pd.DataFrame(
    {
     'name': player_names,
     'number': player_numbers,
     'position': player_positions,
     'team': team,
     'height': player_height,
     'foot': player_foot,
     'value': player_value
    })
    
    return player_data

In [15]:
test_url = 'https://www.transfermarkt.us/manchester-united/kader/verein/985/saison_id/2020/plus/1'
get_team_info(test_url).head()

Unnamed: 0,name,number,position,team,height,foot,value
0,David de Gea,1,Goalkeeper,Manchester United,"1,92 m",right,27.50m
1,Dean Henderson,26,Goalkeeper,Manchester United,"1,88 m",right,22.00m
2,Sergio Romero,22,Goalkeeper,Manchester United,"1,92 m",right,2.20m
3,Lee Grant,13,Goalkeeper,Manchester United,"1,93 m",right,385Th.
4,Harry Maguire,5,Centre-Back,Manchester United,"1,94 m",right,44.00m


It seems that it's working well. Let's gather all the information from the current premier league. Unfortunately, this requires some hard-coding to gather the links.

In [16]:
premier_league_dict = {
    'Liverpool': "https://www.transfermarkt.us/liverpool-fc/kader/verein/31/saison_id/2020/plus/1",
    'Manchester City': "https://www.transfermarkt.us/manchester-city/kader/verein/281/saison_id/2020/plus/1",
    'Chelsea': "https://www.transfermarkt.us/chelsea-fc/kader/verein/631/saison_id/2020/plus/1",
    'Manchester United': "https://www.transfermarkt.us/manchester-united/kader/verein/985/saison_id/2020/plus/1",
    'Tottenham': "https://www.transfermarkt.us/tottenham-hotspur/kader/verein/148/saison_id/2020/plus/1",
    'Arsenal': "https://www.transfermarkt.us/arsenal-fc/kader/verein/11/saison_id/2020/plus/1",
    'Everton': "https://www.transfermarkt.us/everton-fc/kader/verein/29/saison_id/2020/plus/1",
    'Leicester City': "https://www.transfermarkt.us/leicester-city/kader/verein/1003/saison_id/2020/plus/1",
    'Wolves': "https://www.transfermarkt.us/wolverhampton-wanderers/kader/verein/543/saison_id/2020/plus/1",
    'Aston Villa': "https://www.transfermarkt.us/aston-villa/kader/verein/405/saison_id/2020/plus/1",
    'West Ham': "https://www.transfermarkt.us/west-ham-united/kader/verein/379/saison_id/2020/plus/1",
    'Newcastle United': "https://www.transfermarkt.us/newcastle-united/kader/verein/762/saison_id/2020/plus/1",
    'Brighton': "https://www.transfermarkt.us/brighton-amp-hove-albion/kader/verein/1237/saison_id/2020/plus/1",
    'Southampton': "https://www.transfermarkt.us/southampton-fc/kader/verein/180/saison_id/2020/plus/1",
    'Fulham': "https://www.transfermarkt.us/fulham-fc/kader/verein/931/saison_id/2020/plus/1",
    'Crystal Palace': "https://www.transfermarkt.us/crystal-palace/kader/verein/873/saison_id/2020/plus/1",
    'Leeds United': "https://www.transfermarkt.us/leeds-united/kader/verein/399/saison_id/2020/plus/1",
    'Sheffield United': "https://www.transfermarkt.us/sheffield-united/kader/verein/350/saison_id/2020/plus/1",
    'Burnley FC': "https://www.transfermarkt.us/burnley-fc/kader/verein/1132/saison_id/2020/plus/1",
    'West Brom': "https://www.transfermarkt.us/west-bromwich-albion/kader/verein/984/saison_id/2020/plus/1"
}

In [17]:
list_of_prem_datasets = []

list_url = list(premier_league_dict.values())

for url in list_url:
    add_to_dataset = get_team_info(url)
    list_of_prem_datasets.append(add_to_dataset)


In [18]:
prem_dataset = pd.concat(list_of_prem_datasets)

In [19]:
prem_dataset.head()

Unnamed: 0,name,number,position,team,height,foot,value
0,Alisson,1,Goalkeeper,Liverpool FC,"1,91 m",right,88.00m
1,Adrián,13,Goalkeeper,Liverpool FC,"1,90 m",right,2.20m
2,Caoimhin Kelleher,62,Goalkeeper,Liverpool FC,"1,88 m",right,2.20m
3,Virgil van Dijk,4,Centre-Back,Liverpool FC,"1,93 m",right,88.00m
4,Joe Gomez,12,Centre-Back,Liverpool FC,"1,88 m",right,44.00m


And there we have it! We can now export this to a csv, or whatever we want. 