<div style="text-align: center; background-color: #0A6EBD; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
    FIT-HCMUS, VNU-HCM 
    <br>
    Introduction To Data Science 
    <br>
    Final project 📌
</div>

<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 01 - Data collecting 📌
</div>

## 1. Data source

Data sources we collected for this project include:
- [FBref](https://fbref.com/en/): This is the website which contains football data through each season. The reason why we choose this website is that this website has many kinds of data from general data to detailed data , from each team to each player. The data we choose to scrape in this website is the standard stats in the latest 10 seasons for each player in The Premier League 2023/2024 because we think the kind of stat can provide enough information for analysing and predicting in our project. We scrape this website by sending requests to the server of website and parsing HTML text 
- [Transfermarkt](https://www.transfermarkt.com/) : This is also the website which contains football data through each season. The reason why we choose this website is that this website has data about player's injury which is a necessary part in our projects. The data we choose to scrape in this website is the injury data in the latest 10 seasons for each player in The Premier League 2023/2024. We scrape this website by sending requests to the server of website and parsing HTML text

## 2. Set up enviroment

In [1]:
import requests
import pandas as pd 
from bs4 import BeautifulSoup
import re
import time


## 3. Crawl data

### FBREF

Generate list of club in the league in Fbref

In [3]:
def generate_club_list_fbref(url,league_path):
    url_league = url+league_path
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
    #Send the request to the website
    res = requests.get(url_league,headers=headers)
    comm = re.compile("<!--|-->")
    soup = BeautifulSoup(comm.sub("",res.text),'html.parser')
    #Find the table which contains general information about clubs in the league 
    tables = soup.find_all("table",{"id":"results2023-202491_overall"})
    club_list = []
    #Iterate through each row to get the club's link in the website
    for t in tables:
        rows = t.select("tbody tr:not(.thead)")
        for row in rows :
            line = row.find("td",{"data-stat":"team"})
            for a in line.find_all('a', href=True):
                club_list.append(a['href'])
    time.sleep(3)
    return club_list

Generate list of players for each team in the league in Fbref

In [4]:
def generate_player_list_fbref(url,club_path):
    url_club = url+club_path
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
    #Send the request to the website
    res = requests.get(url_club,headers=headers)
    comm = re.compile("<!--|-->")
    soup = BeautifulSoup(comm.sub("",res.text),'html.parser')
    #Find the table which contains general information about players in the club
    tables = soup.find_all("table",{"id":"stats_standard_9"})
    player_list = []
    #Iterate through each row to get the player's link in the website
    for t in tables:
        rows = t.select("tbody tr:not(.thead)")
        for row in rows :
            line = row.find("th",{"data-stat":"player"})
            for a in line.find_all('a', href=True):
                player_list.append(a['href'])
    time.sleep(3)
    return player_list

Crawl data for each player

In [5]:
def crawl_player_fbref(url,player_path):
    url_player = url+player_path
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
    #Send the request to the website
    res = requests.get(url_player,headers=headers)
    comm = re.compile("<!--|-->")
    soup = BeautifulSoup(comm.sub("",res.text),'html.parser')
    #Find the table which contains standard stats about players in their career
    tables = soup.find_all("table",{"id":"stats_standard_dom_lg"})
    if len(tables) == 0 : 
        return None
    table =tables[0]
    #Get the name of columns in DataFrame
    header = ['Name'] + [element.getText() for element in table.find("thead").findAll("tr")[1].findAll("th")]
    rows = table.find("tbody").find_all("tr")
    #Get the latest 10 seasons in their career 
    if len(rows) <= 10:
        selected_rows = rows
    else:
        selected_rows = rows[-10:]
    data = []
    #Iterate each row to get data in DataFrame
    for row in selected_rows:
        playerName = [soup.find("h1").find("span").get_text()]
        season_id = [row.find('th').getText()]
        line =[data.getText() for data in row.findAll("td") ]
        line = playerName+season_id  + line
        data.append(line)
    df = pd.DataFrame(data=data,columns=header)
    time.sleep(3)
    return df

The process of scraping data from Fbref

In [6]:
def crawl_fbref(url,league_path):
    club_list = generate_club_list_fbref(url,league_path)
    player_list =[]
    for club in club_list:
        club_player_list = generate_player_list_fbref(url,club)
        player_list = player_list+club_player_list
    main_df = crawl_player_fbref(url,player_list[0])
    for player in player_list[1:]:
        sub_df = crawl_player_fbref(url,player)
        if sub_df is None : continue
        try:
            main_df = pd.concat([main_df, sub_df], ignore_index=True, axis=0)   
        except pd.errors.InvalidIndexError:
            print("Skipping player because player doesn't have data in first team")
    return main_df

In [7]:
url = "https://fbref.com"
league_path = "/en/comps/9/Premier-League-Stats"
df_fbref = crawl_fbref(url,league_path)

Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in first team
Skipping player because player doesn't have data in fir

### Transfermarkt

Check the season in the range 10 years (The season now : 23/24)

In [2]:
def check_season(season):
    if int(season[:2]) <13 :
        return False
    else: return True

Generate list of clubs in Transfermarkt

In [3]:
def generate_club_list_trans(url,league_path):
    url_league = url+league_path
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
    #Send the request to the website
    res = requests.get(url_league,headers=headers)
    soup = BeautifulSoup(res.text,'html.parser')
    #Find the table which contains general information about clubs in the league
    tables = soup.find_all("table",{"class":"items"})

    club_list = []
    table = tables[0]
    rows = table.select("tbody tr:not(.thead)")
    #Iterate through each row to get the club's link in the website
    for row in rows :
        line = row.find("td",{"class":"hauptlink no-border-links"})
        for a in line.find_all('a', href=True):
            if a['href'] != '#':
                club_list.append(a['href'])
    time.sleep(3)
    return club_list

Generate list of players for each clubs in Transfermarkt

In [4]:
def generate_player_list_trans(url,club_path):
    url_club = url+club_path
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
    #Send the request to the website
    res = requests.get(url_club,headers=headers)
    soup = BeautifulSoup(res.text,'html.parser')
    #Find the table which contains general information about players in the club
    tables = soup.find_all("table",{"class":"items"})
    table =tables[0]
    player_list = []
    rows = table.select("tbody tr")
    #Iterate through each row to get the player's link in the website
    for row in rows:
        line = row.find("td",{"class":"hauptlink"})
        if line is not None : 
            a_tag = line.find('a', href=True)
            if a_tag and a_tag['href'] not in player_list:
                player_list.append(a_tag['href'])
    time.sleep(3)
    return player_list

Scrape injury data of each player for 1 page

In [5]:
def crawl_injury_data_player_page(url,player_path):
    url_player = url+player_path
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
    #Send the request to the website
    res = requests.get(url_player,headers=headers)
    soup=BeautifulSoup(res.text, "html.parser")
    name_tag = soup.find("div",{"class":"data-header__headline-container"})
    #Find the name of the player
    name = name_tag.find("h1",{"class":"data-header__headline-wrapper"}).getText()
    name = re.sub(r"[^\w\s]", "", name)
    name = re.sub(r"\s+", " ", name)
    name = re.sub(r"\d+", "", name)
    name = [name.strip()]
    #Find the table which contains injury data 
    tables = soup.find_all("table",{"class":"items"})
    if (len(tables) == 0) : 
        print("There is no injury data for this player ")
        return None
    table = tables[0]
    #Get the name of columns in DataFrame
    header = ['Name'] + [element.getText() for element in table.find("thead").findAll("tr")[0].findAll("th")]
    full_data =[]
    rows = table.select("tbody tr")
    #Iterate each row to get data in DataFrame
    for row in rows:
        line =[data.getText() for data in row.findAll("td") ]
        if not check_season(line[0]) : continue 
        line = name+ line
        full_data.append(line)
    df = pd.DataFrame(data=full_data,columns=header)
    time.sleep(3)
    return df

Scrape injury data for each player in all pages

In [6]:
def crawl_injury_data_player(url,player_path):
    #Change page from Profile to Injury History  
    player_path = player_path.replace("profil", "verletzungen")
    url_player = url+player_path
    headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
    #Send the request to the website
    res = requests.get(url_player,headers=headers)
    soup=BeautifulSoup(res.text, "html.parser")
    #Find the number of pages in the table
    page = soup.find("div",{"class":"pager"})
    #Scrape injury data for all pages
    if page is None : 
        return crawl_injury_data_player_page(url,player_path)
    page_link = page.find_all("li",{"class":"tm-pagination__list-item"})
    links =[]
    for link in page_link :
        line = link.find('a', href=True)
        if line['href'] is None or line['href'] in links:
            continue
        links.append(line['href'])    
    main_df = crawl_injury_data_player_page(url,links[0])
    for link in links[1:]:
        sub_df = crawl_injury_data_player_page(url,link)
        if sub_df is None : 
            continue
        main_df= pd.concat([main_df,sub_df],ignore_index=True,axis = 0)
    return main_df

The process of scraping injury data from Transfermarkt

In [7]:
def crawl_transfermarkt(url,league_path):
    club_list = generate_club_list_trans(url,league_path)
    player_list =[]
    for club in club_list:
        club_player_list = generate_player_list_trans(url,club)
        player_list = player_list+club_player_list
    main_df = crawl_injury_data_player(url,player_list[0])
    for player in player_list[1:]:
        sub_df = crawl_injury_data_player(url,player)
        if sub_df is None : continue
        main_df = pd.concat([main_df, sub_df], ignore_index=True, axis=0)   
    return main_df

In [8]:
url_trans= "https://www.transfermarkt.com"
league_path_trans = "/premier-league/startseite/wettbewerb/GB1"
df_trans = crawl_transfermarkt(url_trans,league_path_trans)

There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no injury data for this player 
There is no inju

## 4. Clean and save data

In [16]:
#Save the fbref data in csv
df_fbref.to_csv('../../data/raw/raw_data_fbref.csv',index=False,encoding="utf-8")

In [9]:
#Save the transfermrkt data in csv
df_trans.to_csv('../../data/raw/raw_data_trans.csv',index=False,encoding="utf-8")