## Web scrapping EPL data from FBREF website
it provides very comprehensive stats data from all football premier leauges. we will be focusing to scrape the past 3 season data for English premier league including the squad details and the shooting stats. 

Goals for this project:
1. learn how to use BeautifulSoup library and extract table contents from a website.
2. concat two different tables having same columns to a final dataset which can be used for analysis and ML purpose.
3. Understand HTML content in more details. 
4. excercise pandas, request, time libraries for python.

In [10]:
# importing necessary libraries
import requests
import pandas as pd

In [11]:
# importing BeautifulSoup library which will help us to read html data 
from bs4 import BeautifulSoup

In [27]:
# we are scrapping English Premier League data from season 2020 to 2022
# and storing them in a list for all teams 
years = list(range(2022, 2019, -1))
all_matches = []
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [28]:
import time
# for each team in all seasons we run a for loop to ierte over season links
for year in years:
    data = requests.get(standings_url) # request helps to pull html code from websites
    soup = BeautifulSoup(data.text) # using BS to convert into a text format
    standings_table = soup.select('table.stats_table')[0] # looking for stats_table in the text we just pulled

    links = [l.get("href") for l in standings_table.find_all('a')] #there are links for each teams , which we want to land and get the data
    links = [l for l in links if '/squads/' in l] # just looking for squads data from the sub link
    team_urls = [f"https://fbref.com{l}" for l in links] # and then diving to the specific team 
    
    previous_season = soup.select("a.prev")[0].get("href") # selecting the previous season and landing to previous season 
    standings_url = f"https://fbref.com{previous_season}" 
    
    for team_url in team_urls:
        team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ") #we just need the team names, it rquires some formatting
        data = requests.get(team_url) # now getting the data from the specific team url
        matches = pd.read_html(data.text, match="Scores & Fixtures")[0] # looking for Scores and Fixtures Table
        soup = BeautifulSoup(data.text)
        links = [l.get("href") for l in soup.find_all('a')] 
        links = [l for l in links if l and 'all_comps/shooting/' in l] # looking for shooting stats available in the team tables
        data = requests.get(f"https://fbref.com{links[0]}")
        shooting = pd.read_html(data.text, match="Shooting")[0]
        shooting.columns = shooting.columns.droplevel()
        try:
            team_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on="Date")
        except ValueError:
            continue
        team_data = team_data[team_data["Comp"] == "Premier League"] # only getting PL data
        
        team_data["Season"] = year # adding season column to the data we get from extracting
        team_data["Team"] = team_name # also the team name
        all_matches.append(team_data) # appending all the data gathered from all seasons
        time.sleep(15) # executing with sleep time 15 seconods, website doesnt allow you to scapre even with 10 sec sleep time

In [22]:
len(all_matches) # the length of the number of total teams for season scapped

41

In [23]:
match_df = pd.concat(all_matches) # concat all of them of get our final dataframe

In [24]:
match_df.columns = [c.lower() for c in match_df.columns] #converting columns names to lower case

In [26]:
match_df.shape

(1551, 27)

In [14]:
# now lets see how the data looks 
match_df

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
0,2022-08-05,20:00,Premier League,Matchweek 1,Fri,Away,W,2,0,Crystal Palace,...,Match Report,,10.0,2.0,14.6,1.0,0.0,0.0,2022,Arsenal
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,2,Leicester City,...,Match Report,,19.0,7.0,13.0,0.0,0.0,0.0,2022,Arsenal
2,2022-08-20,17:30,Premier League,Matchweek 3,Sat,Away,W,3,0,Bournemouth,...,Match Report,,14.0,6.0,14.8,0.0,0.0,0.0,2022,Arsenal
3,2022-08-27,17:30,Premier League,Matchweek 4,Sat,Home,W,2,1,Fulham,...,Match Report,,22.0,8.0,15.5,1.0,0.0,0.0,2022,Arsenal
4,2022-08-31,19:30,Premier League,Matchweek 5,Wed,Home,W,2,1,Aston Villa,...,Match Report,,22.0,8.0,16.3,1.0,0.0,0.0,2022,Arsenal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2022-04-30,15:00,Premier League,Matchweek 35,Sat,Away,L,0,2,Aston Villa,...,Match Report,,9.0,3.0,21.6,0.0,0.0,0.0,2021,Norwich City
39,2022-05-08,14:00,Premier League,Matchweek 36,Sun,Home,L,0,4,West Ham,...,Match Report,,8.0,2.0,22.2,1.0,0.0,0.0,2021,Norwich City
40,2022-05-11,19:45,Premier League,Matchweek 21,Wed,Away,L,0,3,Leicester City,...,Match Report,,9.0,5.0,17.0,0.0,0.0,0.0,2021,Norwich City
41,2022-05-15,14:00,Premier League,Matchweek 37,Sun,Away,D,1,1,Wolves,...,Match Report,,11.0,2.0,14.4,0.0,0.0,0.0,2021,Norwich City


In [15]:
# lets save this file as a csv Format
match_df.to_csv("EPL_Matches.csv")

References: 
- COurse work Data mining
- Webscrapping documentation : https://beautiful-soup-4.readthedocs.io/en/latest/
- youtube channel: https://www.youtube.com/watch?v=XVv6mJpFOb0