First we need to find the web table that we want to scrape. FBREF has lots of data available as html tables, so we will use that. The below is player goal and shot creation from the Big 5 leagues.

In [1]:
import pandas as pd

url_df = 'https://fbref.com/en/comps/Big5/gca/players/Big-5-European-Leagues-Stats'

df = pd.read_html(url_df)
df

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


[     Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0  \
                      Rk             Player             Nation   
 0                     1         Max Aarons            eng ENG   
 1                     2   Brenden Aaronson             us USA   
 2                     3    Paxten Aaronson             us USA   
 3                     4   Yunis Abdelhamid             ma MAR   
 4                     5  Salis Abdul Samed             gh GHA   
 ...                 ...                ...                ...   
 2841               2733     Lovro Zvonarek             hr CRO   
 2842               2734    Martin Ødegaard             no NOR   
 2843               2735        Milan Đurić             ba BIH   
 2844               2736        Milan Đurić             ba BIH   
 2845               2737   Mateusz Łęgowski             pl POL   
 
      Unnamed: 3_level_0 Unnamed: 4_level_0  Unnamed: 5_level_0  \
                     Pos              Squad                Comp   
 0    

So we've got the data, it's just a bit ugly. If we extract the first element of this though, it turns a lot better.

In [2]:
df = pd.read_html(url_df)[0]
df.head()

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,SCA,...,SCA Types,GCA,GCA,GCA Types,GCA Types,GCA Types,GCA Types,GCA Types,GCA Types,Unnamed: 25_level_0
Unnamed: 0_level_1,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,90s,SCA,...,Def,GCA,GCA90,PassLive,PassDead,TO,Sh,Fld,Def,Matches
0,1,Max Aarons,eng ENG,DF,Bournemouth,eng Premier League,24-082,2000,12.1,20,...,0,1,0.08,1,0,0,0,0,0,Matches
1,2,Brenden Aaronson,us USA,"MF,FW",Union Berlin,de Bundesliga,23-156,2000,7.7,20,...,0,3,0.39,3,0,0,0,0,0,Matches
2,3,Paxten Aaronson,us USA,MF,Eint Frankfurt,de Bundesliga,20-213,2003,1.1,1,...,0,1,0.89,1,0,0,0,0,0,Matches
3,4,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,36-180,1987,22.9,16,...,0,0,0.0,0,0,0,0,0,0,Matches
4,5,Salis Abdul Samed,gh GHA,MF,Lens,fr Ligue 1,24-000,2000,15.5,23,...,0,3,0.19,3,0,0,0,0,0,Matches


Better, but the resulting dataframe has a multi index. We want to remove this multi index, create new headers and slightly modify the dataframe to make it easier to read.

First we will get rid of the multi indexing:

In [3]:
# creating a data with the same headers but without multi indexing
df.columns = [' '.join(col).strip() for col in df.columns]

df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Unnamed: 0_level_0 Rk,Unnamed: 1_level_0 Player,Unnamed: 2_level_0 Nation,Unnamed: 3_level_0 Pos,Unnamed: 4_level_0 Squad,Unnamed: 5_level_0 Comp,Unnamed: 6_level_0 Age,Unnamed: 7_level_0 Born,Unnamed: 8_level_0 90s,SCA SCA,...,SCA Types Def,GCA GCA,GCA GCA90,GCA Types PassLive,GCA Types PassDead,GCA Types TO,GCA Types Sh,GCA Types Fld,GCA Types Def,Unnamed: 25_level_0 Matches
0,1,Max Aarons,eng ENG,DF,Bournemouth,eng Premier League,24-082,2000,12.1,20,...,0,1,0.08,1,0,0,0,0,0,Matches
1,2,Brenden Aaronson,us USA,"MF,FW",Union Berlin,de Bundesliga,23-156,2000,7.7,20,...,0,3,0.39,3,0,0,0,0,0,Matches
2,3,Paxten Aaronson,us USA,MF,Eint Frankfurt,de Bundesliga,20-213,2003,1.1,1,...,0,1,0.89,1,0,0,0,0,0,Matches
3,4,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,36-180,1987,22.9,16,...,0,0,0.0,0,0,0,0,0,0,Matches
4,5,Salis Abdul Samed,gh GHA,MF,Lens,fr Ligue 1,24-000,2000,15.5,23,...,0,3,0.19,3,0,0,0,0,0,Matches


Now we want to get rid of the 'Unamed' bits in the headers.

In [4]:
# creating a list with new names
new_columns = []
for col in df.columns:
  if 'level_0' in col:
      new_col = col.split()[-1]  # takes the last name
  else:
      new_col = col
  new_columns.append(new_col)

# rename columns
df.columns = new_columns
df = df.fillna(0)

df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,90s,SCA SCA,...,SCA Types Def,GCA GCA,GCA GCA90,GCA Types PassLive,GCA Types PassDead,GCA Types TO,GCA Types Sh,GCA Types Fld,GCA Types Def,Matches
0,1,Max Aarons,eng ENG,DF,Bournemouth,eng Premier League,24-082,2000,12.1,20,...,0,1,0.08,1,0,0,0,0,0,Matches
1,2,Brenden Aaronson,us USA,"MF,FW",Union Berlin,de Bundesliga,23-156,2000,7.7,20,...,0,3,0.39,3,0,0,0,0,0,Matches
2,3,Paxten Aaronson,us USA,MF,Eint Frankfurt,de Bundesliga,20-213,2003,1.1,1,...,0,1,0.89,1,0,0,0,0,0,Matches
3,4,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,36-180,1987,22.9,16,...,0,0,0.0,0,0,0,0,0,0,Matches
4,5,Salis Abdul Samed,gh GHA,MF,Lens,fr Ligue 1,24-000,2000,15.5,23,...,0,3,0.19,3,0,0,0,0,0,Matches


Finally, we can see that `Age` and `Comp` is slightly hard to read. Additionally, `Pos` can include two positions if the player is able to. We will split this into two seperate columns instead. 

In [5]:
df['Age'] = df['Age'].str[:2]
df['Position_2'] = df['Pos'].str[3:]
df['Position'] = df['Pos'].str[:2]
df['Nation'] = df['Nation'].str.split(' ').str.get(1)
df['League'] = df['Comp'].str.split(' ').str.get(1)
df['League_'] = df['Comp'].str.split(' ').str.get(2)
df['League'] = df['League'] + ' ' + df['League_']
df = df.drop(columns=['League_', 'Comp', 'Rk', 'Pos','Matches'])

df['Position'] = df['Position'].replace({'MF': 'Midfielder', 'DF': 'Defender', 'FW': 'Forward', 'GK': 'Goalkeeper'})
df['Position_2'] = df['Position_2'].replace({'MF': 'Midfielder', 'DF': 'Defender',
                                                 'FW': 'Forward', 'GK': 'Goalkeeper'})
df['League'] = df['League'].fillna('Bundesliga')

df.head()

Unnamed: 0,Player,Nation,Squad,Age,Born,90s,SCA SCA,SCA SCA90,SCA Types PassLive,SCA Types PassDead,...,GCA GCA90,GCA Types PassLive,GCA Types PassDead,GCA Types TO,GCA Types Sh,GCA Types Fld,GCA Types Def,Position_2,Position,League
0,Max Aarons,ENG,Bournemouth,24,2000,12.1,20,1.67,15,2,...,0.08,1,0,0,0,0,0,,Defender,Premier League
1,Brenden Aaronson,USA,Union Berlin,23,2000,7.7,20,2.59,17,1,...,0.39,3,0,0,0,0,0,Forward,Midfielder,Bundesliga
2,Paxten Aaronson,USA,Eint Frankfurt,20,2003,1.1,1,0.89,1,0,...,0.89,1,0,0,0,0,0,,Midfielder,Bundesliga
3,Yunis Abdelhamid,MAR,Reims,36,1987,22.9,16,0.7,12,1,...,0.0,0,0,0,0,0,0,,Defender,Ligue 1
4,Salis Abdul Samed,GHA,Lens,24,2000,15.5,23,1.48,23,0,...,0.19,3,0,0,0,0,0,,Midfielder,Ligue 1
