### LVF Players Stats

Will scrape volleyball players stats from the Italian women volleyball leagues website.

Link: https://www.legavolleyfemminile.it/
        
The scrapping process is based on Selenium and BeautifulSoup packages.        

In [13]:
# import libraries
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import lxml

Since the work is done on a MacOS we will use the Safari browser and the driver
can be enabled from terminal using "safaridriver --enable" command.

In [14]:
# create a session
driver = webdriver.Safari()

Now we need to investigate how the stats are stored on the site by going to Statistics area.
We see that the stats are generated by selecting several fields. 
This doesn't help too much, so we inspect elements and find the link that we need:
    
https://ww2.legavolleyfemminile.it/Statistiche_i.asp?TipoStat=2.4&Serie=1&AnnoInizio=2021&AnnoFine=2021&Ruolo=5
    
Here we see that in the link we are interested in the last part, "AnnoInizio", "AnnoFine" = "StartYear", "EndYear" (although they are the same)
and "Ruolo" = "Position". We will be interested in AnnoInizio >= 2018 and Ruolo in {1, 4, 5}.

First, let's build a function to scrape stats for a given year and position.

In [15]:
def scrape(year, position):
    # build the url used to load the page
    url_fix = 'https://ww2.legavolleyfemminile.it/Statistiche_i.asp?TipoStat=2.4&Serie=1'
    url_var = '&AnnoInizio=' + str(year) + '&AnnoFine=' + str(year) + '&Ruolo=' + str(position)
    url = url_fix + url_var
    
    # load the page and wait to load
    driver.get(url)
    driver.implicitly_wait(5)
    
    # parse HTML and XML code and get tables using Beautiful Soup and lxml
    soup = BeautifulSoup(driver.page_source, 'lxml')
    tables = soup.find_all('table')
    
    # read tables using Pandas read_html() which returns a list of dataframes
    df_list = pd.read_html(str(tables))
    
    # the stats table we are interested is the second element in the list
    stats = df_list[1]
    
    # the data is a bit messy in terms of columns names: - multiple names - 
    # also we want to drop some irelevant columns
    cols_to_drop = [4, 5, 9, 10, 15, 16, 21, 22, 23, 25]
    stats = stats.drop(stats.columns[cols_to_drop], axis = 1)
    # rename columns
    colnames = ['Name', 'Games', 'Sets', 'Points', 'Serves', 'Aces', 'ServeErrs', 'Receptions', 'ReceptErrs', 'ReceptNeg',
                'ReceptPerf', 'Spikes', 'SpikeErrs', 'SpikesBlocked', 'SpikesPerf', 'Blocks']
    stats.columns = colnames
    
    # fill missing values with 0s
    stats = stats.fillna(0)
    
    # add year (starting year of the season) and player position to the final dataframe
    stats['Year'] = year
    if position == 1:
        stats['Position'] = "Middleblocker"
    elif position == 4:
        stats['Position'] = "OutsideHitter"
    else:
        stats['Position'] = "Opposite"
    
    # transform float columns into integer ones
    col_to_int = ['Games', 'Sets', 'Points', 'Serves', 'Aces', 'ServeErrs', 'Receptions', 'ReceptErrs', 'ReceptNeg',
                'ReceptPerf', 'Spikes', 'SpikeErrs', 'SpikesBlocked', 'SpikesPerf', 'Blocks']
    stats[col_to_int] = stats[col_to_int].applymap(np.int64)
    
    return stats    

In [16]:
# test function
test_df = scrape(2021, 5)
test_df

Unnamed: 0,Name,Games,Sets,Points,Serves,Aces,ServeErrs,Receptions,ReceptErrs,ReceptNeg,ReceptPerf,Spikes,SpikeErrs,SpikesBlocked,SpikesPerf,Blocks,Year,Position
0,Egonu Paola,33,124,763,471,49,86,6,2,0,2,1298,144,30,665,48,2021,Opposite
1,Karakurt Ebrar,32,112,562,436,38,63,6,1,1,1,1162,111,72,480,44,2021,Opposite
2,Nwakalor Sylvia,28,111,553,310,6,40,12,5,2,3,1200,122,82,499,48,2021,Opposite
3,Mingardi Camilla,28,107,549,365,22,52,8,1,5,1,1223,78,76,502,25,2021,Opposite
4,Grobelna Kaja,28,105,483,397,27,55,3,2,1,0,1023,58,88,421,35,2021,Opposite
5,Gicquel Lucille,29,111,442,365,31,59,14,3,6,2,900,84,70,358,53,2021,Opposite
6,Klimets Hanna,26,94,396,299,13,24,6,3,0,2,912,55,78,344,39,2021,Opposite
7,Antropova Ekaterina,26,80,354,256,47,62,25,3,6,4,611,40,38,269,38,2021,Opposite
8,Piani Vittoria,25,76,320,278,30,55,2,0,1,1,741,67,56,280,10,2021,Opposite
9,Lippmann Louisa,30,86,275,227,8,24,3,0,0,1,630,41,34,249,18,2021,Opposite


Now that the function is built we can extract all data wanted and put it in a dataframe.

In [17]:
# create empty df
all_stats = pd.DataFrame(columns=['Name', 'Games', 'Sets', 'Points', 'Serves', 'Aces', 'ServeErrs', 'Receptions', 'ReceptErrs', 'ReceptNeg',
                'ReceptPerf', 'Spikes', 'SpikeErrs', 'SpikesBlocked', 'SpikesPerf', 'Blocks', 'Year', 'Position'])

# apply function for required parameters
for i in range(2018, 2022):
    for j in [1, 4, 5]:
        stats = scrape(i, j)
        all_stats = pd.concat([all_stats, stats], ignore_index = True)

# transform columns type object to integer
col_to_int = ['Games', 'Sets', 'Points', 'Serves', 'Aces', 'ServeErrs', 'Receptions', 'ReceptErrs', 'ReceptNeg',
                'ReceptPerf', 'Spikes', 'SpikeErrs', 'SpikesBlocked', 'SpikesPerf', 'Blocks', 'Year']
all_stats[col_to_int] = all_stats[col_to_int].applymap(np.int64)

In [18]:
# close the session
driver.close()

In [19]:
# check the final df
all_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           584 non-null    object
 1   Games          584 non-null    int64 
 2   Sets           584 non-null    int64 
 3   Points         584 non-null    int64 
 4   Serves         584 non-null    int64 
 5   Aces           584 non-null    int64 
 6   ServeErrs      584 non-null    int64 
 7   Receptions     584 non-null    int64 
 8   ReceptErrs     584 non-null    int64 
 9   ReceptNeg      584 non-null    int64 
 10  ReceptPerf     584 non-null    int64 
 11  Spikes         584 non-null    int64 
 12  SpikeErrs      584 non-null    int64 
 13  SpikesBlocked  584 non-null    int64 
 14  SpikesPerf     584 non-null    int64 
 15  Blocks         584 non-null    int64 
 16  Year           584 non-null    int64 
 17  Position       584 non-null    object
dtypes: int64(16), object(2)
memory

In [20]:
all_stats[all_stats['Name']=='Ungureanu Adelina']

Unnamed: 0,Name,Games,Sets,Points,Serves,Aces,ServeErrs,Receptions,ReceptErrs,ReceptNeg,ReceptPerf,Spikes,SpikeErrs,SpikesBlocked,SpikesPerf,Blocks,Year,Position
216,Ungureanu Adelina,19,35,86,105,7,8,187,19,38,74,227,20,18,74,5,2019,OutsideHitter
349,Ungureanu Adelina,21,85,290,283,19,49,443,44,95,161,651,56,39,240,31,2020,OutsideHitter
521,Ungureanu Adelina,28,77,97,193,18,36,229,29,62,59,221,15,16,69,10,2021,OutsideHitter


In [21]:
all_stats[all_stats['Name']=='Egonu Paola']

Unnamed: 0,Name,Games,Sets,Points,Serves,Aces,ServeErrs,Receptions,ReceptErrs,ReceptNeg,ReceptPerf,Spikes,SpikeErrs,SpikesBlocked,SpikesPerf,Blocks,Year,Position
111,Egonu Paola,34,122,792,443,52,78,7,4,0,3,1452,172,66,677,63,2018,Opposite
273,Egonu Paola,19,48,276,205,31,40,2,0,0,1,412,41,8,224,21,2019,Opposite
406,Egonu Paola,30,87,506,350,36,77,5,1,1,2,725,96,29,426,44,2020,Opposite
557,Egonu Paola,33,124,763,471,49,86,6,2,0,2,1298,144,30,665,48,2021,Opposite


Some other fields can be calculated, like Reception anf Spike efficiencies. Finally, let's save the data locally.

In [22]:
# export to csv
#all_stats.to_csv("all_stats.csv")