### LVF Players Stats

Will scrape volleyball players stats from the Italian women volleyball leagues website.

Link: https://www.legavolleyfemminile.it/
        
The scrapping process is based on Selenium and BeautifulSoup packages.        

In [4]:
# import libraries
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import lxml

Since the work is done on a MacOS we will use the Safari browser and the driver
can be enabled from terminal using "safaridriver --enable" command.

In [3]:
# create a session
driver = webdriver.Safari()

Now we need to investigate how the stats are stored on the site by going to Statistics area.
We see that the stats are generated by selecting several fields. 
This doesn't help too much, so we inspect elements and find the link that we need:
    
https://ww2.legavolleyfemminile.it/Statistiche_i.asp?TipoStat=2.4&Serie=1&AnnoInizio=2021&AnnoFine=2021&Ruolo=5
    
Here we see that in the link we are interested in the last part, "AnnoInizio", "AnnoFine" = "StartYear", "EndYear"
and "Ruolo" = "Position". We will be interested in AnnoInizio >= 2018 and Ruolo in {1, 4, 5}.

First, let's build a function to scrape stats for a given year and position.

In [15]:
def scrape(year, position):
    # build the url used to load the page
    url_fix = 'https://ww2.legavolleyfemminile.it/Statistiche_i.asp?TipoStat=2.4&Serie=1'
    url_var = '&AnnoInizio=' + str(year) + '&AnnoFine=' + str(year+1) + '&Ruolo=' + str(position)
    url = url_fix + url_var
    
    # load the page and wait to load
    driver.get(url)
    driver.implicitly_wait(5)
    
    # parse HTML and XML code and get tables using Beautiful Soup and lxml
    soup = BeautifulSoup(driver.page_source, 'lxml')
    tables = soup.find_all('table')
    
    # read tables using Pandas read_html() which returns a list of dataframes
    df_list = pd.read_html(str(tables))
    
    # the stats table we are interested is the second element in the list
    df = df_list[1]
    
    # the data is a bit messy in terms of columns names: - multiple names - 
    # also we want to drop some irelevant columns
    stats = df.iloc[:, 0:]
    cols_to_drop = [4, 5, 9, 10, 15, 16, 21, 22, 23, 25]
    stats = stats.drop(stats.columns[cols_to_drop], axis = 1)
    # rename columns
    colnames = ['Name', 'Games', 'Sets', 'Points', 'Serves', 'Aces', 'ServeErrs', 'Receptions', 'ReceptErrs', 'ReceptNeg',
                'ReceptPerf', 'Spikes', 'SpikeErrs', 'SpikesBlocked', 'SpikesPerf', 'Blocks']
    stats.columns = colnames
    
    # fill missing values with 0s
    stats = stats.fillna(0)
    
    # add year (starting year of the season) and player position to the final dataframe
    stats['Year'] = year
    if position == 1:
        stats['Position'] = "Middleblocker"
    elif position == 4:
        stats['Position'] = "OutsideHitter"
    else:
        'Games', 'Sets', 'Points', 'Serves', 'Aces', 'ServeErrs', 'Receptions', 'ReceptErrs', 'ReceptNeg',
                'ReceptPerf', 'Spikes', 'SpikeErrs', 'SpikesBlocked', 'SpikesPerf', 'Blocks']
    
    # transform float columns into integer ones
    col_to_int = ['Games', 'Sets', 'Points', 'Serves', 'Aces', 'ServeErrs', 'Receptions', 'ReceptErrs', 'ReceptNeg',
                'ReceptPerf', 'Spikes', 'SpikeErrs', 'SpikesBlocked', 'SpikesPerf', 'Blocks']
    stats[col_to_int] = stats[col_to_int].applymap(np.int64)
    
    return stats    

In [16]:
testdf = scrape(2021, 4)

In [17]:
testdf

Unnamed: 0,Name,Games,Sets,Points,Serves,Aces,ServeErrs,Receptions,ReceptErrs,ReceptNeg,ReceptPerf,Spikes,SpikeErrs,SpikesBlocked,SpikesPerf,Blocks,Year,Position
0,Lanier Khalia,26,101,465,337,25,36,419,48,105,122,1080,70,83,417,23,2021,4
1,Gray Alexa,27,103,453,367,14,35,998,94,320,273,928,66,56,421,18,2021,4
2,Stysiak Magdalena,35,99,395,287,23,65,81,10,16,22,826,76,60,343,29,2021,4
3,Van Hecke Lise,35,97,380,269,13,48,2,0,1,1,827,70,39,336,31,2021,4
4,Stigrot Lena,26,98,374,348,11,15,857,81,165,251,865,60,68,348,15,2021,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,Ulligini Gaia 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2021,4
72,Marcon Francesca,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2021,4
73,Damato Alessia 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2021,4
74,Battistino Beatrice,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2021,4
