# **United Soccer League Scraper** 

In this project, we will be interested in examining the performance of my local Reno soccer team, [Reno 1868 FC](https://www.reno1868fc.com/). We must first start this project by obtaining the data needed to produce interpretable results. Luckily, the [USL website](https://www.uslsoccer.com/usl-statistics) keeps a very good record of league, team, and player stats which we may scrape for our own analytical use. 

![](https://www.visitrenotahoe.com/wp-content/uploads/2017/06/Reno1868Blog-1.jpg)

## **Import Libraries**

Let's start by importing all the libraries we will use for this project.

In [1]:
import seaborn as sns
import functools
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import sqlite3
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import datetime
import plotly.offline
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.dashboard_objs as dashboard
import IPython.display
from IPython.display import Image
#plotly.tools.set_credentials_file(username='justingill', api_key='mS9eejXYe8Z0Jl7CcDo8')
#plotly.tools.set_config_file(sharing='public')
#py.sign_in('justingill', 'mS9eejXYe8Z0Jl7CcDo8')
plotly.offline.init_notebook_mode(connected=True)

%matplotlib inline

## **Define Functions**

We start this project by first defining the necessary functions for scraping and creating our dataframe we wish to work with.
* ***get_latest_opponent*** - This will scrape the next opponent Reno 1868 plays and return a dataframe that is merged with both teams

* ***check_first_last*** - This will check the player list and make sure that the player has both a first and last name.

* ***make_team_df*** - This function will create the dataframe from the soup created by BeautifulSoup and return the dataframe.

* ***save_to_SQL*** - This function will clean up the dataframe by creating a new column 'Player',dropping unnecessary columns, replacing placeholder values, correcting data types for columns, setting Player as the index, then saving it to a SQL database.
                  
* ***scrape_USL*** - This function acts as a 'main' function and encompasses the creating of the data by using a loop of all teams.

get_latest_opponent visits ESPN's [site](http://www.espn.com/soccer/team/fixtures/_/id/18453/reno-1868-fc) for Reno 1868 and scrapes the matches for Reno 1868's next opponent returning a dataframe with only players from Reno and their opponent.

In [2]:
def get_latest_opponent(usl):
    driver = webdriver.Chrome(executable_path="/Users/Justin/Desktop/chromedriver")
    driver.get('http://www.espn.com/soccer/team/fixtures/_/id/18453/reno-1868-fc')
    soup2 = BeautifulSoup(driver.page_source,'html.parser')
    driver.quit()
    opponent = soup2.find(class_='Table2__table-scroller Table2__table').find('tbody').find('tr').find_all('td')[1].get_text(' ').replace(' ','-')
    renovs = usl[(usl['Team'] == 'Reno-1868-FC') | (usl['Team'] == str(opponent))]
    return renovs

check_first_last simply makes sure our dataframe stays the same size by adding a blank first or last name if either is missing.

In [3]:
def check_first_last(player_list,length):
    if player_list == None:
        return player_list
    if len(player_list) == (length+1):
        player_list[1] = ' '.join(player_list[1:3])
        del player_list[2]
        return player_list
    elif len(player_list) == (length-1):
        player_list.insert(1,'-')
        return player_list
    else:
        return player_list

make_team_df takes in a soup variable which corresponds to a team page, which looks like [this](https://www.uslsoccer.com/reno-1868-fc-player-stats). We scrape the page for the data contained in the 'Full Player Stats' section, merge and return the manipulated dataframe. 

In [4]:
def make_team_df(soup):
    seperations = len(soup.find(class_='Opta-Table-Scroll Opta-Table-Scroll-One-Liner Opta-js-discipline').find_all(role='row'))-1
    length_rows = len(soup.find_all(role='row'))

    general_columns = ['First','Last','Games Played','Starts','Subbed off','Minutes Played']
    distribution_columns = ['First','Last','Passes','Passing Acc','Long Passes','Long Pass Acc',
                            'Pass per 90','Forward Passes','Backward Passes','Left Pass',
                            'Right Pass','Passing Acc Opponents Half',
                            'Passing Acc Own Half','Assists','Key Passes','Crosses','Crossing Acc']
    attack_columns = ['First','Last','Shots','Shots on Target','Goals','Right Foot Goals',
                      'Left Foot Goals','Heading Goals','Other','Goals In Box','Goals Out Box',
                      'Free Kick Goals','Conversion Rate','Mins Per Goal']
    defense_columns = ['First','Last','Clears','Blocks','Interceptions','Tackles',
                       'Tackles Won','Duels','Duels Won','Air Duels','Air Duels Won']
    discipline_columns = ['First','Last','Yellow Cards','Red Cards','Fouls Won','Fouls Conceded']

    goalkeeping_columns = ['First','Last','Goals Conceded','Shot At','Saves','Save Rate',
                           'Clean Sheets','Catches','Punches','Drops','Penalties Saved','Clearances']
    
    discipline_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                len(discipline_columns)) for player in soup.find_all(role='row')[seperations*4+5:seperations*5+5]],
                                 columns=discipline_columns)
    defense_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                            len(defense_columns)) for player in soup.find_all(role='row')[seperations*3+4:seperations*4+4]],
                              columns=defense_columns)
    attack_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                            len(attack_columns)) for player in soup.find_all(role='row')[seperations*2+3:seperations*3+3]],
                             columns=attack_columns)
    distribution_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                len(distribution_columns)) for player in soup.find_all(role='row')[seperations+2:seperations*2+2]],
                                   columns=distribution_columns)
    general_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                            len(general_columns)) for player in soup.find_all(role='row')[1:seperations+1]],
                             columns=general_columns)
    goalkeeping_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                len(goalkeeping_columns))for player in soup.find_all(role='row')[seperations*5+6:length_rows]],
                                 columns=goalkeeping_columns)

    df = [general_df,distribution_df,attack_df,defense_df,discipline_df,goalkeeping_df]
    df_merge = functools.reduce(lambda left,right: pd.merge(left,right,on=['First','Last'],
                                                how='outer'), df).fillna(0)
    return df_merge

save_to_SQL cleans the passed dataframe and saves it as a new table named after the current date, then returns the cleaned dataframe.

In [5]:
def save_to_SQL(usl):
    usl.replace('-',0,inplace=True)
    usl = usl.applymap(lambda x: str(x).replace(',',''))
    usl['Player'] = usl['First']+ ' ' + usl['Last']
    usl.set_index('Player',drop=True,inplace=True)
    usl.drop(['First','Last'],axis=1,inplace=True)

    float_types = [e for e in list(usl.columns) if e not in ['Player','Team']]
    usl[float_types] = usl[float_types].applymap(lambda x: round(float(x),3))
    usl['Subbed on'] = usl['Games Played'] - usl['Starts']
    con = sqlite3.connect('USL.sqlite')
    usl.to_sql((str(datetime.date.today())),con,if_exists='replace')
    return usl

scrape_USL acts almost as a 'main' function for the program. It starts up a chrome webdriver using Selenium and proceeds to access the [league standings](https://www.uslsoccer.com/usl-standings). We scrape this page for all the current teams and then store this data away in a list. We can then use this list to visit all the team stats webpages and collect individual data for each player.

In [6]:
def scrape_USL():
    driver = webdriver.Chrome(executable_path="/Users/Justin/Desktop/chromedriver")
    driver.get('https://www.uslsoccer.com/usl-standings')
    time.sleep(1)
    presoup = BeautifulSoup(driver.page_source,'html.parser')
    teams = [team.get_text().replace(' ','-') if team.get_text() != 'Pittsburgh Riverhounds SC' else 'Pittsburgh-Riverhounds' for team in presoup.find_all(class_='Opta-TeamLink Opta-Ext')]
    url = 'https://www.uslsoccer.com/{}-player-stats'

    usl = pd.DataFrame()
    
    for team in teams:
        release = False
        while(release == False):
            driver.get(url.format(team))
            timeout = 10
            try:
                element_present = EC.visibility_of_element_located((By.CLASS_NAME,
                                                                    'Opta-TabbedContent'))
                WebDriverWait(driver, timeout).until(element_present)
                #time.sleep(2)
            except TimeoutException:
                print("Timed out waiting for {}".format(team))

            try:
                soup = BeautifulSoup(driver.page_source,'html5lib')
                team_df = make_team_df(soup)
                team_df['Team'] = team
                usl = pd.concat([usl,team_df],axis=0)
                release = True
            except:
                print("{} page did not load correctly, retrying...".format(team.replace('-',' ')))
                release = False

    usl = save_to_SQL(usl)
    driver.quit()
    return usl

Let's call our function!

In [None]:
usl = scrape_USL()

Charleston Battery page did not load correctly, retrying...
New York Red Bulls II page did not load correctly, retrying...
Indy Eleven page did not load correctly, retrying...
Portland Timbers 2 page did not load correctly, retrying...
Saint Louis FC page did not load correctly, retrying...


Let's check the league dataframe.

In [None]:
usl.head()

In [None]:
copy_usl = usl

Great! We can also check that our SQL database is working correctly.

In [None]:
con = sqlite3.connect('USL.sqlite')
usl_read = pd.read_sql_query('Select * from "{}"'.format(str(datetime.date.today())),
                             con,
                             coerce_float=True)
con.close()

In [None]:
usl_read.head(5)

As we can see, our dataframe reads in correctly.

## **Reno 1868 EDA**

We are interested in comparing Reno 1868 to their next opponent to help us understand statistically how they stack up against one another.

Let's view some quick statistics about our dataset first.

In [None]:
usl.info()

In [None]:
usl.describe()

We must now get our subset using our function get_latest_opponent. 

In [None]:
renovs = get_latest_opponent(usl)

Let's make sure this worked properly!

In [None]:
renovs['Team'].value_counts()

Great! We now have our two teams.

We can now look at some of the more important statistics(displayed below) in soccer and see the differences between the two teams.

In [None]:
compare = pd.concat([renovs[['Goals','Assists','Crosses','Key Passes','Interceptions',
                             'Clearances','Team','Shots on Target','Shots','Tackles']].groupby('Team').sum().transpose(),
           renovs[['Conversion Rate','Team','Passing Acc']].groupby('Team').mean().transpose()],axis=0)
compare.applymap(lambda x: round(x,2))

## **Visualizations using Plotly**

We now want to visualize our data to help us get a better understanding of the individual team differences. We can use the plotly library to help us plot these.  

In [None]:
opponent = (compare.columns).drop('Reno-1868-FC')

Let's first plot out our previous table and look at it.

In [None]:
data  = [
        go.Bar(
    y = compare.index,
    x = compare['Reno-1868-FC'],
    orientation='h',
    marker=dict(color='#0c13a8'),
    name='Reno 1868 FC'),
    
         go.Bar(
    y = compare.index,
    x = compare[opponent[0]],
    orientation='h',
    marker=dict(color='#FF0033'),
    name=opponent[0].replace('-',' '))
]

layout = go.Layout(
        title = 'Reno 1868 vs. '+ opponent[0].replace('-',' '),
        margin = go.layout.Margin(l=110)
)

fig  = go.Figure(data=data,layout=layout)
# url_1 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

Next, let's now look at the individual scorers and assisters on Reno 1868.

In [None]:
data  = [
        go.Bar(
    y = renovs[(renovs['Team'] == 'Reno-1868-FC') & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True).index,
    x = renovs[(renovs['Team'] == 'Reno-1868-FC') & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#0c13a8'),
    name='Goals'),
    
        go.Bar(
    y = renovs[(renovs['Team'] == 'Reno-1868-FC') & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True).index,
    x = renovs[(renovs['Team'] == 'Reno-1868-FC') & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#fe6604'),
    name='Assists')
]

layout = go.Layout(
        title = 'Reno 1868 Scorers & Assisters',
        xaxis = dict(title='Total Goals & Assists'),
        barmode='stack',
        margin = go.layout.Margin(l=140),
        autosize=True

)

fig  = go.Figure(data=data, layout=layout)
# url_2 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

Let's do the same for the opponent's team.

In [None]:
data  = [
        go.Bar(
    y = renovs[(renovs['Team'] == opponent[0]) & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True).index,
    x = renovs[(renovs['Team'] == opponent[0]) & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#FF0033'),
    name='Goals'),
    
        go.Bar(
    y = renovs[(renovs['Team'] == opponent[0]) & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True).index,
    x = renovs[(renovs['Team'] == opponent[0]) & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#9A03FE'),
    name='Assists')
]

layout = go.Layout(
        title = opponent[0].replace('-',' ') + ' Scorers & Assisters',
        xaxis = dict(title='Total Goals & Assists'),
        barmode='stack',
        margin = go.layout.Margin(l=140),
        autosize=True
)

fig  = go.Figure(data=data, layout=layout)
# url_3 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

## **Plotly Dashboard**

We can also make this into an updated dashboard every time we run our notebook. Unfortunately, I was not able to get this working, particularly due to incompatible versions of libraries. Though it does not work currently, this code would produce a dashboard on my personal Plotly account. I will leave this here just to show my thought process on other things I can do with this project!

In [None]:
# my_dboard = dashboard.Dashboard()

In [None]:
# my_dboard.get_preview()

In [None]:
'''
import re

def fileId_from_url(url):
    raw_fileId = re.findall("~[A-z]+/[0-9]+", url)[0][1:]
    return str(raw_fileId).replace('/', ':')

def sharekey_from_url(url):
    if 'share_key=' not in url:
        return "This url is not 'sercret'. It does not have a secret key."
    return url[url.find('share_key=') + len('share_key='):]

fileId_1 = fileId_from_url(url_1)
fileId_2 = fileId_from_url(url_2)
fileId_3 = fileId_from_url(url_3)
print(fileId_1)
print(fileId_2)
print(fileId_3)

box_a = {
    'type': 'box',
    'boxType': 'plot',
    'fileId': fileId_1,
    'title': 'Reno 1868 vs. ' + str(opponent).replace('-',' ')
}
box_b = {
    'type': 'box',
    'boxType': 'plot',
    'fileID': fileId_2,
    'title':  'Reno 1868 Top Scorers & Assisters'
}
box_c = {
    'type': 'box',
    'boxType': 'plot',
    'fileID': fileId_3,
    'title':  str(opponent).replace('-',' ') + ' Top Scorers & Assisters'
}
'''

In [None]:
# my_dboard['settings']['title'] = 'Reno 1868'

In [None]:
# my_dboard['settings']['logoUrl'] = 'https://media.graytvinc.com/images/810*954/1868-SOCCER-KIT.jpg'

In [None]:
# my_dboard.insert(box_a)

In [None]:
# my_dboard.insert(box_b,'above',1)

In [None]:
# my_dboard.insert(box_c,'right',1)

In [None]:
# py.dashboard_ops.upload(my_dboard, 'Reno 1868 Dashboard',sharing='public',auto_open=True)