# **United Soccer League Web Scraper** 

### Before viewing, I advise you to view it [here](https://nbviewer.jupyter.org/github/justingill/Data-Portfolio/blob/master/USL_Scraper.ipynb) instead because Github doesn't support Plotly charts.

In this project, we will be interested in examining the performance of my local Reno soccer team, [Reno 1868 FC](https://www.reno1868fc.com/). We must first start this project by obtaining the data needed to produce interpretable results. Luckily, the [USL website](https://www.uslsoccer.com/usl-statistics) keeps a very good record of league, team, and player stats which we may scrape for our own analytical use. 

![](https://www.visitrenotahoe.com/wp-content/uploads/2017/06/Reno1868Blog-1.jpg)

## **Import Libraries**

Let's start by importing all the libraries we will use for this project.

In [1]:
import seaborn as sns
import functools
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import sqlite3
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import datetime
import plotly.offline
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.dashboard_objs as dashboard

plotly.offline.init_notebook_mode(connected=True)

%matplotlib inline

## **Define Functions**

We start this project by first defining the necessary functions for scraping and creating our dataframe we wish to work with.

* ***get_latest_opponent*** - This function visits ESPN's [website](http://www.espn.com/soccer/team/fixtures/_/id/18453/reno-1868-fc) for Reno 1868 and scrapes the displayed matches for Reno 1868's next opponent; returning a dataframe with only players from Reno 1868 and their opponent.

* ***check_first_last*** - This will check the player list and make sure that the player has both a first and last name.

* ***make_team_df*** - This function will create a dataframe for the team using the html scraped by BeautifulSoup and return a dataframe to be merged into a dataframe of all the teams.

* ***save_to_SQL*** - This function will clean up the dataframe by creating a new column 'Player',dropping unnecessary columns, replacing placeholder values, correcting data types for columns, setting Player as the index, then saving it to a SQL database.
                  
* ***scrape_USL*** - This function acts as a 'main' function and encompasses the scraping of the data, cleaning and writing to SQL database. This function scrapes USL's [Standings](https://www.uslsoccer.com/usl-standings) to obtain all current teams playing in the USL. The soup objects of each team webpage are then passed into make_team_df and merged together. Lastly, this dataframe is cleaned by save_to_sql and saved to a SQL database.


In [2]:
def get_latest_opponent(usl):
    release = False
    while(release == False):
        options = webdriver.ChromeOptions()
        options.add_argument('headless')
        driver = webdriver.Chrome(executable_path="/Users/Justin/Desktop/chromedriver",options=options)
        driver.get('https://www.reno1868fc.com/2018-schedule')
        soup2 = BeautifulSoup(driver.page_source,'html.parser')
        driver.quit()
        try:
            opponent = soup2.find(class_='tableWrapper'
                             ).find('tbody'
                                   ).find_all('tr'
                                             )[-4].get_text(' '
                                                           ).split('\n'
                                                                  )[3].strip().replace(' ','-')
            release = True
        except:
            print('Latest Opponent Page Error.. Retrying')
            
    
    renovs = usl[(usl['Team'] == 'Reno-1868-FC') 
                 | (usl['Team'] == str(opponent))]
    return renovs

check_first_last simply makes sure our dataframe stays the same size by adding a blank first or last name if either is missing.

In [3]:
def check_first_last(player_list,length):
    if player_list == None:
        return player_list
    if len(player_list) == (length+1):
        player_list[1] = ' '.join(player_list[1:3])
        del player_list[2]
        return player_list
    elif len(player_list) == (length-1):
        player_list.insert(1,'-')
        return player_list
    else:
        return player_list

make_team_df takes in a soup variable which corresponds to a team page, which looks like [this](https://www.uslsoccer.com/reno-1868-fc-player-stats). We scrape the page for the data contained in the 'Full Player Stats' section, merge and return the manipulated dataframe. 

In [4]:
def make_team_df(soup):
    seperations = len(soup.find(class_='Opta-Table-Scroll Opta-Table-Scroll-One-Liner Opta-js-discipline'
                               ).find_all(role='row'))-1
    
    length_rows = len(soup.find_all(role='row'))

    general_columns = ['First','Last','Games Played','Starts','Subbed off','Minutes Played']
    
    distribution_columns = ['First','Last','Passes','Passing Acc','Long Passes','Long Pass Acc',
                            'Pass per 90','Forward Passes','Backward Passes','Left Pass',
                            'Right Pass','Passing Acc Opponents Half',
                            'Passing Acc Own Half','Assists','Key Passes','Crosses','Crossing Acc']
    
    attack_columns = ['First','Last','Shots','Shots on Target','Goals','Right Foot Goals',
                      'Left Foot Goals','Heading Goals','Other','Goals In Box','Goals Out Box',
                      'Free Kick Goals','Conversion Rate','Mins Per Goal']
    
    defense_columns = ['First','Last','Clears','Blocks','Interceptions','Tackles',
                       'Tackles Won','Duels','Duels Won','Air Duels','Air Duels Won']
    discipline_columns = ['First','Last','Yellow Cards','Red Cards','Fouls Won','Fouls Conceded']

    goalkeeping_columns = ['First','Last','Goals Conceded','Shot At','Saves','Save Rate',
                           'Clean Sheets','Catches','Punches','Drops','Penalties Saved',
                           'Clearances']
    
    discipline_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                   len(discipline_columns)) 
                                  for player in soup.find_all(role='row')[seperations*4+5:seperations*5+5]],
                                 columns=discipline_columns)
    
    defense_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                len(defense_columns)) 
                               for player in soup.find_all(role='row')[seperations*3+4:seperations*4+4]],
                              columns=defense_columns)
    
    attack_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                               len(attack_columns)) 
                              for player in soup.find_all(role='row')[seperations*2+3:seperations*3+3]],
                             columns=attack_columns)
    
    distribution_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                     len(distribution_columns)) 
                                    for player in soup.find_all(role='row')[seperations+2:seperations*2+2]],
                                   columns=distribution_columns)
    
    general_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                len(general_columns)) 
                               for player in soup.find_all(role='row')[1:seperations+1]],
                             columns=general_columns)
    
    goalkeeping_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                len(goalkeeping_columns))for player in soup.find_all(role='row')[seperations*5+6:length_rows]],
                                 columns=goalkeeping_columns)

    df = [general_df,distribution_df,attack_df,defense_df,discipline_df,goalkeeping_df]
    df_merge = functools.reduce(lambda left,right: pd.merge(left,right,on=['First','Last'],
                                                how='outer'), df).fillna(0)
    return df_merge

save_to_SQL cleans the passed dataframe and saves it as a new table named after the current date, then returns the cleaned dataframe.

In [5]:
def save_to_SQL(usl):
    usl.replace('-',0,inplace=True)
    usl = usl.applymap(lambda x: str(x).replace(',',''))
    usl['Player'] = usl['First']+ ' ' + usl['Last']
    usl.set_index('Player',drop=True,inplace=True)
    usl.drop(['First','Last'],axis=1,inplace=True)

    float_types = [e for e in list(usl.columns) if e not in ['Player','Team']]
    usl[float_types] = usl[float_types].applymap(lambda x: round(float(x),3))
    usl['Subbed on'] = usl['Games Played'] - usl['Starts']
    con = sqlite3.connect('USL.sqlite')
    usl.to_sql((str(datetime.date.today())),con,if_exists='replace')
    return usl

scrape_USL acts almost as a 'main' function for the program. It starts up a chrome webdriver using Selenium and proceeds to access the [league standings](https://www.uslsoccer.com/usl-standings). We scrape this page for all the current teams and then store this data away in a list. We can then use this list to visit all the team stats webpages and collect individual data for each player.

In [6]:
def scrape_USL():
    start = time.time()
    standing_release = False
    while(standing_release == False):
        options = webdriver.ChromeOptions()
        options.add_argument('headless')
        driver = webdriver.Chrome(executable_path="/Users/Justin/Desktop/chromedriver",options=options)
        driver.get('https://www.uslsoccer.com/usl-standings')
        time.sleep(1)
        presoup = BeautifulSoup(driver.page_source,'html.parser')
        try:
            teams = [team.get_text().replace(' ','-') 
                     if team.get_text() != 'Pittsburgh Riverhounds SC' 
                     else 'Pittsburgh-Riverhounds' 
                     for team in presoup.find_all(class_='Opta-TeamLink Opta-Ext')]
            standing_release = True
        except:
            print('Error Loading Standings. Retrying...')

    url = 'https://www.uslsoccer.com/{}-player-stats'

    usl = pd.DataFrame()
    
    for team in teams:
        release = False
        while(release == False):
            driver.get(url.format(team))
            timeout = 10
            try:
                element_present = EC.visibility_of_element_located((By.CLASS_NAME,
                                                                    'Opta-TabbedContent'))
                WebDriverWait(driver, timeout).until(element_present)
            except TimeoutException:
                print("Timed out waiting for {}".format(team))

            try:
                soup = BeautifulSoup(driver.page_source,'html5lib')
                team_df = make_team_df(soup)
                team_df['Team'] = team
                usl = pd.concat([usl,team_df],axis=0)
                release = True
            except:
                release=False

    usl = save_to_SQL(usl)
    driver.quit()
    stop = time.time()
    return usl,stop-start

Let's call our function!

In [7]:
usl,runtime = scrape_USL()
print("Program ran in : {} seconds.".format(runtime))

Timed out waiting for Louisville-City-FC



The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.



Program ran in : 394.8856313228607 seconds.


Let's check the league dataframe.

In [8]:
usl.head()

Unnamed: 0_level_0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Saves,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Team,Subbed on
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Blake Smith,32.0,30.0,4.0,2698.0,1388.0,79.7,193.0,48.7,46.3,601.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,2.0
Corben Bone,34.0,32.0,10.0,2810.0,1491.0,83.6,96.0,65.6,47.8,366.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,2.0
Daniel Haber,17.0,5.0,5.0,490.0,158.0,72.2,12.0,16.7,29.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,12.0
Danni Konig,28.0,17.0,14.0,1558.0,331.0,63.7,5.0,20.0,19.1,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,11.0
Dekel Keinan,22.0,22.0,1.0,1926.0,863.0,80.6,154.0,44.2,40.3,380.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,0.0


Great! We can also check that our SQL database is working correctly.

In [9]:
con = sqlite3.connect('USL.sqlite')
usl_read = pd.read_sql_query('Select * from "{}"'.format(str(datetime.date.today())),
                             con,
                             coerce_float=True)
con.close()

In [10]:
usl_read.head(5)

Unnamed: 0,Player,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,...,Saves,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Team,Subbed on
0,Blake Smith,32.0,30.0,4.0,2698.0,1388.0,79.7,193.0,48.7,46.3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,2.0
1,Corben Bone,34.0,32.0,10.0,2810.0,1491.0,83.6,96.0,65.6,47.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,2.0
2,Daniel Haber,17.0,5.0,5.0,490.0,158.0,72.2,12.0,16.7,29.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,12.0
3,Danni Konig,28.0,17.0,14.0,1558.0,331.0,63.7,5.0,20.0,19.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,11.0
4,Dekel Keinan,22.0,22.0,1.0,1926.0,863.0,80.6,154.0,44.2,40.3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,FC-Cincinnati,0.0


As we can see, our dataframe reads in correctly.

## **Reno 1868 EDA**

We are interested in comparing Reno 1868 to their next opponent to help us understand statistically how they stack up against one another.

Let's view some quick statistics about our dataset first.

In [11]:
usl.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1016 entries, Blake Smith to Wyatt Omsberg
Data columns (total 56 columns):
Games Played                  1016 non-null float64
Starts                        1016 non-null float64
Subbed off                    1016 non-null float64
Minutes Played                1016 non-null float64
Passes                        1016 non-null float64
Passing Acc                   1016 non-null float64
Long Passes                   1016 non-null float64
Long Pass Acc                 1016 non-null float64
Pass per 90                   1016 non-null float64
Forward Passes                1016 non-null float64
Backward Passes               1016 non-null float64
Left Pass                     1016 non-null float64
Right Pass                    1016 non-null float64
Passing Acc Opponents Half    1016 non-null float64
Passing Acc Own Half          1016 non-null float64
Assists                       1016 non-null float64
Key Passes                    1016 non-null flo

In [12]:
usl.describe()

Unnamed: 0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Shot At,Saves,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Subbed on
count,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,...,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0,1016.0
mean,15.959646,12.621063,3.313976,1135.198819,473.723425,74.680217,81.123031,47.027067,37.142815,180.135827,...,4.656496,3.11122,6.06811,0.277559,0.491142,0.383858,0.062992,0.034449,1.027559,3.338583
std,10.484581,9.749226,3.709506,865.012272,425.636216,11.374076,106.879157,20.114808,15.879434,179.672075,...,18.998655,12.820951,19.521965,1.378129,2.353892,1.820057,0.376618,0.238621,4.620924,3.856225
min,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.0,4.0,0.0,360.0,129.0,70.075,11.0,36.775,27.75,42.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15.0,11.0,2.0,980.5,362.0,76.9,41.0,48.0,35.35,121.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,25.0,20.0,5.0,1801.0,711.75,81.5,109.25,58.6,44.425,264.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
max,36.0,36.0,18.0,3240.0,2359.0,100.0,846.0,100.0,270.0,911.0,...,189.0,125.0,100.0,15.0,37.0,20.0,4.0,4.0,52.0,26.0


We must now get our subset using our function get_latest_opponent. 

In [13]:
renovs = get_latest_opponent(usl)

Let's make sure this worked properly!

In [14]:
renovs['Team'].value_counts()

Reno-1868-FC        33
Orange-County-SC    25
Name: Team, dtype: int64

Great! We now have our two teams.

We can now look at some of the more important statistics(displayed below) in soccer and see the differences between the two teams.

In [15]:
compare = pd.concat(
    [
    renovs[['Goals',
             'Assists',
             'Crosses',
             'Key Passes',
             'Interceptions',
             'Clearances',
             'Team',
             'Shots on Target',
             'Shots',
             'Tackles']].groupby('Team').sum().transpose(),
                     
    renovs[['Conversion Rate',
            'Team',
            'Passing Acc']].groupby('Team').mean().transpose()
    ],
    
    axis=0
)
compare.applymap(lambda x: round(x,2))

Team,Orange-County-SC,Reno-1868-FC
Goals,74.0,56.0
Assists,56.0,45.0
Crosses,365.0,378.0
Key Passes,396.0,376.0
Interceptions,487.0,549.0
Clearances,34.0,23.0
Shots on Target,197.0,190.0
Shots,390.0,395.0
Tackles,563.0,552.0
Conversion Rate,14.17,8.08


## **Visualizations using Plotly**

We now want to visualize our data to help us get a better understanding of the individual team differences. We can use the plotly library to help us plot these.  

In [16]:
opponent = (compare.columns).drop('Reno-1868-FC')

Let's first plot out our previous table and look at it.

In [17]:
data  = [
        go.Bar(
    y = compare.index,
    x = compare['Reno-1868-FC'],
    orientation='h',
    marker=dict(color='#0c13a8'),
    name='Reno 1868 FC'),
    
         go.Bar(
    y = compare.index,
    x = compare[opponent[0]],
    orientation='h',
    marker=dict(color='#FF0033'),
    name=opponent[0].replace('-',' '))
]

layout = go.Layout(
        title = 'Reno 1868 vs. '+ opponent[0].replace('-',' '),
        margin = go.layout.Margin(l=110)
)

fig  = go.Figure(data=data,layout=layout)
# url_1 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

Next, let's now look at the individual scorers and assisters on Reno 1868.

In [18]:
data  = [
        go.Bar(
    y = renovs[(renovs['Team'] == 'Reno-1868-FC') 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == 'Reno-1868-FC') 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#0c13a8'),
    name='Goals'),
    
        go.Bar(
    y = renovs[(renovs['Team'] == 'Reno-1868-FC') 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == 'Reno-1868-FC') 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#fe6604'),
    name='Assists')
]

layout = go.Layout(
        title = 'Reno 1868 Scorers & Assisters',
        xaxis = dict(title='Total Goals & Assists'),
        barmode='stack',
        margin = go.layout.Margin(l=140),
        autosize=True

)

fig  = go.Figure(data=data, layout=layout)
# url_2 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

Let's do the same for the opponent's team.

In [19]:
data  = [
        go.Bar(
    y = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True),
            
    orientation='h',
    marker=dict(color='#FF0033'),
    name='Goals'),
    
        go.Bar(
    y = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#9A03FE'),
    name='Assists')
]

layout = go.Layout(
        title = opponent[0].replace('-',' ') + ' Scorers & Assisters',
        xaxis = dict(title='Total Goals & Assists'),
        barmode='stack',
        margin = go.layout.Margin(l=140),
        autosize=True
)

fig  = go.Figure(data=data, layout=layout)
# url_3 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

## **Plotly Dashboard**

We can also make this into an updated dashboard every time we run our notebook. Unfortunately, I was not able to get this working, particularly due to incompatible versions of libraries. Though it does not work currently, this code would produce a dashboard on my personal Plotly account. I will leave this here just to show my thought process on other things I can do with this project!

In [20]:
# my_dboard = dashboard.Dashboard()

In [21]:
# my_dboard.get_preview()

In [22]:
'''
import re

def fileId_from_url(url):
    raw_fileId = re.findall("~[A-z]+/[0-9]+", url)[0][1:]
    return str(raw_fileId).replace('/', ':')

def sharekey_from_url(url):
    if 'share_key=' not in url:
        return "This url is not 'sercret'. It does not have a secret key."
    return url[url.find('share_key=') + len('share_key='):]

fileId_1 = fileId_from_url(url_1)
fileId_2 = fileId_from_url(url_2)
fileId_3 = fileId_from_url(url_3)
print(fileId_1)
print(fileId_2)
print(fileId_3)

box_a = {
    'type': 'box',
    'boxType': 'plot',
    'fileId': fileId_1,
    'title': 'Reno 1868 vs. ' + str(opponent).replace('-',' ')
}
box_b = {
    'type': 'box',
    'boxType': 'plot',
    'fileID': fileId_2,
    'title':  'Reno 1868 Top Scorers & Assisters'
}
box_c = {
    'type': 'box',
    'boxType': 'plot',
    'fileID': fileId_3,
    'title':  str(opponent).replace('-',' ') + ' Top Scorers & Assisters'
}
'''

'\nimport re\n\ndef fileId_from_url(url):\n    raw_fileId = re.findall("~[A-z]+/[0-9]+", url)[0][1:]\n    return str(raw_fileId).replace(\'/\', \':\')\n\ndef sharekey_from_url(url):\n    if \'share_key=\' not in url:\n        return "This url is not \'sercret\'. It does not have a secret key."\n    return url[url.find(\'share_key=\') + len(\'share_key=\'):]\n\nfileId_1 = fileId_from_url(url_1)\nfileId_2 = fileId_from_url(url_2)\nfileId_3 = fileId_from_url(url_3)\nprint(fileId_1)\nprint(fileId_2)\nprint(fileId_3)\n\nbox_a = {\n    \'type\': \'box\',\n    \'boxType\': \'plot\',\n    \'fileId\': fileId_1,\n    \'title\': \'Reno 1868 vs. \' + str(opponent).replace(\'-\',\' \')\n}\nbox_b = {\n    \'type\': \'box\',\n    \'boxType\': \'plot\',\n    \'fileID\': fileId_2,\n    \'title\':  \'Reno 1868 Top Scorers & Assisters\'\n}\nbox_c = {\n    \'type\': \'box\',\n    \'boxType\': \'plot\',\n    \'fileID\': fileId_3,\n    \'title\':  str(opponent).replace(\'-\',\' \') + \' Top Scorers & Assist

In [23]:
# my_dboard['settings']['title'] = 'Reno 1868'

In [24]:
# my_dboard['settings']['logoUrl'] = 'https://media.graytvinc.com/images/810*954/1868-SOCCER-KIT.jpg'

In [25]:
# my_dboard.insert(box_a)

In [26]:
# my_dboard.insert(box_b,'above',1)

In [27]:
# my_dboard.insert(box_c,'right',1)

In [28]:
# py.dashboard_ops.upload(my_dboard, 'Reno 1868 Dashboard',sharing='public',auto_open=True)

## **Clustering For Positions for Fun!**

We may also be interested in statistics by position. Since we are not given that information, we may try to get it ourselves using clustering algorithms.

We start by using a Kmeans clustering algorithm.

In [29]:
from sklearn.cluster import KMeans
from sklearn import tree
from sklearn.preprocessing import StandardScaler

In [30]:
usl_read['Minutes Played'].mean()

1135.1988188976377

We want good representative data, so we will make a limit of at least 600 minutes played or 6.5 games of playing time.

In [31]:
test = usl_read[usl_read['Minutes Played'] >= 600]
test2 = renovs[renovs['Minutes Played'] >= 600]

In [32]:
len(renovs[renovs['Minutes Played'] >= 600])

41

In [33]:
renovs.columns

Index(['Games Played', 'Starts', 'Subbed off', 'Minutes Played', 'Passes',
       'Passing Acc', 'Long Passes', 'Long Pass Acc', 'Pass per 90',
       'Forward Passes', 'Backward Passes', 'Left Pass', 'Right Pass',
       'Passing Acc Opponents Half', 'Passing Acc Own Half', 'Assists',
       'Key Passes', 'Crosses', 'Crossing Acc', 'Shots', 'Shots on Target',
       'Goals', 'Right Foot Goals', 'Left Foot Goals', 'Heading Goals',
       'Other', 'Goals In Box', 'Goals Out Box', 'Free Kick Goals',
       'Conversion Rate', 'Mins Per Goal', 'Clears', 'Blocks', 'Interceptions',
       'Tackles', 'Tackles Won', 'Duels', 'Duels Won', 'Air Duels',
       'Air Duels Won', 'Yellow Cards', 'Red Cards', 'Fouls Won',
       'Fouls Conceded', 'Goals Conceded', 'Shot At', 'Saves', 'Save Rate',
       'Clean Sheets', 'Catches', 'Punches', 'Drops', 'Penalties Saved',
       'Clearances', 'Team', 'Subbed on'],
      dtype='object')

Test sets we will be using.

In [34]:
X = test[['Saves','Goals Conceded','Shots','Clears']].copy()

In [35]:
X2 = test2[['Saves','Goals Conceded','Shots','Clears']].copy()

In [36]:
sc = StandardScaler()
#X2 = pd.DataFrame(sc.fit_transform(X2),columns=X2.columns,index=X2.index)
#X = pd.DataFrame(sc.fit_transform(X),columns=X.columns,index=X.index)

In [37]:
X.head()

Unnamed: 0,Saves,Goals Conceded,Shots,Clears
0,0.0,0.0,12.0,67.0
1,0.0,0.0,42.0,8.0
3,0.0,0.0,29.0,8.0
4,0.0,0.0,9.0,139.0
5,0.0,0.0,32.0,3.0


We fit out model here.

In [38]:
kmeans = KMeans(n_clusters=4, random_state=42).fit(X)

And now we can look at our predictions!

In [39]:
predict = kmeans.predict(X2)

In [40]:
predict

array([0, 3, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0, 1, 0, 2,
       1, 1, 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 1, 3, 1, 1, 1, 0, 0])

In [41]:
concat = pd.concat([X2,pd.DataFrame(predict,index=X2.index)],axis=1)

Unfortunately, we will witness a not so good clustering algorithm and therefore cannot classify these exceptionally well with our current knowledge. The experience was more important though and as you can see it works for very obvious positions since the statistics for goalkeepers are unique to goalkeepers, etc.

Let's view our results on classifying each player's position.

In [42]:
concat[concat[0]==0]

Unnamed: 0_level_0,Saves,Goals Conceded,Shots,Clears,0
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alex Crognale,0.0,0.0,16.0,103.0,0
Walker Hume,0.0,0.0,16.0,97.0,0
Brent Richards,0.0,0.0,28.0,121.0,0
Thomas Janjigian,0.0,0.0,2.0,96.0,0
Zach Carroll,0.0,0.0,9.0,154.0,0


In [43]:
concat[concat[0]==2]

Unnamed: 0_level_0,Saves,Goals Conceded,Shots,Clears,0
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jos Hooiveld,0.0,0.0,15.0,54.0,2
Joseph Amico,0.0,0.0,4.0,43.0,2
Kevin Alston,0.0,0.0,2.0,37.0,2
Owusu-Ansah Kontor,0.0,0.0,1.0,26.0,2
Thomas Juel-Nielsen,0.0,0.0,8.0,47.0,2
Brenton Griffiths,0.0,0.0,4.0,71.0,2
Duke Lacroix,0.0,0.0,17.0,35.0,2
James Kiffe,0.0,0.0,3.0,39.0,2
Jordan Murrell,0.0,0.0,1.0,84.0,2


In [44]:
concat[concat[0]==3]

Unnamed: 0_level_0,Saves,Goals Conceded,Shots,Clears,0
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Andre Rawls,84.0,30.0,0.0,25.0,3
James Marcinkowski,70.0,30.0,0.0,16.0,3
Matt Bersano,34.0,8.0,0.0,7.0,3


In [45]:
concat[concat[0]==1]

Unnamed: 0_level_0,Saves,Goals Conceded,Shots,Clears,0
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aodhan Quinn,0.0,0.0,51.0,24.0,1
Christian Duke,0.0,0.0,17.0,26.0,1
Darwin Jones,0.0,0.0,18.0,8.0,1
Giovanni Ramos Godoy,0.0,0.0,27.0,8.0,1
Koji Hashimoto,0.0,0.0,14.0,7.0,1
Mark Segbers,0.0,0.0,8.0,2.0,1
Mats Bjurman,0.0,0.0,20.0,18.0,1
Michael Seaton,0.0,0.0,62.0,5.0,1
Noah Powder,0.0,0.0,13.0,16.0,1
Richard Chaplow,0.0,0.0,10.0,10.0,1


A count of our possible positions.

In [46]:
concat[0].value_counts()

1    24
2     9
0     5
3     3
Name: 0, dtype: int64

We are done now! We have learned so much about Reno 1868 and their opposition.