# **United Soccer League Web Scraper** 

### Before viewing, I advise you to view it [here](https://nbviewer.jupyter.org/github/justingill/Resume-Portfolio/blob/master/USL%20Scraper/USL_Scraper.ipynb) instead because Github doesn't support Plotly charts.

In this project, we will be interested in examining the performance of my local Reno soccer team, [Reno 1868 FC](https://www.reno1868fc.com/). We must first start this project by obtaining the data needed to produce interpretable results. Luckily, the [USL website](https://www.uslsoccer.com/usl-statistics) keeps a very good record of league, team, and player stats which we may scrape for our own analytical use. 

![](https://www.visitrenotahoe.com/wp-content/uploads/2017/06/Reno1868Blog-1.jpg)

## **Import Libraries**

Let's start by importing all the libraries we will use for this project.

In [1]:
import seaborn as sns
import functools
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import sqlite3
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import datetime
import plotly.offline
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.dashboard_objs as dashboard

plotly.offline.init_notebook_mode(connected=True)

%matplotlib inline

## **Define Functions**

We start this project by first defining the necessary functions for scraping and creating our dataframe we wish to work with.

* ***get_latest_opponent_df*** - This function will take in a string of any opponent/team and return a new dataframe with the opponent and Reno 1868.

* ***check_first_last*** - This will check the player list and make sure that the player has both a first and last name.

* ***make_team_df*** - This function will create a dataframe for the team using the html scraped by BeautifulSoup and return a dataframe to be merged into a dataframe of all the teams.

* ***save_to_SQL*** - This function will clean up the dataframe by creating a new column 'Player',dropping unnecessary columns, replacing placeholder values, correcting data types for columns, setting Player as the index, then saving it to a SQL database.
                  
* ***scrape_USL*** - This function acts as a 'main' function and encompasses the scraping of the data, cleaning and writing to SQL database. This function scrapes USL's [Standings](https://www.uslsoccer.com/usl-standings) to obtain all current teams playing in the USL. The soup objects of each team webpage are then passed into make_team_df and merged together. Lastly, this dataframe is cleaned by save_to_sql and saved to a SQL database.


In [2]:
def get_latest_opponent_df(opponent):
    
    if(opponent != 'Reno 1868 FC'):
        renovs = usl[(usl['Team'] == 'Reno 1868 FC') 
                     | (usl['Team'] == str(opponent))]
    else:
        'Cannot return Reno 1868 vs. Reno 1868'
    return renovs

check_first_last simply makes sure our dataframe stays the same size by adding a blank first or last name if either is missing.

In [3]:
def check_first_last(player_list,length):
    if player_list == None:
        return player_list
    if len(player_list) == (length+1):
        player_list[1] = ' '.join(player_list[1:3])
        del player_list[2]
        return player_list
    elif len(player_list) == (length-1):
        player_list.insert(1,'-')
        return player_list
    else:
        return player_list

make_team_df takes in a soup variable which corresponds to a team page, which looks like [this](https://www.uslsoccer.com/reno-1868-fc-player-stats). We scrape the page for the data contained in the 'Full Player Stats' section, merge and return the manipulated dataframe. 

In [4]:
def make_team_df(soup):
    seperations = len(soup.find(class_='Opta-Table-Scroll Opta-Table-Scroll-One-Liner Opta-js-discipline'
                               ).find_all(role='row'))-1

    length_rows = len(soup.find_all(role='row'))

    general_columns = ['First','Last','Games Played','Starts','Subbed off','Minutes Played']
    
    distribution_columns = ['First','Last','Passes','Passing Acc','Long Passes','Long Pass Acc',
                            'Pass per 90','Forward Passes','Backward Passes','Left Pass',
                            'Right Pass','Passing Acc Opponents Half',
                            'Passing Acc Own Half','Assists','Key Passes','Crosses','Crossing Acc']
    
    attack_columns = ['First','Last','Shots','Shots on Target','Goals','Right Foot Goals',
                      'Left Foot Goals','Heading Goals','Other','Goals In Box','Goals Out Box',
                      'Free Kick Goals','Conversion Rate','Mins Per Goal']
    
    defense_columns = ['First','Last','Clears','Blocks','Interceptions','Tackles',
                       'Tackles Won','Duels','Duels Won','Air Duels','Air Duels Won']
    discipline_columns = ['First','Last','Yellow Cards','Red Cards','Fouls Won','Fouls Conceded']

    goalkeeping_columns = ['First','Last','Goals Conceded','Shot At','Saves','Save Rate',
                           'Clean Sheets','Catches','Punches','Drops','Penalties Saved',
                           'Clearances']
    
    discipline_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                   len(discipline_columns)) 
                                  for player in soup.find_all(role='row')[seperations*4+5:seperations*5+5]],
                                 columns=discipline_columns)
    
    defense_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                len(defense_columns)) 
                               for player in soup.find_all(role='row')[seperations*3+4:seperations*4+4]],
                              columns=defense_columns)
    
    attack_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                               len(attack_columns)) 
                              for player in soup.find_all(role='row')[seperations*2+3:seperations*3+3]],
                             columns=attack_columns)
    
    distribution_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                     len(distribution_columns)) 
                                    for player in soup.find_all(role='row')[seperations+2:seperations*2+2]],
                                   columns=distribution_columns)
    
    general_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                                len(general_columns)) 
                               for player in soup.find_all(role='row')[1:seperations+1]],
                             columns=general_columns)
    
    goalkeeping_df = pd.DataFrame([check_first_last(player.get_text(' ').split(' '),
                                len(goalkeeping_columns))for player in soup.find_all(role='row')[seperations*5+6:length_rows]],
                                 columns=goalkeeping_columns)

    df = [general_df,distribution_df,attack_df,defense_df,discipline_df,goalkeeping_df]
    df_merge = functools.reduce(lambda left,right: pd.merge(left,right,on=['First','Last'],
                                                how='outer'), df).fillna(0)
    return df_merge

save_to_SQL cleans the passed dataframe and saves it as a new table named after the current date, then returns the cleaned dataframe.

In [5]:
def save_to_SQL(usl):
    usl = usl.applymap(lambda x: str(x).replace(',',''))
    usl['Player'] = usl['First']+ ' ' + usl['Last']
    usl.set_index('Player',drop=True,inplace=True)
    usl.drop(['First','Last'],axis=1,inplace=True)

    usl.replace('-',0,inplace=True)
    float_types = [e for e in list(usl.columns) if e not in ['Player','Team']]
    usl[float_types] = usl[float_types].applymap(lambda x: round(float(x),3))
    usl['Subbed on'] = usl['Games Played'] - usl['Starts']
    
    usl['Team'] = usl['Team'].apply(lambda x: x.replace('-',' ').title())
    usl['Team'] = usl['Team'].apply(lambda x: x.replace('Ii','2'))
    usl['Team'] = usl['Team'].apply(lambda x: x.replace('Sc','SC'))
    usl['Team'] = usl['Team'].apply(lambda x: x.replace('Fc','FC'))

    con = sqlite3.connect('USL.sqlite')
    usl.to_sql((str(datetime.date.today())),con,if_exists='replace')
    return usl

scrape_USL acts almost as a 'main' function for the program. It starts up a chrome webdriver using Selenium and proceeds to access the [league standings](https://www.uslsoccer.com/usl-standings). We scrape this page for all the current teams and then store this data away in a list. We can then use this list to visit all the team stats webpages and collect individual data for each player.

In [6]:
def scrapeUSL():
    start = time.time()
    standing_release = False
    while(standing_release == False):
        options = webdriver.ChromeOptions()
        driver = webdriver.Chrome(executable_path="./chromedriver",options=options)
        driver.get('https://www.uslchampionship.com/league-standings')
        time.sleep(1)
        presoup = BeautifulSoup(driver.page_source,'html.parser')
        try:
            teams = [team.get_text().replace(' ','-').lower() 
                     if team.get_text() != 'Pittsburgh Riverhounds SC' 
                     else 'Pittsburgh-Riverhounds'.lower() 
                     for team in presoup.find_all(class_='Opta-TeamLink Opta-Ext')]
            
            #Teams added after 1st season being scraped are not consistently named on website.
            for i, team in enumerate(teams):
                if(team == 'birmingham-legion'):
                    teams[i] = 'birmingham-legion-fc'
                elif(team == 'loudoun-united'):
                    teams[i] = 'loudoun-united-fc'
                elif(team == 'memphis-901'):
                    teams[i] = 'memphis-901-fc'
                elif(team == 'austin-bold'):
                    teams[i] = 'austin-bold-fc'
                elif(team == 'el-paso-locomotive'):
                    teams[i] = 'el-paso-locomotive-fc'
                    
            teams.remove('orange-county-sc')
            standing_release = True
        except:
            print('Error Loading Standings. Retrying...')

    url = 'https://www.uslchampionship.com/{}-player-stats'

    usl = pd.DataFrame()
    
    for team in teams:
        release = False
        while(release == False):
            driver.get(url.format(team))
            timeout = 10
            try:
                element_present = EC.visibility_of_element_located((By.CLASS_NAME,
                                                                    'Opta-TabbedContent'))
                WebDriverWait(driver, timeout).until(element_present)
            except TimeoutException:
                print("Timed out waiting for {}".format(team))

            try:
                soup = BeautifulSoup(driver.page_source)
                team_df = make_team_df(soup)
                team_df['Team'] = team
                usl = pd.concat([usl,team_df],axis=0)
                release = True
            except:
                release=False

    usl = save_to_SQL(usl)
    driver.quit()
    stop = time.time()
    return usl,stop-start

Let's call our function!

In [7]:
usl,runtime = scrapeUSL()
print("Program ran in : {} seconds.".format(runtime))


The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.



Program ran in : 290.5914692878723 seconds.


Let's check the league dataframe.

In [8]:
usl.head(5)

Unnamed: 0_level_0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Saves,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Team,Subbed on
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Allen Yanes,3.0,3.0,0.0,270.0,108.0,72.2,15.0,40.0,36.0,53.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0
Amarildo -,3.0,1.0,1.0,92.0,15.0,66.7,1.0,0.0,14.7,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,2.0
Andreas Ivan,2.0,2.0,1.0,164.0,53.0,73.6,3.0,100.0,29.1,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0
Ben Mines,4.0,1.0,1.0,93.0,38.0,60.5,0.0,0.0,36.8,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,3.0
Brian White,2.0,2.0,2.0,143.0,44.0,59.1,0.0,0.0,27.7,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0


Great! We can also check that our SQL database is working correctly.

In [9]:
con = sqlite3.connect('USL.sqlite')
usl_read = pd.read_sql_query('Select * from "{}"'.format(str(datetime.date.today())),
                             con,
                             coerce_float=True)
con.close()

In [10]:
usl_read.head(5)

Unnamed: 0,Player,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,...,Saves,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Team,Subbed on
0,Allen Yanes,3.0,3.0,0.0,270.0,108.0,72.2,15.0,40.0,36.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0
1,Amarildo -,3.0,1.0,1.0,92.0,15.0,66.7,1.0,0.0,14.7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,2.0
2,Andreas Ivan,2.0,2.0,1.0,164.0,53.0,73.6,3.0,100.0,29.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0
3,Ben Mines,4.0,1.0,1.0,93.0,38.0,60.5,0.0,0.0,36.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,3.0
4,Brian White,2.0,2.0,2.0,143.0,44.0,59.1,0.0,0.0,27.7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0


As we can see, our dataframe reads in correctly.

## **Reno 1868 EDA**

We are interested in comparing Reno 1868 to their next opponent to help us understand statistically how they stack up against one another.

Let's view some quick statistics about our dataset first.

In [11]:
usl.info()

<class 'pandas.core.frame.DataFrame'>
Index: 804 entries, Allen Yanes to Will Bruin
Data columns (total 56 columns):
Games Played                  804 non-null float64
Starts                        804 non-null float64
Subbed off                    804 non-null float64
Minutes Played                804 non-null float64
Passes                        804 non-null float64
Passing Acc                   804 non-null float64
Long Passes                   804 non-null float64
Long Pass Acc                 804 non-null float64
Pass per 90                   804 non-null float64
Forward Passes                804 non-null float64
Backward Passes               804 non-null float64
Left Pass                     804 non-null float64
Right Pass                    804 non-null float64
Passing Acc Opponents Half    804 non-null float64
Passing Acc Own Half          804 non-null float64
Assists                       804 non-null float64
Key Passes                    804 non-null float64
Crosses         

In [12]:
usl.describe()

Unnamed: 0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Shot At,Saves,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Subbed on
count,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0,...,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0,804.0
mean,5.736318,4.573383,1.15796,411.109453,173.095771,74.283582,29.63806,44.539677,37.97699,66.446517,...,1.371891,0.884328,4.790672,0.058458,0.218905,0.103234,0.024876,0.008706,0.273632,1.162935
std,3.017198,3.156892,1.457124,273.984016,137.000085,12.685766,37.041353,24.395238,19.884182,60.407548,...,5.648241,3.771533,16.978515,0.363757,1.082795,0.553975,0.178211,0.116716,1.264419,1.509478
min,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,2.0,0.0,164.0,56.0,69.575,5.0,33.25,26.8,18.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,6.0,4.0,1.0,388.0,138.0,76.5,16.0,45.0,35.95,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,8.0,7.0,2.0,640.0,260.0,81.4,42.25,60.0,46.0,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
max,11.0,11.0,9.0,990.0,575.0,100.0,234.0,100.0,396.0,271.0,...,45.0,31.0,100.0,4.0,13.0,6.0,2.0,2.0,11.0,7.0


We must now get our subset using our function get_latest_opponent. 

In [13]:
renovs = get_latest_opponent_df('Tacoma Defiance')

Let's make sure this worked properly!

In [14]:
renovs['Team'].value_counts()

Tacoma Defiance    37
Reno 1868 FC       22
Name: Team, dtype: int64

Great! We now have our two teams.

We can now look at some of the more important statistics(displayed below) in soccer and see the differences between the two teams.

In [15]:
compare = pd.concat(
    [
    renovs[['Goals',
             'Assists',
             'Crosses',
             'Key Passes',
             'Interceptions',
             'Clearances',
             'Team',
             'Shots on Target',
             'Shots',
             'Tackles']].groupby('Team').sum().transpose(),
                     
    renovs[['Conversion Rate',
            'Team',
            'Passing Acc']].groupby('Team').mean().transpose()
    ],
    
    axis=0
)
compare.applymap(lambda x: round(x,2))

Team,Reno 1868 FC,Tacoma Defiance
Goals,19.0,6.0
Assists,16.0,4.0
Crosses,95.0,77.0
Key Passes,112.0,93.0
Interceptions,128.0,130.0
Clearances,7.0,7.0
Shots on Target,50.0,37.0
Shots,122.0,87.0
Tackles,129.0,172.0
Conversion Rate,8.12,3.03


## **Visualizations using Plotly**

We now want to visualize our data to help us get a better understanding of the individual team differences. We can use the plotly library to help us plot these.  

In [17]:
opponent = (compare.columns).drop('Reno 1868 FC')

Let's first plot out our previous table and look at it.

In [19]:
data  = [
        go.Bar(
    y = compare.index,
    x = compare['Reno 1868 FC'],
    orientation='h',
    marker=dict(color='#0c13a8'),
    name='Reno 1868 FC'),
    
         go.Bar(
    y = compare.index,
    x = compare[opponent[0]],
    orientation='h',
    marker=dict(color='#FF0033'),
    name=opponent[0])
]

layout = go.Layout(
        title = 'Reno 1868 vs. '+ opponent[0],
        margin = go.layout.Margin(l=110)
)

fig  = go.Figure(data=data,layout=layout)
# url_1 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

Next, let's now look at the individual scorers and assisters on Reno 1868.

In [20]:
data  = [
        go.Bar(
    y = renovs[(renovs['Team'] == 'Reno 1868 FC') 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == 'Reno 1868 FC') 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#0c13a8'),
    name='Goals'),
    
        go.Bar(
    y = renovs[(renovs['Team'] == 'Reno 1868 FC') 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == 'Reno 1868 FC') 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#fe6604'),
    name='Assists')
]

layout = go.Layout(
        title = 'Reno 1868 Scorers & Assisters',
        xaxis = dict(title='Total Goals & Assists'),
        barmode='stack',
        margin = go.layout.Margin(l=140),
        autosize=True

)

fig  = go.Figure(data=data, layout=layout)
# url_2 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

Let's do the same for the opponent's team.

In [21]:
data  = [
        go.Bar(
    y = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Goals'] > 0)]['Goals'].sort_values(ascending=True),
            
    orientation='h',
    marker=dict(color='#FF0033'),
    name='Goals'),
    
        go.Bar(
    y = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True).index,
            
    x = renovs[(renovs['Team'] == opponent[0]) 
               & (renovs['Assists'] > 0)]['Assists'].sort_values(ascending=True),
    orientation='h',
    marker=dict(color='#9A03FE'),
    name='Assists')
]

layout = go.Layout(
        title = opponent[0] + ' Scorers & Assisters',
        xaxis = dict(title='Total Goals & Assists'),
        barmode='stack',
        margin = go.layout.Margin(l=140),
        autosize=True
)

fig  = go.Figure(data=data, layout=layout)
# url_3 = py.plot(fig,auto_open=False)
# py.iplot(fig)
plotly.offline.iplot(fig)

## **Plotly Dashboard**

This is a plotly dashboard uploaded to the Plotly website. It currently does not work due to changes in Plotly that broke the dashboard. We therefore have chosen to use a more stable software like Tableau to dashboard our data more clearly and accurately.

In [None]:
# my_dboard = dashboard.Dashboard()

In [None]:
# my_dboard.get_preview()

In [None]:
'''
import re

def fileId_from_url(url):
    raw_fileId = re.findall("~[A-z]+/[0-9]+", url)[0][1:]
    return str(raw_fileId).replace('/', ':')

def sharekey_from_url(url):
    if 'share_key=' not in url:
        return "This url is not 'sercret'. It does not have a secret key."
    return url[url.find('share_key=') + len('share_key='):]

fileId_1 = fileId_from_url(url_1)
fileId_2 = fileId_from_url(url_2)
fileId_3 = fileId_from_url(url_3)
print(fileId_1)
print(fileId_2)
print(fileId_3)

box_a = {
    'type': 'box',
    'boxType': 'plot',
    'fileId': fileId_1,
    'title': 'Reno 1868 vs. ' + str(opponent).replace('-',' ')
}
box_b = {
    'type': 'box',
    'boxType': 'plot',
    'fileID': fileId_2,
    'title':  'Reno 1868 Top Scorers & Assisters'
}
box_c = {
    'type': 'box',
    'boxType': 'plot',
    'fileID': fileId_3,
    'title':  str(opponent).replace('-',' ') + ' Top Scorers & Assisters'
}
'''

In [None]:
# my_dboard['settings']['title'] = 'Reno 1868'

In [None]:
# my_dboard['settings']['logoUrl'] = 'https://media.graytvinc.com/images/810*954/1868-SOCCER-KIT.jpg'

In [None]:
# my_dboard.insert(box_a)

In [None]:
# my_dboard.insert(box_b,'above',1)

In [None]:
# my_dboard.insert(box_c,'right',1)

In [None]:
# py.dashboard_ops.upload(my_dboard, 'Reno 1868 Dashboard',sharing='public',auto_open=True)

## **Positions using Decision Tree**

In [22]:
reno_positions = {'Position':{'Antoine Hoppenot':'Midfielder','Brent Richards':'Defender',
                         'Brenton Griffiths':'Defender','Brian Brown':'Forward',
                         'Christian Thierjung':'Midfielder','Christopher Wehan':'Midfielder',
                         'Daniel Musovski':'Forward','Darwin Espinal':'Midfielder',
                         'Duke Lacroix':'Midfielder','Eric Calvillo':'Midfielder',
                         'Gilbert Fuentes':'Midfielder','Jackson Yueill':'Midfielder',
                         'James Kiffe':'Defender','James Marcinkowski':'Goalkeeper',
                         'Jerry van Ewijk':'Midfielder','Jimmy Ockford':'Defender',
                         'Guy Abend':'Midfielder','Antoine Hoppenot':'Midfielder',
                         'Jochen Graf':'Forward','Joel Qwiberg':'Defender',
                         'Jordan Murrell':'Defender','Kevin Partida':'Midfielder',
                         'Kyle Ihn':'Defender','Lindo Mfeka':'Midfielder',
                         'Luis Felipe Fernandes':'Midfielder','Mark González':'Forward',
                         'Matt Bersano':'Goalkeeper','Mohamed Thiaw':'Forward',
                         'Paul Marie':'Defender','Seth Casiple':'Midfielder',
                         'Thomas Janjigian':'Defender','Will Seymore':'Midfielder',
                         'Zach Carroll':'Defender'}}
usl['Position'] = pd.DataFrame(reno_positions)

In [23]:
data = usl[usl['Games Played']>3]

In [24]:
usl.head()

Unnamed: 0_level_0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Team,Subbed on,Position
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Allen Yanes,3.0,3.0,0.0,270.0,108.0,72.2,15.0,40.0,36.0,53.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0,
Amarildo -,3.0,1.0,1.0,92.0,15.0,66.7,1.0,0.0,14.7,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,2.0,
Andreas Ivan,2.0,2.0,1.0,164.0,53.0,73.6,3.0,100.0,29.1,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0,
Ben Mines,4.0,1.0,1.0,93.0,38.0,60.5,0.0,0.0,36.8,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,3.0,
Brian White,2.0,2.0,2.0,143.0,44.0,59.1,0.0,0.0,27.7,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,0.0,


In [25]:
reno = data[data['Team'] == 'Reno 1868 FC'].copy()
rest = data[data['Team'] != 'Reno 1868 FC'].copy()

In [26]:
len(reno)

17

In [27]:
len(rest)

550

In [28]:
reno

Unnamed: 0_level_0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Team,Subbed on,Position
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aidan Apodaca,4.0,1.0,1.0,157.0,35.0,57.1,3.0,66.7,20.1,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,3.0,
Brent Richards,10.0,10.0,0.0,900.0,432.0,72.9,72.0,41.7,43.2,190.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,0.0,Defender
Brian Brown,9.0,9.0,1.0,809.0,228.0,83.8,4.0,100.0,25.4,32.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,0.0,Forward
Corey Hertzog,10.0,7.0,5.0,659.0,185.0,77.8,10.0,80.0,25.3,35.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,3.0,
Duke Lacroix,9.0,9.0,0.0,810.0,390.0,80.5,34.0,32.4,43.3,150.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,0.0,Midfielder
Emrah Klimenta,5.0,5.0,0.0,450.0,202.0,81.2,38.0,55.3,40.4,81.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,0.0,
James Marcinkowski,5.0,5.0,0.0,450.0,116.0,71.6,57.0,42.1,23.2,59.0,...,59.1,0.0,4.0,0.0,1.0,0.0,1.0,Reno 1868 FC,0.0,Goalkeeper
Lindo Mfeka,8.0,6.0,4.0,515.0,294.0,84.0,31.0,64.5,51.4,93.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,2.0,Midfielder
Matt Bersano,5.0,5.0,0.0,450.0,153.0,73.9,76.0,47.4,30.6,97.0,...,73.9,1.0,2.0,5.0,0.0,0.0,6.0,Reno 1868 FC,0.0,Goalkeeper
Raul Mendiola,7.0,5.0,1.0,499.0,189.0,76.7,12.0,33.3,34.1,65.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Reno 1868 FC,2.0,


In [29]:
rest.head()

Unnamed: 0_level_0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Team,Subbed on,Position
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ben Mines,4.0,1.0,1.0,93.0,38.0,60.5,0.0,0.0,36.8,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,3.0,
Christopher Lema,9.0,8.0,0.0,734.0,444.0,66.7,60.0,35.0,54.4,235.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,1.0,
Derrick Etienne,4.0,3.0,2.0,303.0,129.0,79.1,9.0,77.8,38.3,37.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,1.0,
Edgardo Rito,6.0,5.0,0.0,456.0,204.0,73.5,22.0,27.3,40.3,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,New York Red Bulls 2,1.0,
Evan Louro,7.0,7.0,0.0,630.0,200.0,55.0,137.0,34.3,28.6,142.0,...,63.0,0.0,4.0,0.0,0.0,0.0,4.0,New York Red Bulls 2,0.0,


In [30]:
train = data[data['Position'].notna()]
test = data[data['Position'].isna()].drop(['Team','Position'],axis=1)

In [31]:
X = train.drop(['Position','Team'],axis=1)
y = train['Position']

In [32]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

clf = DecisionTreeClassifier()
clf_fit = clf.fit(X,y)

This line of code only works in Linux currently, Windows must use StringIO to convert dot to PNG. Otherwise, you may uncomment this and it will produce a graphical representation of our current decision tree.

In [None]:
'''
export_graphviz(clf, out_file='tree', feature_names = X.columns,
                class_names = ['Defender','Forward','Goalkeeper','Midfielder'],
                rounded = True, proportion = False, precision = 2, filled = True)
'''

To turn dot file into png, you must use the command line command 'dot -Tpng tree.dot -o tree.png -Gdpi=600'

In [None]:
'''
from IPython.display import Image
Image(filename = 'tree.png')
'''

In [35]:
clf_fit

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [36]:
predict = clf.predict(test)

In [37]:
predict[:10]

array(['Midfielder', 'Defender', 'Midfielder', 'Midfielder', 'Goalkeeper',
       'Midfielder', 'Defender', 'Defender', 'Midfielder', 'Midfielder'],
      dtype=object)

In [38]:
test[:10]

Unnamed: 0_level_0,Games Played,Starts,Subbed off,Minutes Played,Passes,Passing Acc,Long Passes,Long Pass Acc,Pass per 90,Forward Passes,...,Shot At,Saves,Save Rate,Clean Sheets,Catches,Punches,Drops,Penalties Saved,Clearances,Subbed on
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ben Mines,4.0,1.0,1.0,93.0,38.0,60.5,0.0,0.0,36.8,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
Christopher Lema,9.0,8.0,0.0,734.0,444.0,66.7,60.0,35.0,54.4,235.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Derrick Etienne,4.0,3.0,2.0,303.0,129.0,79.1,9.0,77.8,38.3,37.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Edgardo Rito,6.0,5.0,0.0,456.0,204.0,73.5,22.0,27.3,40.3,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Evan Louro,7.0,7.0,0.0,630.0,200.0,55.0,137.0,34.3,28.6,142.0,...,27.0,17.0,63.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0
Jared Stroud,9.0,7.0,2.0,676.0,287.0,68.6,23.0,43.5,38.2,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
Jean-Christophe Koffi,8.0,7.0,3.0,589.0,259.0,78.8,19.0,47.4,39.6,96.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Jordan Scarlett,8.0,8.0,0.0,720.0,320.0,69.7,81.0,32.1,40.0,173.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Kyle Zajec,6.0,4.0,3.0,337.0,163.0,74.2,20.0,40.0,43.5,82.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
Marcus Epps,8.0,6.0,3.0,529.0,198.0,71.7,10.0,60.0,33.7,77.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


If we wish to have a more accurate tree we can start by pruning the bigger tree down to a more reasonable depth. Currently, we cannot check our results due to the time of collecting all the player's positions in the USL individually. Therefore, our model may not be the most tuned and efficient, but displays a tree that makes intuitive sense.