# Predicting this year's NFL draft

-------------------------

This notebook presents a prediction of this year's NFL draft. It is a live test to this recently completed analysis. Let's get started!

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import pickle

import joblib
import gzip
import warnings

from sklearn.metrics import f1_score, accuracy_score

In [2]:
pd.set_option('max_columns', 1000)
pd.set_option('max_rows', 1000)

In order to make predictions of the 2018 draft class, we will need to obtain the NFL Combine results for this year's players. This info is available in a series of HTML tables on the NinersNation website. As we have done before, we will scrape this info, parse the HTML table, and construct a $pandas$ dataframe. To that end, the following two functions help us accomplish this by parsing the HTML table and cleaning the dataframe, respectively. 

In [3]:
def parse_html_table(table): 
    ''' Function takes in a HTML table from a BeautifulSoup object
        Output is a pandas dataframe with appropriate rows and cols
    '''

    nrow = 0
    ncol = 0
    col_names = []
    
    #Find number of rows and columns and column names in the html table
    for row in table.find_all('tr'):
        #First find the number of rows and columns
        the_tds = row.find_all('td')
        if len(the_tds) > 0:
            nrow += 1
            if ncol == 0:
                # Set the number of columns for the table
                ncol = len(the_tds)       
        #Try to find column names in the table
        the_ths = row.find_all('th') 
        if len(the_ths) > 0 and len(col_names) == 0:
            for the_th in the_ths:
                col_names.append(the_th.get_text())
                
    #Define output dataframe
    cols = col_names if len(col_names) > 0 else range(0,ncol)
    df = pd.DataFrame(columns = cols, index= range(0,nrow))

    #Construct output dataframe, element-by-element
    i_row = 0
    for row in table.find_all('tr'):
        i_col = 0
        columns = row.find_all('td')
        for column in columns:
            df.iloc[i_row,i_col] = column.get_text()
            i_col += 1
        if len(columns) > 0:
            i_row += 1
        
    return df

In [4]:
def clean_up(df):
    ''' Function to perform cleaning steps on the dataframe, which is quite messy.
        
    '''
    
    #Make a copy of the input
    copy = df.copy()
    
    #Extract the positiongroup
    copy['PositionGroup'] = copy['#'].str[:2]
    
    #Combine into a single name
    copy['Name'] = copy.Name.str.split(',').str.get(1) + ' ' + copy.Name.str.split(',').str.get(0)
    
    #Convert into height in inches
    copy['height'] = pd.to_numeric(copy.Height)//1000*12+(pd.to_numeric(copy.Height)%1000)/10
    
    #Clean up strings
    copy.Arm = copy.Arm.str.replace('"','').str.strip()
    copy.Hand = copy.Hand.str.replace('"','').str.strip()
    
    #Convert fractional values to decimal values
    copy.Arm = copy.Arm.replace('', '-9 1/1').\
                        apply(lambda x: float(x) if len(x)<=2 else float(x[-3])/float(x[-1]) + float(x[:2]))
    copy.Hand = copy.Hand.apply(lambda x: float(x) if len(x)<=2 else float(x[-3])/float(x[-1])+ float(x[:2]))
    
    #Clean up strings
    copy.Vertical = pd.to_numeric(copy.Vertical.str.replace('"','').str.replace('\'',''))
    copy.Broad = pd.to_numeric(copy.Broad.str.split('\'').str.get(0))*12 + \
                 pd.to_numeric(copy.Broad.str.split('\'').str.get(1).str.replace('"',''))
    
    #Sanity check
    copy = copy[copy.Arm > 0]

    #Ensure column is numeric
    copy.Weight = pd.to_numeric(copy.Weight)
    copy.Bench = pd.to_numeric(copy.Bench)

    #Drop unnecessary columns
    copy.drop(['#', 'Height'], axis=1, inplace=True)
    
    #Return cleaned dataframe
    return copy.copy()

Now we can give the url of the site and read it in with the help of $BeautifulSoup$. 

In [5]:
url = 'http://www.ninersnation.com/2018/3/6/17081404/nfl-combine-2018-full-results-table-40-yard-dash-bench-workout-drills-recap'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
res = requests.get(url, headers=headers, verify=True)
soup = BeautifulSoup(res.text,'lxml')

Because there are tables for each positiongroup, we cycle through the tables and clean each dataframe individually. There are some unique cleaning steps for each, so we take care of it here as well.

In [6]:
df_list = []
col_dict = {'OL':['40 Yard            (1st Att.)', '10 Yard', '40 Yard                  (2nd Att.)'],
            'RB':['40 (1st Att.)', '40 (2nd Att.)', '60 Shuttle'],
            'QB':['40 Yard                (1st Att.)', '40 Yard           (2nd Att.)'],
            'WO':['40 Yard           (1st Att.)', '40  Yard               (2nd Att.)', '60 Shuttle'],
            'TE':['40 (1st Att.)', '40 (2nd Att.)', '60 Shuttle'],
            'DL':['40 (1st Att.)', '40 (2nd Att.)', '10 Yard', '60 Shuttle'],
            'LB':['40 (1st Att.)', '40 (2nd Att.)','60 Shuttle'],
            'DB':['40 Yard                (1st Att.)','40 Yard                          (2nd Att.)', '60 Shuttle']
           }

#Loop over the tables
for tab in soup.find_all('table'):
    
    #Parse the HTML
    thedf = parse_html_table(tab)
    
    #Mistake in the input table: fix it here
    thedf.loc[thedf.Hand=='9 78','Hand'] = '9 7/8'
    
    #Clean the rest of the dataframe
    thedf = clean_up(thedf)
    
    #Drop unnecessary columns
    pos = thedf.PositionGroup.unique()[0]
    thedf.drop(col_dict[pos], axis=1, inplace=True)
    
    #Rename the columns
    thedf.columns = ['name', 'college', 'weight', 'arms', 'hands', 'fortyyd', 'bench', 'vertical', 'broad', 
                     'threecone', 'shuttle', 'positiongroup', 'height']
    
    #Ensure column is numeric
    thedf.fortyyd = pd.to_numeric(thedf.fortyyd)
    thedf.threecone = pd.to_numeric(thedf.threecone)
    thedf.shuttle = pd.to_numeric(thedf.shuttle)
    
    df_list.append(thedf)

Now we can construct the full dataset for 2018 and concat the list of separate dataframes. We again do some cleaning to get it to look like our existing data. 

In [7]:
#Join into one dataframe
df_2018 = pd.concat(df_list).reset_index(drop=True)

#Create columns that need to be there
df_2018['year'] = 2018
df_2018['nflgrade'] = np.nan
df_2018['wonderlic'] = np.nan

#Re-name positiongroup factors
df_2018.positiongroup = df_2018.positiongroup.apply(lambda x: x if x!='WO' else 'RE')
df_2018.positiongroup = df_2018.positiongroup.apply(lambda x: x if x!='TE' else 'RE')
df_2018.positiongroup = df_2018.positiongroup.apply(lambda x: x if x!='RB' else 'BA')

#Remove white spaces
df_2018.name = df_2018.name.str.strip()

df_2018.isnull().sum()*100./len(df_2018)

name               0.000000
college            0.000000
weight             0.000000
arms               0.000000
hands              0.000000
fortyyd           19.753086
bench             25.617284
vertical          23.765432
broad             23.456790
threecone         43.827160
shuttle           40.432099
positiongroup      0.000000
height             0.000000
year               0.000000
nflgrade         100.000000
wonderlic        100.000000
dtype: float64

Note that like before there is null data throughout this dataset. We will use the same solution as before: median imputation grouped by positiongroup. In order to do this, we will need to read in the original data. 

In [8]:
df_orig = pd.read_pickle('data/cleaned_df.pkl')

An important note to make here is to ensure the columns are also ordered correctly (which was initally overlooked). 

In [9]:
print('Original data :\n\t{}\n\n'.format(df_orig.columns))
print('2018 data:\n\t{}'.format(df_2018.columns))

df_2018 = df_2018[[u'name', u'year', u'college', u'height', u'weight',
                   u'fortyyd', u'vertical', u'bench', u'threecone', u'shuttle', u'broad',
                   u'wonderlic', u'nflgrade', u'arms', u'hands', u'positiongroup']].copy()

Original data :
	Index([u'name', u'year', u'college', u'position', u'height', u'weight',
       u'fortyyd', u'vertical', u'bench', u'threecone', u'shuttle', u'broad',
       u'wonderlic', u'nflgrade', u'arms', u'hands', u'team', u'round',
       u'pick', u'overall', u'positiongroup'],
      dtype='object')


2018 data:
	Index([u'name', u'college', u'weight', u'arms', u'hands', u'fortyyd', u'bench',
       u'vertical', u'broad', u'threecone', u'shuttle', u'positiongroup',
       u'height', u'year', u'nflgrade', u'wonderlic'],
      dtype='object')


Now, we can make the imputation of the null values in the 2018 dataframe by the original dataframe. 

In [10]:
cols = ['fortyyd','vertical','bench','threecone','shuttle','broad','wonderlic','nflgrade','arms','hands']

#Impute the relevant columns by positiongroup
for col in cols:
    df_2018[col] = df_2018[col].fillna(df_orig.groupby('positiongroup')[col].transform('median'))

Next, we focus on getting the actual draft results for 2018! We want to get this data and add it to the dataframe. We can use our existing machinery to do this for us. 

In [11]:
#Ignore the warning...
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    
    #Import the methods to scrape DraftHistory
    from scraping_dh import convert_html_table, scrape_per_year

In [12]:
#Get the draft results for this year
res_2018 = scrape_per_year(2018)

In [13]:
#Re-format name
res_2018['name'] = res_2018['Name']
res_2018.drop('Name', axis=1, inplace=True)

#Make sure its an int not float
res_2018['Round'] = res_2018['Round'].astype('int')

#Drop unnecessary columns
res_2018.drop(['Pick', 'Player', 'Team', 'Position', 'College'], axis=1, inplace=True)

In [14]:
#Merge the data 
df_2018 = df_2018.merge(res_2018, on='name', how='left')

#If the player went undrafted, mark them as such
df_2018.fillna(-1, inplace=True)

Now we have our full dataset for 2018 that is needed for both prediction and evaluation of said prediction. Now we can create the needed arrays to input into our already fitted model. 

In [15]:
#Make a copy of the dataframe
df = df_2018.copy()

#Drop unnecessary columns
df.drop(['name', 'college'], axis=1, inplace=True)

#Convert positiongroup to appropriate indicator variables
df = pd.get_dummies(df, drop_first=True)
df['positiongroup_ST'] = 0

#Get the arrays of data
target = 'Round'
x = df.drop(target, axis=1).values
y = df[target].values

We will of course need our fitted model. So we can load it in as our (only) classifier. 

In [16]:
with gzip.GzipFile('final_model.pkl.gz', 'r') as file:
    clf = joblib.load(file)

We can assign the predictions of our model back to the dataframe. 

In [17]:
df_2018['Prediction'] = clf.predict(x)

And, how did we do?!

In [18]:
f1_score(df_2018['Round'], df_2018['Prediction'], average='weighted')

0.27929629085870283

That's not too bad! Recall our ensemble classifier had a F1 score of 0.291127. Our (test) score of 0.279 is not too far away from this score and is actually nearly identical to the score from the Extremely Randomized Trees model. We can also take a look at the cross tabulation of how the predictions stack up with the actual drafted rounds. 

In [19]:
pd.crosstab(df_2018['Prediction'], df_2018['Round'])

Round,-1.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0
Prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
-1,95,9,14,19,19,14,19,11
1,8,12,5,4,4,3,4,4
2,1,2,0,0,1,1,1,0
3,8,5,1,3,2,3,1,0
4,6,2,8,3,4,5,5,2
5,1,1,1,0,3,0,0,0
6,0,0,1,0,0,2,0,0
7,2,1,1,2,0,0,0,1


I think we can be pretty happy with how our model performed. It was relatively on par with how it performed on the test set. Lastly, we can look at the full list of predictions with player name and college. This way it is easier to see the individual predictions for players. For example, we can see the San Francisco 49ers first round selection, and ninth overall, of Mike McGlinchey was indeed predicted to be a first round draftee. Awesome!  

In [20]:
#Get just the player name, college and draft round info
predictions = df_2018[['name', 'college', 'Round', 'Prediction']].copy()

#Re-code the draft rounds
predictions['Round'] = predictions['Round'].astype('int').astype('str')
predictions.loc[predictions['Round'] == '-1', 'Round'] = 'Undrafted'
predictions['Prediction'] = predictions['Prediction'].astype('int').astype('str')
predictions.loc[predictions['Prediction'] == '-1', 'Prediction'] = 'Undrafted'

#Column name consistent
predictions.columns = ['name', 'college', 'round', 'prediction']

#Display full list of sorted players
predictions.sort_values('name').reset_index(drop=True)

Unnamed: 0,name,college,round,prediction
0,Ade Aruna,Tulane,6,1
1,Akrum Wadley,Iowa,Undrafted,Undrafted
2,Alex Cappa,Humboldt State,3,Undrafted
3,Allen Lazard,Iowa State,Undrafted,Undrafted
4,Andre Chachere,San Jose State,Undrafted,3
5,Andre Smith,North Carolina,7,4
6,Andrew Brown,Virginia,5,Undrafted
7,Anthony Averett,Alabama,4,3
8,Anthony Miller,Memphis,2,Undrafted
9,Anthony Winbush,Ball State,Undrafted,7


-------------------

Et fin.