#How to Draft the Next NFL Pro Bowler

Every year, the National Football League holds a draft for players first entering the league, most of whom enter directly from playing college football. Teams are given a draft position depending on how well they fared the previous year, with the worst teams getting higher picks, and they select the rights to sign these players to contracts. Before that happens, every teams does intensive scouting on potential draft picks to determine how highly they value certain players.

One aspect of this evaluation is the NFL Scouting Combine. This occurs every year in February; players are invited to go through a variety of physical drills that are intended to measure their raw athletic ability, including speed, strength, and agility. Our goal is to evaluate whether the Scouting Combine is predictive of future success in the NFL, which drills are most predictive of success, and whether NFL teams are emphasizing the right drills when using the Scouting Combine to evaluate players.

How do we define whether a player was "successful"? There are many different ways we could do this, but for this project we have chosen to define success as whether or not a player made the NFL Pro Bowl at any point in his career. The Pro Bowl is the NFL's All-Star game; players are voted in by other players and fans based on their performance during the season.

In [1]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import statsmodels.api as sm

import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# special matplotlib argument for improved plots
from matplotlib import rcParams

##Combine Data

Our main data file contains results from the NFL Scouting Combine from 1999 through 2015. Our data file can be found at this link: http://nflsavant.com/about.php

Here is a description of the drills we plan to consider, along with their variable names in the data set.

- `fortyyd`: time in the forty-yard dash, which is an all-out sprint for 40 yards
- `bench`: number of times a player can bench press 225 pounds
- `vertical`: vertical jump, where a player stands flat-footed then jumps up as high as he can. Measured in inches.
- `broad`: broad jump, where a player stands flat-footed and then jumps forward as far as he can.
- `threecone`: time in the three cone drill; players run around three cones set in an L shape in a way that is intended to measure their ability to change direction
- `twentyss`: time in the shuttle run; the player runs 5 yards, changes direction to run 10 yards, and changes direction again to run 5 yards

We are also given data on the player's position, which will become important later for sorting, as well as height, weight, college attended, and pick with which they were eventually drafted, if they were selected (the draft is only 7 rounds, so some players go without being selected).

In [2]:
nflcomb = pd.read_csv("combine.csv")
nflcomb.head()

Unnamed: 0,year,name,firstname,lastname,position,heightfeet,heightinches,heightinchestotal,weight,arms,...,vertical,broad,bench,round,college,pick,pickround,picktotal,wonderlic,nflgrade
0,2015,Ameer Abdullah,Ameer,Abdullah,RB,5,9,69,205,0,...,42.5,130,24,0,Nebraska,,0,0,0,5.9
1,2015,Nelson Agholor,Nelson,Agholor,WR,6,0,72,198,0,...,0.0,0,12,0,USC,,0,0,0,5.6
2,2015,Jay Ajayi,Jay,Ajayi,RB,6,0,72,221,0,...,39.0,121,19,0,Boise St.,,0,0,0,6.0
3,2015,Kwon Alexander,Kwon,Alexander,OLB,6,1,73,227,0,...,36.0,121,24,0,LSU,,0,0,0,5.4
4,2015,Mario Alford,Mario,Alford,WR,5,8,68,180,0,...,34.0,121,13,0,West Virginia,,0,0,0,5.3


Since we do not yet know who will make the Pro Bowl for the 2015 season, we will exclude players who participated in the 2015 Scouting Combine.

In [9]:
nflcomb = nflcomb[nflcomb['year'] < 2015]
nflcomb.head()

Unnamed: 0,year,name,firstname,lastname,position,heightfeet,heightinches,heightinchestotal,weight,arms,...,vertical,broad,bench,round,college,pick,pickround,picktotal,wonderlic,nflgrade
322,2014,Jared Abbrederis,Jared,Abbrederis,WR,6,1,73,195,0,...,30.5,117,4,6,Wisconsin,0,16,176,0,5.2
323,2014,Davante Adams,Davante,Adams,WR,6,1,73,212,0,...,39.5,123,14,2,Fresno St.,0,21,53,0,6.0
324,2014,Mo Alexander,Mo,Alexander,SS,6,1,73,220,0,...,38.0,123,0,4,Utah St.,0,14,110,0,4.9
325,2014,Ricardo Allen,Ricardo,Allen,CB,5,9,69,187,0,...,35.5,117,13,5,Purdue,0,19,147,0,5.1
326,2014,Jace Amaro,Jace,Amaro,TE,6,5,77,265,0,...,33.0,118,28,2,Texas Tech,0,17,49,0,5.4


In [12]:
nflcomb.columns

Index([u'year', u'name', u'firstname', u'lastname', u'position', u'heightfeet',
       u'heightinches', u'heightinchestotal', u'weight', u'arms', u'hands',
       u'fortyyd', u'twentyyd', u'tenyd', u'twentyss', u'threecone',
       u'vertical', u'broad', u'bench', u'round', u'college', u'pick',
       u'pickround', u'picktotal', u'wonderlic', u'nflgrade'],
      dtype='object')

Most of these variables are unnecessary, so we can drop them from our data set.

In [27]:
dropvars = ['heightfeet', 'heightinches', 'arms', 'hands', 'wonderlic', 'nflgrade', 'pick', 'twentyyd', 'tenyd']
nflcomb.drop(dropvars, axis = 1, inplace=True)
nflcomb.head()

Unnamed: 0,year,name,firstname,lastname,position,heightinchestotal,weight,fortyyd,twentyyd,tenyd,twentyss,threecone,vertical,broad,bench,round,college,pick,pickround,picktotal
322,2014,Jared Abbrederis,Jared,Abbrederis,WR,73,195,4.5,0,0,4.08,6.8,30.5,117,4,6,Wisconsin,0,16,176
323,2014,Davante Adams,Davante,Adams,WR,73,212,4.56,0,0,4.3,6.82,39.5,123,14,2,Fresno St.,0,21,53
324,2014,Mo Alexander,Mo,Alexander,SS,73,220,4.54,0,0,4.51,7.05,38.0,123,0,4,Utah St.,0,14,110
325,2014,Ricardo Allen,Ricardo,Allen,CB,69,187,4.61,0,0,4.15,0.0,35.5,117,13,5,Purdue,0,19,147
326,2014,Jace Amaro,Jace,Amaro,TE,77,265,4.74,0,0,4.3,7.42,33.0,118,28,2,Texas Tech,0,17,49


Missing values in this data set are coded as 0, so we need to replace those with NaNs. We will address missing values later.

In [71]:
nflcomb.replace(0, np.nan, inplace=True)
nflcomb.head()

Unnamed: 0,year,name,firstname,lastname,position,heightinchestotal,weight,fortyyd,twentyyd,tenyd,twentyss,threecone,vertical,broad,bench,round,college,pick,pickround,picktotal
322,2014,Jared Abbrederis,Jared,Abbrederis,WR,73,195,4.5,,,4.08,6.8,30.5,117,4.0,6,Wisconsin,0,16,176
323,2014,Davante Adams,Davante,Adams,WR,73,212,4.56,,,4.3,6.82,39.5,123,14.0,2,Fresno St.,0,21,53
324,2014,Mo Alexander,Mo,Alexander,SS,73,220,4.54,,,4.51,7.05,38.0,123,,4,Utah St.,0,14,110
325,2014,Ricardo Allen,Ricardo,Allen,CB,69,187,4.61,,,4.15,,35.5,117,13.0,5,Purdue,0,19,147
326,2014,Jace Amaro,Jace,Amaro,TE,77,265,4.74,,,4.3,7.42,33.0,118,28.0,2,Texas Tech,0,17,49


###Position Analysis

Now that we've cleaned the data a little bit, we need to sort it by position. The reason for this is that different players can have vastly different roles on the field, which means that a physical trait which is vitally important for one player is essentially meaningless for others. Here is a description of each of the positions, sorted by offense and defense. For a diagram of standard offensive/defensive alignments, see here: https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Positions_American_Football.svg/2000px-Positions_American_Football.svg.png

**Offense**

- *Quarterback*: His primary job is to throw the ball to wide receivers and tight ends (see below). Arm strength and agility are the most important physical attributes for a quarterback, but success at this position depends more on mental attributes than phyiscal. Because of this uniqueness, as well as the small sample size at the positiion, we will leave quarterbacks out of our analysis. 

- *Running Back*: In addition to forward passes, teams can also advance the ball forward simply by carrying the ball. The running back does most of this, in addition to sometimes catching passes. Some running backs rely on speed and agility to stay away from potential tacklers, others rely on strength to power through them, and some are good at both.

- *Full Back*: Full back. Lines up in front of the running back and mainly acts as an extra blocker on run plays, in addition to occasionally running the ball himself or catching passes.

- *Wide Receiver*: Runs down the field to try to catch forward passes. Generally smaller and faster than most players, though the best receivers are also tall, in order to outjump defenders.

- *Tight End*: Essentially a hybrid between an offensive tackle and a wide receiver. They usually line up next to the offensive line and play an important role in blocking on run plays, but they can also catch passes on passing plays. Faster than OTs but slower than WRs in general, and stronger than WRs but weaker than OTs in general.

- *Offensive Tackle*: Offenses have five offensive linemen, whose job is to prevent defenders from tackling the player carrying the ball. Most of these players are around 300 pounds. Offensive tackles play the outermost position on the line. Strength is vital, because most of the players they're blocking are just as big. Agility also plays a part, especially on passing plays, because most of the players trying to tackle the quarterback, called pass rushers, are very fast in addition to strong, and tackles tend to have to take on the defense's fastest pass rushers.

- *Offensive Guard*: Offensive Guard. These offensive linemen play just inside the offensive tackles. Strength is likely to be a little more important, and agility less so, compared to offensive tackles; guards are relied on more for run blocking and blocking pass rushers who rely on strength rather than speed.

- *Center*: The middle player on the offensive line. Essentially the same as a guard, except the center also starts the play by snapping the ball to the quarterback. They are most often responsible for calling out blocking schemes depending on how the defense lines up.

**Defense**

- *Defensive End and Defensive Tackle*: They line up on the outside of the defensive line and play a big role in both defending run plays and rushing the quarterback on passing plays. Usually very large and strong, but also fast and agile. 

- *Outside Linebacker and Inside Linebacker*: Linebackers play immediately behind the defensive linemen. On running plays, they are the players expected to take down the ball-carrier. On passing plays, they sometimes will rush the quarterback and sometimes will drop back to defend passes thrown. Outside linebackers usually play a bigger role in rushing the passer than inside linebackers; inside linebackers play between the outside linebackers and usually paly a bigger role in defending the run. Their athleticism is similar to tight ends and running backs, and often they are tasked with covering these players when they go out to catch passes.

- *Cornerback*: These players play along the line of scrimmage on the outside and are responsible for covering wide receivers as they go out to catch passes. They are similar to wide receiver in athleticism, except they are usually smaller and faster than the players they're covering.

- *Free Safety and Strong Safety*: Safeties play behind the linebackers. They're the last line of defense in case a running back gets past the linebackers or a receiver gets past the cornerbacks. Strong safeties usually play a bigger role in the running game than free safeties. They're usually a bit slower and bigger than cornerbacks, and often cornerbacks will convert to safety later in their career if their speed starts to fade.

**Special Teams**

- *Kicker*: Kicks field goals, which are worth half as much as touchdowns, and performs kickoffs, which is when the team gives the ball back to the other team following a score.

- *Punter*: Teams have four downs to get a first down. If after third down they haven't gained enough yards for a first, often they will kick the ball to the other team via punt rather than risk giving the ball back if they fail to convert.

Let's look at the different position values we have in our data set.

In [46]:
nflcomb['position'].unique()

array(['WR', 'SS', 'CB', 'TE', 'RB', 'C', 'OLB', 'OT', 'OG', 'ILB', 'QB',
       'K', 'DT', 'FS', 'NT', 'P', 'DE', 'FB', 'LS', 'OC'], dtype=object)

Most of these abbreviations are straightforward: RB for Running back, etc. LS stands for long snapper, which is similar to the center but specializes in snapping the ball on punts and field goals, which requires snapping the ball a longer distance. OC and C both refer to Center. NT stands for nose tackle, which is a special type of defensive tackle who specializes almost exclusively in stopping the run.

As mentioned before, we need to break up our analysis by position. The question is how we can do that without combining positions which are too unlike each other, and while also maintaining a large enough sample size that we may still find a signal.

We also need to take into account right censorship: some players who are early in their careers haven't made the Pro Bowl yet but will make the Pro Bowl in 2016 or later. Based on that we need to exclude more recent years. We expect that most players who don't make the Pro Bowl in their first 5 years in the league are very unlikely to make it in future years; based on that we will exclude all players after 2009.

Let's create a new data frame including only players in the 2009 or earlier Combine, and let's see the sample size at each position.

In [66]:
drop09 = nflcomb[nflcomb.year <= 2009]
combpos = drop09.groupby('position')
combpos.count()

Unnamed: 0_level_0,year,name,firstname,lastname,heightinchestotal,weight,fortyyd,twentyyd,tenyd,twentyss,threecone,vertical,broad,bench,round,college,pick,pickround,picktotal
position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
CB,293,293,293,293,293,293,293,293,293,293,293,293,293,293,293,213,213,293,293
DE,262,262,262,262,262,262,262,262,262,262,262,262,262,262,262,184,184,262,262
DT,232,232,232,232,232,232,232,232,232,232,232,232,232,232,232,160,160,232,232
FB,87,87,87,87,87,87,87,87,87,87,87,87,87,87,87,54,54,87,87
FS,147,147,147,147,147,147,147,147,147,147,147,147,147,147,147,97,97,147,147
ILB,146,146,146,146,146,146,146,146,146,146,146,146,146,146,146,92,92,146,146
OC,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,56,56,100,100
OG,220,220,220,220,220,220,220,220,220,220,220,220,220,220,220,123,123,220,220
OLB,216,216,216,216,216,216,216,216,216,216,216,216,216,216,216,154,154,216,216
OT,248,248,248,248,248,248,248,248,248,248,248,248,248,248,248,178,178,248,248


From these numbers a few groupings make obvious sense, because they produce a large sample size and combine positions requiring similar skills:
    
1. Lump CB, FS and SS together into the "defensive back" category
2. Lump OT, OG and OC together into the "offensive line" category
3. Lump OLB and ILB together into the "linebacker" category

The rest of the positions could be lumped together in any number of ways. To aid in these decisions I first glanced at the averages at the positions in both physical attributes and performance in certain drills.

In [67]:
combpos.mean()

Unnamed: 0_level_0,year,heightinchestotal,weight,fortyyd,twentyyd,tenyd,twentyss,threecone,vertical,broad,bench,round,pickround,picktotal
position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
CB,2004.122867,71.259386,192.03413,4.490751,0.008771,0.005222,3.092867,0.045802,33.701365,111.211604,11.675768,2.573379,12.389078,73.535836
DE,2003.877863,75.835878,268.125954,4.845076,0.052366,0.042748,3.519389,0.223092,28.891221,98.843511,20.774809,2.667939,12.961832,78.832061
DT,2003.793103,75.172414,304.189655,5.081379,0.0,0.0,3.441121,0.0,24.793103,84.728448,23.146552,2.685345,11.931034,78.564655
FB,2003.517241,72.83908,244.505747,4.751954,0.0,0.0,3.518506,0.0,30.5,103.747126,20.356322,3.137931,13.126437,97.643678
FS,2004.040816,72.918367,204.231293,4.566531,0.0,0.0,3.172789,0.0,32.897959,105.55102,14.312925,2.782313,12.571429,83.367347
ILB,2003.712329,73.664384,242.082192,4.763904,0.0,0.0,3.30411,0.0,28.59589,96.369863,19.191781,2.390411,11.945205,70.280822
OC,2003.92,75.41,301.72,5.1835,0.0,0.0,4.0223,0.0,25.345,89.41,23.85,2.29,11.39,69.21
OG,2003.259091,76.131818,314.045455,5.325182,0.0,0.0,3.773773,0.0,23.820455,82.109091,21.890909,2.518182,10.586364,76.204545
OLB,2004.134259,74.055556,238.060185,4.681528,0.0,0.0,3.45287,0.0,30.078704,100.953704,19.111111,2.74537,12.87037,80.509259
OT,2004.004032,77.806452,317.770161,5.287702,0.012016,0.007218,3.821371,0.031129,24.475806,87.181452,20.709677,2.818548,13.092742,83.612903


Here are the choices I made:

- Add tight ends and fullbacks to the linebacker category. These positions are very different in role, but the athletic skills needed of athletes at both positions are very similar, and their averages in terms of weight, forty yard dash time, and bench press time are similar.
- Combine running backs and wide receivers into one category. I came in expecting these positions to be too different to justify combining, and while there are some obvious differences (such as performace in the bench press) that make such a combination not ideal, wide receivers are closer to running backs than running backs are to fullbacks or tight ends, and the sample size in the running back group isn't large enough to make it okay to leave them on their own.

##Pro Bowl Data

The Combine data set didn't include any Pro Bowl data, so we need to scrape that for ourselves. We will do that by using the Pro Bowl pages on Wikipedia, scraping the roster from every year and matching that up with the Combine data. Since our Combine data start with the February 1999 Scouting Combine, we will use Pro Bowl rosters starting in 2000. The Pro Bowl is held every year in late January/early February the week before the Super Bowl, while the Combine is held mid-late February at the conclusion of the season.

We will use the Python BeautifulSoup library to do this. First we need to specify the urls from which we will be scraping.

In [57]:
from bs4 import BeautifulSoup
years = range(2000, 2016, 1)
urls = {year:"http://en.wikipedia.org/wiki/" + str(year) + "_Pro_Bowl" for year in years}
urls

{2000: 'http://en.wikipedia.org/wiki/2000_Pro_Bowl',
 2001: 'http://en.wikipedia.org/wiki/2001_Pro_Bowl',
 2002: 'http://en.wikipedia.org/wiki/2002_Pro_Bowl',
 2003: 'http://en.wikipedia.org/wiki/2003_Pro_Bowl',
 2004: 'http://en.wikipedia.org/wiki/2004_Pro_Bowl',
 2005: 'http://en.wikipedia.org/wiki/2005_Pro_Bowl',
 2006: 'http://en.wikipedia.org/wiki/2006_Pro_Bowl',
 2007: 'http://en.wikipedia.org/wiki/2007_Pro_Bowl',
 2008: 'http://en.wikipedia.org/wiki/2008_Pro_Bowl',
 2009: 'http://en.wikipedia.org/wiki/2009_Pro_Bowl',
 2010: 'http://en.wikipedia.org/wiki/2010_Pro_Bowl',
 2011: 'http://en.wikipedia.org/wiki/2011_Pro_Bowl',
 2012: 'http://en.wikipedia.org/wiki/2012_Pro_Bowl',
 2013: 'http://en.wikipedia.org/wiki/2013_Pro_Bowl',
 2014: 'http://en.wikipedia.org/wiki/2014_Pro_Bowl',
 2015: 'http://en.wikipedia.org/wiki/2015_Pro_Bowl'}