# NFL Combine Classification Modeling

## Exploratory Data Analysis

## Project Goals

- Determine the influence the NFL Combine has on a prospect getting drafted or not.
- Determine the influence the NFL Combine has in terms of how early or how late a prospect gets drafted.
- Discover which NFL Combine drills have the most impact on a prospect's draft status.

## Summary of Data

The dataset that was analyzed for this study contains 10,228 observations of NFL Combine data, dating from 1987-2018.

### Library Import

In [1]:
#Import libraries
%run ../python_files/libraries

## Data Importing, Data Merging, and Data Examination

In [14]:
# import nfl combine and NFL Draft data
nfl_combine_df = pd.read_csv('../data/nfl_combine_data.csv')
nfl_draft_df = pd.read_csv('../data/nfl_draft_data.csv')

# quick overview of the NFL Combine dataset
nfl_combine_df

Unnamed: 0,combine_year,first_name,last_name,college,position,height_inches,weight_lbs,hand_size_inches,arm_length_inches,40_yard_dash,bench_press_reps,vertical_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle
0,2017,jamal,adams,louisiana_state,db,71.63,214,9.25,33.38,4.56,18.0,31.5,120.0,6.96,4.13,
1,2017,montravius,adams,auburn,dl,75.63,304,9.25,32.75,4.87,22.0,29.0,108.0,7.62,,
2,2017,rodney,adams,south_florida,wr,73.25,189,9.00,32.00,4.44,8.0,29.5,125.0,6.98,4.28,11.39
3,2017,quincy,adeboyejo,mississippi,wr,74.75,197,9.38,31.75,4.42,8.0,34.5,123.0,6.73,4.14,
4,2017,brian,allen,utah,db,74.88,215,10.00,34.00,4.48,15.0,34.5,117.0,6.64,4.34,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9945,1987,rod,woodson,purdue,lb,72.00,202,10.50,31.00,4.33,10.0,36.0,125.0,,3.98,10.92
9946,1987,john,wooldridge,ohio_state,rb,68.40,193,,,,,,,,,
9947,1987,david,wyman,stanford,lb,74.00,235,9.50,31.25,4.79,23.0,29.0,118.0,,4.30,11.78
9948,1987,theo,young,arkansas,fb_te,74.00,231,9.00,34.00,4.89,9.0,30.0,107.0,,4.20,11.71


In [15]:
# quick overview of the NFL Draft dataset
nfl_draft_df

Unnamed: 0,first_name,last_name,combine_year,round,pick,team
0,jameis,winston,2015,1,1,tam
1,marcus,mariota,2015,1,2,ten
2,dante,fowler,2015,1,3,jax
3,amari,cooper,2015,1,4,oak
4,brandon,scherff,2015,1,5,was
...,...,...,...,...,...,...
8174,xavier,woods,2017,6,191,dal
8175,zach,banner,2017,4,137,ind
8176,zach,cunningham,2017,2,57,hou
8177,zane,gonzalez,2017,7,224,cle


In [16]:
# merge the NFL Combine and NFL Draft datasets with an outer join so that all values and rows are retained
nfl_df = pd.merge(nfl_combine_df, nfl_draft_df, how = 'left', on = ['last_name', 'first_name', 'combine_year'])
nfl_df

Unnamed: 0,combine_year,first_name,last_name,college,position,height_inches,weight_lbs,hand_size_inches,arm_length_inches,40_yard_dash,bench_press_reps,vertical_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle,round,pick,team
0,2017,jamal,adams,louisiana_state,db,71.63,214,9.25,33.38,4.56,18.0,31.5,120.0,6.96,4.13,,1.0,6.0,nyj
1,2017,montravius,adams,auburn,dl,75.63,304,9.25,32.75,4.87,22.0,29.0,108.0,7.62,,,3.0,93.0,gnb
2,2017,rodney,adams,south_florida,wr,73.25,189,9.00,32.00,4.44,8.0,29.5,125.0,6.98,4.28,11.39,5.0,170.0,min
3,2017,quincy,adeboyejo,mississippi,wr,74.75,197,9.38,31.75,4.42,8.0,34.5,123.0,6.73,4.14,,,,
4,2017,brian,allen,utah,db,74.88,215,10.00,34.00,4.48,15.0,34.5,117.0,6.64,4.34,,5.0,173.0,pit
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9967,1987,rod,woodson,purdue,lb,72.00,202,10.50,31.00,4.33,10.0,36.0,125.0,,3.98,10.92,1.0,10.0,pit
9968,1987,john,wooldridge,ohio_state,rb,68.40,193,,,,,,,,,,,,
9969,1987,david,wyman,stanford,lb,74.00,235,9.50,31.25,4.79,23.0,29.0,118.0,,4.30,11.78,2.0,45.0,sea
9970,1987,theo,young,arkansas,fb_te,74.00,231,9.00,34.00,4.89,9.0,30.0,107.0,,4.20,11.71,12.0,317.0,pit


In [17]:
# quick review of the variables in the NFL merged dataset
nfl_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9972 entries, 0 to 9971
Data columns (total 19 columns):
combine_year            9972 non-null int64
first_name              9972 non-null object
last_name               9972 non-null object
college                 9972 non-null object
position                9972 non-null object
height_inches           9972 non-null float64
weight_lbs              9972 non-null int64
hand_size_inches        8422 non-null float64
arm_length_inches       8104 non-null float64
40_yard_dash            9092 non-null float64
bench_press_reps        6792 non-null float64
vertical_leap_inches    8069 non-null float64
broad_jump_inches       7921 non-null float64
3_cone_drill            4521 non-null float64
20_yard_shuttle         7191 non-null float64
60_yard_shuttle         3177 non-null float64
round                   6144 non-null float64
pick                    6144 non-null float64
team                    6144 non-null object
dtypes: float64(12), int64(2

In [18]:
# players who attended the NFL combine and were drafted
# 8526 NFL combine attendees out of 9972 were drafted
nfl_combine_df[nfl_combine_df.last_name.isin(nfl_draft_df.last_name)]

Unnamed: 0,combine_year,first_name,last_name,college,position,height_inches,weight_lbs,hand_size_inches,arm_length_inches,40_yard_dash,bench_press_reps,vertical_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle
0,2017,jamal,adams,louisiana_state,db,71.63,214,9.25,33.38,4.56,18.0,31.5,120.0,6.96,4.13,
1,2017,montravius,adams,auburn,dl,75.63,304,9.25,32.75,4.87,22.0,29.0,108.0,7.62,,
2,2017,rodney,adams,south_florida,wr,73.25,189,9.00,32.00,4.44,8.0,29.5,125.0,6.98,4.28,11.39
4,2017,brian,allen,utah,db,74.88,215,10.00,34.00,4.48,15.0,34.5,117.0,6.64,4.34,
5,2017,jonathan,allen,alabama,dl,74.63,286,9.38,33.63,5.00,21.0,30.0,108.0,7.49,4.44,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,1987,tony,woods,pittsburgh,dl,74.80,249,10.00,34.00,4.85,18.0,29.0,115.0,,4.55,11.82
9945,1987,rod,woodson,purdue,lb,72.00,202,10.50,31.00,4.33,10.0,36.0,125.0,,3.98,10.92
9947,1987,david,wyman,stanford,lb,74.00,235,9.50,31.25,4.79,23.0,29.0,118.0,,4.30,11.78
9948,1987,theo,young,arkansas,fb_te,74.00,231,9.00,34.00,4.89,9.0,30.0,107.0,,4.20,11.71


In [19]:
# players who attended the NFL combine but did not get drafted
# 1424 NFL combine attendees out of 9972 went undrafted
nfl_combine_df[~nfl_combine_df.last_name.isin(nfl_draft_df.last_name)]

Unnamed: 0,combine_year,first_name,last_name,college,position,height_inches,weight_lbs,hand_size_inches,arm_length_inches,40_yard_dash,bench_press_reps,vertical_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle
3,2017,quincy,adeboyejo,mississippi,wr,74.75,197,9.38,31.75,4.42,8.0,34.5,123.0,6.73,4.14,
10,2017,antony,auclair,laval,fb_te,78.00,254,,,,,,,,4.45,12.08
11,2017,erik,austell,charleston_southern,ol,75.13,301,9.00,32.00,5.23,24.0,27.5,107.0,8.13,4.90,
20,2017,collin,bevins,northwest_missouri_state,dl,78.00,285,,,,,,,,4.39,
24,2017,garett,bolles,utah,ol,77.25,297,9.38,34.00,4.95,,28.0,115.0,7.29,4.55,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9907,1987,jason,stargel,cincinnati,wr,74.30,186,,,4.64,5.0,30.5,114.0,,4.36,11.94
9920,1987,jeff,tiefenthaler,south_dakota_state,wr,73.00,181,,,4.67,11.0,24.0,105.0,,,
9921,1987,van,tiffin,alabama,k_k_p,68.60,155,9.00,28.25,4.90,,22.5,100.0,,4.65,
9932,1987,tom,welter,nebraska,ol,76.40,274,9.75,31.75,5.23,13.0,27.0,87.0,,5.04,


In [20]:
# players who were drafted but did not attend the NFL combine
# 703 players did not attend the NFL combine and were still drafted
nfl_draft_df[~nfl_draft_df.last_name.isin(nfl_combine_df.last_name)]

Unnamed: 0,first_name,last_name,combine_year,round,pick,team
209,christian,ringo,2015,6,210,gnb
221,austin,reiter,2015,7,222,was
245,geoff,swaim,2015,7,246,dal
250,taurean,nixon,2015,7,251,den
251,josh,furman,2015,7,252,den
...,...,...,...,...,...,...
7859,jimmy,landes,2016,6,210,det
7928,roberto,aguayo,2016,2,59,tam
7998,chris,godwin,2017,3,84,tam
7999,chris,wormley,2017,3,74,bal


In [22]:
# quick review of the characteristics of the feature variables in the dataset
nfl_df.describe()

Unnamed: 0,combine_year,height_inches,weight_lbs,hand_size_inches,arm_length_inches,40_yard_dash,bench_press_reps,vertical_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle,round,pick
count,9972.0,9972.0,9972.0,8422.0,8104.0,9092.0,6792.0,8069.0,7921.0,4521.0,7191.0,3177.0,6144.0,6144.0
mean,2002.190533,73.735322,240.251103,9.528544,32.221266,4.830313,19.832155,32.008489,112.323065,7.345895,4.403406,11.674328,4.308594,121.358887
std,9.301371,2.644132,45.019225,0.629227,1.49793,0.309724,6.538387,4.203464,9.307325,0.446682,0.268807,0.425468,2.45211,76.540067
min,1987.0,64.9,142.0,7.13,25.63,4.21,1.0,17.5,7.0,6.34,3.73,10.43,1.0,1.0
25%,1994.0,71.88,203.0,9.13,31.25,4.59,15.0,29.0,106.0,7.01,4.2,11.38,2.0,57.0
50%,2002.0,74.0,232.0,9.5,32.25,4.76,20.0,32.0,113.0,7.26,4.36,11.65,4.0,114.0
75%,2011.0,75.75,275.0,10.0,33.25,5.04,24.0,35.0,119.0,7.62,4.57,11.93,6.0,179.0
max,2017.0,82.4,387.0,11.88,38.5,6.12,51.0,46.0,147.0,9.61,5.68,13.91,12.0,336.0


In [23]:
# check the number of missing values in the NFL merged dataset
nfl_df.isna().sum()

combine_year               0
first_name                 0
last_name                  0
college                    0
position                   0
height_inches              0
weight_lbs                 0
hand_size_inches        1550
arm_length_inches       1868
40_yard_dash             880
bench_press_reps        3180
vertical_leap_inches    1903
broad_jump_inches       2051
3_cone_drill            5451
20_yard_shuttle         2781
60_yard_shuttle         6795
round                   3828
pick                    3828
team                    3828
dtype: int64

## Data Cleaning, Data Transformations, and Data Exploration



In [None]:
# Create dummy values for the categorical variables

# auto_df['mstatus'] = auto_df['mstatus'].map({'Yes': 1, 'No': 0})
# auto_df['sex'] = auto_df['sex'].map({'M': 1, 'F': 0})
# auto_df['education'] = auto_df['education'].map({'<High School': 0, 'High School': 0, 'Bachelors': 1, 'Masters': 1, 'PhD': 1})
# auto_df['job'] = auto_df['job'].map({'Student': 1, 'Blue Collar': 0, 'Clerical': 0, 'Doctor': 0, 'Home Maker': 0, 'Lawyer': 0, 'Manager': 0, 'Professional': 0})

In [None]:
# Log Transformations for non-normalized variables. Then, drop the original variable from the dataset.

# def log_col(df, col):
#     '''Convert column to log values and
#     drop the original column
#     '''
#     df[f'{col}_log'] = np.log(df[col])
#     df.drop(col, axis=1, inplace=True)

# log_col(auto_df, 'tif')

In [None]:
# quick review of the characteristics of all variables in the dataset, 
# including the new dummy variables and log-transformed variables
# df.describe()

In [None]:
# Correlations between all variables in auto_df dataset
# df.corr(method = 'pearson')

In [None]:
#Correlation Heatmap of all variables in auto_df dataset

# mask = np.zeros_like(auto_df.corr())
# triangle_indices = np.triu_indices_from(mask)
# mask[triangle_indices] = True

# plt.figure(figsize=(35,30))
# ax = sns.heatmap(auto_df.corr(method='pearson'), cmap="coolwarm", mask=mask, annot=True, annot_kws={"size": 18}, square=True, linewidths=4)
# sns.set_style('white')
# plt.xticks(fontsize=14, rotation=45)
# plt.yticks(fontsize=14, rotation=0)
# bottom, top = ax.get_ylim()
# ax.set_ylim(bottom + 0.5, top - 0.5)
# plt.show()

## Initial Train and Test Dataset Creation



In [None]:
#Split auto_insurance_df into train and test datasets for our logistic and linear regression models

#train and test datasets for logistic regression model
# crash = auto_df['crash']
# features_log = auto_df.drop(['crash', 'crash_cost'], axis = 1)
# x_train_log, x_test_log, y_train_log, y_test_log = train_test_split(features_log, crash, test_size = 0.2, random_state = 10)

## Feature Selection

For modeling purposes, we used recursive feature elimination for both our logistic regression model and our simple linear regression model. This process uses cross-validation techniques, using accuracy as a metric, to eliminate variables that may hurt our model performance. Those variables get dropped from the dataset prior to modeling.

### Recursive Feature Elimination for Logistic Regression Model

In [None]:
# logreg_model = LogisticRegression()
# rfecv_log = RFECV(estimator=logreg_model, step=1, cv=StratifiedKFold(10), scoring='accuracy')
# rfecv_log.fit(x_train_log, y_train_log)

In [None]:
# feature_importance_log = list(zip(features_log, rfecv_log.support_))
# new_features_log = []
# for key,value in enumerate(feature_importance_log):
#     if(value[1]) == True:
#         new_features_log.append(value[0])
        
# print(new_features_log)

In [None]:
# linreg_model = LinearRegression()
# rfecv_lin = RFECV(estimator=linreg_model, step=1, min_features_to_select = 1, scoring='r2')
# rfecv_lin.fit(x_train_lin, y_train_lin)

In [None]:
# feature_importance_lin = list(zip(features_lin, rfecv_lin.support_))
# new_features_lin = []
# for key,value in enumerate(feature_importance_lin):
#     if(value[1]) == True:
#         new_features_lin.append(value[0])
        
# print(new_features_lin)

## Final Train and Test Datasets after Feature Selection



In [None]:
#final train and test datasets for logistic regression model
# x_train_log = x_train_log[new_features_log]
# x_test_log = x_test_log[new_features_log]

# print(x_train_log.shape)
# print(x_test_log.shape)