# NFL Combine Classification Modeling

## Exploratory Data Analysis

## Project Goals

- Determine the influence the NFL Combine has on a prospect getting drafted or not.
- Determine the influence the NFL Combine has in terms of how early or how late a prospect gets drafted.
- Discover which NFL Combine drills have the most impact on a prospect's draft status.

## Summary of Data

The dataset that was analyzed for this study contains 10,228 observations of NFL Combine data, dating from 1987-2018.

### Library Import

In [1]:
#Import libraries
%run ../python_files/libraries

## Data Import and Data Examination

In [2]:
# import data
nfl_combine_df = pd.read_csv('../data/nfl_combine_data.csv')

# quick overview of the dataset
nfl_combine_df

Unnamed: 0,combine_year,player_name,college,position,height_inches,weight_lbs,hand_size_inches,arm_length_inches,wonderlic_score,40_yard_dash,bench_press_reps,vertitcal_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle
0,2018,Josh Adams,Notre Dame,RB,74.0,213,9.25,33.75,,,18.0,,,,,
1,2018,Ola Adeniyi,Toledo,DE,74.0,248,9.63,31.75,,4.83,26.0,31.5,,7.21,4.28,12.79
2,2018,Jordan Akins,Central Florida,TE,75.0,249,9.50,32.50,,,,,,,,
3,2018,Jaire Alexander,Louisville,CB,71.0,192,,,,4.38,14.0,35.0,127.0,6.71,3.98,
4,2018,Austin Allen,Arkansas,QB,72.0,210,9.63,30.63,,4.81,,29.5,112.0,7.18,4.48,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10223,1987,Rod Woodson,Purdue,CB,72.0,202,10.50,31.00,,4.33,10.0,36.0,125.0,,3.98,10.92
10224,1987,John Wooldridge,Ohio State,RB,68.4,193,,,,,,,,,,
10225,1987,David Wyman,Stanford,ILB,74.0,235,9.50,31.25,,4.79,23.0,29.0,118.0,,4.30,11.78
10226,1987,Theo Young,Arkansas,TE,74.0,231,9.00,34.00,,4.89,9.0,30.0,107.0,,4.20,11.71


In [3]:
# quick review of the variables in the dataset
nfl_combine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10228 entries, 0 to 10227
Data columns (total 16 columns):
combine_year             10228 non-null int64
player_name              10228 non-null object
college                  10228 non-null object
position                 10228 non-null object
height_inches            10228 non-null float64
weight_lbs               10228 non-null int64
hand_size_inches         8621 non-null float64
arm_length_inches        8303 non-null float64
wonderlic_score          431 non-null float64
40_yard_dash             9292 non-null float64
bench_press_reps         6977 non-null float64
vertitcal_leap_inches    8258 non-null float64
broad_jump_inches        8107 non-null float64
3_cone_drill             4667 non-null float64
20_yard_shuttle          7333 non-null float64
60_yard_shuttle          3211 non-null float64
dtypes: float64(11), int64(2), object(3)
memory usage: 1.2+ MB


In [4]:
# quick review of the characteristics of the variables in the dataset
nfl_combine_df.describe()

Unnamed: 0,combine_year,height_inches,weight_lbs,hand_size_inches,arm_length_inches,wonderlic_score,40_yard_dash,bench_press_reps,vertitcal_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle
count,10228.0,10228.0,10228.0,8621.0,8303.0,431.0,9292.0,6977.0,8258.0,8107.0,4667.0,7333.0,3211.0
mean,2002.4386,73.741555,240.183125,9.531669,32.223198,24.044084,4.828435,19.822703,32.022463,112.403232,7.340456,4.403492,11.674363
std,21.930191,2.638082,44.937472,0.627339,1.493612,7.614576,0.309596,6.522123,4.202037,9.312732,0.445726,0.268127,0.425705
min,5.0,64.9,142.0,7.13,25.63,4.0,4.21,1.0,17.5,7.0,6.34,3.73,10.43
25%,1994.0,71.9,203.0,9.13,31.25,19.0,4.59,15.0,29.0,106.0,7.0,4.2,11.38
50%,2002.0,74.0,232.0,9.5,32.25,24.0,4.755,20.0,32.0,113.0,7.25,4.36,11.65
75%,2012.0,75.75,275.0,10.0,33.25,29.0,5.04,24.0,35.0,119.0,7.62,4.57,11.93
max,2018.0,82.4,387.0,11.88,38.5,48.0,6.12,51.0,46.0,147.0,9.61,5.68,13.91


In [5]:
# check the number of NaN values in the dataset
nfl_combine_df.isna().sum()

combine_year                0
player_name                 0
college                     0
position                    0
height_inches               0
weight_lbs                  0
hand_size_inches         1607
arm_length_inches        1925
wonderlic_score          9797
40_yard_dash              936
bench_press_reps         3251
vertitcal_leap_inches    1970
broad_jump_inches        2121
3_cone_drill             5561
20_yard_shuttle          2895
60_yard_shuttle          7017
dtype: int64

## Data Cleaning, Data Transformations, and Data Exploration



In [None]:
# Create dummy values for the categorical variables

# auto_df['mstatus'] = auto_df['mstatus'].map({'Yes': 1, 'No': 0})
# auto_df['sex'] = auto_df['sex'].map({'M': 1, 'F': 0})
# auto_df['education'] = auto_df['education'].map({'<High School': 0, 'High School': 0, 'Bachelors': 1, 'Masters': 1, 'PhD': 1})
# auto_df['job'] = auto_df['job'].map({'Student': 1, 'Blue Collar': 0, 'Clerical': 0, 'Doctor': 0, 'Home Maker': 0, 'Lawyer': 0, 'Manager': 0, 'Professional': 0})

In [None]:
# Log Transformations for non-normalized variables. Then, drop the original variable from the dataset.

# def log_col(df, col):
#     '''Convert column to log values and
#     drop the original column
#     '''
#     df[f'{col}_log'] = np.log(df[col])
#     df.drop(col, axis=1, inplace=True)

# log_col(auto_df, 'tif')

In [None]:
# quick review of the characteristics of all variables in the dataset, 
# including the new dummy variables and log-transformed variables
# df.describe()

In [None]:
# Correlations between all variables in auto_df dataset
# df.corr(method = 'pearson')

In [None]:
#Correlation Heatmap of all variables in auto_df dataset

# mask = np.zeros_like(auto_df.corr())
# triangle_indices = np.triu_indices_from(mask)
# mask[triangle_indices] = True

# plt.figure(figsize=(35,30))
# ax = sns.heatmap(auto_df.corr(method='pearson'), cmap="coolwarm", mask=mask, annot=True, annot_kws={"size": 18}, square=True, linewidths=4)
# sns.set_style('white')
# plt.xticks(fontsize=14, rotation=45)
# plt.yticks(fontsize=14, rotation=0)
# bottom, top = ax.get_ylim()
# ax.set_ylim(bottom + 0.5, top - 0.5)
# plt.show()

## Initial Train and Test Dataset Creation



In [None]:
#Split auto_insurance_df into train and test datasets for our logistic and linear regression models

#train and test datasets for logistic regression model
# crash = auto_df['crash']
# features_log = auto_df.drop(['crash', 'crash_cost'], axis = 1)
# x_train_log, x_test_log, y_train_log, y_test_log = train_test_split(features_log, crash, test_size = 0.2, random_state = 10)

## Feature Selection

For modeling purposes, we used recursive feature elimination for both our logistic regression model and our simple linear regression model. This process uses cross-validation techniques, using accuracy as a metric, to eliminate variables that may hurt our model performance. Those variables get dropped from the dataset prior to modeling.

### Recursive Feature Elimination for Logistic Regression Model

In [None]:
# logreg_model = LogisticRegression()
# rfecv_log = RFECV(estimator=logreg_model, step=1, cv=StratifiedKFold(10), scoring='accuracy')
# rfecv_log.fit(x_train_log, y_train_log)

In [None]:
# feature_importance_log = list(zip(features_log, rfecv_log.support_))
# new_features_log = []
# for key,value in enumerate(feature_importance_log):
#     if(value[1]) == True:
#         new_features_log.append(value[0])
        
# print(new_features_log)

In [None]:
# linreg_model = LinearRegression()
# rfecv_lin = RFECV(estimator=linreg_model, step=1, min_features_to_select = 1, scoring='r2')
# rfecv_lin.fit(x_train_lin, y_train_lin)

In [None]:
# feature_importance_lin = list(zip(features_lin, rfecv_lin.support_))
# new_features_lin = []
# for key,value in enumerate(feature_importance_lin):
#     if(value[1]) == True:
#         new_features_lin.append(value[0])
        
# print(new_features_lin)

## Final Train and Test Datasets after Feature Selection



In [None]:
#final train and test datasets for logistic regression model
# x_train_log = x_train_log[new_features_log]
# x_test_log = x_test_log[new_features_log]

# print(x_train_log.shape)
# print(x_test_log.shape)