# NFL Combine Classification Modeling

## Technical Notebook

## Project Goals

- Determine the influence the NFL Combine has on a prospect's draft status (getting drafted or not).
- Reveal how much the NFL Combine factors in on a prospect's draft value (how early or how late a prospect gets drafted, if at all).
- Discover which NFL Combine drills have the most impact on a prospect's draft position.

## Summary of Data

The dataset that was analyzed for this study contains 9,972 observations of NFL Combine and NFL Draft data, dating from 1987-2017.

### Library Import

In [1]:
#Import libraries
%run ../python_files/libraries

### Data Import

In [2]:
#Import cleaned data from our exploratory data analysis
%run ../python_files/nfl_combine_eda

In [3]:
nfl_combine_df

Unnamed: 0,combine_year,player_name,college,position,height_inches,weight_lbs,hand_size_inches,arm_length_inches,wonderlic_score,40_yard_dash,bench_press_reps,vertitcal_leap_inches,broad_jump_inches,3_cone_drill,20_yard_shuttle,60_yard_shuttle
0,2018,Josh Adams,Notre Dame,RB,74.0,213,9.25,33.75,,,18.0,,,,,
1,2018,Ola Adeniyi,Toledo,DE,74.0,248,9.63,31.75,,4.83,26.0,31.5,,7.21,4.28,12.79
2,2018,Jordan Akins,Central Florida,TE,75.0,249,9.50,32.50,,,,,,,,
3,2018,Jaire Alexander,Louisville,CB,71.0,192,,,,4.38,14.0,35.0,127.0,6.71,3.98,
4,2018,Austin Allen,Arkansas,QB,72.0,210,9.63,30.63,,4.81,,29.5,112.0,7.18,4.48,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10223,1987,Rod Woodson,Purdue,CB,72.0,202,10.50,31.00,,4.33,10.0,36.0,125.0,,3.98,10.92
10224,1987,John Wooldridge,Ohio State,RB,68.4,193,,,,,,,,,,
10225,1987,David Wyman,Stanford,ILB,74.0,235,9.50,31.25,,4.79,23.0,29.0,118.0,,4.30,11.78
10226,1987,Theo Young,Arkansas,TE,74.0,231,9.00,34.00,,4.89,9.0,30.0,107.0,,4.20,11.71


## Modeling

### Logistic Regression Model



##### Model Implementation



In [None]:
# logit_model = sm.Logit(y_train_log, x_train_log)
# logit_result = logit_model.fit()
# print(logit_result.summary())

##### Model Fitting

We use our model from above, which was built on the training data set, to test against our test data set below. This will help us evaluate the model performance of our logistic regression model.

In [None]:
# logreg_model = LogisticRegression()
# logreg_model.fit(x_train_log, y_train_log)

##### Predicting Test Set Results and Calculating Accuracy

Below, we use several metrics to evaluate the model performance of our logistic regression model, including the calculation of accuracy, a confusion matrix, a classification report, and a plot of a ROC curve. These performance evaluation techniques evaluate the training dataset against the test data set.

In [None]:
# y_pred_log = logreg_model.predict(x_test_log)
# print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg_model.score(x_test_log, y_test_log)))

Through the calculation of the accuracy, 78% of the variability in 'crash' can be explained using our feature variables, which is promising.

##### Confusion Matrix

The below confusion matrix results are telling us that we have 946 (830 + 116 = 946) correct predictions and 263 (200 + 63 = 263) incorrect predictions. The ratio of approximately 3.6 correct predictions to every 1 incorrect predictions is a good sign.

In [None]:
# confusion_matrix = confusion_matrix(y_test_log, y_pred_log)
# print(confusion_matrix)

##### Interpretation of Results

The below classification report and ROC curve further display the accuracy of our model against the test set and the number of correct predictions vs incorrect predictions.

In [None]:
# print(classification_report(y_test_log, y_pred_log))

In [None]:
# ROC Curve

# logit_roc_auc = roc_auc_score(y_test_log, logreg_model.predict(x_test_log))
# fpr, tpr, thresholds = roc_curve(y_test_log, logreg_model.predict_proba(x_test_log)[:,1])
# plt.figure()
# plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
# plt.plot([0, 1], [0, 1],'r--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver operating characteristic')
# plt.legend(loc="lower right")
# plt.show()

### Linear Regression Model



##### Model Implementation



In [None]:
# linear_model = sm.OLS(y_train_lin, x_train_lin)
# linear_result = linear_model.fit()
# print(linear_result.summary())

##### Model Fitting

We use our model from above, which was built on the training data set, to test against our test data set below. This will help us evaluate the model performance of our simple linear regression model.

In [None]:
# linreg_model = LinearRegression()
# linreg_model.fit(x_train_lin, y_train_lin)

##### Model Results

Below, we use several metrics to evaluate the model performance of our simple linear regression model, including the calculation of r-squared value, the calculation of root mean squared error (RMSE) value, the calculation of mean absolute error value (MAE), and a plot of the model residuals. These performance evaluation techniques evaluate the training dataset against the test data set.

In [None]:
#Calculate r-squared value

# y_pred_lin = linreg_model.predict(x_test_lin)
# print('Linear Regression R squared": %.4f' % linreg_model.score(x_test_lin, y_test_lin))

Through the calculation of the r-squared value, only approximately 1% of the variability in crash_cost can be explained using our feature variables, which is very low.

In [None]:
#Calculate root mean squared error (RMSE) value

# mse_lin = mean_squared_error(y_pred_lin, y_test_lin)
# rmse_lin = np.sqrt(mse_lin)
# print('Linear Regression RMSE: %.4f' % rmse_lin)

Through the calculation of root mean squared error (RMSE), our model was able to predict that the value of every crash in the test set was within approximately $9867 of the real price.

In [None]:
#Calculate mean absolute error (MAE) value

# mae_lin = mean_absolute_error(y_pred_lin, y_test_lin)
# print('Linear Regression MAE: %.4f' % mae_lin)

The calculation of mean absolute error (MAE) was also concerning.

In [None]:
#Plot of Residuals

# visualizer = ResidualsPlot(linreg_model)
# visualizer.fit(x_train_lin, y_train_lin)  # Fit the training data to the visualizer
# visualizer.score(x_test_lin, y_test_lin)  # Evaluate the model on the test data
# visualizer.show()                         # Finalize and render the plot of residuals

The above plot of the model residuals shows a bad relationship between predicted and actual values, which also proves that our model is not accurate.

## Results and Conclusions

