# NFL Combine Classification Modeling

## Technical Notebook

## Project Goals

- Determine the influence the NFL Combine has on a lineman (offensive linemen and defensive linemen) prospect's draft status (getting drafted or not).
- Reveal how much the NFL Combine factors in on a lineman (offensive linemen and defensive linemen) prospect's draft value (how early or how late a prospect gets drafted, if at all).
- Discover which NFL Combine drills have the most impact on a lineman (offensive linemen and defensive linemen) prospect's draft position.

## Summary of Data

The dataset that was analyzed for this study contains 9,544 observations of NFL Combine and NFL Draft data, dating from 1987-2017. The NFL Combine data primarily displays the performance of players over that time period in combine drills. The NFL Draft data contains the draft pick information of players from that time span, including what round they were selected in and the team that picked them.

### Library Import

In [1]:
#Import libraries
%run ../python_files/libraries
%matplotlib inline
# from libraries import *   #for use within .py file



### Data Import

In [2]:
#Import cleaned data from our exploratory data analysis
%run ../python_files/nfl_combine_eda

## Modeling

##### Pre-Modeling Techniques

- Scaling: We use Standard Scaler to scale our 'x' training and test datasets so that our model does not unfairly penalize our coefficients due to differences in units.

- Resampling: We use SMOTE, as this method creates synthetic samples for minority classes. This will oversample the minority classes and thus, add more balance to our model.

##### Model Implementation and Model Performance

We utilized a pipeline technique to implement 8 different model types:

- Logistic Regression
- KNN
- SVC
- NuSVC
- Decision Tree
- Random Forest
- Ada Boost
- Gradient Boosting

After running our pipeline, we will be able to review model performance using accuracy as our primary metric, as well as confusion matrices to review correct predictions vs incorrect predictions.

In [7]:
#Model Selection and Comparison

classifiers = [
    LogisticRegression(),
    KNeighborsClassifier(3),
    SVC(),
    NuSVC(probability=True),
    DecisionTreeClassifier(max_depth = 13),
    RandomForestClassifier(max_depth = 13),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]
for classifier in classifiers:
    pipe = Pipeline([
                     ('ss', StandardScaler()),
                     ('classifier', classifier)])
    pipe.fit(x_train_ds, y_train_ds)   
    print(classifier, '\n')
    conf_matrix = pd.DataFrame(confusion_matrix(y_train_ds, pipe.predict(x_train_ds)),
                           index = ['actual 0', 'actual 1'], 
                           columns = ['predicted 0', 'predicted 1'])
    print(conf_matrix, '\n')
    print("model score: %.3f" % pipe.score(x_train_ds, y_train_ds), '\n')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False) 

          predicted 0  predicted 1
actual 0          251          562
actual 1          151         1514 

model score: 0.712 

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform') 

          predicted 0  predicted 1
actual 0          478          335
actual 1          150         1515 

model score: 0.804 

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol

### Logistic Regression Model



##### Model Implementation



In [None]:
# logit_model = sm.Logit(y_train_log, x_train_log)
# logit_result = logit_model.fit()
# print(logit_result.summary())

##### Model Fitting

We use our model from above, which was built on the training data set, to test against our test data set below. This will help us evaluate the model performance of our logistic regression model.

In [None]:
# logreg_model = LogisticRegression()
# logreg_model.fit(x_train_log, y_train_log)

##### Predicting Test Set Results and Calculating Accuracy

Below, we use several metrics to evaluate the model performance of our logistic regression model, including the calculation of accuracy, a confusion matrix, a classification report, and a plot of a ROC curve. These performance evaluation techniques evaluate the training dataset against the test data set.

In [None]:
# y_pred_log = logreg_model.predict(x_test_log)
# print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg_model.score(x_test_log, y_test_log)))

Through the calculation of the accuracy, 78% of the variability in 'crash' can be explained using our feature variables, which is promising.

##### Confusion Matrix

The below confusion matrix results are telling us that we have 946 (830 + 116 = 946) correct predictions and 263 (200 + 63 = 263) incorrect predictions. The ratio of approximately 3.6 correct predictions to every 1 incorrect predictions is a good sign.

In [None]:
# confusion_matrix = confusion_matrix(y_test_log, y_pred_log)
# print(confusion_matrix)

In [None]:
# ROC Curve

logit_roc_auc = roc_auc_score(y_test_log, logreg_model.predict(x_test_log))
fpr, tpr, thresholds = roc_curve(y_test_log, logreg_model.predict_proba(x_test_log)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

## Results and Conclusions

