## Combine Analysis Defensive Ends 

Which Combine tests have the most potential influence on a players ability to get drafted and their draft position?

Our training dataset is combine data from 2010 - 2020 and our testing dataset is 2021-2024

In [1]:
import pandas as pd

# Path relative to notebook location (DE_similarity_scores_project/) - data is in project root
de_data = pd.read_csv('../data/processed/de_training_data.csv')
print(de_data.columns)
# Convert Height from feet-inches to inches
de_data['Height'] = de_data['Height'].str.split('-').str[0].astype(int) * 12 + de_data['Height'].str.split('-').str[1].astype(int)

# Examine every column in the dataset and its correlation with the Drafted column 
de_data_just_numeric = de_data.select_dtypes(include=['number'])
de_data_just_numeric['Drafted'] = de_data['Drafted']
print(de_data_just_numeric.corr()['Drafted'].sort_values(ascending=False))


Index(['Year', 'Player', 'Pos', 'School', 'Height', 'Weight', '40yd',
       'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle', 'Drafted',
       'Round', 'Pick', 'Sacks_cumulative', 'TFL_cumulative',
       'QB_Hurry_cumulative', 'Sacks_final_season', 'TFL_final_season',
       'QB_Hurry_final_season'],
      dtype='object')
Drafted                  1.000000
Broad Jump               0.319828
Vertical                 0.249153
QB_Hurry_final_season    0.213194
TFL_final_season         0.190802
Sacks_final_season       0.175748
Bench                    0.153866
QB_Hurry_cumulative      0.143173
Height                   0.133169
Sacks_cumulative         0.114300
TFL_cumulative           0.106476
Weight                   0.047001
Year                    -0.089961
Shuttle                 -0.223542
3Cone                   -0.223938
40yd                    -0.306137
Round                         NaN
Pick                          NaN
Name: Drafted, dtype: float64


For context if the correlation is positive that means that a higher number is better, if a correlation is negative that means that a lower number is better. With that said it looks like our most impactful combine values on **being drafted** are 

1. Broad Jump: .318
2. 40yd: -.311
3. Vertical: .266
4. Shuttle: -.24
5. 3 Cone

and our most impactful defensive stats on **being drafted are 
1. TFL_cumulative           0.486042
2. Sacks_cumulative         0.408868
3. TFL_final_season         0.396291
4. QB_Hurry_cumulative      0.386837
5. QB_Hurry_final_season    0.336749
6. Sacks_final_season       0.298233

Anything too far below abs(.20) is likely too weak to consider using for any models. 

In [2]:
# Examine every column in the dataset and its correlation with the Drafted column 
# Lower Draft Position is better
de_data_just_numeric = de_data.select_dtypes(include=['number'])
de_data_just_numeric['Pick'] = de_data['Pick']
print(de_data_just_numeric.corr()['Pick'].sort_values(ascending=False))


Pick                     1.000000
Round                    0.988563
40yd                     0.275288
Shuttle                  0.184148
3Cone                    0.162359
Year                    -0.047675
Bench                   -0.057544
Weight                  -0.124414
Vertical                -0.127590
Height                  -0.160059
Broad Jump              -0.217950
TFL_cumulative          -0.224798
QB_Hurry_final_season   -0.263154
QB_Hurry_cumulative     -0.294456
Sacks_cumulative        -0.324386
TFL_final_season        -0.413254
Sacks_final_season      -0.418762
Name: Pick, dtype: float64


With that said it looks like our most impactful combine values on **Draft Position** are 

1. 40yd: .338
2. Broad Jump: -.228
3. 3Cone: .200
4. Shuttle 

It's a little hard to understand what these mean but essentially a higher 40 yard dash and a shorter broad jump correlated with a later round draft pick. So we are looking for shorter 40 yard dashes and longer broad jumps. 

And it looks like our most impactful defensive values on **Draft Position** are 

1. TFL_final_season        -0.537341
2. Sacks_final_season      -0.463860
3. QB_Hurry_final_season   -0.321810

## Looking to Model

When we look to create machine learning models there are 3 tasks we would like to accomplish. The first two can use our current datasets of combine data and college data. The final one/two will require the first four seasons of our Defensive Ends stats in the NFL. 

1. KNN Player Comps
2. Projected Draft Position/Round (KNN)
3. Projected NFL Ability/Value (TBD)

## KNN Player Comps

We would like to use the KNN "machine learning model" to help us determine which players are most closely related. This could help us give NFL player comparisons for upcoming draft picks.

In [3]:
# KNN Player Similarity - Find the K most similar players given an example

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from IPython.display import display

# Combine training + testing data for KNN (full 2010-2023 player pool)
de_knn = pd.concat([
    pd.read_csv('../data/processed/de_training_data.csv'),
    pd.read_csv('../data/processed/de_testing_data.csv')
], ignore_index=True)
de_knn['Height'] = de_knn['Height'].str.split('-').str[0].astype(int) * 12 + de_knn['Height'].str.split('-').str[1].astype(int)

# Features for similarity (combine + college stats)
FEATURE_COLS = ['Height', 'Weight', '40yd', 'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle',
                'Sacks_final_season', 'TFL_final_season', 'QB_Hurry_final_season']

# Prepare features: select and impute
X = de_knn[FEATURE_COLS].copy()
X = X.fillna(X.median())  # Impute NaNs with column median

# Scale features (important for KNN distance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit KNN - we use k+1 because the nearest "neighbor" will be the player themselves
K = 5  # Number of similar players to return
nn = NearestNeighbors(n_neighbors=K + 1, metric='euclidean')
nn.fit(X_scaled)

def find_similar_players(player_or_profile, k=5):
    """
    Given a player name, row index, or dict of feature values, return the K most similar players.
    
    For dict input, use keys from FEATURE_COLS. Height can be "6-7" or inches. Missing keys filled with median.
    Example: {'Height': '6-7', 'Weight': 260, '40yd': 4.74, 'Vertical': 36, 'Broad Jump': 117,
              'Sacks_final_season': 14, 'TFL_final_season': 16.5, 'QB_Hurry_final_season': 12}
    """
    if isinstance(player_or_profile, dict):
        # Build feature vector from dict
        row = []
        for col in FEATURE_COLS:
            val = player_or_profile.get(col, np.nan)
            if val is None:
                val = np.nan
            if col == 'Height' and isinstance(val, str) and '-' in str(val):
                parts = str(val).split('-')
                val = int(parts[0]) * 12 + int(parts[1])
            row.append(val)
        x = np.array(row, dtype=float).reshape(1, -1)
        x = np.where(np.isnan(x), X.median().values, x)
        x_scaled = scaler.transform(x)
        distances, indices = nn.kneighbors(x_scaled, n_neighbors=k)
        similar = de_knn.iloc[indices[0]].copy()
        similar['Similarity_Distance'] = distances[0]
    elif isinstance(player_or_profile, str):
        mask = de_knn['Player'] == player_or_profile
        if not mask.any():
            return f"Player '{player_or_profile}' not found in dataset."
        idx = de_knn[mask].index[0]
        distances, indices = nn.kneighbors(X_scaled[idx:idx+1], n_neighbors=k + 1)
        similar = de_knn.iloc[indices[0][1:]].copy()
        similar['Similarity_Distance'] = distances[0][1:]
    else:
        idx = player_or_profile
        distances, indices = nn.kneighbors(X_scaled[idx:idx+1], n_neighbors=k + 1)
        similar = de_knn.iloc[indices[0][1:]].copy()
        similar['Similarity_Distance'] = distances[0][1:]
    
    return similar[['Player', 'Year', 'School', 'Height', 'Weight', 'Round', 'Pick', '40yd', 'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle',
                'Sacks_final_season', 'TFL_final_season', 'QB_Hurry_final_season', 'Similarity_Distance']]

# Example profiles: use a dict of features to find comps (or pass a player name like "Aidan Hutchinson")
# Keys: Height, Weight, 40yd, Vertical, Bench, Broad Jump, 3Cone, Shuttle, Sacks_final_season, TFL_final_season, QB_Hurry_final_season
example_profiles = [
    # ("Aidan Hutchinson (2022)", {
    #     'Height': '6-7', 'Weight': 260, '40yd': 4.74, 'Vertical': 36, 'Bench': None, 'Broad Jump': 117,
    #     '3Cone': 6.73, 'Shuttle': 4.15,
    #     'Sacks_final_season': 14, 'TFL_final_season': 16.5, 'QB_Hurry_final_season': 12
    # }),
    # ("Arvell Reese — Ohio State (2026)", {
    #     'Height': '6-4', 'Weight': 243, '40yd': 4.52, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 6.5, 'TFL_final_season': 10, 'QB_Hurry_final_season': 5
    # }),
    # ("Reuben Bain — Miami (2026)", {
    #     'Height': '6-3', 'Weight': 270, '40yd': 4.72, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 9.5, 'TFL_final_season': 15.5, 'QB_Hurry_final_season': 5
    # }),
    # ("David Bailey — Texas Tech (2026)", {
    #     'Height': '6-3', 'Weight': 250, '40yd': 4.52, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 13.5, 'TFL_final_season': 19.5, 'QB_Hurry_final_season': 13
    # }),
    # ("Keldric Faulk — Auburn (2026)", {
    #     'Height': '6-6', 'Weight': 285, '40yd': 4.75, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 2, 'TFL_final_season': 5, 'QB_Hurry_final_season': 6
    # }),
    ("TJ Parker — Clemson (2026)", {
        'Height': '6-3', 'Weight': 260, '40yd': 4.65, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
        '3Cone': None, 'Shuttle': None,
        'Sacks_final_season': 5, 'TFL_final_season': 9.5, 'QB_Hurry_final_season': 7
    }),
]

# Run similarity for one profile (change index 0–5 to switch player)
example_name, example_profile = example_profiles[0]
print(f"Top {K} players most similar to {example_name}:\n")
result = find_similar_players(example_profile, k=K)
display(result)

Top 5 players most similar to TJ Parker — Clemson (2026):





Unnamed: 0,Player,Year,School,Height,Weight,Round,Pick,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Sacks_final_season,TFL_final_season,QB_Hurry_final_season,Similarity_Distance
324,Ikenna Enechukwu,2023,Rice,76,264.0,,,4.7,31.5,,120.0,,,3.5,8.5,6.0,1.405808
44,Jabaal Sheard,2011,Pittsburgh,75,264.0,2.0,37.0,4.68,31.0,,115.0,,,,,,1.436639
65,Donte Paige-Moss,2012,North Carolina,75,268.0,,,4.67,,26.0,,,,,,,1.488344
175,Charles Tapper,2016,Oklahoma,75,271.0,4.0,101.0,4.59,34.0,23.0,119.0,,,,,,1.50852
41,Robert Quinn,2011,North Carolina,76,265.0,1.0,14.0,4.62,34.0,22.0,116.0,7.13,4.4,,,,1.554925


## KNN Round Projection

We can use the same method of KNN to use our most important variables to help group players into which round they will be selected in. In this model we will use our training and testing data to see how well it did. 

We want to use Height, Weight, 40yd, Broad Jump, 3Cone, Shuttle, TFL_final_season, Sacks_final_season, and QB_Hurry_final_season as our variables for modeling 

In [4]:
# KNN Round Projection - Train on 2010-2020, test on 2021-2023

from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load and prepare data
train_df = pd.read_csv('../data/processed/de_training_data.csv')
test_df = pd.read_csv('../data/processed/de_testing_data.csv')

# Convert Height to inches
for df in [train_df, test_df]:
    df['Height'] = df['Height'].str.split('-').str[0].astype(int) * 12 + df['Height'].str.split('-').str[1].astype(int)

# Features for round prediction (combine + college stats)
ROUND_FEATURES = ['Height', 'Weight', '40yd', 'Broad Jump', '3Cone', 'Shuttle',
                  'TFL_final_season', 'Sacks_final_season', 'QB_Hurry_final_season']

# Map Round to draft day categories: Undrafted, Day1 (R1), Day2 (R2-3), Day3 (R4-7)
def round_to_day(r):
    if pd.isna(r) or r == 0:
        return 0  # Undrafted
    elif r == 1:
        return 1  # Day 1 - Round 1
    elif r in (2, 3):
        return 2  # Day 2 - Rounds 2-3
    else:
        return 3  # Day 3 - Rounds 4-7

DAY_LABELS = ['Undrafted', 'Day 1 (R1)', 'Day 2 (R2-3)', 'Day 3 (R4-7)']

# Use all rows - KNN imputation fills missing values
X_train = train_df[ROUND_FEATURES].copy()
X_test = test_df[ROUND_FEATURES].copy()
y_train = train_df['Round'].apply(round_to_day)
y_test = test_df['Round'].apply(round_to_day)

# KNN imputation: fill NaN with values from K nearest neighbors
imputer = KNNImputer(n_neighbors=5)
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
print(f"Training: {len(X_train)} players | Testing: {len(X_test)} players (KNN imputation for missing values)\n")

# Scale features
scaler_round = StandardScaler()
X_train_scaled = scaler_round.fit_transform(X_train)
X_test_scaled = scaler_round.transform(X_test)

# Fit KNN classifier
k = 4  # Number of neighbors
knn_round = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
knn_round.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = knn_round.predict(X_test_scaled)

# Evaluate
print(f"KNN Round Projection (k={k}) - 4 classes: Undrafted, Day1, Day2, Day3")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2%}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=DAY_LABELS, zero_division=0))
print("Confusion Matrix (rows=actual, cols=predicted):")
print(pd.DataFrame(confusion_matrix(y_test, y_pred), index=DAY_LABELS, columns=DAY_LABELS))

Training: 268 players | Testing: 93 players (KNN imputation for missing values)

KNN Round Projection (k=4) - 4 classes: Undrafted, Day1, Day2, Day3
Test Accuracy: 41.94%

Classification Report:
              precision    recall  f1-score   support

   Undrafted       0.50      0.44      0.47        32
  Day 1 (R1)       0.40      0.36      0.38        11
Day 2 (R2-3)       0.30      0.26      0.28        23
Day 3 (R4-7)       0.43      0.56      0.48        27

    accuracy                           0.42        93
   macro avg       0.41      0.40      0.40        93
weighted avg       0.42      0.42      0.42        93

Confusion Matrix (rows=actual, cols=predicted):
              Undrafted  Day 1 (R1)  Day 2 (R2-3)  Day 3 (R4-7)
Undrafted            14           1             9             8
Day 1 (R1)            2           4             2             3
Day 2 (R2-3)          5           3             6             9
Day 3 (R4-7)          7           2             3            15


In [5]:
import numpy as np

def predict_draft_position(features: dict) -> str:
    """
    Predict draft day for a player given a dict of features.
    
    Keys: Height, Weight, 40yd, Broad Jump, 3Cone, Shuttle,
          TFL_final_season, Sacks_final_season, QB_Hurry_final_season
    Height can be "6-7" or inches. Missing keys filled via KNN imputation.
    
    Returns: 'Undrafted', 'Day 1 (R1)', 'Day 2 (R2-3)', or 'Day 3 (R4-7)'
    """
    row = []
    for col in ROUND_FEATURES:
        val = features.get(col, np.nan)
        if val is None:
            val = np.nan
        if col == 'Height' and isinstance(val, str) and '-' in str(val):
            parts = str(val).split('-')
            val = int(parts[0]) * 12 + int(parts[1])
        row.append(val)
    
    x = np.array(row, dtype=float).reshape(1, -1)
    x = imputer.transform(x)  # Fill NaNs using training neighbors
    x_scaled = scaler_round.transform(x)
    pred = knn_round.predict(x_scaled)[0]
    return DAY_LABELS[pred]

# Same example profiles as Find Similar Players (change index 0–5 to switch player)
example_profiles = [
    # ("Aidan Hutchinson (2022)", {
    #     'Height': '6-7', 'Weight': 260, '40yd': 4.74, 'Vertical': 36, 'Bench': None, 'Broad Jump': 117,
    #     '3Cone': 6.73, 'Shuttle': 4.15,
    #     'Sacks_final_season': 14, 'TFL_final_season': 16.5, 'QB_Hurry_final_season': 12
    # }),
    # ("Arvell Reese — Ohio State (2026)", {
    #     'Height': '6-4', 'Weight': 243, '40yd': 4.52, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 6.5, 'TFL_final_season': 10, 'QB_Hurry_final_season': 5
    # }),
    # ("Reuben Bain — Miami (2026)", {
    #     'Height': '6-3', 'Weight': 270, '40yd': 4.72, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 9.5, 'TFL_final_season': 15.5, 'QB_Hurry_final_season': 5
    # }),
    ("David Bailey — Texas Tech (2026)", {
        'Height': '6-3', 'Weight': 250, '40yd': 4.52, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
        '3Cone': None, 'Shuttle': None,
        'Sacks_final_season': 13.5, 'TFL_final_season': 19.5, 'QB_Hurry_final_season': 13
    }),
    # ("Keldric Faulk — Auburn (2026)", {
    #     'Height': '6-6', 'Weight': 285, '40yd': 4.75, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 2, 'TFL_final_season': 5, 'QB_Hurry_final_season': 6
    # }),
    # ("TJ Parker — Clemson (2026)", {
    #     'Height': '6-3', 'Weight': 260, '40yd': 4.65, 'Vertical': None, 'Bench': None, 'Broad Jump': None,
    #     '3Cone': None, 'Shuttle': None,
    #     'Sacks_final_season': 5, 'TFL_final_season': 9.5, 'QB_Hurry_final_season': 7
    # }),
]
example_name, example = example_profiles[0]
print(f"Predicted draft position for {example_name}: {predict_draft_position(example)}")

Predicted draft position for David Bailey — Texas Tech (2026): Day 1 (R1)


