## Combine Analysis Defensive Ends 

Which Combine tests have the most potential influence on a players ability to get drafted and their draft position?

Our training dataset is combine data from 2010–2019 (268 players) and our testing dataset is 2021–2023 (93 players). Correlations below use the training data; all models are trained on the training set and evaluated on the test set.

In [1]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn.utils.validation')

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.utils.class_weight import compute_sample_weight
import xgboost as xgb

# Path relative to notebook location (DE_similarity_scores_project/) - data is in project root
de_data = pd.read_csv('../data/processed/de_training_data.csv')
print(de_data.columns)
# Convert Height from feet-inches to inches
de_data['Height'] = de_data['Height'].str.split('-').str[0].astype(int) * 12 + de_data['Height'].str.split('-').str[1].astype(int)

# Examine every column in the dataset and its correlation with the Drafted column 
de_data_just_numeric = de_data.select_dtypes(include=['number'])
de_data_just_numeric['Drafted'] = de_data['Drafted']
print(de_data_just_numeric.corr()['Drafted'].sort_values(ascending=False))


Index(['Year', 'Player', 'Pos', 'School', 'Height', 'Weight', '40yd',
       'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle', 'Drafted',
       'Round', 'Pick', 'Sacks_cumulative', 'TFL_cumulative',
       'QB_Hurry_cumulative', 'Sacks_final_season', 'TFL_final_season',
       'QB_Hurry_final_season'],
      dtype='object')
Drafted                  1.000000
Broad Jump               0.319828
Vertical                 0.249153
QB_Hurry_final_season    0.213194
TFL_final_season         0.190802
Sacks_final_season       0.175748
Bench                    0.153866
QB_Hurry_cumulative      0.143173
Height                   0.133169
Sacks_cumulative         0.114300
TFL_cumulative           0.106476
Weight                   0.047001
Year                    -0.089961
Shuttle                 -0.223542
3Cone                   -0.223938
40yd                    -0.306137
Round                         NaN
Pick                          NaN
Name: Drafted, dtype: float64


For context if the correlation is positive that means that a higher number is better, if a correlation is negative that means that a lower number is better. From the training data, the most impactful **combine** values on **being drafted** are:

1. Broad Jump: 0.320
2. Vertical: 0.249
3. 40yd: -0.306 (faster = more likely drafted)
4. Shuttle: -0.224
5. 3Cone: -0.224
6. Bench: 0.154

and the most impactful **defensive stats** on **being drafted** are:

1. QB_Hurry_final_season: 0.213
2. TFL_final_season: 0.191
3. Sacks_final_season: 0.176
4. QB_Hurry_cumulative: 0.143
5. Sacks_cumulative: 0.114
6. TFL_cumulative: 0.106

Anything too far below abs(0.20) is likely too weak to consider using for any models.

In [2]:
# Examine every column in the dataset and its correlation with the Drafted column 
# Lower Draft Position is better
de_data_just_numeric = de_data.select_dtypes(include=['number'])
de_data_just_numeric['Pick'] = de_data['Pick']
print(de_data_just_numeric.corr()['Pick'].sort_values(ascending=False))


Pick                     1.000000
Round                    0.988563
40yd                     0.275288
Shuttle                  0.184148
3Cone                    0.162359
Year                    -0.047675
Bench                   -0.057544
Weight                  -0.124414
Vertical                -0.127590
Height                  -0.160059
Broad Jump              -0.217950
TFL_cumulative          -0.224798
QB_Hurry_final_season   -0.263154
QB_Hurry_cumulative     -0.294456
Sacks_cumulative        -0.324386
TFL_final_season        -0.413254
Sacks_final_season      -0.418762
Name: Pick, dtype: float64


From the training data, the most impactful **combine** values on **draft position** (Pick; lower = earlier/better) are:

1. 40yd: 0.275 (slower = later pick)
2. Shuttle: 0.184 (slower = later pick)
3. 3Cone: 0.162 (slower = later pick)
4. Broad Jump: -0.218 (longer = earlier pick)
5. Bench: -0.058 (more reps = earlier pick)

So a higher 40-yard time and a shorter broad jump are associated with a later draft pick; we want faster 40s and longer broad jumps for earlier picks.

The most impactful **defensive stats** on **draft position** are:

1. Sacks_final_season: -0.419
2. TFL_final_season: -0.413
3. Sacks_cumulative: -0.324
4. QB_Hurry_cumulative: -0.294
5. QB_Hurry_final_season: -0.263
6. TFL_cumulative: -0.225

## Looking to Model

When we look to create machine learning models there are 3 tasks we would like to accomplish. The first two can use our current datasets of combine data and college data. The final one/two will require the first four seasons of our Defensive Ends stats in the NFL. 

1. Projected Drafted or Undrafted
2. Projected Draft Position/Round
3. Projected NFL Ability/Value 

## Projected Drafted or Undrafted

We build models to predict whether a Defensive End will be **drafted or undrafted** using combine and college stats. Training set: 2010–2019 (268 players). Test set: 2021–2023 (93 players). Models tested: logistic regression, Random Forest, and XGBoost. For models that use college stats (QB Hurry, TFL, Sacks, p4_conference), we restrict to 2017+ so that data is available (91 train, 93 test).

In [3]:
# Load training and testing data (paths relative to Edges/)
train_raw = pd.read_csv('../data/processed/de_training_data.csv')
test_raw = pd.read_csv('../data/processed/de_testing_data.csv')

# Convert Height from feet-inches to inches
def height_to_inches(h):
    if pd.isna(h) or not isinstance(h, str) or '-' not in str(h):
        return np.nan
    parts = str(h).split('-')
    return int(parts[0]) * 12 + int(parts[1])

for df in [train_raw, test_raw]:
    df['Height'] = df['Height'].apply(height_to_inches)

# Column names for modeling (match CSV)
FEATURE_COLS = [
    'Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season',
    'Sacks_final_season', 'Shuttle', '3Cone', '40yd', 'Height', 'Weight'
]

# --- Speed Score: weight * 200 / 40yd^4 ---
def add_speed_score(df):
    df = df.copy()
    df['speed_score'] = np.where(
        df['40yd'].notna() & (df['40yd'] > 0),
        df['Weight'] * 200 / (df['40yd'] ** 4),
        np.nan
    )
    return df

train_raw = add_speed_score(train_raw)
test_raw = add_speed_score(test_raw)

# --- Explosive Score (position-specific z-scores from training data) ---
def add_explosive_score(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    train_df['vertical_z'] = np.nan
    train_df['broad_z'] = np.nan
    test_df['vertical_z'] = np.nan
    test_df['broad_z'] = np.nan
    for pos in train_df['Pos'].dropna().unique():
        tr = train_df[train_df['Pos'] == pos]
        mean_v = tr['Vertical'].mean()
        std_v = tr['Vertical'].std()
        mean_b = tr['Broad Jump'].mean()
        std_b = tr['Broad Jump'].std()
        if std_v == 0 or np.isnan(std_v):
            std_v = 1.0
        if std_b == 0 or np.isnan(std_b):
            std_b = 1.0
        mask_train = train_df['Pos'] == pos
        mask_test = test_df['Pos'] == pos
        train_df.loc[mask_train, 'vertical_z'] = (train_df.loc[mask_train, 'Vertical'] - mean_v) / std_v
        train_df.loc[mask_train, 'broad_z'] = (train_df.loc[mask_train, 'Broad Jump'] - mean_b) / std_b
        test_df.loc[mask_test, 'vertical_z'] = (test_df.loc[mask_test, 'Vertical'] - mean_v) / std_v
        test_df.loc[mask_test, 'broad_z'] = (test_df.loc[mask_test, 'Broad Jump'] - mean_b) / std_b
    train_df['explosive_score'] = train_df['vertical_z'].fillna(0) + train_df['broad_z'].fillna(0)
    test_df['explosive_score'] = test_df['vertical_z'].fillna(0) + test_df['broad_z'].fillna(0)
    return train_df.drop(columns=['vertical_z', 'broad_z'], errors='ignore'), test_df.drop(columns=['vertical_z', 'broad_z'], errors='ignore')

train_raw, test_raw = add_explosive_score(train_raw, test_raw)

# --- Agility Score (position-specific z-scores from training; flip sign so better = higher) ---
def add_agility_score(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    train_df['three_cone_z'] = np.nan
    train_df['shuttle_z'] = np.nan
    test_df['three_cone_z'] = np.nan
    test_df['shuttle_z'] = np.nan
    for pos in train_df['Pos'].dropna().unique():
        tr = train_df[train_df['Pos'] == pos]
        mean_3 = tr['3Cone'].mean()
        std_3 = tr['3Cone'].std()
        mean_sh = tr['Shuttle'].mean()
        std_sh = tr['Shuttle'].std()
        if std_3 == 0 or np.isnan(std_3):
            std_3 = 1.0
        if std_sh == 0 or np.isnan(std_sh):
            std_sh = 1.0
        mask_train = train_df['Pos'] == pos
        mask_test = test_df['Pos'] == pos
        train_df.loc[mask_train, 'three_cone_z'] = (train_df.loc[mask_train, '3Cone'] - mean_3) / std_3
        train_df.loc[mask_train, 'shuttle_z'] = (train_df.loc[mask_train, 'Shuttle'] - mean_sh) / std_sh
        test_df.loc[mask_test, 'three_cone_z'] = (test_df.loc[mask_test, '3Cone'] - mean_3) / std_3
        test_df.loc[mask_test, 'shuttle_z'] = (test_df.loc[mask_test, 'Shuttle'] - mean_sh) / std_sh
    train_df['agility_score'] = (-train_df['three_cone_z'].fillna(0)) + (-train_df['shuttle_z'].fillna(0))
    test_df['agility_score'] = (-test_df['three_cone_z'].fillna(0)) + (-test_df['shuttle_z'].fillna(0))
    return train_df.drop(columns=['three_cone_z', 'shuttle_z'], errors='ignore'), test_df.drop(columns=['three_cone_z', 'shuttle_z'], errors='ignore')

train_raw, test_raw = add_agility_score(train_raw, test_raw)

# --- P4/P5 conference: binary 1 if School is in power conference. Pac-12 counts only for draft year 2023 and before. ---
P4_WITH_PAC12 = {'SEC', 'Big Ten', 'Big 12', 'ACC', 'Pac-12'}
P4_NO_PAC12 = {'SEC', 'Big Ten', 'Big 12', 'ACC'}
_stats = pd.read_csv('../data/processed/defensive_stats_2016_to_2025.csv')
P4_SCHOOLS = set(_stats[_stats['Conference'].isin(P4_WITH_PAC12)]['Team'].unique())
P4_SCHOOLS_NO_PAC12 = set(_stats[_stats['Conference'].isin(P4_NO_PAC12)]['Team'].unique())
school_alias = {
    'Ole Miss': 'Mississippi', 'Miami (FL)': 'Miami', 'Southern California': 'USC',
    'Central Florida': 'UCF', 'Brigham Young': 'BYU', 'Ohio St.': 'Ohio State',
    'Florida St.': 'Florida State', 'Kansas St.': 'Kansas State', 'Iowa St.': 'Iowa State',
    'Oklahoma St.': 'Oklahoma State', 'Penn St.': 'Penn State', 'San Diego St.': 'San Diego State',
    'San Jose St.': 'San José State', 'Boston Col.': 'Boston College',
}

def add_p4_conference(df):
    df = df.copy()
    def norm(s):
        return school_alias.get(s, s) if pd.notna(s) and s else None
    def is_p4(row):
        sn = norm(row['School'])
        if not sn: return 0
        year = row.get('Year', 0)
        schools = P4_SCHOOLS if year <= 2023 else P4_SCHOOLS_NO_PAC12
        return 1 if sn in schools else 0
    df['p4_conference'] = df.apply(is_p4, axis=1)
    df['contains_p4_conference'] = df['School'].notna().astype(int)
    return df

train_raw = add_p4_conference(train_raw)
test_raw = add_p4_conference(test_raw)

# --- Binary contains_* for each metric (1 if present, 0 if missing) ---
METRIC_COLS = [
    'Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season',
    'Sacks_final_season', 'Shuttle', '3Cone', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score', 'agility_score'
]
def add_contains_flags(df):
    df = df.copy()
    name_map = {
        'Broad Jump': 'broad_jump', 'Vertical': 'vertical',
        'QB_Hurry_final_season': 'qb_hurry_final_season', 'TFL_final_season': 'tfl_final_season',
        'Sacks_final_season': 'sacks_final_season', 'Shuttle': 'shuttle', '3Cone': 'three_cone',
        '40yd': '40yd', 'Height': 'height', 'Weight': 'weight',
        'speed_score': 'speed_score', 'explosive_score': 'explosive_score', 'agility_score': 'agility_score'
    }
    for col in METRIC_COLS:
        if col not in df.columns:
            continue
        flag_name = f"contains_{name_map.get(col, col.lower().replace(' ', '_'))}"
        df[flag_name] = (df[col].notna()).astype(int)
    return df

train_raw = add_contains_flags(train_raw)
test_raw = add_contains_flags(test_raw)

# Final training and test datasets for modeling
train_df = train_raw.copy()
test_df = test_raw.copy()

print('Training set:', train_df.shape[0], 'players')
print('Test set:', test_df.shape[0], 'players')
print('\nModeling features:', FEATURE_COLS)
print('Derived metrics: speed_score, explosive_score, agility_score')
print('Contains flags: contains_* for each metric')
train_df.head()

Training set: 268 players
Test set: 93 players

Modeling features: ['Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'Shuttle', '3Cone', '40yd', 'Height', 'Weight']
Derived metrics: speed_score, explosive_score, agility_score
Contains flags: contains_* for each metric


Unnamed: 0,Year,Player,Pos,School,Height,Weight,40yd,Vertical,Bench,Broad Jump,...,contains_tfl_final_season,contains_sacks_final_season,contains_shuttle,contains_three_cone,contains_40yd,contains_height,contains_weight,contains_speed_score,contains_explosive_score,contains_agility_score
0,2010,Rahim Alem,DE,LSU,75,251.0,4.75,30.5,,106.0,...,0,0,1,1,1,1,1,1,1,1
1,2010,Tyson Alualu,DE,California,74,295.0,4.87,35.5,21.0,116.0,...,0,0,1,1,1,1,1,1,1,1
2,2010,Kevin Basped,DE,Nevada,76,258.0,4.75,29.0,26.0,104.0,...,0,0,1,1,1,1,1,1,1,1
3,2010,Alex Carrington,DE,Arkansas State,77,285.0,4.92,,26.0,,...,0,0,0,0,1,1,1,1,1,1
4,2010,Jermaine Cunningham,DE,Florida,75,266.0,4.89,,,,...,0,0,0,0,1,1,1,1,1,1


In [4]:
# Combine-only logistic regression: predict Drafted (1) vs Undrafted (0)
# No college stats — only combine metrics + derived scores

# Combine-only features (no Sacks/TFL/QB Hurry) + binary "contains_*" flags
# Exclude Shuttle, 3Cone, agility_score — often missing at combine
COMBINE_ONLY_FEATURES = [
    'Broad Jump', 'Vertical', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score', 'p4_conference'
]
COMBINE_ONLY_CONTAINS = [
    'contains_broad_jump', 'contains_vertical',
    'contains_40yd', 'contains_height', 'contains_weight',
    'contains_speed_score', 'contains_explosive_score', 'contains_p4_conference'
]
COMBINE_ONLY_ALL = COMBINE_ONLY_FEATURES + COMBINE_ONLY_CONTAINS

# Prepare X, y
X_tr_raw = train_df[COMBINE_ONLY_ALL].copy()
X_te_raw = test_df[COMBINE_ONLY_ALL].copy()
y_train = (train_df['Drafted'].astype(bool)).astype(int)
y_test = (test_df['Drafted'].astype(bool)).astype(int)

# KNN imputation (fit on train, transform train and test)
knn_imputer_combine = KNNImputer(n_neighbors=10)
X_tr = knn_imputer_combine.fit_transform(X_tr_raw)
X_te = knn_imputer_combine.transform(X_te_raw)
# For prediction: leave missing as NaN then transform; use NaN "medians" so _player_row keeps NaNs
train_medians = pd.Series(np.nan, index=COMBINE_ONLY_ALL)

# Scale (fit on train, transform both)
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr)
X_te_scaled = scaler.transform(X_te)

# Fit binary logistic regression
logit_draft = LogisticRegression(max_iter=1000, random_state=42)
logit_draft.fit(X_tr_scaled, y_train)

# Predict on test
y_pred = logit_draft.predict(X_te_scaled)
y_prob = logit_draft.predict_proba(X_te_scaled)[:, 1]

# Metrics
print('Combine-only logistic model: Drafted vs Undrafted')
print('=' * 50)
print('Test accuracy:', (y_pred == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred))
print('\nClassification report:')
print(classification_report(y_test, y_pred, target_names=['Undrafted', 'Drafted']))
if y_test.nunique() == 2:
    print('Test ROC-AUC:', roc_auc_score(y_test, y_prob).round(4))

Combine-only logistic model: Drafted vs Undrafted
Test accuracy: 0.6559

Confusion matrix (rows=actual, cols=predicted):
[[ 4 28]
 [ 4 57]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.50      0.12      0.20        32
     Drafted       0.67      0.93      0.78        61

    accuracy                           0.66        93
   macro avg       0.59      0.53      0.49        93
weighted avg       0.61      0.66      0.58        93

Test ROC-AUC: 0.6962


In [5]:
# Logistic regression with college stats: training data from 2017 onward
# Same target (Drafted vs Undrafted), with combine + college stats

# Restrict to 2017+ so college stats are available
train_2017 = train_df[train_df['Year'] >= 2017].copy()
test_2017 = test_df[test_df['Year'] >= 2017].copy()

# Full feature set (combine + college stats) + binary "contains_*" flags
# Exclude Shuttle, 3Cone, agility_score — often missing at combine
FEATURES_WITH_COLLEGE = [
    'Broad Jump', 'Vertical', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score',
    'QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'p4_conference'
]
CONTAINS_WITH_COLLEGE = [
    'contains_broad_jump', 'contains_vertical',
    'contains_40yd', 'contains_height', 'contains_weight',
    'contains_speed_score', 'contains_explosive_score',
    'contains_qb_hurry_final_season', 'contains_tfl_final_season', 'contains_sacks_final_season',
    'contains_p4_conference'
]
FEATURES_WITH_COLLEGE_ALL = FEATURES_WITH_COLLEGE + CONTAINS_WITH_COLLEGE

X_tr17_raw = train_2017[FEATURES_WITH_COLLEGE_ALL].copy()
X_te17_raw = test_2017[FEATURES_WITH_COLLEGE_ALL].copy()
y_train17 = (train_2017['Drafted'].astype(bool)).astype(int)
y_test17 = (test_2017['Drafted'].astype(bool)).astype(int)

# KNN imputation
knn_imputer17 = KNNImputer(n_neighbors=10)
X_tr17 = knn_imputer17.fit_transform(X_tr17_raw)
X_te17 = knn_imputer17.transform(X_te17_raw)
train_medians17 = pd.Series(np.nan, index=FEATURES_WITH_COLLEGE_ALL)

# Scale (fit on train, transform both)
scaler17 = StandardScaler()
X_tr17_scaled = scaler17.fit_transform(X_tr17)
X_te17_scaled = scaler17.transform(X_te17)

# Fit logistic regression (with college stats)
logit_draft_college = LogisticRegression(max_iter=1000, random_state=42)
logit_draft_college.fit(X_tr17_scaled, y_train17)

y_pred17 = logit_draft_college.predict(X_te17_scaled)
y_prob17 = logit_draft_college.predict_proba(X_te17_scaled)[:, 1]

print('Logistic model with college stats (train 2017+, test 2017+)')
print('=' * 55)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred17, target_names=['Undrafted', 'Drafted']))
if y_test17.nunique() == 2 and len(y_test17) > 0:
    print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob17).round(4))

Logistic model with college stats (train 2017+, test 2017+)
Training samples: 91 | Test samples: 93
Test accuracy: 0.6559

Confusion matrix (rows=actual, cols=predicted):
[[10 22]
 [10 51]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.50      0.31      0.38        32
     Drafted       0.70      0.84      0.76        61

    accuracy                           0.66        93
   macro avg       0.60      0.57      0.57        93
weighted avg       0.63      0.66      0.63        93

Test ROC-AUC: 0.7275


In [6]:
# College + combine **with agility_score**: for players who have 3Cone/Shuttle (agility) we use these models.
FEATURES_WITH_COLLEGE_AGILITY = FEATURES_WITH_COLLEGE + ['agility_score']
CONTAINS_WITH_COLLEGE_AGILITY = CONTAINS_WITH_COLLEGE + ['contains_agility_score']
FEATURES_WITH_COLLEGE_AGILITY_ALL = FEATURES_WITH_COLLEGE_AGILITY + CONTAINS_WITH_COLLEGE_AGILITY

X_tr_ag_raw = train_2017[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()
X_te_ag_raw = test_2017[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()
knn_imputer_ag = KNNImputer(n_neighbors=10)
X_tr_ag = knn_imputer_ag.fit_transform(X_tr_ag_raw)
X_te_ag = knn_imputer_ag.transform(X_te_ag_raw)
train_medians_ag = pd.Series(np.nan, index=FEATURES_WITH_COLLEGE_AGILITY_ALL)

scaler_ag = StandardScaler()
X_tr_ag_scaled = scaler_ag.fit_transform(X_tr_ag)
X_te_ag_scaled = scaler_ag.transform(X_te_ag)

# Draft/undrafted models (college+combine w/ agility)
logit_draft_college_agility = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
logit_draft_college_agility.fit(X_tr_ag_scaled, y_train17)
y_pred_ag = logit_draft_college_agility.predict(X_te_ag_scaled)
y_prob_ag = logit_draft_college_agility.predict_proba(X_te_ag_scaled)[:, 1]

rf_college_agility = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42, class_weight='balanced')
rf_college_agility.fit(X_tr_ag, y_train17)
y_pred_rf_ag = rf_college_agility.predict(X_te_ag)
y_prob_rf_ag = rf_college_agility.predict_proba(X_te_ag)[:, 1]

xgb_college_agility = xgb.XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college_agility.fit(X_tr_ag, y_train17)
y_pred_xgb_ag = xgb_college_agility.predict(X_te_ag)
y_prob_xgb_ag = xgb_college_agility.predict_proba(X_te_ag)[:, 1]

print('College+combine w/ agility: Drafted vs Undrafted (train 2017+)')
print('Logistic ROC-AUC:', roc_auc_score(y_test17, y_prob_ag).round(4))
print('RF ROC-AUC:', roc_auc_score(y_test17, y_prob_rf_ag).round(4))
print('XGB ROC-AUC:', roc_auc_score(y_test17, y_prob_xgb_ag).round(4))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


College+combine w/ agility: Drafted vs Undrafted (train 2017+)
Logistic ROC-AUC: 0.7249
RF ROC-AUC: 0.7449
XGB ROC-AUC: 0.7259


In [7]:
# Combined prediction: average both models' probabilities into one drafted/undrafted prediction
# Combine-only model predicts on full test_df; college model on test_2017 (2017+). Test set is 2021+ so both apply to all rows.

combined_prob = (y_prob + y_prob17) / 2
combined_pred = (combined_prob >= 0.5).astype(int)

# Use same test labels (y_test from full test_df; same rows as test_2017)
print('Combined model: average of combine-only + college-stats probabilities')
print('=' * 60)
print('Test accuracy:', (combined_pred == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred))
print('\nClassification report:')
print(classification_report(y_test, combined_pred, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob).round(4))

Combined model: average of combine-only + college-stats probabilities
Test accuracy: 0.6452

Confusion matrix (rows=actual, cols=predicted):
[[ 8 24]
 [ 9 52]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.47      0.25      0.33        32
     Drafted       0.68      0.85      0.76        61

    accuracy                           0.65        93
   macro avg       0.58      0.55      0.54        93
weighted avg       0.61      0.65      0.61        93

Test ROC-AUC: 0.7203


In [8]:
# College-only models: QB Hurry, TFL, Sacks + p4_conference only (no combine metrics)
# For players without combine data (e.g. 2026 prospects pre-combine)

COLLEGE_ONLY_FEATURES = ['QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'p4_conference', 'Height', 'Weight']
COLLEGE_ONLY_CONTAINS = ['contains_qb_hurry_final_season', 'contains_tfl_final_season', 'contains_sacks_final_season', 'contains_p4_conference', 'contains_height', 'contains_weight']
COLLEGE_ONLY_ALL = COLLEGE_ONLY_FEATURES + COLLEGE_ONLY_CONTAINS

X_tr_co_raw = train_2017[COLLEGE_ONLY_ALL].copy()
X_te_co_raw = test_2017[COLLEGE_ONLY_ALL].copy()
knn_imputer_co = KNNImputer(n_neighbors=10)
X_tr_co = knn_imputer_co.fit_transform(X_tr_co_raw)
X_te_co = knn_imputer_co.transform(X_te_co_raw)
train_medians_co = pd.Series(np.nan, index=COLLEGE_ONLY_ALL)

scaler_co = StandardScaler()
X_tr_co_scaled = scaler_co.fit_transform(X_tr_co)
X_te_co_scaled = scaler_co.transform(X_te_co)

# Drafted/undrafted
logit_draft_college_only = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
logit_draft_college_only.fit(X_tr_co_scaled, y_train17)
y_pred_college_only = logit_draft_college_only.predict(X_te_co_scaled)
y_prob_college_only = logit_draft_college_only.predict_proba(X_te_co_scaled)[:, 1]

rf_college_only = RandomForestClassifier(n_estimators=200, max_depth=1, random_state=42, class_weight='balanced')
rf_college_only.fit(X_tr_co, y_train17)
y_pred_rf_co = rf_college_only.predict(X_te_co)
y_prob_rf_co = rf_college_only.predict_proba(X_te_co)[:, 1]

xgb_college_only = xgb.XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college_only.fit(X_tr_co, y_train17)
y_pred_xgb_co = xgb_college_only.predict(X_te_co)
y_prob_xgb_co = xgb_college_only.predict_proba(X_te_co)[:, 1]

print('College-only models (QB Hurry, TFL, Sacks, p4_conference) — drafted/undrafted')
print('=' * 70)
for name, pred, prob in [('Logistic', y_pred_college_only, y_prob_college_only), ('RF', y_pred_rf_co, y_prob_rf_co), ('XGB', y_pred_xgb_co, y_prob_xgb_co)]:
    print(f'{name}: accuracy={(pred == y_test17).mean():.4f}, ROC-AUC={roc_auc_score(y_test17, prob):.4f}')

College-only models (QB Hurry, TFL, Sacks, p4_conference) — drafted/undrafted
Logistic: accuracy=0.6774, ROC-AUC=0.6778
RF: accuracy=0.6452, ROC-AUC=0.6811
XGB: accuracy=0.6344, ROC-AUC=0.6012


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [9]:
# Random Forest: combine-only features — predict Drafted vs Undrafted

rf_combine = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf_combine.fit(X_tr, y_train)  # X_tr already has COMBINE_ONLY_ALL, imputed

y_pred_rf = rf_combine.predict(X_te)
y_prob_rf = rf_combine.predict_proba(X_te)[:, 1]

print('Random Forest (combine-only): Drafted vs Undrafted')
print('=' * 55)
print('Test accuracy:', (y_pred_rf == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred_rf))
print('\nClassification report:')
print(classification_report(y_test, y_pred_rf, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, y_prob_rf).round(4))

Random Forest (combine-only): Drafted vs Undrafted
Test accuracy: 0.6774

Confusion matrix (rows=actual, cols=predicted):
[[ 6 26]
 [ 4 57]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.60      0.19      0.29        32
     Drafted       0.69      0.93      0.79        61

    accuracy                           0.68        93
   macro avg       0.64      0.56      0.54        93
weighted avg       0.66      0.68      0.62        93

Test ROC-AUC: 0.6742


In [10]:
# Random Forest: combine + college stats (2017+)
rf_college = RandomForestClassifier(n_estimators=200, max_depth=2, random_state=42)
rf_college.fit(X_tr17, y_train17)  # X_tr17 already has FEATURES_WITH_COLLEGE_ALL, imputed

y_pred_rf17 = rf_college.predict(X_te17)
y_prob_rf17 = rf_college.predict_proba(X_te17)[:, 1]

print('Random Forest (combine + college, train 2017+): Drafted vs Undrafted')
print('=' * 60)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred_rf17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred_rf17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred_rf17, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob_rf17).round(4))

Random Forest (combine + college, train 2017+): Drafted vs Undrafted
Training samples: 91 | Test samples: 93
Test accuracy: 0.6774

Confusion matrix (rows=actual, cols=predicted):
[[ 6 26]
 [ 4 57]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.60      0.19      0.29        32
     Drafted       0.69      0.93      0.79        61

    accuracy                           0.68        93
   macro avg       0.64      0.56      0.54        93
weighted avg       0.66      0.68      0.62        93

Test ROC-AUC: 0.7485


In [11]:
# Combined RF prediction: average both RF models' probabilities
combined_prob_rf = (y_prob_rf + y_prob_rf17) / 2
combined_pred_rf = (combined_prob_rf >= 0.5).astype(int)

print('Combined Random Forest: average of combine-only + college-stats probabilities')
print('=' * 65)
print('Test accuracy:', (combined_pred_rf == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred_rf))
print('\nClassification report:')
print(classification_report(y_test, combined_pred_rf, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob_rf).round(4))

Combined Random Forest: average of combine-only + college-stats probabilities
Test accuracy: 0.6667

Confusion matrix (rows=actual, cols=predicted):
[[ 6 26]
 [ 5 56]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.55      0.19      0.28        32
     Drafted       0.68      0.92      0.78        61

    accuracy                           0.67        93
   macro avg       0.61      0.55      0.53        93
weighted avg       0.64      0.67      0.61        93

Test ROC-AUC: 0.7095


In [12]:
# XGBoost: combine-only features — predict Drafted vs Undrafted

xgb_combine = xgb.XGBClassifier(n_estimators=200, max_depth=2, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_combine.fit(X_tr, y_train)

y_pred_xgb = xgb_combine.predict(X_te)
y_prob_xgb = xgb_combine.predict_proba(X_te)[:, 1]

print('XGBoost (combine-only): Drafted vs Undrafted')
print('=' * 55)
print('Test accuracy:', (y_pred_xgb == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred_xgb))
print('\nClassification report:')
print(classification_report(y_test, y_pred_xgb, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, y_prob_xgb).round(4))

XGBoost (combine-only): Drafted vs Undrafted
Test accuracy: 0.6774

Confusion matrix (rows=actual, cols=predicted):
[[ 9 23]
 [ 7 54]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.56      0.28      0.38        32
     Drafted       0.70      0.89      0.78        61

    accuracy                           0.68        93
   macro avg       0.63      0.58      0.58        93
weighted avg       0.65      0.68      0.64        93

Test ROC-AUC: 0.7003


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [13]:
# XGBoost: combine + college stats (2017+)
xgb_college = xgb.XGBClassifier(n_estimators=200, max_depth=2, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college.fit(X_tr17, y_train17)

y_pred_xgb17 = xgb_college.predict(X_te17)
y_prob_xgb17 = xgb_college.predict_proba(X_te17)[:, 1]

print('XGBoost (combine + college, train 2017+): Drafted vs Undrafted')
print('=' * 60)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred_xgb17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred_xgb17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred_xgb17, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob_xgb17).round(4))

XGBoost (combine + college, train 2017+): Drafted vs Undrafted
Training samples: 91 | Test samples: 93
Test accuracy: 0.6774

Confusion matrix (rows=actual, cols=predicted):
[[ 9 23]
 [ 7 54]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.56      0.28      0.38        32
     Drafted       0.70      0.89      0.78        61

    accuracy                           0.68        93
   macro avg       0.63      0.58      0.58        93
weighted avg       0.65      0.68      0.64        93

Test ROC-AUC: 0.729


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [14]:
# Combined XGBoost prediction: average both XGBoost models' probabilities
combined_prob_xgb = (y_prob_xgb + y_prob_xgb17) / 2
combined_pred_xgb = (combined_prob_xgb >= 0.5).astype(int)

print('Combined XGBoost: average of combine-only + college-stats probabilities')
print('=' * 65)
print('Test accuracy:', (combined_pred_xgb == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred_xgb))
print('\nClassification report:')
print(classification_report(y_test, combined_pred_xgb, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob_xgb).round(4))

Combined XGBoost: average of combine-only + college-stats probabilities
Test accuracy: 0.6667

Confusion matrix (rows=actual, cols=predicted):
[[ 7 25]
 [ 6 55]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.54      0.22      0.31        32
     Drafted       0.69      0.90      0.78        61

    accuracy                           0.67        93
   macro avg       0.61      0.56      0.55        93
weighted avg       0.64      0.67      0.62        93

Test ROC-AUC: 0.7285


In [15]:
# Compare all drafted/undrafted models on the same test set (y_test, n=93)

models = [
    ('Logistic (combine-only)', y_pred, y_prob),
    ('Logistic (combine+college)', y_pred17, y_prob17),
    ('Logistic combined', combined_pred, combined_prob),
    ('RF (combine-only)', y_pred_rf, y_prob_rf),
    ('RF (combine+college)', y_pred_rf17, y_prob_rf17),
    ('RF combined', combined_pred_rf, combined_prob_rf),
    ('XGBoost (combine-only)', y_pred_xgb, y_prob_xgb),
    ('XGBoost (combine+college)', y_pred_xgb17, y_prob_xgb17),
    ('XGBoost combined', combined_pred_xgb, combined_prob_xgb),
]

results = []
for name, pred, prob in models:
    acc = (pred == y_test).mean()
    auc = roc_auc_score(y_test, prob)
    f1_macro = f1_score(y_test, pred, average='macro')
    # Per-class recall: Undrafted (0), Drafted (1). CM rows=actual, cols=pred.
    tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
    recall_undrafted = tn / (tn + fn) if (tn + fn) > 0 else 0  # actual undrafted we got right
    recall_drafted = tp / (tp + fn) if (tp + fn) > 0 else 0
    results.append({
        'Model': name,
        'Accuracy': acc,
        'ROC-AUC': auc,
        'Macro F1': f1_macro,
        'Recall (Undrafted)': recall_undrafted,
        'Recall (Drafted)': recall_drafted,
    })

results_df = pd.DataFrame(results)
results_df = results_df.sort_values('ROC-AUC', ascending=False).reset_index(drop=True)
print('All models ranked by ROC-AUC (same test set, n=93):')
print('=' * 75)
print(results_df.to_string(index=False))
print()

best_auc = results_df.loc[0, 'Model']
best_auc_val = results_df.loc[0, 'ROC-AUC']
best_f1 = results_df.loc[results_df['Macro F1'].idxmax(), 'Model']
best_f1_val = results_df['Macro F1'].max()
print('Summary:')
print('  Best by ROC-AUC:', best_auc, f'({best_auc_val:.4f})')
print('  Best by Macro F1 (balanced Undrafted/Drafted):', best_f1, f'({best_f1_val:.4f})')
print()
print('Conclusion: ROC-AUC is the preferred metric for imbalanced drafted/undrafted;')
print('Macro F1 rewards balance. If all models are close, the best single model or')
print('combined ensemble is listed above.')

All models ranked by ROC-AUC (same test set, n=93):
                     Model  Accuracy  ROC-AUC  Macro F1  Recall (Undrafted)  Recall (Drafted)
      RF (combine+college)  0.677419 0.748463  0.538690            0.600000          0.934426
 XGBoost (combine+college)  0.677419 0.728996  0.578804            0.562500          0.885246
          XGBoost combined  0.666667 0.728484  0.545626            0.538462          0.901639
Logistic (combine+college)  0.655914 0.727459  0.572905            0.500000          0.836066
         Logistic combined  0.645161 0.720287  0.542827            0.470588          0.852459
               RF combined  0.666667 0.709529  0.531143            0.545455          0.918033
    XGBoost (combine-only)  0.677419 0.700307  0.578804            0.562500          0.885246
   Logistic (combine-only)  0.655914 0.696209  0.490411            0.500000          0.934426
         RF (combine-only)  0.677419 0.674180  0.538690            0.600000          0.934426

Summary

## Projected Round/Day Drafted

We predict **draft round (1–7)** for Defensive Ends who are already projected to be drafted, using the same train/test split (2010–2019 train, 2021–2023 test). Only drafted players are used: 197 train, 61 test overall; for models that use college stats we use 2017+ drafted only (65 train, 61 test). Models include ordinal-logit style logistic regression, Random Forest, and XGBoost.

In [16]:
# Draft ROUND modeling: Round 1–7. Train only on DRAFTED players.
# Target: round_ord 0..6 (Round 1–7) for 7-class models.
train_draft = train_df[train_df['Drafted'] == True].copy()
test_draft = test_df[test_df['Drafted'] == True].copy()
train_draft['draft_round'] = train_draft['Round'].astype(int).clip(1, 7)
test_draft['draft_round'] = test_draft['Round'].astype(int).clip(1, 7)
train_draft['round_ord'] = (train_draft['draft_round'] - 1).astype(int)  # 0=R1..6=R7
test_draft['round_ord'] = (test_draft['draft_round'] - 1).astype(int)

# Combine-only X, y (drafted only) — KNN impute
X_draft_tr = pd.DataFrame(knn_imputer_combine.transform(train_draft[COMBINE_ONLY_ALL].copy()), columns=COMBINE_ONLY_ALL, index=train_draft.index)
X_draft_te = pd.DataFrame(knn_imputer_combine.transform(test_draft[COMBINE_ONLY_ALL].copy()), columns=COMBINE_ONLY_ALL, index=test_draft.index)
y_draft_tr = train_draft['round_ord'].values
y_draft_te = test_draft['round_ord'].values

# Combine+college 2017+ (drafted only) — KNN impute
train_draft_17 = train_draft[train_draft['Year'] >= 2017]
test_draft_17 = test_draft[test_draft['Year'] >= 2017]
X_draft_tr17 = pd.DataFrame(knn_imputer17.transform(train_draft_17[FEATURES_WITH_COLLEGE_ALL].copy()), columns=FEATURES_WITH_COLLEGE_ALL, index=train_draft_17.index)
X_draft_te17 = pd.DataFrame(knn_imputer17.transform(test_draft_17[FEATURES_WITH_COLLEGE_ALL].copy()), columns=FEATURES_WITH_COLLEGE_ALL, index=test_draft_17.index)
y_draft_tr17 = train_draft_17['round_ord'].values
y_draft_te17 = test_draft_17['round_ord'].values

# Scale for ordinal logistic (same scalers as before, but transform draft subsets)
X_draft_tr_scaled = scaler.transform(X_draft_tr)
X_draft_te_scaled = scaler.transform(X_draft_te)
X_draft_tr17_scaled = scaler17.transform(X_draft_tr17)
X_draft_te17_scaled = scaler17.transform(X_draft_te17)

print('Draft ROUND modeling (drafted only), 7 classes R1–R7')
print('Train drafted:', len(train_draft), '| Test drafted:', len(test_draft))
print('Train 2017+ drafted:', len(train_draft_17), '| Test 2017+ drafted:', len(test_draft_17))
for r in range(1, 8):
    print(f'  R{r}: {(train_draft["draft_round"]==r).sum()} train, {(test_draft["draft_round"]==r).sum()} test')

Draft ROUND modeling (drafted only), 7 classes R1–R7
Train drafted: 197 | Test drafted: 61
Train 2017+ drafted: 65 | Test 2017+ drafted: 61
  R1: 37 train, 11 test
  R2: 28 train, 13 test
  R3: 35 train, 10 test
  R4: 27 train, 9 test
  R5: 20 train, 8 test
  R6: 21 train, 4 test
  R7: 29 train, 6 test


In [17]:
# Ordinal logistic: 7 classes (R1–R7), 0=R1..6=R7. Combined = average probabilities.
ord_combine = LogisticRegression(max_iter=2000, random_state=42, class_weight='balanced')
ord_college = LogisticRegression(max_iter=2000, random_state=43, class_weight='balanced')

ord_combine.fit(X_draft_tr_scaled, y_draft_tr)
prob_ord_combine = ord_combine.predict_proba(X_draft_te_scaled)
pred_ord_combine = ord_combine.predict(X_draft_te_scaled).astype(int).clip(0, 6)

ord_college.fit(X_draft_tr17_scaled, y_draft_tr17)
prob_ord_college = ord_college.predict_proba(X_draft_te17_scaled)
pred_ord_college = ord_college.predict(X_draft_te17_scaled).astype(int).clip(0, 6)

prob_ord_combined = (prob_ord_combine + prob_ord_college) / 2
pred_ord_combined = np.argmax(prob_ord_combined, axis=1)

for name, pred in [('Ordinal logit (combine-only)', pred_ord_combine), ('Ordinal logit (combine+college)', pred_ord_college), ('Ordinal logit combined', pred_ord_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix (rows=actual, cols=R1..R7):\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

Ordinal logit (combine-only)
  Accuracy: 0.0984
  Confusion matrix (rows=actual, cols=R1..R7):
 [[4 1 2 4 0 0 0]
 [4 1 6 2 0 0 0]
 [2 1 0 1 4 0 2]
 [3 1 1 0 3 0 1]
 [2 0 0 0 0 2 4]
 [1 0 0 0 0 0 3]
 [0 1 0 1 1 2 1]]
  Macro F1: 0.075

Ordinal logit (combine+college)
  Accuracy: 0.1967
  Confusion matrix (rows=actual, cols=R1..R7):
 [[3 3 1 1 2 1 0]
 [1 5 5 0 0 0 2]
 [0 1 1 3 3 0 2]
 [1 1 1 1 2 0 3]
 [0 0 1 1 0 2 4]
 [0 1 1 0 1 0 1]
 [0 0 2 0 2 0 2]]
  Macro F1: 0.1737

Ordinal logit combined
  Accuracy: 0.1639
  Confusion matrix (rows=actual, cols=R1..R7):
 [[4 1 1 3 2 0 0]
 [1 4 5 1 0 0 2]
 [2 0 1 2 2 0 3]
 [2 1 1 1 2 0 2]
 [0 1 0 0 0 3 4]
 [1 0 1 0 1 0 1]
 [0 1 2 0 2 1 0]]
  Macro F1: 0.1403



In [18]:
# Random Forest: draft ROUND (7 classes R1–R7)
rf_day_combine = RandomForestClassifier(n_estimators=200, max_depth=4, random_state=42, class_weight='balanced')
rf_day_combine.fit(X_draft_tr, y_draft_tr)
pred_rf_day_combine = rf_day_combine.predict(X_draft_te).astype(int).clip(0, 6)
prob_rf_day_combine = rf_day_combine.predict_proba(X_draft_te)

rf_day_college = RandomForestClassifier(n_estimators=200, max_depth=2, random_state=42, class_weight='balanced')
rf_day_college.fit(X_draft_tr17, y_draft_tr17)
pred_rf_day_college = rf_day_college.predict(X_draft_te17).astype(int).clip(0, 6)
prob_rf_day_college = rf_day_college.predict_proba(X_draft_te17)

prob_rf_day_combined = (prob_rf_day_combine + prob_rf_day_college) / 2
pred_rf_day_combined = np.argmax(prob_rf_day_combined, axis=1)

for name, pred in [('RF (combine-only)', pred_rf_day_combine), ('RF (combine+college)', pred_rf_day_college), ('RF combined', pred_rf_day_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix (R1..R7):\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

RF (combine-only)
  Accuracy: 0.1967
  Confusion matrix (R1..R7):
 [[3 1 3 3 1 0 0]
 [5 2 4 2 0 0 0]
 [2 1 1 1 5 0 0]
 [1 0 2 3 2 1 0]
 [0 2 1 2 0 0 3]
 [1 0 1 0 0 2 0]
 [1 3 1 0 0 0 1]]
  Macro F1: 0.2272

RF (combine+college)
  Accuracy: 0.2131
  Confusion matrix (R1..R7):
 [[4 4 0 2 1 0 0]
 [6 4 3 0 0 0 0]
 [3 0 2 0 3 0 2]
 [1 1 2 1 3 0 1]
 [0 1 0 0 1 1 5]
 [0 1 1 0 1 0 1]
 [0 2 0 0 2 1 1]]
  Macro F1: 0.1781

RF combined
  Accuracy: 0.1967
  Confusion matrix (R1..R7):
 [[5 1 2 2 1 0 0]
 [5 3 3 2 0 0 0]
 [2 0 0 3 3 0 2]
 [1 0 2 2 2 1 1]
 [0 2 1 2 0 0 3]
 [1 0 0 0 0 1 2]
 [0 2 2 0 0 1 1]]
  Macro F1: 0.1864



In [19]:
# College-only draft-ROUND models (7 classes, for players without combine data)
X_draft_tr_co = pd.DataFrame(knn_imputer_co.transform(train_draft_17[COLLEGE_ONLY_ALL].copy()), columns=COLLEGE_ONLY_ALL, index=train_draft_17.index)
X_draft_te_co = pd.DataFrame(knn_imputer_co.transform(test_draft_17[COLLEGE_ONLY_ALL].copy()), columns=COLLEGE_ONLY_ALL, index=test_draft_17.index)
X_draft_tr_co_scaled = scaler_co.transform(X_draft_tr_co)
X_draft_te_co_scaled = scaler_co.transform(X_draft_te_co)

ord_college_only = LogisticRegression(max_iter=2000, random_state=44, class_weight='balanced')
ord_college_only.fit(X_draft_tr_co_scaled, y_draft_tr17)
pred_ord_college_only = ord_college_only.predict(X_draft_te_co_scaled).astype(int).clip(0, 6)

rf_day_college_only = RandomForestClassifier(n_estimators=200, max_depth=2, random_state=42, class_weight='balanced')
rf_day_college_only.fit(X_draft_tr_co, y_draft_tr17)
pred_rf_day_college_only = rf_day_college_only.predict(X_draft_te_co).astype(int).clip(0, 6)

sample_weight_tr_co = compute_sample_weight('balanced', y_draft_tr17)
xgb_day_college_only = xgb.XGBClassifier(n_estimators=200, max_depth=2, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_college_only.fit(X_draft_tr_co, y_draft_tr17, sample_weight=sample_weight_tr_co)
pred_xgb_day_college_only = xgb_day_college_only.predict(X_draft_te_co).astype(int).clip(0, 6)

print('College-only draft-ROUND models (7 classes):')
for name, pred in [('Ordinal (college-only)', pred_ord_college_only), ('RF (college-only)', pred_rf_day_college_only), ('XGB (college-only)', pred_xgb_day_college_only)]:
    print(f'  {name}: acc={(pred == y_draft_te17).mean():.4f}, Macro F1={f1_score(y_draft_te17, pred, average="macro", zero_division=0):.4f}')

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


College-only draft-ROUND models (7 classes):
  Ordinal (college-only): acc=0.2131, Macro F1=0.1935
  RF (college-only): acc=0.1967, Macro F1=0.1894
  XGB (college-only): acc=0.1475, Macro F1=0.1557


In [20]:
# XGBoost: draft ROUND (7 classes R1–R7)
from sklearn.utils.class_weight import compute_sample_weight
sample_weight_tr = compute_sample_weight('balanced', y_draft_tr)
sample_weight_tr17 = compute_sample_weight('balanced', y_draft_tr17)
xgb_day_combine = xgb.XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_combine.fit(X_draft_tr, y_draft_tr, sample_weight=sample_weight_tr)
pred_xgb_day_combine = xgb_day_combine.predict(X_draft_te).astype(int).clip(0, 6)
prob_xgb_day_combine = xgb_day_combine.predict_proba(X_draft_te)

xgb_day_college = xgb.XGBClassifier(n_estimators=200, max_depth=8, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_college.fit(X_draft_tr17, y_draft_tr17, sample_weight=sample_weight_tr17)
pred_xgb_day_college = xgb_day_college.predict(X_draft_te17).astype(int).clip(0, 6)
prob_xgb_day_college = xgb_day_college.predict_proba(X_draft_te17)

prob_xgb_day_combined = (prob_xgb_day_combine + prob_xgb_day_college) / 2
pred_xgb_day_combined = np.argmax(prob_xgb_day_combined, axis=1)

for name, pred in [('XGBoost (combine-only)', pred_xgb_day_combine), ('XGBoost (combine+college)', pred_xgb_day_college), ('XGBoost combined', pred_xgb_day_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix (R1..R7):\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost (combine-only)
  Accuracy: 0.1475
  Confusion matrix (R1..R7):
 [[5 0 2 3 1 0 0]
 [5 1 4 1 0 0 2]
 [3 1 0 1 2 1 2]
 [3 1 1 1 1 1 1]
 [1 1 0 3 0 2 1]
 [1 1 1 0 0 1 0]
 [1 2 0 0 0 2 1]]
  Macro F1: 0.1257

XGBoost (combine+college)
  Accuracy: 0.1803
  Confusion matrix (R1..R7):
 [[1 2 2 3 1 0 2]
 [3 3 4 1 0 1 1]
 [2 0 2 2 0 0 4]
 [0 0 5 2 0 0 2]
 [0 0 0 1 2 2 3]
 [0 0 1 1 1 0 1]
 [0 2 1 2 0 0 1]]
  Macro F1: 0.1716

XGBoost combined
  Accuracy: 0.1967
  Confusion matrix (R1..R7):
 [[4 0 2 3 1 0 1]
 [4 2 4 2 0 0 1]
 [3 0 1 1 2 0 3]
 [1 1 2 3 0 1 1]
 [0 1 0 2 1 2 2]
 [1 0 0 1 1 0 1]
 [0 2 1 1 0 1 1]]
  Macro F1: 0.1708



In [21]:
# Round models: college+combine w/ agility (7 classes R1–R7, for drafted players with agility data)
X_draft_tr_ag = pd.DataFrame(knn_imputer_ag.transform(train_draft_17[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()), columns=FEATURES_WITH_COLLEGE_AGILITY_ALL, index=train_draft_17.index)
X_draft_te_ag = pd.DataFrame(knn_imputer_ag.transform(test_draft_17[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()), columns=FEATURES_WITH_COLLEGE_AGILITY_ALL, index=test_draft_17.index)
X_draft_tr_ag_scaled = scaler_ag.transform(X_draft_tr_ag)
X_draft_te_ag_scaled = scaler_ag.transform(X_draft_te_ag)

ord_college_agility = LogisticRegression(max_iter=2000, random_state=45, class_weight='balanced')
ord_college_agility.fit(X_draft_tr_ag_scaled, y_draft_tr17)
pred_ord_college_agility = ord_college_agility.predict(X_draft_te_ag_scaled).astype(int).clip(0, 6)

rf_day_college_agility = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42, class_weight='balanced')
rf_day_college_agility.fit(X_draft_tr_ag, y_draft_tr17)
pred_rf_day_college_agility = rf_day_college_agility.predict(X_draft_te_ag).astype(int).clip(0, 6)

sample_weight_tr_ag = compute_sample_weight('balanced', y_draft_tr17)
xgb_day_college_agility = xgb.XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_college_agility.fit(X_draft_tr_ag, y_draft_tr17, sample_weight=sample_weight_tr_ag)
pred_xgb_day_college_agility = xgb_day_college_agility.predict(X_draft_te_ag).astype(int).clip(0, 6)

print('College+combine w/ agility: draft ROUND models (R1–R7)')
for name, pred in [('Ordinal (college + combine w/ agility)', pred_ord_college_agility), ('RF (college + combine w/ agility)', pred_rf_day_college_agility), ('XGB (college + combine w/ agility)', pred_xgb_day_college_agility)]:
    print(f'  {name}: Macro F1={f1_score(y_draft_te17, pred, average="macro", zero_division=0):.4f}')

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


College+combine w/ agility: draft ROUND models (R1–R7)
  Ordinal (college + combine w/ agility): Macro F1=0.1532
  RF (college + combine w/ agility): Macro F1=0.1478
  XGB (college + combine w/ agility): Macro F1=0.1822


In [22]:
# Pipeline: best drafted model (by ROC) and best round model (by Macro F1) **per category**.
# Each prediction column uses ONLY models from that category: combine → combine only, college → college only, college+combine → college+combine only.

def _cat_from_draft_name(name):
    if 'agility' in name.lower(): return 'college+combine_agility'
    if 'college + combine' in name or 'college+combine' in name: return 'college+combine'
    if name.endswith('(college)'): return 'college'
    return 'combine'

def _cat_from_round_name(name):
    if 'agility' in name.lower(): return 'college+combine_agility'
    if 'college + combine' in name or 'college+combine' in name: return 'college+combine'
    if name.endswith('(college)'): return 'college'
    return 'combine'

# Map results_df display names to prediction-code names and category
_draft_display_to_pred_cat = {
    'Logistic (combine-only)': ('Logistic (combine)', 'combine'),
    'RF (combine-only)': ('RF (combine)', 'combine'),
    'XGBoost (combine-only)': ('XGB (combine)', 'combine'),
    'Logistic (combine+college)': ('Logistic (college + combine)', 'college+combine'),
    'RF (combine+college)': ('RF (college + combine)', 'college+combine'),
    'XGBoost (combine+college)': ('XGB (college + combine)', 'college+combine'),
    'Logistic combined': ('Logistic (college + combine)', 'college+combine'),
    'RF combined': ('RF (college + combine)', 'college+combine'),
    'XGBoost combined': ('XGB (college + combine)', 'college+combine'),
}

draft_rows = []
for _, row in results_df.iterrows():
    pred_name, cat = _draft_display_to_pred_cat.get(row['Model'], (row['Model'], 'college+combine'))
    draft_rows.append({'pred_name': pred_name, 'category': cat, 'roc': row['ROC-AUC']})
# College-only draft models (not in results_df)
draft_rows.append({'pred_name': 'Logistic (college)', 'category': 'college', 'roc': roc_auc_score(y_test17, y_prob_college_only)})
draft_rows.append({'pred_name': 'RF (college)', 'category': 'college', 'roc': roc_auc_score(y_test17, y_prob_rf_co)})
draft_rows.append({'pred_name': 'XGB (college)', 'category': 'college', 'roc': roc_auc_score(y_test17, y_prob_xgb_co)})
# College+combine w/ agility draft models (for players with agility scores)
draft_rows.append({'pred_name': 'Logistic (college + combine w/ agility)', 'category': 'college+combine_agility', 'roc': roc_auc_score(y_test17, y_prob_ag)})
draft_rows.append({'pred_name': 'RF (college + combine w/ agility)', 'category': 'college+combine_agility', 'roc': roc_auc_score(y_test17, y_prob_rf_ag)})
draft_rows.append({'pred_name': 'XGB (college + combine w/ agility)', 'category': 'college+combine_agility', 'roc': roc_auc_score(y_test17, y_prob_xgb_ag)})

draft_df = pd.DataFrame(draft_rows)
best_drafted_by_cat = draft_df.loc[draft_df.groupby('category')['roc'].idxmax()].set_index('category')['pred_name'].to_dict()

# Round models: (pred_name, category, Macro F1). Use same-category test set (y_draft_te for combine, y_draft_te17 for college / college+combine).
round_rows = [
    ('Ordinal (combine)', 'combine', f1_score(y_draft_te, pred_ord_combine, average='macro', zero_division=0)),
    ('RF (combine)', 'combine', f1_score(y_draft_te, pred_rf_day_combine, average='macro', zero_division=0)),
    ('XGB (combine)', 'combine', f1_score(y_draft_te, pred_xgb_day_combine, average='macro', zero_division=0)),
    ('Ordinal (college)', 'college', f1_score(y_draft_te17, pred_ord_college_only, average='macro', zero_division=0)),
    ('RF (college)', 'college', f1_score(y_draft_te17, pred_rf_day_college_only, average='macro', zero_division=0)),
    ('XGB (college)', 'college', f1_score(y_draft_te17, pred_xgb_day_college_only, average='macro', zero_division=0)),
    ('Ordinal (college + combine)', 'college+combine', f1_score(y_draft_te17, pred_ord_college, average='macro', zero_division=0)),
    ('RF (college + combine)', 'college+combine', f1_score(y_draft_te17, pred_rf_day_college, average='macro', zero_division=0)),
    ('XGB (college + combine)', 'college+combine', f1_score(y_draft_te17, pred_xgb_day_college, average='macro', zero_division=0)),
    ('Ordinal (college + combine w/ agility)', 'college+combine_agility', f1_score(y_draft_te17, pred_ord_college_agility, average='macro', zero_division=0)),
    ('RF (college + combine w/ agility)', 'college+combine_agility', f1_score(y_draft_te17, pred_rf_day_college_agility, average='macro', zero_division=0)),
    ('XGB (college + combine w/ agility)', 'college+combine_agility', f1_score(y_draft_te17, pred_xgb_day_college_agility, average='macro', zero_division=0)),
]
round_df = pd.DataFrame(round_rows, columns=['pred_name', 'category', 'macro_f1'])
# Best round model **per category** (only round models in that category)
best_round_by_cat = round_df.loc[round_df.groupby('category')['macro_f1'].idxmax()].set_index('category')['pred_name'].to_dict()

# Optional: pipe_df for compatibility (same-category pairs only)
pipe_list = []
for cat in best_drafted_by_cat:
    pipe_list.append({'Drafted/Undrafted': best_drafted_by_cat[cat], 'Draft Day': best_round_by_cat[cat], 'category': cat})
pipe_df = pd.DataFrame(pipe_list)

print('Best drafted model per category (by ROC-AUC):', best_drafted_by_cat)
print('Best round model per category (by Macro F1, same category only):', best_round_by_cat)

Best drafted model per category (by ROC-AUC): {'college': 'RF (college)', 'college+combine': 'RF (college + combine)', 'college+combine_agility': 'RF (college + combine w/ agility)', 'combine': 'XGB (combine)'}
Best round model per category (by Macro F1, same category only): {'college': 'Ordinal (college)', 'college+combine': 'RF (college + combine)', 'college+combine_agility': 'XGB (college + combine w/ agility)', 'combine': 'RF (combine)'}


## Predict draft for a single player

Selects the **best drafted/undrafted model** (by ROC-AUC) and **best round model** (by Macro F1 on drafted) **independently** per category. Three prediction columns: combine-only, college-only, college+combine (best models independently).

In [23]:
def _player_has_college_stats(player_dict):
    """True only if player has all three college stats (non-null). If any are missing, use combine-only models."""
    keys = ['QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season']
    for k in keys:
        v = player_dict.get(k, player_dict.get(k.replace('_', ' ')))
        if v is None or (isinstance(v, float) and np.isnan(v)):
            return False
    return True

def _get_val(player_dict, *keys, default=np.nan):
    for k in keys:
        if k in player_dict and player_dict[k] is not None:
            v = player_dict[k]
            if isinstance(v, float) and np.isnan(v):
                continue
            return v
    return default

def _height_inches(h):
    if h is None or (isinstance(h, float) and np.isnan(h)):
        return np.nan
    if isinstance(h, (int, float)):
        return float(h)
    if isinstance(h, str) and '-' in h:
        parts = h.strip().split('-')
        return int(parts[0]) * 12 + int(parts[1])
    return np.nan

# Map contains_* column name -> feature name for checking if player has that stat
CONTAINS_TO_FEATURE = {
    'contains_broad_jump': 'Broad Jump', 'contains_vertical': 'Vertical',
    'contains_40yd': '40yd', 'contains_height': 'Height', 'contains_weight': 'Weight',
    'contains_speed_score': 'speed_score', 'contains_explosive_score': 'explosive_score',
    'contains_agility_score': 'agility_score',
    'contains_qb_hurry_final_season': 'QB_Hurry_final_season', 'contains_tfl_final_season': 'TFL_final_season', 'contains_sacks_final_season': 'Sacks_final_season',
    'contains_p4_conference': 'School',
}

def _player_row(player_dict, feature_list, contains_list, medians, add_speed=True):
    """Build one row for the player with feature_list + contains_*; fill missing with medians."""
    row = {}
    for col in feature_list:
        v = _get_val(player_dict, col, col.replace(' ', '_').lower(), col.replace(' ', ''))
        if col == 'Height':
            v = _height_inches(v)
        if col == 'p4_conference':
            school = _get_val(player_dict, 'School', 'school', 'School')
            school_norm = school_alias.get(school, school) if pd.notna(school) and school else None
            year = _get_val(player_dict, 'Year', 'year', 'Year')
            year = int(year) if pd.notna(year) else 2025
            schools = P4_SCHOOLS if year <= 2023 else P4_SCHOOLS_NO_PAC12
            v = 1 if (school_norm and school_norm in schools) else 0
        if add_speed and col == 'speed_score' and (v is np.nan or (isinstance(v, float) and np.isnan(v))):
            w, forty = _get_val(player_dict, 'Weight'), _get_val(player_dict, '40yd')
            if w is not np.nan and forty is not np.nan and float(forty) > 0:
                v = float(w) * 200 / (float(forty) ** 4)
        row[col] = v if (v is not np.nan and not (isinstance(v, float) and np.isnan(v))) else medians.get(col, np.nan)
    for col in contains_list:
        feat = CONTAINS_TO_FEATURE.get(col, col.replace('contains_', '').replace('_', ' '))
        v = _get_val(player_dict, feat, feat.replace(' ', '_').lower() if isinstance(feat, str) else feat)
        row[col] = 1 if (v is not None and v is not np.nan and not (isinstance(v, float) and np.isnan(v))) else 0
    return pd.Series(row)

def _player_has_agility_stats(player_dict):
    """True if player has agility_score (non-null). Matches 2024/2025/2026 where agility is computed from 3Cone or Shuttle."""
    ag = _get_val(player_dict, 'agility_score', 'agility_score')
    return ag is not None and not (isinstance(ag, float) and np.isnan(ag))

def _player_has_combine_stats(player_dict):
    """True if player has at least 3 of 5 key combine metrics (40yd, Vertical, Broad Jump, Height, Weight)."""
    keys = ['40yd', 'Vertical', 'Broad Jump', 'Height', 'Weight']
    n_has = 0
    for k in keys:
        v = _get_val(player_dict, k, k.replace(' ', '_').lower(), k.replace(' ', ''))
        if k == 'Height':
            v = _height_inches(v) if v is not None else np.nan
        if v is not None and not (isinstance(v, float) and np.isnan(v)):
            n_has += 1
    return n_has >= 3

def get_best_pipeline_for_player(player_dict, pipe_df=None):
    """Select best drafted model and best round model *independently* per category (best ROC for draft, best Macro F1 for round)."""
    has_college = _player_has_college_stats(player_dict)
    has_combine = _player_has_combine_stats(player_dict)
    if not has_college:
        cat = 'combine'
    elif not has_combine:
        cat = 'college'
    elif has_combine and has_college and _player_has_agility_stats(player_dict):
        cat = 'college+combine_agility'
    else:
        cat = 'college+combine'
    best_drafted = best_drafted_by_cat.get(cat, best_drafted_by_cat.get('college+combine'))
    best_round = best_round_by_cat.get(cat, best_round_by_cat.get('college+combine'))
    return best_drafted, best_round

def _run_drafted_model(drafted_name, row_combine, row_full, row_college_only=None, row_full_agility=None):
    """Return P(drafted) in [0,1] for the given model name."""
    if row_college_only is None:
        rc = row_full.reindex(COLLEGE_ONLY_ALL)
        row_college_only = pd.Series(knn_imputer_co.transform(rc.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    if row_full_agility is None:
        ra = row_full.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
        row_full_agility = pd.Series(knn_imputer_ag.transform(ra.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    if drafted_name == 'Logistic (combine)':
        return logit_draft.predict_proba(scaler.transform(row_combine.to_frame().T))[0, 1]
    if drafted_name == 'Logistic (college + combine)':
        return logit_draft_college.predict_proba(scaler17.transform(row_full.to_frame().T))[0, 1]
    if drafted_name == 'Logistic (college + combine w/ agility)':
        return logit_draft_college_agility.predict_proba(scaler_ag.transform(row_full_agility.to_frame().T))[0, 1]
    if drafted_name == 'Logistic (college)':
        return logit_draft_college_only.predict_proba(scaler_co.transform(row_college_only.to_frame().T))[0, 1]
    if drafted_name == 'RF (combine)':
        return rf_combine.predict_proba(row_combine.to_frame().T)[0, 1]
    if drafted_name == 'RF (college + combine)':
        return rf_college.predict_proba(row_full.to_frame().T)[0, 1]
    if drafted_name == 'RF (college + combine w/ agility)':
        return rf_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 1]
    if drafted_name == 'RF (college)':
        return rf_college_only.predict_proba(row_college_only.to_frame().T)[0, 1]
    if drafted_name == 'XGB (combine)':
        return xgb_combine.predict_proba(row_combine.to_frame().T)[0, 1]
    if drafted_name == 'XGB (college + combine)':
        return xgb_college.predict_proba(row_full.to_frame().T)[0, 1]
    if drafted_name == 'XGB (college + combine w/ agility)':
        return xgb_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 1]
    if drafted_name == 'XGB (college)':
        return xgb_college_only.predict_proba(row_college_only.to_frame().T)[0, 1]
    return 0.0

def _run_day_model(day_name, row_combine, row_full, row_college_only=None, row_full_agility=None):
    """Return draft round 1-7 from the round model (model outputs 0..6, we return +1)."""
    if row_college_only is None:
        rc = row_full.reindex(COLLEGE_ONLY_ALL)
        row_college_only = pd.Series(knn_imputer_co.transform(rc.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    if row_full_agility is None:
        ra = row_full.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
        row_full_agility = pd.Series(knn_imputer_ag.transform(ra.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    pred = 0
    if day_name == 'Ordinal (combine)':
        pred = int(np.clip(ord_combine.predict(scaler.transform(row_combine.to_frame().T))[0], 0, 6))
    elif day_name == 'Ordinal (college + combine)':
        pred = int(np.clip(ord_college.predict(scaler17.transform(row_full.to_frame().T))[0], 0, 6))
    elif day_name == 'Ordinal (college + combine w/ agility)':
        pred = int(np.clip(ord_college_agility.predict(scaler_ag.transform(row_full_agility.to_frame().T))[0], 0, 6))
    elif day_name == 'Ordinal (college)':
        pred = int(np.clip(ord_college_only.predict(scaler_co.transform(row_college_only.to_frame().T))[0], 0, 6))
    elif day_name == 'RF (combine)':
        pred = int(np.clip(rf_day_combine.predict(row_combine.to_frame().T)[0], 0, 6))
    elif day_name == 'RF (college + combine)':
        pred = int(np.clip(rf_day_college.predict(row_full.to_frame().T)[0], 0, 6))
    elif day_name == 'RF (college + combine w/ agility)':
        pred = int(np.clip(rf_day_college_agility.predict(row_full_agility.to_frame().T)[0], 0, 6))
    elif day_name == 'RF (college)':
        pred = int(np.clip(rf_day_college_only.predict(row_college_only.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (combine)':
        pred = int(np.clip(xgb_day_combine.predict(row_combine.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (college + combine)':
        pred = int(np.clip(xgb_day_college.predict(row_full.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (college + combine w/ agility)':
        pred = int(np.clip(xgb_day_college_agility.predict(row_full_agility.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (college)':
        pred = int(np.clip(xgb_day_college_only.predict(row_college_only.to_frame().T)[0], 0, 6))
    return pred + 1  # 1-7

def _get_round1_prob(day_name, row_combine, row_full, row_college_only=None, row_full_agility=None):
    """Return P(Round 1) = probability of class 0 from the round model."""
    if row_college_only is None:
        rc = row_full.reindex(COLLEGE_ONLY_ALL)
        row_college_only = pd.Series(knn_imputer_co.transform(rc.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    if row_full_agility is None:
        ra = row_full.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
        row_full_agility = pd.Series(knn_imputer_ag.transform(ra.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    if day_name == 'Ordinal (combine)':
        return ord_combine.predict_proba(scaler.transform(row_combine.to_frame().T))[0, 0]
    if day_name == 'Ordinal (college + combine)':
        return ord_college.predict_proba(scaler17.transform(row_full.to_frame().T))[0, 0]
    if day_name == 'Ordinal (college + combine w/ agility)':
        return ord_college_agility.predict_proba(scaler_ag.transform(row_full_agility.to_frame().T))[0, 0]
    if day_name == 'Ordinal (college)':
        return ord_college_only.predict_proba(scaler_co.transform(row_college_only.to_frame().T))[0, 0]
    if day_name == 'RF (combine)':
        return rf_day_combine.predict_proba(row_combine.to_frame().T)[0, 0]
    if day_name == 'RF (college + combine)':
        return rf_day_college.predict_proba(row_full.to_frame().T)[0, 0]
    if day_name == 'RF (college + combine w/ agility)':
        return rf_day_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 0]
    if day_name == 'RF (college)':
        return rf_day_college_only.predict_proba(row_college_only.to_frame().T)[0, 0]
    if day_name == 'XGB (combine)':
        return xgb_day_combine.predict_proba(row_combine.to_frame().T)[0, 0]
    if day_name == 'XGB (college + combine)':
        return xgb_day_college.predict_proba(row_full.to_frame().T)[0, 0]
    if day_name == 'XGB (college + combine w/ agility)':
        return xgb_day_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 0]
    if day_name == 'XGB (college)':
        return xgb_day_college_only.predict_proba(row_college_only.to_frame().T)[0, 0]
    return 0.0

R1_PROB_THRESHOLD = 0.28

def predict_draft(player_dict, pipe_df=None, category=None):
    """
    Predict drafted/undrafted and (if drafted) draft round (1-7) for one player.
    category: if provided ('combine', 'college', 'college+combine', 'college+combine_agility'), use that category's best models; else infer from player data.
    Returns: dict with drafted (bool), draft_round (1-7 or None), drafted_model, day_model, prob_drafted.
    """
    if category is not None:
        drafted_name = best_drafted_by_cat.get(category, best_drafted_by_cat.get('college+combine'))
        day_name = best_round_by_cat.get(category, best_round_by_cat.get('college+combine'))
    else:
        if pipe_df is None:
            pipe_df = globals().get('pipe_df')
        if pipe_df is None:
            raise ValueError('Run the pipeline comparison cell first to create pipe_df, or pass pipe_df.')
        drafted_name, day_name = get_best_pipeline_for_player(player_dict, pipe_df)
    row_combine = _player_row(player_dict, COMBINE_ONLY_FEATURES, COMBINE_ONLY_CONTAINS, train_medians)
    row_combine = row_combine.reindex(COMBINE_ONLY_ALL)
    row_combine = pd.Series(knn_imputer_combine.transform(row_combine.values.reshape(1, -1))[0], index=COMBINE_ONLY_ALL)
    row_full = _player_row(player_dict, FEATURES_WITH_COLLEGE, CONTAINS_WITH_COLLEGE, train_medians17)
    row_full = row_full.reindex(FEATURES_WITH_COLLEGE_ALL)
    row_full = pd.Series(knn_imputer17.transform(row_full.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_ALL)
    row_full_agility = _player_row(player_dict, FEATURES_WITH_COLLEGE_AGILITY, CONTAINS_WITH_COLLEGE_AGILITY, train_medians_ag)
    row_full_agility = row_full_agility.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
    row_full_agility = pd.Series(knn_imputer_ag.transform(row_full_agility.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    row_college_only = _player_row(player_dict, COLLEGE_ONLY_FEATURES, COLLEGE_ONLY_CONTAINS, train_medians_co, add_speed=False)
    row_college_only = row_college_only.reindex(COLLEGE_ONLY_ALL)
    row_college_only = pd.Series(knn_imputer_co.transform(row_college_only.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    prob_drafted = _run_drafted_model(drafted_name, row_combine, row_full, row_college_only, row_full_agility)
    drafted = prob_drafted >= 0.5
    draft_round = None
    if drafted:
        round_pred = _run_day_model(day_name, row_combine, row_full, row_college_only, row_full_agility)
        prob_r1 = _get_round1_prob(day_name, row_combine, row_full, row_college_only, row_full_agility)
        if prob_r1 >= R1_PROB_THRESHOLD:
            draft_round = 1
        else:
            draft_round = int(np.clip(round_pred, 1, 7))
    return {
        'drafted': drafted,
        'draft_round': draft_round,
        'drafted_model': drafted_name,
        'day_model': day_name,
        'prob_drafted': float(prob_drafted),
    }

In [24]:
# 2024 drafted edges: compute speed_score, explosive_score, agility_score; run combine/college/college+combine predictions
# Requires prior cells (train_df, predict_draft, pipe_df, etc.) to be run.

edges_2024 = pd.read_csv('edges_drafted_2024.csv')
# Height may already be in inches
if edges_2024['Height'].dtype == object or (edges_2024['Height'].notna() & (edges_2024['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '':
            return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h):
            return float(h)
        s = str(h)
        if '-' in s:
            parts = s.split('-')
            return int(parts[0]) * 12 + int(parts[1])
        return np.nan
    edges_2024['Height'] = edges_2024['Height'].apply(_ht_inches)
else:
    edges_2024['Height'] = pd.to_numeric(edges_2024['Height'], errors='coerce')

# 1) Speed score
edges_2024['speed_score'] = np.where(
    edges_2024['40yd'].notna() & (edges_2024['40yd'] > 0),
    edges_2024['Weight'] * 200 / (edges_2024['40yd'] ** 4),
    np.nan
)

# 2) Explosive score (z-scores from training DEs: Vertical + Broad Jump)
# Use NaN when BOTH Vertical and Broad Jump are missing — otherwise the model thinks we have "average" (0) and mis-predicts.
tr_de = train_df[train_df['Pos'] == 'DE']
mean_v = tr_de['Vertical'].mean()
std_v = tr_de['Vertical'].std()
mean_b = tr_de['Broad Jump'].mean()
std_b = tr_de['Broad Jump'].std()  # pyright: ignore[reportUndefinedVariable]
if std_v == 0 or np.isnan(std_v):
    std_v = 1.0
if std_b == 0 or np.isnan(std_b):
    std_b = 1.0
v_z = (edges_2024['Vertical'] - mean_v) / std_v
b_z = (edges_2024['Broad Jump'] - mean_b) / std_b
has_explosive = edges_2024['Vertical'].notna() | edges_2024['Broad Jump'].notna()
edges_2024['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

# 3) Agility score (z-scores from training DEs: lower 3Cone/Shuttle = better, so negate z)
# Use NaN when BOTH 3Cone and Shuttle are missing — otherwise 0 is treated as "has data" and biases predictions.
mean_3 = tr_de['3Cone'].mean()
std_3 = tr_de['3Cone'].std()
mean_sh = tr_de['Shuttle'].mean()
std_sh = tr_de['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3):
    std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh):
    std_sh = 1.0
z_3 = (edges_2024['3Cone'] - mean_3) / std_3
z_sh = (edges_2024['Shuttle'] - mean_sh) / std_sh
has_agility = edges_2024['3Cone'].notna() | edges_2024['Shuttle'].notna()
edges_2024['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

# 4) Run model on each 2024 edge: combine-only, college-only, college+combine (best models independently)
def row_to_player_dict(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row['Shuttle'], '3Cone': row['3Cone'],
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'speed_score': row['speed_score'], 'explosive_score': row['explosive_score'], 'agility_score': row['agility_score'],
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2024),
    }

pred_combine, model_combine = [], []
pred_college_only, model_college_only = [], []
pred_college_combine, model_college_combine = [], []
for _, row in edges_2024.iterrows():
    row_dict = row_to_player_dict(row)
    has_combine = _player_has_combine_stats(row_dict)
    has_college = _player_has_college_stats(row_dict)
    has_agility = _player_has_agility_stats(row_dict)
    if has_combine:
        out = predict_draft(row_dict, category='combine')
        pred_combine.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_combine.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_combine.append('—')
        model_combine.append('—')
    if has_college:
        out = predict_draft(row_dict, category='college')
        pred_college_only.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_only.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_only.append('—')
        model_college_only.append('—')
    if has_combine and has_college:
        cat = 'college+combine_agility' if has_agility else 'college+combine'
        out = predict_draft(row_dict, category=cat)
        pred_college_combine.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_combine.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_combine.append('—')
        model_college_combine.append('—')
edges_2024['prediction_combine'] = pred_combine
edges_2024['model_combine'] = model_combine
edges_2024['prediction_college_only'] = pred_college_only
edges_2024['model_college_only'] = model_college_only
edges_2024['prediction_college_combine'] = pred_college_combine
edges_2024['model_college_combine'] = model_college_combine

edges_2024

Unnamed: 0,Round,Pick,Player,Pos,School,Year,Height,Weight,40yd,Vertical,...,QB_Hurry_final_season,speed_score,explosive_score,agility_score,prediction_combine,model_combine,prediction_college_only,model_college_only,prediction_college_combine,model_college_combine
0,1,15,Laiatu Latu,DE,UCLA,2024,77.0,259.0,4.64,32.0,...,10.0,111.752652,-0.181657,,Round 1,XGB (combine) + RF (combine),Round 1,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
1,1,17,Dallas Turner,DE,Alabama,2024,74.0,247.0,4.47,40.5,...,13.0,123.736223,4.424927,,Round 1,XGB (combine) + RF (combine),Round 4,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
2,1,19,Jared Verse,DE,Florida State,2024,76.0,254.0,4.58,35.0,...,7.0,115.45209,2.586193,,Round 3,XGB (combine) + RF (combine),Round 2,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
3,1,21,Demeioun Robinson,DE,Penn State,2024,75.0,254.0,4.48,34.5,...,5.0,126.110619,2.579481,,Round 1,XGB (combine) + RF (combine),Undrafted,RF (college),Round 2,RF (college + combine) + RF (college + combine)
4,2,56,Marshawn Kneeland,DE,Western Michigan,2024,75.0,267.0,4.75,35.5,...,8.0,104.897906,1.469784,,Round 3,XGB (combine) + RF (combine),Round 7,RF (college) + Ordinal (college),Round 2,RF (college + combine) + RF (college + combine)
5,2,57,Chris Braswell,DE,Alabama,2024,75.0,251.0,4.6,33.5,...,5.0,112.117238,0.15937,,Round 3,XGB (combine) + RF (combine),Round 6,RF (college) + Ordinal (college),Round 2,RF (college + combine) + RF (college + combine)
6,3,74,Bralen Trice,DE,Washington,2024,75.5,245.0,4.72,,...,15.0,98.725214,,,Round 5,XGB (combine) + RF (combine),Undrafted,RF (college),Round 7,RF (college + combine) + RF (college + combine)
7,3,76,Jonah Elliss,DE,Utah,2024,74.0,248.0,,,...,3.0,,,,—,—,Round 2,RF (college) + Ordinal (college),—,—
8,3,94,Jalyx Hunt,DE,Houston Christian,2024,76.0,252.0,4.64,37.5,...,0.0,108.73231,3.582427,,Round 6,XGB (combine) + RF (combine),Undrafted,RF (college),Round 6,RF (college + combine) + RF (college + combine)
9,5,138,Xavier Thomas,DE,Clemson,2024,74.0,244.0,4.62,32.5,...,10.0,107.115401,0.627284,,Round 2,XGB (combine) + RF (combine),Undrafted,RF (college),Round 7,RF (college + combine) + RF (college + combine)


In [25]:
# 2025 drafted edges: same as 2024 — speed_score, explosive_score, agility_score + combine/college/college+combine predictions
# Requires prior cells (train_df, predict_draft, pipe_df, etc.) to be run.

edges_2025 = pd.read_csv('edges_drafted_2025.csv')
# Height may already be in inches
if edges_2025['Height'].dtype == object or (edges_2025['Height'].notna() & (edges_2025['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '':
            return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h):
            return float(h)
        s = str(h)
        if '-' in s:
            parts = s.split('-')
            return int(parts[0]) * 12 + int(parts[1])
        return np.nan
    edges_2025['Height'] = edges_2025['Height'].apply(_ht_inches)
else:
    edges_2025['Height'] = pd.to_numeric(edges_2025['Height'], errors='coerce')

# 1) Speed score
edges_2025['speed_score'] = np.where(
    edges_2025['40yd'].notna() & (edges_2025['40yd'] > 0),
    edges_2025['Weight'] * 200 / (edges_2025['40yd'] ** 4),
    np.nan
)

# 2) Explosive score (z-scores from training DEs: Vertical + Broad Jump)
tr_de = train_df[train_df['Pos'] == 'DE']
mean_v = tr_de['Vertical'].mean()
std_v = tr_de['Vertical'].std()
mean_b = tr_de['Broad Jump'].mean()
std_b = tr_de['Broad Jump'].std()
if std_v == 0 or np.isnan(std_v):
    std_v = 1.0
if std_b == 0 or np.isnan(std_b):
    std_b = 1.0
v_z = (edges_2025['Vertical'] - mean_v) / std_v
b_z = (edges_2025['Broad Jump'] - mean_b) / std_b
has_explosive = edges_2025['Vertical'].notna() | edges_2025['Broad Jump'].notna()
edges_2025['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

# 3) Agility score (z-scores from training DEs)
mean_3 = tr_de['3Cone'].mean()
std_3 = tr_de['3Cone'].std()
mean_sh = tr_de['Shuttle'].mean()
std_sh = tr_de['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3):
    std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh):
    std_sh = 1.0
z_3 = (edges_2025['3Cone'] - mean_3) / std_3
z_sh = (edges_2025['Shuttle'] - mean_sh) / std_sh
has_agility = edges_2025['3Cone'].notna() | edges_2025['Shuttle'].notna()
edges_2025['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

# 4) Run model on each 2025 edge: combine-only, college-only, college+combine (best models independently)
def row_to_player_dict_2025(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row['Shuttle'], '3Cone': row['3Cone'],
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'speed_score': row['speed_score'], 'explosive_score': row['explosive_score'], 'agility_score': row['agility_score'],
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2025),
    }

pred_combine_2025, model_combine_2025 = [], []
pred_college_only_2025, model_college_only_2025 = [], []
pred_college_combine_2025, model_college_combine_2025 = [], []
for _, row in edges_2025.iterrows():
    row_dict = row_to_player_dict_2025(row)
    has_combine = _player_has_combine_stats(row_dict)
    has_college = _player_has_college_stats(row_dict)
    has_agility = _player_has_agility_stats(row_dict)
    if has_combine:
        out = predict_draft(row_dict, category='combine')
        pred_combine_2025.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_combine_2025.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_combine_2025.append('—')
        model_combine_2025.append('—')
    if has_college:
        out = predict_draft(row_dict, category='college')
        pred_college_only_2025.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_only_2025.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_only_2025.append('—')
        model_college_only_2025.append('—')
    if has_combine and has_college:
        cat = 'college+combine_agility' if has_agility else 'college+combine'
        out = predict_draft(row_dict, category=cat)
        pred_college_combine_2025.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_combine_2025.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_combine_2025.append('—')
        model_college_combine_2025.append('—')
edges_2025['prediction_combine'] = pred_combine_2025
edges_2025['model_combine'] = model_combine_2025
edges_2025['prediction_college_only'] = pred_college_only_2025
edges_2025['model_college_only'] = model_college_only_2025
edges_2025['prediction_college_combine'] = pred_college_combine_2025
edges_2025['model_college_combine'] = model_college_combine_2025

edges_2025

Unnamed: 0,Round,Pick,Player,Pos,School,Year,Height,Weight,40yd,Vertical,...,QB_Hurry_final_season,speed_score,explosive_score,agility_score,prediction_combine,model_combine,prediction_college_only,model_college_only,prediction_college_combine,model_college_combine
0,1,3,Abdul Carter,DE,Penn State,2025,75,252,4.5,,...,9.0,122.908093,1.845127,,Round 1,XGB (combine) + RF (combine),Round 1,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
1,1,11,Mykel Williams,DE,Georgia,2025,77,260,4.73,35.5,...,3.0,103.88642,2.432459,,Round 3,XGB (combine) + RF (combine),Round 2,RF (college) + Ordinal (college),Round 2,RF (college + combine) + RF (college + combine)
2,1,15,Jalon Walker,DE,Georgia,2025,73,243,4.5,,...,7.0,118.518519,-0.882452,,Round 6,XGB (combine) + RF (combine),Undrafted,RF (college),Round 5,RF (college + combine) + RF (college + combine)
3,1,17,Shemar Stewart,DE,Texas A&M,2025,77,267,4.59,40.0,...,7.0,120.306894,4.899553,,Round 1,XGB (combine) + RF (combine),Round 7,RF (college) + Ordinal (college),Round 7,RF (college + combine) + RF (college + combine)
4,1,26,James Pearce Jr.,DE,Tennessee,2025,77,245,4.47,31.0,...,10.0,122.734311,0.607148,,Round 1,XGB (combine) + RF (combine),Round 1,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
5,2,44,Donovan Ezeiruaku,DE,Boston College,2025,74,248,4.62,35.5,...,14.0,108.871392,1.469784,2.929873,Undrafted,XGB (combine),Round 1,RF (college) + Ordinal (college),Round 2,RF (college + combine w/ agility) + XGB (colle...
6,2,45,J.T. Tuimoloau,DE,Ohio State,2025,76,265,4.62,35.5,...,5.0,116.33435,1.469784,,Round 1,XGB (combine) + RF (combine),Round 1,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
7,2,51,Nic Scourton,DE,Texas A&M,2025,75,257,4.59,,...,,115.801018,,,Round 4,XGB (combine) + RF (combine),—,—,—,—
8,2,52,Oluwafemi Oladejo,DE,UCLA,2025,75,259,4.7,36.5,...,6.0,106.15448,1.964545,,Round 2,XGB (combine) + RF (combine),Round 5,RF (college) + Ordinal (college),Round 2,RF (college + combine) + RF (college + combine)
9,2,59,Mike Green,DE,Marshall,2025,75,251,4.57,,...,,115.090352,,2.981869,Round 3,XGB (combine) + RF (combine),—,—,—,—


In [26]:
# 2026 drafted edges: same as 2024/2025 — speed_score, explosive_score, agility_score + combine/college/college+combine predictions
# Requires prior cells (train_df, predict_draft, pipe_df, etc.) to be run.

edges_2026 = pd.read_csv('edges_drafted_2026.csv')
# Height may already be in inches
if edges_2026['Height'].dtype == object or (edges_2026['Height'].notna() & (edges_2026['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '':
            return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h):
            return float(h)
        s = str(h)
        if '-' in s:
            parts = s.split('-')
            return int(parts[0]) * 12 + int(parts[1])
        return np.nan
    edges_2026['Height'] = edges_2026['Height'].apply(_ht_inches)
else:
    edges_2026['Height'] = pd.to_numeric(edges_2026['Height'], errors='coerce')

# 1) Speed score
edges_2026['speed_score'] = np.where(
    edges_2026['40yd'].notna() & (edges_2026['40yd'] > 0),
    edges_2026['Weight'] * 200 / (edges_2026['40yd'] ** 4),
    np.nan
)

# 2) Explosive score (z-scores from training DEs: Vertical + Broad Jump)
tr_de = train_df[train_df['Pos'] == 'DE']
mean_v = tr_de['Vertical'].mean()
std_v = tr_de['Vertical'].std()
mean_b = tr_de['Broad Jump'].mean()
std_b = tr_de['Broad Jump'].std()
if std_v == 0 or np.isnan(std_v):
    std_v = 1.0
if std_b == 0 or np.isnan(std_b):
    std_b = 1.0
v_z = (edges_2026['Vertical'] - mean_v) / std_v
b_z = (edges_2026['Broad Jump'] - mean_b) / std_b
has_explosive = edges_2026['Vertical'].notna() | edges_2026['Broad Jump'].notna()
edges_2026['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

# 3) Agility score (z-scores from training DEs)
mean_3 = tr_de['3Cone'].mean()
std_3 = tr_de['3Cone'].std()
mean_sh = tr_de['Shuttle'].mean()
std_sh = tr_de['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3):
    std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh):
    std_sh = 1.0
z_3 = (edges_2026['3Cone'] - mean_3) / std_3
z_sh = (edges_2026['Shuttle'] - mean_sh) / std_sh
has_agility = edges_2026['3Cone'].notna() | edges_2026['Shuttle'].notna()
edges_2026['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

# 4) Run model on each 2026 edge: combine-only, college-only, college+combine (best models independently)
def row_to_player_dict_2026(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row['Shuttle'], '3Cone': row['3Cone'],
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'speed_score': row['speed_score'], 'explosive_score': row['explosive_score'], 'agility_score': row['agility_score'],
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2026),
    }

pred_combine_2026, model_combine_2026 = [], []
pred_college_only_2026, model_college_only_2026 = [], []
pred_college_combine_2026, model_college_combine_2026 = [], []
for _, row in edges_2026.iterrows():
    row_dict = row_to_player_dict_2026(row)
    has_combine = _player_has_combine_stats(row_dict)
    has_college = _player_has_college_stats(row_dict)
    has_agility = _player_has_agility_stats(row_dict)
    if has_combine:
        out = predict_draft(row_dict, category='combine')
        pred_combine_2026.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_combine_2026.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_combine_2026.append('—')
        model_combine_2026.append('—')
    if has_college:
        out = predict_draft(row_dict, category='college')
        pred_college_only_2026.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_only_2026.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_only_2026.append('—')
        model_college_only_2026.append('—')
    if has_combine and has_college:
        cat = 'college+combine_agility' if has_agility else 'college+combine'
        out = predict_draft(row_dict, category=cat)
        pred_college_combine_2026.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_combine_2026.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_combine_2026.append('—')
        model_college_combine_2026.append('—')
edges_2026['prediction_combine'] = pred_combine_2026
edges_2026['model_combine'] = model_combine_2026
edges_2026['prediction_college_only'] = pred_college_only_2026
edges_2026['model_college_only'] = model_college_only_2026
edges_2026['prediction_college_combine'] = pred_college_combine_2026
edges_2026['model_college_combine'] = model_college_combine_2026

edges_2026

Unnamed: 0,Round,Pick,Player,Pos,School,Year,Height,Weight,40yd,Vertical,...,QB_Hurry_final_season,speed_score,explosive_score,agility_score,prediction_combine,model_combine,prediction_college_only,model_college_only,prediction_college_combine,model_college_combine
0,1,1,Rueben Bain Jr.,DE,Miami,2026,75.0,276.0,4.72,,...,5.0,111.216976,,,Round 4,XGB (combine) + RF (combine),Round 2,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
1,1,9,T.J. Parker,DE,Clemson,2026,75.0,260.0,,,...,7.0,,,,—,—,Round 7,RF (college) + Ordinal (college),—,—
2,1,10,Keldric Faulk,DE,Auburn,2026,78.0,285.0,,,...,6.0,,,,—,—,Round 7,RF (college) + Ordinal (college),—,—
3,1,12,David Bailey,DE,Texas Tech,2026,75.0,250.0,4.52,,...,10.0,119.788814,,,Round 3,XGB (combine) + RF (combine),Round 1,RF (college) + Ordinal (college),Round 1,RF (college + combine) + RF (college + combine)
4,1,17,Cashius Howell,DE,Texas A&M,2026,74.0,248.0,,,...,5.0,,,,—,—,Round 2,RF (college) + Ordinal (college),—,—
5,1,26,Romello Height,DE,Texas Tech,2026,75.0,240.0,,,...,13.0,,,,—,—,Round 6,RF (college) + Ordinal (college),—,—
6,1,30,R Mason Thomas,DE,Oklahoma,2026,74.0,249.0,,,...,3.0,,,,—,—,Undrafted,RF (college),—,—
7,2,37,Joshua Josephs,DE,Tennessee,2026,75.0,243.0,,,...,6.0,,,,—,—,Undrafted,RF (college),—,—
8,2,38,Quincy Rhodes,DE,Arkansas,2026,78.0,276.0,,,...,6.0,,,,—,—,Round 1,RF (college) + Ordinal (college),—,—
9,2,49,Matayo Uiagalelei,DE,Oregon,2026,77.0,270.0,,,...,7.0,,,,—,—,Round 2,RF (college) + Ordinal (college),—,—
