## Combine Analysis Defensive Tackles 

Which Combine tests have the most potential influence on a players ability to get drafted and their draft position?

Our training dataset is combine data from 2010 - 2020 and our testing dataset is 2021-2024

In [20]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# Path relative to notebook location (DE_similarity_scores_project/) - data is in project root
dt_data = pd.read_csv('../data/processed/dt_training_data.csv')
print(dt_data.columns)
# Convert Height from feet-inches to inches
dt_data['Height'] = dt_data['Height'].str.split('-').str[0].astype(int) * 12 + dt_data['Height'].str.split('-').str[1].astype(int)

# Examine every column in the dataset and its correlation with the Drafted column 
dt_data_just_numeric = dt_data.select_dtypes(include=['number'])
dt_data_just_numeric['Drafted'] = dt_data['Drafted']
print(dt_data_just_numeric.corr()['Drafted'].sort_values(ascending=False))


Index(['Year', 'Player', 'Pos', 'School', 'Height', 'Weight', '40yd',
       'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle', 'Drafted',
       'Round', 'Pick', 'Sacks_cumulative', 'TFL_cumulative',
       'QB_Hurry_cumulative', 'Sacks_final_season', 'TFL_final_season',
       'QB_Hurry_final_season'],
      dtype='object')
Drafted                  1.000000
Sacks_final_season       0.537543
TFL_final_season         0.471537
Sacks_cumulative         0.372802
TFL_cumulative           0.224549
QB_Hurry_cumulative      0.190996
QB_Hurry_final_season    0.183513
Bench                    0.167922
Broad Jump               0.158607
Vertical                 0.141317
Weight                   0.127717
Height                   0.063890
Year                     0.046518
Shuttle                 -0.103366
3Cone                   -0.109659
40yd                    -0.236880
Round                         NaN
Pick                          NaN
Name: Drafted, dtype: float64


For context if the correlation is positive that means that a higher number is better, if a correlation is negative that means that a lower number is better. With that said it looks like our most impactful combine values on **being drafted** are 

1. Broad Jump: .318
2. 40yd: -.311
3. Vertical: .266
4. Shuttle: -.24
5. 3 Cone

and our most impactful defensive stats on **being drafted are 
1. TFL_cumulative           0.486042
2. Sacks_cumulative         0.408868
3. TFL_final_season         0.396291
4. QB_Hurry_cumulative      0.386837
5. QB_Hurry_final_season    0.336749
6. Sacks_final_season       0.298233

Anything too far below abs(.20) is likely too weak to consider using for any models. 

In [21]:
# Examine every column in the dataset and its correlation with the Drafted column 
# Lower Draft Position is better
dt_data_just_numeric = dt_data.select_dtypes(include=['number'])
dt_data_just_numeric['Pick'] = dt_data['Pick']
print(dt_data_just_numeric.corr()['Pick'].sort_values(ascending=False))


Pick                     1.000000
Round                    0.988804
3Cone                    0.108831
Year                     0.101440
40yd                     0.099456
Shuttle                  0.003743
TFL_final_season        -0.014431
Vertical                -0.020725
Height                  -0.034635
Weight                  -0.039855
TFL_cumulative          -0.047021
Broad Jump              -0.057780
Bench                   -0.106148
Sacks_final_season      -0.212243
Sacks_cumulative        -0.302680
QB_Hurry_final_season   -0.346735
QB_Hurry_cumulative     -0.421125
Name: Pick, dtype: float64


With that said it looks like our most impactful combine values on **Draft Position** are 

1. 40yd: .338
2. Broad Jump: -.228
3. 3Cone: .200
4. Shuttle 

It's a little hard to understand what these mean but essentially a higher 40 yard dash and a shorter broad jump correlated with a later round draft pick. So we are looking for shorter 40 yard dashes and longer broad jumps. 

And it looks like our most impactful defensive values on **Draft Position** are 

1. TFL_final_season        -0.537341
2. Sacks_final_season      -0.463860
3. QB_Hurry_final_season   -0.321810

## Looking to Model

When we look to create machine learning models there are 3 tasks we would like to accomplish. The first two can use our current datasets of combine data and college data. The final one/two will require the first four seasons of our Defensive Tackles stats in the NFL. 

1. Projected Drafted or Undrafted
2. Projected Draft Position/Round
3. Projected NFL Ability/Value 

## Projected Drafted or Undrafted

We are creating 3 models here to see if a player will be drafted or go undrafted based on their combine and college stats. The training set will be data from 2016 - 2020 and our test set will be 2021-2023. The three models we will be testing will be ordinal logistic regression, random tree and XGBoost

In [22]:
# College-only models: QB Hurry, TFL, Sacks + p4_conference (and cumulative) only (no combine metrics)
# For players without combine data (e.g. 2026 prospects pre-combine)
# DT includes cumulative stats like edges

COLLEGE_ONLY_FEATURES = ['QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative', 'p4_conference', 'Height', 'Weight']
COLLEGE_ONLY_CONTAINS = ['contains_qb_hurry_final_season', 'contains_tfl_final_season', 'contains_sacks_final_season', 'contains_sacks_cumulative', 'contains_tfl_cumulative', 'contains_qb_hurry_cumulative', 'contains_p4_conference', 'contains_height', 'contains_weight']
COLLEGE_ONLY_ALL = COLLEGE_ONLY_FEATURES + COLLEGE_ONLY_CONTAINS

X_tr_co = train_2017[COLLEGE_ONLY_ALL].copy()
X_te_co = test_2017[COLLEGE_ONLY_ALL].copy()
train_medians_co = X_tr_co.median()
X_tr_co = X_tr_co.fillna(train_medians_co)
X_te_co = X_te_co.fillna(train_medians_co)

scaler_co = StandardScaler()
X_tr_co_scaled = scaler_co.fit_transform(X_tr_co)
X_te_co_scaled = scaler_co.transform(X_te_co)

logit_draft_college_only = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
logit_draft_college_only.fit(X_tr_co_scaled, y_train17)
y_pred_college_only = logit_draft_college_only.predict(X_te_co_scaled)
y_prob_college_only = logit_draft_college_only.predict_proba(X_te_co_scaled)[:, 1]

rf_college_only = RandomForestClassifier(n_estimators=200, max_depth=6, random_state=42, class_weight='balanced')
rf_college_only.fit(X_tr_co, y_train17)
y_pred_rf_co = rf_college_only.predict(X_te_co)
y_prob_rf_co = rf_college_only.predict_proba(X_te_co)[:, 1]

xgb_college_only = xgb.XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college_only.fit(X_tr_co, y_train17)
y_pred_xgb_co = xgb_college_only.predict(X_te_co)
y_prob_xgb_co = xgb_college_only.predict_proba(X_te_co)[:, 1]

print('College-only models (QB Hurry, TFL, Sacks, cumulative, p4_conference) — drafted/undrafted')
print('=' * 75)
for name, pred, prob in [('Logistic', y_pred_college_only, y_prob_college_only), ('RF', y_pred_rf_co, y_prob_rf_co), ('XGB', y_pred_xgb_co, y_prob_xgb_co)]:
    print(f'{name}: accuracy={(pred == y_test17).mean():.4f}, ROC-AUC={roc_auc_score(y_test17, prob):.4f}')

NameError: name 'train_2017' is not defined

In [None]:
# College + combine with agility: train on players with 3Cone+Shuttle (agility_score present)
# Use agility models when player has agility_score; otherwise use non-agility
FEATURES_WITH_COLLEGE_AGILITY = FEATURES_WITH_COLLEGE + ['agility_score']
CONTAINS_WITH_COLLEGE_AGILITY = CONTAINS_WITH_COLLEGE + ['contains_agility_score']
FEATURES_WITH_COLLEGE_AGILITY_ALL = FEATURES_WITH_COLLEGE_AGILITY + CONTAINS_WITH_COLLEGE_AGILITY

has_agility_train = train_2017['3Cone'].notna() & train_2017['Shuttle'].notna()
train_agility = train_2017[has_agility_train]
X_tr_ag = train_agility[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()
train_medians_ag = X_tr_ag.median()
X_tr_ag = X_tr_ag.fillna(train_medians_ag)
scaler_ag = StandardScaler()
X_tr_ag_scaled = scaler_ag.fit_transform(X_tr_ag)
y_train_ag = (train_agility['Drafted'].astype(bool)).astype(int)

logit_draft_college_agility = LogisticRegression(max_iter=1000, random_state=42)
logit_draft_college_agility.fit(X_tr_ag_scaled, y_train_ag)
rf_college_agility = RandomForestClassifier(n_estimators=200, max_depth=7, random_state=42)
rf_college_agility.fit(X_tr_ag, y_train_ag)
xgb_college_agility = xgb.XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college_agility.fit(X_tr_ag, y_train_ag)

# Predict on full test (impute missing agility with median)
X_te_ag_fill = test_2017[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy().fillna(train_medians_ag)
X_te_ag_fill_scaled = scaler_ag.transform(X_te_ag_fill)
y_pred_agility = logit_draft_college_agility.predict(X_te_ag_fill_scaled)
y_prob_agility = logit_draft_college_agility.predict_proba(X_te_ag_fill_scaled)[:, 1]
y_pred_rf_agility = rf_college_agility.predict(X_te_ag_fill)
y_prob_rf_agility = rf_college_agility.predict_proba(X_te_ag_fill)[:, 1]
y_pred_xgb_agility = xgb_college_agility.predict(X_te_ag_fill)
y_prob_xgb_agility = xgb_college_agility.predict_proba(X_te_ag_fill)[:, 1]

print('College+combine w/ agility: trained on', len(train_agility), 'players with 3Cone+Shuttle')

In [None]:
# Load training and testing data (paths relative to DT/)
train_raw = pd.read_csv('../data/processed/dt_training_data.csv')
test_raw = pd.read_csv('../data/processed/dt_testing_data.csv')

# Convert Height from feet-inches to inches
def height_to_inches(h):
    if pd.isna(h) or not isinstance(h, str) or '-' not in str(h):
        return np.nan
    parts = str(h).split('-')
    return int(parts[0]) * 12 + int(parts[1])

for df in [train_raw, test_raw]:
    df['Height'] = df['Height'].apply(height_to_inches)

# Column names for modeling (match CSV)
FEATURE_COLS = [
    'Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season',
    'Sacks_final_season', 'Shuttle', '3Cone', '40yd', 'Height', 'Weight'
]

# --- Speed Score: weight * 200 / 40yd^4 ---
def add_speed_score(df):
    df = df.copy()
    df['speed_score'] = np.where(
        df['40yd'].notna() & (df['40yd'] > 0),
        df['Weight'] * 200 / (df['40yd'] ** 4),
        np.nan
    )
    return df

train_raw = add_speed_score(train_raw)
test_raw = add_speed_score(test_raw)

# --- Explosive Score (position-specific z-scores from training data) ---
def add_explosive_score(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    train_df['vertical_z'] = np.nan
    train_df['broad_z'] = np.nan
    test_df['vertical_z'] = np.nan
    test_df['broad_z'] = np.nan
    for pos in train_df['Pos'].dropna().unique():
        tr = train_df[train_df['Pos'] == pos]
        mean_v = tr['Vertical'].mean()
        std_v = tr['Vertical'].std()
        mean_b = tr['Broad Jump'].mean()
        std_b = tr['Broad Jump'].std()
        if std_v == 0 or np.isnan(std_v):
            std_v = 1.0
        if std_b == 0 or np.isnan(std_b):
            std_b = 1.0
        mask_train = train_df['Pos'] == pos
        mask_test = test_df['Pos'] == pos
        train_df.loc[mask_train, 'vertical_z'] = (train_df.loc[mask_train, 'Vertical'] - mean_v) / std_v
        train_df.loc[mask_train, 'broad_z'] = (train_df.loc[mask_train, 'Broad Jump'] - mean_b) / std_b
        test_df.loc[mask_test, 'vertical_z'] = (test_df.loc[mask_test, 'Vertical'] - mean_v) / std_v
        test_df.loc[mask_test, 'broad_z'] = (test_df.loc[mask_test, 'Broad Jump'] - mean_b) / std_b
    train_df['explosive_score'] = train_df['vertical_z'].fillna(0) + train_df['broad_z'].fillna(0)
    test_df['explosive_score'] = test_df['vertical_z'].fillna(0) + test_df['broad_z'].fillna(0)
    return train_df.drop(columns=['vertical_z', 'broad_z'], errors='ignore'), test_df.drop(columns=['vertical_z', 'broad_z'], errors='ignore')

train_raw, test_raw = add_explosive_score(train_raw, test_raw)

# --- Agility Score (position-specific z-scores from training; flip sign so better = higher) ---
def add_agility_score(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    train_df['three_cone_z'] = np.nan
    train_df['shuttle_z'] = np.nan
    test_df['three_cone_z'] = np.nan
    test_df['shuttle_z'] = np.nan
    for pos in train_df['Pos'].dropna().unique():
        tr = train_df[train_df['Pos'] == pos]
        mean_3 = tr['3Cone'].mean()
        std_3 = tr['3Cone'].std()
        mean_sh = tr['Shuttle'].mean()
        std_sh = tr['Shuttle'].std()
        if std_3 == 0 or np.isnan(std_3):
            std_3 = 1.0
        if std_sh == 0 or np.isnan(std_sh):
            std_sh = 1.0
        mask_train = train_df['Pos'] == pos
        mask_test = test_df['Pos'] == pos
        train_df.loc[mask_train, 'three_cone_z'] = (train_df.loc[mask_train, '3Cone'] - mean_3) / std_3
        train_df.loc[mask_train, 'shuttle_z'] = (train_df.loc[mask_train, 'Shuttle'] - mean_sh) / std_sh
        test_df.loc[mask_test, 'three_cone_z'] = (test_df.loc[mask_test, '3Cone'] - mean_3) / std_3
        test_df.loc[mask_test, 'shuttle_z'] = (test_df.loc[mask_test, 'Shuttle'] - mean_sh) / std_sh
    train_df['agility_score'] = (-train_df['three_cone_z'].fillna(0)) + (-train_df['shuttle_z'].fillna(0))
    test_df['agility_score'] = (-test_df['three_cone_z'].fillna(0)) + (-test_df['shuttle_z'].fillna(0))
    return train_df.drop(columns=['three_cone_z', 'shuttle_z'], errors='ignore'), test_df.drop(columns=['three_cone_z', 'shuttle_z'], errors='ignore')

train_raw, test_raw = add_agility_score(train_raw, test_raw)

# --- P4 conference: binary 1 if School in power conference. Pac-12 counts only for draft year 2023 and before. ---
P4_WITH_PAC12 = {'SEC', 'Big Ten', 'Big 12', 'ACC', 'Pac-12'}
P4_NO_PAC12 = {'SEC', 'Big Ten', 'Big 12', 'ACC'}
_stats = pd.read_csv('../data/processed/defensive_stats_2016_to_2025.csv')
P4_SCHOOLS = set(_stats[_stats['Conference'].isin(P4_WITH_PAC12)]['Team'].unique())
P4_SCHOOLS_NO_PAC12 = set(_stats[_stats['Conference'].isin(P4_NO_PAC12)]['Team'].unique())
school_alias = {
    'Ole Miss': 'Mississippi', 'Miami (FL)': 'Miami', 'Southern California': 'USC',
    'Central Florida': 'UCF', 'Brigham Young': 'BYU', 'Ohio St.': 'Ohio State',
    'Florida St.': 'Florida State', 'Kansas St.': 'Kansas State', 'Iowa St.': 'Iowa State',
    'Oklahoma St.': 'Oklahoma State', 'Penn St.': 'Penn State', 'San Diego St.': 'San Diego State',
    'San Jose St.': 'San José State', 'Boston Col.': 'Boston College',
}

def add_p4_conference(df):
    df = df.copy()
    def norm(s):
        return school_alias.get(s, s) if pd.notna(s) and s else None
    def is_p4(row):
        sn = norm(row['School'])
        if not sn: return 0
        year = row.get('Year', 0)
        schools = P4_SCHOOLS if year <= 2023 else P4_SCHOOLS_NO_PAC12
        return 1 if sn in schools else 0
    df['p4_conference'] = df.apply(is_p4, axis=1)
    df['contains_p4_conference'] = df['School'].notna().astype(int)
    return df

train_raw = add_p4_conference(train_raw)
test_raw = add_p4_conference(test_raw)

# --- Binary contains_* for each metric (1 if present, 0 if missing) ---
METRIC_COLS = [
    'Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season',
    'Sacks_final_season', 'Shuttle', '3Cone', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score', 'agility_score',
    'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative'
]
def add_contains_flags(df):
    df = df.copy()
    name_map = {
        'Broad Jump': 'broad_jump', 'Vertical': 'vertical',
        'QB_Hurry_final_season': 'qb_hurry_final_season', 'TFL_final_season': 'tfl_final_season',
        'Sacks_final_season': 'sacks_final_season', 'Sacks_cumulative': 'sacks_cumulative', 'TFL_cumulative': 'tfl_cumulative', 'QB_Hurry_cumulative': 'qb_hurry_cumulative',
        'Shuttle': 'shuttle', '3Cone': 'three_cone',
        '40yd': '40yd', 'Height': 'height', 'Weight': 'weight',
        'speed_score': 'speed_score', 'explosive_score': 'explosive_score', 'agility_score': 'agility_score'
    }
    for col in METRIC_COLS:
        if col not in df.columns:
            continue
        flag_name = f"contains_{name_map.get(col, col.lower().replace(' ', '_'))}"
        df[flag_name] = (df[col].notna()).astype(int)
    return df

train_raw = add_contains_flags(train_raw)
test_raw = add_contains_flags(test_raw)

# Final training and test datasets for modeling
train_df = train_raw.copy()
test_df = test_raw.copy()

print('Training set:', train_df.shape[0], 'players')
print('Test set:', test_df.shape[0], 'players')
print('\nModeling features:', FEATURE_COLS)
print('Derived metrics: speed_score, explosive_score, agility_score')
print('Contains flags: contains_* for each metric')
train_df.head()

Training set: 231 players
Test set: 54 players

Modeling features: ['Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'Shuttle', '3Cone', '40yd', 'Height', 'Weight']
Derived metrics: speed_score, explosive_score, agility_score
Contains flags: contains_* for each metric


Unnamed: 0,Year,Player,Pos,School,Height,Weight,40yd,Vertical,Bench,Broad Jump,...,contains_three_cone,contains_40yd,contains_height,contains_weight,contains_speed_score,contains_explosive_score,contains_agility_score,contains_sacks_cumulative,contains_tfl_cumulative,contains_qb_hurry_cumulative
0,2010,Charles Alexander,DT,LSU,76,300.0,5.4,,,,...,0,1,1,1,1,1,1,0,0,0
1,2010,Geno Atkins,DT,Georgia,73,293.0,4.75,33.0,34.0,117.0,...,1,1,1,1,1,1,1,0,0,0
2,2010,Terrence Cody,DT,Alabama,76,354.0,5.71,20.5,,90.0,...,1,1,1,1,1,1,1,0,0,0
3,2010,Brandon Deaderick,DT,Alabama,76,314.0,5.08,,,,...,0,1,1,1,1,1,1,0,0,0
4,2010,Lamarr Houston,DT,Texas,75,305.0,4.84,33.5,30.0,114.0,...,1,1,1,1,1,1,1,0,0,0


In [None]:
# Combine-only logistic regression: predict Drafted (1) vs Undrafted (0)
# No college stats — only combine metrics + derived scores

# Combine-only features (no Sacks/TFL/QB Hurry) + binary "contains_*" flags
# Exclude Shuttle, 3Cone, agility_score — often missing at combine
COMBINE_ONLY_FEATURES = [
    'Broad Jump', 'Vertical', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score', 'p4_conference'
]
COMBINE_ONLY_CONTAINS = [
    'contains_broad_jump', 'contains_vertical',
    'contains_40yd', 'contains_height', 'contains_weight',
    'contains_speed_score', 'contains_explosive_score', 'contains_p4_conference'
]
COMBINE_ONLY_ALL = COMBINE_ONLY_FEATURES + COMBINE_ONLY_CONTAINS

# Prepare X, y
X_tr = train_df[COMBINE_ONLY_ALL].copy()
X_te = test_df[COMBINE_ONLY_ALL].copy()
y_train = (train_df['Drafted'].astype(bool)).astype(int)
y_test = (test_df['Drafted'].astype(bool)).astype(int)

# Impute missing with training medians
train_medians = X_tr.median()
X_tr = X_tr.fillna(train_medians)
X_te = X_te.fillna(train_medians)

# Scale (fit on train, transform both)
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr)
X_te_scaled = scaler.transform(X_te)

# Fit binary logistic regression
logit_draft = LogisticRegression(max_iter=1000, random_state=42)
logit_draft.fit(X_tr_scaled, y_train)

# Predict on test
y_pred = logit_draft.predict(X_te_scaled)
y_prob = logit_draft.predict_proba(X_te_scaled)[:, 1]

# Metrics
print('Combine-only logistic model: Drafted vs Undrafted')
print('=' * 50)
print('Test accuracy:', (y_pred == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred))
print('\nClassification report:')
print(classification_report(y_test, y_pred, target_names=['Undrafted', 'Drafted']))
if y_test.nunique() == 2:
    print('Test ROC-AUC:', roc_auc_score(y_test, y_prob).round(4))

Combine-only logistic model: Drafted vs Undrafted
Test accuracy: 0.7407

Confusion matrix (rows=actual, cols=predicted):
[[ 8 13]
 [ 1 32]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.89      0.38      0.53        21
     Drafted       0.71      0.97      0.82        33

    accuracy                           0.74        54
   macro avg       0.80      0.68      0.68        54
weighted avg       0.78      0.74      0.71        54

Test ROC-AUC: 0.8355


In [None]:
# Logistic regression with college stats: training data from 2017 onward
# Same target (Drafted vs Undrafted), with combine + college stats

# Restrict to 2017+ so college stats are available
train_2017 = train_df[train_df['Year'] >= 2017].copy()
test_2017 = test_df[test_df['Year'] >= 2017].copy()

# Full feature set (combine + college stats) + binary "contains_*" flags
# Exclude Shuttle, 3Cone, agility_score — often missing; keep cumulative stats (DT-specific)
FEATURES_WITH_COLLEGE = [
    'Broad Jump', 'Vertical', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score',
    'QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season',
    'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative', 'p4_conference'
]
CONTAINS_WITH_COLLEGE = [
    'contains_broad_jump', 'contains_vertical',
    'contains_40yd', 'contains_height', 'contains_weight',
    'contains_speed_score', 'contains_explosive_score',
    'contains_qb_hurry_final_season', 'contains_tfl_final_season', 'contains_sacks_final_season',
    'contains_sacks_cumulative', 'contains_tfl_cumulative', 'contains_qb_hurry_cumulative',
    'contains_p4_conference'
]
FEATURES_WITH_COLLEGE_ALL = FEATURES_WITH_COLLEGE + CONTAINS_WITH_COLLEGE

X_tr17 = train_2017[FEATURES_WITH_COLLEGE_ALL].copy()
X_te17 = test_2017[FEATURES_WITH_COLLEGE_ALL].copy()
y_train17 = (train_2017['Drafted'].astype(bool)).astype(int)
y_test17 = (test_2017['Drafted'].astype(bool)).astype(int)

# Impute missing with training medians
train_medians17 = X_tr17.median()
X_tr17 = X_tr17.fillna(train_medians17)
X_te17 = X_te17.fillna(train_medians17)

# Scale (fit on train, transform both)
scaler17 = StandardScaler()
X_tr17_scaled = scaler17.fit_transform(X_tr17)
X_te17_scaled = scaler17.transform(X_te17)

# Fit logistic regression (with college stats)
logit_draft_college = LogisticRegression(max_iter=1000, random_state=42)
logit_draft_college.fit(X_tr17_scaled, y_train17)

y_pred17 = logit_draft_college.predict(X_te17_scaled)
y_prob17 = logit_draft_college.predict_proba(X_te17_scaled)[:, 1]

print('Logistic model with college stats (train 2017+, test 2017+)')
print('=' * 55)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred17, target_names=['Undrafted', 'Drafted']))
if y_test17.nunique() == 2 and len(y_test17) > 0:
    print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob17).round(4))

Logistic model with college stats (train 2017+, test 2017+)
Training samples: 46 | Test samples: 54
Test accuracy: 0.7407

Confusion matrix (rows=actual, cols=predicted):
[[13  8]
 [ 6 27]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.68      0.62      0.65        21
     Drafted       0.77      0.82      0.79        33

    accuracy                           0.74        54
   macro avg       0.73      0.72      0.72        54
weighted avg       0.74      0.74      0.74        54

Test ROC-AUC: 0.8384


In [None]:
# Combined prediction: average both models' probabilities into one drafted/undrafted prediction
# Combine-only model predicts on full test_df; college model on test_2017 (2017+). Test set is 2021+ so both apply to all rows.

combined_prob = (y_prob + y_prob17) / 2
combined_pred = (combined_prob >= 0.5).astype(int)

# Use same test labels (y_test from full test_df; same rows as test_2017)
print('Combined model: average of combine-only + college-stats probabilities')
print('=' * 60)
print('Test accuracy:', (combined_pred == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred))
print('\nClassification report:')
print(classification_report(y_test, combined_pred, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob).round(4))

Combined model: average of combine-only + college-stats probabilities
Test accuracy: 0.8519

Confusion matrix (rows=actual, cols=predicted):
[[13  8]
 [ 0 33]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       1.00      0.62      0.76        21
     Drafted       0.80      1.00      0.89        33

    accuracy                           0.85        54
   macro avg       0.90      0.81      0.83        54
weighted avg       0.88      0.85      0.84        54

Test ROC-AUC: 0.8644


In [None]:
# Random Forest: combine-only features — predict Drafted vs Undrafted

rf_combine = RandomForestClassifier(n_estimators=200, max_depth=9, random_state=42)
rf_combine.fit(X_tr, y_train)  # X_tr already has COMBINE_ONLY_ALL, imputed

y_pred_rf = rf_combine.predict(X_te)
y_prob_rf = rf_combine.predict_proba(X_te)[:, 1]

print('Random Forest (combine-only): Drafted vs Undrafted')
print('=' * 55)
print('Test accuracy:', (y_pred_rf == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred_rf))
print('\nClassification report:')
print(classification_report(y_test, y_pred_rf, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, y_prob_rf).round(4))

Random Forest (combine-only): Drafted vs Undrafted
Test accuracy: 0.7407

Confusion matrix (rows=actual, cols=predicted):
[[ 9 12]
 [ 2 31]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.82      0.43      0.56        21
     Drafted       0.72      0.94      0.82        33

    accuracy                           0.74        54
   macro avg       0.77      0.68      0.69        54
weighted avg       0.76      0.74      0.72        54

Test ROC-AUC: 0.8211


In [None]:
# Random Forest: combine + college stats (2017+)
rf_college = RandomForestClassifier(n_estimators=200, max_depth=7, random_state=42)
rf_college.fit(X_tr17, y_train17)  # X_tr17 already has FEATURES_WITH_COLLEGE_ALL, imputed

y_pred_rf17 = rf_college.predict(X_te17)
y_prob_rf17 = rf_college.predict_proba(X_te17)[:, 1]

print('Random Forest (combine + college, train 2017+): Drafted vs Undrafted')
print('=' * 60)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred_rf17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred_rf17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred_rf17, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob_rf17).round(4))

Random Forest (combine + college, train 2017+): Drafted vs Undrafted
Training samples: 46 | Test samples: 54
Test accuracy: 0.7037

Confusion matrix (rows=actual, cols=predicted):
[[ 9 12]
 [ 4 29]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.69      0.43      0.53        21
     Drafted       0.71      0.88      0.78        33

    accuracy                           0.70        54
   macro avg       0.70      0.65      0.66        54
weighted avg       0.70      0.70      0.68        54

Test ROC-AUC: 0.746


In [None]:
# Combined RF prediction: average both RF models' probabilities
combined_prob_rf = (y_prob_rf + y_prob_rf17) / 2
combined_pred_rf = (combined_prob_rf >= 0.5).astype(int)

print('Combined Random Forest: average of combine-only + college-stats probabilities')
print('=' * 65)
print('Test accuracy:', (combined_pred_rf == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred_rf))
print('\nClassification report:')
print(classification_report(y_test, combined_pred_rf, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob_rf).round(4))

Combined Random Forest: average of combine-only + college-stats probabilities
Test accuracy: 0.7037

Confusion matrix (rows=actual, cols=predicted):
[[ 6 15]
 [ 1 32]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.86      0.29      0.43        21
     Drafted       0.68      0.97      0.80        33

    accuracy                           0.70        54
   macro avg       0.77      0.63      0.61        54
weighted avg       0.75      0.70      0.66        54

Test ROC-AUC: 0.8211


In [None]:
# XGBoost: combine-only features — predict Drafted vs Undrafted

xgb_combine = xgb.XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_combine.fit(X_tr, y_train)

y_pred_xgb = xgb_combine.predict(X_te)
y_prob_xgb = xgb_combine.predict_proba(X_te)[:, 1]

print('XGBoost (combine-only): Drafted vs Undrafted')
print('=' * 55)
print('Test accuracy:', (y_pred_xgb == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred_xgb))
print('\nClassification report:')
print(classification_report(y_test, y_pred_xgb, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, y_prob_xgb).round(4))

XGBoost (combine-only): Drafted vs Undrafted
Test accuracy: 0.7778

Confusion matrix (rows=actual, cols=predicted):
[[10 11]
 [ 1 32]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.91      0.48      0.62        21
     Drafted       0.74      0.97      0.84        33

    accuracy                           0.78        54
   macro avg       0.83      0.72      0.73        54
weighted avg       0.81      0.78      0.76        54

Test ROC-AUC: 0.7937


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [None]:
# XGBoost: combine + college stats (2017+)
xgb_college = xgb.XGBClassifier(n_estimators=200, max_depth=2, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college.fit(X_tr17, y_train17)

y_pred_xgb17 = xgb_college.predict(X_te17)
y_prob_xgb17 = xgb_college.predict_proba(X_te17)[:, 1]

print('XGBoost (combine + college, train 2017+): Drafted vs Undrafted')
print('=' * 60)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred_xgb17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred_xgb17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred_xgb17, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob_xgb17).round(4))

XGBoost (combine + college, train 2017+): Drafted vs Undrafted
Training samples: 46 | Test samples: 54
Test accuracy: 0.6852

Confusion matrix (rows=actual, cols=predicted):
[[10 11]
 [ 6 27]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.62      0.48      0.54        21
     Drafted       0.71      0.82      0.76        33

    accuracy                           0.69        54
   macro avg       0.67      0.65      0.65        54
weighted avg       0.68      0.69      0.67        54

Test ROC-AUC: 0.8009


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [None]:
# Combined XGBoost prediction: average both XGBoost models' probabilities
combined_prob_xgb = (y_prob_xgb + y_prob_xgb17) / 2
combined_pred_xgb = (combined_prob_xgb >= 0.5).astype(int)

print('Combined XGBoost: average of combine-only + college-stats probabilities')
print('=' * 65)
print('Test accuracy:', (combined_pred_xgb == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred_xgb))
print('\nClassification report:')
print(classification_report(y_test, combined_pred_xgb, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob_xgb).round(4))

Combined XGBoost: average of combine-only + college-stats probabilities
Test accuracy: 0.7778

Confusion matrix (rows=actual, cols=predicted):
[[11 10]
 [ 2 31]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.85      0.52      0.65        21
     Drafted       0.76      0.94      0.84        33

    accuracy                           0.78        54
   macro avg       0.80      0.73      0.74        54
weighted avg       0.79      0.78      0.76        54

Test ROC-AUC: 0.8023


In [None]:
# Compare all drafted/undrafted models on the same test set (y_test, n=93)

models = [
    ('Logistic (combine-only)', y_pred, y_prob),
    ('Logistic (combine+college)', y_pred17, y_prob17),
    ('Logistic combined', combined_pred, combined_prob),
    ('RF (combine-only)', y_pred_rf, y_prob_rf),
    ('RF (combine+college)', y_pred_rf17, y_prob_rf17),
    ('RF combined', combined_pred_rf, combined_prob_rf),
    ('XGBoost (combine-only)', y_pred_xgb, y_prob_xgb),
    ('XGBoost (combine+college)', y_pred_xgb17, y_prob_xgb17),
    ('XGBoost combined', combined_pred_xgb, combined_prob_xgb),
]

results = []
for name, pred, prob in models:
    acc = (pred == y_test).mean()
    auc = roc_auc_score(y_test, prob)
    f1_macro = f1_score(y_test, pred, average='macro')
    # Per-class recall: Undrafted (0), Drafted (1). CM rows=actual, cols=pred.
    tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
    recall_undrafted = tn / (tn + fn) if (tn + fn) > 0 else 0  # actual undrafted we got right
    recall_drafted = tp / (tp + fn) if (tp + fn) > 0 else 0
    results.append({
        'Model': name,
        'Accuracy': acc,
        'ROC-AUC': auc,
        'Macro F1': f1_macro,
        'Recall (Undrafted)': recall_undrafted,
        'Recall (Drafted)': recall_drafted,
    })

results_df = pd.DataFrame(results)
results_df = results_df.sort_values('ROC-AUC', ascending=False).reset_index(drop=True)
print('All models ranked by ROC-AUC (same test set, n=93):')
print('=' * 75)
print(results_df.to_string(index=False))
print()

best_auc = results_df.loc[0, 'Model']
best_auc_val = results_df.loc[0, 'ROC-AUC']
best_f1 = results_df.loc[results_df['Macro F1'].idxmax(), 'Model']
best_f1_val = results_df['Macro F1'].max()
print('Summary:')
print('  Best by ROC-AUC:', best_auc, f'({best_auc_val:.4f})')
print('  Best by Macro F1 (balanced Undrafted/Drafted):', best_f1, f'({best_f1_val:.4f})')
print()
print('Conclusion: ROC-AUC is the preferred metric for imbalanced drafted/undrafted;')
print('Macro F1 rewards balance. If all models are close, the best single model or')
print('combined ensemble is listed above.')

All models ranked by ROC-AUC (same test set, n=93):
                     Model  Accuracy  ROC-AUC  Macro F1  Recall (Undrafted)  Recall (Drafted)
         Logistic combined  0.851852 0.864358  0.828299            1.000000          1.000000
Logistic (combine+college)  0.740741 0.838384  0.722059            0.684211          0.818182
   Logistic (combine-only)  0.740741 0.835498  0.676923            0.888889          0.969697
         RF (combine-only)  0.740741 0.821068  0.689145            0.818182          0.939394
               RF combined  0.703704 0.821068  0.614286            0.857143          0.969697
          XGBoost combined  0.777778 0.802309  0.742448            0.846154          0.939394
 XGBoost (combine+college)  0.685185 0.800866  0.650552            0.625000          0.818182
    XGBoost (combine-only)  0.777778 0.793651  0.733553            0.909091          0.969697
      RF (combine+college)  0.703704 0.746032  0.656598            0.692308          0.878788

Summary

## Projected Round/Day Drafted

We are creating 3 models here to see if a player will be drafted or go undrafted based on their combine and college stats. The training set will be data from 2016 - 2020 and our test set will be 2021-2023. The three models we will be testing will be ordinal logistic regression, random tree and XGBoost

In [None]:
# Draft Day modeling: Day 1 (R1), Day 2 (R2-3), Day 3 (R4-7). Train only on DRAFTED players.
def round_to_draft_day(r):
    if pd.isna(r): return np.nan
    r = int(r)
    if r == 1: return 1   # Day 1
    if r in (2, 3): return 2   # Day 2
    if r in (4, 5, 6, 7): return 3   # Day 3
    return np.nan

# Drafted-only data
train_draft = train_df[train_df['Drafted'] == True].copy()
test_draft = test_df[test_df['Drafted'] == True].copy()
train_draft['draft_day'] = train_draft['Round'].apply(round_to_draft_day)
test_draft['draft_day'] = test_draft['Round'].apply(round_to_draft_day)
train_draft = train_draft.dropna(subset=['draft_day'])
test_draft = test_draft.dropna(subset=['draft_day'])

# Ordinal target: 0=Day1, 1=Day2, 2=Day3 (for mord/sklearn)
train_draft['draft_day_ord'] = (train_draft['draft_day'] - 1).astype(int)
test_draft['draft_day_ord'] = (test_draft['draft_day'] - 1).astype(int)

# Combine-only X, y (drafted only)
X_draft_tr = train_draft[COMBINE_ONLY_ALL].copy()
X_draft_te = test_draft[COMBINE_ONLY_ALL].copy()
X_draft_tr = X_draft_tr.fillna(train_medians)
X_draft_te = X_draft_te.fillna(train_medians)
y_draft_tr = train_draft['draft_day_ord'].values
y_draft_te = test_draft['draft_day_ord'].values

# Combine+college 2017+ (drafted only)
train_draft_17 = train_draft[train_draft['Year'] >= 2017]
test_draft_17 = test_draft[test_draft['Year'] >= 2017]
X_draft_tr17 = train_draft_17[FEATURES_WITH_COLLEGE_ALL].copy()
X_draft_te17 = test_draft_17[FEATURES_WITH_COLLEGE_ALL].copy()
X_draft_tr17 = X_draft_tr17.fillna(train_medians17)
X_draft_te17 = X_draft_te17.fillna(train_medians17)
y_draft_tr17 = train_draft_17['draft_day_ord'].values
y_draft_te17 = test_draft_17['draft_day_ord'].values

# Scale for ordinal logistic (same scalers as before, but transform draft subsets)
X_draft_tr_scaled = scaler.transform(X_draft_tr)
X_draft_te_scaled = scaler.transform(X_draft_te)
X_draft_tr17_scaled = scaler17.transform(X_draft_tr17)
X_draft_te17_scaled = scaler17.transform(X_draft_te17)

print('Draft Day modeling (drafted players only)')
print('Train drafted:', len(train_draft), '| Test drafted:', len(test_draft))
print('Train 2017+ drafted:', len(train_draft_17), '| Test 2017+ drafted:', len(test_draft_17))
print('Day 1 (R1):', (train_draft['draft_day']==1).sum(), 'train,', (test_draft['draft_day']==1).sum(), 'test')
print('Day 2 (R2-3):', (train_draft['draft_day']==2).sum(), 'train,', (test_draft['draft_day']==2).sum(), 'test')
print('Day 3 (R4-7):', (train_draft['draft_day']==3).sum(), 'train,', (test_draft['draft_day']==3).sum(), 'test')

Draft Day modeling (drafted players only)
Train drafted: 170 | Test drafted: 33
Train 2017+ drafted: 36 | Test 2017+ drafted: 33
Day 1 (R1): 29 train, 6 test
Day 2 (R2-3): 65 train, 8 test
Day 3 (R4-7): 76 train, 19 test


In [None]:
# Ordinal logistic (same as drafted/undrafted: two models, combined = average probabilities)
# 3 classes: 0=Day1, 1=Day2, 2=Day3 — use multinomial logistic so we have predict_proba
ord_combine = LogisticRegression(max_iter=1000, random_state=42)
ord_college = LogisticRegression(max_iter=1000, random_state=43)

ord_combine.fit(X_draft_tr_scaled, y_draft_tr)
prob_ord_combine = ord_combine.predict_proba(X_draft_te_scaled)
pred_ord_combine = ord_combine.predict(X_draft_te_scaled).astype(int).clip(0, 2)

ord_college.fit(X_draft_tr17_scaled, y_draft_tr17)
prob_ord_college = ord_college.predict_proba(X_draft_te17_scaled)
pred_ord_college = ord_college.predict(X_draft_te17_scaled).astype(int).clip(0, 2)

# Combined: average probabilities (same as drafted/undrafted), then argmax
prob_ord_combined = (prob_ord_combine + prob_ord_college) / 2
pred_ord_combined = np.argmax(prob_ord_combined, axis=1)

day_names = ['Day 1 (R1)', 'Day 2 (R2-3)', 'Day 3 (R4-7)']
for name, pred in [('Ordinal logit (combine-only)', pred_ord_combine), ('Ordinal logit (combine+college)', pred_ord_college), ('Ordinal logit combined', pred_ord_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix (rows=actual, cols=Day1,Day2,Day3):\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

Ordinal logit (combine-only)
  Accuracy: 0.5455
  Confusion matrix (rows=actual, cols=Day1,Day2,Day3):
 [[ 2  0  4]
 [ 2  3  3]
 [ 0  6 13]]
  Macro F1: 0.4732

Ordinal logit (combine+college)
  Accuracy: 0.5455
  Confusion matrix (rows=actual, cols=Day1,Day2,Day3):
 [[ 3  1  2]
 [ 1  7  0]
 [ 0 11  8]]
  Macro F1: 0.5567

Ordinal logit combined
  Accuracy: 0.5758
  Confusion matrix (rows=actual, cols=Day1,Day2,Day3):
 [[ 3  1  2]
 [ 1  7  0]
 [ 0 10  9]]
  Macro F1: 0.5795



In [None]:
# Random Forest: draft day — combine-only, combine+college, combined
rf_day_combine = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42)
rf_day_combine.fit(X_draft_tr, y_draft_tr)
pred_rf_day_combine = rf_day_combine.predict(X_draft_te)
prob_rf_day_combine = rf_day_combine.predict_proba(X_draft_te)

rf_day_college = RandomForestClassifier(n_estimators=200, max_depth=2, random_state=42)
rf_day_college.fit(X_draft_tr17, y_draft_tr17)
pred_rf_day_college = rf_day_college.predict(X_draft_te17)
prob_rf_day_college = rf_day_college.predict_proba(X_draft_te17)

# Combined: average class probabilities, then argmax
prob_rf_day_combined = (prob_rf_day_combine + prob_rf_day_college) / 2
pred_rf_day_combined = np.argmax(prob_rf_day_combined, axis=1)

for name, pred in [('RF (combine-only)', pred_rf_day_combine), ('RF (combine+college)', pred_rf_day_college), ('RF combined', pred_rf_day_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix:\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

RF (combine-only)
  Accuracy: 0.6061
  Confusion matrix:
 [[ 1  0  5]
 [ 0  3  5]
 [ 0  3 16]]
  Macro F1: 0.4751

RF (combine+college)
  Accuracy: 0.5758
  Confusion matrix:
 [[ 0  0  6]
 [ 0  4  4]
 [ 0  4 15]]
  Macro F1: 0.3939

RF combined
  Accuracy: 0.5455
  Confusion matrix:
 [[ 0  0  6]
 [ 0  3  5]
 [ 0  4 15]]
  Macro F1: 0.3556



In [None]:
# XGBoost: draft day — combine-only, combine+college, combined
xgb_day_combine = xgb.XGBClassifier(n_estimators=200, max_depth=2, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_combine.fit(X_draft_tr, y_draft_tr)
pred_xgb_day_combine = xgb_day_combine.predict(X_draft_te)
prob_xgb_day_combine = xgb_day_combine.predict_proba(X_draft_te)

xgb_day_college = xgb.XGBClassifier(n_estimators=200, max_depth=7, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_college.fit(X_draft_tr17, y_draft_tr17)
pred_xgb_day_college = xgb_day_college.predict(X_draft_te17)
prob_xgb_day_college = xgb_day_college.predict_proba(X_draft_te17)

# Combined: average class probabilities, then argmax
prob_xgb_day_combined = (prob_xgb_day_combine + prob_xgb_day_college) / 2
pred_xgb_day_combined = np.argmax(prob_xgb_day_combined, axis=1)

for name, pred in [('XGBoost (combine-only)', pred_xgb_day_combine), ('XGBoost (combine+college)', pred_xgb_day_college), ('XGBoost combined', pred_xgb_day_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix:\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost (combine-only)
  Accuracy: 0.5152
  Confusion matrix:
 [[ 1  2  3]
 [ 2  4  2]
 [ 1  6 12]]
  Macro F1: 0.4222

XGBoost (combine+college)
  Accuracy: 0.6667
  Confusion matrix:
 [[ 4  0  2]
 [ 2  5  1]
 [ 2  4 13]]
  Macro F1: 0.6342

XGBoost combined
  Accuracy: 0.5758
  Confusion matrix:
 [[ 2  1  3]
 [ 2  4  2]
 [ 2  4 13]]
  Macro F1: 0.5022



In [None]:
# Compare all draft-day models (same test set: drafted players only, y_draft_te)
day_models = [
    ('Ordinal logit (combine-only)', pred_ord_combine),
    ('Ordinal logit (combine+college)', pred_ord_college),
    ('Ordinal logit combined', pred_ord_combined),
    ('RF (combine-only)', pred_rf_day_combine),
    ('RF (combine+college)', pred_rf_day_college),
    ('RF combined', pred_rf_day_combined),
    ('XGBoost (combine-only)', pred_xgb_day_combine),
    ('XGBoost (combine+college)', pred_xgb_day_college),
    ('XGBoost combined', pred_xgb_day_combined),
]

day_results = []
for name, pred in day_models:
    acc = (pred == y_draft_te).mean()
    f1 = f1_score(y_draft_te, pred, average='macro', zero_division=0)
    cm = confusion_matrix(y_draft_te, pred)
    # Per-class recall: Day1=0, Day2=1, Day3=2
    recalls = []
    for k in range(3):
        if cm.shape[0] > k and cm[k].sum() > 0:
            recalls.append(cm[k, k] / cm[k].sum())
        else:
            recalls.append(np.nan)
    day_results.append({
        'Model': name,
        'Accuracy': acc,
        'Macro F1': f1,
        'Recall Day1': recalls[0] if not np.isnan(recalls[0]) else None,
        'Recall Day2': recalls[1] if not np.isnan(recalls[1]) else None,
        'Recall Day3': recalls[2] if not np.isnan(recalls[2]) else None,
    })

day_df = pd.DataFrame(day_results)
day_df = day_df.sort_values('Macro F1', ascending=False).reset_index(drop=True)
print('Draft-day models ranked by Macro F1 (test set: drafted only, n=' + str(len(y_draft_te)) + ')')
print('=' * 80)
print(day_df.to_string(index=False))
print()
best_acc = day_df.loc[day_df['Accuracy'].idxmax(), 'Model']
best_acc_val = day_df['Accuracy'].max()
best_f1 = day_df.loc[0, 'Model']
best_f1_val = day_df.loc[0, 'Macro F1']
print('Summary:')
print('  Best by Accuracy:', best_acc, '(' + str(round(best_acc_val, 4)) + ')')
print('  Best by Macro F1:', best_f1, '(' + str(round(best_f1_val, 4)) + ')')

Draft-day models ranked by Macro F1 (test set: drafted only, n=33)
                          Model  Accuracy  Macro F1  Recall Day1  Recall Day2  Recall Day3
      XGBoost (combine+college)  0.666667  0.634174     0.666667        0.625     0.684211
         Ordinal logit combined  0.575758  0.579487     0.500000        0.875     0.473684
Ordinal logit (combine+college)  0.545455  0.556748     0.500000        0.875     0.421053
               XGBoost combined  0.575758  0.502208     0.333333        0.500     0.684211
              RF (combine-only)  0.606061  0.475132     0.166667        0.375     0.842105
   Ordinal logit (combine-only)  0.545455  0.473203     0.333333        0.375     0.684211
         XGBoost (combine-only)  0.515152  0.422222     0.166667        0.500     0.631579
           RF (combine+college)  0.575758  0.393939     0.000000        0.500     0.789474
                    RF combined  0.545455  0.355556     0.000000        0.375     0.789474

Summary:
  Best by Acc

## Pipeline: Drafted/Undrafted + Draft Day

Test every combination of (drafted/undrafted model) × (draft-day model). Pipeline: if predicted undrafted → "Undrafted"; if predicted drafted → use draft-day model to get Day 1, 2, or 3. Evaluate on 4-class: Undrafted, Day 1, Day 2, Day 3.

In [None]:
# Actual 4-class labels for full test set: 0=Undrafted, 1=Day1, 2=Day2, 3=Day3
def round_to_draft_day(r):
    if pd.isna(r): return np.nan
    r = int(r)
    if r == 1: return 1
    if r in (2, 3): return 2
    if r in (4, 5, 6, 7): return 3
    return np.nan

y_actual_4 = []
for _, row in test_df.iterrows():
    if not row['Drafted']:
        y_actual_4.append(0)
    else:
        d = round_to_draft_day(row['Round'])
        y_actual_4.append(int(d))  # 1, 2, or 3
y_actual_4 = np.array(y_actual_4)

# Full-test draft-day predictions (length = len(test_df)) for each draft-day model
day_ord_c = ord_combine.predict(X_te_scaled).astype(int).clip(0, 2)
day_ord_17 = ord_college.predict(X_te17_scaled).astype(int).clip(0, 2)
day_ord_comb = np.argmax((ord_combine.predict_proba(X_te_scaled) + ord_college.predict_proba(X_te17_scaled)) / 2, axis=1)
day_rf_c = rf_day_combine.predict(X_te)
day_rf_17 = rf_day_college.predict(X_te17)
day_rf_comb = np.argmax((rf_day_combine.predict_proba(X_te) + rf_day_college.predict_proba(X_te17)) / 2, axis=1)
day_xgb_c = xgb_day_combine.predict(X_te)
day_xgb_17 = xgb_day_college.predict(X_te17)
day_xgb_comb = np.argmax((xgb_day_combine.predict_proba(X_te) + xgb_day_college.predict_proba(X_te17)) / 2, axis=1)

drafted_models = [
    ('Logistic (combine)', y_pred),
    ('Logistic (college)', y_pred17),
    ('Logistic combined', combined_pred),
    ('RF (combine)', y_pred_rf),
    ('RF (college)', y_pred_rf17),
    ('RF combined', combined_pred_rf),
    ('XGB (combine)', y_pred_xgb),
    ('XGB (college)', y_pred_xgb17),
    ('XGB combined', combined_pred_xgb),
]
day_models_full = [
    ('Ordinal (combine)', day_ord_c),
    ('Ordinal (college)', day_ord_17),
    ('Ordinal combined', day_ord_comb),
    ('RF (combine)', day_rf_c),
    ('RF (college)', day_rf_17),
    ('RF combined', day_rf_comb),
    ('XGB (combine)', day_xgb_c),
    ('XGB (college)', day_xgb_17),
    ('XGB combined', day_xgb_comb),
]

# For each combination: pred_4 = 0 if drafted_pred==0 else (1 + day_pred)
results = []
for dname, d_pred in drafted_models:
    for dayname, day_pred in day_models_full:
        pred_4 = np.where(d_pred == 0, 0, 1 + day_pred.astype(int).clip(0, 2))
        acc = (pred_4 == y_actual_4).mean()
        f1 = f1_score(y_actual_4, pred_4, average='macro', zero_division=0)
        results.append({'Drafted/Undrafted': dname, 'Draft Day': dayname, 'Accuracy': acc, 'Macro F1': f1})

pipe_df = pd.DataFrame(results)
pipe_df = pipe_df.sort_values('Macro F1', ascending=False).reset_index(drop=True)
print('All 81 combinations (Drafted/Undrafted × Draft Day), ranked by Macro F1')
print('=' * 90)
print(pipe_df.to_string(index=False))
print()
print('Top 10 combinations:')
print(pipe_df.head(10).to_string(index=False))
print()
print('Best pair:', pipe_df.loc[0, 'Drafted/Undrafted'], '+', pipe_df.loc[0, 'Draft Day'],
      '| Macro F1 =', round(pipe_df.loc[0, 'Macro F1'], 4), '| Accuracy =', round(pipe_df.loc[0, 'Accuracy'], 4))

All 81 combinations (Drafted/Undrafted × Draft Day), ranked by Macro F1
 Drafted/Undrafted         Draft Day  Accuracy  Macro F1
 Logistic combined     XGB (college)  0.648148  0.621935
 Logistic combined  Ordinal combined  0.592593  0.581112
      XGB combined     XGB (college)  0.574074  0.573109
 Logistic combined Ordinal (college)  0.574074  0.565877
     XGB (combine)     XGB (college)  0.574074  0.554958
Logistic (college)  Ordinal combined  0.555556  0.545076
      RF (combine)     XGB (college)  0.537037  0.540754
Logistic (college)     XGB (college)  0.555556  0.538480
     XGB (college)     XGB (college)  0.518519  0.537551
      XGB combined  Ordinal combined  0.537037  0.537414
 Logistic combined      XGB combined  0.592593  0.530373
Logistic (college) Ordinal (college)  0.537037  0.528239
      RF (college)     XGB (college)  0.518519  0.526526
Logistic (combine)     XGB (college)  0.537037  0.525618
     XGB (combine)  Ordinal combined  0.518519  0.521741
      XGB combin

## Predict draft for a single player

Function that selects the best pipeline (drafted/undrafted + draft day) based on what stats the player has, then returns the prediction. If the player has no college stats, only combine-only model pairs are considered.

In [None]:
def _player_has_college_stats(player_dict):
    """True if player has all three college stats (non-null)."""
    keys = ['QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season']
    for k in keys:
        v = player_dict.get(k, player_dict.get(k.replace('_', ' ')))
        if v is None or (isinstance(v, float) and np.isnan(v)):
            return False
    return True

def _get_val(player_dict, *keys, default=np.nan):
    for k in keys:
        if k in player_dict and player_dict[k] is not None:
            v = player_dict[k]
            if isinstance(v, float) and np.isnan(v):
                continue
            return v
    return default

def _height_inches(h):
    if h is None or (isinstance(h, float) and np.isnan(h)):
        return np.nan
    if isinstance(h, (int, float)):
        return float(h)
    if isinstance(h, str) and '-' in h:
        parts = h.strip().split('-')
        return int(parts[0]) * 12 + int(parts[1])
    return np.nan

# Map contains_* column name -> feature name for checking if player has that stat
CONTAINS_TO_FEATURE = {
    'contains_broad_jump': 'Broad Jump', 'contains_vertical': 'Vertical', 'contains_shuttle': 'Shuttle',
    'contains_three_cone': '3Cone', 'contains_40yd': '40yd', 'contains_height': 'Height', 'contains_weight': 'Weight',
    'contains_speed_score': 'speed_score', 'contains_explosive_score': 'explosive_score', 'contains_agility_score': 'agility_score',
    'contains_qb_hurry_final_season': 'QB_Hurry_final_season', 'contains_tfl_final_season': 'TFL_final_season', 'contains_sacks_final_season': 'Sacks_final_season',
    'contains_sacks_cumulative': 'Sacks_cumulative', 'contains_tfl_cumulative': 'TFL_cumulative', 'contains_qb_hurry_cumulative': 'QB_Hurry_cumulative',
    'contains_p4_conference': 'p4_conference',
}

def _player_row(player_dict, feature_list, contains_list, medians, add_speed=True):
    """Build one row for the player with feature_list + contains_*; fill missing with medians."""
    row = {}
    for col in feature_list:
        v = _get_val(player_dict, col, col.replace(' ', '_').lower(), col.replace(' ', ''))
        if col == 'Height':
            v = _height_inches(v)
        if add_speed and col == 'speed_score' and (v is np.nan or (isinstance(v, float) and np.isnan(v))):
            w, forty = _get_val(player_dict, 'Weight'), _get_val(player_dict, '40yd')
            if w is not np.nan and forty is not np.nan and float(forty) > 0:
                v = float(w) * 200 / (float(forty) ** 4)
        row[col] = v if (v is not np.nan and not (isinstance(v, float) and np.isnan(v))) else medians.get(col, np.nan)
    for col in contains_list:
        feat = CONTAINS_TO_FEATURE.get(col, col.replace('contains_', '').replace('_', ' '))
        v = _get_val(player_dict, feat, feat.replace(' ', '_').lower() if isinstance(feat, str) else feat)
        row[col] = 1 if (v is not None and v is not np.nan and not (isinstance(v, float) and np.isnan(v))) else 0
    return pd.Series(row)

def get_best_pipeline_for_player(player_dict, pipe_df):
    """Select best (drafted/undrafted, draft day) model pair given what player data is available."""
    has_college = _player_has_college_stats(player_dict)
    if not has_college:
        # Restrict to combine-only models (exclude 'college' and 'combined')
        drafted_ok = pipe_df['Drafted/Undrafted'].isin(['Logistic (combine)', 'RF (combine)', 'XGB (combine)'])
        day_ok = pipe_df['Draft Day'].isin(['Ordinal (combine)', 'RF (combine)', 'XGB (combine)'])
        sub = pipe_df[drafted_ok & day_ok].sort_values('Macro F1', ascending=False)
        if len(sub) == 0:
            sub = pipe_df
    else:
        sub = pipe_df
    best = sub.iloc[0]
    return best['Drafted/Undrafted'], best['Draft Day']

def _run_drafted_model(drafted_name, row_combine, row_full):
    """Return P(drafted) in [0,1] for the given model name."""
    if drafted_name == 'Logistic (combine)':
        return logit_draft.predict_proba(scaler.transform(row_combine.to_frame().T))[0, 1]
    if drafted_name == 'Logistic (college)':
        return logit_draft_college.predict_proba(scaler17.transform(row_full.to_frame().T))[0, 1]
    if drafted_name == 'Logistic combined':
        p1 = logit_draft.predict_proba(scaler.transform(row_combine.to_frame().T))[0, 1]
        p2 = logit_draft_college.predict_proba(scaler17.transform(row_full.to_frame().T))[0, 1]
        return (p1 + p2) / 2
    if drafted_name == 'RF (combine)':
        return rf_combine.predict_proba(row_combine.to_frame().T)[0, 1]
    if drafted_name == 'RF (college)':
        return rf_college.predict_proba(row_full.to_frame().T)[0, 1]
    if drafted_name == 'RF combined':
        p1 = rf_combine.predict_proba(row_combine.to_frame().T)[0, 1]
        p2 = rf_college.predict_proba(row_full.to_frame().T)[0, 1]
        return (p1 + p2) / 2
    if drafted_name == 'XGB (combine)':
        return xgb_combine.predict_proba(row_combine.to_frame().T)[0, 1]
    if drafted_name == 'XGB (college)':
        return xgb_college.predict_proba(row_full.to_frame().T)[0, 1]
    if drafted_name == 'XGB combined':
        p1 = xgb_combine.predict_proba(row_combine.to_frame().T)[0, 1]
        p2 = xgb_college.predict_proba(row_full.to_frame().T)[0, 1]
        return (p1 + p2) / 2
    return 0.0

def _run_day_model(day_name, row_combine, row_full):
    """Return class 0/1/2 (Day1/Day2/Day3) for the given model name."""
    if day_name == 'Ordinal (combine)':
        return int(ord_combine.predict(scaler.transform(row_combine.to_frame().T))[0])
    if day_name == 'Ordinal (college)':
        return int(ord_college.predict(scaler17.transform(row_full.to_frame().T))[0])
    if day_name == 'Ordinal combined':
        p1 = ord_combine.predict_proba(scaler.transform(row_combine.to_frame().T))[0]
        p2 = ord_college.predict_proba(scaler17.transform(row_full.to_frame().T))[0]
        return int(np.argmax((p1 + p2) / 2))
    if day_name == 'RF (combine)':
        return int(rf_day_combine.predict(row_combine.to_frame().T)[0])
    if day_name == 'RF (college)':
        return int(rf_day_college.predict(row_full.to_frame().T)[0])
    if day_name == 'RF combined':
        p1 = rf_day_combine.predict_proba(row_combine.to_frame().T)[0]
        p2 = rf_day_college.predict_proba(row_full.to_frame().T)[0]
        return int(np.argmax((p1 + p2) / 2))
    if day_name == 'XGB (combine)':
        return int(xgb_day_combine.predict(row_combine.to_frame().T)[0])
    if day_name == 'XGB (college)':
        return int(xgb_day_college.predict(row_full.to_frame().T)[0])
    if day_name == 'XGB combined':
        p1 = xgb_day_combine.predict_proba(row_combine.to_frame().T)[0]
        p2 = xgb_day_college.predict_proba(row_full.to_frame().T)[0]
        return int(np.argmax((p1 + p2) / 2))
    return 2

def predict_draft(player_dict, pipe_df=None):
    """
    Predict drafted/undrafted and (if drafted) draft day for one player.
    player_dict: keys like Height, Weight, 40yd, Vertical, Broad Jump, Shuttle, 3Cone,
                 QB_Hurry_final_season, TFL_final_season, Sacks_final_season.
                 Height can be inches (int) or "6-4". Missing stats can be omitted or None.
    If college stats are missing, the best combine-only pipeline is used.
    Returns: dict with drafted (bool), draft_day (1|2|3 or None), drafted_model, day_model.
    """
    if pipe_df is None:
        pipe_df = globals().get('pipe_df')
    if pipe_df is None:
        raise ValueError('Run the pipeline comparison cell first to create pipe_df, or pass pipe_df.')
    drafted_name, day_name = get_best_pipeline_for_player(player_dict, pipe_df)
    row_combine = _player_row(player_dict, COMBINE_ONLY_FEATURES, COMBINE_ONLY_CONTAINS, train_medians)
    row_combine = row_combine.reindex(COMBINE_ONLY_ALL).fillna(train_medians)
    row_full = _player_row(player_dict, FEATURES_WITH_COLLEGE, CONTAINS_WITH_COLLEGE, train_medians17)
    row_full = row_full.reindex(FEATURES_WITH_COLLEGE_ALL).fillna(train_medians17)
    prob_drafted = _run_drafted_model(drafted_name, row_combine, row_full)
    drafted = prob_drafted >= 0.5
    draft_day = None
    if drafted:
        day_class = _run_day_model(day_name, row_combine, row_full)
        draft_day = int(np.clip(day_class, 0, 2)) + 1  # 0->Day1, 1->Day2, 2->Day3
    return {
        'drafted': drafted,
        'draft_day': draft_day,
        'drafted_model': drafted_name,
        'day_model': day_name,
        'prob_drafted': float(prob_drafted),
    }

# Example (run after all model cells):
player = {'Height': '6-4', 'Weight': 265, '40yd': 4.65, 'Vertical': 35, 'Broad Jump': 118,
          'Shuttle': 4.4, '3Cone': 7.2, 'QB_Hurry_final_season': 12, 'TFL_final_season': 10, 'Sacks_final_season': 6}
predict_draft(player)

{'drafted': np.True_,
 'draft_day': 2,
 'drafted_model': 'Logistic combined',
 'day_model': 'XGB (college)',
 'prob_drafted': 0.9374035489568803}

In [None]:
# Example predictions: Peter Woods, Caleb Banks, Kayden McDonald, Christen Miller
# Stats from listed measurements and college production; combine values are estimates where not reported.
# Optional: Sacks_cumulative, TFL_cumulative, QB_Hurry_cumulative — if provided, the college/combined
# models use them; if omitted or None, training medians are used.

examples = {
    'Peter Woods': {
        'Height': '6-3', 'Weight': 310,
        '40yd': 4.75, 'Vertical': None, 'Broad Jump': None,
        'Shuttle': None, '3Cone': None,
        'QB_Hurry_final_season': 9, 'TFL_final_season': 3.5, 'Sacks_final_season': 3.5,  # Clemson
        'Sacks_cumulative': 5, 'TFL_cumulative': 14.5, 'QB_Hurry_cumulative': 42,  # fill in to test with cumulative
    },
    'Caleb Banks': {
        'Height': '6-6', 'Weight': 330,
        '40yd': 5.2, 'Vertical': None, 'Broad Jump': None,
        'Shuttle': None, '3Cone': None,
        'QB_Hurry_final_season': 6, 'TFL_final_season': 7, 'Sacks_final_season': 4.5,  # 2024 Injured in 2025
        'Sacks_cumulative': 5.5, 'TFL_cumulative': 9.5, 'QB_Hurry_cumulative': 42,
    },
    'Kayden McDonald': {
        'Height': '6-3', 'Weight': 325,
        '40yd': 5.15, 'Vertical': None, 'Broad Jump': None,
        'Shuttle': None, '3Cone': None,
        'QB_Hurry_final_season': 8, 'TFL_final_season': 9, 'Sacks_final_season': 3,    # Ohio State 2025
        'Sacks_cumulative': 3, 'TFL_cumulative': 11, 'QB_Hurry_cumulative': 25,
    },
    'Christen Miller': {
        'Height': '6-4', 'Weight': 305,
        '40yd': 4.90, 'Vertical': None, 'Broad Jump': None,
        'Shuttle': None, '3Cone': None,
        'QB_Hurry_final_season': 14, 'TFL_final_season': 4, 'Sacks_final_season': 1.5,    # Colorado
        'Sacks_cumulative': 4, 'TFL_cumulative': 11.5, 'QB_Hurry_cumulative': None,
    },
}

def _all_pipeline_preds(player_dict, pipe_subset):
    """Run all pipelines in pipe_subset for this player; return arrays of prob_drafted and draft_day (1/2/3 or nan)."""
    row_combine = _player_row(player_dict, COMBINE_ONLY_FEATURES, COMBINE_ONLY_CONTAINS, train_medians)
    row_combine = row_combine.reindex(COMBINE_ONLY_ALL).fillna(train_medians)
    row_full = _player_row(player_dict, FEATURES_WITH_COLLEGE, CONTAINS_WITH_COLLEGE, train_medians17)
    row_full = row_full.reindex(FEATURES_WITH_COLLEGE_ALL).fillna(train_medians17)
    probs, days = [], []
    for _, r in pipe_subset.iterrows():
        p = _run_drafted_model(r['Drafted/Undrafted'], row_combine, row_full)
        probs.append(p)
        if p >= 0.5:
            day_class = _run_day_model(r['Draft Day'], row_combine, row_full)
            days.append(int(np.clip(day_class, 0, 2)) + 1)
        else:
            days.append(np.nan)
    return np.array(probs), np.array(days)

print('Draft predictions (best pipeline) + variance across all pipelines\n' + '=' * 70)
for name, player_dict in examples.items():
    out = predict_draft(player_dict)
    day_str = f"Day {out['draft_day']}" if out['drafted'] else 'Undrafted'
    print(f"{name}: {day_str}  (P(drafted)={out['prob_drafted']:.3f})")
    print(f"  Best models: {out['drafted_model']} + {out['day_model']}")

    probs, days = _all_pipeline_preds(player_dict, pipe_df)
    p_mean, p_var, p_std = probs.mean(), probs.var(), probs.std()
    print(f"  P(drafted) across {len(probs)} pipelines: mean = {p_mean:.3f}, std = {p_std:.3f}, variance = {p_var:.4f}")
    days_drafted = days[~np.isnan(days)]
    if len(days_drafted) > 0:
        d_mean, d_var, d_std = days_drafted.mean(), days_drafted.var(), days_drafted.std()
        print(f"  Draft day (when predicted drafted): mean = {d_mean:.2f}, std = {d_std:.2f}, variance = {d_var:.4f}")
        print(f"  Draft day distribution: Day1={np.sum(days==1)}, Day2={np.sum(days==2)}, Day3={np.sum(days==3)}")
    else:
        print(f"  Draft day: all pipelines predicted Undrafted")
    print()

Draft predictions (best pipeline) + variance across all pipelines
Peter Woods: Day 1  (P(drafted)=0.990)
  Best models: Logistic combined + XGB (college)
  P(drafted) across 81 pipelines: mean = 0.921, std = 0.107, variance = 0.0113
  Draft day (when predicted drafted): mean = 2.33, std = 0.67, variance = 0.4444
  Draft day distribution: Day1=9, Day2=36, Day3=36

Caleb Banks: Day 3  (P(drafted)=0.946)
  Best models: Logistic combined + XGB (college)
  P(drafted) across 81 pipelines: mean = 0.914, std = 0.072, variance = 0.0052
  Draft day (when predicted drafted): mean = 2.67, std = 0.47, variance = 0.2222
  Draft day distribution: Day1=0, Day2=27, Day3=54

Kayden McDonald: Day 3  (P(drafted)=0.937)
  Best models: Logistic combined + XGB (college)
  P(drafted) across 81 pipelines: mean = 0.838, std = 0.143, variance = 0.0205
  Draft day (when predicted drafted): mean = 3.00, std = 0.00, variance = 0.0000
  Draft day distribution: Day1=0, Day2=0, Day3=81

Christen Miller: Day 2  (P(draf

In [None]:
# 2024 drafted DTs: compute speed_score, explosive_score, agility_score; run model and add predicted_draft_day
# Requires prior cells (train_df, predict_draft, pipe_df, etc.) to be run.

dt_2024 = pd.read_csv('dt_drafted_2024.csv')
if dt_2024['Height'].dtype == object or (dt_2024['Height'].notna() & (dt_2024['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '':
            return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h):
            return float(h)
        s = str(h)
        if '-' in s:
            parts = s.split('-')
            return int(parts[0]) * 12 + int(parts[1])
        return np.nan
    dt_2024['Height'] = dt_2024['Height'].apply(_ht_inches)
else:
    dt_2024['Height'] = pd.to_numeric(dt_2024['Height'], errors='coerce')

dt_2024['speed_score'] = np.where(
    dt_2024['40yd'].notna() & (dt_2024['40yd'] > 0),
    dt_2024['Weight'] * 200 / (dt_2024['40yd'] ** 4),
    np.nan
)

tr_dt = train_df[train_df['Pos'] == 'DT']
mean_v = tr_dt['Vertical'].mean()
std_v = tr_dt['Vertical'].std()
mean_b = tr_dt['Broad Jump'].mean()
std_b = tr_dt['Broad Jump'].std()
if std_v == 0 or np.isnan(std_v): std_v = 1.0
if std_b == 0 or np.isnan(std_b): std_b = 1.0
v_z = (dt_2024['Vertical'] - mean_v) / std_v
b_z = (dt_2024['Broad Jump'] - mean_b) / std_b
has_explosive = dt_2024['Vertical'].notna() | dt_2024['Broad Jump'].notna()
dt_2024['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

mean_3 = tr_dt['3Cone'].mean()
std_3 = tr_dt['3Cone'].std()
mean_sh = tr_dt['Shuttle'].mean()
std_sh = tr_dt['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3): std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh): std_sh = 1.0
z_3 = (dt_2024['3Cone'] - mean_3) / std_3
z_sh = (dt_2024['Shuttle'] - mean_sh) / std_sh
has_agility = dt_2024['3Cone'].notna() | dt_2024['Shuttle'].notna()
dt_2024['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

def row_to_player_dict(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row.get('Shuttle', np.nan), '3Cone': row.get('3Cone', np.nan),
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'Sacks_cumulative': row.get('Sacks_cumulative', np.nan),
        'TFL_cumulative': row.get('TFL_cumulative', np.nan),
        'QB_Hurry_cumulative': row.get('QB_Hurry_cumulative', np.nan),
        'speed_score': row['speed_score'], 'explosive_score': row['explosive_score'], 'agility_score': row['agility_score'],
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2024),
    }

predicted_draft_day = []
models_used = []
for _, row in dt_2024.iterrows():
    out = predict_draft(row_to_player_dict(row))
    if out['drafted']:
        predicted_draft_day.append(f"Day {out['draft_day']}")
        models_used.append(f"{out['drafted_model']} + {out['day_model']}")
    else:
        predicted_draft_day.append('Undrafted')
        models_used.append(out['drafted_model'])

dt_2024['predicted_draft_day'] = predicted_draft_day
dt_2024['models_used'] = models_used

print('2024 drafted DTs: speed_score, explosive_score, agility_score + predicted_draft_day + models_used')
print(dt_2024[['Round', 'Pick', 'Player', 'School', 'speed_score', 'explosive_score', 'agility_score', 'predicted_draft_day', 'models_used']].to_string())
dt_2024

NameError: name 'np' is not defined

In [None]:
# 2025 drafted DTs: same analysis as 2024 — speed_score, explosive_score, agility_score + predicted_draft_day

dt_2025 = pd.read_csv('dt_drafted_2025.csv')
if dt_2025['Height'].dtype == object or (dt_2025['Height'].notna() & (dt_2025['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '': return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h): return float(h)
        s = str(h)
        if '-' in s: return int(s.split('-')[0]) * 12 + int(s.split('-')[1])
        return np.nan
    dt_2025['Height'] = dt_2025['Height'].apply(_ht_inches)
else:
    dt_2025['Height'] = pd.to_numeric(dt_2025['Height'], errors='coerce')

dt_2025['speed_score'] = np.where(dt_2025['40yd'].notna() & (dt_2025['40yd'] > 0),
    dt_2025['Weight'] * 200 / (dt_2025['40yd'] ** 4), np.nan)

tr_dt = train_df[train_df['Pos'] == 'DT']
mean_v, std_v = tr_dt['Vertical'].mean(), tr_dt['Vertical'].std()
mean_b, std_b = tr_dt['Broad Jump'].mean(), tr_dt['Broad Jump'].std()
if std_v == 0 or np.isnan(std_v): std_v = 1.0
if std_b == 0 or np.isnan(std_b): std_b = 1.0
v_z = (dt_2025['Vertical'] - mean_v) / std_v
b_z = (dt_2025['Broad Jump'] - mean_b) / std_b
has_explosive = dt_2025['Vertical'].notna() | dt_2025['Broad Jump'].notna()
dt_2025['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

mean_3, std_3 = tr_dt['3Cone'].mean(), tr_dt['3Cone'].std()
mean_sh, std_sh = tr_dt['Shuttle'].mean(), tr_dt['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3): std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh): std_sh = 1.0
z_3 = (dt_2025['3Cone'] - mean_3) / std_3
z_sh = (dt_2025['Shuttle'] - mean_sh) / std_sh
has_agility = dt_2025['3Cone'].notna() | dt_2025['Shuttle'].notna()
dt_2025['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

def row_to_player_dict_2025(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row.get('Shuttle', np.nan), '3Cone': row.get('3Cone', np.nan),
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'Sacks_cumulative': row.get('Sacks_cumulative', np.nan),
        'TFL_cumulative': row.get('TFL_cumulative', np.nan),
        'QB_Hurry_cumulative': row.get('QB_Hurry_cumulative', np.nan),
        'speed_score': row['speed_score'], 'explosive_score': row['explosive_score'], 'agility_score': row['agility_score'],
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2025),
    }

predicted_draft_day_2025 = []
models_used_2025 = []
for _, row in dt_2025.iterrows():
    out = predict_draft(row_to_player_dict_2025(row))
    if out['drafted']:
        predicted_draft_day_2025.append(f"Day {out['draft_day']}")
        models_used_2025.append(f"{out['drafted_model']} + {out['day_model']}")
    else:
        predicted_draft_day_2025.append('Undrafted')
        models_used_2025.append(out['drafted_model'])

dt_2025['predicted_draft_day'] = predicted_draft_day_2025
dt_2025['models_used'] = models_used_2025

print('2025 drafted DTs: speed_score, explosive_score, agility_score + predicted_draft_day + models_used')
print(dt_2025[['Round', 'Pick', 'Player', 'School', 'speed_score', 'explosive_score', 'agility_score', 'predicted_draft_day', 'models_used']].to_string())
dt_2025

In [None]:
# 2026 drafted DTs: same analysis as 2024/2025 — speed_score, explosive_score, agility_score + predicted_draft_day

dt_2026 = pd.read_csv('dt_drafted_2026.csv')
if dt_2026['Height'].dtype == object or (dt_2026['Height'].notna() & (dt_2026['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '': return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h): return float(h)
        s = str(h)
        if '-' in s: return int(s.split('-')[0]) * 12 + int(s.split('-')[1])
        return np.nan
    dt_2026['Height'] = dt_2026['Height'].apply(_ht_inches)
else:
    dt_2026['Height'] = pd.to_numeric(dt_2026['Height'], errors='coerce')

dt_2026['speed_score'] = np.where(dt_2026['40yd'].notna() & (dt_2026['40yd'] > 0),
    dt_2026['Weight'] * 200 / (dt_2026['40yd'] ** 4), np.nan)

tr_dt = train_df[train_df['Pos'] == 'DT']
mean_v, std_v = tr_dt['Vertical'].mean(), tr_dt['Vertical'].std()
mean_b, std_b = tr_dt['Broad Jump'].mean(), tr_dt['Broad Jump'].std()
if std_v == 0 or np.isnan(std_v): std_v = 1.0
if std_b == 0 or np.isnan(std_b): std_b = 1.0
v_z = (dt_2026['Vertical'] - mean_v) / std_v
b_z = (dt_2026['Broad Jump'] - mean_b) / std_b
has_explosive = dt_2026['Vertical'].notna() | dt_2026['Broad Jump'].notna()
dt_2026['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

mean_3, std_3 = tr_dt['3Cone'].mean(), tr_dt['3Cone'].std()
mean_sh, std_sh = tr_dt['Shuttle'].mean(), tr_dt['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3): std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh): std_sh = 1.0
z_3 = (dt_2026['3Cone'] - mean_3) / std_3
z_sh = (dt_2026['Shuttle'] - mean_sh) / std_sh
has_agility = dt_2026['3Cone'].notna() | dt_2026['Shuttle'].notna()
dt_2026['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

def row_to_player_dict_2026(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row.get('Shuttle', np.nan), '3Cone': row.get('3Cone', np.nan),
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'Sacks_cumulative': row.get('Sacks_cumulative', np.nan),
        'TFL_cumulative': row.get('TFL_cumulative', np.nan),
        'QB_Hurry_cumulative': row.get('QB_Hurry_cumulative', np.nan),
        'speed_score': row['speed_score'], 'explosive_score': row['explosive_score'], 'agility_score': row['agility_score'],
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2026),
    }

predicted_draft_day_2026 = []
models_used_2026 = []
for _, row in dt_2026.iterrows():
    out = predict_draft(row_to_player_dict_2026(row))
    if out['drafted']:
        predicted_draft_day_2026.append(f"Day {out['draft_day']}")
        models_used_2026.append(f"{out['drafted_model']} + {out['day_model']}")
    else:
        predicted_draft_day_2026.append('Undrafted')
        models_used_2026.append(out['drafted_model'])

dt_2026['predicted_draft_day'] = predicted_draft_day_2026
dt_2026['models_used'] = models_used_2026

print('2026 drafted DTs: speed_score, explosive_score, agility_score + predicted_draft_day + models_used')
print(dt_2026[['Round', 'Pick', 'Player', 'School', 'speed_score', 'explosive_score', 'agility_score', 'predicted_draft_day', 'models_used']].to_string())
dt_2026