## Combine Analysis Linebackers 

Which Combine tests have the most potential influence on a players ability to get drafted and their draft position?

Our training dataset is combine data from 2010–2020 (428 players) and our testing dataset is 2021–2023 (102 players). Correlations below use the training data; all models are trained on the training set and evaluated on the test set.

In [89]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn.utils.validation')

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.utils.class_weight import compute_sample_weight
import xgboost as xgb

# Path relative to notebook location (LB/) - data is in project root
lb_data = pd.read_csv('../data/processed/lb_training_data.csv')
print(lb_data.columns)
# Convert Height from feet-inches to inches
lb_data['Height'] = lb_data['Height'].str.split('-').str[0].astype(int) * 12 + lb_data['Height'].str.split('-').str[1].astype(int)

# Examine every column in the dataset and its correlation with the Drafted column 
lb_data_just_numeric = lb_data.select_dtypes(include=['number'])
lb_data_just_numeric['Drafted'] = lb_data['Drafted']
print(lb_data_just_numeric.corr()['Drafted'].sort_values(ascending=False))


Index(['Year', 'Player', 'Pos', 'School', 'Height', 'Weight', '40yd',
       'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle', 'Drafted',
       'Round', 'Pick', 'Sacks_cumulative', 'Sacks_final_season',
       'TFL_cumulative', 'TFL_final_season', 'QB_Hurry_cumulative',
       'QB_Hurry_final_season', 'PD_cumulative', 'PD_final_season',
       'SOLO_cumulative', 'SOLO_final_season', 'TOT_cumulative',
       'TOT_final_season'],
      dtype='object')
Drafted                  1.000000
PD_final_season          0.318812
Broad Jump               0.229168
Vertical                 0.226552
TFL_final_season         0.197827
Bench                    0.190172
SOLO_final_season        0.184699
Sacks_final_season       0.170538
PD_cumulative            0.160228
TOT_final_season         0.119914
Sacks_cumulative         0.109303
Weight                   0.101887
QB_Hurry_final_season    0.059888
SOLO_cumulative          0.048046
Height                   0.045014
TFL_cumulative           0.007

For context if the correlation is positive that means that a higher number is better, if a correlation is negative that means that a lower number is better. From the LB training data, the most impactful **combine** values on **being drafted** are:

1. 40yd: -0.434 (faster = more likely drafted)
2. Broad Jump: 0.229
3. Vertical: 0.226
4. Bench: 0.190
5. Shuttle: -0.187 (faster = more likely drafted)
6. 3Cone: -0.167 (faster = more likely drafted)

and the most impactful **college stats** (Sacks, TFL, QB Hurry, PD, SOLO, TOT) on **being drafted** are:

1. PD_final_season: 0.319
2. TFL_final_season: 0.198
3. SOLO_final_season: 0.185
4. Sacks_final_season: 0.171
5. PD_cumulative: 0.160
6. TOT_final_season: 0.120

Anything too far below abs(0.20) is likely too weak to consider using for any models. 

In [90]:
# Examine every column in the dataset and its correlation with the Drafted column 
# Lower Draft Position is better
lb_data_just_numeric = lb_data.select_dtypes(include=['number'])
lb_data_just_numeric['Pick'] = lb_data['Pick']
print(lb_data_just_numeric.corr()['Pick'].sort_values(ascending=False))


Pick                     1.000000
Round                    0.987279
40yd                     0.313035
Shuttle                  0.197240
3Cone                    0.170228
SOLO_cumulative          0.153081
TOT_cumulative           0.099085
Year                     0.011584
TFL_cumulative          -0.045935
PD_cumulative           -0.046975
Bench                   -0.053790
Sacks_cumulative        -0.055897
SOLO_final_season       -0.110385
TOT_final_season        -0.132528
Height                  -0.134473
Weight                  -0.136509
QB_Hurry_cumulative     -0.159975
PD_final_season         -0.212124
QB_Hurry_final_season   -0.218966
Sacks_final_season      -0.240208
Vertical                -0.256206
TFL_final_season        -0.301390
Broad Jump              -0.328650
Name: Pick, dtype: float64


From the LB training data, the most impactful **combine** values on **draft position** (Pick; lower = earlier/better) are:

1. 40yd: 0.313 (slower = later pick)
2. Shuttle: 0.197 (slower = later pick)
3. 3Cone: 0.170 (slower = later pick)
4. Bench: -0.054 (more reps = earlier pick)
5. Broad Jump: -0.329 (longer = earlier pick)
6. Vertical: -0.256 (higher = earlier pick)

The most impactful **college stats** (Sacks, TFL, QB Hurry, PD, SOLO, TOT) on **draft position** are:

1. TFL_final_season: -0.301 (more TFL = earlier pick)
2. Sacks_final_season: -0.240 (more sacks = earlier pick)
3. QB_Hurry_final_season: -0.219 (more hurries = earlier pick)
4. PD_final_season: -0.212 (more PD = earlier pick)
5. TOT_final_season: -0.133 (more tackles = earlier pick)
6. SOLO_final_season: -0.110 (more solos = earlier pick)

So for LBs, better combine (faster 40, longer broad jump, higher vertical) and stronger college production (TFL, sacks, QB hurries, PD, tackles) are associated with earlier draft picks.

## Looking to Model

When we look to create machine learning models there are 3 tasks we would like to accomplish. The first two can use our current datasets of combine data and college data. The final one/two will require the first four seasons of our Linebackers stats in the NFL. 

1. Projected Drafted or Undrafted
2. Projected Draft Position/Round
3. Projected NFL Ability/Value (To be done)

## Projected Drafted or Undrafted

We build models to predict whether a player will be **drafted or undrafted** using combine and college stats. Training set: 2010–2020 (428 players). Test set: 2021–2023 (102 players). Models tested: logistic regression, Random Forest, and XGBoost. For models that use college stats (Sacks, TFL, QB Hurry, PD, SOLO, TOT), we restrict to 2017+ so that data is available.

In [91]:
# Load training and testing data (paths relative to LB/)
train_raw = pd.read_csv('../data/processed/lb_training_data.csv')
test_raw = pd.read_csv('../data/processed/lb_testing_data.csv')

# Convert Height from feet-inches to inches
def height_to_inches(h):
    if pd.isna(h) or not isinstance(h, str) or '-' not in str(h):
        return np.nan
    parts = str(h).split('-')
    return int(parts[0]) * 12 + int(parts[1])

for df in [train_raw, test_raw]:
    df['Height'] = df['Height'].apply(height_to_inches)

# Column names for modeling (match CSV). LB college stats: Sacks, TFL, QB Hurry, PD, SOLO, TOT (final season).
FEATURE_COLS = [
    'Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season',
    'Sacks_final_season', 'PD_final_season', 'SOLO_final_season', 'TOT_final_season',
    'Shuttle', '3Cone', '40yd', 'Height', 'Weight'
]

# --- Speed Score: weight * 200 / 40yd^4 ---
def add_speed_score(df):
    df = df.copy()
    df['speed_score'] = np.where(
        df['40yd'].notna() & (df['40yd'] > 0),
        df['Weight'] * 200 / (df['40yd'] ** 4),
        np.nan
    )
    return df

train_raw = add_speed_score(train_raw)
test_raw = add_speed_score(test_raw)

# --- Explosive Score (position-specific z-scores from training data) ---
def add_explosive_score(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    train_df['vertical_z'] = np.nan
    train_df['broad_z'] = np.nan
    test_df['vertical_z'] = np.nan
    test_df['broad_z'] = np.nan
    for pos in train_df['Pos'].dropna().unique():
        tr = train_df[train_df['Pos'] == pos]
        mean_v = tr['Vertical'].mean()
        std_v = tr['Vertical'].std()
        mean_b = tr['Broad Jump'].mean()
        std_b = tr['Broad Jump'].std()
        if std_v == 0 or np.isnan(std_v):
            std_v = 1.0
        if std_b == 0 or np.isnan(std_b):
            std_b = 1.0
        mask_train = train_df['Pos'] == pos
        mask_test = test_df['Pos'] == pos
        train_df.loc[mask_train, 'vertical_z'] = (train_df.loc[mask_train, 'Vertical'] - mean_v) / std_v
        train_df.loc[mask_train, 'broad_z'] = (train_df.loc[mask_train, 'Broad Jump'] - mean_b) / std_b
        test_df.loc[mask_test, 'vertical_z'] = (test_df.loc[mask_test, 'Vertical'] - mean_v) / std_v
        test_df.loc[mask_test, 'broad_z'] = (test_df.loc[mask_test, 'Broad Jump'] - mean_b) / std_b
    train_df['explosive_score'] = train_df['vertical_z'].fillna(0) + train_df['broad_z'].fillna(0)
    test_df['explosive_score'] = test_df['vertical_z'].fillna(0) + test_df['broad_z'].fillna(0)
    return train_df.drop(columns=['vertical_z', 'broad_z'], errors='ignore'), test_df.drop(columns=['vertical_z', 'broad_z'], errors='ignore')

train_raw, test_raw = add_explosive_score(train_raw, test_raw)

# --- Agility Score (position-specific z-scores from training; flip sign so better = higher) ---
def add_agility_score(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    train_df['three_cone_z'] = np.nan
    train_df['shuttle_z'] = np.nan
    test_df['three_cone_z'] = np.nan
    test_df['shuttle_z'] = np.nan
    for pos in train_df['Pos'].dropna().unique():
        tr = train_df[train_df['Pos'] == pos]
        mean_3 = tr['3Cone'].mean()
        std_3 = tr['3Cone'].std()
        mean_sh = tr['Shuttle'].mean()
        std_sh = tr['Shuttle'].std()
        if std_3 == 0 or np.isnan(std_3):
            std_3 = 1.0
        if std_sh == 0 or np.isnan(std_sh):
            std_sh = 1.0
        mask_train = train_df['Pos'] == pos
        mask_test = test_df['Pos'] == pos
        train_df.loc[mask_train, 'three_cone_z'] = (train_df.loc[mask_train, '3Cone'] - mean_3) / std_3
        train_df.loc[mask_train, 'shuttle_z'] = (train_df.loc[mask_train, 'Shuttle'] - mean_sh) / std_sh
        test_df.loc[mask_test, 'three_cone_z'] = (test_df.loc[mask_test, '3Cone'] - mean_3) / std_3
        test_df.loc[mask_test, 'shuttle_z'] = (test_df.loc[mask_test, 'Shuttle'] - mean_sh) / std_sh
    train_df['agility_score'] = (-train_df['three_cone_z'].fillna(0)) + (-train_df['shuttle_z'].fillna(0))
    test_df['agility_score'] = (-test_df['three_cone_z'].fillna(0)) + (-test_df['shuttle_z'].fillna(0))
    return train_df.drop(columns=['three_cone_z', 'shuttle_z'], errors='ignore'), test_df.drop(columns=['three_cone_z', 'shuttle_z'], errors='ignore')

train_raw, test_raw = add_agility_score(train_raw, test_raw)

# --- P4/P5 conference: binary 1 if School is in power conference. Pac-12 counts only for draft year 2023 and before. ---
P4_WITH_PAC12 = {'SEC', 'Big Ten', 'Big 12', 'ACC', 'Pac-12'}
P4_NO_PAC12 = {'SEC', 'Big Ten', 'Big 12', 'ACC'}
_stats = pd.read_csv('../data/processed/defensive_stats_2016_to_2025.csv')
P4_SCHOOLS = set(_stats[_stats['Conference'].isin(P4_WITH_PAC12)]['Team'].unique())
P4_SCHOOLS_NO_PAC12 = set(_stats[_stats['Conference'].isin(P4_NO_PAC12)]['Team'].unique())
school_alias = {
    'Ole Miss': 'Mississippi', 'Miami (FL)': 'Miami', 'Southern California': 'USC',
    'Central Florida': 'UCF', 'Brigham Young': 'BYU', 'Ohio St.': 'Ohio State',
    'Florida St.': 'Florida State', 'Kansas St.': 'Kansas State', 'Iowa St.': 'Iowa State',
    'Oklahoma St.': 'Oklahoma State', 'Penn St.': 'Penn State', 'San Diego St.': 'San Diego State',
    'San Jose St.': 'San José State', 'Boston Col.': 'Boston College',
}

def add_p4_conference(df):
    df = df.copy()
    def norm(s):
        return school_alias.get(s, s) if pd.notna(s) and s else None
    def is_p4(row):
        sn = norm(row['School'])
        if not sn: return 0
        year = row.get('Year', 0)
        schools = P4_SCHOOLS if year <= 2023 else P4_SCHOOLS_NO_PAC12
        return 1 if sn in schools else 0
    df['p4_conference'] = df.apply(is_p4, axis=1)
    df['contains_p4_conference'] = df['School'].notna().astype(int)
    return df

train_raw = add_p4_conference(train_raw)
test_raw = add_p4_conference(test_raw)

# --- Binary contains_* for each metric (1 if present, 0 if missing) ---
METRIC_COLS = [
    'Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season',
    'Sacks_final_season', 'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative',
    'PD_final_season', 'PD_cumulative', 'SOLO_final_season', 'SOLO_cumulative', 'TOT_final_season', 'TOT_cumulative',
    'Shuttle', '3Cone', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score', 'agility_score'
]
def add_contains_flags(df):
    df = df.copy()
    name_map = {
        'Broad Jump': 'broad_jump', 'Vertical': 'vertical',
        'QB_Hurry_final_season': 'qb_hurry_final_season', 'TFL_final_season': 'tfl_final_season',
        'Sacks_final_season': 'sacks_final_season', 'Sacks_cumulative': 'sacks_cumulative', 'TFL_cumulative': 'tfl_cumulative', 'QB_Hurry_cumulative': 'qb_hurry_cumulative',
        'PD_final_season': 'pd_final_season', 'PD_cumulative': 'pd_cumulative', 'SOLO_final_season': 'solo_final_season', 'SOLO_cumulative': 'solo_cumulative', 'TOT_final_season': 'tot_final_season', 'TOT_cumulative': 'tot_cumulative',
        'Shuttle': 'shuttle', '3Cone': 'three_cone', '40yd': '40yd', 'Height': 'height', 'Weight': 'weight',
        'speed_score': 'speed_score', 'explosive_score': 'explosive_score', 'agility_score': 'agility_score'
    }
    for col in METRIC_COLS:
        if col not in df.columns:
            continue
        flag_name = f"contains_{name_map.get(col, col.lower().replace(' ', '_'))}"
        df[flag_name] = (df[col].notna()).astype(int)
    return df

train_raw = add_contains_flags(train_raw)
test_raw = add_contains_flags(test_raw)

# Final training and test datasets for modeling
train_df = train_raw.copy()
test_df = test_raw.copy()

print('Training set:', train_df.shape[0], 'players')
print('Test set:', test_df.shape[0], 'players')
print('\nModeling features:', FEATURE_COLS)
print('Derived metrics: speed_score, explosive_score, agility_score')
print('Contains flags: contains_* for each metric')
train_df.head()

Training set: 428 players
Test set: 102 players

Modeling features: ['Broad Jump', 'Vertical', 'QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'PD_final_season', 'SOLO_final_season', 'TOT_final_season', 'Shuttle', '3Cone', '40yd', 'Height', 'Weight']
Derived metrics: speed_score, explosive_score, agility_score
Contains flags: contains_* for each metric


Unnamed: 0,Year,Player,Pos,School,Height,Weight,40yd,Vertical,Bench,Broad Jump,...,contains_tot_final_season,contains_tot_cumulative,contains_shuttle,contains_three_cone,contains_40yd,contains_height,contains_weight,contains_speed_score,contains_explosive_score,contains_agility_score
0,2010,Pat Angerer,ILB,Iowa,72,235.0,4.71,35.0,26.0,110.0,...,0,0,1,1,1,1,1,1,1,1
1,2010,Jason Beauchamp,ILB,UNLV,75,244.0,4.89,39.5,21.0,120.0,...,0,0,0,1,1,1,1,1,1,1
2,2010,Kyle Bosworth,OLB,UCLA,73,236.0,4.69,32.5,25.0,117.0,...,0,0,1,1,1,1,1,1,1,1
3,2010,Navorro Bowman,OLB,Penn State,72,242.0,4.7,29.5,26.0,115.0,...,0,0,1,1,1,1,1,1,1,1
4,2010,Donald Butler,ILB,Washington,73,245.0,4.61,,35.0,,...,0,0,0,0,1,1,1,1,1,1


In [92]:
# Combine-only logistic regression: predict Drafted (1) vs Undrafted (0)
# No college stats — only combine metrics + derived scores

# Combine-only features (no college stats: Sacks, TFL, QB Hurry, PD, SOLO, TOT) + binary "contains_*" flags
# Exclude Shuttle, 3Cone, agility_score — often missing at combine
COMBINE_ONLY_FEATURES = [
    'Broad Jump', 'Vertical', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score', 'p4_conference'
]
COMBINE_ONLY_CONTAINS = [
    'contains_broad_jump', 'contains_vertical',
    'contains_40yd', 'contains_height', 'contains_weight',
    'contains_speed_score', 'contains_explosive_score', 'contains_p4_conference'
]
COMBINE_ONLY_ALL = COMBINE_ONLY_FEATURES + COMBINE_ONLY_CONTAINS

# Prepare X, y
X_tr_raw = train_df[COMBINE_ONLY_ALL].copy()
X_te_raw = test_df[COMBINE_ONLY_ALL].copy()
y_train = (train_df['Drafted'].astype(bool)).astype(int)
y_test = (test_df['Drafted'].astype(bool)).astype(int)

# KNN imputation (fit on train, transform train and test)
knn_imputer_combine = KNNImputer(n_neighbors=10)
X_tr = knn_imputer_combine.fit_transform(X_tr_raw)
X_te = knn_imputer_combine.transform(X_te_raw)
# For prediction: leave missing as NaN then transform; use NaN "medians" so _player_row keeps NaNs
train_medians = pd.Series(np.nan, index=COMBINE_ONLY_ALL)

# Scale (fit on train, transform both)
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr)
X_te_scaled = scaler.transform(X_te)

# Fit binary logistic regression
logit_draft = LogisticRegression(max_iter=1000, random_state=42)
logit_draft.fit(X_tr_scaled, y_train)

# Predict on test
y_pred = logit_draft.predict(X_te_scaled)
y_prob = logit_draft.predict_proba(X_te_scaled)[:, 1]

# Metrics
print('Combine-only logistic model: Drafted vs Undrafted')
print('=' * 50)
print('Test accuracy:', (y_pred == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred))
print('\nClassification report:')
print(classification_report(y_test, y_pred, target_names=['Undrafted', 'Drafted']))
if y_test.nunique() == 2:
    print('Test ROC-AUC:', roc_auc_score(y_test, y_prob).round(4))

Combine-only logistic model: Drafted vs Undrafted
Test accuracy: 0.6569

Confusion matrix (rows=actual, cols=predicted):
[[17 33]
 [ 2 50]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.89      0.34      0.49        50
     Drafted       0.60      0.96      0.74        52

    accuracy                           0.66       102
   macro avg       0.75      0.65      0.62       102
weighted avg       0.75      0.66      0.62       102

Test ROC-AUC: 0.7925


In [93]:
# Logistic regression with college stats: training data from 2017 onward
# Same target (Drafted vs Undrafted), with combine + LB college stats (Sacks, TFL, QB Hurry, PD, SOLO, TOT final season)

# Restrict to 2017+ so college stats (Sacks, TFL, QB Hurry, PD, SOLO, TOT) are available
train_2017 = train_df[train_df['Year'] >= 2017].copy()
test_2017 = test_df[test_df['Year'] >= 2017].copy()

# Full feature set (combine + college stats: Sacks, TFL, QB Hurry, PD, SOLO, TOT final season) + binary "contains_*" flags
# Exclude Shuttle, 3Cone, agility_score — often missing at combine
FEATURES_WITH_COLLEGE = [
    'Broad Jump', 'Vertical', '40yd', 'Height', 'Weight',
    'speed_score', 'explosive_score',
    'QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'PD_final_season', 'SOLO_final_season', 'TOT_final_season', 'p4_conference'
]
CONTAINS_WITH_COLLEGE = [
    'contains_broad_jump', 'contains_vertical',
    'contains_40yd', 'contains_height', 'contains_weight',
    'contains_speed_score', 'contains_explosive_score',
    'contains_qb_hurry_final_season', 'contains_tfl_final_season', 'contains_sacks_final_season',
    'contains_pd_final_season', 'contains_solo_final_season', 'contains_tot_final_season',
    'contains_p4_conference'
]
FEATURES_WITH_COLLEGE_ALL = FEATURES_WITH_COLLEGE + CONTAINS_WITH_COLLEGE

X_tr17_raw = train_2017[FEATURES_WITH_COLLEGE_ALL].copy()
X_te17_raw = test_2017[FEATURES_WITH_COLLEGE_ALL].copy()
y_train17 = (train_2017['Drafted'].astype(bool)).astype(int)
y_test17 = (test_2017['Drafted'].astype(bool)).astype(int)

# KNN imputation
knn_imputer17 = KNNImputer(n_neighbors=10)
X_tr17 = knn_imputer17.fit_transform(X_tr17_raw)
X_te17 = knn_imputer17.transform(X_te17_raw)
train_medians17 = pd.Series(np.nan, index=FEATURES_WITH_COLLEGE_ALL)

# Scale (fit on train, transform both)
scaler17 = StandardScaler()
X_tr17_scaled = scaler17.fit_transform(X_tr17)
X_te17_scaled = scaler17.transform(X_te17)

# Fit logistic regression (with college stats)
logit_draft_college = LogisticRegression(max_iter=1000, random_state=42)
logit_draft_college.fit(X_tr17_scaled, y_train17)

y_pred17 = logit_draft_college.predict(X_te17_scaled)
y_prob17 = logit_draft_college.predict_proba(X_te17_scaled)[:, 1]

print('Logistic model with college stats (train 2017+, test 2017+)')
print('=' * 55)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred17, target_names=['Undrafted', 'Drafted']))
if y_test17.nunique() == 2 and len(y_test17) > 0:
    print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob17).round(4))

Logistic model with college stats (train 2017+, test 2017+)
Training samples: 147 | Test samples: 102
Test accuracy: 0.7157

Confusion matrix (rows=actual, cols=predicted):
[[23 27]
 [ 2 50]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.92      0.46      0.61        50
     Drafted       0.65      0.96      0.78        52

    accuracy                           0.72       102
   macro avg       0.78      0.71      0.69       102
weighted avg       0.78      0.72      0.70       102

Test ROC-AUC: 0.8181


In [94]:
# College + combine **with agility_score**: for players who have 3Cone/Shuttle (agility). College stats: Sacks, TFL, QB Hurry, PD, SOLO, TOT.
FEATURES_WITH_COLLEGE_AGILITY = FEATURES_WITH_COLLEGE + ['agility_score']
CONTAINS_WITH_COLLEGE_AGILITY = CONTAINS_WITH_COLLEGE + ['contains_agility_score']
FEATURES_WITH_COLLEGE_AGILITY_ALL = FEATURES_WITH_COLLEGE_AGILITY + CONTAINS_WITH_COLLEGE_AGILITY

X_tr_ag_raw = train_2017[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()
X_te_ag_raw = test_2017[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()
knn_imputer_ag = KNNImputer(n_neighbors=10)
X_tr_ag = knn_imputer_ag.fit_transform(X_tr_ag_raw)
X_te_ag = knn_imputer_ag.transform(X_te_ag_raw)
train_medians_ag = pd.Series(np.nan, index=FEATURES_WITH_COLLEGE_AGILITY_ALL)

scaler_ag = StandardScaler()
X_tr_ag_scaled = scaler_ag.fit_transform(X_tr_ag)
X_te_ag_scaled = scaler_ag.transform(X_te_ag)

# Draft/undrafted models (college+combine w/ agility)
logit_draft_college_agility = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
logit_draft_college_agility.fit(X_tr_ag_scaled, y_train17)
y_pred_ag = logit_draft_college_agility.predict(X_te_ag_scaled)
y_prob_ag = logit_draft_college_agility.predict_proba(X_te_ag_scaled)[:, 1]

rf_college_agility = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42, class_weight='balanced')
rf_college_agility.fit(X_tr_ag, y_train17)
y_pred_rf_ag = rf_college_agility.predict(X_te_ag)
y_prob_rf_ag = rf_college_agility.predict_proba(X_te_ag)[:, 1]

xgb_college_agility = xgb.XGBClassifier(n_estimators=200, max_depth=9, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college_agility.fit(X_tr_ag, y_train17)
y_pred_xgb_ag = xgb_college_agility.predict(X_te_ag)
y_prob_xgb_ag = xgb_college_agility.predict_proba(X_te_ag)[:, 1]

print('College+combine w/ agility: Drafted vs Undrafted (train 2017+)')
print('Logistic ROC-AUC:', roc_auc_score(y_test17, y_prob_ag).round(4))
print('RF ROC-AUC:', roc_auc_score(y_test17, y_prob_rf_ag).round(4))
print('XGB ROC-AUC:', roc_auc_score(y_test17, y_prob_xgb_ag).round(4))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


College+combine w/ agility: Drafted vs Undrafted (train 2017+)
Logistic ROC-AUC: 0.8315
RF ROC-AUC: 0.8173
XGB ROC-AUC: 0.7542


In [95]:
# Combined prediction: average both models' probabilities into one drafted/undrafted prediction
# Combine-only model predicts on full test_df; college model on test_2017 (2017+). Test set is 2021+ so both apply to all rows.

combined_prob = (y_prob + y_prob17) / 2
combined_pred = (combined_prob >= 0.5).astype(int)

# Use same test labels (y_test from full test_df; same rows as test_2017)
print('Combined model: average of combine-only + college-stats probabilities')
print('=' * 60)
print('Test accuracy:', (combined_pred == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred))
print('\nClassification report:')
print(classification_report(y_test, combined_pred, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob).round(4))

Combined model: average of combine-only + college-stats probabilities
Test accuracy: 0.6765

Confusion matrix (rows=actual, cols=predicted):
[[19 31]
 [ 2 50]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.90      0.38      0.54        50
     Drafted       0.62      0.96      0.75        52

    accuracy                           0.68       102
   macro avg       0.76      0.67      0.64       102
weighted avg       0.76      0.68      0.65       102

Test ROC-AUC: 0.8162


In [96]:
# College-only models: Sacks, TFL, QB Hurry, PD, SOLO, TOT (final season + cumulative) + p4_conference (no combine metrics)
# For players without combine data (e.g. 2026 prospects pre-combine)

COLLEGE_ONLY_FEATURES = ['QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'PD_final_season', 'SOLO_final_season', 'TOT_final_season', 'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative', 'PD_cumulative', 'SOLO_cumulative', 'TOT_cumulative', 'p4_conference', 'Height', 'Weight']
COLLEGE_ONLY_CONTAINS = ['contains_qb_hurry_final_season', 'contains_tfl_final_season', 'contains_sacks_final_season', 'contains_pd_final_season', 'contains_solo_final_season', 'contains_tot_final_season', 'contains_sacks_cumulative', 'contains_tfl_cumulative', 'contains_qb_hurry_cumulative', 'contains_pd_cumulative', 'contains_solo_cumulative', 'contains_tot_cumulative', 'contains_p4_conference', 'contains_height', 'contains_weight']
COLLEGE_ONLY_ALL = COLLEGE_ONLY_FEATURES + COLLEGE_ONLY_CONTAINS

X_tr_co_raw = train_2017[COLLEGE_ONLY_ALL].copy()
X_te_co_raw = test_2017[COLLEGE_ONLY_ALL].copy()
knn_imputer_co = KNNImputer(n_neighbors=10)
X_tr_co = knn_imputer_co.fit_transform(X_tr_co_raw)
X_te_co = knn_imputer_co.transform(X_te_co_raw)
train_medians_co = pd.Series(np.nan, index=COLLEGE_ONLY_ALL)

scaler_co = StandardScaler()
X_tr_co_scaled = scaler_co.fit_transform(X_tr_co)
X_te_co_scaled = scaler_co.transform(X_te_co)

# Drafted/undrafted
logit_draft_college_only = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
logit_draft_college_only.fit(X_tr_co_scaled, y_train17)
y_pred_college_only = logit_draft_college_only.predict(X_te_co_scaled)
y_prob_college_only = logit_draft_college_only.predict_proba(X_te_co_scaled)[:, 1]

rf_college_only = RandomForestClassifier(n_estimators=200, max_depth=4, random_state=42, class_weight='balanced')
rf_college_only.fit(X_tr_co, y_train17)
y_pred_rf_co = rf_college_only.predict(X_te_co)
y_prob_rf_co = rf_college_only.predict_proba(X_te_co)[:, 1]

xgb_college_only = xgb.XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college_only.fit(X_tr_co, y_train17)
y_pred_xgb_co = xgb_college_only.predict(X_te_co)
y_prob_xgb_co = xgb_college_only.predict_proba(X_te_co)[:, 1]

print('College-only models (Sacks, TFL, QB Hurry, PD, SOLO, TOT, p4_conference) — drafted/undrafted')
print('=' * 70)
for name, pred, prob in [('Logistic', y_pred_college_only, y_prob_college_only), ('RF', y_pred_rf_co, y_prob_rf_co), ('XGB', y_pred_xgb_co, y_prob_xgb_co)]:
    print(f'{name}: accuracy={(pred == y_test17).mean():.4f}, ROC-AUC={roc_auc_score(y_test17, prob):.4f}')

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


College-only models (Sacks, TFL, QB Hurry, PD, SOLO, TOT, p4_conference) — drafted/undrafted
Logistic: accuracy=0.6078, ROC-AUC=0.6888
RF: accuracy=0.6078, ROC-AUC=0.6412
XGB: accuracy=0.5196, ROC-AUC=0.4954


In [97]:
# Random Forest: combine-only features — predict Drafted vs Undrafted

rf_combine = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf_combine.fit(X_tr, y_train)  # X_tr already has COMBINE_ONLY_ALL, imputed

y_pred_rf = rf_combine.predict(X_te)
y_prob_rf = rf_combine.predict_proba(X_te)[:, 1]

print('Random Forest (combine-only): Drafted vs Undrafted')
print('=' * 55)
print('Test accuracy:', (y_pred_rf == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred_rf))
print('\nClassification report:')
print(classification_report(y_test, y_pred_rf, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, y_prob_rf).round(4))

Random Forest (combine-only): Drafted vs Undrafted
Test accuracy: 0.6667

Confusion matrix (rows=actual, cols=predicted):
[[17 33]
 [ 1 51]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.94      0.34      0.50        50
     Drafted       0.61      0.98      0.75        52

    accuracy                           0.67       102
   macro avg       0.78      0.66      0.62       102
weighted avg       0.77      0.67      0.63       102

Test ROC-AUC: 0.749


In [98]:
# Random Forest: combine + college stats (2017+): Sacks, TFL, QB Hurry, PD, SOLO, TOT
rf_college = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf_college.fit(X_tr17, y_train17)  # X_tr17 already has FEATURES_WITH_COLLEGE_ALL, imputed

y_pred_rf17 = rf_college.predict(X_te17)
y_prob_rf17 = rf_college.predict_proba(X_te17)[:, 1]

print('Random Forest (combine + college, train 2017+): Drafted vs Undrafted')
print('=' * 60)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred_rf17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred_rf17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred_rf17, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob_rf17).round(4))

Random Forest (combine + college, train 2017+): Drafted vs Undrafted
Training samples: 147 | Test samples: 102
Test accuracy: 0.6863

Confusion matrix (rows=actual, cols=predicted):
[[21 29]
 [ 3 49]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.88      0.42      0.57        50
     Drafted       0.63      0.94      0.75        52

    accuracy                           0.69       102
   macro avg       0.75      0.68      0.66       102
weighted avg       0.75      0.69      0.66       102

Test ROC-AUC: 0.7381


In [99]:
# Combined RF prediction: average both RF models' probabilities
combined_prob_rf = (y_prob_rf + y_prob_rf17) / 2
combined_pred_rf = (combined_prob_rf >= 0.5).astype(int)

print('Combined Random Forest: average of combine-only + college-stats probabilities')
print('=' * 65)
print('Test accuracy:', (combined_pred_rf == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred_rf))
print('\nClassification report:')
print(classification_report(y_test, combined_pred_rf, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob_rf).round(4))

Combined Random Forest: average of combine-only + college-stats probabilities
Test accuracy: 0.6667

Confusion matrix (rows=actual, cols=predicted):
[[18 32]
 [ 2 50]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.90      0.36      0.51        50
     Drafted       0.61      0.96      0.75        52

    accuracy                           0.67       102
   macro avg       0.75      0.66      0.63       102
weighted avg       0.75      0.67      0.63       102

Test ROC-AUC: 0.7573


In [100]:
# XGBoost: combine-only features — predict Drafted vs Undrafted

xgb_combine = xgb.XGBClassifier(n_estimators=200, max_depth=12, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_combine.fit(X_tr, y_train)

y_pred_xgb = xgb_combine.predict(X_te)
y_prob_xgb = xgb_combine.predict_proba(X_te)[:, 1]

print('XGBoost (combine-only): Drafted vs Undrafted')
print('=' * 55)
print('Test accuracy:', (y_pred_xgb == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, y_pred_xgb))
print('\nClassification report:')
print(classification_report(y_test, y_pred_xgb, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, y_prob_xgb).round(4))

XGBoost (combine-only): Drafted vs Undrafted
Test accuracy: 0.6471

Confusion matrix (rows=actual, cols=predicted):
[[17 33]
 [ 3 49]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.85      0.34      0.49        50
     Drafted       0.60      0.94      0.73        52

    accuracy                           0.65       102
   macro avg       0.72      0.64      0.61       102
weighted avg       0.72      0.65      0.61       102

Test ROC-AUC: 0.6913


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [101]:
# XGBoost: combine + college stats (2017+): Sacks, TFL, QB Hurry, PD, SOLO, TOT
xgb_college = xgb.XGBClassifier(n_estimators=200, max_depth=8, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_college.fit(X_tr17, y_train17)

y_pred_xgb17 = xgb_college.predict(X_te17)
y_prob_xgb17 = xgb_college.predict_proba(X_te17)[:, 1]

print('XGBoost (combine + college, train 2017+): Drafted vs Undrafted')
print('=' * 60)
print('Training samples:', len(train_2017), '| Test samples:', len(test_2017))
print('Test accuracy:', (y_pred_xgb17 == y_test17).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test17, y_pred_xgb17))
print('\nClassification report:')
print(classification_report(y_test17, y_pred_xgb17, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test17, y_prob_xgb17).round(4))

XGBoost (combine + college, train 2017+): Drafted vs Undrafted
Training samples: 147 | Test samples: 102
Test accuracy: 0.6765

Confusion matrix (rows=actual, cols=predicted):
[[23 27]
 [ 6 46]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.79      0.46      0.58        50
     Drafted       0.63      0.88      0.74        52

    accuracy                           0.68       102
   macro avg       0.71      0.67      0.66       102
weighted avg       0.71      0.68      0.66       102

Test ROC-AUC: 0.7212


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [102]:
# Combined XGBoost prediction: average both XGBoost models' probabilities
combined_prob_xgb = (y_prob_xgb + y_prob_xgb17) / 2
combined_pred_xgb = (combined_prob_xgb >= 0.5).astype(int)

print('Combined XGBoost: average of combine-only + college-stats probabilities')
print('=' * 65)
print('Test accuracy:', (combined_pred_xgb == y_test).mean().round(4))
print('\nConfusion matrix (rows=actual, cols=predicted):')
print(confusion_matrix(y_test, combined_pred_xgb))
print('\nClassification report:')
print(classification_report(y_test, combined_pred_xgb, target_names=['Undrafted', 'Drafted']))
print('Test ROC-AUC:', roc_auc_score(y_test, combined_prob_xgb).round(4))

Combined XGBoost: average of combine-only + college-stats probabilities
Test accuracy: 0.6765

Confusion matrix (rows=actual, cols=predicted):
[[19 31]
 [ 2 50]]

Classification report:
              precision    recall  f1-score   support

   Undrafted       0.90      0.38      0.54        50
     Drafted       0.62      0.96      0.75        52

    accuracy                           0.68       102
   macro avg       0.76      0.67      0.64       102
weighted avg       0.76      0.68      0.65       102

Test ROC-AUC: 0.7135


In [103]:
# Compare all drafted/undrafted models: college-only, combine-only, combine+college, combine+college w/ agility
# (Logistic, RF, XGB for each). Same test set (y_test = y_test17, n=102).

models = [
    # College-only (train 2017+, test 2017+)
    ('Logistic (college-only)', y_pred_college_only, y_prob_college_only),
    ('RF (college-only)', y_pred_rf_co, y_prob_rf_co),
    ('XGBoost (college-only)', y_pred_xgb_co, y_prob_xgb_co),
    # Combine-only
    ('Logistic (combine-only)', y_pred, y_prob),
    ('RF (combine-only)', y_pred_rf, y_prob_rf),
    ('XGBoost (combine-only)', y_pred_xgb, y_prob_xgb),
    # Combine+college
    ('Logistic (combine+college)', y_pred17, y_prob17),
    ('RF (combine+college)', y_pred_rf17, y_prob_rf17),
    ('XGBoost (combine+college)', y_pred_xgb17, y_prob_xgb17),
    # Combine+college w/ agility
    ('Logistic (combine+college w/ agility)', y_pred_ag, y_prob_ag),
    ('RF (combine+college w/ agility)', y_pred_rf_ag, y_prob_rf_ag),
    ('XGBoost (combine+college w/ agility)', y_pred_xgb_ag, y_prob_xgb_ag),
]

results = []
for name, pred, prob in models:
    acc = (pred == y_test).mean()
    auc = roc_auc_score(y_test, prob)
    f1_macro = f1_score(y_test, pred, average='macro')
    tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
    recall_undrafted = tn / (tn + fn) if (tn + fn) > 0 else 0
    recall_drafted = tp / (tp + fn) if (tp + fn) > 0 else 0
    results.append({
        'Model': name,
        'Accuracy': acc,
        'ROC-AUC': auc,
        'Macro F1': f1_macro,
        'Recall (Undrafted)': recall_undrafted,
        'Recall (Drafted)': recall_drafted,
    })

results_df = pd.DataFrame(results)
results_df = results_df.sort_values('ROC-AUC', ascending=False).reset_index(drop=True)
print('All models ranked by ROC-AUC (same test set, n=' + str(len(y_test)) + '):')
print('Categories: college-only, combine-only, combine+college, combine+college w/ agility (Logistic, RF, XGB each)')
print('=' * 75)
print(results_df.to_string(index=False))
print()

best_auc = results_df.loc[0, 'Model']
best_auc_val = results_df.loc[0, 'ROC-AUC']
best_f1 = results_df.loc[results_df['Macro F1'].idxmax(), 'Model']
best_f1_val = results_df['Macro F1'].max()
print('Summary:')
print('  Best by ROC-AUC:', best_auc, f'({best_auc_val:.4f})')
print('  Best by Macro F1 (balanced Undrafted/Drafted):', best_f1, f'({best_f1_val:.4f})')
print()
print('Conclusion: ROC-AUC is the preferred metric for imbalanced drafted/undrafted;')
print('Macro F1 rewards balance.')

All models ranked by ROC-AUC (same test set, n=102):
Categories: college-only, combine-only, combine+college, combine+college w/ agility (Logistic, RF, XGB each)
                                Model  Accuracy  ROC-AUC  Macro F1  Recall (Undrafted)  Recall (Drafted)
Logistic (combine+college w/ agility)  0.764706 0.831538  0.762422            0.809524          0.846154
           Logistic (combine+college)  0.715686 0.818077  0.694264            0.920000          0.961538
      RF (combine+college w/ agility)  0.715686 0.817308  0.707563            0.800000          0.865385
              Logistic (combine-only)  0.656863 0.792500  0.616747            0.894737          0.961538
 XGBoost (combine+college w/ agility)  0.705882 0.754231  0.694122            0.812500          0.884615
                    RF (combine-only)  0.666667 0.749038  0.625000            0.944444          0.980769
                 RF (combine+college)  0.686275 0.738077  0.660707            0.875000          0.94230

## Projected Round/Day Drafted

We predict **draft round (1–7)** for players who are drafted, using the same train/test split (2010–2020 train, 2021–2023 test). Only drafted players are used; for models that use college stats (Sacks, TFL, QB Hurry, PD, SOLO, TOT) we use 2017+ drafted only. Models include ordinal logistic regression, Random Forest, and XGBoost.

In [104]:
# Draft ROUND modeling: Round 1–7. Train only on DRAFTED players.
# Target: round_ord 0..6 (Round 1–7) for 7-class models.
train_draft = train_df[train_df['Drafted'] == True].copy()
test_draft = test_df[test_df['Drafted'] == True].copy()
train_draft['draft_round'] = train_draft['Round'].astype(int).clip(1, 7)
test_draft['draft_round'] = test_draft['Round'].astype(int).clip(1, 7)
train_draft['round_ord'] = (train_draft['draft_round'] - 1).astype(int)  # 0=R1..6=R7
test_draft['round_ord'] = (test_draft['draft_round'] - 1).astype(int)

# Combine-only X, y (drafted only) — KNN impute
X_draft_tr = pd.DataFrame(knn_imputer_combine.transform(train_draft[COMBINE_ONLY_ALL].copy()), columns=COMBINE_ONLY_ALL, index=train_draft.index)
X_draft_te = pd.DataFrame(knn_imputer_combine.transform(test_draft[COMBINE_ONLY_ALL].copy()), columns=COMBINE_ONLY_ALL, index=test_draft.index)
y_draft_tr = train_draft['round_ord'].values
y_draft_te = test_draft['round_ord'].values

# Combine+college 2017+ (drafted only): Sacks, TFL, QB Hurry, PD, SOLO, TOT — KNN impute
train_draft_17 = train_draft[train_draft['Year'] >= 2017]
test_draft_17 = test_draft[test_draft['Year'] >= 2017]
X_draft_tr17 = pd.DataFrame(knn_imputer17.transform(train_draft_17[FEATURES_WITH_COLLEGE_ALL].copy()), columns=FEATURES_WITH_COLLEGE_ALL, index=train_draft_17.index)
X_draft_te17 = pd.DataFrame(knn_imputer17.transform(test_draft_17[FEATURES_WITH_COLLEGE_ALL].copy()), columns=FEATURES_WITH_COLLEGE_ALL, index=test_draft_17.index)
y_draft_tr17 = train_draft_17['round_ord'].values
y_draft_te17 = test_draft_17['round_ord'].values

# College+combine w/ agility 2017+ (drafted only): Sacks, TFL, QB Hurry, PD, SOLO, TOT + agility — KNN impute
X_draft_tr_ag = pd.DataFrame(knn_imputer_ag.transform(train_draft_17[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()), columns=FEATURES_WITH_COLLEGE_AGILITY_ALL, index=train_draft_17.index)
X_draft_te_ag = pd.DataFrame(knn_imputer_ag.transform(test_draft_17[FEATURES_WITH_COLLEGE_AGILITY_ALL].copy()), columns=FEATURES_WITH_COLLEGE_AGILITY_ALL, index=test_draft_17.index)
X_draft_tr_ag_scaled = scaler_ag.transform(X_draft_tr_ag)
X_draft_te_ag_scaled = scaler_ag.transform(X_draft_te_ag)

# Scale for ordinal logistic
X_draft_tr_scaled = scaler.transform(X_draft_tr)
X_draft_te_scaled = scaler.transform(X_draft_te)
X_draft_tr17_scaled = scaler17.transform(X_draft_tr17)
X_draft_te17_scaled = scaler17.transform(X_draft_te17)

print('Draft ROUND modeling (drafted only), 7 classes R1–R7')
print('Train drafted:', len(train_draft), '| Test drafted:', len(test_draft))
print('Train 2017+ drafted:', len(train_draft_17), '| Test 2017+ drafted:', len(test_draft_17))
for r in range(1, 8):
    print(f'  R{r}: {(train_draft["draft_round"]==r).sum()} train, {(test_draft["draft_round"]==r).sum()} test')

Draft ROUND modeling (drafted only), 7 classes R1–R7
Train drafted: 303 | Test drafted: 52
Train 2017+ drafted: 99 | Test 2017+ drafted: 52
  R1: 40 train, 7 test
  R2: 41 train, 5 test
  R3: 45 train, 17 test
  R4: 53 train, 4 test
  R5: 51 train, 11 test
  R6: 36 train, 4 test
  R7: 37 train, 4 test


In [105]:
# Ordinal logistic: 7 classes (R1–R7), 0=R1..6=R7. Combined = average probabilities.
ord_combine = LogisticRegression(max_iter=2000, random_state=42, class_weight='balanced')
ord_college = LogisticRegression(max_iter=2000, random_state=43, class_weight='balanced')
ord_college_agility = LogisticRegression(max_iter=2000, random_state=44, class_weight='balanced')

ord_combine.fit(X_draft_tr_scaled, y_draft_tr)
prob_ord_combine = ord_combine.predict_proba(X_draft_te_scaled)
pred_ord_combine = ord_combine.predict(X_draft_te_scaled).astype(int).clip(0, 6)

ord_college.fit(X_draft_tr17_scaled, y_draft_tr17)
prob_ord_college = ord_college.predict_proba(X_draft_te17_scaled)
pred_ord_college = ord_college.predict(X_draft_te17_scaled).astype(int).clip(0, 6)

ord_college_agility.fit(X_draft_tr_ag_scaled, y_draft_tr17)
prob_ord_college_agility = ord_college_agility.predict_proba(X_draft_te_ag_scaled)
pred_ord_college_agility = ord_college_agility.predict(X_draft_te_ag_scaled).astype(int).clip(0, 6)

prob_ord_combined = (prob_ord_combine + prob_ord_college) / 2
pred_ord_combined = np.argmax(prob_ord_combined, axis=1)

round_names = [f'R{r}' for r in range(1, 8)]
for name, pred in [('Ordinal logit (combine-only)', pred_ord_combine), ('Ordinal logit (combine+college)', pred_ord_college), ('Ordinal logit combined', pred_ord_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix (rows=actual, cols=R1..R7):\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

Ordinal logit (combine-only)
  Accuracy: 0.2115
  Confusion matrix (rows=actual, cols=R1..R7):
 [[5 0 2 0 0 0 0]
 [2 0 2 0 1 0 0]
 [7 1 4 1 3 1 0]
 [1 1 0 0 0 1 1]
 [4 1 2 0 1 3 0]
 [0 0 0 1 3 0 0]
 [0 0 2 0 0 1 1]]
  Macro F1: 0.157

Ordinal logit (combine+college)
  Accuracy: 0.1923
  Confusion matrix (rows=actual, cols=R1..R7):
 [[1 3 2 0 0 0 1]
 [0 2 1 1 0 0 1]
 [5 2 2 1 3 0 4]
 [1 1 0 2 0 0 0]
 [4 0 2 0 1 2 2]
 [2 0 0 0 1 0 1]
 [0 0 0 2 0 0 2]]
  Macro F1: 0.1951

Ordinal logit combined
  Accuracy: 0.1923
  Confusion matrix (rows=actual, cols=R1..R7):
 [[3 2 1 0 0 0 1]
 [1 1 1 0 1 0 1]
 [6 2 1 0 6 0 2]
 [1 1 0 2 0 0 0]
 [5 0 2 0 1 2 1]
 [2 0 0 0 2 0 0]
 [0 0 0 0 0 2 2]]
  Macro F1: 0.234



In [106]:
# Random Forest: draft ROUND (7 classes R1–R7)
rf_day_combine = RandomForestClassifier(n_estimators=200, max_depth=2, random_state=42, class_weight='balanced')
rf_day_combine.fit(X_draft_tr, y_draft_tr)
pred_rf_day_combine = rf_day_combine.predict(X_draft_te).astype(int).clip(0, 6)
prob_rf_day_combine = rf_day_combine.predict_proba(X_draft_te)

rf_day_college = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42, class_weight='balanced')
rf_day_college.fit(X_draft_tr17, y_draft_tr17)
pred_rf_day_college = rf_day_college.predict(X_draft_te17).astype(int).clip(0, 6)
prob_rf_day_college = rf_day_college.predict_proba(X_draft_te17)

rf_day_college_agility = RandomForestClassifier(n_estimators=200, max_depth=6, random_state=42, class_weight='balanced')
rf_day_college_agility.fit(X_draft_tr_ag, y_draft_tr17)
pred_rf_day_college_agility = rf_day_college_agility.predict(X_draft_te_ag).astype(int).clip(0, 6)
prob_rf_day_college_agility = rf_day_college_agility.predict_proba(X_draft_te_ag)

prob_rf_day_combined = (prob_rf_day_combine + prob_rf_day_college) / 2
pred_rf_day_combined = np.argmax(prob_rf_day_combined, axis=1)

for name, pred in [('RF (combine-only)', pred_rf_day_combine), ('RF (combine+college)', pred_rf_day_college), ('RF combined', pred_rf_day_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix (R1..R7):\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

RF (combine-only)
  Accuracy: 0.2885
  Confusion matrix (R1..R7):
 [[5 0 1 0 1 0 0]
 [2 0 1 0 2 0 0]
 [8 0 5 4 0 0 0]
 [1 0 0 2 1 0 0]
 [3 0 2 2 2 2 0]
 [3 0 1 0 0 0 0]
 [0 0 1 1 0 1 1]]
  Macro F1: 0.235

RF (combine+college)
  Accuracy: 0.1923
  Confusion matrix (R1..R7):
 [[0 2 3 0 1 0 1]
 [0 1 2 0 0 0 2]
 [7 1 7 0 1 0 1]
 [0 0 2 0 0 0 2]
 [2 0 5 0 2 1 1]
 [2 0 1 0 1 0 0]
 [0 0 2 0 0 2 0]]
  Macro F1: 0.1187

RF combined
  Accuracy: 0.25
  Confusion matrix (R1..R7):
 [[3 2 2 0 0 0 0]
 [1 1 2 0 0 0 1]
 [8 0 7 0 1 0 1]
 [1 0 1 0 1 0 1]
 [3 0 4 0 2 2 0]
 [3 0 1 0 0 0 0]
 [0 0 2 0 0 2 0]]
  Macro F1: 0.1623



In [107]:
# College-only draft-day models (Sacks, TFL, QB Hurry, PD, SOLO, TOT; for players without combine data) — KNN impute
X_draft_tr_co = pd.DataFrame(knn_imputer_co.transform(train_draft_17[COLLEGE_ONLY_ALL].copy()), columns=COLLEGE_ONLY_ALL, index=train_draft_17.index)
X_draft_te_co = pd.DataFrame(knn_imputer_co.transform(test_draft_17[COLLEGE_ONLY_ALL].copy()), columns=COLLEGE_ONLY_ALL, index=test_draft_17.index)
X_draft_tr_co_scaled = scaler_co.transform(X_draft_tr_co)
X_draft_te_co_scaled = scaler_co.transform(X_draft_te_co)

ord_college_only = LogisticRegression(max_iter=1000, random_state=44, class_weight='balanced')
ord_college_only.fit(X_draft_tr_co_scaled, y_draft_tr17)
pred_ord_college_only = ord_college_only.predict(X_draft_te_co_scaled).astype(int).clip(0, 6)

rf_day_college_only = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42, class_weight='balanced')
rf_day_college_only.fit(X_draft_tr_co, y_draft_tr17)
pred_rf_day_college_only = rf_day_college_only.predict(X_draft_te_co).astype(int).clip(0, 6)

sample_weight_tr_co = compute_sample_weight('balanced', y_draft_tr17)
xgb_day_college_only = xgb.XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_college_only.fit(X_draft_tr_co, y_draft_tr17, sample_weight=sample_weight_tr_co)
pred_xgb_day_college_only = xgb_day_college_only.predict(X_draft_te_co).astype(int).clip(0, 6)

print('College-only draft-ROUND models (7 classes):')
for name, pred in [('Ordinal (college-only)', pred_ord_college_only), ('RF (college-only)', pred_rf_day_college_only), ('XGB (college-only)', pred_xgb_day_college_only)]:
    print(f'  {name}: acc={(pred == y_draft_te17).mean():.4f}, Macro F1={f1_score(y_draft_te17, pred, average="macro", zero_division=0):.4f}')

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


College-only draft-ROUND models (7 classes):
  Ordinal (college-only): acc=0.0962, Macro F1=0.1006
  RF (college-only): acc=0.2500, Macro F1=0.2141
  XGB (college-only): acc=0.1731, Macro F1=0.1476


In [108]:
# XGBoost: draft day — combine-only, combine+college, combined
# Balanced sample weights so Day 1 (minority) isn't under-predicted
from sklearn.utils.class_weight import compute_sample_weight
sample_weight_tr = compute_sample_weight('balanced', y_draft_tr)
sample_weight_tr17 = compute_sample_weight('balanced', y_draft_tr17)
xgb_day_combine = xgb.XGBClassifier(n_estimators=200, max_depth=2, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_combine.fit(X_draft_tr, y_draft_tr, sample_weight=sample_weight_tr)
pred_xgb_day_combine = xgb_day_combine.predict(X_draft_te).astype(int).clip(0, 6)
prob_xgb_day_combine = xgb_day_combine.predict_proba(X_draft_te)

xgb_day_college = xgb.XGBClassifier(n_estimators=200, max_depth=1, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_college.fit(X_draft_tr17, y_draft_tr17, sample_weight=sample_weight_tr17)
pred_xgb_day_college = xgb_day_college.predict(X_draft_te17).astype(int).clip(0, 6)
prob_xgb_day_college = xgb_day_college.predict_proba(X_draft_te17)

sample_weight_tr_ag = compute_sample_weight('balanced', y_draft_tr17)
xgb_day_college_agility = xgb.XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, random_state=42, use_label_encoder=False, eval_metric='mlogloss')
xgb_day_college_agility.fit(X_draft_tr_ag, y_draft_tr17, sample_weight=sample_weight_tr_ag)
pred_xgb_day_college_agility = xgb_day_college_agility.predict(X_draft_te_ag).astype(int).clip(0, 6)
prob_xgb_day_college_agility = xgb_day_college_agility.predict_proba(X_draft_te_ag)

prob_xgb_day_combined = (prob_xgb_day_combine + prob_xgb_day_college) / 2
pred_xgb_day_combined = np.argmax(prob_xgb_day_combined, axis=1)

for name, pred in [('XGBoost (combine-only)', pred_xgb_day_combine), ('XGBoost (agility)', pred_xgb_day_college_agility), ('XGBoost (combine+college)', pred_xgb_day_college), ('XGBoost combined', pred_xgb_day_combined)]:
    y_use = y_draft_te
    print(name)
    print('  Accuracy:', round((pred == y_use).mean(), 4))
    print('  Confusion matrix (R1..R7):\n', confusion_matrix(y_use, pred))
    print('  Macro F1:', round(f1_score(y_use, pred, average='macro', zero_division=0), 4))
    print()

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost (combine-only)
  Accuracy: 0.25
  Confusion matrix (R1..R7):
 [[4 1 2 0 0 0 0]
 [0 0 4 1 0 0 0]
 [6 2 7 1 1 0 0]
 [1 0 0 1 1 1 0]
 [1 1 7 0 1 1 0]
 [0 0 2 0 1 0 1]
 [1 0 1 0 1 1 0]]
  Macro F1: 0.1658

XGBoost (agility)
  Accuracy: 0.1731
  Confusion matrix (R1..R7):
 [[0 2 2 0 2 0 1]
 [0 1 2 0 1 0 1]
 [5 1 5 1 2 0 3]
 [0 1 1 0 0 0 2]
 [3 0 3 1 1 2 1]
 [2 0 1 0 0 0 1]
 [0 1 1 0 0 0 2]]
  Macro F1: 0.1255

XGBoost (combine+college)
  Accuracy: 0.2308
  Confusion matrix (R1..R7):
 [[0 2 2 0 1 1 1]
 [1 1 0 0 2 0 1]
 [4 0 7 2 1 0 3]
 [0 0 1 1 0 1 1]
 [3 0 1 0 2 4 1]
 [1 0 2 0 0 0 1]
 [0 1 1 0 0 1 1]]
  Macro F1: 0.1927

XGBoost combined
  Accuracy: 0.1731
  Confusion matrix (R1..R7):
 [[3 2 2 0 0 0 0]
 [1 0 2 0 1 0 1]
 [6 1 6 2 1 0 1]
 [1 0 0 0 1 1 1]
 [2 1 5 0 0 2 1]
 [1 0 3 0 0 0 0]
 [0 1 1 0 0 2 0]]
  Macro F1: 0.0884



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [109]:
# Compare all draft-ROUND models: college-only, combine-only, combine+college, combine+college w/ agility
# (Ordinal/Logistic, RF, XGB for each). Same test set = drafted players only (y_draft_te).

round_models = [
    # College-only
    ('Ordinal (college-only)', pred_ord_college_only),
    ('RF (college-only)', pred_rf_day_college_only),
    ('XGB (college-only)', pred_xgb_day_college_only),
    # Combine-only
    ('Ordinal (combine-only)', pred_ord_combine),
    ('RF (combine-only)', pred_rf_day_combine),
    ('XGB (combine-only)', pred_xgb_day_combine),
    # Combine+college
    ('Ordinal (combine+college)', pred_ord_college),
    ('RF (combine+college)', pred_rf_day_college),
    ('XGB (combine+college)', pred_xgb_day_college),
    # Combine+college w/ agility
    ('Ordinal (combine+college w/ agility)', pred_ord_college_agility),
    ('RF (combine+college w/ agility)', pred_rf_day_college_agility),
    ('XGB (combine+college w/ agility)', pred_xgb_day_college_agility),
]

# All round models: combine-only preds align with y_draft_te; college/combine+college/agility with y_draft_te17. Test is 2021+ (drafted LBs only).
y_round = y_draft_te

round_results = []
for name, pred in round_models:
    acc = (pred == y_round).mean()
    f1 = f1_score(y_round, pred, average='macro', zero_division=0)
    cm = confusion_matrix(y_round, pred)
    recall_r1 = (cm[0, 0] / cm[0].sum()) if cm.shape[0] > 0 and cm[0].sum() > 0 else np.nan
    round_results.append({
        'Model': name,
        'Accuracy': acc,
        'Macro F1': f1,
        'Recall R1': recall_r1,
    })

round_df = pd.DataFrame(round_results)
round_df = round_df.sort_values('Macro F1', ascending=False).reset_index(drop=True)
print('All round models ranked by Macro F1 (same test set, drafted only, n=' + str(len(y_round)) + '):')
print('Categories: college-only, combine-only, combine+college, combine+college w/ agility (Ordinal, RF, XGB each)')
print('=' * 75)
print(round_df.to_string(index=False))
print()

best_acc = round_df.loc[round_df['Accuracy'].idxmax(), 'Model']
best_acc_val = round_df['Accuracy'].max()
best_f1 = round_df.loc[0, 'Model']
best_f1_val = round_df.loc[0, 'Macro F1']
print('Summary:')
print('  Best by Accuracy:', best_acc, f'({best_acc_val:.4f})')
print('  Best by Macro F1 (balanced across R1–R7):', best_f1, f'({best_f1_val:.4f})')
print()
print('Conclusion: Macro F1 is the preferred metric for multi-class round prediction;')
print('Accuracy and Recall R1 are also useful.')

All round models ranked by Macro F1 (same test set, drafted only, n=52):
Categories: college-only, combine-only, combine+college, combine+college w/ agility (Ordinal, RF, XGB each)
                               Model  Accuracy  Macro F1  Recall R1
                   RF (combine-only)  0.288462  0.234994   0.714286
                   RF (college-only)  0.250000  0.214062   0.142857
Ordinal (combine+college w/ agility)  0.192308  0.212072   0.285714
           Ordinal (combine+college)  0.192308  0.195147   0.142857
               XGB (combine+college)  0.230769  0.192670   0.000000
                  XGB (combine-only)  0.250000  0.165816   0.571429
              Ordinal (combine-only)  0.211538  0.157011   0.714286
                  XGB (college-only)  0.173077  0.147593   0.000000
    XGB (combine+college w/ agility)  0.173077  0.125519   0.000000
                RF (combine+college)  0.192308  0.118742   0.000000
              Ordinal (college-only)  0.096154  0.100645   0.142857
   

## Predict draft for a single player

For **linebackers**, college stats are Sacks, TFL, QB Hurry, PD, SOLO, TOT (final season). Selects the **best drafted/undrafted model** (by ROC-AUC) and **best round model** (by Macro F1 on drafted players) **independently** for the player's category:
- **Combine only** → best combine model for draft + best combine model for round
- **College only** → best college model for draft + best college model for round (uses Sacks, TFL, QB Hurry, PD, SOLO, TOT)
- **College + combine** → best college+combine model for draft + best college+combine model for round
- **College + combine w/ agility** → best agility model for draft + best agility model for round

In [110]:
def _player_has_college_stats(player_dict):
    """True only if player has all six college stats (non-null): Sacks, TFL, QB Hurry, PD, SOLO, TOT final season. If any are missing, use combine-only models."""
    keys = ['QB_Hurry_final_season', 'TFL_final_season', 'Sacks_final_season', 'PD_final_season', 'SOLO_final_season', 'TOT_final_season']
    for k in keys:
        v = player_dict.get(k, player_dict.get(k.replace('_', ' ')))
        if v is None or (isinstance(v, float) and np.isnan(v)):
            return False
    return True

def _get_val(player_dict, *keys, default=np.nan):
    for k in keys:
        if k in player_dict and player_dict[k] is not None:
            v = player_dict[k]
            if isinstance(v, float) and np.isnan(v):
                continue
            return v
    return default

def _height_inches(h):
    if h is None or (isinstance(h, float) and np.isnan(h)):
        return np.nan
    if isinstance(h, (int, float)):
        return float(h)
    if isinstance(h, str) and '-' in h:
        parts = h.strip().split('-')
        return int(parts[0]) * 12 + int(parts[1])
    return np.nan

# Map contains_* column name -> feature name for checking if player has that stat
CONTAINS_TO_FEATURE = {
    'contains_broad_jump': 'Broad Jump', 'contains_vertical': 'Vertical',
    'contains_40yd': '40yd', 'contains_height': 'Height', 'contains_weight': 'Weight',
    'contains_speed_score': 'speed_score', 'contains_explosive_score': 'explosive_score',
    'contains_agility_score': 'agility_score',
    'contains_qb_hurry_final_season': 'QB_Hurry_final_season', 'contains_tfl_final_season': 'TFL_final_season', 'contains_sacks_final_season': 'Sacks_final_season',
    'contains_pd_final_season': 'PD_final_season', 'contains_solo_final_season': 'SOLO_final_season', 'contains_tot_final_season': 'TOT_final_season',
    'contains_sacks_cumulative': 'Sacks_cumulative', 'contains_tfl_cumulative': 'TFL_cumulative', 'contains_qb_hurry_cumulative': 'QB_Hurry_cumulative',
    'contains_pd_cumulative': 'PD_cumulative', 'contains_solo_cumulative': 'SOLO_cumulative', 'contains_tot_cumulative': 'TOT_cumulative',
    'contains_p4_conference': 'School',
}

def _player_row(player_dict, feature_list, contains_list, medians, add_speed=True):
    """Build one row for the player with feature_list + contains_*; fill missing with medians."""
    row = {}
    for col in feature_list:
        v = _get_val(player_dict, col, col.replace(' ', '_').lower(), col.replace(' ', ''))
        if col == 'Height':
            v = _height_inches(v)
        if col == 'p4_conference':
            school = _get_val(player_dict, 'School', 'school', 'School')
            school_norm = school_alias.get(school, school) if pd.notna(school) and school else None
            year = _get_val(player_dict, 'Year', 'year', 'Year')
            year = int(year) if pd.notna(year) else 2025
            schools = P4_SCHOOLS if year <= 2023 else P4_SCHOOLS_NO_PAC12
            v = 1 if (school_norm and school_norm in schools) else 0
        if add_speed and col == 'speed_score' and (v is np.nan or (isinstance(v, float) and np.isnan(v))):
            w, forty = _get_val(player_dict, 'Weight'), _get_val(player_dict, '40yd')
            if w is not np.nan and forty is not np.nan and float(forty) > 0:
                v = float(w) * 200 / (float(forty) ** 4)
        row[col] = v if (v is not np.nan and not (isinstance(v, float) and np.isnan(v))) else medians.get(col, np.nan)
    for col in contains_list:
        feat = CONTAINS_TO_FEATURE.get(col, col.replace('contains_', '').replace('_', ' '))
        v = _get_val(player_dict, feat, feat.replace(' ', '_').lower() if isinstance(feat, str) else feat)
        row[col] = 1 if (v is not None and v is not np.nan and not (isinstance(v, float) and np.isnan(v))) else 0
    return pd.Series(row)

def _player_has_agility_stats(player_dict):
    """True if player has agility_score (non-null). Matches 2024/2025/2026 where agility is computed from 3Cone or Shuttle."""
    ag = _get_val(player_dict, 'agility_score', 'agility_score')
    return ag is not None and not (isinstance(ag, float) and np.isnan(ag))

def _player_has_combine_stats(player_dict):
    """True if player has at least 3 of 5 key combine metrics (40yd, Vertical, Broad Jump, Height, Weight)."""
    keys = ['40yd', 'Vertical', 'Broad Jump', 'Height', 'Weight']
    n_has = 0
    for k in keys:
        v = _get_val(player_dict, k, k.replace(' ', '_').lower(), k.replace(' ', ''))
        if k == 'Height':
            v = _height_inches(v) if v is not None else np.nan
        if v is not None and not (isinstance(v, float) and np.isnan(v)):
            n_has += 1
    return n_has >= 3

def get_best_pipeline_for_player(player_dict, pipe_df=None):
    """Select best drafted/undrafted model and best round model *independently* per category.

    We use simple, fixed choices per category based on the earlier comparison tables:
    - combine: Logistic (combine) for draft, Ordinal (combine) for round
    - college: Logistic (college) for draft, Ordinal (college) for round
    - college+combine: Logistic (college + combine) for draft, Ordinal (college + combine) for round
    - college+combine_agility: Logistic (college + combine w/ agility) for draft, Ordinal (college + combine w/ agility) for round
    """
    has_college = _player_has_college_stats(player_dict)
    has_combine = _player_has_combine_stats(player_dict)
    if not has_college:
        cat = 'combine'
    elif not has_combine:
        cat = 'college'
    elif has_combine and has_college and _player_has_agility_stats(player_dict):
        cat = 'college+combine_agility'
    else:
        cat = 'college+combine'

    if cat == 'combine':
        best_drafted = 'Logistic (combine)'
        best_round = 'Ordinal (combine)'
    elif cat == 'college':
        best_drafted = 'Logistic (college)'
        best_round = 'Ordinal (college)'
    elif cat == 'college+combine_agility':
        best_drafted = 'Logistic (college + combine w/ agility)'
        best_round = 'Ordinal (college + combine w/ agility)'
    else:
        best_drafted = 'Logistic (college + combine)'
        best_round = 'Ordinal (college + combine)'

    return best_drafted, best_round

def _run_drafted_model(drafted_name, row_combine, row_full, row_college_only=None, row_full_agility=None):
    """Return P(drafted) in [0,1] for the given model name."""
    if row_college_only is None:
        rc = row_full[COLLEGE_ONLY_ALL].reindex(COLLEGE_ONLY_ALL)
        row_college_only = pd.Series(knn_imputer_co.transform(rc.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    if row_full_agility is None:
        ra = row_full.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
        row_full_agility = pd.Series(knn_imputer_ag.transform(ra.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    if drafted_name == 'Logistic (combine)':
        return logit_draft.predict_proba(scaler.transform(row_combine.to_frame().T))[0, 1]
    if drafted_name == 'Logistic (college + combine)':
        return logit_draft_college.predict_proba(scaler17.transform(row_full.to_frame().T))[0, 1]
    if drafted_name == 'Logistic (college + combine w/ agility)':
        return logit_draft_college_agility.predict_proba(scaler_ag.transform(row_full_agility.to_frame().T))[0, 1]
    if drafted_name == 'Logistic (college)':
        return logit_draft_college_only.predict_proba(scaler_co.transform(row_college_only.to_frame().T))[0, 1]
    if drafted_name == 'RF (combine)':
        return rf_combine.predict_proba(row_combine.to_frame().T)[0, 1]
    if drafted_name == 'RF (college + combine)':
        return rf_college.predict_proba(row_full.to_frame().T)[0, 1]
    if drafted_name == 'RF (college + combine w/ agility)':
        return rf_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 1]
    if drafted_name == 'RF (college)':
        return rf_college_only.predict_proba(row_college_only.to_frame().T)[0, 1]
    if drafted_name == 'XGB (combine)':
        return xgb_combine.predict_proba(row_combine.to_frame().T)[0, 1]
    if drafted_name == 'XGB (college + combine)':
        return xgb_college.predict_proba(row_full.to_frame().T)[0, 1]
    if drafted_name == 'XGB (college + combine w/ agility)':
        return xgb_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 1]
    if drafted_name == 'XGB (college)':
        return xgb_college_only.predict_proba(row_college_only.to_frame().T)[0, 1]
    return 0.0

def _run_day_model(day_name, row_combine, row_full, row_college_only=None, row_full_agility=None):
    """Return draft round 1-7 from the round model (model outputs 0..6, we return +1)."""
    if row_college_only is None:
        rc = row_full[COLLEGE_ONLY_ALL].reindex(COLLEGE_ONLY_ALL)
        row_college_only = pd.Series(knn_imputer_co.transform(rc.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    if row_full_agility is None:
        ra = row_full.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
        row_full_agility = pd.Series(knn_imputer_ag.transform(ra.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    pred = 0
    if day_name == 'Ordinal (combine)':
        pred = int(np.clip(ord_combine.predict(scaler.transform(row_combine.to_frame().T))[0], 0, 6))
    elif day_name == 'Ordinal (college + combine)':
        pred = int(np.clip(ord_college.predict(scaler17.transform(row_full.to_frame().T))[0], 0, 6))
    elif day_name == 'Ordinal (college + combine w/ agility)':
        pred = int(np.clip(ord_college_agility.predict(scaler_ag.transform(row_full_agility.to_frame().T))[0], 0, 6))
    elif day_name == 'Ordinal (college)':
        pred = int(np.clip(ord_college_only.predict(scaler_co.transform(row_college_only.to_frame().T))[0], 0, 6))
    elif day_name == 'RF (combine)':
        pred = int(np.clip(rf_day_combine.predict(row_combine.to_frame().T)[0], 0, 6))
    elif day_name == 'RF (college + combine)':
        pred = int(np.clip(rf_day_college.predict(row_full.to_frame().T)[0], 0, 6))
    elif day_name == 'RF (college + combine w/ agility)':
        pred = int(np.clip(rf_day_college_agility.predict(row_full_agility.to_frame().T)[0], 0, 6))
    elif day_name == 'RF (college)':
        pred = int(np.clip(rf_day_college_only.predict(row_college_only.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (combine)':
        pred = int(np.clip(xgb_day_combine.predict(row_combine.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (college + combine)':
        pred = int(np.clip(xgb_day_college.predict(row_full.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (college + combine w/ agility)':
        pred = int(np.clip(xgb_day_college_agility.predict(row_full_agility.to_frame().T)[0], 0, 6))
    elif day_name == 'XGB (college)':
        pred = int(np.clip(xgb_day_college_only.predict(row_college_only.to_frame().T)[0], 0, 6))
    return pred + 1  # 1-7

def _get_round1_prob(day_name, row_combine, row_full, row_college_only=None, row_full_agility=None):
    """Return P(Round 1) = probability of class 0 from the round model (for thresholding)."""
    if row_college_only is None:
        rc = row_full.reindex(COLLEGE_ONLY_ALL)
        row_college_only = pd.Series(knn_imputer_co.transform(rc.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    if row_full_agility is None:
        ra = row_full.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
        row_full_agility = pd.Series(knn_imputer_ag.transform(ra.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    # predict_proba returns (n_samples, n_classes) with classes 0=Day1, 1=Day2, 2=Day3
    if day_name == 'Ordinal (combine)':
        return ord_combine.predict_proba(scaler.transform(row_combine.to_frame().T))[0, 0]
    if day_name == 'Ordinal (college + combine)':
        return ord_college.predict_proba(scaler17.transform(row_full.to_frame().T))[0, 0]
    if day_name == 'Ordinal (college + combine w/ agility)':
        return ord_college_agility.predict_proba(scaler_ag.transform(row_full_agility.to_frame().T))[0, 0]
    if day_name == 'Ordinal (college)':
        return ord_college_only.predict_proba(scaler_co.transform(row_college_only.to_frame().T))[0, 0]
    if day_name == 'RF (combine)':
        return rf_day_combine.predict_proba(row_combine.to_frame().T)[0, 0]
    if day_name == 'RF (college + combine)':
        return rf_day_college.predict_proba(row_full.to_frame().T)[0, 0]
    if day_name == 'RF (college + combine w/ agility)':
        return rf_day_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 0]
    if day_name == 'RF (college)':
        return rf_day_college_only.predict_proba(row_college_only.to_frame().T)[0, 0]
    if day_name == 'XGB (combine)':
        return xgb_day_combine.predict_proba(row_combine.to_frame().T)[0, 0]
    if day_name == 'XGB (college + combine)':
        return xgb_day_college.predict_proba(row_full.to_frame().T)[0, 0]
    if day_name == 'XGB (college + combine w/ agility)':
        return xgb_day_college_agility.predict_proba(row_full_agility.to_frame().T)[0, 0]
    if day_name == 'XGB (college)':
        return xgb_day_college_only.predict_proba(row_college_only.to_frame().T)[0, 0]
    return 0.0

R1_PROB_THRESHOLD = 0.28  # If P(R1) >= this, predict Round 1 (improves R1 recall)

def predict_draft(player_dict, pipe_df=None, category=None):
    """
    Predict drafted/undrafted and (if drafted) draft round (1-7) for one player.
    category: if provided ('combine', 'college', 'college+combine', 'college+combine_agility'), use that category's best models; else infer from player data.
    Returns: dict with drafted (bool), draft_round (1-7 or None), drafted_model, day_model, prob_drafted.
    """
    if category is not None:
        # Directly choose models by category (same choices as in get_best_pipeline_for_player)
        if category == 'combine':
            drafted_name = 'Logistic (combine)'
            day_name = 'Ordinal (combine)'
        elif category == 'college':
            drafted_name = 'Logistic (college)'
            day_name = 'Ordinal (college)'
        elif category == 'college+combine_agility':
            drafted_name = 'Logistic (college + combine w/ agility)'
            day_name = 'Ordinal (college + combine w/ agility)'
        else:  # 'college+combine' or anything else falls back here
            drafted_name = 'Logistic (college + combine)'
            day_name = 'Ordinal (college + combine)'
    else:
        if pipe_df is None:
            pipe_df = globals().get('pipe_df')
        if pipe_df is None:
            raise ValueError('Run the pipeline comparison cell first to create pipe_df, or pass pipe_df.')
        drafted_name, day_name = get_best_pipeline_for_player(player_dict, pipe_df)
    row_combine = _player_row(player_dict, COMBINE_ONLY_FEATURES, COMBINE_ONLY_CONTAINS, train_medians)
    row_combine = row_combine.reindex(COMBINE_ONLY_ALL)
    row_combine = pd.Series(knn_imputer_combine.transform(row_combine.values.reshape(1, -1))[0], index=COMBINE_ONLY_ALL)
    row_full = _player_row(player_dict, FEATURES_WITH_COLLEGE, CONTAINS_WITH_COLLEGE, train_medians17)
    row_full = row_full.reindex(FEATURES_WITH_COLLEGE_ALL)
    row_full = pd.Series(knn_imputer17.transform(row_full.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_ALL)
    row_full_agility = _player_row(player_dict, FEATURES_WITH_COLLEGE_AGILITY, CONTAINS_WITH_COLLEGE_AGILITY, train_medians_ag)
    row_full_agility = row_full_agility.reindex(FEATURES_WITH_COLLEGE_AGILITY_ALL)
    row_full_agility = pd.Series(knn_imputer_ag.transform(row_full_agility.values.reshape(1, -1))[0], index=FEATURES_WITH_COLLEGE_AGILITY_ALL)
    row_college_only = _player_row(player_dict, COLLEGE_ONLY_FEATURES, COLLEGE_ONLY_CONTAINS, train_medians_co, add_speed=False)
    row_college_only = row_college_only.reindex(COLLEGE_ONLY_ALL)
    row_college_only = pd.Series(knn_imputer_co.transform(row_college_only.values.reshape(1, -1))[0], index=COLLEGE_ONLY_ALL)
    prob_drafted = _run_drafted_model(drafted_name, row_combine, row_full, row_college_only, row_full_agility)
    drafted = prob_drafted >= 0.5
    draft_round = None
    if drafted:
        round_pred = _run_day_model(day_name, row_combine, row_full, row_college_only, row_full_agility)  # 1-7
        prob_r1 = _get_round1_prob(day_name, row_combine, row_full, row_college_only, row_full_agility)
        if prob_r1 >= R1_PROB_THRESHOLD:
            draft_round = 1
        else:
            draft_round = int(np.clip(round_pred, 1, 7))
    return {
        'drafted': drafted,
        'draft_round': draft_round,
        'drafted_model': drafted_name,
        'day_model': day_name,
        'prob_drafted': float(prob_drafted),
    }



In [111]:
# 2024 drafted LBs: compute speed_score, explosive_score, agility_score; run combine/college/college+combine predictions
# Requires prior cells (train_df, predict_draft, pipe_df, etc.) to be run.

lb_2024 = pd.read_csv('lb_drafted_2024.csv')
# Height may already be in inches
if lb_2024['Height'].dtype == object or (lb_2024['Height'].notna() & (lb_2024['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '':
            return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h):
            return float(h)
        s = str(h)
        if '-' in s:
            parts = s.split('-')
            return int(parts[0]) * 12 + int(parts[1])
        return np.nan
    lb_2024['Height'] = lb_2024['Height'].apply(_ht_inches)
else:
    lb_2024['Height'] = pd.to_numeric(lb_2024['Height'], errors='coerce')

# 1) Speed score
lb_2024['speed_score'] = np.where(
    lb_2024['40yd'].notna() & (lb_2024['40yd'] > 0),
    lb_2024['Weight'] * 200 / (lb_2024['40yd'] ** 4),
    np.nan
)

# 2) Explosive score (z-scores from training LBs: Vertical + Broad Jump)
# Use NaN when BOTH Vertical and Broad Jump are missing — otherwise the model thinks we have "average" (0) and mis-predicts.
tr_lb = train_df[train_df['Pos'].isin(['ILB', 'LB', 'OLB'])]
mean_v = tr_lb['Vertical'].mean()
std_v = tr_lb['Vertical'].std()
mean_b = tr_lb['Broad Jump'].mean()
std_b = tr_lb['Broad Jump'].std()  # pyright: ignore[reportUndefinedVariable]
if std_v == 0 or np.isnan(std_v):
    std_v = 1.0
if std_b == 0 or np.isnan(std_b):
    std_b = 1.0
v_z = (lb_2024['Vertical'] - mean_v) / std_v
b_z = (lb_2024['Broad Jump'] - mean_b) / std_b
has_explosive = lb_2024['Vertical'].notna() | lb_2024['Broad Jump'].notna()
lb_2024['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

# 3) Agility score (z-scores from training LBs: lower 3Cone/Shuttle = better, so negate z)
# Use NaN when BOTH 3Cone and Shuttle are missing — otherwise 0 is treated as "has data" and biases predictions.
mean_3 = tr_lb['3Cone'].mean()
std_3 = tr_lb['3Cone'].std()
mean_sh = tr_lb['Shuttle'].mean()
std_sh = tr_lb['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3):
    std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh):
    std_sh = 1.0
z_3 = (lb_2024['3Cone'] - mean_3) / std_3
z_sh = (lb_2024['Shuttle'] - mean_sh) / std_sh
has_agility = lb_2024['3Cone'].notna() | lb_2024['Shuttle'].notna()
lb_2024['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

# 4) Run model on each 2024 LB: combine-only, college-only, college+combine (best models independently)
def row_to_player_dict(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row['Shuttle'], '3Cone': row['3Cone'],
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'PD_final_season': row.get('PD_final_season', np.nan),
        'SOLO_final_season': row.get('SOLO_final_season', np.nan),
        'TOT_final_season': row.get('TOT_final_season', np.nan),
        'Sacks_cumulative': row.get('Sacks_cumulative', np.nan),
        'TFL_cumulative': row.get('TFL_cumulative', np.nan),
        'QB_Hurry_cumulative': row.get('QB_Hurry_cumulative', np.nan),
        'PD_cumulative': row.get('PD_cumulative', np.nan),
        'SOLO_cumulative': row.get('SOLO_cumulative', np.nan),
        'TOT_cumulative': row.get('TOT_cumulative', np.nan),
        'speed_score': row.get('speed_score', np.nan), 'explosive_score': row.get('explosive_score', np.nan), 'agility_score': row.get('agility_score', np.nan),
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2024),
    }

pred_combine, model_combine = [], []
pred_college_only, model_college_only = [], []
pred_college_combine, model_college_combine = [], []
for _, row in lb_2024.iterrows():
    row_dict = row_to_player_dict(row)
    has_combine = _player_has_combine_stats(row_dict)
    has_college = _player_has_college_stats(row_dict)
    has_agility = _player_has_agility_stats(row_dict)
    if has_combine:
        out = predict_draft(row_dict, category='combine')
        pred_combine.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_combine.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_combine.append('—')
        model_combine.append('—')
    if has_college:
        out = predict_draft(row_dict, category='college')
        pred_college_only.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_only.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_only.append('—')
        model_college_only.append('—')
    if has_combine and has_college:
        cat = 'college+combine_agility' if has_agility else 'college+combine'
        out = predict_draft(row_dict, category=cat)
        pred_college_combine.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_combine.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_combine.append('—')
        model_college_combine.append('—')
lb_2024['prediction_combine'] = pred_combine
lb_2024['model_combine'] = model_combine
lb_2024['prediction_college_only'] = pred_college_only
lb_2024['model_college_only'] = model_college_only
lb_2024['prediction_college_combine'] = pred_college_combine
lb_2024['model_college_combine'] = model_college_combine

lb_2024

Unnamed: 0,Round,Pick,Player,Pos,School,Year,Height,Weight,40yd,Vertical,...,TOT_final_season,speed_score,explosive_score,agility_score,prediction_combine,model_combine,prediction_college_only,model_college_only,prediction_college_combine,model_college_combine
0,1,17,Dallas Turner,LB,Alabama,2024,76.0,244.0,4.46,40.5,...,53.0,123.33331,3.253088,,Round 1,Logistic (combine) + Ordinal (combine),Round 7,Logistic (college) + Ordinal (college),Round 1,Logistic (college + combine) + Ordinal (colleg...
1,2,45,Edgerrin Cooper,LB,Texas A&M,2024,75.0,230.0,4.51,34.5,...,83.0,111.186399,0.838347,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 1,Logistic (college + combine) + Ordinal (colleg...
2,2,52,Junior Colson,LB,Michigan,2024,74.0,238.0,4.58,34.0,...,89.0,108.179518,0.339918,,Round 3,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 3,Logistic (college + combine) + Ordinal (colleg...
3,3,72,Trevin Wallace,LB,Kentucky,2024,73.0,237.0,4.51,37.5,...,80.0,114.570333,2.470291,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 5,Logistic (college + combine) + Ordinal (colleg...
4,3,98,Payton Wilson,LB,NC State,2024,75.0,233.0,4.43,34.5,...,138.0,120.996,0.328859,,Round 1,Logistic (combine) + Ordinal (combine),Round 1,Logistic (college) + Ordinal (college),Round 1,Logistic (college + combine) + Ordinal (colleg...
5,3,87,Marist Liufau,LB,Notre Dame,2024,74.0,239.0,4.64,30.0,...,42.0,103.123103,-0.930251,,Round 5,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 2,Logistic (college + combine) + Ordinal (colleg...
6,4,114,Jaylan Ford,LB,Texas,2024,73.0,236.0,4.73,33.5,...,96.0,94.296904,-0.158512,,Round 6,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 4,Logistic (college + combine) + Ordinal (colleg...
7,4,118,Tyrice Knight,LB,UTEP,2024,73.0,233.0,4.63,35.5,...,113.0,101.405603,0.98606,,Round 5,Logistic (combine) + Ordinal (combine),Round 4,Logistic (college) + Ordinal (college),Round 4,Logistic (college + combine) + Ordinal (colleg...
8,5,149,Edefuan Ulofoshio,LB,Washington,2024,72.0,236.0,4.56,39.5,...,90.0,109.164801,2.935546,,Round 4,Logistic (combine) + Ordinal (combine),Round 5,Logistic (college) + Ordinal (college),Round 5,Logistic (college + combine) + Ordinal (colleg...
9,5,160,Steele Chambers,LB,Ohio State,2024,74.0,226.0,4.77,33.5,...,83.0,87.310187,0.011317,,Undrafted,Logistic (combine),Undrafted,Logistic (college),Undrafted,Logistic (college + combine)


In [112]:
# 2025 drafted LBs: same as 2024 — speed_score, explosive_score, agility_score + combine/college/college+combine predictions
# Requires prior cells (train_df, predict_draft, pipe_df, etc.) to be run.

lb_2025 = pd.read_csv('lb_drafted_2025.csv')
# Height may already be in inches
if lb_2025['Height'].dtype == object or (lb_2025['Height'].notna() & (lb_2025['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '':
            return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h):
            return float(h)
        s = str(h)
        if '-' in s:
            parts = s.split('-')
            return int(parts[0]) * 12 + int(parts[1])
        return np.nan
    lb_2025['Height'] = lb_2025['Height'].apply(_ht_inches)
else:
    lb_2025['Height'] = pd.to_numeric(lb_2025['Height'], errors='coerce')

# 1) Speed score
lb_2025['speed_score'] = np.where(
    lb_2025['40yd'].notna() & (lb_2025['40yd'] > 0),
    lb_2025['Weight'] * 200 / (lb_2025['40yd'] ** 4),
    np.nan
)

# 2) Explosive score (z-scores from training LBs: Vertical + Broad Jump)
tr_lb = train_df[train_df['Pos'].isin(['ILB', 'LB', 'OLB'])]
mean_v = tr_lb['Vertical'].mean()
std_v = tr_lb['Vertical'].std()
mean_b = tr_lb['Broad Jump'].mean()
std_b = tr_lb['Broad Jump'].std()
if std_v == 0 or np.isnan(std_v):
    std_v = 1.0
if std_b == 0 or np.isnan(std_b):
    std_b = 1.0
v_z = (lb_2025['Vertical'] - mean_v) / std_v
b_z = (lb_2025['Broad Jump'] - mean_b) / std_b
has_explosive = lb_2025['Vertical'].notna() | lb_2025['Broad Jump'].notna()
lb_2025['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

# 3) Agility score (z-scores from training LBs)
mean_3 = tr_lb['3Cone'].mean()
std_3 = tr_lb['3Cone'].std()
mean_sh = tr_lb['Shuttle'].mean()
std_sh = tr_lb['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3):
    std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh):
    std_sh = 1.0
z_3 = (lb_2025['3Cone'] - mean_3) / std_3
z_sh = (lb_2025['Shuttle'] - mean_sh) / std_sh
has_agility = lb_2025['3Cone'].notna() | lb_2025['Shuttle'].notna()
lb_2025['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

# 4) Run model on each 2025 LB: combine-only, college-only, college+combine (best models independently)
def row_to_player_dict_2025(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row['Shuttle'], '3Cone': row['3Cone'],
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'PD_final_season': row.get('PD_final_season', np.nan),
        'SOLO_final_season': row.get('SOLO_final_season', np.nan),
        'TOT_final_season': row.get('TOT_final_season', np.nan),
        'Sacks_cumulative': row.get('Sacks_cumulative', np.nan),
        'TFL_cumulative': row.get('TFL_cumulative', np.nan),
        'QB_Hurry_cumulative': row.get('QB_Hurry_cumulative', np.nan),
        'PD_cumulative': row.get('PD_cumulative', np.nan),
        'SOLO_cumulative': row.get('SOLO_cumulative', np.nan),
        'TOT_cumulative': row.get('TOT_cumulative', np.nan),
        'speed_score': row.get('speed_score', np.nan), 'explosive_score': row.get('explosive_score', np.nan), 'agility_score': row.get('agility_score', np.nan),
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2025),
    }

pred_combine_2025, model_combine_2025 = [], []
pred_college_only_2025, model_college_only_2025 = [], []
pred_college_combine_2025, model_college_combine_2025 = [], []
for _, row in lb_2025.iterrows():
    row_dict = row_to_player_dict_2025(row)
    has_combine = _player_has_combine_stats(row_dict)
    has_college = _player_has_college_stats(row_dict)
    has_agility = _player_has_agility_stats(row_dict)
    if has_combine:
        out = predict_draft(row_dict, category='combine')
        pred_combine_2025.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_combine_2025.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_combine_2025.append('—')
        model_combine_2025.append('—')
    if has_college:
        out = predict_draft(row_dict, category='college')
        pred_college_only_2025.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_only_2025.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_only_2025.append('—')
        model_college_only_2025.append('—')
    if has_combine and has_college:
        cat = 'college+combine_agility' if has_agility else 'college+combine'
        out = predict_draft(row_dict, category=cat)
        pred_college_combine_2025.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_combine_2025.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_combine_2025.append('—')
        model_college_combine_2025.append('—')
lb_2025['prediction_combine'] = pred_combine_2025
lb_2025['model_combine'] = model_combine_2025
lb_2025['prediction_college_only'] = pred_college_only_2025
lb_2025['model_college_only'] = model_college_only_2025
lb_2025['prediction_college_combine'] = pred_college_combine_2025
lb_2025['model_college_combine'] = model_college_combine_2025


lb_2025

Unnamed: 0,Round,Pick,Player,Pos,School,Year,Height,Weight,40yd,Vertical,...,TOT_final_season,speed_score,explosive_score,agility_score,prediction_combine,model_combine,prediction_college_only,model_college_only,prediction_college_combine,model_college_combine
0,1,15,Jalon Walker,LB,Georgia,2025,73.0,243.0,,,...,61.0,,,,—,—,Round 7,Logistic (college) + Ordinal (college),—,—
1,1,31,Jihaad Campbell,LB,Alabama,2025,74.7,235.0,4.52,,...,119.0,112.601485,1.523556,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 1,Logistic (college + combine) + Ordinal (colleg...
2,2,33,Carson Schwesinger,LB,UCLA,2025,74.4,242.0,,39.5,...,136.0,,3.275205,0.718916,Round 5,Logistic (combine) + Ordinal (combine),Round 6,Logistic (college) + Ordinal (college),Round 5,Logistic (college + combine w/ agility) + Ordi...
3,2,49,Demetrius Knight Jr,LB,South Carolina,2025,73.5,235.0,4.58,31.5,...,82.0,106.81591,-0.793596,0.384904,Round 5,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 6,Logistic (college + combine w/ agility) + Ordi...
4,3,75,Nick Martin,LB,Oklahoma State,2025,71.5,221.0,4.53,38.0,...,,104.961363,2.119574,,Round 4,Logistic (combine) + Ordinal (combine),—,—,—,—
5,4,107,Jack Kiser,LB,Notre Dame,2025,73.5,231.0,4.62,34.5,...,90.0,101.408433,-0.0108,2.247261,Round 5,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 4,Logistic (college + combine w/ agility) + Ordi...
6,4,112,Danny Stutsman,LB,Oklahoma,2025,75.2,233.0,4.52,34.0,...,110.0,111.643175,0.170088,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 3,Logistic (college + combine) + Ordinal (colleg...
7,4,115,Cody Simon,LB,Ohio State,2025,74.1,229.0,4.59,33.5,...,112.0,103.184565,0.181147,-0.305187,Round 6,Logistic (combine) + Ordinal (combine),Round 1,Logistic (college) + Ordinal (college),Round 5,Logistic (college + combine w/ agility) + Ordi...
8,4,119,Barrett Carter,LB,Clemson,2025,72.1,231.0,4.64,34.5,...,74.0,99.671284,-0.180629,-0.68851,Round 6,Logistic (combine) + Ordinal (combine),Round 4,Logistic (college) + Ordinal (college),Round 4,Logistic (college + combine w/ agility) + Ordi...
9,4,129,Teddye Buchanan,LB,California,2025,73.7,233.0,4.6,40.0,...,108.0,104.076958,3.094317,,Round 4,Logistic (combine) + Ordinal (combine),Round 1,Logistic (college) + Ordinal (college),Round 1,Logistic (college + combine) + Ordinal (colleg...


In [117]:
# 2026 drafted LBs: same as 2024/2025 — speed_score, explosive_score, agility_score + combine/college/college+combine predictions
# Requires prior cells (train_df, predict_draft, pipe_df, etc.) to be run.

lb_2026 = pd.read_csv('lb_drafted_2026.csv')
# Height may already be in inches
if lb_2026['Height'].dtype == object or (lb_2026['Height'].notna() & (lb_2026['Height'].astype(str).str.contains('-', na=False))).any():
    def _ht_inches(h):
        if pd.isna(h) or h == '':
            return np.nan
        if isinstance(h, (int, float)) and not np.isnan(h):
            return float(h)
        s = str(h)
        if '-' in s:
            parts = s.split('-')
            return int(parts[0]) * 12 + int(parts[1])
        return np.nan
    lb_2026['Height'] = lb_2026['Height'].apply(_ht_inches)
else:
    lb_2026['Height'] = pd.to_numeric(lb_2026['Height'], errors='coerce')

# 1) Speed score
lb_2026['speed_score'] = np.where(
    lb_2026['40yd'].notna() & (lb_2026['40yd'] > 0),
    lb_2026['Weight'] * 200 / (lb_2026['40yd'] ** 4),
    np.nan
)

# 2) Explosive score (z-scores from training LBs: Vertical + Broad Jump)
tr_lb = train_df[train_df['Pos'].isin(['ILB', 'LB', 'OLB'])]
mean_v = tr_lb['Vertical'].mean()
std_v = tr_lb['Vertical'].std()
mean_b = tr_lb['Broad Jump'].mean()
std_b = tr_lb['Broad Jump'].std()
if std_v == 0 or np.isnan(std_v):
    std_v = 1.0
if std_b == 0 or np.isnan(std_b):
    std_b = 1.0
v_z = (lb_2026['Vertical'] - mean_v) / std_v
b_z = (lb_2026['Broad Jump'] - mean_b) / std_b
has_explosive = lb_2026['Vertical'].notna() | lb_2026['Broad Jump'].notna()
lb_2026['explosive_score'] = np.where(has_explosive, v_z.fillna(0) + b_z.fillna(0), np.nan)

# 3) Agility score (z-scores from training LBs)
mean_3 = tr_lb['3Cone'].mean()
std_3 = tr_lb['3Cone'].std()
mean_sh = tr_lb['Shuttle'].mean()
std_sh = tr_lb['Shuttle'].std()
if std_3 == 0 or np.isnan(std_3):
    std_3 = 1.0
if std_sh == 0 or np.isnan(std_sh):
    std_sh = 1.0
z_3 = (lb_2026['3Cone'] - mean_3) / std_3
z_sh = (lb_2026['Shuttle'] - mean_sh) / std_sh
has_agility = lb_2026['3Cone'].notna() | lb_2026['Shuttle'].notna()
lb_2026['agility_score'] = np.where(has_agility, (-z_3.fillna(0)) + (-z_sh.fillna(0)), np.nan)

# 4) Run model on each 2026 LB: combine-only, college-only, college+combine (best models independently)
def row_to_player_dict_2026(row):
    return {
        'Height': row['Height'], 'Weight': row['Weight'], '40yd': row['40yd'],
        'Vertical': row['Vertical'], 'Broad Jump': row['Broad Jump'],
        'Shuttle': row['Shuttle'], '3Cone': row['3Cone'],
        'QB_Hurry_final_season': row.get('QB_Hurry_final_season', np.nan),
        'TFL_final_season': row.get('TFL_final_season', np.nan),
        'Sacks_final_season': row.get('Sacks_final_season', np.nan),
        'PD_final_season': row.get('PD_final_season', np.nan),
        'SOLO_final_season': row.get('SOLO_final_season', np.nan),
        'TOT_final_season': row.get('TOT_final_season', np.nan),
        'Sacks_cumulative': row.get('Sacks_cumulative', np.nan),
        'TFL_cumulative': row.get('TFL_cumulative', np.nan),
        'QB_Hurry_cumulative': row.get('QB_Hurry_cumulative', np.nan),
        'PD_cumulative': row.get('PD_cumulative', np.nan),
        'SOLO_cumulative': row.get('SOLO_cumulative', np.nan),
        'TOT_cumulative': row.get('TOT_cumulative', np.nan),
        'speed_score': row.get('speed_score', np.nan), 'explosive_score': row.get('explosive_score', np.nan), 'agility_score': row.get('agility_score', np.nan),
        'School': row.get('School', np.nan), 'Year': row.get('Year', 2026),
    }

pred_combine_2026, model_combine_2026 = [], []
pred_college_only_2026, model_college_only_2026 = [], []
pred_college_combine_2026, model_college_combine_2026 = [], []
for _, row in lb_2026.iterrows():
    row_dict = row_to_player_dict_2026(row)
    has_combine = _player_has_combine_stats(row_dict)
    has_college = _player_has_college_stats(row_dict)
    has_agility = _player_has_agility_stats(row_dict)
    if has_combine:
        out = predict_draft(row_dict, category='combine')
        pred_combine_2026.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_combine_2026.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_combine_2026.append('—')
        model_combine_2026.append('—')
    if has_college:
        out = predict_draft(row_dict, category='college')
        pred_college_only_2026.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_only_2026.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_only_2026.append('—')
        model_college_only_2026.append('—')
    if has_combine and has_college:
        cat = 'college+combine_agility' if has_agility else 'college+combine'
        out = predict_draft(row_dict, category=cat)
        pred_college_combine_2026.append(f"Round {out['draft_round']}" if out['drafted'] else 'Undrafted')
        model_college_combine_2026.append(f"{out['drafted_model']} + {out['day_model']}" if out['drafted'] else out['drafted_model'])
    else:
        pred_college_combine_2026.append('—')
        model_college_combine_2026.append('—')
lb_2026['prediction_combine'] = pred_combine_2026
lb_2026['model_combine'] = model_combine_2026
lb_2026['prediction_college_only'] = pred_college_only_2026
lb_2026['model_college_only'] = model_college_only_2026
lb_2026['prediction_college_combine'] = pred_college_combine_2026
lb_2026['model_college_combine'] = model_college_combine_2026

lb_2026

Unnamed: 0,Round,Pick,Player,Pos,School,Year,Height,Weight,40yd,Vertical,...,TOT_final_season,speed_score,explosive_score,agility_score,prediction_combine,model_combine,prediction_college_only,model_college_only,prediction_college_combine,model_college_combine
0,,,Arvell Reese,LB,Ohio State,2026,76.0,243.0,4.52,,...,69.0,116.434727,,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 1,Logistic (college + combine) + Ordinal (colleg...
1,,,Sonny Styles,LB,Ohio State,2026,76.0,243.0,4.48,,...,83.0,120.649135,,,Round 1,Logistic (combine) + Ordinal (combine),Round 3,Logistic (college) + Ordinal (college),Round 1,Logistic (college + combine) + Ordinal (colleg...
2,,,CJ Allen,LB,Georgia,2026,73.0,235.0,4.55,,...,88.0,109.661018,,,Round 1,Logistic (combine) + Ordinal (combine),Round 5,Logistic (college) + Ordinal (college),Round 2,Logistic (college + combine) + Ordinal (colleg...
3,,,Anthony Hill Jr,LB,Texas,2026,75.0,238.0,4.5,,...,70.0,116.079866,,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 2,Logistic (college + combine) + Ordinal (colleg...
4,,,Deontae Lawson,LB,Alabama,2026,74.0,228.0,4.6,,...,80.0,101.843547,,,Round 4,Logistic (combine) + Ordinal (combine),Round 4,Logistic (college) + Ordinal (college),Round 7,Logistic (college + combine) + Ordinal (colleg...
5,,,Josiah Trotter,LB,Missouri,2026,74.0,237.0,4.61,,...,84.0,104.948114,,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 2,Logistic (college + combine) + Ordinal (colleg...
6,,,Jake Golday,LB,Cincinnati,2026,76.0,240.0,4.55,,...,95.0,111.994231,,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 2,Logistic (college + combine) + Ordinal (colleg...
7,,,Taurean York,LB,Texas A&M,2026,70.0,227.0,4.58,,...,72.0,103.179624,,,Round 4,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 7,Logistic (college + combine) + Ordinal (colleg...
8,,,Jacob Rodriguez,LB,Texas Tech,2026,73.0,233.0,4.7,,...,122.0,95.498046,,,Round 4,Logistic (combine) + Ordinal (combine),Round 4,Logistic (college) + Ordinal (college),Round 4,Logistic (college + combine) + Ordinal (colleg...
9,,,Harold Perkins Jr,LB,LSU,2026,73.0,222.0,4.45,,...,55.0,113.225156,,,Round 1,Logistic (combine) + Ordinal (combine),Undrafted,Logistic (college),Round 7,Logistic (college + combine) + Ordinal (colleg...
