## Combine Analysis Defensive Tackles 

Which Combine tests have the most potential influence on a players ability to get drafted and their draft position?

Our training dataset is combine data from 2010 - 2020 and our testing dataset is 2021-2023

In [2]:
import pandas as pd

# Path relative to notebook location (DE_similarity_scores_project/) - data is in project root
dt_data = pd.read_csv('../data/processed/dt_training_data.csv')
print(dt_data.columns)
# Convert Height from feet-inches to inches
dt_data['Height'] = dt_data['Height'].str.split('-').str[0].astype(int) * 12 + dt_data['Height'].str.split('-').str[1].astype(int)

# Examine every column in the dataset and its correlation with the Drafted column 
de_data_just_numeric = dt_data.select_dtypes(include=['number'])
de_data_just_numeric['Drafted'] = dt_data['Drafted']
print(de_data_just_numeric.corr()['Drafted'].sort_values(ascending=False))


Index(['Year', 'Player', 'Pos', 'School', 'Height', 'Weight', '40yd',
       'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle', 'Drafted',
       'Round', 'Pick', 'Sacks_cumulative', 'TFL_cumulative',
       'QB_Hurry_cumulative', 'Sacks_final_season', 'TFL_final_season',
       'QB_Hurry_final_season'],
      dtype='object')
Drafted                  1.000000
Sacks_final_season       0.537543
TFL_final_season         0.471537
Sacks_cumulative         0.372802
TFL_cumulative           0.224549
QB_Hurry_cumulative      0.190996
QB_Hurry_final_season    0.183513
Bench                    0.167922
Broad Jump               0.158607
Vertical                 0.141317
Weight                   0.127717
Height                   0.063890
Year                     0.046518
Shuttle                 -0.103366
3Cone                   -0.109659
40yd                    -0.236880
Round                         NaN
Pick                          NaN
Name: Drafted, dtype: float64


For context if the correlation is positive that means that a higher number is better, if a correlation is negative that means that a lower number is better. With that said it looks like our most impactful combine values on **being drafted** are 

1. 40yd                    -0.236880

and our most impactful defensive stats on **being drafted** are 
1. Sacks_final_season       0.537543
2. TFL_final_season         0.471537
3. Sacks_cumulative         0.372802
4. TFL_cumulative           0.224549
5. QB_Hurry_cumulative      0.190996
6. QB_Hurry_final_season    0.183513

Anything too far below abs(.20) is likely too weak to consider using for any models. 

We can see that the Combine is not a strong indicator on if a player is going to **be drafted** or not but it does seem that the defensive stats the player accumulated in college has a slightly higher correlation than their defensive line counterparts (edges)

In [3]:
# Examine every column in the dataset and its correlation with the Drafted column 
# Lower Draft Position is better
dt_data_just_numeric = dt_data.select_dtypes(include=['number'])
dt_data_just_numeric['Pick'] = dt_data['Pick']
print(dt_data_just_numeric.corr()['Pick'].sort_values(ascending=False))


Pick                     1.000000
Round                    0.988804
3Cone                    0.108831
Year                     0.101440
40yd                     0.099456
Shuttle                  0.003743
TFL_final_season        -0.014431
Vertical                -0.020725
Height                  -0.034635
Weight                  -0.039855
TFL_cumulative          -0.047021
Broad Jump              -0.057780
Bench                   -0.106148
Sacks_final_season      -0.212243
Sacks_cumulative        -0.302680
QB_Hurry_final_season   -0.346735
QB_Hurry_cumulative     -0.421125
Name: Pick, dtype: float64


With that said it looks like our most impactful combine values on **Draft Position** are 

None of them, it doesn't look like these stats played much of a part in determining **draft positon** at all

And it looks like our most impactful defensive values on **Draft Position** are 

1. QB_Hurry_cumulative     -0.421125
2. QB_Hurry_final_season   -0.346735
3. Sacks_cumulative        -0.302680
4. Sacks_final_season      -0.212243

This seems to indicate that while run stopping is an important skill and can get defensive tackles drafted. The highest valued defensive tackles have pass rush upside and can affect the QB. 

## Looking to Model

When we look to create machine learning models there are 3 tasks we would like to accomplish. The first two can use our current datasets of combine data and college data. The final one/two will require the first four seasons of our Defensive Ends stats in the NFL. 

1. KNN Player Comps
2. Projected Draft Position/Round (Decision Tree)
3. Projected NFL Ability/Value (TBD)

## KNN Player Comps

We would like to use the KNN "machine learning model" to help us determine which players are most closely related. This could help us give NFL player comparisons for upcoming draft picks.

In [4]:
# KNN Player Similarity - Find the K most similar defensive tackles given an example

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from IPython.display import display

# Combine training + testing data for KNN (full 2010-2023 DT player pool)
dt_knn = pd.concat([
    pd.read_csv('../data/processed/dt_training_data.csv'),
    pd.read_csv('../data/processed/dt_testing_data.csv')
], ignore_index=True)
# Convert Height to inches (handle NaN / missing)
def to_inches(h):
    if pd.isna(h) or not isinstance(h, str) or '-' not in str(h):
        return np.nan
    parts = str(h).strip().split('-')
    if len(parts) != 2:
        return np.nan
    try:
        return int(parts[0]) * 12 + int(parts[1])
    except (ValueError, TypeError):
        return np.nan
dt_knn['Height'] = dt_knn['Height'].apply(to_inches)
dt_knn['Height'] = dt_knn['Height'].fillna(dt_knn['Height'].median())

# Features for similarity (reduced combine + college cumulative & final season)
FEATURE_COLS = ['Height', 'Weight', '40yd', 'Broad Jump',
                'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative',
                'Sacks_final_season', 'TFL_final_season', 'QB_Hurry_final_season']

# Prepare features: select and impute
X = dt_knn[FEATURE_COLS].copy()
X = X.fillna(X.median())  # Impute NaNs with column median

# Scale features (important for KNN distance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit KNN - we use k+1 because the nearest "neighbor" will be the player themselves
K = 5  # Number of similar players to return
nn = NearestNeighbors(n_neighbors=K + 1, metric='euclidean')
nn.fit(X_scaled)

def find_similar_players(player_or_profile, k=5):
    """
    Given a player name, row index, or dict of feature values, return the K most similar defensive tackles.
    
    For dict input, use keys from FEATURE_COLS. Height can be "6-6" or inches. Missing keys filled with median.
    """
    if isinstance(player_or_profile, dict):
        # Build feature vector from dict
        row = []
        for col in FEATURE_COLS:
            val = player_or_profile.get(col, np.nan)
            if val is None:
                val = np.nan
            if col == 'Height' and isinstance(val, str) and '-' in str(val):
                parts = str(val).split('-')
                val = int(parts[0]) * 12 + int(parts[1])
            row.append(val)
        n_missing = np.isnan(np.array(row, dtype=float)).sum()
        if n_missing > len(FEATURE_COLS) // 2:
            return f"Insufficient data: {n_missing}/{len(FEATURE_COLS)} features missing. Provide at least {len(FEATURE_COLS)//2 + 1} of: {', '.join(FEATURE_COLS)}"
        x = np.array(row, dtype=float).reshape(1, -1)
        x = np.where(np.isnan(x), X.median().values, x)
        x_scaled = scaler.transform(x)
        distances, indices = nn.kneighbors(x_scaled, n_neighbors=k)
        similar = dt_knn.iloc[indices[0]].copy()
        similar['Similarity_Distance'] = distances[0]
    elif isinstance(player_or_profile, str):
        mask = dt_knn['Player'] == player_or_profile
        if not mask.any():
            return f"Player '{player_or_profile}' not found in dataset."
        idx = dt_knn[mask].index[0]
        distances, indices = nn.kneighbors(X_scaled[idx:idx+1], n_neighbors=k + 1)
        similar = dt_knn.iloc[indices[0][1:]].copy()
        similar['Similarity_Distance'] = distances[0][1:]
    else:
        idx = player_or_profile
        distances, indices = nn.kneighbors(X_scaled[idx:idx+1], n_neighbors=k + 1)
        similar = dt_knn.iloc[indices[0][1:]].copy()
        similar['Similarity_Distance'] = distances[0][1:]
    
    return similar[['Player', 'Year', 'School', 'Round', 'Pick', 'Height', 'Weight', '40yd', 'Broad Jump',
                'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative',
                'Sacks_final_season', 'TFL_final_season', 'QB_Hurry_final_season', 'Similarity_Distance']]

# Example: use a dict of features to find comps (or pass a player name like "Jordan Davis")
# example_profile = {  # Peter Woods - DT, Clemson (2026 draft; 6-3 315)
#     'Height': '6-3', 'Weight': 315, '40yd': 4.82, 'Broad Jump': None,
#     'Sacks_cumulative': 5, 'TFL_cumulative': 14.5, 'QB_Hurry_cumulative': None,
#     'Sacks_final_season': 2, 'TFL_final_season': 3.5, 'QB_Hurry_final_season': 3
# }
example_profile = {  # Caleb Banks - DT, Florida (2026 draft; Louisville transfer)
    'Height': '6-6', 'Weight': 330, '40yd': 5.2, 'Broad Jump': None,
    'Sacks_cumulative': 6.5, 'TFL_cumulative': 10, 'QB_Hurry_cumulative': None,
    'Sacks_final_season': 4.5, 'TFL_final_season': 7, 'QB_Hurry_final_season': None
}
print(f"Top {K} defensive tackles most similar to this profile:\n")
result = find_similar_players(example_profile, k=K)
display(result)

Top 5 defensive tackles most similar to this profile:





Unnamed: 0,Player,Year,School,Round,Pick,Height,Weight,40yd,Broad Jump,Sacks_cumulative,TFL_cumulative,QB_Hurry_cumulative,Sacks_final_season,TFL_final_season,QB_Hurry_final_season,Similarity_Distance
147,Jordan Phillips,2015,Oklahoma,2.0,52.0,77.0,329.0,5.17,105.0,,,,,,,1.862329
51,Michael Brockers,2012,LSU,1.0,14.0,77.0,322.0,5.31,105.0,,,,,,,2.018121
154,Leterrius Walton,2015,Central Michigan,6.0,199.0,77.0,319.0,5.25,103.0,,,,,,,2.030132
143,Ellis McCarthy,2015,UCLA,,,77.0,338.0,5.21,109.0,,,,,,,2.056719
82,Quinton Dial,2013,Alabama,5.0,157.0,77.0,318.0,5.29,,,,,,,,2.062761


## Tree Round Projection

We can use the method of Decision Trees to use our most important variables to help group players into which round they will be selected in. In this model we will use our training and testing data to see how well it did. 

We use reduced combine (Height, Weight, 40yd) plus college cumulative and final-season stats: Sacks_cumulative, TFL_cumulative, QB_Hurry_cumulative, TFL_final_season, Sacks_final_season, QB_Hurry_final_season 

In [5]:
# Tree Round Projection - Train on 2010-2020, test on 2021-2023

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load and prepare data
train_df = pd.read_csv('../data/processed/dt_training_data.csv')
test_df = pd.read_csv('../data/processed/dt_testing_data.csv')

# Convert Height to inches (handle NaN / missing)
def to_inches_tree(h):
    if pd.isna(h) or not isinstance(h, str) or '-' not in str(h):
        return np.nan
    parts = str(h).strip().split('-')
    if len(parts) != 2:
        return np.nan
    try:
        return int(parts[0]) * 12 + int(parts[1])
    except (ValueError, TypeError):
        return np.nan
for df in [train_df, test_df]:
    df['Height'] = df['Height'].apply(to_inches_tree)

# Features for round prediction (reduced combine + college cumulative & final season)
ROUND_FEATURES = ['Height', 'Weight', '40yd',
                  'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative',
                  'TFL_final_season', 'Sacks_final_season', 'QB_Hurry_final_season']

# Map Round to draft day categories: Undrafted, Day1 (R1), Day2 (R2-3), Day3 (R4-7)
def round_to_day(r):
    if pd.isna(r) or r == 0:
        return 0  # Undrafted
    elif r == 1:
        return 1  # Day 1 - Round 1
    elif r in (2, 3):
        return 2  # Day 2 - Rounds 2-3
    else:
        return 3  # Day 3 - Rounds 4-7

DAY_LABELS = ['Undrafted', 'Day 1 (R1)', 'Day 2 (R2-3)', 'Day 3 (R4-7)']

# Use all rows - KNN imputation fills missing values
X_train = train_df[ROUND_FEATURES].copy()
X_test = test_df[ROUND_FEATURES].copy()
y_train = train_df['Round'].apply(round_to_day)
y_test = test_df['Round'].apply(round_to_day)

# KNN imputation
imputer = KNNImputer(n_neighbors=5)
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
print(f"Training: {len(X_train)} players | Testing: {len(X_test)} players (KNN imputation)\n")

# Scale features
scaler_round = StandardScaler()
X_train_scaled = scaler_round.fit_transform(X_train)
X_test_scaled = scaler_round.transform(X_test)

# Fit Decision Tree classifier
tree_round = DecisionTreeClassifier(random_state=42, max_depth=5)
tree_round.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = tree_round.predict(X_test_scaled)

# Evaluate
print("Tree Round Projection - 4 classes: Undrafted, Day1, Day2, Day3")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2%}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=DAY_LABELS, zero_division=0))
print("Confusion Matrix (rows=actual, cols=predicted):")
print(pd.DataFrame(confusion_matrix(y_test, y_pred), index=DAY_LABELS, columns=DAY_LABELS))

Training: 231 players | Testing: 54 players (KNN imputation)

Tree Round Projection - 4 classes: Undrafted, Day1, Day2, Day3
Test Accuracy: 44.44%

Classification Report:
              precision    recall  f1-score   support

   Undrafted       0.61      0.52      0.56        21
  Day 1 (R1)       0.29      0.67      0.40         6
Day 2 (R2-3)       0.25      0.25      0.25         8
Day 3 (R4-7)       0.50      0.37      0.42        19

    accuracy                           0.44        54
   macro avg       0.41      0.45      0.41        54
weighted avg       0.48      0.44      0.45        54

Confusion Matrix (rows=actual, cols=predicted):
              Undrafted  Day 1 (R1)  Day 2 (R2-3)  Day 3 (R4-7)
Undrafted            11           3             2             5
Day 1 (R1)            0           4             2             0
Day 2 (R2-3)          1           3             2             2
Day 3 (R4-7)          6           4             2             7


In [6]:
import numpy as np

def predict_draft_position(features: dict) -> str:
    """
    Predict draft day for a DT given a dict of features (using the decision tree model).
    
    Keys: Height, Weight, 40yd, Sacks_cumulative, TFL_cumulative, QB_Hurry_cumulative,
          TFL_final_season, Sacks_final_season, QB_Hurry_final_season
    Height can be "6-3" or inches. Missing keys filled via KNN imputation.
    
    Returns: 'Undrafted', 'Day 1 (R1)', 'Day 2 (R2-3)', or 'Day 3 (R4-7)'
    """
    row = []
    for col in ROUND_FEATURES:
        val = features.get(col, np.nan)
        if val is None:
            val = np.nan
        if col == 'Height' and isinstance(val, str) and '-' in str(val):
            parts = str(val).split('-')
            val = int(parts[0]) * 12 + int(parts[1])
        row.append(val)
    
    n_missing = np.isnan(np.array(row, dtype=float)).sum()
    if n_missing > len(ROUND_FEATURES) // 2:
        return f"Insufficient data: {n_missing}/{len(ROUND_FEATURES)} features missing. Provide at least {len(ROUND_FEATURES)//2 + 1} of: {', '.join(ROUND_FEATURES)}"
    
    x = np.array(row, dtype=float).reshape(1, -1)
    x = imputer.transform(x)
    x_scaled = scaler_round.transform(x)
    pred = tree_round.predict(x_scaled)[0]
    return DAY_LABELS[pred]

example = {  # Peter Woods - DT, Clemson
    'Height': '6-3', 'Weight': 315, '40yd': 4.82,
    'Sacks_cumulative': 5, 'TFL_cumulative': 14.5, 'QB_Hurry_cumulative': None,
    'Sacks_final_season': 2, 'TFL_final_season': 3.5, 'QB_Hurry_final_season': 3
}
# example = {  # Caleb Banks - DT, Florida
#     'Height': '6-6', 'Weight': 330, '40yd': 5.2,
#     'Sacks_cumulative': 6.5, 'TFL_cumulative': 10, 'QB_Hurry_cumulative': 10,
#     'Sacks_final_season': 4.5, 'TFL_final_season': 7, 'QB_Hurry_final_season': 6
# }
print(f"Predicted draft position: {predict_draft_position(example)}")

Predicted draft position: Day 1 (R1)




## Average combine & college stats by draft outcome

Combined 2010–2023 data, same features as the round model: Height, Weight, 40yd, and college cumulative/final-season stats, broken out by **Round 1 (Day 1)**, **Day 2 (R2–3)**, **Day 3 (R4–7)**, and **Undrafted**.

**What the data shows:**

- **Among drafted (Day 1–3):** Height (~75 in) and Weight (~309 lb) are nearly identical across rounds. College production (sacks, TFL, QB hurries) is also very similar—Day 1 guys don’t stand out on stats. The clear separator is **40yd dash**: R1 averages **4.99**, Day 2 **5.08**, Day 3 **5.09**. So within “drafted” DTs, round is driven more by speed than by production or size.

- **Drafted vs Undrafted:** Undrafted DTs are slightly lighter (305 lb vs ~309), slower (40yd **5.19** vs 4.99–5.09), and have **much lower** college production—e.g. ~4.4 cumulative sacks and ~4.0 final-season TFL vs 6–8 for drafted. So getting drafted at all tracks with both better 40yd and better production; *where* you go (R1 vs Day 2 vs Day 3) tracks mainly with 40yd.

In [7]:
# Combined data and draft-day buckets (same as Body type section)
avg_df = pd.concat([
    pd.read_csv('../data/processed/dt_training_data.csv'),
    pd.read_csv('../data/processed/dt_testing_data.csv')
], ignore_index=True)

def to_inches_avg(h):
    if pd.isna(h) or not isinstance(h, str) or '-' not in str(h):
        return np.nan
    parts = str(h).strip().split('-')
    if len(parts) != 2:
        return np.nan
    try:
        return int(parts[0]) * 12 + int(parts[1])
    except (ValueError, TypeError):
        return np.nan

avg_df['Height'] = avg_df['Height'].apply(to_inches_avg)

def round_to_day_avg(r):
    if pd.isna(r) or r == 0:
        return 'Undrafted'
    elif r == 1:
        return 'Day 1 (R1)'
    elif r in (2, 3):
        return 'Day 2 (R2-3)'
    else:
        return 'Day 3 (R4-7)'

avg_df['Draft_Day'] = avg_df['Round'].apply(round_to_day_avg)

# Same feature set as the round model
FEATURES_AVG = ['Height', 'Weight', '40yd',
                'Sacks_cumulative', 'TFL_cumulative', 'QB_Hurry_cumulative',
                'TFL_final_season', 'Sacks_final_season', 'QB_Hurry_final_season']

# Order for display
day_order = ['Day 1 (R1)', 'Day 2 (R2-3)', 'Day 3 (R4-7)', 'Undrafted']
means = avg_df.groupby('Draft_Day')[FEATURES_AVG].mean().reindex(day_order)
counts = avg_df.groupby('Draft_Day').size().reindex(day_order)

print("Sample size per group (n):")
print(counts.to_string())
print()
print("Average values by draft outcome:")
display(means.round(2))
print()
print("40yd by draft outcome — main separator between rounds (seconds):")
print(means['40yd'].round(3).to_string())

Sample size per group (n):
Draft_Day
Day 1 (R1)      35
Day 2 (R2-3)    73
Day 3 (R4-7)    95
Undrafted       82

Average values by draft outcome:


Unnamed: 0_level_0,Height,Weight,40yd,Sacks_cumulative,TFL_cumulative,QB_Hurry_cumulative,TFL_final_season,Sacks_final_season,QB_Hurry_final_season
Draft_Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Day 1 (R1),75.17,309.43,4.99,7.56,14.89,6.67,7.61,4.17,3.89
Day 2 (R2-3),75.16,309.07,5.08,6.62,13.06,7.94,7.76,4.26,4.59
Day 3 (R4-7),74.98,309.03,5.09,6.3,14.47,4.93,7.85,3.85,2.57
Undrafted,74.86,304.8,5.19,4.44,10.96,4.42,3.96,1.5,1.85



40yd by draft outcome — main separator between rounds (seconds):
Draft_Day
Day 1 (R1)      4.995
Day 2 (R2-3)    5.076
Day 3 (R4-7)    5.091
Undrafted       5.189


### Does athleticism matter more than production—or is there a flaw in the averages?

Two ways to read the same numbers:

1. **Combine / athleticism matters more:** Teams might value 40yd (and perceived upside) over college production when drafting DTs, which would explain why Day 1–3 guys look similar on stats but different on 40yd.

2. **Confounding by size:** Heavier DTs tend to run slower 40s. Day 1 guys are *slightly* heavier on average (see Weight in the table above). So the "40yd separates rounds" might partly reflect **body type**—lighter guys run faster and get drafted higher—rather than pure "athleticism matters more than production." If we don’t control for weight, we’re mixing "bigger vs smaller" with "earlier vs later draft."

To check whether 40yd matters *within* size, below we split DTs into **weight bands** and look at mean 40yd by draft day in each band. If Day 1 is still faster than Day 2–3 *within* the same weight band, then 40yd is adding signal beyond size. If the gap disappears within bands, the earlier pattern was partly confounding.

In [8]:
# Weight bands (lbs) so we compare similar body types
weight_bands = [(280, 300), (300, 320), (320, 340), (340, 400)]
day_order_40 = ['Day 1 (R1)', 'Day 2 (R2-3)', 'Day 3 (R4-7)', 'Undrafted']

# Within each weight band: mean 40yd by draft day (Day 1, Day 2, Day 3, Undrafted)
results = []
for lo, hi in weight_bands:
    mask = (avg_df['Weight'] >= lo) & (avg_df['Weight'] < hi)
    band = avg_df.loc[mask]
    if len(band) < 5:
        continue
    means_40 = band.groupby('Draft_Day')['40yd'].mean().reindex(day_order_40)
    means_40 = means_40.round(3)
    means_40.name = f'{lo}-{hi} lb (n={len(band)})'
    results.append(means_40)

if results:
    within_band = pd.DataFrame(results)
    print("Mean 40yd by draft day, within weight band:")
    display(within_band)
    print("If Day 1 (R1) is consistently faster within each row, 40yd adds signal beyond size.")
else:
    print("Not enough players in weight bands for this breakdown.")

Mean 40yd by draft day, within weight band:


Draft_Day,Day 1 (R1),Day 2 (R2-3),Day 3 (R4-7),Undrafted
280-300 lb (n=83),4.899,4.934,4.966,5.107
300-320 lb (n=141),4.993,5.073,5.085,5.241
320-340 lb (n=43),5.312,5.211,5.276,5.2
340-400 lb (n=13),4.923,5.37,5.293,5.467


If Day 1 (R1) is consistently faster within each row, 40yd adds signal beyond size.


## Body type and draft

In our data, **tall-only** (≥6-6) and **heavy-only** (≥330 lbs) DTs tend to get drafted *higher* than average. The **combo** (6-6 and 330+ lbs) has only 2 players in training and 3 in the combined set—too few to draw conclusions; the "bad outcome" narrative for that combo is likely a training-data quirk, not a real signal. Below we show draft outcomes for: (1) Height ≥ 6-6 (any weight), (2) Weight ≥ 330 (any height), (3) Both 6-6 and 330+, and (4) All DTs for baseline.

In [9]:
# Combined training + testing data (2010-2023) for body-type analysis
dt_combined = pd.concat([
    pd.read_csv('../data/processed/dt_training_data.csv'),
    pd.read_csv('../data/processed/dt_testing_data.csv')
], ignore_index=True)

def to_inches_ht(h):
    if pd.isna(h) or not isinstance(h, str) or '-' not in str(h):
        return np.nan
    parts = str(h).strip().split('-')
    if len(parts) != 2:
        return np.nan
    try:
        return int(parts[0]) * 12 + int(parts[1])
    except (ValueError, TypeError):
        return np.nan

dt_combined['Height'] = dt_combined['Height'].apply(to_inches_ht)

def draft_summary(df, label):
    n = len(df)
    drafted = df['Drafted'] == 1
    n_drafted = drafted.sum()
    r1 = ((df['Round'] == 1) & drafted).sum()
    r2_3 = ((df['Round'].isin([2, 3])) & drafted).sum()
    r4_7 = ((df['Round'] >= 4) & (df['Round'] <= 7) & drafted).sum()
    undrafted = n - n_drafted
    print(f"--- {label} (n={n}) ---")
    print(f"Drafted: {n_drafted} ({100*n_drafted/n:.1f}%)  |  Undrafted: {undrafted} ({100*undrafted/n:.1f}%)")
    if n_drafted > 0:
        print(f"Round 1: {r1}  |  Rounds 2-3: {r2_3}  |  Rounds 4-7: {r4_7}")
        print(f"Mean Round (drafted): {df.loc[drafted, 'Round'].mean():.2f}  |  Mean Pick (drafted): {df.loc[drafted, 'Pick'].mean():.1f}")
    print()

# (1) Height >= 6-6 (78 in), any weight; (2) Weight >= 330, any height; (3) Both (small n)
tall = dt_combined['Height'] >= 78
heavy = dt_combined['Weight'] >= 330
combo = tall & heavy

print("Baseline (all DTs):")
draft_summary(dt_combined, "All DTs")
print("Height ≥ 6-6 (any weight):")
draft_summary(dt_combined.loc[tall], "Height ≥ 6-6")
print("Weight ≥ 330 lbs (any height):")
draft_summary(dt_combined.loc[heavy], "Weight ≥ 330")
print("Both 6-6 and 330+ (small n—interpret with caution):")
draft_summary(dt_combined.loc[combo], "6-6 and 330+ lb")

print("6-6 and 330+ lb players in dataset:")
display(dt_combined.loc[combo, ['Player', 'Year', 'School', 'Height', 'Weight', 'Round', 'Pick']].sort_values(['Year', 'Pick']))

Baseline (all DTs):
--- All DTs (n=285) ---
Drafted: 203 (71.2%)  |  Undrafted: 82 (28.8%)
Round 1: 35  |  Rounds 2-3: 73  |  Rounds 4-7: 95
Mean Round (drafted): 3.64  |  Mean Pick (drafted): 109.1

Height ≥ 6-6 (any weight):
--- Height ≥ 6-6 (n=10) ---
Drafted: 8 (80.0%)  |  Undrafted: 2 (20.0%)
Round 1: 2  |  Rounds 2-3: 4  |  Rounds 4-7: 2
Mean Round (drafted): 2.50  |  Mean Pick (drafted): 66.9

Weight ≥ 330 lbs (any height):
--- Weight ≥ 330 (n=29) ---
Drafted: 23 (79.3%)  |  Undrafted: 6 (20.7%)
Round 1: 5  |  Rounds 2-3: 7  |  Rounds 4-7: 11
Mean Round (drafted): 3.57  |  Mean Pick (drafted): 106.8

Both 6-6 and 330+ (small n—interpret with caution):
--- 6-6 and 330+ lb (n=3) ---
Drafted: 2 (66.7%)  |  Undrafted: 1 (33.3%)
Round 1: 1  |  Rounds 2-3: 0  |  Rounds 4-7: 1
Mean Round (drafted): 3.50  |  Mean Pick (drafted): 114.0

6-6 and 330+ lb players in dataset:


Unnamed: 0,Player,Year,School,Height,Weight,Round,Pick
79,T.J. Barnes,2013,Georgia Tech,78.0,369.0,,
119,Daniel McCullers,2014,Tennessee,79.0,352.0,6.0,215.0
247,Jordan Davis,2022,Georgia,78.0,341.0,1.0,13.0
