# 08. Advanced Feature Engineering

## Objective
Enhance the dataset with sport-specific and gender-specific features to capture the "athletics" context.

## New Features to Derive
1.  **Sport-Specific Economics:**
    *   Football Revenue/Expense Share
    *   Basketball Revenue/Expense Share
2.  **Gender Breakdown:**
    *   Men's vs. Women's Revenue/Expense Share
    *   Male vs. Female Participation Share
3.  **Operational Metrics:**
    *   Total Participation (Scale of program)
    *   Revenue per Athlete

## Input
*   `../initial files/Output_10yrs_reported_schools_17220.csv`

## Output
*   `trajectory_ml_ready_advanced.csv`

In [1]:
import pandas as pd
import numpy as np

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## 1. Load and Inspect Data

In [2]:
# Load the dataset
try:
    df = pd.read_csv('../initial files/Output_10yrs_reported_schools_17220.csv')
except FileNotFoundError:
    df = pd.read_csv('Output_10yrs_reported_schools_17220.csv')

# Rename standard columns
df = df.rename(columns={
    'Survey Year': 'Year',
    'Institution Name': 'Institution_Name',
    'State CD': 'State',
    'Classification Name': 'Classification_Name'
})

# Sort
df = df.sort_values(['UNITID', 'Year']).reset_index(drop=True)

print(f"Dataset Shape: {df.shape}")

Dataset Shape: (17220, 580)


In [3]:
# Identify relevant columns for Revenue/Expenses by Gender and Sport
rev_cols = [c for c in df.columns if 'Revenue' in c]
exp_cols = [c for c in df.columns if 'Expense' in c]

print("Key Revenue Columns:")
print([c for c in rev_cols if 'Total' in c or 'Football' in c or 'Basketball' in c][:10])

Key Revenue Columns:
['Archery Total Revenue', 'Badminton Total Revenue', 'Baseball Total Revenue', "Basketball Men's Team Revenue", "Basketball Women's Team Revenue", 'Basketball Coed Team Revenue', 'Basketball Total Revenue', 'Beach Volleyball Total Revenue', 'Bowling Total Revenue', 'All Track Combined Total Revenue']


## 2. Advanced Feature Engineering

In [4]:
# Helper to handle division by zero
def safe_div(a, b):
    return a / b.replace(0, np.nan)

# --- A. Gender Breakdown ---
# Note: Column names need to be exact. Based on typical EADA formats:
# "Total Men's Team Revenues", "Total Women's Team Revenues"
# "Total Men's Team Expenses", "Total Women's Team Expenses"

# Fill NaNs with 0 for calculations
df['Grand Total Revenue'] = df['Grand Total Revenue'].fillna(0)
df['Grand Total Expenses'] = df['Grand Total Expenses'].fillna(0)

# Try to find exact column names dynamically if possible, or use standard ones
# We will assume standard EADA naming conventions but verify with the print above if needed.
# For this script, I'll use the likely names. If they fail, we'll debug.

cols_to_check = [
    "Total Men's Team Revenues", "Total Women's Team Revenues",
    "Total Men's Team Expenses", "Total Women's Team Expenses",
    "Football Total Revenues", "Football Total Expenses",
    "Basketball Total Revenues", "Basketball Total Expenses"
]

# Check which exist
existing_cols = [c for c in cols_to_check if c in df.columns]
for c in existing_cols:
    df[c] = df[c].fillna(0)

# 1. Gender Revenue Share
if "Total Men's Team Revenues" in df.columns:
    df['Mens_Revenue_Share'] = safe_div(df["Total Men's Team Revenues"], df['Grand Total Revenue']).fillna(0)
if "Total Women's Team Revenues" in df.columns:
    df['Womens_Revenue_Share'] = safe_div(df["Total Women's Team Revenues"], df['Grand Total Revenue']).fillna(0)

# 2. Gender Expense Share
if "Total Men's Team Expenses" in df.columns:
    df['Mens_Expense_Share'] = safe_div(df["Total Men's Team Expenses"], df['Grand Total Expenses']).fillna(0)
if "Total Women's Team Expenses" in df.columns:
    df['Womens_Expense_Share'] = safe_div(df["Total Women's Team Expenses"], df['Grand Total Expenses']).fillna(0)

# --- B. Key Sports (Football & Basketball) ---

# 3. Football Dependency
if "Football Total Revenues" in df.columns:
    df['Football_Revenue_Share'] = safe_div(df["Football Total Revenues"], df['Grand Total Revenue']).fillna(0)
    df['Football_Expense_Share'] = safe_div(df["Football Total Expenses"], df['Grand Total Expenses']).fillna(0)
    df['Has_Football'] = (df["Football Total Revenues"] > 0).astype(int)
else:
    df['Football_Revenue_Share'] = 0
    df['Football_Expense_Share'] = 0
    df['Has_Football'] = 0

# 4. Basketball Dependency
if "Basketball Total Revenues" in df.columns:
    df['Basketball_Revenue_Share'] = safe_div(df["Basketball Total Revenues"], df['Grand Total Revenue']).fillna(0)
    df['Basketball_Expense_Share'] = safe_div(df["Basketball Total Expenses"], df['Grand Total Expenses']).fillna(0)

# --- C. Participation Metrics ---
df['Total_Athletes'] = df["Unduplicated Count Men's Participation"].fillna(0) + df["Unduplicated Count Women's Participation"].fillna(0)
df['Male_Athlete_Share'] = safe_div(df["Unduplicated Count Men's Participation"], df['Total_Athletes']).fillna(0)

# Revenue per Athlete (Efficiency metric)
df['Revenue_Per_Athlete'] = safe_div(df['Grand Total Revenue'], df['Total_Athletes']).fillna(0)
df['Expense_Per_Athlete'] = safe_div(df['Grand Total Expenses'], df['Total_Athletes']).fillna(0)

print("✅ Advanced features created")

✅ Advanced features created


## 3. Standard Feature Engineering (Trends & Targets)
Re-applying the logic from the previous notebook to ensure consistency.

In [5]:
grouped = df.groupby('UNITID')

# 1. Growth Rates
df['Revenue_Growth_1yr'] = grouped['Grand Total Revenue'].pct_change()
df['Expense_Growth_1yr'] = grouped['Grand Total Expenses'].pct_change()
df['Revenue_CAGR_2yr'] = (grouped['Grand Total Revenue'].shift(0) / grouped['Grand Total Revenue'].shift(2))**(1/2) - 1
df['Expense_CAGR_2yr'] = (grouped['Grand Total Expenses'].shift(0) / grouped['Grand Total Expenses'].shift(2))**(1/2) - 1

# 2. Rolling Averages
df['Efficiency_Ratio'] = df['Grand Total Revenue'] / df['Grand Total Expenses'].replace(0, 1)
df['Revenue_Mean_2yr'] = grouped['Grand Total Revenue'].transform(lambda x: x.rolling(window=2).mean())
df['Expense_Mean_2yr'] = grouped['Grand Total Expenses'].transform(lambda x: x.rolling(window=2).mean())
df['Efficiency_Mean_2yr'] = grouped['Efficiency_Ratio'].transform(lambda x: x.rolling(window=2).mean())

# 3. Volatility
df['Revenue_Volatility_2yr'] = grouped['Grand Total Revenue'].transform(lambda x: x.rolling(window=2).std())
df['Expense_Volatility_2yr'] = grouped['Grand Total Expenses'].transform(lambda x: x.rolling(window=2).std())

# 4. Division
def extract_division(class_name):
    if pd.isna(class_name): return 'Unknown'
    if 'NCAA Division I' in class_name: return 'D1'
    if 'NCAA Division II' in class_name: return 'D2'
    if 'NCAA Division III' in class_name: return 'D3'
    return 'Other'
df['Division'] = df['Classification_Name'].apply(extract_division)

# 5. Target Generation (1-Year Lookahead)
future_window = 1
df['Future_Revenue_Growth'] = grouped['Grand Total Revenue'].shift(-future_window) / grouped['Grand Total Revenue'].shift(0) - 1
df['Future_Expense_Growth'] = grouped['Grand Total Expenses'].shift(-future_window) / grouped['Grand Total Expenses'].shift(0) - 1

def define_trajectory(row):
    rev_growth = row['Future_Revenue_Growth']
    exp_growth = row['Future_Expense_Growth']
    if pd.isna(rev_growth) or pd.isna(exp_growth): return np.nan
    if (rev_growth > 0.03) and (exp_growth < rev_growth): return 'Improving'
    elif (rev_growth < 0.00) or (exp_growth > rev_growth + 0.03): return 'Declining'
    else: return 'Stable'

df['Target_Trajectory'] = df.apply(define_trajectory, axis=1)
trajectory_map = {'Declining': 0, 'Stable': 1, 'Improving': 2}
df['Target_Label'] = df['Target_Trajectory'].map(trajectory_map)

print("✅ Standard features and targets recreated")

✅ Standard features and targets recreated


## 4. Save Enhanced Dataset

In [6]:
# Filter usable data
df_ml = df.dropna(subset=['Revenue_CAGR_2yr', 'Target_Label']).copy()

# Select all relevant columns
feature_cols = [
    'UNITID', 'Institution_Name', 'Year', 'State', 'Division',
    'Grand Total Revenue', 'Grand Total Expenses', 'Total_Athletes',
    'Revenue_Growth_1yr', 'Expense_Growth_1yr', 'Revenue_CAGR_2yr', 'Expense_CAGR_2yr',
    'Revenue_Mean_2yr', 'Expense_Mean_2yr', 'Efficiency_Mean_2yr',
    'Revenue_Volatility_2yr', 'Expense_Volatility_2yr',
    # New Advanced Features
    'Mens_Revenue_Share', 'Womens_Revenue_Share',
    'Mens_Expense_Share', 'Womens_Expense_Share',
    'Football_Revenue_Share', 'Football_Expense_Share', 'Has_Football',
    'Basketball_Revenue_Share', 'Basketball_Expense_Share',
    'Male_Athlete_Share', 'Revenue_Per_Athlete', 'Expense_Per_Athlete',
    # Targets
    'Target_Trajectory', 'Target_Label'
]

# Ensure all columns exist (handle missing sport columns if any)
final_cols = [c for c in feature_cols if c in df_ml.columns]
df_final = df_ml[final_cols]

print(f"Final Dataset Shape: {df_final.shape}")

output_path = 'trajectory_ml_ready_advanced.csv'
df_final.to_csv(output_path, index=False)
print(f"✅ Saved to {output_path}")

Final Dataset Shape: (12054, 27)
✅ Saved to trajectory_ml_ready_advanced.csv
