# LSTM Data Preparation

## Objective
Prepare the NCAA financial data for Time-Series Deep Learning (LSTM). We need to create sequences of financial data for schools that have 10 years of consistent reporting.

## Strategy
1.  **Load Raw Data**: Use the full 10-year dataset.
2.  **Filter Schools**: Keep only schools with 10 consecutive years of data.
3.  **Feature Engineering**: Calculate key financial metrics per year.
4.  **Sequence Creation**: Create sliding windows (e.g., use years 1-5 to predict year 6).
5.  **Save**: Export as numpy arrays for LSTM training.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

print("Libraries Loaded.")

Libraries Loaded.


## 1. Load Raw Data

In [2]:
# Load the raw 10-year dataset
df = pd.read_csv('../initial files/Output_10yrs_reported_schools_17220.csv')

print(f"Raw Data Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()[:20]}...")  # Show first 20 columns
print(f"\nYear Range: {df['Survey Year'].min()} - {df['Survey Year'].max()}")

Raw Data Shape: (17220, 580)

Columns: ['Survey Year', 'UNITID', 'OPE ID', 'Institution Name', 'State CD', 'Classification Name', 'Classification Other', 'Sanction Code', 'Sanction Name', 'Male Undergraduates', 'Female Undergraduates', 'Total Undergraduates', "Archery Men's Team Participation", "Archery Women's Team Participation", 'Archery Coed Team Men Participation', 'Archery Coed Team Women Participation', 'Archery Total Participation', "Badminton Women's Team Participation", 'Badminton Total Participation', "Baseball Men's Team Participation"]...

Year Range: 2014 - 2023


## 2. Select Key Financial Columns

In [3]:
# Find the revenue and expense columns
revenue_cols = [c for c in df.columns if 'Grand Total Revenue' in c]
expense_cols = [c for c in df.columns if 'Grand Total Expenses' in c]

print(f"Revenue columns: {revenue_cols}")
print(f"Expense columns: {expense_cols}")

# Select core columns
core_cols = ['Survey Year', 'UNITID', 'Institution Name', 'State CD', 'Classification Name']
core_cols.extend(revenue_cols)
core_cols.extend(expense_cols)

# Add athlete counts
athlete_cols = [c for c in df.columns if 'Undergraduates' in c]
core_cols.extend(athlete_cols)

df_clean = df[core_cols].copy()
df_clean = df_clean.rename(columns={'Survey Year': 'Year'})

print(f"\nCleaned Data Shape: {df_clean.shape}")

Revenue columns: ['Grand Total Revenue']
Expense columns: ['Grand Total Expenses']

Cleaned Data Shape: (17220, 10)


## 3. Filter Schools with 10 Years of Data

In [4]:
# Count years per school
school_years = df_clean.groupby('UNITID')['Year'].nunique()
schools_with_10_years = school_years[school_years == 10].index

print(f"Schools with exactly 10 years of data: {len(schools_with_10_years)}")

# Filter to only these schools
df_lstm = df_clean[df_clean['UNITID'].isin(schools_with_10_years)].copy()
df_lstm = df_lstm.sort_values(['UNITID', 'Year'])

print(f"LSTM Dataset Shape: {df_lstm.shape}")
print(f"Total Schools: {df_lstm['UNITID'].nunique()}")

Schools with exactly 10 years of data: 1722
LSTM Dataset Shape: (17220, 10)
Total Schools: 1722


## 4. Feature Engineering

In [5]:
# Get the actual column names for revenue and expenses
rev_col = [c for c in df_lstm.columns if 'Grand Total Revenue' in c][0]
exp_col = [c for c in df_lstm.columns if 'Grand Total Expenses' in c][0]

# Calculate basic features
df_lstm['Revenue'] = pd.to_numeric(df_lstm[rev_col], errors='coerce')
df_lstm['Expenses'] = pd.to_numeric(df_lstm[exp_col], errors='coerce')
df_lstm['Efficiency_Ratio'] = df_lstm['Revenue'] / df_lstm['Expenses']
df_lstm['Net_Income'] = df_lstm['Revenue'] - df_lstm['Expenses']

# Calculate year-over-year growth
df_lstm['Revenue_Growth'] = df_lstm.groupby('UNITID')['Revenue'].pct_change()
df_lstm['Expense_Growth'] = df_lstm.groupby('UNITID')['Expenses'].pct_change()

# Fill NaN values
df_lstm = df_lstm.fillna(0)

# Replace inf values
df_lstm = df_lstm.replace([np.inf, -np.inf], 0)

print("Features Engineered.")
print(f"\nFeature columns: {['Revenue', 'Expenses', 'Efficiency_Ratio', 'Net_Income', 'Revenue_Growth', 'Expense_Growth']}")

Features Engineered.

Feature columns: ['Revenue', 'Expenses', 'Efficiency_Ratio', 'Net_Income', 'Revenue_Growth', 'Expense_Growth']


## 5. Create Target Variable (Future Trajectory)

In [6]:
# Calculate future growth (1 year ahead)
df_lstm['Future_Rev_Growth'] = df_lstm.groupby('UNITID')['Revenue_Growth'].shift(-1)
df_lstm['Future_Exp_Growth'] = df_lstm.groupby('UNITID')['Expense_Growth'].shift(-1)

# Define trajectory based on future growth
def classify_trajectory(row):
    if pd.isna(row['Future_Rev_Growth']):
        return np.nan
    
    # Improving: Revenue growing >3% AND expenses growing slower than revenue
    if row['Future_Rev_Growth'] > 0.03 and row['Future_Exp_Growth'] < row['Future_Rev_Growth']:
        return 2  # Improving
    # Declining: Revenue shrinking OR expenses growing >3% faster than revenue
    elif row['Future_Rev_Growth'] < 0.0 or row['Future_Exp_Growth'] > (row['Future_Rev_Growth'] + 0.03):
        return 0  # Declining
    else:
        return 1  # Stable

df_lstm['Target'] = df_lstm.apply(classify_trajectory, axis=1)

# Remove rows without target (last year for each school)
df_lstm = df_lstm.dropna(subset=['Target'])

print(f"\nTarget Distribution:")
print(df_lstm['Target'].value_counts())
print(f"\n0=Declining, 1=Stable, 2=Improving")


Target Distribution:
Target
1.0    7513
0.0    5809
2.0    2176
Name: count, dtype: int64

0=Declining, 1=Stable, 2=Improving


## 6. Create Sequences for LSTM

In [7]:
# Define sequence parameters
SEQUENCE_LENGTH = 5  # Use 5 years of history

# Select features for the model
feature_cols = ['Revenue', 'Expenses', 'Efficiency_Ratio', 'Net_Income', 'Revenue_Growth', 'Expense_Growth']

# Normalize features (LSTM is sensitive to scale)
scaler = StandardScaler()
df_lstm[feature_cols] = scaler.fit_transform(df_lstm[feature_cols])

# Create sequences
X_sequences = []
y_sequences = []
school_ids = []

for school_id in df_lstm['UNITID'].unique():
    school_data = df_lstm[df_lstm['UNITID'] == school_id]
    
    # Need at least SEQUENCE_LENGTH years
    if len(school_data) < SEQUENCE_LENGTH:
        continue
    
    # Extract features and targets
    features = school_data[feature_cols].values
    targets = school_data['Target'].values
    
    # Create sliding windows
    for i in range(len(school_data) - SEQUENCE_LENGTH):
        X_sequences.append(features[i:i+SEQUENCE_LENGTH])
        y_sequences.append(targets[i+SEQUENCE_LENGTH])  # Predict the next year
        school_ids.append(school_id)

# Convert to numpy arrays
X = np.array(X_sequences)
y = np.array(y_sequences)

print(f"\nSequence Shape: {X.shape}")
print(f"  - Samples: {X.shape[0]}")
print(f"  - Time Steps: {X.shape[1]}")
print(f"  - Features: {X.shape[2]}")
print(f"\nTarget Shape: {y.shape}")
print(f"\nTarget Distribution in Sequences:")
unique, counts = np.unique(y, return_counts=True)
for label, count in zip(unique, counts):
    print(f"  Class {int(label)}: {count} ({count/len(y)*100:.1f}%)")


Sequence Shape: (6888, 5, 6)
  - Samples: 6888
  - Time Steps: 5
  - Features: 6

Target Shape: (6888,)

Target Distribution in Sequences:
  Class 0: 2594 (37.7%)
  Class 1: 3322 (48.2%)
  Class 2: 972 (14.1%)


## 7. Save Data for LSTM Training

In [8]:
# Save the sequences
np.savez('lstm_data.npz', 
         X=X, 
         y=y, 
         feature_names=feature_cols,
         school_ids=school_ids)

print("\nData saved to 'lstm_data.npz'")
print(f"Ready for LSTM training with {X.shape[0]} sequences!")


Data saved to 'lstm_data.npz'
Ready for LSTM training with 6888 sequences!
