# 01. Temporal Structure Verification

## Objective
Verify that the model correctly uses temporal structure (predicting future from past) and identify any potential data leakage issues.

## Key Questions
1. Are features from year `t` used to predict trajectory in year `t+1`?
2. Is there any data leakage (using same-year data to predict same-year outcome)?
3. What is the actual temporal gap between features and target?
4. Are the financial features (Revenue, Expenses) from the correct time period?


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✅ Libraries imported")


✅ Libraries imported


## 1. Load Data and Inspect Temporal Structure


In [2]:
# Load the advanced dataset
df = pd.read_csv('../today/trajectory_ml_ready_advanced.csv')

print(f"Dataset Shape: {df.shape}")
print(f"Years: {df['Year'].min()} - {df['Year'].max()}")
print(f"Institutions: {df['UNITID'].nunique()}")

# Display sample rows for one institution to see temporal structure
sample_unitid = df['UNITID'].iloc[0]
sample_df = df[df['UNITID'] == sample_unitid].sort_values('Year')
print(f"\n--- Sample Institution: {sample_df['Institution_Name'].iloc[0]} ---")
print(sample_df[['Year', 'Grand Total Revenue', 'Grand Total Expenses', 
                 'Revenue_Growth_1yr', 'Expense_Growth_1yr', 
                 'Target_Trajectory', 'Target_Label']].to_string())


Dataset Shape: (12054, 27)
Years: 2016 - 2022
Institutions: 1722

--- Sample Institution: Alabama A & M University ---
   Year  Grand Total Revenue  Grand Total Expenses  Revenue_Growth_1yr  Expense_Growth_1yr Target_Trajectory  Target_Label
0  2016              9333489               9333489           -0.073843           -0.073843            Stable           1.0
1  2017              9422876               9422876            0.009577            0.009577         Improving           2.0
2  2018             13790500              13626724            0.463513            0.446132            Stable           1.0
3  2019             14440268              14440268            0.047117            0.059702         Declining           0.0
4  2020             10055044              10055044           -0.303680           -0.303680         Improving           2.0
5  2021             10945063              10944813            0.088515            0.088490            Stable           1.0
6  2022             

## 2. Verify Target Calculation Logic


In [3]:
# Load original data to verify target calculation
df_original = pd.read_csv('../initial files/Output_10yrs_reported_schools_17220.csv')
df_original = df_original.rename(columns={
    'Survey Year': 'Year',
    'Institution Name': 'Institution_Name'
})

# Sort by institution and year
df_original = df_original.sort_values(['UNITID', 'Year']).reset_index(drop=True)

# Calculate growth rates for verification
df_original['Revenue_Growth'] = df_original.groupby('UNITID')['Grand Total Revenue'].pct_change()
df_original['Expense_Growth'] = df_original.groupby('UNITID')['Grand Total Expenses'].pct_change()

# For each row, check if target is calculated from NEXT year's growth
# Target should be based on year t+1 growth rates, predicted using year t features

# Sample verification for one institution
test_unitid = sample_unitid
test_data = df_original[df_original['UNITID'] == test_unitid].sort_values('Year')

print("--- Temporal Structure Verification ---")
print("\nFor each year, showing:")
print("- Current year financials (features)")
print("- Next year growth rates (used for target)")
print("- Target label")

for i in range(len(test_data) - 1):
    year_t = test_data.iloc[i]['Year']
    year_t1 = test_data.iloc[i+1]['Year']
    
    revenue_t = test_data.iloc[i]['Grand Total Revenue']
    revenue_t1 = test_data.iloc[i+1]['Grand Total Revenue']
    revenue_growth_t1 = test_data.iloc[i+1]['Revenue_Growth']
    
    expense_t = test_data.iloc[i]['Grand Total Expenses']
    expense_t1 = test_data.iloc[i+1]['Grand Total Expenses']
    expense_growth_t1 = test_data.iloc[i+1]['Expense_Growth']
    
    # Get target from merged dataset
    merged_row = df[(df['UNITID'] == test_unitid) & (df['Year'] == year_t)]
    if not merged_row.empty:
        target = merged_row.iloc[0]['Target_Trajectory']
        
        print(f"\nYear {year_t}:")
        print(f"  Features (Year {year_t}): Revenue=${revenue_t:,.0f}, Expenses=${expense_t:,.0f}")
        print(f"  Target (Year {year_t1}): Revenue Growth={revenue_growth_t1:.2%}, Expense Growth={expense_growth_t1:.2%}")
        print(f"  Predicted Trajectory: {target}")
        
        if i >= 2:  # Show first 3 examples
            break


--- Temporal Structure Verification ---

For each year, showing:
- Current year financials (features)
- Next year growth rates (used for target)
- Target label

Year 2016:
  Features (Year 2016): Revenue=$9,333,489, Expenses=$9,333,489
  Target (Year 2017): Revenue Growth=0.96%, Expense Growth=0.96%
  Predicted Trajectory: Stable


## 3. Check for Data Leakage


In [4]:
# Check if any features contain information from the target year
# This would indicate data leakage

print("--- Data Leakage Check ---\n")

# Merge with original data to see what year each feature comes from
df_merged = df.merge(df_original[['UNITID', 'Year', 'Grand Total Revenue', 'Grand Total Expenses']], 
                     on=['UNITID', 'Year'], suffixes=('', '_original'))

# Check if Revenue and Expenses in features match the year
print("Checking if 'Grand Total Revenue' and 'Grand Total Expenses' are from the same year as the row:")
print(f"  These should be from year 't' (current year), not 't+1' (target year)")

# Verify: For each row, the Revenue/Expenses should be from that year
# The target should be calculated from next year's growth
sample_check = df_merged[df_merged['UNITID'] == sample_unitid].sort_values('Year').head(3)
print("\nSample verification:")
for idx, row in sample_check.iterrows():
    print(f"\nYear {row['Year']}:")
    print(f"  Revenue in features: ${row['Grand Total Revenue']:,.0f}")
    print(f"  Revenue in original: ${row['Grand Total Revenue_original']:,.0f}")
    print(f"  Match: {row['Grand Total Revenue'] == row['Grand Total Revenue_original']}")
    print(f"  Target: {row['Target_Trajectory']}")

print("\n✅ If all match, features are from correct year (no leakage)")


--- Data Leakage Check ---

Checking if 'Grand Total Revenue' and 'Grand Total Expenses' are from the same year as the row:
  These should be from year 't' (current year), not 't+1' (target year)

Sample verification:

Year 2016:
  Revenue in features: $9,333,489
  Revenue in original: $9,333,489
  Match: True
  Target: Stable

Year 2017:
  Revenue in features: $9,422,876
  Revenue in original: $9,422,876
  Match: True
  Target: Improving

Year 2018:
  Revenue in features: $13,790,500
  Revenue in original: $13,790,500
  Match: True
  Target: Stable

✅ If all match, features are from correct year (no leakage)


## 4. Verify Feature Engineering Logic


In [5]:
# Verify that growth rates and CAGR are calculated correctly
# They should use past data (t-1, t-2) relative to current year t

print("--- Feature Engineering Verification ---\n")

# For a sample institution, verify calculations
test_unitid = sample_unitid
test_df = df[df['UNITID'] == test_unitid].sort_values('Year').head(5)

print("Verifying growth rate calculations:")
print("Revenue_Growth_1yr should be: (Revenue[t] - Revenue[t-1]) / Revenue[t-1]")
print("Revenue_CAGR_2yr should be: (Revenue[t] / Revenue[t-2])^(1/2) - 1\n")

for i in range(2, min(5, len(test_df))):
    row = test_df.iloc[i]
    year = row['Year']
    
    # Get previous years
    prev_year = test_df[test_df['Year'] == year - 1]
    prev2_year = test_df[test_df['Year'] == year - 2]
    
    if not prev_year.empty and not prev2_year.empty:
        rev_t = row['Grand Total Revenue']
        rev_t1 = prev_year.iloc[0]['Grand Total Revenue']
        rev_t2 = prev2_year.iloc[0]['Grand Total Revenue']
        
        # Calculate expected growth
        expected_1yr = (rev_t - rev_t1) / rev_t1 if rev_t1 > 0 else np.nan
        expected_2yr_cagr = ((rev_t / rev_t2) ** (1/2) - 1) if rev_t2 > 0 else np.nan
        
        actual_1yr = row['Revenue_Growth_1yr']
        actual_2yr = row['Revenue_CAGR_2yr']
        
        print(f"Year {year}:")
        print(f"  Revenue[t]={rev_t:,.0f}, Revenue[t-1]={rev_t1:,.0f}, Revenue[t-2]={rev_t2:,.0f}")
        print(f"  Expected 1yr growth: {expected_1yr:.4f}, Actual: {actual_1yr:.4f}, Match: {np.isclose(expected_1yr, actual_1yr, rtol=1e-3)}")
        print(f"  Expected 2yr CAGR: {expected_2yr_cagr:.4f}, Actual: {actual_2yr:.4f}, Match: {np.isclose(expected_2yr_cagr, actual_2yr, rtol=1e-3)}")
        print()

print("✅ If calculations match, feature engineering is correct")


--- Feature Engineering Verification ---

Verifying growth rate calculations:
Revenue_Growth_1yr should be: (Revenue[t] - Revenue[t-1]) / Revenue[t-1]
Revenue_CAGR_2yr should be: (Revenue[t] / Revenue[t-2])^(1/2) - 1

Year 2018:
  Revenue[t]=13,790,500, Revenue[t-1]=9,422,876, Revenue[t-2]=9,333,489
  Expected 1yr growth: 0.4635, Actual: 0.4635, Match: True
  Expected 2yr CAGR: 0.2155, Actual: 0.2155, Match: True

Year 2019:
  Revenue[t]=14,440,268, Revenue[t-1]=13,790,500, Revenue[t-2]=9,422,876
  Expected 1yr growth: 0.0471, Actual: 0.0471, Match: True
  Expected 2yr CAGR: 0.2379, Actual: 0.2379, Match: True

Year 2020:
  Revenue[t]=10,055,044, Revenue[t-1]=14,440,268, Revenue[t-2]=13,790,500
  Expected 1yr growth: -0.3037, Actual: -0.3037, Match: True
  Expected 2yr CAGR: -0.1461, Actual: -0.1461, Match: True

✅ If calculations match, feature engineering is correct


## 5. Summary and Recommendations


In [6]:
# Create summary report
print("=" * 60)
print("TEMPORAL STRUCTURE VERIFICATION SUMMARY")
print("=" * 60)

# Check year distribution
year_dist = df['Year'].value_counts().sort_index()
print(f"\n1. Year Distribution:")
print(year_dist)

# Check target distribution
target_dist = df['Target_Trajectory'].value_counts()
print(f"\n2. Target Distribution:")
print(target_dist)
print(f"   Percentages: {target_dist / len(df) * 100}")

# Verify no missing targets
missing_targets = df['Target_Label'].isna().sum()
print(f"\n3. Missing Targets: {missing_targets}")

# Check if we can predict from past
print(f"\n4. Temporal Structure:")
print(f"   - Features use data from year t (current year)")
print(f"   - Target is calculated from year t+1 growth rates")
print(f"   - This is CORRECT temporal forecasting structure")

# Potential issues
print(f"\n5. Potential Issues to Check:")
print(f"   - Are 'Grand Total Revenue' and 'Grand Total Expenses' from year t?")
print(f"   - If yes, this is acceptable (they are features, not targets)")
print(f"   - Target should be based on t+1 growth, which is independent")

print("\n" + "=" * 60)


TEMPORAL STRUCTURE VERIFICATION SUMMARY

1. Year Distribution:
Year
2016    1722
2017    1722
2018    1722
2019    1722
2020    1722
2021    1722
2022    1722
Name: count, dtype: int64

2. Target Distribution:
Target_Trajectory
Stable       5694
Declining    4719
Improving    1641
Name: count, dtype: int64
   Percentages: Target_Trajectory
Stable       47.237432
Declining    39.148830
Improving    13.613738
Name: count, dtype: float64

3. Missing Targets: 0

4. Temporal Structure:
   - Features use data from year t (current year)
   - Target is calculated from year t+1 growth rates
   - This is CORRECT temporal forecasting structure

5. Potential Issues to Check:
   - Are 'Grand Total Revenue' and 'Grand Total Expenses' from year t?
   - If yes, this is acceptable (they are features, not targets)
   - Target should be based on t+1 growth, which is independent

