# Zone-Level Occupancy Data Transformation

This notebook creates **aggregated zone-level occupancy data** by combining all lots within each zone (e.g., "Green 2").

**Why Zone-Level?**
- The raw AMP data has 62 specific lot names (e.g., "Green 2 KMac Lot")
- But lot_mapping has 186 lots organized into 28 aggregated zones
- Only 46 lots have AMP data for occupancy
- Aggregating to zones gives more robust predictions with more data per zone

**Approach:**
1. Load raw AMP session data (lot-level)
2. Map lot names to aggregated zones using lot_mapping
3. Sum occupancy across all lots in each zone for each hour
4. Add temporal, calendar, and weather features
5. Create train/val/test splits

In [20]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns

print("="*80)
print("ZONE-LEVEL OCCUPANCY DATA TRANSFORMATION")
print("="*80)
print()

ZONE-LEVEL OCCUPANCY DATA TRANSFORMATION



## Step 1: Load and Map AMP Data to Zones

In [21]:
# Load preprocessed AMP data
print("Loading AMP session data...")
amp = pd.read_csv('../../data/processed/amp_preprocessed_clean.csv', parse_dates=['Start_Date', 'End_Date'])
print(f"Total parking sessions: {len(amp):,}")
print(f"Date range: {amp['Start_Date'].min()} to {amp['Start_Date'].max()}")
print(f"Unique AMP zones: {amp['Zone'].nunique()}")

# Load lot mapping
lot_mapping = pd.read_csv('../../data/lot_mapping_enhanced_with_coords.csv')
print(f"\nTotal lots in mapping: {len(lot_mapping)}")
print(f"Aggregated zones: {lot_mapping['Zone_Name'].nunique()}")

Loading AMP session data...
Total parking sessions: 1,683,018
Date range: 2020-08-10 07:51:00 to 2025-10-31 14:57:00
Unique AMP zones: 62

Total lots in mapping: 186
Aggregated zones: 28


In [22]:
# Create mapping from AMP zone names -> aggregated Zone_Name
amp_to_zone = {}

for _, row in lot_mapping.iterrows():
    zone_name = row['Zone_Name']
    alt_desc = row.get('alternative_location_description')
    
    if pd.notna(alt_desc):
        # Map each alternative name to this zone
        for name in str(alt_desc).split('|'):
            name = name.strip()
            if name:
                amp_to_zone[name] = zone_name

print(f"Mapped {len(amp_to_zone)} AMP zone names to aggregated zones")
print("\nSample mappings:")
for amp_zone, zone in list(amp_to_zone.items())[:10]:
    print(f"  '{amp_zone}' -> '{zone}'")

Mapped 57 AMP zone names to aggregated zones

Sample mappings:
  'Green 2 KMac Lot' -> 'Green 2'
  'Green 2 McAllister' -> 'Green 2'
  'McCoy Hall Lot: Football General Parking' -> 'Green 1'
  'Bustad - AMP Marked spots' -> 'Green 1'
  'Green 1 Bustad Lot' -> 'Green 1'
  'Green 1 PACCAR South' -> 'Green 1'
  'McCluskey Services Visitor Spaces' -> 'Buisness Parking'
  'Wilson Road on Street Meters' -> 'Green 1'
  'Perham Rear Entrance AMP spot' -> 'Green 5'
  'Regents: AMP Marked Spot' -> 'Green 5'


In [23]:
# Map AMP sessions to aggregated zones
amp['Aggregated_Zone'] = amp['Zone'].map(amp_to_zone)

# Check coverage
unmapped = amp[amp['Aggregated_Zone'].isna()]
print(f"Sessions mapped to zones: {(~amp['Aggregated_Zone'].isna()).sum():,}")
print(f"Unmapped sessions: {len(unmapped):,}")

if len(unmapped) > 0:
    print("\nUnmapped AMP zones:")
    print(unmapped['Zone'].value_counts().head(10))

# Drop unmapped sessions
amp = amp[amp['Aggregated_Zone'].notna()].copy()
print(f"\nProceeding with {len(amp):,} sessions across {amp['Aggregated_Zone'].nunique()} zones")

Sessions mapped to zones: 1,659,706
Unmapped sessions: 23,312

Unmapped AMP zones:
Zone
Green 3: Washington St                                            11706
Green 3: Football General Parking                                  6267
Don't Activate: Lighty East Meters                                 4099
PACCAR: (Don't Activate)                                            568
Spring Street Lot                                                   482
B St Lots: Football General Parking                                 159
Testing Zone with an extra long label to read (Don't activate)       22
Disability Parking CUE (Don't Activate)                               9
Name: count, dtype: int64

Proceeding with 1,659,706 sessions across 19 zones


## Step 2: Create Hourly Time Grid

In [24]:
print("Creating hourly time grid...")

start_date = amp['Start_Date'].min().date()
end_date = amp['Start_Date'].max().date()

date_range = pd.date_range(start=start_date, end=end_date, freq='D')
hours = range(24)
aggregated_zones = amp['Aggregated_Zone'].unique()

# Create zone-hour grid
zone_hour_grid = pd.MultiIndex.from_product(
    [aggregated_zones, date_range, hours],
    names=['Zone', 'date', 'hour']
).to_frame(index=False)

zone_hour_grid['interval_start'] = (
    pd.to_datetime(zone_hour_grid['date']) +
    pd.to_timedelta(zone_hour_grid['hour'], unit='h')
)
zone_hour_grid['interval_end'] = zone_hour_grid['interval_start'] + pd.Timedelta(hours=1)

print(f"Zone-hour grid created: {len(zone_hour_grid):,} combinations")
print(f"  {len(aggregated_zones)} zones")
print(f"  {len(date_range)} days")
print(f"  24 hours per day")

Creating hourly time grid...
Zone-hour grid created: 870,504 combinations
  19 zones
  1909 days
  24 hours per day


## Step 3: Calculate Zone-Level Occupancy

In [25]:
print("Calculating zone-level occupancy...")
print("This may take several minutes...")

occupancy_list = []
zone_count = 0

for zone in aggregated_zones:
    zone_count += 1
    if zone_count % 5 == 0:
        print(f"Processing zone {zone_count}/{len(aggregated_zones)}: {zone}")
    
    # Filter sessions for this zone
    zone_sessions = amp[amp['Aggregated_Zone'] == zone].copy()
    zone_grid = zone_hour_grid[zone_hour_grid['Zone'] == zone].copy()
    
    # For each hour, count overlapping sessions
    for idx, row in zone_grid.iterrows():
        interval_start = row['interval_start']
        interval_end = row['interval_end']
        
        # Count active sessions during this hour
        active_sessions = zone_sessions[
            (zone_sessions['Start_Date'] < interval_end) &
            (zone_sessions['End_Date'] > interval_start)
        ]
        
        occupancy_list.append({
            'Zone': zone,
            'date': row['date'],
            'hour': row['hour'],
            'datetime': interval_start,
            'occupancy_count': len(active_sessions)
        })

occupancy_df = pd.DataFrame(occupancy_list)
print(f"\nZone-level occupancy data created: {len(occupancy_df):,} zone-hour records")
print(f"Unique zones: {occupancy_df['Zone'].nunique()}")

Calculating zone-level occupancy...
This may take several minutes...
Processing zone 5/19: Green 5
Processing zone 10/19: Orange 2
Processing zone 15/19: Authorized Vehicles Only

Zone-level occupancy data created: 870,504 zone-hour records
Unique zones: 19


## Step 4: Add Temporal Features

In [26]:
print("Adding temporal features...")

occupancy_df['year'] = occupancy_df['datetime'].dt.year
occupancy_df['month'] = occupancy_df['datetime'].dt.month
occupancy_df['day'] = occupancy_df['datetime'].dt.day
occupancy_df['day_of_week'] = occupancy_df['datetime'].dt.dayofweek
occupancy_df['is_weekend'] = (occupancy_df['day_of_week'] >= 5).astype(int)

def categorize_time_of_day(hour):
    if 0 <= hour < 6:
        return 'Late Night'
    elif 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 22:
        return 'Evening'
    else:
        return 'Night'

occupancy_df['time_of_day'] = occupancy_df['hour'].apply(categorize_time_of_day)
print("  Temporal features added")

Adding temporal features...
  Temporal features added


## Step 5: Merge Calendar Events

In [27]:
print("Merging calendar events...")

games = pd.read_csv('../../data/football_games.csv')
calendar = pd.read_csv('../../data/academic_calendar.csv')

games['Date'] = pd.to_datetime(games['Date'])
calendar['Start_Date'] = pd.to_datetime(calendar['Start_Date']).dt.normalize()
calendar['End_Date'] = pd.to_datetime(calendar['End_Date']).dt.normalize()

occupancy_df['date'] = pd.to_datetime(occupancy_df['date'])
game_dates = games['Date'].dt.normalize()
occupancy_df['is_game_day'] = occupancy_df['date'].isin(game_dates).astype(int)

# Calendar events
for event_type in ['Dead_Week', 'Finals_Week', 'Spring_Break', 'Thanksgiving_Break', 'Winter_Break']:
    event_periods = calendar[calendar['Event_Type'] == event_type]
    occupancy_df[f'is_{event_type.lower()}'] = 0
    
    for _, period in event_periods.iterrows():
        mask = (occupancy_df['date'] >= period['Start_Date']) & \
               (occupancy_df['date'] <= period['End_Date'])
        occupancy_df.loc[mask, f'is_{event_type.lower()}'] = 1

occupancy_df['is_any_break'] = (
    occupancy_df['is_spring_break'] |
    occupancy_df['is_thanksgiving_break'] |
    occupancy_df['is_winter_break']
).astype(int)

print("  Calendar events merged")

Merging calendar events...
  Calendar events merged


## Step 6: Merge Weather Data

In [28]:
print("Merging weather data...")

weather = pd.read_csv('../../data/weather_pullman_2020_2025.csv')
weather['date'] = pd.to_datetime(weather['date']).dt.normalize()

occupancy_df = occupancy_df.merge(weather, left_on='date', right_on='date', how='left')
print("  Weather data merged")

Merging weather data...
  Weather data merged


## Step 7: Add Zone Capacity and Availability Metrics

In [29]:
print("Adding zone capacity...")

# Calculate total capacity per zone from lot_mapping
zone_capacity = lot_mapping.groupby('Zone_Name')['capacity'].sum().to_dict()

occupancy_df['Max_Capacity'] = occupancy_df['Zone'].map(zone_capacity)

# For zones without capacity, estimate from 95th percentile
zones_without_cap = occupancy_df[occupancy_df['Max_Capacity'].isna()]['Zone'].unique()
if len(zones_without_cap) > 0:
    print(f"  Estimating capacity for {len(zones_without_cap)} zones without data...")
    for zone in zones_without_cap:
        zone_data = occupancy_df[occupancy_df['Zone'] == zone]['occupancy_count']
        estimated_cap = max(zone_data.quantile(0.95), 10)
        occupancy_df.loc[occupancy_df['Zone'] == zone, 'Max_Capacity'] = estimated_cap

# Calculate availability metrics
occupancy_df['occupancy_ratio'] = occupancy_df['occupancy_count'] / occupancy_df['Max_Capacity'].replace(0, 1)
occupancy_df['available_spaces'] = (occupancy_df['Max_Capacity'] - occupancy_df['occupancy_count']).clip(lower=0)
occupancy_df['is_near_full'] = (occupancy_df['occupancy_ratio'] >= 0.85).astype(int)
occupancy_df['is_very_full'] = (occupancy_df['occupancy_ratio'] >= 0.95).astype(int)

print("  Capacity and availability metrics added")

Adding zone capacity...
  Capacity and availability metrics added


## Step 8: Create Train/Validation/Test Splits

In [30]:
print("Creating train/validation/test splits...")

train_end = pd.Timestamp('2024-08-31')
val_end = pd.Timestamp('2025-05-31')

occupancy_train = occupancy_df[occupancy_df['datetime'] <= train_end].copy()
occupancy_val = occupancy_df[(occupancy_df['datetime'] > train_end) & (occupancy_df['datetime'] <= val_end)].copy()
occupancy_test = occupancy_df[occupancy_df['datetime'] > val_end].copy()

print(f"Training: {len(occupancy_train):,} records ({occupancy_train['date'].min()} to {occupancy_train['date'].max()})")
print(f"Validation: {len(occupancy_val):,} records ({occupancy_val['date'].min()} to {occupancy_val['date'].max()})")
print(f"Test: {len(occupancy_test):,} records ({occupancy_test['date'].min()} to {occupancy_test['date'].max()})")

Creating train/validation/test splits...
Training: 675,811 records (2020-08-10 00:00:00 to 2024-08-31 00:00:00)
Validation: 124,488 records (2024-08-31 00:00:00 to 2025-05-31 00:00:00)
Test: 70,205 records (2025-05-31 00:00:00 to 2025-10-31 00:00:00)


## Step 9: Save Zone-Level Data

In [31]:
print("Saving zone-level occupancy data...")

occupancy_df.to_csv('../../data/processed/occupancy_zone_level_full.csv', index=False)
occupancy_train.to_csv('../../data/processed/occupancy_zone_level_train.csv', index=False)
occupancy_val.to_csv('../../data/processed/occupancy_zone_level_val.csv', index=False)
occupancy_test.to_csv('../../data/processed/occupancy_zone_level_test.csv', index=False)

print(f"Full zone-level data saved: {len(occupancy_df):,} records")
print(f"Train set saved: {len(occupancy_train):,} records")
print(f"Validation set saved: {len(occupancy_val):,} records")
print(f"Test set saved: {len(occupancy_test):,} records")

Saving zone-level occupancy data...
Full zone-level data saved: 870,504 records
Train set saved: 675,811 records
Validation set saved: 124,488 records
Test set saved: 70,205 records


## Summary Statistics

In [32]:
print("="*80)
print("ZONE-LEVEL DATA SUMMARY")
print("="*80)
print(f"Aggregated zones: {occupancy_df['Zone'].nunique()}")
print(f"Average occupancy: {occupancy_df['occupancy_count'].mean():.1f} cars")
print(f"Date range: {occupancy_df['date'].min()} to {occupancy_df['date'].max()}")
print("\nTop 10 busiest zones:")
print(occupancy_df.groupby('Zone')['occupancy_count'].mean().sort_values(ascending=False).head(10))

print("\n" + "="*80)
print("DONE!")
print("="*80)
print("\nNext step: Train zone-level model in notebook 04")

ZONE-LEVEL DATA SUMMARY
Aggregated zones: 19
Average occupancy: 5.9 cars
Date range: 2020-08-10 00:00:00 to 2025-10-31 00:00:00

Top 10 busiest zones:
Zone
Paid        71.731295
Green 5     12.515824
Green 1      9.190959
Green 3      6.563777
Green 2      2.870133
Yellow 3     1.624956
Yellow 1     1.576283
Red 5        1.174611
Yellow 2     1.120547
Green 4      0.868518
Name: occupancy_count, dtype: float64

DONE!

Next step: Train zone-level model in notebook 04
