# Lot-Level Occupancy Data Transformation (LPR-Based)

This notebook creates **lot-level occupancy estimates** using License Plate Reader (LPR) scan data.

**Why LPR-Based?**
- AMP session data only covers 46 lots
- **LPR data covers 183 lots!** (missing only lots 117, 149, 185)
- LPR scans serve as a proxy for parking activity/occupancy
- Enables lot-specific predictions for almost every parking lot on campus

**Approach:**
1. Load LPR data (FY23, FY24, FY25)
2. Count hourly LPR scans per lot
3. Create lag features (previous hours' scan counts)
4. Add temporal, calendar, and weather features
5. Create train/val/test splits

**Note:** LPR scans don't directly equal occupancy, but they correlate with parking activity. High scans = high turnover/enforcement attention.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns

print("="*80)
print("LOT-LEVEL OCCUPANCY DATA TRANSFORMATION (LPR-BASED)")
print("="*80)
print()

LOT-LEVEL OCCUPANCY DATA TRANSFORMATION (LPR-BASED)



## Step 1: Load LPR Data

In [2]:
print("Loading LPR data from all fiscal years...")

# Load LPR data
lpr23 = pd.read_excel('../../data/raw/Data_For_Class_Project.xlsx', sheet_name='LPR_Reads_FY23')
lpr24 = pd.read_excel('../../data/raw/Data_For_Class_Project.xlsx', sheet_name='LPR_Reads_FY24')
lpr25 = pd.read_excel('../../data/raw/Data_For_Class_Project.xlsx', sheet_name='LPR_Reads_FY25')

# Combine all years
lpr_all = pd.concat([lpr23, lpr24, lpr25], ignore_index=True)

print(f"FY23 LPR scans: {len(lpr23):,}")
print(f"FY24 LPR scans: {len(lpr24):,}")
print(f"FY25 LPR scans: {len(lpr25):,}")
print(f"Total LPR scans: {len(lpr_all):,}")

Loading LPR data from all fiscal years...
FY23 LPR scans: 520,850
FY24 LPR scans: 612,428
FY25 LPR scans: 646,909
Total LPR scans: 1,780,187


In [3]:
# Parse lot numbers from LOT column
lpr_all['lot_number'] = lpr_all['LOT'].str.extract(r'LOT\s+(\d+)')[0].astype(float)

# Parse datetime
lpr_all['datetime'] = pd.to_datetime(lpr_all['Date_Time'])

# Remove rows without lot number
lpr_all = lpr_all[lpr_all['lot_number'].notna()].copy()
lpr_all['lot_number'] = lpr_all['lot_number'].astype(int)

print(f"\nAfter parsing:")
print(f"  LPR scans with lot numbers: {len(lpr_all):,}")
print(f"  Unique lots: {lpr_all['lot_number'].nunique()}")
print(f"  Date range: {lpr_all['datetime'].min()} to {lpr_all['datetime'].max()}")


After parsing:
  LPR scans with lot numbers: 1,780,187
  Unique lots: 185
  Date range: 2022-07-01 05:24:26 to 2025-06-30 21:58:49


## Step 2: Load Lot Mapping

In [4]:
# Load lot mapping to get zone info and capacity
lot_mapping = pd.read_csv('../../data/lot_mapping_enhanced_with_coords.csv')

print(f"Total lots in mapping: {len(lot_mapping)}")

# Create lot_number -> zone mapping
lot_to_zone = lot_mapping.set_index('Lot_number')['Zone_Name'].to_dict()
lot_to_capacity = lot_mapping.set_index('Lot_number')['capacity'].to_dict()

# Check LPR coverage
lpr_lots = set(lpr_all['lot_number'].unique())
mapping_lots = set(lot_mapping['Lot_number'].unique())
missing_from_lpr = mapping_lots - lpr_lots

print(f"\nLots with LPR data: {len(lpr_lots)}")
print(f"Lots missing from LPR: {missing_from_lpr}")

Total lots in mapping: 186

Lots with LPR data: 185
Lots missing from LPR: {185, 117, 149}


## Step 3: Create Hourly LPR Scan Counts

In [5]:
print("Creating hourly LPR scan counts per lot...")

# Extract date and hour
lpr_all['date'] = lpr_all['datetime'].dt.date
lpr_all['hour'] = lpr_all['datetime'].dt.hour

# Count scans per lot-date-hour
lpr_hourly = lpr_all.groupby(['lot_number', 'date', 'hour']).size().reset_index(name='lpr_scans')

print(f"Hourly LPR data: {len(lpr_hourly):,} lot-hour records")
print(f"Average scans per hour: {lpr_hourly['lpr_scans'].mean():.2f}")
print(f"Max scans in one hour: {lpr_hourly['lpr_scans'].max()}")

Creating hourly LPR scan counts per lot...
Hourly LPR data: 76,550 lot-hour records
Average scans per hour: 23.26
Max scans in one hour: 666


## Step 4: Create Complete Time Grid

In [6]:
print("Creating complete lot-hour grid...")

# Date range
start_date = lpr_all['datetime'].min().date()
end_date = lpr_all['datetime'].max().date()
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Create grid for all lots with LPR data
lots_with_lpr = sorted(lpr_lots)
hours = range(24)

lot_hour_grid = pd.MultiIndex.from_product(
    [lots_with_lpr, date_range, hours],
    names=['lot_number', 'date', 'hour']
).to_frame(index=False)

lot_hour_grid['date'] = lot_hour_grid['date'].dt.date

print(f"Lot-hour grid created: {len(lot_hour_grid):,} combinations")
print(f"  {len(lots_with_lpr)} lots")
print(f"  {len(date_range)} days")
print(f"  24 hours per day")

Creating complete lot-hour grid...
Lot-hour grid created: 4,866,240 combinations
  185 lots
  1096 days
  24 hours per day


In [7]:
# Merge LPR scan counts
occupancy_df = lot_hour_grid.merge(
    lpr_hourly,
    on=['lot_number', 'date', 'hour'],
    how='left'
)

# Fill missing scans with 0
occupancy_df['lpr_scans'] = occupancy_df['lpr_scans'].fillna(0).astype(int)

print(f"\nMerged LPR scans into grid")
print(f"Total records: {len(occupancy_df):,}")
print(f"Hours with scans: {(occupancy_df['lpr_scans'] > 0).sum():,}")
print(f"Hours without scans: {(occupancy_df['lpr_scans'] == 0).sum():,}")


Merged LPR scans into grid
Total records: 4,866,240
Hours with scans: 76,550
Hours without scans: 4,789,690


## Step 5: Add Lot Metadata

In [8]:
print("Adding lot metadata (zone, capacity)...")

occupancy_df['Zone'] = occupancy_df['lot_number'].map(lot_to_zone)
occupancy_df['capacity'] = occupancy_df['lot_number'].map(lot_to_capacity)

# Fill missing capacity with estimated values
lots_without_cap = occupancy_df[occupancy_df['capacity'].isna()]['lot_number'].unique()
if len(lots_without_cap) > 0:
    print(f"  Estimating capacity for {len(lots_without_cap)} lots...")
    for lot_num in lots_without_cap:
        # Use 95th percentile of LPR scans as capacity estimate
        lot_scans = occupancy_df[occupancy_df['lot_number'] == lot_num]['lpr_scans']
        estimated_cap = max(lot_scans.quantile(0.95) * 2, 10)  # Assume scans are ~50% of capacity
        occupancy_df.loc[occupancy_df['lot_number'] == lot_num, 'capacity'] = estimated_cap

print("  Lot metadata added")

Adding lot metadata (zone, capacity)...
  Estimating capacity for 26 lots...
  Lot metadata added


## Step 6: Add Temporal Features

In [9]:
print("Adding temporal features...")

occupancy_df['date'] = pd.to_datetime(occupancy_df['date'])
occupancy_df['datetime'] = occupancy_df['date'] + pd.to_timedelta(occupancy_df['hour'], unit='h')

occupancy_df['year'] = occupancy_df['datetime'].dt.year
occupancy_df['month'] = occupancy_df['datetime'].dt.month
occupancy_df['day'] = occupancy_df['datetime'].dt.day
occupancy_df['day_of_week'] = occupancy_df['datetime'].dt.dayofweek
occupancy_df['is_weekend'] = (occupancy_df['day_of_week'] >= 5).astype(int)

def categorize_time_of_day(hour):
    if 0 <= hour < 6:
        return 'Late Night'
    elif 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 22:
        return 'Evening'
    else:
        return 'Night'

occupancy_df['time_of_day'] = occupancy_df['hour'].apply(categorize_time_of_day)
print("  Temporal features added")

Adding temporal features...
  Temporal features added


## Step 7: Create Lag Features

In [10]:
print("Creating lag features (previous hours' scans)...")

# Sort by lot and datetime
occupancy_df = occupancy_df.sort_values(['lot_number', 'datetime']).reset_index(drop=True)

# Create lag features for each lot
lag_hours = [1, 2, 3, 24, 168]  # 1h, 2h, 3h, 1 day, 1 week ago

for lag in lag_hours:
    occupancy_df[f'lpr_scans_lag_{lag}h'] = occupancy_df.groupby('lot_number')['lpr_scans'].shift(lag)
    print(f"  Created lag feature: {lag}h ago")

# Fill missing lag values with 0
for lag in lag_hours:
    occupancy_df[f'lpr_scans_lag_{lag}h'] = occupancy_df[f'lpr_scans_lag_{lag}h'].fillna(0).astype(int)

print("  Lag features created")

Creating lag features (previous hours' scans)...
  Created lag feature: 1h ago
  Created lag feature: 2h ago
  Created lag feature: 3h ago
  Created lag feature: 24h ago
  Created lag feature: 168h ago
  Lag features created


## Step 8: Merge Calendar Events

In [11]:
print("Merging calendar events...")

games = pd.read_csv('../../data/football_games.csv')
calendar = pd.read_csv('../../data/academic_calendar.csv')

games['Date'] = pd.to_datetime(games['Date'])
calendar['Start_Date'] = pd.to_datetime(calendar['Start_Date']).dt.normalize()
calendar['End_Date'] = pd.to_datetime(calendar['End_Date']).dt.normalize()

game_dates = games['Date'].dt.normalize()
occupancy_df['is_game_day'] = occupancy_df['date'].isin(game_dates).astype(int)

# Calendar events
for event_type in ['Dead_Week', 'Finals_Week', 'Spring_Break', 'Thanksgiving_Break', 'Winter_Break']:
    event_periods = calendar[calendar['Event_Type'] == event_type]
    occupancy_df[f'is_{event_type.lower()}'] = 0
    
    for _, period in event_periods.iterrows():
        mask = (occupancy_df['date'] >= period['Start_Date']) & \
               (occupancy_df['date'] <= period['End_Date'])
        occupancy_df.loc[mask, f'is_{event_type.lower()}'] = 1

occupancy_df['is_any_break'] = (
    occupancy_df['is_spring_break'] |
    occupancy_df['is_thanksgiving_break'] |
    occupancy_df['is_winter_break']
).astype(int)

print("  Calendar events merged")

Merging calendar events...
  Calendar events merged


## Step 9: Merge Weather Data

In [12]:
print("Merging weather data...")

weather = pd.read_csv('../../data/weather_pullman_2020_2025.csv')
weather['date'] = pd.to_datetime(weather['date']).dt.normalize()

occupancy_df = occupancy_df.merge(weather, left_on='date', right_on='date', how='left')
print("  Weather data merged")

Merging weather data...
  Weather data merged


## Step 10: Create Train/Validation/Test Splits

In [13]:
print("Creating train/validation/test splits...")

train_end = pd.Timestamp('2024-08-31')
val_end = pd.Timestamp('2025-05-31')

occupancy_train = occupancy_df[occupancy_df['datetime'] <= train_end].copy()
occupancy_val = occupancy_df[(occupancy_df['datetime'] > train_end) & (occupancy_df['datetime'] <= val_end)].copy()
occupancy_test = occupancy_df[occupancy_df['datetime'] > val_end].copy()

print(f"Training: {len(occupancy_train):,} records ({occupancy_train['date'].min()} to {occupancy_train['date'].max()})")
print(f"Validation: {len(occupancy_val):,} records ({occupancy_val['date'].min()} to {occupancy_val['date'].max()})")
print(f"Test: {len(occupancy_test):,} records ({occupancy_test['date'].min()} to {occupancy_test['date'].max()})")

Creating train/validation/test splits...
Training: 3,516,665 records (2022-07-01 00:00:00 to 2024-08-31 00:00:00)
Validation: 1,212,120 records (2024-08-31 00:00:00 to 2025-05-31 00:00:00)
Test: 137,455 records (2025-05-31 00:00:00 to 2025-06-30 00:00:00)


## Step 11: Save Lot-Level Data

In [14]:
print("Saving lot-level occupancy data (LPR-based)...")

occupancy_df.to_csv('../../data/processed/occupancy_lot_level_lpr_full.csv', index=False)
occupancy_train.to_csv('../../data/processed/occupancy_lot_level_lpr_train.csv', index=False)
occupancy_val.to_csv('../../data/processed/occupancy_lot_level_lpr_val.csv', index=False)
occupancy_test.to_csv('../../data/processed/occupancy_lot_level_lpr_test.csv', index=False)

print(f"Full lot-level data saved: {len(occupancy_df):,} records")
print(f"Train set saved: {len(occupancy_train):,} records")
print(f"Validation set saved: {len(occupancy_val):,} records")
print(f"Test set saved: {len(occupancy_test):,} records")

Saving lot-level occupancy data (LPR-based)...
Full lot-level data saved: 4,866,240 records
Train set saved: 3,516,665 records
Validation set saved: 1,212,120 records
Test set saved: 137,455 records


## Summary Statistics

In [15]:
print("="*80)
print("LOT-LEVEL DATA SUMMARY (LPR-BASED)")
print("="*80)
print(f"Lots with data: {occupancy_df['lot_number'].nunique()}")
print(f"Average LPR scans per hour: {occupancy_df['lpr_scans'].mean():.2f}")
print(f"Date range: {occupancy_df['date'].min()} to {occupancy_df['date'].max()}")
print("\nTop 10 lots by LPR scan activity:")
print(occupancy_df.groupby('lot_number')['lpr_scans'].mean().sort_values(ascending=False).head(10))

print("\n" + "="*80)
print("DONE!")
print("="*80)
print("\nNext steps:")
print("1. Train lot-level model using LPR scans as target")
print("2. Use lag features for prediction (previous hours' activity)")
print("3. Compare with zone-level model performance")

LOT-LEVEL DATA SUMMARY (LPR-BASED)
Lots with data: 185
Average LPR scans per hour: 0.37
Date range: 2022-07-01 00:00:00 to 2025-06-30 00:00:00

Top 10 lots by LPR scan activity:
lot_number
150    6.818849
71     4.589682
9      4.294936
26     3.942404
124    3.629866
146    3.361694
104    2.461603
1      2.123936
120    1.877775
47     1.646442
Name: lpr_scans, dtype: float64

DONE!

Next steps:
1. Train lot-level model using LPR scans as target
2. Use lag features for prediction (previous hours' activity)
3. Compare with zone-level model performance
