# Extend Enforcement Data to October 2025
This notebook extends enforcement data through October 2025 using historical LPR patterns.
## Data Situation
- **LPR data**: 2022-07-01 to 2025-06-30 ✅
- **Ticket data**: 2018-07-02 to 2025-10-30 ✅
- **AMP data**: 2020-08-10 to 2025-11-02 ✅
## Strategy
For July-October 2025:
1. Use **actual ticket and AMP data** (ground truth enforcement)
2. **Estimate LPR** from historical 2022-2024 patterns
3. Calculate enforcement metrics as usual
4. Add `lpr_estimated` flag for model to learn from
## Why This Works
- LPR shows **WHEN/WHERE** enforcement patrols (timing patterns are stable)
- Tickets show **HOW AGGRESSIVE** enforcement is (may increase over time)
- Historical LPR timing + Actual 2025 tickets = Best estimate of current enforcement risk

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
print("Libraries loaded")

Libraries loaded


## 1. Load Data

In [2]:
# Load all data sources
amp = pd.read_csv('../../data/processed/amp_preprocessed.csv', parse_dates=['Start_Date', 'End_Date'])
lpr = pd.read_csv('../../data/processed/lpr_preprocessed.csv', parse_dates=['Date_Time'])
tickets = pd.read_csv('../../data/processed/tickets_enriched.csv', parse_dates=['Issue_DateTime'])
lot_mapping = pd.read_csv('../../data/lot_mapping_enhanced.csv')
amp_aliases = pd.read_csv('../../data/amp_zone_aliases.csv')
print("Data loaded:")
print(f"  AMP sessions: {len(amp):,}")
print(f"  LPR reads: {len(lpr):,}")
print(f"  Tickets: {len(tickets):,}")
print(f"\nDate ranges:")
print(f"  AMP: {amp['Start_Date'].min().date()} to {amp['End_Date'].max().date()}")
print(f"  LPR: {lpr['Date_Time'].min().date()} to {lpr['Date_Time'].max().date()}")
print(f"  Tickets: {tickets['Issue_DateTime'].min().date()} to {tickets['Issue_DateTime'].max().date()}")
lpr_end_date = lpr['Date_Time'].max().date()
tickets_end_date = tickets['Issue_DateTime'].max().date()
print(f"\n⚠️  LPR gap: {lpr_end_date} to {tickets_end_date}")
gap_days = (tickets_end_date - lpr_end_date).days
print(f"   Need to estimate {gap_days} days of LPR data")

Data loaded:
  AMP sessions: 1,702,867
  LPR reads: 1,780,391
  Tickets: 192,709

Date ranges:
  AMP: 2020-08-10 to 2025-11-02
  LPR: 2022-07-01 to 2025-06-30
  Tickets: 2018-07-02 to 2025-10-30

⚠️  LPR gap: 2025-06-30 to 2025-10-30
   Need to estimate 122 days of LPR data


## 2. Calculate Historical LPR Patterns
Calculate average LPR scans per zone-day_of_week-hour-month from 2022-2024 data.

In [3]:
# LPR data is already preprocessed - each row is one scan
# Add date column for counting unique dates
lpr['date'] = lpr['Date_Time'].dt.date
print(f"Historical LPR data: {lpr['Date_Time'].min().date()} to {lpr['Date_Time'].max().date()}")
print(f"Total LPR records: {len(lpr):,}")
# Filter to historical data only (2022-2024)
lpr_historical = lpr[lpr['Date_Time'] < '2025-01-01'].copy()
print(f"\nHistorical LPR (2022-2024): {len(lpr_historical):,} records")
# Calculate patterns: count scans per zone-day_of_week-hour-month
print("\nCalculating LPR patterns (zone-day_of_week-hour-month)...")
lpr_patterns = lpr_historical.groupby(['Zone_Name', 'day_of_week', 'hour', 'month']).size().to_frame('total_scans')
# Also count unique dates to get average
date_counts = lpr_historical.groupby(['Zone_Name', 'day_of_week', 'hour', 'month'])['date'].nunique().to_frame('num_dates')
lpr_patterns = lpr_patterns.join(date_counts)
lpr_patterns['avg_scans'] = (lpr_patterns['total_scans'] / lpr_patterns['num_dates']).round(1)
print(f"  Primary patterns: {len(lpr_patterns):,} combinations")
# Fallback patterns (no day_of_week, for sparse data)
print("\nCalculating fallback patterns (zone-hour-month)...")
lpr_patterns_fallback = lpr_historical.groupby(['Zone_Name', 'hour', 'month']).size().to_frame('total_scans')
date_counts_fallback = lpr_historical.groupby(['Zone_Name', 'hour', 'month'])['date'].nunique().to_frame('num_dates')
lpr_patterns_fallback = lpr_patterns_fallback.join(date_counts_fallback)
lpr_patterns_fallback['avg_scans'] = (lpr_patterns_fallback['total_scans'] / lpr_patterns_fallback['num_dates']).round(1)
print(f"  Fallback patterns: {len(lpr_patterns_fallback):,} combinations")
# Show sample patterns
print("\nSample patterns (Paid zone):")
paid_patterns = lpr_patterns[lpr_patterns.index.get_level_values('Zone_Name') == 'Paid'].head(10)
if len(paid_patterns) > 0:
    print(paid_patterns[['num_dates', 'avg_scans']])
else:
    print("  No Paid zone patterns found - trying other zones...")
    # Show any patterns
    print(lpr_patterns.head(10)[['num_dates', 'avg_scans']])

Historical LPR data: 2022-07-01 to 2025-06-30
Total LPR records: 1,780,391

Historical LPR (2022-2024): 1,476,561 records

Calculating LPR patterns (zone-day_of_week-hour-month)...
  Primary patterns: 11,270 combinations

Calculating fallback patterns (zone-hour-month)...
  Fallback patterns: 3,485 combinations

Sample patterns (Paid zone):
                                  num_dates  avg_scans
Zone_Name day_of_week hour month                      
Paid      0           0    1              1        1.0
                           3              2        1.0
                           4              3        1.3
                           5              1        2.0
                           8              2        1.0
                           9              2        2.0
                           10             1        1.0
                           11             2        1.0
                           12             3        1.3
                      1    1              1        1

## 3. Create Extended Time Grid (Through October 2025)

In [4]:
# Extended time range
start_date = pd.Timestamp('2022-07-01').date()
end_date = tickets_end_date
print(f"Extended time range: {start_date} to {end_date}")
# Get ticket zones
ticket_zones = tickets['Zone_Name'].dropna().unique()
print(f"Processing {len(ticket_zones)} zones: {sorted(ticket_zones)}")
# Create time grid
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
hours = range(24)
time_grid = pd.MultiIndex.from_product(
    [date_range, hours],
    names=['date', 'hour']
).to_frame(index=False)
time_grid['datetime'] = time_grid['date'] + pd.to_timedelta(time_grid['hour'], unit='h')
time_grid['day_of_week'] = time_grid['datetime'].dt.dayofweek
time_grid['month'] = time_grid['datetime'].dt.month
time_grid['has_lpr_data'] = time_grid['date'].dt.date <= lpr_end_date
print(f"\nTime grid created: {len(time_grid):,} hourly intervals")
print(f"  With actual LPR data: {time_grid['has_lpr_data'].sum():,}")
print(f"  Using estimated LPR: {(~time_grid['has_lpr_data']).sum():,}")
print(f"\nBreakdown by year:")
time_grid['year'] = time_grid['datetime'].dt.year
for year in sorted(time_grid['year'].unique()):
    year_data = time_grid[time_grid['year'] == year]
    actual_lpr = year_data['has_lpr_data'].sum()
    estimated_lpr = (~year_data['has_lpr_data']).sum()
    print(f"  {year}: {len(year_data):,} intervals ({actual_lpr:,} actual LPR, {estimated_lpr:,} estimated)")

Extended time range: 2022-07-01 to 2025-10-30
Processing 25 zones: ['Apartments', 'Authorized Vehicles Only', 'Blue 1', 'Buisness Parking', 'Crimson 2', 'Disability', 'Gray 1', 'Green 1', 'Green 2', 'Green 3', 'Green 4', 'Green 5', 'Grey 2', 'Orange 4', 'Paid', 'Red 1', 'Red 4', 'Red 5', 'Visitor', 'Yellow 1', 'Yellow 2', 'Yellow 3', 'Yellow 4', 'Yellow 5', 'Yellow5']

Time grid created: 29,232 hourly intervals
  With actual LPR data: 26,304
  Using estimated LPR: 2,928

Breakdown by year:
  2022: 4,416 intervals (4,416 actual LPR, 0 estimated)
  2023: 8,760 intervals (8,760 actual LPR, 0 estimated)
  2024: 8,784 intervals (8,784 actual LPR, 0 estimated)
  2025: 7,272 intervals (4,344 actual LPR, 2,928 estimated)


## 4. Map AMP Zones

In [5]:
# Map AMP zones to standard names
amp = amp.merge(
    amp_aliases[['AMP_Zone_Name', 'Standard_Zone_Name']],
    left_on='Zone',
    right_on='AMP_Zone_Name',
    how='left'
)
print(f"AMP zones mapped: {amp['Standard_Zone_Name'].notna().sum():,} / {len(amp):,}")

AMP zones mapped: 1,698,169 / 1,702,867


## 5. Process Enforcement Data with LPR Estimation
This is the main processing loop. For each zone and hour:
- Count actual AMP sessions and tickets
- Use actual LPR if available, otherwise estimate from historical patterns

In [6]:
print("Processing enforcement data...")
print("This will take 5-10 minutes.\n")
enforcement_list = []
for i, zone in enumerate(ticket_zones, 1):
    if i % 5 == 0:
        print(f"  {i}/{len(ticket_zones)}: {zone}")
    # Filter data for this zone
    zone_amp = amp[amp['Zone'].str.contains(zone, case=False, na=False)].copy()
    zone_lpr = lpr[lpr['Zone_Name'].str.contains(zone, case=False, na=False)].copy()
    zone_tickets = tickets[tickets['Zone_Name'] == zone].copy()
    for _, row in time_grid.iterrows():
        hour_start = row['datetime']
        hour_end = hour_start + pd.Timedelta(hours=1)
        day_of_week = row['day_of_week']
        hour = row['hour']
        month = row['month']
        has_lpr = row['has_lpr_data']
        # Count AMP sessions active during this hour
        amp_count = len(zone_amp[
            (zone_amp['Start_Date'] < hour_end) &
            (zone_amp['End_Date'] > hour_start)
        ])
        # Count tickets issued during this hour
        ticket_count = len(zone_tickets[
            (zone_tickets['Issue_DateTime'] >= hour_start) &
            (zone_tickets['Issue_DateTime'] < hour_end)
        ])
        # LPR count: actual or estimated
        if has_lpr:
            # Use actual LPR data
            lpr_count = len(zone_lpr[
                (zone_lpr['Date_Time'] >= hour_start) &
                (zone_lpr['Date_Time'] < hour_end)
            ])
            lpr_estimated = False
        else:
            # Estimate from historical pattern
            pattern_key = (zone, day_of_week, hour, month)
            if pattern_key in lpr_patterns.index:
                lpr_count = int(lpr_patterns.loc[pattern_key, 'avg_scans'])
            else:
                # Try fallback pattern (no day_of_week)
                fallback_key = (zone, hour, month)
                if fallback_key in lpr_patterns_fallback.index:
                    lpr_count = int(lpr_patterns_fallback.loc[fallback_key, 'avg_scans'])
                else:
                    # Use zone-hour average across all months
                    zone_hour_data = lpr_historical[
                        (lpr_historical['Zone_Name'] == zone) &
                        (lpr_historical['hour'] == hour)
                    ]
                    if len(zone_hour_data) > 0:
                        lpr_count = int(zone_hour_data.groupby('date').size().mean())
                    else:
                        lpr_count = 0
            lpr_estimated = True
        # Calculate enforcement metrics
        unpaid_estimate = max(0, lpr_count - amp_count)
        enforcement_rate = ticket_count / unpaid_estimate if unpaid_estimate > 0 else 0
        enforcement_list.append({
            'Zone': zone,
            'date': row['date'],
            'hour': hour,
            'datetime': hour_start,
            'lpr_scans': lpr_count,
            'amp_sessions': amp_count,
            'tickets_issued': ticket_count,
            'unpaid_estimate': unpaid_estimate,
            'enforcement_rate': enforcement_rate,
            'lpr_estimated': lpr_estimated
        })
enforcement_df = pd.DataFrame(enforcement_list)
print(f"\n Enforcement data created: {len(enforcement_df):,} records")
print(f"  Date range: {enforcement_df['date'].min()} to {enforcement_df['date'].max()}")

Processing enforcement data...
This will take 5-10 minutes.

  5/25: Yellow 4
  10/25: Green 1
  15/25: Yellow5
  20/25: Apartments
  25/25: Visitor

 Enforcement data created: 730,800 records
  Date range: 2022-07-01 00:00:00 to 2025-10-30 00:00:00


## 6. Verify Data Quality

In [7]:
print("="*80)
print("DATA QUALITY CHECK")
print("="*80)
print(f"\nLPR data breakdown:")
print(f"  Actual LPR: {(~enforcement_df['lpr_estimated']).sum():,} records ({(~enforcement_df['lpr_estimated']).sum()/len(enforcement_df)*100:.1f}%)")
print(f"  Estimated LPR: {enforcement_df['lpr_estimated'].sum():,} records ({enforcement_df['lpr_estimated'].sum()/len(enforcement_df)*100:.1f}%)")
print(f"\nTicket counts (all actual data):")
print(f"  With actual LPR: {enforcement_df[~enforcement_df['lpr_estimated']]['tickets_issued'].sum():,}")
print(f"  With estimated LPR: {enforcement_df[enforcement_df['lpr_estimated']]['tickets_issued'].sum():,}")
print(f"  Total tickets: {enforcement_df['tickets_issued'].sum():,}")
print(f"\n2025 breakdown:")
data_2025 = enforcement_df[enforcement_df['datetime'].dt.year == 2025]
print(f"  Total 2025 records: {len(data_2025):,}")
jan_june = data_2025[~data_2025['lpr_estimated']]
july_oct = data_2025[data_2025['lpr_estimated']]
print(f"  Jan-June 2025 (actual LPR): {len(jan_june):,} records, {jan_june['tickets_issued'].sum():,} tickets")
print(f"  July-Oct 2025 (estimated LPR): {len(july_oct):,} records, {july_oct['tickets_issued'].sum():,} tickets")
# Check specific zones
print(f"\nCUE Garage / Paid zone check:")
cue_data = enforcement_df[enforcement_df['Zone'] == 'Paid']
if len(cue_data) > 0:
    cue_2025 = cue_data[cue_data['datetime'].dt.year == 2025]
    cue_estimated = cue_2025[cue_2025['lpr_estimated']]
    print(f"  Total 2025 records: {len(cue_2025):,}")
    print(f"  Estimated LPR records: {len(cue_estimated):,}")
    print(f"  Tickets with estimated LPR: {cue_estimated['tickets_issued'].sum():,}")
else:
    print(f"  No 'Paid' zone found - check zone names")

DATA QUALITY CHECK

LPR data breakdown:
  Actual LPR: 657,600 records (90.0%)
  Estimated LPR: 73,200 records (10.0%)

Ticket counts (all actual data):
  With actual LPR: 46,822
  With estimated LPR: 6,671
  Total tickets: 53,493

2025 breakdown:
  Total 2025 records: 181,800
  Jan-June 2025 (actual LPR): 108,600 records, 8,257 tickets
  July-Oct 2025 (estimated LPR): 73,200 records, 6,671 tickets

CUE Garage / Paid zone check:
  Total 2025 records: 7,272
  Estimated LPR records: 2,928
  Tickets with estimated LPR: 3,520


## 7. Add Temporal Features

In [8]:
# Add temporal features
enforcement_df['year'] = enforcement_df['datetime'].dt.year
enforcement_df['month'] = enforcement_df['datetime'].dt.month
enforcement_df['day_of_week'] = enforcement_df['datetime'].dt.dayofweek
enforcement_df['is_weekend'] = (enforcement_df['day_of_week'] >= 5).astype(int)
def categorize_time_of_day(hour):
    if 0 <= hour < 6:
        return 'Late Night'
    elif 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 22:
        return 'Evening'
    else:
        return 'Night'
enforcement_df['time_of_day'] = enforcement_df['hour'].apply(categorize_time_of_day)
time_of_day_map = {'Afternoon': 0, 'Evening': 1, 'Late Night': 2, 'Morning': 3, 'Night': 4}
enforcement_df['time_of_day_code'] = enforcement_df['time_of_day'].map(time_of_day_map)
print("Temporal features added")
print(f"\nYear distribution:")
for year in sorted(enforcement_df['year'].unique()):
    year_data = enforcement_df[enforcement_df['year'] == year]
    tickets = year_data['tickets_issued'].sum()
    estimated = year_data['lpr_estimated'].sum()
    print(f"  {year}: {len(year_data):,} records, {tickets:,} tickets ({estimated:,} with estimated LPR)")

Temporal features added

Year distribution:
  2022: 110,400 records, 7,675 tickets (0 with estimated LPR)
  2023: 219,000 records, 15,733 tickets (0 with estimated LPR)
  2024: 219,600 records, 15,157 tickets (0 with estimated LPR)
  2025: 181,800 records, 14,928 tickets (73,200 with estimated LPR)


## 8. Merge Calendar Events

In [9]:
# Load calendar data
games = pd.read_csv('../../data/football_games.csv', parse_dates=['Date'])
calendar = pd.read_csv('../../data/academic_calendar.csv', parse_dates=['Start_Date', 'End_Date'])
# Convert date for merging
enforcement_df['date'] = pd.to_datetime(enforcement_df['date'])
# Mark game days
game_dates = games['Date'].dt.date.unique()
enforcement_df['is_game_day'] = enforcement_df['date'].dt.date.isin(game_dates).astype(int)
# Mark calendar events
for event_type in ['Dead_Week', 'Finals_Week', 'Spring_Break', 'Thanksgiving_Break', 'Winter_Break']:
    event_periods = calendar[calendar['Event_Type'] == event_type]
    enforcement_df[f'is_{event_type.lower()}'] = 0
    for _, period in event_periods.iterrows():
        mask = (enforcement_df['date'] >= period['Start_Date']) & \
               (enforcement_df['date'] <= period['End_Date'])
        enforcement_df.loc[mask, f'is_{event_type.lower()}'] = 1
# Combined break indicator
enforcement_df['is_any_break'] = (
    enforcement_df['is_spring_break'] | 
    enforcement_df['is_thanksgiving_break'] | 
    enforcement_df['is_winter_break']
).astype(int)
print("Calendar events merged")
print(f"  Game day hours: {enforcement_df['is_game_day'].sum():,}")
print(f"  Finals week hours: {enforcement_df['is_finals_week'].sum():,}")
print(f"  Break hours: {enforcement_df['is_any_break'].sum():,}")

Calendar events merged
  Game day hours: 13,800
  Finals week hours: 18,000
  Break hours: 70,800


## 9. Merge Weather Data

In [10]:
# Load weather data
weather = pd.read_csv('../../data/weather_pullman_hourly_2020_2025.csv', parse_dates=['datetime'])
weather['date'] = pd.to_datetime(weather['date']).dt.date
weather['hour'] = weather['hour'].astype(int)
# Merge on date + hour
enforcement_df['date_for_merge'] = enforcement_df['date'].dt.date
enforcement_df = enforcement_df.merge(
    weather, 
    left_on=['date_for_merge', 'hour'], 
    right_on=['date', 'hour'], 
    how='left'
)
# Clean up merge columns
enforcement_df = enforcement_df.drop(columns=['date_for_merge', 'date_y', 'datetime_y'], errors='ignore')
enforcement_df = enforcement_df.rename(columns={'date_x': 'date', 'datetime_x': 'datetime'})
print("Weather data merged")
weather_cols = [col for col in weather.columns if col not in ['date', 'hour', 'datetime', 'year']]
print(f"  Weather features added: {weather_cols}")

Weather data merged
  Weather features added: ['temperature_f', 'precipitation_inches', 'snowfall_inches', 'snow_depth_inches', 'wind_mph', 'weather_code', 'weather_category', 'is_rainy', 'is_snowy', 'is_cold', 'is_hot', 'is_windy', 'is_severe']


## 10. Save Extended Enforcement Data

In [13]:
# Save to file
output_path = '../../data/processed/enforcement_full_extended.csv'
enforcement_df.to_csv(output_path, index=False)
print("="*80)
print("EXTENDED ENFORCEMENT DATA SAVED")
print("="*80)
print(f"\nFile: {output_path}")
print(f"Records: {len(enforcement_df):,}")
print(f"Date range: {enforcement_df['date'].min()} to {enforcement_df['date'].max()}")
print(f"Columns: {len(enforcement_df.columns)}")
print(f"\nData summary:")
print(f"  Total tickets: {enforcement_df['tickets_issued'].sum():,}")
print(f"  Actual LPR: {(~enforcement_df['lpr_estimated']).sum():,} records")
print(f"  Estimated LPR: {enforcement_df['lpr_estimated'].sum():,} records")
print(f"\nOctober 2025 data:")
oct_2025 = enforcement_df[
    (enforcement_df['datetime'] >= '2025-10-01') & 
    (enforcement_df['datetime'] < '2025-11-01')
]
print(f"  Records: {len(oct_2025):,}")
print(f"  Tickets: {oct_2025['tickets_issued'].sum():,}")
print(f"  All using estimated LPR: {oct_2025['lpr_estimated'].all()}")

EXTENDED ENFORCEMENT DATA SAVED

File: ../../data/processed/enforcement_full_extended.csv
Records: 730,800
Date range: 2022-07-01 00:00:00 to 2025-10-30 00:00:00
Columns: 37

Data summary:
  Total tickets: 53,493
  Actual LPR: 657,600 records
  Estimated LPR: 73,200 records

October 2025 data:
  Records: 18,000
  Tickets: 1,742
  All using estimated LPR: True
