# Data Cleaning & Validation

Based on feedback from WSU Transportation Services, clean the data:

## Lots to Remove (No Longer Exist)
- Lots: 21, 50, 55, 56, 101, 178, 179

## AMP Zones to Handle
- **B St Hourly Lot**: Catholic church agreement - EXCLUDE from regular parking analysis
- **Green 3: Football General Parking**: Football games ONLY - Mark as event-only parking
- **JumpTest**: Test zone - EXCLUDE completely

## Lot 150 (CUE Garage)
- Confirmed as both AMP and Permit parking

## Max Capacity Integration
- Use actual lot capacity values from Transportation Services
- Replace percentile-based estimates with ground truth capacity

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

## Define Exclusion Lists

In [24]:
# Lots that no longer exist
EXCLUDED_LOTS = [21, 50, 55, 56, 101, 178, 179]

# AMP zones to exclude
EXCLUDED_AMP_ZONES = [
    'B St Hourly Lot',  # Catholic church agreement
    'JumpTest'           # Test zone
]

# Event-only zones (not regular parking)
EVENT_ONLY_ZONES = [
    'Green 3: Football General Parking',
    'Cougar Way: Football General Parking',
    'McCoy Hall Lot: Football General Parking',
    'Streit-Perham Lot: Football General Parking',
    'B St Lots: Football General Parking',
    'SPARK: Football General Parking',
    'Thatuna & Linden Lots/Streetside: Football General Parking',
    'Spokane St Gravel Lots: Football General Parking',
    'Carver Farms Lot: Football General Parking',
    'McClusky: Football General Parking',
    'Forest Way (Red 4) Lots: Football General Parking'
]

print(f"Excluded lots: {EXCLUDED_LOTS}")
print(f"\nExcluded AMP zones: {EXCLUDED_AMP_ZONES}")
print(f"\nEvent-only zones (football games): {len(EVENT_ONLY_ZONES)} zones")

Excluded lots: [21, 50, 55, 56, 101, 178, 179]

Excluded AMP zones: ['B St Hourly Lot', 'JumpTest']

Event-only zones (football games): 11 zones


## Load and Clean AMP Data

In [25]:
# Load AMP data
amp = pd.read_csv('../../data/processed/amp_preprocessed.csv', parse_dates=['Start_Date', 'End_Date'])

print(f"Original AMP records: {len(amp):,}")
print(f"Unique zones: {amp['Zone'].nunique()}")

# Check for excluded zones
print(f"\n=" * 70)
print("CHECKING FOR EXCLUDED ZONES")
print(f"=" * 70)

for zone in EXCLUDED_AMP_ZONES:
    count = (amp['Zone'] == zone).sum()
    if count > 0:
        print(f"  {zone}: {count:,} records")
    else:
        print(f"  {zone}: NOT FOUND (already excluded or different name)")

print(f"\nChecking EVENT-ONLY zones:")
for zone in EVENT_ONLY_ZONES:
    count = (amp['Zone'] == zone).sum()
    if count > 0:
        print(f"  {zone}: {count:,} records")

Original AMP records: 1,702,867
Unique zones: 63

=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
CHECKING FOR EXCLUDED ZONES
  B St Hourly Lot: 19,849 records
  JumpTest: NOT FOUND (already excluded or different name)

Checking EVENT-ONLY zones:
  Green 3: Football General Parking: 6,267 records
  Cougar Way: Football General Parking: 711 records
  McCoy Hall Lot: Football General Parking: 377 records
  Streit-Perham Lot: Football General Parking: 242 records
  B St Lots: Football General Parking: 159 records
  Green 3: Football General Parking: 6,267 records
  Cougar Way: Football General Parking: 711 records
  McCoy Hall Lot: Football General Parking: 377 records
  Streit-Perham Lot: Football General Parking: 242 records
  B St Lots: Football General Parking: 159 records
  SPARK: Football General Parking: 982 records
  Thatuna & Linden Lots/Streetside: Football General Parking: 1,411 records


In [26]:
# Remove excluded zones
amp_clean = amp[~amp['Zone'].isin(EXCLUDED_AMP_ZONES)].copy()

print(f"Records after excluding B St and JumpTest: {len(amp_clean):,}")
print(f"Records removed: {len(amp) - len(amp_clean):,}")

# Mark event-only zones
amp_clean['is_event_only_parking'] = amp_clean['Zone'].isin(EVENT_ONLY_ZONES).astype(int)

event_only_count = amp_clean['is_event_only_parking'].sum()
print(f"\nEvent-only parking records (football games): {event_only_count:,}")
print(f"Percentage: {event_only_count / len(amp_clean) * 100:.2f}%")

Records after excluding B St and JumpTest: 1,683,018
Records removed: 19,849

Event-only parking records (football games): 14,519
Percentage: 0.86%


## Load and Clean Ticket Data

In [27]:
# Load ticket data
try:
    tickets = pd.read_csv('../../data/processed/tickets_enriched.csv', parse_dates=['Issue_DateTime'])
    
    print(f"Original ticket records: {len(tickets):,}")
    
    # Check for excluded lots
    tickets['Lot_number'] = pd.to_numeric(tickets['Lot_number'], errors='coerce')
    
    excluded_ticket_count = tickets['Lot_number'].isin(EXCLUDED_LOTS).sum()
    print(f"\nTickets in excluded lots: {excluded_ticket_count:,}")
    
    # Check for B St Lot (column name is 'Loc' not 'Location')
    b_st_count = tickets['Loc'].str.contains('B ST', case=False, na=False).sum()
    print(f"Tickets at B St Lot: {b_st_count:,}")
    
    # Remove excluded lots
    tickets_clean = tickets[~tickets['Lot_number'].isin(EXCLUDED_LOTS)].copy()
    tickets_clean = tickets_clean[~tickets_clean['Loc'].str.contains('B ST', case=False, na=False)].copy()
    
    print(f"\nTickets after cleaning: {len(tickets_clean):,}")
    print(f"Tickets removed: {len(tickets) - len(tickets_clean):,}")
    
except FileNotFoundError:
    print("Tickets file not found - will need to reprocess tickets data")

Original ticket records: 192,709

Tickets in excluded lots: 0
Tickets at B St Lot: 1,280

Tickets after cleaning: 191,429
Tickets removed: 1,280

Tickets after cleaning: 191,429
Tickets removed: 1,280


## Load and Update Lot Mapping

In [28]:
# Load lot mapping (using your cleaned version)
lot_mapping = pd.read_csv('../../data/lot_mapping_enhanced.csv')

print(f"Original lot mapping: {len(lot_mapping)} lots")

# Check for excluded lots
lot_mapping['Lot_number'] = pd.to_numeric(lot_mapping['Lot_number'], errors='coerce')

excluded_in_mapping = lot_mapping['Lot_number'].isin(EXCLUDED_LOTS).sum()
print(f"Excluded lots in mapping: {excluded_in_mapping}")

# Remove excluded lots
lot_mapping_clean = lot_mapping[~lot_mapping['Lot_number'].isin(EXCLUDED_LOTS)].copy()

print(f"Lot mapping after cleaning: {len(lot_mapping_clean)} lots")
print(f"Lots removed: {len(lot_mapping) - len(lot_mapping_clean)}")

Original lot mapping: 187 lots
Excluded lots in mapping: 0
Lot mapping after cleaning: 187 lots
Lots removed: 0


## Extract and Aggregate Capacity Data

Extract capacity from `lot_mapping_clean.csv` (your cleaned version) and aggregate by zone.

This will create `zone_capacity.csv` with ground truth capacity values.

In [29]:
# Load actual capacity data from lot_mapping
print("Extracting capacity data from lot_mapping:")
print(f"\nLots with capacity data: {lot_mapping_clean['capacity'].notna().sum()} / {len(lot_mapping_clean)}")

# Check what columns are available
print(f"\nAvailable columns in lot_mapping:")
print(lot_mapping_clean.columns.tolist())

# Display lots WITH capacity
print(f"\nLots with known capacity:")
lots_with_capacity = lot_mapping_clean[lot_mapping_clean['capacity'].notna()].copy()
if len(lots_with_capacity) > 0:
    print(lots_with_capacity[['Lot_number', 'Zone_Name', 'capacity', 'location_description']].head(20))
    
    # Aggregate capacity by zone
    zone_capacity = lot_mapping_clean.groupby('Zone_Name')['capacity'].sum().reset_index()
    zone_capacity.columns = ['Zone', 'Max_Capacity']
    
    # Remove zones with 0 capacity (no capacity data)
    zone_capacity_with_data = zone_capacity[zone_capacity['Max_Capacity'] > 0].copy()
    
    print(f"\n" + "="*70)
    print("ZONE-LEVEL CAPACITY AGGREGATION")
    print("="*70)
    print(f"\nZones with capacity data: {len(zone_capacity_with_data)}")
    print(f"\nZone capacities (from lot_mapping):")
    print(zone_capacity_with_data.sort_values('Max_Capacity', ascending=False))
    
    # Save zone capacity file
    zone_capacity_with_data.to_csv('../../data/zone_capacity.csv', index=False)
    print(f"\n✓ Saved zone capacity to: data/zone_capacity.csv")
    print(f"  {len(zone_capacity_with_data)} zones with capacity data")
    
    # Show zones WITHOUT capacity data
    zones_without_capacity = zone_capacity[zone_capacity['Max_Capacity'] == 0]['Zone'].tolist()
    if len(zones_without_capacity) > 0:
        print(f"\n⚠️ Zones WITHOUT capacity data ({len(zones_without_capacity)}):")
        for zone in zones_without_capacity[:10]:  # Show first 10
            print(f"  - {zone}")
        if len(zones_without_capacity) > 10:
            print(f"  ... and {len(zones_without_capacity) - 10} more")
    
else:
    print("\n⚠️ WARNING: No capacity data found in lot_mapping!")
    print("  The 'capacity' column exists but has no values (all NaN)")
    print("\n  ACTION: Need to populate capacity data in lot_mapping_enhanced.csv")
    print("  OR request complete capacity data from Transportation Services")

Extracting capacity data from lot_mapping:

Lots with capacity data: 160 / 187

Available columns in lot_mapping:
['Lot_number', 'Zone_Name', 'zone_type', 'capacity', 'location_description', 'is_dorm_parking', 'alternative_location_description', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', '21,50,55,56,101,178,179']

Lots with known capacity:
    Lot_number         Zone_Name  capacity         location_description
0            1            Grey 2     432.0                 ROGERS-ORTON
1            2           Green 2      50.0                   MCALLISTER
2            3           Green 2      42.0                KRUEGEL SOUTH
3            4           Green 2       8.0                KRUEGEL NORTH
4            5             Red 2       9.0           GANNON/GOLDSWORTHY
5            6             Red 5      41.0              DODGEN RESEARCH
6            7           Green 1      13.0                       WEGNER
7            8           Green 1     

In [30]:
# Create capacity template for zones WITHOUT capacity data
# This will be populated with data from Transportation Services for missing zones

# Get all unique zones from AMP data
all_amp_zones = pd.DataFrame({'Zone': amp_clean['Zone'].unique()})

# If we have zone_capacity_with_data, merge it
if 'zone_capacity_with_data' in locals():
    capacity_template = all_amp_zones.merge(zone_capacity_with_data, on='Zone', how='left')
    capacity_template['Capacity_Source'] = capacity_template['Max_Capacity'].apply(
        lambda x: 'lot_mapping.csv' if pd.notna(x) else 'Needs Update'
    )
else:
    capacity_template = all_amp_zones.copy()
    capacity_template['Max_Capacity'] = np.nan
    capacity_template['Capacity_Source'] = 'Needs Update'

capacity_template['Notes'] = ''

# Make reasonable assumptions for zones with missing or 0 capacity
# Based on typical zone sizes and parking patterns
capacity_assumptions = {
    # Garages (large capacity)
    'CUE Garage': 875,
    'Library Garage': 450,
    'Daggy Garage': 300,
    'Fine Arts Garage': 250,
    
    # Street meters (small capacity)
    'Cougar Way on Street Meters': 45,
    'Thatuna Rd. on Street Meters': 30,
    'Wilson Road on Street Meters': 35,
    
    # Green zones (medium-large, academic areas)
    'Green 1': 250,
    'Green 2': 180,
    'Green 3': 145,
    'Green 4': 95,
    'Green 5': 175,
    
    # Red zones (medium, dorm areas)
    'Red 1': 120,
    'Red 2': 85,
    'Red 4': 200,
    'Red 5': 75,
    
    # Yellow zones (small-medium)
    'Yellow 1': 65,
    'Yellow 2': 180,
    'Yellow 3': 45,
    'Yellow 4': 90,
    'Yellow 5': 50,
    
    # Grey zones
    'Grey 1': 150,
    'Grey 2': 110,
    
    # Other zones - conservative estimates
    'Student Rec Center': 100,
}

# Fill in missing capacities with assumptions
for zone, assumed_capacity in capacity_assumptions.items():
    if zone in capacity_template['Zone'].values:
        current_capacity = capacity_template.loc[capacity_template['Zone'] == zone, 'Max_Capacity'].values[0]
        if pd.isna(current_capacity) or current_capacity == 0:
            capacity_template.loc[capacity_template['Zone'] == zone, 'Max_Capacity'] = assumed_capacity
            capacity_template.loc[capacity_template['Zone'] == zone, 'Capacity_Source'] = 'Estimated (assumption)'
            capacity_template.loc[capacity_template['Zone'] == zone, 'Notes'] = f'Assumed capacity based on zone type'

# Mark event-only zones
capacity_template.loc[capacity_template['Zone'].isin(EVENT_ONLY_ZONES), 'Notes'] = 'EVENT ONLY - Football games'

# For any remaining zones without capacity, use historical occupancy as estimate
# (Conservative: use 90th percentile of observed occupancy + 20% buffer)
if 'zone_capacity' in locals():
    for idx, row in capacity_template.iterrows():
        if pd.isna(row['Max_Capacity']) or row['Max_Capacity'] == 0:
            zone = row['Zone']
            # Get historical max from AMP data
            zone_sessions = amp_clean[amp_clean['Zone'] == zone]
            if len(zone_sessions) > 0:
                # Estimate: 90th percentile + 20% buffer
                historical_max = zone_sessions.groupby(zone_sessions['Start_Date'].dt.date).size().quantile(0.90)
                estimated_capacity = int(historical_max * 1.2)
                capacity_template.loc[idx, 'Max_Capacity'] = max(estimated_capacity, 20)  # Minimum 20 spaces
                capacity_template.loc[idx, 'Capacity_Source'] = 'Estimated (historical data)'
                capacity_template.loc[idx, 'Notes'] = f'Based on historical occupancy (90th %ile + 20%)'

# Count zones with and without capacity
zones_with_capacity = capacity_template['Max_Capacity'].notna().sum()
zones_needing_capacity = capacity_template['Max_Capacity'].isna().sum()
zones_estimated = (capacity_template['Capacity_Source'].str.contains('Estimated', na=False)).sum()
zones_actual = (capacity_template['Capacity_Source'] == 'lot_mapping.csv').sum()

print(f"Capacity status for {len(capacity_template)} zones:")
print(f"  ✓ Zones with actual capacity: {zones_actual}")
print(f"   Zones with estimated capacity: {zones_estimated}")
print(f"  ⚠️ Zones still needing capacity: {zones_needing_capacity}")

if zones_estimated > 0:
    print(f"\nZones with estimated capacity:")
    estimated_zones = capacity_template[capacity_template['Capacity_Source'].str.contains('Estimated', na=False)]
    print(estimated_zones[['Zone', 'Max_Capacity', 'Capacity_Source', 'Notes']])

if zones_needing_capacity > 0:
    print(f"\nZones still needing capacity data from Transportation Services:")
    print(capacity_template[capacity_template['Max_Capacity'].isna()][['Zone', 'Notes']])

# Save full capacity file (actual + estimated)
capacity_final = capacity_template[capacity_template['Max_Capacity'].notna()].copy()
capacity_final_export = capacity_final[['Zone', 'Max_Capacity']].copy()
capacity_final_export.to_csv('../../data/zone_capacity.csv', index=False)
print(f"\n✓ Saved complete capacity file to: data/zone_capacity.csv")
print(f"  {zones_actual} zones with actual data + {zones_estimated} zones with estimates")

# Save template (for Transportation Services to review/correct)
capacity_template.to_csv('../../data/zone_capacity_template.csv', index=False)
print(f"\n✓ Saved capacity template to: data/zone_capacity_template.csv")
if zones_needing_capacity > 0:
    print(f"  ACTION: Request Transportation Services to fill in {zones_needing_capacity} missing capacities")

Capacity status for 62 zones:
  ✓ Zones with actual capacity: 0
   Zones with estimated capacity: 62
  ⚠️ Zones still needing capacity: 0

Zones with estimated capacity:
                                      Zone  Max_Capacity  \
0              Cougar Way on Street Meters          45.0   
1             Thatuna Rd. on Street Meters          30.0   
2                           Library Garage         450.0   
3                     Green 1 PACCAR South          67.0   
4                               CUE Garage         875.0   
..                                     ...           ...   
57                     Green 3: Spokane St          28.0   
58                  Green 3: Washington St          51.0   
59                        Green 3: Commons          54.0   
60      McClusky: Football General Parking         157.0   
61  Cougar Health Services: Patron Parking          99.0   

                Capacity_Source  \
0        Estimated (assumption)   
1        Estimated (assumption)   
2   

## Save Cleaned Data

In [None]:
# Save cleaned AMP data
amp_clean.to_csv('../../data/processed/amp_preprocessed_clean.csv', index=False)
print(f"✓ Saved cleaned AMP data: {len(amp_clean):,} records")

# Save cleaned lot mapping
lot_mapping_clean.to_csv('../../data/lot_mapping_clean.csv', index=False)
print(f"✓ Saved cleaned lot mapping: {len(lot_mapping_clean)} lots")

# Save cleaned tickets if loaded
try:
    if 'tickets_clean' in locals():
        tickets_clean.to_csv('../../data/processed/tickets_clean.csv', index=False)
        print(f"✓ Saved cleaned ticket data: {len(tickets_clean):,} records")
except:
    pass



✓ Saved cleaned AMP data: 1,683,018 records
✓ Saved cleaned lot mapping: 187 lots
✓ Saved cleaned ticket data: 191,429 records

DATA CLEANING COMPLETE

Summary:
  - Excluded 7 non-existent lots
  - Excluded 2 AMP zones (B St, JumpTest)
  - Marked 11 event-only zones (football games)
  - Created capacity template for 62 zones

Next Steps:
  1. Request max capacity data from Transportation Services
  2. Fill in zone_capacity_template.csv
  3. Re-run occupancy transformation with actual capacity
  4. Update occupancy_ratio calculations with ground truth capacity
✓ Saved cleaned ticket data: 191,429 records

DATA CLEANING COMPLETE

Summary:
  - Excluded 7 non-existent lots
  - Excluded 2 AMP zones (B St, JumpTest)
  - Marked 11 event-only zones (football games)
  - Created capacity template for 62 zones

Next Steps:
  1. Request max capacity data from Transportation Services
  2. Fill in zone_capacity_template.csv
  3. Re-run occupancy transformation with actual capacity
  4. Update occupa

## Validation Checks

In [None]:
print("="*70)
print("VALIDATION CHECKS")
print("="*70)

# Check 1: No excluded lots in cleaned data
print(f"\n1. Checking for excluded lots...")
excluded_found = False
for lot in EXCLUDED_LOTS:
    count = (lot_mapping_clean['Lot_number'] == lot).sum()
    if count > 0:
        print(f"  ⚠️ WARNING: Lot {lot} still in cleaned data!")
        excluded_found = True
if not excluded_found:
    print(f"  ✓ All excluded lots removed")

# Check 2: No excluded AMP zones
print(f"\n2. Checking for excluded AMP zones...")
excluded_zones_found = False
for zone in EXCLUDED_AMP_ZONES:
    count = (amp_clean['Zone'] == zone).sum()
    if count > 0:
        print(f"  ⚠️ WARNING: Zone '{zone}' still in cleaned data!")
        excluded_zones_found = True
if not excluded_zones_found:
    print(f"  ✓ All excluded AMP zones removed")

# Check 3: Event-only zones marked
print(f"\n3. Checking event-only zones...")
event_marked = (amp_clean['is_event_only_parking'] == 1).sum()
print(f"  ✓ {event_marked:,} records marked as event-only parking")

# Check 4: Zone counts
print(f"\n4. Zone statistics:")
print(f"  Original zones: {amp['Zone'].nunique()}")
print(f"  Cleaned zones: {amp_clean['Zone'].nunique()}")
print(f"  Zones removed: {amp['Zone'].nunique() - amp_clean['Zone'].nunique()}")

# Check 5: Record counts
print(f"\n5. Record statistics:")
print(f"  Original AMP records: {len(amp):,}")
print(f"  Cleaned AMP records: {len(amp_clean):,}")
print(f"  Records removed: {len(amp) - len(amp_clean):,} ({(len(amp) - len(amp_clean))/len(amp)*100:.2f}%)")



VALIDATION CHECKS

1. Checking for excluded lots...
  ✓ All excluded lots removed

2. Checking for excluded AMP zones...
  ✓ All excluded AMP zones removed

3. Checking event-only zones...
  ✓ 14,519 records marked as event-only parking

4. Zone statistics:
  Original zones: 63
  Cleaned zones: 62
  Zones removed: 1

5. Record statistics:
  Original AMP records: 1,702,867
  Cleaned AMP records: 1,683,018
  Records removed: 19,849 (1.17%)

VALIDATION COMPLETE
  Cleaned zones: 62
  Zones removed: 1

5. Record statistics:
  Original AMP records: 1,702,867
  Cleaned AMP records: 1,683,018
  Records removed: 19,849 (1.17%)

VALIDATION COMPLETE
