# Data Validation - Occupancy Training Data

This notebook validates occupancy training data to ensure data quality and identify potential issues that could affect model predictions.

**Checks performed:**
- Date range coverage
- Zone availability (all 28 zones have data)
- Occupancy statistics per zone
- Specific time period analysis (e.g., 4pm peak hours)
- Missing data identification

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("="*80)
print("Occupancy Training Data Validation")
print("="*80)
print()

## Load Occupancy Data

In [None]:
# Load occupancy data
df = pd.read_csv('../../data/processed/occupancy_full_extended.csv', parse_dates=['datetime'])

print("Overall data:")
print(f"  Start date: {df['datetime'].min()}")
print(f"  End date: {df['datetime'].max()}")
print(f"  Total records: {len(df):,}")
print(f"  Unique zones: {df['Zone'].nunique()}")
print()

## Zone-Specific Analysis

In [None]:
# Analyze each zone
zone_summary = df.groupby('Zone').agg({
    'Occupancy': ['count', 'mean', 'min', 'max', 'std'],
    'datetime': ['min', 'max']
}).round(2)

zone_summary.columns = ['_'.join(col).strip() for col in zone_summary.columns.values]
zone_summary = zone_summary.reset_index()

print("Zone Summary Statistics:")
print(zone_summary.to_string(index=False))
print()

## Check Green 2 Zone (Example)

In [None]:
# Check Green 2 specifically
green2 = df[df['Zone'] == 'Green 2']
print("Green 2 zone:")
if len(green2) > 0:
    print(f"  Records: {len(green2):,}")
    print(f"  Date range: {green2['datetime'].min()} to {green2['datetime'].max()}")
    print(f"  Avg occupancy: {green2['Occupancy'].mean():.1f} cars")
    print(f"  Max occupancy: {green2['Occupancy'].max()} cars")
    print(f"  Min occupancy: {green2['Occupancy'].min()} cars")
    print()

    # Check 4pm specifically
    green2_4pm = green2[green2['datetime'].dt.hour == 16]
    if len(green2_4pm) > 0:
        print(f"  At 4pm (16:00):")
        print(f"    Records: {len(green2_4pm)}")
        print(f"    Avg occupancy: {green2_4pm['Occupancy'].mean():.1f} cars")
        print(f"    Max: {green2_4pm['Occupancy'].max()} cars")
else:
    print("  NO DATA FOUND!")
print()

## Zone Data Availability Check

In [None]:
print("="*80)
print("Checking all zones for data availability:")
print("="*80)

zones_with_data = df.groupby('Zone').size().sort_values(ascending=False)
print(f"\nZones with occupancy data ({len(zones_with_data)} total):")
for zone, count in zones_with_data.head(10).items():
    print(f"  {zone}: {count:,} records")

if len(zones_with_data) < 28:
    print(f"\nWARNING: Only {len(zones_with_data)} zones have data, but there are 28 unique zones!")
    print("Some zones may not have any training data.")
else:
    print(f"\nAll {len(zones_with_data)} zones have data.")

## Visualize Zone Data Distribution

In [None]:
# Visualize record counts per zone
plt.figure(figsize=(14, 6))
zones_with_data.plot(kind='barh', color='steelblue')
plt.xlabel('Number of Records')
plt.ylabel('Zone')
plt.title('Data Availability by Zone')
plt.tight_layout()
plt.show()

## Occupancy Distribution by Zone

In [None]:
# Box plot of occupancy by zone
plt.figure(figsize=(14, 6))
top_zones = zones_with_data.head(10).index
df_top = df[df['Zone'].isin(top_zones)]
sns.boxplot(data=df_top, x='Zone', y='Occupancy')
plt.xticks(rotation=45, ha='right')
plt.title('Occupancy Distribution - Top 10 Zones by Data Volume')
plt.tight_layout()
plt.show()

## Missing Data Analysis

In [None]:
# Check for missing values
print("Missing values by column:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing_Count': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing_Count'] > 0])
print()

## Time Period Coverage

In [None]:
# Check coverage by hour of day
df['hour'] = df['datetime'].dt.hour
hourly_coverage = df.groupby('hour').size()

plt.figure(figsize=(12, 5))
hourly_coverage.plot(kind='bar', color='teal')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Records')
plt.title('Data Coverage by Hour of Day')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## Summary

This notebook validates the occupancy training data and identifies any potential issues. Review the outputs above to ensure:
- All expected zones have data
- Date ranges are complete
- Occupancy values are reasonable
- No excessive missing data