# ParkSense: Full Dataset Preparation

**Goal**: Process the **entire** 2019 dataset for maximum model accuracy.
**Challenge**: The file is large (~3GB). We will use optimized types and chunking if necessary.

**Steps**:
1.  Load Static Data.
2.  Load **ALL** Historical Data (2019).
3.  Clean, Merge, and Feature Engineer.
4.  Export to `processed_parking_data_full.csv`.

In [None]:
import pandas as pd
import numpy as np
import os

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## 1. Load Static Bays

In [None]:
BAYS_PATH = '../data/on-street-parking-bays.csv'
bays_df = pd.read_csv(BAYS_PATH)

# Clean KerbsideID immediately
bays_df = bays_df.dropna(subset=['KerbsideID'])
bays_df['KerbsideID'] = bays_df['KerbsideID'].astype(str).str.replace(r'\.0$', '', regex=True)

print(f"Bays loaded: {bays_df.shape}")

## 2. Load Full Historical Data (2019)
We read the full CSV. To save memory, we can specify data types or read only needed columns initially, but let's try a standard load first. If it crashes, we will need to optimize.

In [None]:
SENSORS_PATH = '../data/On-street_Car_Parking_Sensor_Data_-_2019.csv'

# Columns we definitely need (filtering early saves RAM)
use_cols = ['ArrivalTime', 'DepartureTime', 'BayId', 'DurationMinutes']

print("Loading FULL 2019 dataset... (this may take a minute)")
try:
    sensors_df = pd.read_csv(
        SENSORS_PATH,
        usecols=use_cols,
        parse_dates=['ArrivalTime', 'DepartureTime']
    )
    print("Load successful!")
except Exception as e:
    print(f"Error loading full dataset: {e}")

# Rename immediately
sensors_df.rename(columns={
    'ArrivalTime': 'Arrival_Time', 
    'DepartureTime': 'Departure_Time',
    'BayId': 'KerbsideID',
    'DurationMinutes': 'duration_min'
}, inplace=True)

print(f"Full Sensors Shape: {sensors_df.shape}")

## 3. Cleaning & Merging

In [None]:
# Drop NaNs
sensors_df = sensors_df.dropna(subset=['KerbsideID', 'Arrival_Time', 'Departure_Time'])

# Ensure String ID
sensors_df['KerbsideID'] = sensors_df['KerbsideID'].astype(str).str.replace(r'\.0$', '', regex=True)

# Merge with Bays to get Location
print("Merging datasets...")
merged_df = sensors_df.merge(
    bays_df[['KerbsideID', 'Latitude', 'Longitude']], 
    on='KerbsideID', 
    how='inner'
)

print(f"Merged Shape: {merged_df.shape}")

# Free up memory by deleting original dfs
del sensors_df
del bays_df
import gc
gc.collect()

## 4. Feature Engineering

In [6]:
print("Generating features...")
merged_df['hour'] = merged_df['Arrival_Time'].dt.hour
merged_df['day_of_week'] = merged_df['Arrival_Time'].dt.dayofweek
merged_df['is_weekend'] = merged_df['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)

# Clean Duration: Convert to numeric, force errors to NaN
merged_df['duration_min'] = pd.to_numeric(merged_df['duration_min'], errors='coerce')
merged_df = merged_df.dropna(subset=['duration_min'])

# Target Creation
LOOKAHEAD_15 = 15
merged_df['is_free_15m'] = (merged_df['duration_min'] < LOOKAHEAD_15).astype(int)

print("Features created.")

Generating features...
Features created.


## 5. Export Full Dataset
Since this file might be huge (millions of rows), we'll save it. Warning: The output CSV might also be large (>1GB).

In [7]:
OUTPUT_PATH = '../data/processed_parking_data_full.csv'
print(f"Saving to {OUTPUT_PATH}...")
merged_df.to_csv(OUTPUT_PATH, index=False)
print("Done! Full dataset ready for training.")

Saving to ../data/processed_parking_data_full.csv...
Done! Full dataset ready for training.
