# EV Charging Data Cleaning & Preparation (v2 - Robust Merge)

**Goal:** Prepare a clean dataset for Neural Network modeling.
**Key Improvement:** Correctly merges Session data with Weather data using a Left Join on Date, ensuring no data loss.

## 1. Load Data
Reading the raw EV session data and the weather data.

In [None]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

# File Paths
SESSION_PATH = '../../data/trondheim/Dataset 1_EV charging reports.csv'
WEATHER_PATH = '../../data/trondheim/Norway_Trondheim_ExactLoc_Weather.csv'
OUT_FILE = 'data/ev_sessions_clean.csv'

# 1. Load Sessions
# Note: European format (semicolon sep, comma decimal)
df_sessions = pd.read_csv(SESSION_PATH, sep=';', decimal=',')
print(f"Sessions Loaded: {df_sessions.shape}")

# 2. Load Weather
# Note: Standard csv
df_weather = pd.read_csv(WEATHER_PATH, low_memory=False)
print(f"Weather Loaded: {df_weather.shape}")

## 2. Preprocessing & Merging
We need to create a common `date` column to join on.

In [None]:
# --- PROCESS SESSIONS ---
# Parse start time to get the Date
df_sessions['Start_plugin_dt'] = pd.to_datetime(df_sessions['Start_plugin'], dayfirst=True, errors='coerce')
df_sessions['date'] = df_sessions['Start_plugin_dt'].dt.date

# Drop invalid dates immediately
df_sessions = df_sessions.dropna(subset=['Start_plugin_dt', 'date']).copy()

# --- PROCESS WEATHER ---
# Parse weather date
df_weather['weather_dt'] = pd.to_datetime(df_weather['datetime'], errors='coerce')
df_weather['date'] = df_weather['weather_dt'].dt.date

# Select relevant weather columns
weather_cols = ['date', 'temp', 'precip', 'clouds', 'solar_rad', 'wind_spd']
# intersection with available columns to be safe
available_weather_cols = [c for c in weather_cols if c in df_weather.columns]
df_weather_clean = df_weather[available_weather_cols].copy()

# Handle duplicates in weather (if any) by taking the mean or first - usually daily data is unique per day
df_weather_clean = df_weather_clean.groupby('date').first().reset_index()

# --- MERGE ---
# Left Join: Keep all sessions, attach weather where possible
df_merged = pd.merge(df_sessions, df_weather_clean, on='date', how='left')

print(f"Merged Shape: {df_merged.shape}")
print(f"Missing Weather Rows: {df_merged['temp'].isna().sum()}")

# Sanity Check
assert len(df_sessions) == len(df_merged), "Error: Row count changed during merge!"

## 3. Cleaning & Feature Engineering
Fixing timestamps, durations, and creating the `is_short_session` target.

In [None]:
# 1. Parse End Time
df_merged['End_plugout_dt'] = pd.to_datetime(df_merged['End_plugout'], dayfirst=True, errors='coerce')

# 2. Recompute Duration (Validation)
df_merged['Duration_check'] = (df_merged['End_plugout_dt'] - df_merged['Start_plugin_dt']).dt.total_seconds() / 3600.0

# Use recomputed duration if original is weird, but mostly trust recomputed
df_merged['Duration_hours'] = df_merged['Duration_check']

# 3. Filters (Physical possibilities)
# - Remove negative/zero duration
# - Remove near-zero duration (< 0.05h is likely error/testing)
# - Remove El_kWh <= 0
mask_valid = (
    (df_merged['Duration_hours'] > 0.05) & 
    (df_merged['El_kWh'] > 0) &
    (df_merged['Duration_hours'] < 240)  # Cap at 10 days (extreme outliers)
)
df_clean = df_merged[mask_valid].copy()

print(f"Rows dropped: {len(df_merged) - len(df_clean)}")
print(f"Final Shape: {df_clean.shape}")

# 4. Feature Engineering
# Cyclical time features
df_clean['hour'] = df_clean['Start_plugin_dt'].dt.hour
df_clean['hour_sin'] = np.sin(2 * np.pi * df_clean['hour'] / 24)
df_clean['hour_cos'] = np.cos(2 * np.pi * df_clean['hour'] / 24)

# Day of week
df_clean['weekday'] = df_clean['Start_plugin_dt'].dt.dayofweek

# 5. Target Creation
# Binary Classification Target: is_short_session
# 1 = Short (< 24h), 0 = Long (>= 24h)
df_clean['is_short_session'] = (df_clean['Duration_hours'] < 24).astype(int)

print("Class Distribution:")
print(df_clean['is_short_session'].value_counts(normalize=True))

## 4. Save
Saving to `data/ev_sessions_clean.csv`.

In [None]:
os.makedirs('data', exist_ok=True)
df_clean.to_csv(OUT_FILE, index=False)
print(f"Saved to {OUT_FILE}")