# 01_preprocessing_and_features

**Purpose:** Load raw CSV, perform data quality checks, cleaning, and create features (lags, time features). Save cleaned feature file for modeling.

Notes:
- Expect the raw CSV to be placed at `data/extended_data_v2.csv` (or update the path).
- This notebook produces `processed/features_ready.csv`.


In [None]:
import os
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Paths
RAW_PATH = Path("data/extended_data_v2.csv")   # change if needed
PROCESSED_DIR = Path("processed")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
OUT_FEATURES = PROCESSED_DIR / "features_ready.csv"


## Load data
We try Kaggle path first (for convenience), otherwise use local `data/` folder.


# Flexible loading (works on Kaggle and local)
import os
if os.path.exists("/kaggle/input/finland-afrr-energy-market-and-weather-data/extended_data_v2.csv"):
    raw_path = "/kaggle/input/finland-afrr-energy-market-and-weather-data/extended_data_v2.csv"
else:
    raw_path = RAW_PATH

print("Loading from:", raw_path)
df = pd.read_csv(raw_path, parse_dates=['datetime'])
print("Loaded shape:", df.shape)
df.head()


## Quick data quality checks
Check basic info, missing values, value ranges and types.


In [None]:
# Basic data checks
display(df.info())
display(df.describe(include='all').T)
na_counts = df.isna().sum().sort_values(ascending=False)
display(na_counts.head(40))


## Cleaning Plan
- Ensure datetime is parsed and timezone-localized to UTC
- Reindex to a continuous hourly index and forward/backward-fill small gaps
- Convert object columns to numeric where appropriate
- Replace infinities with NaN
- Save a copy of cleaned raw (before feature engineering)


df = df.copy()
# datetime -> index
if 'datetime' in df.columns:
    df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')
    # localize to UTC if naive
    if df['datetime'].dt.tz is None:
        df['datetime'] = df['datetime'].dt.tz_localize("UTC")
    df = df.dropna(subset=['datetime']).set_index('datetime').sort_index()
else:
    if not isinstance(df.index, pd.DatetimeIndex):
        raise RuntimeError("No datetime column and index is not DatetimeIndex.")
    if df.index.tz is None:
        df.index = df.index.tz_localize("UTC")

# reindex hourly and fill tiny gaps
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='H', tz='UTC')
missing = full_idx.difference(df.index)
print("Missing hourly timestamps count:", len(missing))
if len(missing) > 0:
    df = df.reindex(full_idx)
    df = df.fillna(method='ffill').fillna(method='bfill')

# coerce object columns to numeric
for c in df.columns:
    if df[c].dtype == object:
        df[c] = pd.to_numeric(df[c], errors='coerce')

# replace infs
df.replace([np.inf, -np.inf], np.nan, inplace=True)

print("Post-clean shape:", df.shape)


## Feature engineering
Create:
- target column `Up_next_hour` (Up shifted -1)
- lag features for selected base columns (1,2,3,6,12,24)
- time cyclic features (hour_sin/hour_cos), weekday, weekend, month


TARGET = "Up"
if TARGET not in df.columns:
    raise ValueError("Target 'Up' not in dataset.")

# target
df['Up_next_hour'] = df[TARGET].shift(-1)

# lags
LAGS = [1,2,3,6,12,24]
lag_base = [c for c in [TARGET, "electricity_consumption", "electricity_consumption_forecast",
                        "sp", "Up_Cap", "Down_Cap", "air_temperature", "wind_speed"] if c in df.columns]
for lag in LAGS:
    for c in lag_base:
        df[f"{c}_lag_{lag}"] = df[c].shift(lag)

# time features
df['hour'] = df.index.hour
df['hour_sin'] = np.sin(2*np.pi*df['hour']/24)
df['hour_cos'] = np.cos(2*np.pi*df['hour']/24)
df['day_of_week'] = df.index.weekday
df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int)
df['month'] = df.index.month
if 'is_public_holiday' in df.columns:
    df['is_public_holiday'] = pd.to_numeric(df['is_public_holiday'], errors='coerce').fillna(0).astype(int)

# drop rows where target or the minimum-lag features are missing
before = df.shape[0]
df = df.dropna(subset=['Up_next_hour'] + [f"{c}_lag_{min(LAGS)}" for c in lag_base], how='any')
after = df.shape[0]
print(f"Dropped {before-after} rows due to lag/target NaNs. Remaining rows: {after}")


In [None]:
# Save processed features
df.to_csv(OUT_FEATURES)
print("Saved processed features to:", OUT_FEATURES)
