# Enoda Technical Challenge

## 1. Exploratory Data Analysis

This dataset (https://zenodo.org/records/4549296) contains power measurements and meteorological forecasts relatvei to a set of 24 substation power meters from the distribution grid in Switzerland.

The power measurements are provided as a pickle dataset, which includes:

For each phase:

- mean active and reactive power
- voltage magnitude
- maximum total harmonic distortion (THD)
- voltage frequency 
- the average power over the three phases.
- The latter one has been used as target variable in the aforementioned paper.

The meteorological forecasts are provided as a Hierarchical Data Format 5 file, which includes:

- temperature
- global horizontal and normal irradiance (GHI and GNI, respectively)
- relative humidity (RH)
- pressure
- wind speed and direction

In [None]:
from pathlib import Path
import pandas as pd

DATA_FLD = Path("data")

nwp_data = pd.read_hdf(DATA_FLD / "nwp_data.h5","df")
power_data = pd.read_pickle(DATA_FLD / "power_data.p")

In [None]:
nwp_data.head(3)

In [None]:
# reshape data
nwp_cols_to_keep = ["ghi_backwards", "windspeed", "temperature"]
nwp_df = pd.concat([
    pd.DataFrame(nwp_data[col].tolist(), index=nwp_data.index).add_prefix(f"{col}_")
    for col in nwp_cols_to_keep
], axis=1)

In [None]:
nwp_df.head(3)

In [None]:
power_data.keys()

In [None]:
power_data_cols = ['freqA', 'P_mean', 'Q_mean', 'V_mean']

In [None]:
power_df = pd.concat([power_data[col].add_suffix(f"_{col}") for col in power_data_cols], axis=1)

In [None]:
power_df.head(3)

In [None]:
substations = ['0307a3cec15787560b7d0ba094f74d1decb2fa72',
       '0f415416ff153479d65f54df3fa9974af46e3a89',
       '1caab5f0e80231e1c6fdefc00edec4fdb6a02c5f',
       '27fbb11689277a30f5db9c71b42e1d3826bd34ff',
       '2ce3e7e1a5365dc54c7f4fc6284f0052397702b9',
       '2d837275047e5fdce39fda42b541dbf6c858a4d1',
       '350c6b9720ebb1e1a04e8f88ad0fa114c2af77b3',
       '39f06481738604cb5916dce15639e380514b99ca',
       '432650b919537d23cb4054fdb85a07eecaa4524c',
       '49228b90116c6075fabcd8a1cf0e48e016373614',
       '4db83178615678a918dfa6a38ae6e23de7a2d39a',
       '5e9c55269b890ad82c8ebbd146ea2a563fe768ce',
       '75d0930659fc8dcdaffed6c60d5871a969a76a87',
       '7bf877fd51c1c6db07c1fb0255eac4540030f28f',
       '7ebc4dd008e424c2510c6581a195524563b00ee9',
       '89819f031b89125c8c4b364317478f078925fe38',
       'a0ab25616dde3d31062ade71f866faa3b1e8e18f',
       'a4656735af4aa0ba2e4758f8d4f6e411cfc55097',
       'a52f9650e9aa3d60e43792eb2574e0e76bb00aaf',
       'b3e1bf5d8d0337b42f972ca11beafea062bd99be',
       'c41c064e0aa78571b028c8673ebe7abd59d0e6d8',
       'c55a669913fe883d9ec913821688656ea8e4c884',
       'da3ac5e45e56e0e2263f39f38c033366f5d1e0c4',
       'fe2245a4afe0afc24d215dd4abd2ffb34610dd27']

aggregations = ['all', 'S1', 'S2', 'S11','S12', 'S21', 'S22']

In [None]:
power_df["all_P_mean"].tail(6*48).plot(figsize=(20, 5))

In [None]:
nwp_df["temperature_0"].tail(6*48).plot(figsize=(20, 5))

## 2. Feature Engineering
- Lags of power mean (previous hour, previous day)
- Moving average of power mean
- Hour of day, day of week, month
- Holiday or not

In [None]:
import datetime as dt

holidays = [dt.datetime.strptime(i, "%d.%m.%Y").date() for i in pd.read_csv(DATA_FLD / "holidays.txt", delimiter="\t").squeeze().values]

In [None]:
pd.plotting.autocorrelation_plot(power_df["0307a3cec15787560b7d0ba094f74d1decb2fa72_P_mean"].tail(6*24*30)) # last month

In [None]:
from typing import List

def create_lag_features_df(df: pd.DataFrame, lags: List[int]=[6, 12, 24, 24*7])->pd.DataFrame:
    lagged_frames = [
        df.shift(lag*6).add_prefix(f'lag{lag}hr_')
        for lag in lags
    ]
    return pd.concat(lagged_frames, axis=1)

In [None]:
power_df.filter(like="P_mean").filter(like="0307a3cec15787560b7d0ba094f74d1decb2fa72")

#### Select Target Column

In [None]:
Y_COL = "0307a3cec15787560b7d0ba094f74d1decb2fa72_P_mean"

In [None]:
#dataframes
power_rolling_mean = power_df.filter(like="P_mean").filter(like="0307a3cec15787560b7d0ba094f74d1decb2fa72").shift(1).rolling(window="1h").mean().add_suffix("_rolling_mean")
lagged_df = create_lag_features_df(power_df.filter(like="P_mean").filter(like="0307a3cec15787560b7d0ba094f74d1decb2fa72"))

#series
hour_of_day = pd.Series(power_df.index.hour)
day_of_week = pd.Series(power_df.index.dayofweek)
month = pd.Series(power_df.index.month)
is_holiday = [i in holidays for i in power_df.index.date]

extra_features_df = pd.DataFrame(
    {
        "hour_of_day":hour_of_day,
        "day_of_week":day_of_week,
        "month":month,
        "is_holiday":is_holiday,
    },
    index = power_df.index
)
 
features_df = pd.concat([nwp_df, extra_features_df, lagged_df, power_rolling_mean], axis=1) # join weather data, lagged power data and timeseries indicators

In [None]:
features_df.head(3)

In [None]:
selected_features = features_df.columns

## 3. Modelling
Use XGboost as a multi output regressor

In [None]:
forecast_horizon = 6*24  # 24 hours ahead at 10-min intervals

# Create target columns
leads = range(1, forecast_horizon + 1)

lead_frames = [
        power_df[Y_COL].to_frame().shift(-lead).add_prefix(f'+{lead}_')
        for lead in leads
    ]

targets_df = pd.concat(lead_frames, axis=1).dropna(how="all", axis=1).dropna(how="any", axis=0)

In [None]:
final_features_df = features_df[selected_features].loc[targets_df.index]

## 4. Modelling

## Train, test, validation split

In [None]:
from sklearn.model_selection import train_test_split

X_temp, X_val, y_temp, y_val = train_test_split(final_features_df, targets_df, test_size=0.2, random_state=42, shuffle=False)# don't shuffle since timeseries
X_train, X_test, y_train, y_test = train_test_split(X_temp, y_temp, test_size=0.2, random_state=42, shuffle=False)# don't shuffle since timeseries

## Hyperparameter Tuning (Optuna)

In [None]:
from sklearn.multioutput import MultiOutputRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import make_scorer, mean_squared_error
import numpy as np

# Define scoring
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

scorer = make_scorer(rmse)

In [None]:
# import optuna

# def objective(trial):
#     param = {
#         "n_estimators": trial.suggest_int("n_estimators", 50, 100),
#         "max_depth": trial.suggest_int("max_depth", -1, 5),
#         "num_leaves": trial.suggest_int("num_leaves", 20, 50),
#         "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
#         "random_state": 42
#     }

#     model = MultiOutputRegressor(LGBMRegressor(**param))
#     model.fit(X_train, y_train)

#     y_pred = model.predict(X_test)
#     rmse = rmse(y_test, y_pred))

#     return rmse

In [None]:
# study = optuna.create_study()
# study.optimize(objective, n_trials=50)

In [None]:
# from plotly.io import show

# fig = optuna.visualization.plot_optimization_history(study)
# show(fig)

In [None]:
best_params = {'n_estimators': 93, 'max_depth': 0, 'num_leaves': 42, 'learning_rate': 0.06768919270718152}

## 5. Evaluation
Check final metrics by time horizon and feature importance

In [None]:
model = MultiOutputRegressor(LGBMRegressor(**best_params, random_state=42, verbose=-1))

In [None]:
final_model = model.fit(X_train, y_train)

In [None]:
y_pred = final_model.predict(X_val)

In [None]:
rmse(y_val, y_pred)

In [None]:
rmse_s = [np.sqrt(i) for i in mean_squared_error(y_val, y_pred, multioutput="raw_values")]

In [None]:
from sklearn.metrics import mean_absolute_percentage_error as mape

f"{mape(y_val, y_pred):.1%}"

In [None]:
mape_s = mape(y_val, y_pred, multioutput="raw_values")

In [None]:
pd.DataFrame(
    {
        "RMSE": rmse_s,
        "MAPE": [i*100 for i in mape_s],
    },
    index=range(1, 145)
).plot(subplots=True, figsize=(15, 5), title="Metrics over the horizon (+10min to +24hr)")

In [None]:
#Feature importance

In [None]:
from lightgbm import plot_importance

ax = plot_importance(final_model.estimators_[5], max_num_features=10)
ax.set_title("+1hr Model Feature Importance")

In [None]:
plot_importance(final_model.estimators_[-1], max_num_features=10)
ax.set_title("+24hr Model Feature Importance")

In [None]:
#Plots (show short vs long forecasting)

In [None]:
y_pred_df = pd.DataFrame(y_pred, index=y_val.index)
y_pred_df.columns = y_val.columns

In [None]:
y_val[f"+144_{Y_COL}"].head(6*24*7).plot(label="true", figsize=(10, 2), title=("+24hr time horizon"))
y_pred_df[f"+144_{Y_COL}"].head(6*24*7).plot(label="predicted")

In [None]:
y_val[f"+6_{Y_COL}"].head(6*24*5).plot(label="true", figsize=(10, 2), title=("+1hr time horizon"))
y_pred_df[f"+6_{Y_COL}"].head(6*24*5).plot(label="predicted")

In [None]:
# Plot as single series

In [None]:
y_pred_df2 = pd.DataFrame(y_pred, index=y_val.index)
y_pred_df2.columns = y_val.columns

ls = []

for idx, col in enumerate(y_pred_df2.columns):
    ser = y_pred_df2[col]
    shift = idx+1
    ser.index = ser.index + (shift*pd.Timedelta("10m"))
    ls.append(ser)

In [None]:
plot_df = pd.concat(ls, axis=1).head(6*24*10)
ax = plot_df.plot(alpha=0.4, legend=False, figsize=(15,5), color="lightgrey")
power_df.loc[plot_df.index, Y_COL].plot(ax=ax, color="black")
ax.set_title("24hr Forecasting (all horizons) vs real data");

## 7. Physical Reasoning
Demonstrate how the forecast respects or violates physics
- V = IZ (we don't know Z?)
- P = IV (apparent power)
- AP = VIcos(theta) - we have frequency and so can work out phase angle? Check for PF > 1
- Conservation of energy

Show with plots. Select a single time horizon, say hour ahead.

In [None]:
pd.concat(ls, axis=1).filter(like="+6_0307a3cec15787560b7d0ba094f74d1decb2fa72").plot()

In [None]:
pd.concat(ls, axis=1)["+6_0307a3cec15787560b7d0ba094f74d1decb2fa72_P_mean"].plot()

In [None]:
predicted_power_1hr_ser = pd.concat(ls, axis=1)["+6_0307a3cec15787560b7d0ba094f74d1decb2fa72_P_mean"]
power_data_df = power_df.loc[predicted_power_1hr_ser.index]

### Check power doesn't exceed phase power mean power

In [None]:
phase_df = pd.concat(
    [
        power_data["PA"]["0307a3cec15787560b7d0ba094f74d1decb2fa72"],
        power_data["PB"]["0307a3cec15787560b7d0ba094f74d1decb2fa72"],
        power_data["PC"]["0307a3cec15787560b7d0ba094f74d1decb2fa72"]
    ],
    axis=1
).assign(expected_p_mean_from_phases=lambda d: d.mean(axis=1)).loc[predicted_power_1hr_ser.index]

In [None]:
phase_df

In [None]:
pd.concat(
    [phase_df, 
     predicted_power_1hr_ser
    ],
    axis=1
).assign(
    flag= lambda d: d["+6_0307a3cec15787560b7d0ba094f74d1decb2fa72_P_mean"] > d["expected_p_mean_from_phases"]
).astype(
    {"flag": float}
).head(
    6*24*5
).plot(
    subplots=True, 
    figsize=(15, 5), 
    title="Predicted Active Power exceeds real Apparent Power"
)

## 8. Anomaly Detection (extra)
In the historic data identify:
- voltage sags or spikes: sudden drops/spikes in voltage (try basic rolling std thresholding) / show frequency and THD for context
- overload conditions: 
- phase imbalance or anomalous switching: compare power/voltage across phases and look for outlier (compare max of phases to median of phases)
- look for spikes in power i.e show freq and tempeature for context