**Data Science and AI for Energy Systems** 

Karlsruhe Institute of Technology

Institute of Automation and Applied Informatics

Summer Term 2024

---

# Exercise XIII: "Other ML" applications for energy

**Imports**

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.inspection import DecisionBoundaryDisplay

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, RocCurveDisplay, roc_curve, PrecisionRecallDisplay, precision_recall_curve, precision_score, recall_score, f1_score

import matplotlib.pyplot as plt


In [None]:
rng = np.random.RandomState(0)

## Problem XIII.2 (programming) - Applying Isolation Forests to Time Series Data

**In this programming task, we apply isolation forests for unsupervised anomaly detection.**

**(a) We start by applying and visualizing an isolation forest to randomly generated data points by following the [scikit-learn example](https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html).**

**i. Generate the random data points as described in the notebook.**

In [None]:
n_samples, n_outliers = 120, 40


covariance = np.array([[0.5, -0.1], [0.7, 0.4]])
cluster_1 = 0.4 * rng.randn(n_samples, 2) @ covariance + np.array([2, 2])  # general
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])  # spherical


# create uniformely distributed "outliers"
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers])
y = np.concatenate([np.ones((2 * n_samples), dtype=int), -np.ones((n_outliers), dtype=int)])



Split the data into a training and test set. You can use ```train_test_split``` from scikit-learn. By setting ```stratify=y``` you can ensure that the train and test set contain the same proportion of outliers.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

Let's visualize the generated data points.

In [None]:
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("Gaussian inliers with \nuniformly distributed outliers")
plt.show()

**ii. Train the isolation forest.**

In [None]:
clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

**iii. Visualize the decision boundary.**

The sklearn implementation of the isolation forest returns the negative of the  anomaly score defined in the paper for the ```score_samples``` method. Therefore we multiply the scores by -1 to get the anomaly scores defined in Exercise XIII.1. 

In [None]:

scatter = plt.scatter(X[:, 0], X[:, 1], c=clf.score_samples(X) * -1, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.title("Anomaly Scores")
plt.colorbar(scatter)
plt.show()

The decision boundary that is used when calling ```predict()``` is defined by the ```contamination``` parameter. In this case we have not set it so it defaults to ```auto``` which in principle means that the threshold is set at an anomaly score of ```0.5```.

In [None]:


disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="predict",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Binary decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.show()

We can also plot the ```decision_function``` of the isolation forest. By looking at the docs we can see that ```decision_function = score_samples - offset_``` where the offset is either set as ```0.5``` if the contamination is set to ```auto``` or as if the contamination is set to a specific value it is is defined in such a way we obtain the expected number of outliers. The decision function then divides outliers from inliers at ```0```.

In [None]:
disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="decision_function",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Path length decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.colorbar(disp.ax_.collections[1])
plt.show()

---

**(b) Now we look into how to apply this algorithm for time series as encountered in energy applications. Specifically, we introduce synthetic anomalies into load data.**

**i. Retrieve the Load data from BWsyncandshare.**

In [None]:
# read the data
load_data = pd.read_csv('data.csv', index_col=0)

In [None]:
load_data.plot(figsize=(20, 5))

**ii. Introduce synthetic anomalies as described in the notebook.**

Here we introduce anomalies by adding or subtracting from the original data at randomly selected points.
Start with large anomalies (e.g. offsets of around 8000) and then reduce the size of the anomalies to see how the isolation forest reacts.

In [None]:
# copy the data to retain the original data
data = load_data.copy()

# randomly select 5% of the data as anomalies
anomaly_indices = np.random.choice(len(data), size=int(0.05*len(data)), replace=False)

# generate random offsets for the anomalies
anomaly_offset = np.random.choice([-1, 1], size=(len(anomaly_indices))) * np.random.normal(2500, 800, size=(len(anomaly_indices)))

# introduce anomalies in the data
data.iloc[anomaly_indices] = (data.iloc[anomaly_indices].values.flatten() + anomaly_offset).reshape(-1, 1)

As the isolation forest is unsupervised we do not need a ground truth during training. However we can use the ground truth to evaluate the performance of the isolation forest.

In [None]:
# save the indices of the anomalies as ground_truth
ground_truth = pd.Series(np.zeros(len(data)), index=data.index)
ground_truth.iloc[anomaly_indices] = 1

We split the data into a training and test set

In [None]:
# train test split
X_train = data.loc[:'2020']
X_test = data.loc['2020':]
y_train = ground_truth.loc[:'2020']
y_test = ground_truth.loc['2020':]

**iii. Train an Isolation Forest with the training set and predict on the test set.**

In [None]:
# train the isolation forest
clf = IsolationForest(max_samples=100, random_state=0, contamination=0.05)
clf.fit(X_train)

The ```predict``` method returns ```1``` for normal data and ```-1``` for anomalies. We want to compare the predictions with the ground truth. Therefore we need to convert the predictions to the same format as the ground truth which is ```1``` for anomalies and ```0``` for normal data.

In [None]:
prediction = (clf.predict(X_test) - 1) / - 2 
prediction = pd.Series(prediction, index=X_test.index)

**iv. Evaluate the predictions.**

Plot a Confusion matrix.

In [None]:
cm = confusion_matrix(y_test, prediction)

cm_display = ConfusionMatrixDisplay(cm).plot()

The F1 score is the harmonic mean of precision and recall. It is a good metric to evaluate the performance of the isolation forest. If we want to emphasize either precision or recall we can use the F-beta score which is a weighted harmonic mean of precision and recall. The beta parameter determines the weight of precision in the F-beta score. If beta is larger than 1 recall is emphasized, if beta is smaller than 1 precision is emphasized.

In [None]:
# calculate f1 score
f1_score(y_test, prediction)

We can also calculate precision and recall separately.

In [None]:
# calculate precision
print('Precision:', precision_score(y_test, prediction))

# calculate recall
print('Recall:', recall_score(y_test, prediction))

By plotting the ROC curve and the Precision-Recall curve we can see how the isolation forest performs for different thresholds.

In [None]:
y_score = (clf.decision_function(X_test) - 1) / - 2 

fpr, tpr, _ = roc_curve(y_test, y_score)
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr)

prec, recall, _ = precision_recall_curve(y_test, y_score)
pr_display = PrecisionRecallDisplay(precision=prec, recall=recall)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))

roc_display.plot(ax=ax1)
pr_display.plot(ax=ax2)
plt.show()

**v. Improve the performance of the isolation forest for time series by giving context via engineered features.**

In time series anomalies the context is often important. For example, a change in load that is normal during the day could be an anomaly if it happens during the night. Also deviations that are not extreme on the global scale could be very large in their local surrounding. Therefore we can engineer features that give the isolation forest more context and improve its performance. Here we start by adding the increments of the load data as features.

In [None]:
# include the difference of the load as a feature
data['load_DE_diff'] = data['load_DE'].diff(1)
# data['load_DE_diff2'] = data['load_DE'].diff(2)

In [None]:
# train test split
X_train = data.loc[:'2020']
X_test = data.loc['2020':]
y_train = ground_truth.loc[:'2020']
y_test = ground_truth.loc['2020':]

# drop the first row as it contains NaN
X_train = X_train.dropna()
y_train = y_train.loc[X_train.index]

# train the isolation forest
clf = IsolationForest(max_samples=100, random_state=0, contamination=0.05)
clf.fit(X_train)

prediction = (clf.predict(X_test) - 1) / - 2 
prediction = pd.Series(prediction, index=X_test.index)

In [None]:
cm = confusion_matrix(y_test, prediction)

cm_display = ConfusionMatrixDisplay(cm).plot()

In [None]:
# calculate f1 score
f1_score(y_test, prediction)

In [None]:
# calculate precision
print('Precision:', precision_score(y_test, prediction))

# calculate recall
print('Recall:', recall_score(y_test, prediction))

---