# Industrial Control System (ICS) Cyber Attack Detection

### Contents
* Introduction
* Dataset
* Model Training & Evaluation
* Conclusion
* References

# Introduction

An industrial control system (ICS) is an electronic control system and the associated instrumentation used for controlling and monitoring automated processes. ICSs are used in most industrial sectors, including critical infrastructure in energy, manufacturing, transportation, and water treatment. These systems are frequently the targets of cyber attacks. When monitoring the ICS, it can be challenging to distinguish between an attack and a natural fault resulting from the system components' regular behaviour or maintenance. We overcome this challenge by leveraging machine learning to create a multi-class classifier that distinguishes between attacks, regular operation and natural faults. This can aid in detecting and responding to any potential threats, as well as maintaining the system's proper functioning. Additionally, understanding the differences between these three states can help prevent future incidents; organizations can create more effective security measures and respond better to potential threats.


# Dataset


This dataset contains data related to electric transmission system behaviour, including measurements of synchrophasors and data logs from relays. This dataset can be used to understand the differences between normal operation, disturbances, control, and cyber attack behaviours. 

You can see a small sample of the data below.

More information on the dataset can be found at

* http://www.ece.uah.edu/~thm0009/icsdatasets/PowerSystem_Dataset_README.pdf
* https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets

Importing the necessary libraries

In [1]:
import glob
import requests
import os.path
import pickle
import cudf
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, f1_score, confusion_matrix

Downloading the dataset:

In [2]:
if not os.path.isfile("triple.7z"):

    URL = "http://www.ece.uah.edu/~thm0009/icsdatasets/triple.7z"
    response = requests.get(URL)
    open("triple.7z", "wb").write(response.content)

Checking the hash:

In [3]:
!md5sum triple.7z

2b4ae3dc094bb472f6b9f312c4afc3f0  triple.7z


When this file(triple.7z) was downloaded on 25 Jan 2023 md5 Checksum was "2b4ae3dc094bb472f6b9f312c4afc3f0"

In [4]:
if not os.path.isfile("data1.csv"):
    !p7zip -k -d triple.7z


7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,80 CPUs Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (406F1),ASM,AES-NI)

Scanning the drive for archives:
  0M Sca        1 file, 19701721 bytes (19 MiB)

Extracting archive: triple.7z
--
Path = triple.7z
Type = 7z
Physical Size = 19701721
Headers Size = 400
Method = LZMA:26
Solid = +
Blocks = 1

     20% 2 - data11.cs                   36% 6 - data15.cs                   57% 9 - data4.c                 79% 12 - data7.cs                  Everything is Ok

Files: 15
Size:       81668321
Compressed: 19701721


Combining files into one:

In [5]:
if not os.path.isfile("3class.csv"):
    all_files = glob.glob(os.path.join("*.csv"))
    if not os.path.isfile("data1.csv"):
        time.sleep(3)

    dflist=[]
    for i in all_files:
        dflist.append(pd.read_csv(i))
    df = pd.concat(dflist)
    df.reset_index(drop=True, inplace=True)
    df.to_csv("3class.csv", index=False)
else:
    df=pd.read_csv("3class.csv")

A sample from the dataset

In [6]:
len(df)

78377

In [7]:
df.head(5)

Unnamed: 0,R1-PA1:VH,R1-PM1:V,R1-PA2:VH,R1-PM2:V,R1-PA3:VH,R1-PM3:V,R1-PA4:IH,R1-PM4:I,R1-PA5:IH,R1-PM5:I,...,control_panel_log4,relay1_log,relay2_log,relay3_log,relay4_log,snort_log1,snort_log2,snort_log3,snort_log4,marker
0,129.047284,133139.0637,9.069922,133113.9904,-110.90744,133214.2835,129.522839,477.18466,4.262806,509.59513,...,0,0,0,0,0,0,0,0,0,NoEvents
1,128.949881,133063.8439,8.978249,133013.6974,-111.016302,133088.9172,129.368141,477.9171,4.125296,510.32757,...,0,0,0,0,0,0,0,0,0,NoEvents
2,128.222225,132336.7191,8.262051,132286.5725,-111.749688,132386.8656,128.382653,482.12863,3.265859,512.708,...,0,0,0,0,0,0,0,0,0,NoEvents
3,123.850557,129202.5603,3.867465,129152.4138,-116.132816,129277.7801,121.965526,505.74982,-1.249048,525.5257,...,0,0,0,0,0,0,0,0,0,NoEvents
4,128.594648,132236.426,8.623015,132186.2794,-111.354348,132261.4993,132.36471,276.86232,8.82928,289.13069,...,0,0,0,0,0,0,0,0,0,NoEvents


The label column name in the dataset is "marker".

In [8]:
df.marker.unique()

array(['NoEvents', 'Attack', 'Natural'], dtype=object)

In [9]:
df["marker"].value_counts()

Attack      55663
Natural     18309
NoEvents     4405
Name: marker, dtype: int64

The dataset is imbalanced. So, we will choose performance metrics accordingly.

List the column names in the dataset:

In [10]:
#Features and the label columns of the dataset
list(df)

['R1-PA1:VH',
 'R1-PM1:V',
 'R1-PA2:VH',
 'R1-PM2:V',
 'R1-PA3:VH',
 'R1-PM3:V',
 'R1-PA4:IH',
 'R1-PM4:I',
 'R1-PA5:IH',
 'R1-PM5:I',
 'R1-PA6:IH',
 'R1-PM6:I',
 'R1-PA7:VH',
 'R1-PM7:V',
 'R1-PA8:VH',
 'R1-PM8:V',
 'R1-PA9:VH',
 'R1-PM9:V',
 'R1-PA10:IH',
 'R1-PM10:I',
 'R1-PA11:IH',
 'R1-PM11:I',
 'R1-PA12:IH',
 'R1-PM12:I',
 'R1:F',
 'R1:DF',
 'R1-PA:Z',
 'R1-PA:ZH',
 'R1:S',
 'R2-PA1:VH',
 'R2-PM1:V',
 'R2-PA2:VH',
 'R2-PM2:V',
 'R2-PA3:VH',
 'R2-PM3:V',
 'R2-PA4:IH',
 'R2-PM4:I',
 'R2-PA5:IH',
 'R2-PM5:I',
 'R2-PA6:IH',
 'R2-PM6:I',
 'R2-PA7:VH',
 'R2-PM7:V',
 'R2-PA8:VH',
 'R2-PM8:V',
 'R2-PA9:VH',
 'R2-PM9:V',
 'R2-PA10:IH',
 'R2-PM10:I',
 'R2-PA11:IH',
 'R2-PM11:I',
 'R2-PA12:IH',
 'R2-PM12:I',
 'R2:F',
 'R2:DF',
 'R2-PA:Z',
 'R2-PA:ZH',
 'R2:S',
 'R3-PA1:VH',
 'R3-PM1:V',
 'R3-PA2:VH',
 'R3-PM2:V',
 'R3-PA3:VH',
 'R3-PM3:V',
 'R3-PA4:IH',
 'R3-PM4:I',
 'R3-PA5:IH',
 'R3-PM5:I',
 'R3-PA6:IH',
 'R3-PM6:I',
 'R3-PA7:VH',
 'R3-PM7:V',
 'R3-PA8:VH',
 'R3-PM8:V',
 'R3-PA9:VH',
 'R3

Replace infinite values in the dataset with nan values.

In [11]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)

Replace labels with numbers

In [12]:
df["marker"] = df["marker"].replace("NoEvents", 0)
df["marker"] = df["marker"].replace("Attack", 1)
df["marker"] = df["marker"].replace("Natural", 2)

In [13]:
df["marker"].unique()

array([0, 1, 2])

Replace the nan values with the median of each column.

In [14]:
df = df.fillna(df.median())

Create dataframes for input and labels.

In [15]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X = cudf.from_pandas(X)
y = cudf.from_pandas(y)

Create train and test sets:

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Model Training and Evaluation

### Create an XGBOOST classifier to distinguish cyber attacks and natural faults

XGBoost refers to Extreme Gradient Boosting, a highly optimized and scalable version of the gradient boosting algorithm for machine learning. It is suitable for most tabular data in supervised learning. The training process involves using a part of the data to teach the model to recognize patterns and behaviours. The remaining data is then used to evaluate the accuracy of the model's predictions.

In [17]:
xgb_clf = XGBClassifier(n_estimators=1000)

In [18]:
xgb_clf = xgb_clf.fit(X_train, y_train)

We will test the model against a part of the dataset that wasn't seen in the training.

In [19]:
y_pred = xgb_clf.predict(X_test)

The dataset is imbalanced, so we will use check different F1 scores and a confusion matrix

In [20]:
print("The F1 score for No Events label is",f1_score(y_test.to_numpy(), y_pred ,labels=[0], average="weighted"))
print("The F1 score for Attack label is",f1_score(y_test.to_numpy(), y_pred ,labels=[1], average="weighted"))
print("The F1 score for Natural label is",f1_score(y_test.to_numpy(), y_pred ,labels=[2], average="weighted"))
print("\nThe micro F1 score is:", f1_score(y_test.to_numpy(), y_pred ,average="micro"))
print("The weighted F1 score is:", f1_score(y_test.to_numpy(), y_pred ,average="weighted"))
print("The macro F1 score is:", f1_score(y_test.to_numpy(), y_pred ,average="macro"))
print("\nConfusion Matrix")
print(confusion_matrix(y_test.to_numpy(), y_pred))

The F1 score for No Events label is 0.9668746999519926
The F1 score for Attack label is 0.9435565305198648
The F1 score for Natural label is 0.8130662851692895

The micro F1 score is: 0.9168665475886706
The weighted F1 score is: 0.9142174482444869
The macro F1 score is: 0.9078325052137156

Confusion Matrix
[[ 1007    52     6]
 [    9 13549   376]
 [    2  1184  3410]]


A model can be saved as :

In [21]:
# Save the model to a file
with open("../models/"+"ot-xgboost-20230207.pkl", "wb") as file:
    pickle.dump(xgb_clf, file)

# Conclusion

The model was able to achieve an F1 score greater than 0.91 on the test set, indicating its ability to distinguish between regular operations, disturbances, and cyber attack behaviours to a significant extent. This is a crucial outcome in enhancing security measures and comprehending the behaviour of electric transmission systems. The performance of the model encourages the possibility of reducing the amount of manual analysis in the event of an occurrence, and potentially allowing for near real-time detection, with further improvements.

## References

* http://www.ece.uah.edu/~thm0009/icsdatasets/PowerSystem_Dataset_README.pdf
* https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets
