# Predicting cyberattacks - HAI 20.07 dataset

This Jupyter notebook explores various approaches for implementing machine learning algorithms that detects attacks in the HAI 20.07 dataset.

## Downloading the dataset

In this section, we will be looking at how to download the different CSV files, and how to convert them as pandas dataframes.

In [1]:
!mkdir -p ./data
!wget -O - "https://raw.githubusercontent.com/icsdataset/hai/master/hai-20.07/test1.csv.gz" | gunzip > ./data/test1.csv
!wget -O - "https://raw.githubusercontent.com/icsdataset/hai/master/hai-20.07/test2.csv.gz" | gunzip > ./data/test2.csv
!wget -O - "https://raw.githubusercontent.com/icsdataset/hai/master/hai-20.07/train1.csv.gz" | gunzip > ./data/train1.csv
!wget -O - "https://raw.githubusercontent.com/icsdataset/hai/master/hai-20.07/train2.csv.gz" | gunzip > ./data/train2.csv

--2024-07-16 15:08:54--  https://raw.githubusercontent.com/icsdataset/hai/master/hai-20.07/test1.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32813703 (31M) [application/octet-stream]
Saving to: 'STDOUT'


2024-07-16 15:09:02 (6.64 MB/s) - written to stdout [32813703/32813703]

--2024-07-16 15:09:02--  https://raw.githubusercontent.com/icsdataset/hai/master/hai-20.07/test2.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16907509 (16M) [application/octet-stream]
Saving to: 'STDOUT'


2024-07-16 15:09:06 (6.63 MB/s)

## Displaying the datasets

Here, we will try to visualize the different downloaded datasets, and its metadatas.

In [119]:
import pandas as pd
import datetime

In [62]:
train = []
test = []

train.append(pd.read_csv("./data/train1.csv", sep=";", index_col="time", usecols=list(range(60))))
train.append(pd.read_csv("./data/train2.csv", sep=";", index_col="time", usecols=list(range(60))))

test.append(pd.read_csv("./data/test1.csv", sep=";", index_col="time"))
test.append(pd.read_csv("./data/test2.csv", sep=";", index_col="time"))

In [100]:
total_training_duration = sum([df.shape[0] for df in train])
total_testing_duration = sum([df.shape[0] for df in test])

In [101]:
print("Number of training data points:", total_training_duration)
print("Number of testing data points:", total_testing_duration)

Number of training data points: 550800
Number of testing data points: 444600


In [110]:
attacks_series = pd.concat([test_df["attack"] for test_df in test])

total_attacks_duration = attacks_series.sum()

# Extracting each attack duration
attacks_durations = pd.Series()

# Joining all attacks as a string
str_attacks = "".join([str(data_point) for data_point in attacks_series])

# Splitting on the 0s and removing empty values
# (Keeping only the sequences of 1s)
attack_sequences = list(filter(lambda sequence: sequence != "", str_attacks.split("0")))

# Returning the lenght of every attack sequence (in seconds)
attacks_durations = pd.Series([len(attack_sequence) for attack_sequence in attack_sequences])

In [117]:
print("Statistics about the attacks and their durations (in seconds)")
attacks_durations.describe()

Statistics about the attacks and their durations (in seconds)


count      38.000000
mean      461.236842
std       422.544508
min       151.000000
25%       316.750000
50%       365.000000
75%       513.000000
max      2888.000000
dtype: float64

In [128]:
print("System was under attack for", str(datetime.timedelta(seconds=int(total_attacks_duration))), "out of", str(datetime.timedelta(seconds=total_testing_duration)))
print(f"This represents {round(total_attacks_duration/total_testing_duration*100, 2)}% of the testing time.")

System was under attack for 4:52:07 out of 5 days, 3:30:00
This represents 3.94% of the testing time.
