<a href="https://colab.research.google.com/github/parmigggiana/ml-ids/blob/main/IDS_CICIDS2017.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web attack detection using CICIDS2017 dataset

This is an edited version of the original https://github.com/fisher85/ml-cybersecurity/blob/master/python-web-attack-detection/web-attack-detection.ipynb

I have adapted the script to use the [corrected CIC-IDS-2017 dataset](https://intrusion-detection.distrinet-research.be/CNS2022/index.html).
Instead of selecting a single day, I used the whole dataset (~2 mil. entries).
The script was written and corrected iterating on only one of those days, to make it easier to test that everything was working as intended. After it was done, I re-ran it with all the available data overnight and saved the results.
I changed the math in the undersampling section to make it easier and more direct. I also chose to only undersample based on probability instead of having an hard limit. Given the size of the dataset, this is mostly ininfluent.
I have also re-done the feature selection and analysis.
The original trained and selected features on the whole dataset, causing obvious overfitting. After the data preparation I took out a portion of the dataset which was never used again, if not for the final test after everything else was set.  
After that, I added testing on the corrected CSE-CIC-IDS-2018 dataset and on my own CTF Dataset.

In [43]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline

## Data preprocessing

Source: https://github.com/bozbil/Anomaly-Detection-in-Networks-Using-Machine-Learning/blob/master/01_preprocessing.ipynb [Kostas2018].

### Download and clean data

I will use the corrected CIC-IDS-2017 instead of the original.

In [44]:
!wget https://intrusion-detection.distrinet-research.be/CNS2022/Datasets/CICIDS2017_improved.zip -O dataset.zip
!unzip -u -d Corrected_CICIDS2017/ dataset.zip 

--2023-06-18 12:53:10--  https://intrusion-detection.distrinet-research.be/CNS2022/Datasets/CICIDS2017_improved.zip
Resolving intrusion-detection.distrinet-research.be (intrusion-detection.distrinet-research.be)... 134.58.40.205
Connecting to intrusion-detection.distrinet-research.be (intrusion-detection.distrinet-research.be)|134.58.40.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 343549013 (328M) [application/zip]
Saving to: ‘dataset.zip’


2023-06-18 12:53:55 (7.30 MB/s) - ‘dataset.zip’ saved [343549013/343549013]

Archive:  dataset.zip


Using encoding='latin' avoids the UnicodeDecodeError we get otherwise

In [45]:
from pathlib import Path
li = []
for filename in Path('./Corrected_CICIDS2017/').glob('*.csv'):
  li.append(pd.read_csv(filename, index_col=0, encoding='latin'))
df = pd.concat(li, axis=0, ignore_index=True)
df.sample(100000) # random subset of 100k samples, REMOVE WHEN YOU'RE DONE

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,...,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,ICMP Code,ICMP Type,Total TCP Flow Time,Label,Attempted Category
306881,192.168.10.14-192.168.10.3-49185-53-17,192.168.10.14,49185,192.168.10.3,53,17,2017-07-05 11:46:13.554381,34546,2,2,...,0,0.000000e+00,0.000000e+00,0,0,-1,-1,0,BENIGN,-1
269713,192.168.10.3-192.168.10.1-62491-53-17,192.168.10.3,62491,192.168.10.1,53,17,2017-07-05 19:50:40.169960,24188,1,1,...,0,0.000000e+00,0.000000e+00,0,0,-1,-1,0,BENIGN,-1
1406914,192.168.10.9-192.168.10.3-49220-53-17,192.168.10.9,49220,192.168.10.3,53,17,2017-07-03 17:13:06.327104,169,2,2,...,0,0.000000e+00,0.000000e+00,0,0,-1,-1,0,BENIGN,-1
688424,192.168.10.5-50.115.208.113-51793-443-6,192.168.10.5,51793,50.115.208.113,443,6,2017-07-07 15:16:55.250846,116493505,21,21,...,86552,1.000133e+07,1.436556e+04,10010315,9958888,-1,-1,116493505,BENIGN,-1
855923,192.168.10.15-52.84.145.150-49794-443-6,192.168.10.15,49794,52.84.145.150,443,6,2017-07-07 12:15:34.360144,115763624,23,23,...,21936,9.616680e+06,1.332814e+06,10005487,5384430,-1,-1,115763624,BENIGN,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1103196,192.168.10.3-192.168.10.1-62253-53-17,192.168.10.3,62253,192.168.10.1,53,17,2017-07-03 14:03:08.328643,449,1,1,...,0,0.000000e+00,0.000000e+00,0,0,-1,-1,0,BENIGN,-1
1225278,192.168.10.16-217.196.36.3-38154-80-6,192.168.10.16,38154,217.196.36.3,80,6,2017-07-03 12:57:37.679400,15890446,13,11,...,838844,1.000802e+07,0.000000e+00,10008019,10008019,-1,-1,15890446,BENIGN,-1
809711,172.16.0.1-192.168.10.50-18188-80-6,172.16.0.1,18188,192.168.10.50,80,6,2017-07-07 18:59:25.024563,11443994,9,5,...,1411127,1.003287e+07,0.000000e+00,10032867,10032867,-1,-1,11443994,DDoS,-1
1776643,192.168.10.3-192.168.10.1-62318-53-17,192.168.10.3,62318,192.168.10.1,53,17,2017-07-06 15:17:41.835916,120696,1,1,...,0,0.000000e+00,0.000000e+00,0,0,-1,-1,0,BENIGN,-1


In [46]:
df

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,...,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,ICMP Code,ICMP Type,Total TCP Flow Time,Label,Attempted Category
0,192.168.10.9-52.197.242.182-13793-443-6,192.168.10.9,13793,52.197.242.182,443,6,2017-07-05 11:43:34.783702,187889,1,1,...,0,0.0,0.0,0,0,-1,-1,187889,BENIGN,-1
1,192.168.10.9-52.197.242.182-13793-443-6,192.168.10.9,13793,52.197.242.182,443,6,2017-07-05 11:43:35.484765,187758,1,1,...,0,0.0,0.0,0,0,-1,-1,187758,BENIGN,-1
2,192.168.10.9-54.65.28.113-13794-443-6,192.168.10.9,13794,54.65.28.113,443,6,2017-07-05 11:43:36.375217,189882,1,1,...,0,0.0,0.0,0,0,-1,-1,189882,BENIGN,-1
3,192.168.10.9-54.65.28.113-13794-443-6,192.168.10.9,13794,54.65.28.113,443,6,2017-07-05 11:43:37.075970,190117,1,1,...,0,0.0,0.0,0,0,-1,-1,190117,BENIGN,-1
4,192.168.10.9-52.197.242.182-13796-443-6,192.168.10.9,13796,52.197.242.182,443,6,2017-07-05 11:43:37.968708,188603,1,1,...,0,0.0,0.0,0,0,-1,-1,188603,BENIGN,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2099971,192.168.10.8-192.168.10.19-61232-21571-6,192.168.10.8,61232,192.168.10.19,21571,6,2017-07-06 18:21:53.829425,61,2,2,...,0,0.0,0.0,0,0,-1,-1,61,Infiltration - Portscan,-1
2099972,192.168.10.15-192.168.10.3-55355-53-17,192.168.10.15,55355,192.168.10.3,53,17,2017-07-06 17:05:17.960348,183,2,2,...,0,0.0,0.0,0,0,-1,-1,0,BENIGN,-1
2099973,192.168.10.14-192.168.10.3-63524-53-17,192.168.10.14,63524,192.168.10.3,53,17,2017-07-06 18:04:58.308001,1040,2,2,...,0,0.0,0.0,0,0,-1,-1,0,BENIGN,-1
2099974,192.168.10.15-192.168.10.50-55298-22-6,192.168.10.15,55298,192.168.10.50,22,6,2017-07-06 15:37:11.000885,1530528,42,46,...,0,0.0,0.0,0,0,-1,-1,1530528,BENIGN,-1


In [47]:
df.shape

(2099976, 90)

As for the differences between the features, aside for a couple of names that changed slightly, the corrected datasets adds 5 features: 'Fwd RST Flags', 'Bwd RST Flags', 'ICMP Code', 'ICMP Type', 'Total TCP Flow Time'. It also removed the duplicated feature 'Fwd Header Length.1'.
Other than that, there's a column 'Attempted'. This should not be treated as a feature by the machine learning model. As suggested by the paper authors, we treat all those samples ad benign.

In [48]:
def clean_attempted(row):
  if row['Attempted Category'] != -1:
    row['Label'] = 'BENIGN'
  return row

df = df.apply(clean_attempted, axis=1)
df = df.drop(columns='Attempted Category')

When assessing the distribution of labels, it turns out that out of 2099976 records there are many benign records - 1594545 to be exact.

In [49]:
df['Label'].unique()

array(['BENIGN', 'DoS Slowloris', 'DoS Slowhttptest', 'DoS Hulk',
       'DoS GoldenEye', 'Heartbleed', 'Botnet', 'Portscan', 'DDoS',
       'FTP-Patator', 'SSH-Patator', 'Web Attack - Brute Force',
       'Infiltration', 'Infiltration - Portscan', 'Web Attack - XSS',
       'Web Attack - SQL Injection'], dtype=object)

In [50]:
df['Label'].value_counts()

Label
BENIGN                        1594545
Portscan                       159066
DoS Hulk                       158468
DDoS                            95144
Infiltration - Portscan         71767
DoS GoldenEye                    7567
FTP-Patator                      3972
DoS Slowloris                    3859
SSH-Patator                      2961
DoS Slowhttptest                 1740
Botnet                            736
Web Attack - Brute Force           73
Infiltration                       36
Web Attack - XSS                   18
Web Attack - SQL Injection         13
Heartbleed                         11
Name: count, dtype: int64

Delete blank records. This shouldn't make a difference since the new dataset already has no blank records.

In [51]:
df = df.drop(df[pd.isnull(df['Flow ID'])].index)
df.shape

(2099976, 89)

The "Flow Bytes/s" and "Flow Packets/s" columns have non-numerical values, replace them.

In [52]:
df.replace('Infinity', -1, inplace=True)
df[["Flow Bytes/s", "Flow Packets/s"]] = df[["Flow Bytes/s", "Flow Packets/s"]].apply(pd.to_numeric)

Replace the NaN values and infinity values with -1.

In [53]:
df.replace([np.inf, -np.inf, np.nan], -1, inplace=True)

Convert string characters to numbers, use LabelEncoder, not OneHotEncoder.

In [54]:
string_features = list(df.select_dtypes(include=['object']).columns)
string_features.remove('Label')
string_features

['Flow ID', 'Src IP', 'Dst IP', 'Timestamp']

In [55]:
le = preprocessing.LabelEncoder()
df[string_features] = df[string_features].apply(lambda col: le.fit_transform(col))

In [56]:
df.to_csv("web_attacks_unbalanced.csv", index=False)

### Undersampling against unbalance

Dataset is unbalanced: total records = 2099976, "BENIGN" records = 1594545, records with attacks much less: 11 + 13 + 18 + 36 + 73 + 736 + 1740 + 2961 + 3859 + 3972 + 7567 + 71767 + 95144 + 158468 + 159066 = 505431.

In [57]:
df = pd.read_csv('web_attacks_unbalanced.csv')

In [58]:
benign_total = len(df[df['Label'] == "BENIGN"])
benign_total

1594545

In [59]:
attack_total = len(df[df['Label'] != "BENIGN"])
attack_total

505431

In [60]:
df['Label'].value_counts()

Label
BENIGN                        1594545
Portscan                       159066
DoS Hulk                       158468
DDoS                            95144
Infiltration - Portscan         71767
DoS GoldenEye                    7567
FTP-Patator                      3972
DoS Slowloris                    3859
SSH-Patator                      2961
DoS Slowhttptest                 1740
Botnet                            736
Web Attack - Brute Force           73
Infiltration                       36
Web Attack - XSS                   18
Web Attack - SQL Injection         13
Heartbleed                         11
Name: count, dtype: int64

We use **undersampling** to correct class imbalances: we remove most of the "BENIGN" records.

Form a balanced dataset web_attacks_balanced.csv in proportion: 70% benign data, 30% attacks (2099976 total: x attacks, 1179339 benign).

Algorithm to form a balanced df_balanced dataset:

* All the records with the attacks are copied to the new dataset.
* There are two conditions for copying "BENIGN" records to the new dataset:

     1. The next record is copyied with the benign_inc_probability.
     2. The total number of "BENIGN" records must not exceed the limit of 5087 records.

Сalculate the probability of copying a "BENIGN" record.

In [61]:
total_samples = len(df[df['Label'] != 'BENIGN']) // 0.3
benign_included_max = round(total_samples * 0.7)
benign_inc_probability = benign_included_max / benign_total
print(benign_included_max, benign_inc_probability)

1179339 0.7396084776534999


Copy records from df to df_balanced, save dataset **web_attacks_balanced.csv**.

In [62]:
import random
indexes = []
benign_included_count = 0
portscan_included_count = 0
for index, row in df.iterrows():
    if (row['Label'] == "BENIGN"):
      # Have we achieved 70%?
      #if benign_included_count > benign_included_max: continue
      # Copying with benign_inc_probability
      if random.random() > benign_inc_probability: continue
      benign_included_count += 1

    indexes.append(index)

df_balanced = df.loc[indexes]

In [63]:
df_balanced['Label'].value_counts()

Label
BENIGN                        1178789
Portscan                       159066
DoS Hulk                       158468
DDoS                            95144
Infiltration - Portscan         71767
DoS GoldenEye                    7567
FTP-Patator                      3972
DoS Slowloris                    3859
SSH-Patator                      2961
DoS Slowhttptest                 1740
Botnet                            736
Web Attack - Brute Force           73
Infiltration                       36
Web Attack - XSS                   18
Web Attack - SQL Injection         13
Heartbleed                         11
Name: count, dtype: int64

In [64]:
len(df_balanced[df_balanced['Label'] == 'BENIGN'])/len(df_balanced)

0.6999020318010711

In [65]:
df_balanced.to_csv("web_attacks_balanced.csv", index=False)

### Preparing data for training

In [66]:
df = pd.read_csv('web_attacks_balanced.csv')

The Label column is encoded as follows: "BENIGN" = 0, attack = 1.

In [67]:
df['Label'] = df['Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)

7 features (Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp) are excluded from the dataset. The hypothesis is that the "shape" of the data being transmitted is more important than these attributes. In addition, ports and addresses can be substituted by an attacker, so it is better that the ML algorithm does not take these features into account in training [Kostas2018].

In [68]:
excluded = ['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp']
df = df.drop(columns=excluded)

In [69]:
y = df['Label'].values
X = df.drop(columns=['Label'])
print(X.shape, y.shape)

(1684220, 81) (1684220,)


## Feature importance

In [70]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=241)

X_select, X_val, y_select, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=241)

### Evaluation of importance using RandomForestClassifier.feature_importances_ (move from one tree to a random forest, classification quality increases)

In [71]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=150, random_state=42, oob_score=True)
rf.fit(X_select, y_select)
# Score = mean accuracy on the given test data and labels
print('R^2 Training Score: {:.2f} \nR^2 Validation Score: {:.2f} \nOut-of-bag Score: {:.2f}'
      .format(rf.score(X_select, y_select), rf.score(X_val, y_val), rf.oob_score_))

We select all the features with importance at least 1%

In [None]:
features = X.columns
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
webattack_features = []

for index, i in enumerate(indices):
    if importances[i] >= 0.01:
      webattack_features.append(features[i])
    print(f'{index+1}. \t #{i} \t {importances[i]:.3f} \t {features[i]}')

Visualize what we're left with

In [None]:
indices = np.argsort(importances)[-len(webattack_features):]
plt.rcParams['figure.figsize'] = (10, 6)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='#cccccc', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.grid()
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = rf.predict(X_val)
confusion_matrix(y_val, y_pred)

## Analysis of selected features

In [None]:

frame_attacks = df[df['Label'] == 1]
frame_benigns = df[df['Label'] == 0]
plt.figure(figsize=(20,25))
for i, feat in enumerate(webattack_features):
    plt.subplot(len(webattack_features)//3+1, 3, i+1)
    x1 = sorted(frame_benigns[feat])
    x2 = sorted(frame_attacks[feat])
    fr = [x1, x2]
    plt.hist(fr, bins=50, stacked=True, density=True)
    plt.title(feat)
    plt.legend(['benign', 'attacks'])
plt.show()

In [None]:
import seaborn as sns
corr_matrix = df[webattack_features].corr()
plt.rcParams['figure.figsize'] = (16, 6)
g = sns.heatmap(corr_matrix, annot=True, fmt='.1g', cmap='Greys')
g.set_xticklabels(g.get_xticklabels(), verticalalignment='top', horizontalalignment='right', rotation=30);
plt.show()

Remove correlated features.

In [None]:
to_be_removed = {'Packet Length Std', 'Bwd Packet Length Std', 'Bwd Segment Size Avg', 'Bwd Packet Length Max', 'Bwd Packet Length Mean', 'Packet Length Max', 'Packet length Max', 'Packet Length Mean', 'Average Packet Size', 'Subflow Bwd Bytes', 'Fwd RST Flags'}
webattack_features = [item for item in webattack_features if item not in to_be_removed]

In [None]:
corr_matrix = df[webattack_features].corr()
plt.rcParams['figure.figsize'] = (12, 5)
sns.heatmap(corr_matrix, annot=True, fmt='.1g', cmap='Greys');

## Hyperparameter selection

We get the list of RandomForestClassifier parameters.

In [None]:
rfc = RandomForestClassifier(criterion='gini', random_state=6)

### Grid search

In [None]:
parameters = {'n_estimators': [20, 25, 30, 35, 50, 80],
              'min_samples_leaf': [4, 3, 2],
              'max_features': ['sqrt', 'log2', 5, 8, 14, None],
              'max_depth': [3, 5, 8, 12, None]}

In [None]:
from sklearn.model_selection import GridSearchCV

gcv = GridSearchCV(rfc, parameters, scoring='f1', refit='f1', cv=3, return_train_score=True, verbose=10)
gcv.fit(X_train, y_train)

Let's take a look at the results of the parameter selection.

In [None]:
gcv.best_params_

In [None]:
gcv.best_score_

## Final model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
rfc = RandomForestClassifier(max_depth=20, max_features=3, min_samples_leaf=2, n_estimators=25, random_state=42, oob_score=True)
# rfc = RandomForestClassifier(n_estimators=250, random_state=1)
rfc.fit(X_train, y_train)

In [None]:
features = X.columns
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]

for index, i in enumerate(indices[:10]):
    print('{}.\t#{}\t{:.3f}\t{}'.format(index + 1, i, importances[i], features[i]))

In [None]:
y_pred = rfc.predict(X_test)
confusion_matrix(y_test, y_pred)

In [None]:
import sklearn.metrics as metrics
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
print('Accuracy =', accuracy)
print('Precision =', precision)
print('Recall =', recall)
print('F1 =', f1)

## Model saving

In [None]:
import pickle
with open('webattack_detection_rf_model.pkl', 'wb') as f:
    pickle.dump(rfc, f)

## Model approbation

Open the previously saved model.

In [None]:
with open('webattack_detection_rf_model.pkl', 'rb') as f:
    rfc = pickle.load(f)
rfc

Reopen the dataset.

In [None]:
df = pd.read_csv('web_attacks_balanced.csv')
df['Label'] = df['Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)
y_test = df['Label'].values
X_test = df[webattack_features]
print(X_test.shape, y_test.shape)

In [None]:
X_test.head()

In [None]:
import time
seconds = time.time()
y_pred = rfc.predict(X_test)
print("Total operation time:", time.time() - seconds, "seconds")

print("Benign records detected (0), attacks detected (1):")
unique, counts = np.unique(y_pred, return_counts=True)
dict(zip(unique, counts))

Confusion matrix:

      0  1 - predicted value (Wikipedia uses different convention for axes)
    0 TN FP
    1 FN TP

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
print('Accuracy =', accuracy)
print('Precision =', precision)
print('Recall =', recall)
print('F1 =', f1)

Manual calculation of the confusion matrix:

    array([[5075,   12],
           [   1, 2179]], dtype=int64)

      0  1 - predicted value (Wikipedia uses different convention for axes)
    0 TN FP
    1 FN TP

    Точность
    Precision = TP / (TP + FP) = 2179 / (2179 + 12) = 0.9945230488361478

    Полнота
    Recall = TP / (TP + FN) = 2179 / (2179 + 1) = 0.9995412844036697

    F-мера, параметр = 1, гармоническое среднее точности и полноты, множитель = 2
    F1 = 2 * (precision * recall) / (precision + recall) = 0.9970258522077328

    Доля правильных ответов
    Accuracy = TP + TN / (TP + TN + FP + FN) = (2179 + 5075) / 7267 = 0.998211091234347

In [None]:
predict = pd.DataFrame({'Predict': rfc.predict(X_test)})
label = pd.DataFrame({'Label': y_test})
result = X_test.join(label).join(predict)

In [None]:
result[result['Predict'] == 1]

In [None]:
result[410:430]