<a href="https://colab.research.google.com/github/parmigggiana/ml-ids/blob/main/IDS_CICIDS2017.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web attack detection using CICIDS2017 dataset

This is an edited version of the original https://github.com/fisher85/ml-cybersecurity/blob/master/python-web-attack-detection/web-attack-detection.ipynb

I have adapted the script to use the [corrected CIC-IDS-2017 dataset](https://intrusion-detection.distrinet-research.be/CNS2022/index.html).
Instead of selecting a single day, I used the whole dataset (~2 mil. entries).
The script was written and corrected iterating on only one of those days, to make it easier to test that everything was working as intended. After it was done, I re-ran it with all the available data overnight and saved the results.
I changed the math in the undersampling section to make it easier and more direct. I also chose to only undersample based on probability instead of having an hard limit. Given the size of the dataset, this is mostly ininfluent.
I have also re-done the feature selection and analysis.
The original trained and selected features on the whole dataset, causing obvious overfitting. After the data preparation I took out a portion of the dataset which was never used again, if not for the final test after everything else was set.  
After that, I added testing on the corrected CSE-CIC-IDS-2018 dataset and on my own CTF Dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline

## Data preprocessing

Source: https://github.com/bozbil/Anomaly-Detection-in-Networks-Using-Machine-Learning/blob/master/01_preprocessing.ipynb [Kostas2018].

### Download and clean data

I will use the corrected CIC-IDS-2017 instead of the original.

In [None]:
#!wget https://intrusion-detection.distrinet-research.be/CNS2022/Datasets/CICIDS2017_improved.zip -O dataset.zip
#!unzip -u -d Corrected_CICIDS2017/ dataset.zip 

Using encoding='latin' avoids the UnicodeDecodeError we get otherwise

In [None]:
from pathlib import Path
li = []
for filename in Path('./Corrected_CICIDS2017/').glob('*.csv'):
  li.append(pd.read_csv(filename, index_col=0, encoding='latin'))
df = pd.concat(li, axis=0, ignore_index=True)

In [None]:
df

In [None]:
df.shape

As for the differences between the features, aside for a couple of names that changed slightly, the corrected datasets adds 5 features: 'Fwd RST Flags', 'Bwd RST Flags', 'ICMP Code', 'ICMP Type', 'Total TCP Flow Time'. It also removed the duplicated feature 'Fwd Header Length.1'.
Other than that, there's a column 'Attempted'. This should not be treated as a feature by the machine learning model. As suggested by the paper authors, we treat all those samples ad benign.

In [None]:
def clean_attempted(row):
  if row['Attempted Category'] != -1:
    row['Label'] = 'BENIGN'
  return row

df = df.apply(clean_attempted, axis=1)
df = df.drop(columns='Attempted Category')

When assessing the distribution of labels, it turns out that out of 2099976 records there are many benign records - 1594545 to be exact.

In [None]:
df['Label'].unique()

In [None]:
df['Label'].value_counts()

Delete blank records. This shouldn't make a difference since the new dataset already has no blank records.

In [None]:
df = df.drop(df[pd.isnull(df['Flow ID'])].index)
df.shape

The "Flow Bytes/s" and "Flow Packets/s" columns have non-numerical values, replace them.

In [None]:
df.replace('Infinity', -1, inplace=True)
df[["Flow Bytes/s", "Flow Packets/s"]] = df[["Flow Bytes/s", "Flow Packets/s"]].apply(pd.to_numeric)

Replace the NaN values and infinity values with -1.

In [None]:
df.replace([np.inf, -np.inf, np.nan], -1, inplace=True)

Convert string characters to numbers, use LabelEncoder, not OneHotEncoder.

In [None]:
string_features = list(df.select_dtypes(include=['object']).columns)
string_features.remove('Label')
string_features

In [None]:
le = preprocessing.LabelEncoder()
df[string_features] = df[string_features].apply(lambda col: le.fit_transform(col))

In [None]:
df.to_csv("web_attacks_unbalanced.csv", index=False)

### Undersampling against unbalance

Dataset is unbalanced: total records = 2099976, "BENIGN" records = 1594545, records with attacks much less: 11 + 13 + 18 + 36 + 73 + 736 + 1740 + 2961 + 3859 + 3972 + 7567 + 71767 + 95144 + 158468 + 159066 = 505431.

In [None]:
df = pd.read_csv('web_attacks_unbalanced.csv')

In [None]:
benign_total = len(df[df['Label'] == "BENIGN"])
benign_total

In [None]:
attack_total = len(df[df['Label'] != "BENIGN"])
attack_total

In [None]:
df['Label'].value_counts()

We use **undersampling** to correct class imbalances: we remove most of the "BENIGN" records.

Form a balanced dataset web_attacks_balanced.csv in proportion: 70% benign data, 30% attacks (2099976 total: x attacks, 1179339 benign).

Algorithm to form a balanced df_balanced dataset:

* All the records with the attacks are copied to the new dataset.
* There are two conditions for copying "BENIGN" records to the new dataset:

     1. The next record is copyied with the benign_inc_probability.
     2. The total number of "BENIGN" records must not exceed the limit of 5087 records.

Сalculate the probability of copying a "BENIGN" record.

In [None]:
total_samples = len(df[df['Label'] != 'BENIGN']) // 0.3
benign_included_max = round(total_samples * 0.7)
benign_inc_probability = benign_included_max / benign_total
print(benign_included_max, benign_inc_probability)

Copy records from df to df_balanced, save dataset **web_attacks_balanced.csv**.

In [None]:
import random
indexes = []
benign_included_count = 0
for index, row in df.iterrows():
    if (row['Label'] == "BENIGN"):
      # Have we achieved 70%?
      #if benign_included_count > benign_included_max: continue
      # Copying with benign_inc_probability
      if random.random() > benign_inc_probability: continue
      benign_included_count += 1

    indexes.append(index)

df_balanced = df.loc[indexes]

In [None]:
df_balanced['Label'].value_counts()

In [None]:
len(df_balanced[df_balanced['Label'] == 'BENIGN'])/len(df_balanced)

In [None]:
df_balanced.to_csv("web_attacks_balanced.csv", index=False)

### Preparing data for training

In [None]:
df = pd.read_csv('web_attacks_balanced.csv')

The Label column is encoded as follows: "BENIGN" = 0, attack = 1.

In [None]:
df['Label'] = df['Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)

7 features (Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp) are excluded from the dataset. The hypothesis is that the "shape" of the data being transmitted is more important than these attributes. In addition, ports and addresses can be substituted by an attacker, so it is better that the ML algorithm does not take these features into account in training [Kostas2018].

In [None]:
excluded = ['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp']
df = df.drop(columns=excluded)

In [None]:
df.to_csv("definitive_dataset.csv", index=False)

## Feature importance

In [None]:
df = df.sample(500000) # randomly take 500k samples for hyperparameters selection
y = df['Label'].values
X = df.drop(columns=['Label'])
print(X.shape, y.shape)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=241)

### Evaluation of importance using RandomForestClassifier.feature_importances_

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=150, random_state=42, oob_score=True)
rf.fit(X_train, y_train)
# Score = mean accuracy on the given test data and labels
print('R^2 Training Score: {:.2f} \nR^2 Validation Score: {:.2f} \nOut-of-bag Score: {:.2f}'
      .format(rf.score(X_train, y_train), rf.score(X_test, y_test), rf.oob_score_))

We select all the features with importance at least 1%

In [None]:
features = X.columns
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
webattack_features = []

for index, i in enumerate(indices):
    if importances[i] >= 0.01:
      webattack_features.append(features[i])
    print(f'{index+1}. \t #{i} \t {importances[i]:.3f} \t {features[i]}')

Visualize what we're left with

In [None]:
indices = np.argsort(importances)[-len(webattack_features):]
plt.rcParams['figure.figsize'] = (11, 6)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='#cccccc', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.grid()
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = rf.predict(X_test)
confusion_matrix(y_test, y_pred)

## Analysis of selected features

In [None]:

frame_attacks = df[df['Label'] == 1]
frame_benigns = df[df['Label'] == 0]
plt.figure(figsize=(30,45))
for i, feat in enumerate(webattack_features):
    plt.subplot(len(webattack_features)//3+1, 3, i+1)
    x1 = sorted(frame_benigns[feat])
    x2 = sorted(frame_attacks[feat])
    fr = [x1, x2]
    plt.hist([x for x in fr if x != (0, 0)], bins=150, density=True, stacked=True)
    plt.title(feat)
    plt.legend(['benign', 'attacks'])
plt.show()

In [None]:
import seaborn as sns
corr_matrix = df[webattack_features].corr()
plt.rcParams['figure.figsize'] = (18, 8)
g = sns.heatmap(corr_matrix, annot=True, fmt='.1g', cmap='Greys')
g.set_xticklabels(g.get_xticklabels(), verticalalignment='top', horizontalalignment='right', rotation=30);
plt.show()

Remove correlated features.

In [None]:
to_be_removed = {'Packet Length Std', 'Bwd Packet Length Std', 'Bwd Segment Size Avg', 'Bwd Packet Length Max', 'Bwd Packet Length Mean', 'Packet Length Max', 'Packet length Max', 'Packet Length Mean', 'Average Packet Size', 'Subflow Bwd Bytes', 'Fwd RST Flags', 'Subflow Fwd Bytes', 'Fwd Segment Size Avg'}
webattack_features = [item for item in webattack_features if item not in to_be_removed]

In [None]:
corr_matrix = df[webattack_features].corr()
plt.rcParams['figure.figsize'] = (12, 5)
sns.heatmap(corr_matrix, annot=True, fmt='.1g', cmap='Greys');

In [None]:
print(webattack_features)

## Hyperparameter selection

### Grid search

In [None]:
parameters = {'n_estimators': [20, 30, 50, 70],
              'min_samples_leaf': [4, 3, 2],
              'max_features': ['sqrt', 'log2', 5, 12, None],
              'max_depth': [3, 5, 8, None]}

In [None]:
from sklearn.model_selection import GridSearchCV
X_train = X_train[webattack_features]
X_test = X_test[webattack_features]

rfc = RandomForestClassifier(criterion='gini', random_state=6, oob_score=True, n_jobs=-1)
gcv = GridSearchCV(rfc, parameters, scoring='f1', refit='f1', cv=3, return_train_score=True, verbose=10, n_jobs=2)
gcv.fit(X_train, y_train)

Let's take a look at the results of the parameter selection.

In [None]:
gcv.best_params_

In [None]:
gcv.best_score_

## Final model

In [None]:
df = pd.read_csv('definitive_dataset.csv')
df

In [None]:
y = df['Label']
X = df[webattack_features] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
#rfc = RandomForestClassifier(criterion='gini', max_depth=20, max_features=3, min_samples_leaf=2, n_estimators=25, random_state=42, oob_score=True)
rfc = gcv.best_estimator_
rfc.fit(X_train, y_train)

In [None]:
features = rfc.feature_names_in_
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]

for index, i in enumerate(indices):
    print(f'{index+1}. \t #{i} \t {importances[i]:.3f} \t {features[i]}')

In [None]:
import sklearn.metrics as metrics

y_pred = rfc.predict(X_test)

Accuracy = metrics.accuracy_score(y_test, y_pred)
Precision = metrics.precision_score(y_test, y_pred)
Recall = metrics.recall_score(y_test, y_pred)
F1 = metrics.f1_score(y_test, y_pred)
fpr, tpr, threasholds = metrics.roc_curve(y_test, y_pred)
auroc = metrics.roc_auc_score(y_test, y_pred)
""" 
Confusion matrix:

      0  1 - predicted value (Wikipedia uses different convention for axes)
    0 TN FP
    1 FN TP 
"""
print(confusion_matrix(y_test, y_pred))
print(f'{Accuracy = }')
print(f'{Precision = }')
print(f'{Recall = }')
print(f"Area Under ROC Curve = {auroc}")
print(f'{F1 = }')
plt.figure(figsize=(5,4))
plt.plot(fpr, tpr)
plt.show()

## Model saving

In [None]:
import pickle
with open('rf_model.pkl', 'wb') as f:
    pickle.dump(rfc, f)

## Intra-Dataset testing

We tested against CIC-IDS-2017 mostly as control - now we should evaluate performance against a similar dataset, the corrected CSE-CIC-IDS-2018

Open the previously saved model.

In [None]:
import pickle

import numpy as np
import pandas as pd
import sklearn.metrics as metrics

from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
with open('rf_model.pkl', 'rb') as f:
    rfc: RandomForestClassifier = pickle.load(f)
rfc

In [None]:
#!wget https://intrusion-detection.distrinet-research.be/CNS2022/Datasets/CSECICIDS2018_improved.zip -O 2018dataset.zip
#!unzip -u -d Corrected_CSECICIDS2018/ 2018dataset.zip 

Due to the size of the dataset it is impossible to load it all at once.
So I define two functions that I can call easily on each file. 
Since we have the same characteristics, the data pipeline is the same as before minus the undersampling, and we directly selected the relevant features.


In [None]:
def prepare_data(df: pd.DataFrame) -> pd.DataFrame:
    def clean_attempted(row):
        if row['Attempted Category'] != -1:
            row['Label'] = 'BENIGN'
        return row

    df = df.apply(clean_attempted, axis=1)
    df.drop(columns='Attempted Category', inplace=True)
    #print('Cleaned Attempted')

    df.drop(df[pd.isnull(df['Flow ID'])].index, inplace=True)

    df.replace('Infinity', -1, inplace=True)
    df[["Flow Bytes/s", "Flow Packets/s"]] = df[["Flow Bytes/s", "Flow Packets/s"]].apply(pd.to_numeric)

    df.replace([np.inf, -np.inf, np.nan], -1, inplace=True)

    df['Label'] = df['Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)

    le = LabelEncoder()
    string_features = list(df.select_dtypes(include=['object']).columns)
    df[string_features] = df[string_features].apply(lambda col: le.fit_transform(col))

    feats = list(features)
    feats.append('Label')
    return df[feats]

In [None]:
def eval_metrics(df: pd.DataFrame) -> None:

    y_test = df['Label']
    X_test = df.drop(columns='Label')
    #print('Predicting')
    y_pred = rfc.predict(X_test)

    #print('Computing scores')
    Accuracy = metrics.accuracy_score(y_test, y_pred)
    Precision = metrics.precision_score(y_test, y_pred)
    Recall = metrics.recall_score(y_test, y_pred)
    F1 = metrics.f1_score(y_test, y_pred)
    fpr, tpr, threasholds = metrics.roc_curve(y_test, y_pred)
    auroc = metrics.roc_auc_score(y_test, y_pred)
    """ 
    Confusion matrix:

        0  1 - predicted value (Wikipedia uses different convention for axes)
        0 TN FP
        1 FN TP 
    """
    print(confusion_matrix(y_test, y_pred))
    print(f'{Accuracy = }')
    print(f'{Precision = }')
    print(f'{Recall = }')
    print(f"Area Under ROC Curve = {auroc}")
    print(f'{F1 = }')
    plt.figure(figsize=(5,4))
    plt.plot(fpr, tpr)
    plt.show()

In [None]:
for filename in Path('./Corrected_CSECICIDS2018/').glob('*.csv'):
  print(f"{filename}")
  df = pd.read_csv(filename, index_col=0, encoding='latin').sample(1_000_000) # We only sample a milion because the whole file saturates the RAM and causes the computer to freeze
  #print(df.shape)
  df = prepare_data(df)
  eval_metrics(df)

# input("Press enter to continue")

## OOD Testing

Evaluating performance on a different dataset is harder, especially if they don't have a similar set of features