<a href="https://colab.research.google.com/github/parmigggiana/ml-ids/blob/main/IDS_CTF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Attack detection using CTF dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.cluster import KMeans, MeanShift, FeatureAgglomeration
from pandas.plotting import scatter_matrix

%matplotlib inline

## Utility Functions

In [None]:

def print_scores(y_test, y_pred):
    Accuracy = metrics.accuracy_score(y_test, y_pred)
    Precision = metrics.precision_score(y_test, y_pred)
    Recall = metrics.recall_score(y_test, y_pred)
    F1 = metrics.f1_score(y_test, y_pred)
    fpr, tpr, threasholds = metrics.roc_curve(y_test, y_pred)
    auroc = metrics.roc_auc_score(y_test, y_pred)
    """ 
    Confusion matrix:

        0  1 - predicted value (Wikipedia uses different convention for axes)
        0 TN FP
        1 FN TP 
    """
    print(metrics.confusion_matrix(y_test, y_pred))
    print(f'{Accuracy = }')
    print(f'{Precision = }')
    print(f'{Recall = }')
    print(f"Area Under ROC Curve = {auroc}")
    print(f'{F1 = }')
    plt.figure(figsize=(5,4))
    plt.plot(fpr, tpr)
    plt.show()

## Data preparation

In [None]:
!wget https://github.com/parmigggiana/ml-ids/raw/main/CTF%20Data/ctf_flows_1.csv -O dataset_ctf.csv

In [None]:
df = pd.read_csv('CTF Data/Thu15.csv')
df.shape

Make sure that there's no null rows

In [None]:
df = df.drop(df[pd.isnull(df['Flow ID'])].index)
df.shape

Drop Label column since it's useless

In [None]:
df.drop(columns='Label', inplace=True)
df.shape

Drop all flows pertaining ssh and caronte

In [None]:
df.drop(df[df['Src Port'] == 22].index, inplace=True)
df.drop(df[df['Dst Port'] == 22].index, inplace=True)
df.drop(df[df['Src Port'] == 3333].index, inplace=True)
df.drop(df[df['Dst Port'] == 3333].index, inplace=True)
df.shape

Drop all flows made by our team

In [None]:
df.drop(df[df['Src IP'].str.fullmatch(r"10\.80\.39\.\d{1,3}")].index, inplace=True)
df.drop(df[df['Dst IP'].str.fullmatch(r"10\.80\.39\.\d{1,3}")].index, inplace=True)
df.shape

In [None]:
df['Src IP'].unique()

I noticed there's 1784 flows belonging to other addresses. This probably means there's an error in the gameserver, leaking some packets. Upon manual inspection of the pcap, I noticed they are mostly FIN/ACK and RST. 
I chose to keep these flows as it's still actual traffic and we will be removing the IP features anyway

In [None]:
df[
    ((df["Src IP"] != "10.254.0.1") & (df["Src IP"] != "10.60.39.1"))
    | ((df["Dst IP"] != "10.254.0.1") & (df["Dst IP"] != "10.60.39.1"))
].shape

The "Flow Bytes/s" and "Flow Packets/s" columns have non-numerical values, replace them.

In [None]:
df.replace('Infinity', -1, inplace=True)
df[["Flow Bytes/s", "Flow Packets/s"]] = df[["Flow Bytes/s", "Flow Packets/s"]].apply(pd.to_numeric)

Replace the NaN values and infinity values with -1.

In [None]:
df.replace([np.inf, -np.inf, np.nan], -1, inplace=True)

7 features (Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp) are excluded from the dataset. The hypothesis is that the "shape" of the data being transmitted is more important than these attributes. In addition, ports and addresses can be substituted by an attacker, so it is better that the ML algorithm does not take these features into account in training [Kostas2018].

In [None]:
excluded = ['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp']
df.drop(columns=excluded, inplace=True)

## Dimensionality Reduction

We don't have labels to select features based on importance. A simple approach would be using the same features selected on the CIC-IDS-2017 Dataset 

### Feature Selection

In [None]:
features = ['RST Flag Count', 'Total Length of Fwd Packet', 'Fwd Packet Length Max', 'Packet Length Variance', 'Fwd Packets/s', 'Fwd Packet Length Mean', 'Flow IAT Max', 'Flow Duration', 'Flow Packets/s', 'Total TCP Flow Time', 'PSH Flag Count', 'Packet Length Min', 'Bwd IAT Total', 'FWD Init Win Bytes', 'Flow Bytes/s', 'ACK Flag Count', 'Fwd Header Length', 'SYN Flag Count', 'Total Bwd packets']
X = df[features]

In [None]:


corr_matrix = df.corr()
plt.rcParams['figure.figsize'] = (14, 6)
g = sns.heatmap(corr_matrix, annot=True, fmt='.1g', cmap='Greys')
g.set_xticklabels(g.get_xticklabels(), verticalalignment='top', horizontalalignment='right', rotation=30)
plt.show()
scatter_matrix(df, alpha=0.1, figsize=(20, 20), diagonal="kde")

An alternative to selecting the features to use would be employing a dimensionality reduction algorithm.
Before that, we need to normalize our dataset.
I chose to use RobustScaler instead of StandardScaler because, by using the median, it is more resilient to inbalanced data.

### PCA

In [None]:
scaler = RobustScaler()
scaler.fit(df)
X = scaler.transform(df)

dimred = PCA(20, random_state=42)
dimred.fit(X)
X = dimred.transform(X)
features = dimred.feature_names_in_
X.shape

In [None]:
corr_matrix = pd.DataFrame(X).corr()
g = sns.heatmap(corr_matrix, annot=True, fmt='.1g', cmap='Greys')
g.set_xticklabels(g.get_xticklabels(), verticalalignment='top', horizontalalignment='right', rotation=30)
plt.show()
scatter_matrix(pd.DataFrame(X), alpha=0.2, figsize=(20, 20), diagonal="kde")

### Feature Agglomeration

In [None]:
scaler = RobustScaler()
scaler.fit(df)
X = scaler.transform(df)

dimred = FeatureAgglomeration(n_clusters=15)
dimred.fit(X)
X = dimred.transform(X)
features = dimred.feature_names_in_
X.shape

In [None]:
corr_matrix = pd.DataFrame(X).corr()
g = sns.heatmap(corr_matrix, annot=True, fmt='.1g', cmap='Greys')
g.set_xticklabels(g.get_xticklabels(), verticalalignment='top', horizontalalignment='right', rotation=30)
plt.show()
scatter_matrix(pd.DataFrame(X), alpha=0.2, figsize=(20, 20), diagonal="kde")

## Training


In [None]:
clf: KMeans = KMeans(2, random_state=42, verbose=10)

param_grid = {
    'algorithm': ['lloyd', 'elkan'], 
    'max_iter': [100, 300, 650, 1000, 2000], 
    'n_init': [1, 2, 5, 10, 20], 
}
search = HalvingGridSearchCV(clf, param_grid=param_grid, n_jobs=-1, verbose=10).fit(X)
clf: KMeans = search.best_estimator_


In [None]:
print(search.best_estimator_)
print(search.best_params_)

### MeanShift

In [None]:
clf = MeanShift(max_iter=300, n_jobs=-1)
clf.fit(X)

"""
MeanShift outputs 107 classes. To project them into binary I iteratively tried considering every class a benign or not, maximizing the F1 score on the test dataset (Corrected CIC-IDS-2017). 
"""
test_df = pd.read_csv(filepath_or_buffer='definitive_dataset.csv')
y_test = test_df['Label']
X_test = test_df.drop(columns='Label')
X_test = scaler.transform(X_test)
X_test = dimred.transform(X_test)

y_pred = clf.predict(X_test)

benign_classes = []
y_pred_tmp = y_pred.tolist()
classes = np.unique(y_pred)
for c in classes:    
    y_pred_ben = [0 if x == c else x for x in y_pred_tmp]
    y_pred_att = [1 if x == c else x for x in y_pred_tmp]

    y_pred_ben = [1 if x != 0 else x for x in y_pred_ben]
    y_pred_att = [1 if x != 0 else x for x in y_pred_att]

    F1_B = metrics.f1_score(y_test, y_pred_ben)
    F1_A = metrics.f1_score(y_test, y_pred_att)
    
    print(f"F1 with class {c} as BENIGN: {F1_B}")
    print(f"F1 with class {c} as ATTACK: {F1_A}")
    print('\n')
    if F1_B > F1_A:
        benign_classes.append(c)
    y_pred_tmp = [0 if x in benign_classes else x for x in y_pred]

print(benign_classes)
y_pred = [1 if x != 0 else x for x in y_pred_tmp]

### Isolation Forest

In [None]:
clf: IsolationForest = IsolationForest(n_estimators=200, max_features=10, bootstrap=True, random_state=42, verbose=1)

clf.fit(X)
y_pred = clf.predict(X_test) # output is -1 | 1
y_pred = [0 if x == -1 else 1 for x in y_pred]


## Testing

In [None]:
#!wget https://intrusion-detection.distrinet-research.be/CNS2022/Datasets/CICIDS2017_improved.zip -O dataset.zip
#!unzip -u -d Corrected_CICIDS2017/ dataset.zip 
test_df = pd.read_csv(filepath_or_buffer='definitive_dataset.csv')

y_test = test_df['Label']
X_test = test_df.drop(columns='Label')
X_test = scaler.transform(X_test)
X_test = dimred.transform(X_test)

In [None]:
y_pred = clf.predict(X_test)
print_scores(y_test, y_pred)