<a href="https://colab.research.google.com/github/parmigggiana/ml-ids/blob/main/IDS_CTF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web attack detection using CTF dataset

## Data preprocessing

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
!wget https://github.com/parmigggiana/ml-ids/raw/main/CTF%20Data/ctf_flows_1.csv -O dataset_ctf.csv

In [2]:
df = pd.read_csv('dataset_ctf.csv')

In [3]:
df.shape

(79819, 89)

Make sure that there's no null rows

In [4]:
df = df.drop(df[pd.isnull(df['Flow ID'])].index)
df.shape

(79819, 89)

Drop Label column since it's useless

In [5]:
df.drop(columns='Label', inplace=True)
df.shape

(79819, 88)

Drop all flows pertaining ssh and caronte

In [6]:
df.drop(df[df['Src Port'] == 22].index, inplace=True)
df.drop(df[df['Dst Port'] == 22].index, inplace=True)
df.drop(df[df['Src Port'] == 3333].index, inplace=True)
df.drop(df[df['Dst Port'] == 3333].index, inplace=True)
df.shape

(79472, 88)

Drop all flows made by our team

In [7]:
df.drop(df[df['Src IP'].str.fullmatch(r"10\.80\.39\.\d{1,3}")].index, inplace=True)
df.drop(df[df['Dst IP'].str.fullmatch(r"10\.80\.39\.\d{1,3}")].index, inplace=True)
df.shape

(79471, 88)

In [8]:
df['Src IP'].unique()

array(['10.254.0.1', '10.80.35.2', '172.23.0.2', '10.80.24.6',
       '10.80.36.6', '10.80.24.12', '10.80.22.2', '10.80.36.8',
       '10.80.22.6', '10.80.35.3', '10.80.20.2', '10.80.6.4', '10.80.5.7',
       '10.80.22.4', '10.80.36.9', '10.80.32.2', '10.80.35.7',
       '10.80.22.7', '10.60.39.1', '10.80.21.4', '10.80.22.5',
       '10.80.36.7', '10.80.22.3', '10.80.30.2', '10.80.26.6'],
      dtype=object)

I noticed there's some flows belonging to other addresses. This probably means there's an error in the gameserver, leaking some packets. Upon manual inspection of the pcap, I noticed they are mostly FIN/ACK and RST. 
I chose to keep these flows as it's still actual traffic and we will be removing the IP features anyway

The "Flow Bytes/s" and "Flow Packets/s" columns have non-numerical values, replace them.

In [9]:
df.replace('Infinity', -1, inplace=True)
df[["Flow Bytes/s", "Flow Packets/s"]] = df[["Flow Bytes/s", "Flow Packets/s"]].apply(pd.to_numeric)

Replace the NaN values and infinity values with -1.

In [10]:
df.replace([np.inf, -np.inf, np.nan], -1, inplace=True)

7 features (Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp) are excluded from the dataset. The hypothesis is that the "shape" of the data being transmitted is more important than these attributes. In addition, ports and addresses can be substituted by an attacker, so it is better that the ML algorithm does not take these features into account in training [Kostas2018].

In [11]:
excluded = ['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol', 'Timestamp']
df.drop(columns=excluded, inplace=True)