# Traffic classification - DDoS or benign.

Log research of IDS systems (Intrusion Detection System) for 
detection of abnormal traffic using machine learning.

DDoS data extracted from different public IDS's and concatenated with benign (normal) traffic flows. 

Base datasets:
1. CSE-CIC-IDS2018-AWS: https://www.unb.ca/cic/datasets/ids-2017.html
2. CICIDS2017: https://www.unb.ca/cic/datasets/ids-2018.html
3. CIC DoS dataset(2016): https://www.unb.ca/cic/datasets/dos-dataset.html

## Download dataset. Import libraries.

In [1]:
! pip install kaggle==1.5.3 -q
! pip install urllib3 --upgrade -q

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kaggle 1.5.3 requires urllib3<1.25,>=1.21.1, but you have urllib3 2.3.0 which is incompatible.


In [2]:
! kaggle datasets files devendra416/ddos-datasets

name                                         size  creationDate         
-------------------------------------------  ----  -------------------  
ddos_balanced/final_dataset.csv               6GB  2019-04-30 13:57:07  
ddos_imbalanced/unbalaced_20_80_dataset.csv   4GB  2019-04-30 13:58:15  


In [3]:
! kaggle datasets download -d "devendra416/ddos-datasets" -f "ddos_balanced/final_dataset.csv" -p "./content/"

Downloading final_dataset.csv.zip to ./content




  0%|          | 0.00/1.56G [00:00<?, ?B/s]
  0%|          | 1.00M/1.56G [00:00<16:47, 1.66MB/s]
  0%|          | 2.00M/1.56G [00:00<08:43, 3.20MB/s]
  0%|          | 4.00M/1.56G [00:00<04:43, 5.91MB/s]
  0%|          | 6.00M/1.56G [00:01<03:37, 7.68MB/s]
  1%|          | 8.00M/1.56G [00:01<03:06, 8.96MB/s]
  1%|          | 10.0M/1.56G [00:01<02:49, 9.85MB/s]
  1%|          | 12.0M/1.56G [00:01<02:39, 10.4MB/s]
  1%|          | 14.0M/1.56G [00:01<02:32, 10.9MB/s]
  1%|1         | 16.0M/1.56G [00:01<02:30, 11.1MB/s]
  1%|1         | 18.0M/1.56G [00:02<02:25, 11.4MB/s]
  1%|1         | 20.0M/1.56G [00:02<02:23, 11.5MB/s]
  1%|1         | 22.0M/1.56G [00:02<02:25, 11.4MB/s]
  2%|1         | 24.0M/1.56G [00:02<02:24, 11.4MB/s]
  2%|1         | 26.0M/1.56G [00:02<02:22, 11.6MB/s]
  2%|1         | 28.0M/1.56G [00:03<02:21, 11.6MB/s]
  2%|1         | 30.0M/1.56G [00:03<02:20, 11.7MB/s]
  2%|2         | 32.0M/1.56G [00:03<02:19, 11.7MB/s]
  2%|2         | 34.0M/1.56G [00:03<02:19, 11.8MB/s]
 

In [4]:
import zipfile
import os

CSV_PATH = './content/final_dataset.csv'

if not os.path.exists(CSV_PATH):
    print('Extracting dataset...')
    with zipfile.ZipFile('./content/final_dataset.csv.zip', 'r') as zip_ref:
        zip_ref.extractall('./content/')
    print('Success.')
else:
    print('Dataset already extracted.')

Extracting dataset...
Success.


In [5]:
import numpy as np
import pandas as pd

import warnings


warnings.filterwarnings('ignore')
%matplotlib inline

## Read dataset.

In [6]:
def get_dtypes():
    """Return optimal data types for each column."""
    return {
        'Src IP': 'category',
        'Src Port': 'uint16',
        'Dst IP': 'category',
        'Dst Port': 'uint16',
        'Protocol': 'category',
        'Flow Duration': 'uint32',
        'Tot Fwd Pkts': 'uint32',
        'Tot Bwd Pkts': 'uint32',
        'TotLen Fwd Pkts': 'float32',
        'TotLen Bwd Pkts': 'float32',
        'Fwd Pkt Len Max': 'float32',
        'Fwd Pkt Len Min': 'float32',
        'Fwd Pkt Len Mean': 'float32',
        'Fwd Pkt Len Std': 'float32',
        'Bwd Pkt Len Max': 'float32',
        'Bwd Pkt Len Min': 'float32',
        'Bwd Pkt Len Mean': 'float32',
        'Bwd Pkt Len Std': 'float32',
        'Flow Byts/s': 'float32',
        'Flow Pkts/s': 'float32',
        'Flow IAT Mean': 'float32',
        'Flow IAT Std': 'float32',
        'Flow IAT Max': 'float32',
        'Flow IAT Min': 'float32',
        'Fwd IAT Tot': 'float32',
        'Fwd IAT Mean': 'float32',
        'Fwd IAT Std': 'float32',
        'Fwd IAT Max': 'float32',
        'Fwd IAT Min': 'float32',
        'Bwd IAT Tot': 'float32',
        'Bwd IAT Mean': 'float32',
        'Bwd IAT Std': 'float32',
        'Bwd IAT Max': 'float32',
        'Bwd IAT Min': 'float32',
        'Fwd PSH Flags': 'category',
        'Bwd PSH Flags': 'category',
        'Fwd URG Flags': 'category',
        'Bwd URG Flags': 'category',
        'Fwd Header Len': 'uint32',
        'Bwd Header Len': 'uint32',
        'Fwd Pkts/s': 'float32',
        'Bwd Pkts/s': 'float32',
        'Pkt Len Min': 'float32',
        'Pkt Len Max': 'float32',
        'Pkt Len Mean': 'float32',
        'Pkt Len Std': 'float32',
        'Pkt Len Var': 'float32',
        'FIN Flag Cnt': 'category',
        'SYN Flag Cnt': 'category',
        'RST Flag Cnt': 'category',
        'PSH Flag Cnt': 'category',
        'ACK Flag Cnt': 'category',
        'URG Flag Cnt': 'category',
        'CWE Flag Count': 'category',
        'ECE Flag Cnt': 'category',
        'Down/Up Ratio': 'float32',
        'Pkt Size Avg': 'float32',
        'Fwd Seg Size Avg': 'float32',
        'Bwd Seg Size Avg': 'float32',
        'Fwd Byts/b Avg': 'uint32',
        'Fwd Pkts/b Avg': 'uint32',
        'Fwd Blk Rate Avg': 'uint32',
        'Bwd Byts/b Avg': 'uint32',
        'Bwd Pkts/b Avg': 'uint32',
        'Bwd Blk Rate Avg': 'uint32',
        'Subflow Fwd Pkts': 'uint32',
        'Subflow Fwd Byts': 'uint32',
        'Subflow Bwd Pkts': 'uint32',
        'Subflow Bwd Byts': 'uint32',
        'Init Fwd Win Byts': 'uint32',
        'Init Bwd Win Byts': 'uint32',
        'Fwd Act Data Pkts': 'uint32',
        'Fwd Seg Size Min': 'uint32',
        'Active Mean': 'float32',
        'Active Std': 'float32',
        'Active Max': 'float32',
        'Active Min': 'float32',
        'Idle Mean': 'float32',
        'Idle Std': 'float32',
        'Idle Max': 'float32',
        'Idle Min': 'float32',
        'Label': 'category'
    }

In [7]:
def get_memory_usage(dataframe: pd.DataFrame):
    """Return DataFrame memory usage in MB"""
    return round(dataframe.memory_usage().sum() / 1024**2, 2)

In [8]:
def get_df_shape_and_memory_str(dataframe: pd.DataFrame):
    """Return DataFrame shape and memory usage formatted string."""
    mem = round(dataframe.memory_usage().sum() / 1024**2, 2)
    return f'\t Shape: {dataframe.shape}\n\tMemory: {mem} MB'

In [9]:
PARQUET_PATH = CSV_PATH[:-4] + '.parquet'
FORCE_REWRITE = False # Set True to rewrite parquet file.

# Read dataset if parquet file does not exist.
if FORCE_REWRITE or not os.path.exists(PARQUET_PATH):
    CHUNK_SIZE = 100_000  # Read {CHUNK_SIZE} rows at a time.
    FRAC = 0.25  # Use 25% of dataset (about 1.5GB of RAM).
    
    chunks = []
    dtypes = get_dtypes()
    print('Reading CSV...')
    for i, chunk in enumerate(pd.read_csv(
            CSV_PATH,
            dtype=dtypes,
            engine='c',
            low_memory=True,
            chunksize=CHUNK_SIZE
    )):
        chunk_ = chunk.sample(frac=FRAC, random_state=0)
        chunks.append(chunk_)
        print(f'Read: {int(i*CHUNK_SIZE / 1e3)}k\tSave: {int(i*CHUNK_SIZE*FRAC / 1e3)}k', end='\r')
    
    print('\nRead.\nConcatenating...')       
    df = pd.concat(chunks)
    del dtypes
    del chunks
    
    print(f'Concatenated.\n{get_df_shape_and_memory_str(df)}\nConverting into parquet...')
    df.to_parquet(PARQUET_PATH)
    print(f'Converted.\nDF saved into `{PARQUET_PATH}`.')

print(f'\nReading from parquet...')
df = pd.read_parquet(PARQUET_PATH)
print(f'Read.\n{get_df_shape_and_memory_str(df)}')

Reading CSV...
Read: 12700k	Save: 3175k
Read.
Concatenating...
Concatenated.
	 Shape: (3198657, 85)
	Memory: 1238.49 MB
Converting into parquet...
Converted.
DF saved into `./content/final_dataset.parquet`.

Reading from parquet...
Read.
	 Shape: (3198657, 85)
	Memory: 1238.49 MB


In [10]:
df.sample(5, random_state=0)

Unnamed: 0.1,Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
12388875,6426326,172.31.65.68-62.146.70.120-445-62494-6,62.146.70.120,62494,172.31.65.68,445,6,20/02/2018 11:34:15,104135,3,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
3129205,278540,172.31.69.25-18.219.193.20-80-59464-6,18.219.193.20,59464,172.31.69.25,80,6,16/02/2018 11:16:36 PM,25067,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ddos
2502067,227593,172.31.69.25-18.219.5.43-80-54451-6,18.219.5.43,54451,172.31.69.25,80,6,20/02/2018 10:21:27,9990720,2,...,20,0.0,0.0,0.0,0.0,9990720.0,0.0,9990720.0,9990720.0,ddos
6307514,3456849,172.31.69.25-18.219.193.20-80-34166-6,172.31.69.25,80,18.219.193.20,34166,6,16/02/2018 11:27:06 PM,4367247,4,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ddos
598118,491559,172.31.69.28-18.219.9.1-80-55063-6,18.219.9.1,55063,172.31.69.28,80,6,21/02/2018 11:54:49 PM,593,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ddos


## Preprocessing

In [11]:
df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,3198657.0,,,,2663100.37692,2170006.252904,0.0,898504.0,2041665.0,3906515.0,7902473.0
Flow ID,3198657,1719404,8.0.6.4-8.6.0.1-0-0-0,18908,,,,,,,
Src IP,3198657,25152,172.31.69.25,441858,,,,,,,
Src Port,3198657.0,,,,37062.017538,25223.515196,0.0,443.0,50590.0,56224.0,65535.0
Dst IP,3198657,24501,172.31.69.25,621315,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
Idle Mean,3198657.0,,,,3119173.25,12190516.0,0.0,0.0,0.0,0.0,119988720.0
Idle Std,3198657.0,,,,109250.890625,1415605.5,0.0,0.0,0.0,0.0,73978656.0
Idle Max,3198657.0,,,,3214718.75,12442321.0,0.0,0.0,0.0,0.0,119988720.0
Idle Min,3198657.0,,,,3018433.25,12062840.0,0.0,0.0,0.0,0.0,119988720.0


In [12]:
# There's some inf values in the columns which can't be true.
df.replace({'Flow Byts/s': np.inf, 'Flow Pkts/s': np.inf}, np.nan, inplace=True)

In [13]:
# Look for NaN values.
missing = df.columns[df.isna().any()]
print(f'Missing values:')
for col in missing:
    print(f'\t{col}: {df[col].isna().sum()} | {df[col].isna().sum() / df.shape[0]:.2%} of {df.shape[0]} rows')

Missing values:
	Flow Byts/s: 11891 | 0.37% of 3198657 rows
	Flow Pkts/s: 11891 | 0.37% of 3198657 rows


### Clean data

Properties that can be deleted:
- `Unnamed: 0` & `Flow ID` - Just an IDs;
- `Bwd URG Flags` & `Fwd URG Flags` - Only one unique value;
- Some of the cols with 2 unique values as they have over 90% of one value dominance:
`[CWE Flag Count,
  URG Flag Cnt,
  ACK Flag Cnt,
  PSH Flag Cnt,
  RST Flag Cnt,
  SYN Flag Cnt,
  FIN Flag Cnt,
  Bwd PSH Flags,
  Fwd PSH Flags,
  ECE Flag Cnt]`;

And some (less than 0.5%) of the rows with NaN values.

In [14]:
# Look for dominance % of one value in binary columns.
cols_with_2_unique = [col for col in df.columns if df[col].nunique() == 2 and col != 'Label']
col_val_dominance = [
    [col, df[col].value_counts(normalize=True).idxmax(), df[col].value_counts(normalize=True).max()]
    for col in cols_with_2_unique
]

for col, value, dominance in sorted(col_val_dominance, reverse=True, key=lambda x: x[2]):
    print(f'{col}: {value} \t| {dominance:.2%} dominance')

Bwd PSH Flags: 0 	| 99.76% dominance
FIN Flag Cnt: 0 	| 98.45% dominance
URG Flag Cnt: 0 	| 98.09% dominance
Fwd PSH Flags: 0 	| 97.14% dominance
CWE Flag Count: 0 	| 91.54% dominance
RST Flag Cnt: 0 	| 88.91% dominance
SYN Flag Cnt: 0 	| 86.41% dominance
ECE Flag Cnt: 0 	| 80.55% dominance
PSH Flag Cnt: 0 	| 79.89% dominance
ACK Flag Cnt: 1 	| 51.08% dominance


In [15]:
# Proceed clean-up.

print(f'Before:\n{get_df_shape_and_memory_str(df)}\nCleaning...')

# Delete useless columns.
cols_to_drop = ['Unnamed: 0', 'Flow ID', 'Bwd URG Flags', 'Fwd URG Flags']
df = df.drop(columns=cols_to_drop)
print(f'Dropped: {cols_to_drop}')

# Delete columns with {DOM_THRESHOLD} dominance of one value.
DOM_THRESHOLD = 0.9
cols_to_drop = [col for col, _, dom in col_val_dominance if dom > DOM_THRESHOLD]
df = df.drop(columns=cols_to_drop)
print(f'Dropped: {cols_to_drop}')

# Delete rows with missing values.
missing_count = df.isna().any(axis=1).sum()
df = pd.DataFrame(df.dropna(), columns=df.columns)
print(f'Dropped: {missing_count} rows with missing values')

del DOM_THRESHOLD, col_val_dominance
del cols_to_drop
del missing, missing_count

print(f'Cleaned-up.\nAfter:\n{get_df_shape_and_memory_str(df)}')

Before:
	 Shape: (3198657, 85)
	Memory: 1238.49 MB
Cleaning...
Dropped: ['Unnamed: 0', 'Flow ID', 'Bwd URG Flags', 'Fwd URG Flags']
Dropped: ['Fwd PSH Flags', 'Bwd PSH Flags', 'FIN Flag Cnt', 'URG Flag Cnt', 'CWE Flag Count']
Dropped: 11891 rows with missing values
Cleaned-up.
After:
	 Shape: (3186766, 76)
	Memory: 1057.62 MB
