# IDS-IPS with Deep Learning

Our goal is to build a proof-of-concept model that mimics an IDS-IPS system by predicting whether a stream of network data is malicious or benign. Common IDS-IPS systems use signature-based detection that flags previously-identified malicious network activities. While there are anomaly-based IDS-IPS systems, they often use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) or Gaussian Mixture Models to identify outliers. In this notebook, I will attempt to build a DNN that identifies anomalies in network activity and perform classification. In addition, I will also build an autoencoder network, PCA, DBSCAN, K-Means Clustering and a Gaussian Mixture Model for benchmark purpsoes.

The dataset we're using is from ISCX 2017. 

> CICIDS2017 dataset contains benign and the most up-to-date common attacks, which resembles the true real-world data (PCAPs). It also includes the results of the network traffic analysis using CICFlowMeter with labeled flows based on the time stamp, source and destination IPs, source and destination ports, protocols and attack (CSV files).

# EDA

In [1]:
# Essentials
import pandas as pd
import numpy as np

# Plots
import seaborn as sns
import matplotlib.pyplot as plt

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore")
pd.options.display.max_seq_items = 8000
pd.options.display.max_rows = 8000

import os
pd.set_option("display.precision", 2)

In [2]:
# Here, we take a look at how many CSV files we have
print(os.listdir('./data'))

['Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv', 'Monday-WorkingHours.pcap_ISCX.csv', 'Friday-WorkingHours-Morning.pcap_ISCX.csv', 'Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv', 'Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv', 'Tuesday-WorkingHours.pcap_ISCX.csv', 'Wednesday-workingHours.pcap_ISCX.csv', 'Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv']


First, let's load in the dataset and have a look at the raw data. Since we have multiple csv files, I'll load them all at once and concatenate them into one giant dataframe

In [3]:
df_list = []
for filename in os.listdir('./data'):
    df = pd.read_csv(os.path.join('./data',filename), index_col=None)
    df_list.append(df)
df = pd.concat(df_list[:2], axis=0, ignore_index=True)

df.shape

(818520, 79)

At a quick glance, we have 2830743 entries, 78 features and 1 label for the class of the network data. Let's take a closer look at the data. Here, we can see that there are a lot of pre-calculated mean, max, min, std.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 818520 entries, 0 to 818519
Data columns (total 79 columns):
 Destination Port               818520 non-null int64
 Flow Duration                  818520 non-null int64
 Total Fwd Packets              818520 non-null int64
 Total Backward Packets         818520 non-null int64
Total Length of Fwd Packets     818520 non-null int64
 Total Length of Bwd Packets    818520 non-null int64
 Fwd Packet Length Max          818520 non-null int64
 Fwd Packet Length Min          818520 non-null int64
 Fwd Packet Length Mean         818520 non-null float64
 Fwd Packet Length Std          818520 non-null float64
Bwd Packet Length Max           818520 non-null int64
 Bwd Packet Length Min          818520 non-null int64
 Bwd Packet Length Mean         818520 non-null float64
 Bwd Packet Length Std          818520 non-null float64
Flow Bytes/s                    818438 non-null object
 Flow Packets/s                 818520 non-null object
 Flow IAT Mean 

Before we go any further, let's rename the columns so that we can access our column data easier.

In [9]:
df.rename(columns=lambda x: x.lower().lstrip()
          .rstrip().replace(" ", "_"), inplace=True)
df.columns

Index(['destination_port', 'flow_duration', 'total_fwd_packets',
       'total_backward_packets', 'total_length_of_fwd_packets',
       'total_length_of_bwd_packets', 'fwd_packet_length_max',
       'fwd_packet_length_min', 'fwd_packet_length_mean',
       'fwd_packet_length_std', 'bwd_packet_length_max',
       'bwd_packet_length_min', 'bwd_packet_length_mean',
       'bwd_packet_length_std', 'flow_bytes/s', 'flow_packets/s',
       'flow_iat_mean', 'flow_iat_std', 'flow_iat_max', 'flow_iat_min',
       'fwd_iat_total', 'fwd_iat_mean', 'fwd_iat_std', 'fwd_iat_max',
       'fwd_iat_min', 'bwd_iat_total', 'bwd_iat_mean', 'bwd_iat_std',
       'bwd_iat_max', 'bwd_iat_min', 'fwd_psh_flags', 'bwd_psh_flags',
       'fwd_urg_flags', 'bwd_urg_flags', 'fwd_header_length',
       'bwd_header_length', 'fwd_packets/s', 'bwd_packets/s',
       'min_packet_length', 'max_packet_length', 'packet_length_mean',
       'packet_length_std', 'packet_length_variance', 'fin_flag_count',
       'syn_flag_co

Here, we can see that `Flow Bytes/s`, `Flow Packets/s` and `Labels` are all objects, let's convert the types first

In [10]:
df['flow_bytes/s'] = df['flow_bytes/s'].astype('float64')
df['flow_packets/s'] = df['flow_packets/s'].astype('float64')

As we can see, the data is highly imbalanced. This is actually a huge problem in IDS/IPS models since these imbalanced datasets give rise to a high number of false positives and false negatives. My reasoning would be optimizing the model without overfitting, and allowing more false positives than false negatives.

Better safe than sorry...

In [11]:
df.label.value_counts()

BENIGN          818484
Infiltration        36
Name: label, dtype: int64

Finally, let's visualize some of the features to see if there are any interesting distribution / trends.

In [None]:
numeric_dtypes = ['int64', 'float64']
numeric = []

# We just want the total instead of the pre-calculated features
for i in df.columns:
    if df[i].dtype in numeric_dtypes:
        if 'mean' in i or 'min' in i or 'max' in i or 'std' in i or 'variance' in i:
            pass
        else:
            numeric.append(i)

# visualising data
f, ax = plt.subplots(ncols=2, nrows=0, figsize=(12,12))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)
sns.color_palette("husl", 8)

for i, feature in enumerate(list(df[numeric]), 1):
    if i == 10:
        break
    plt.subplot(len(list(numeric)), 3, i)
    sns.scatterplot(x=feature, y='label', hue='label', palette='Blues', data=df)
    plt.xlabel('{}'.format(feature), size=15,labelpad=12.5)
    plt.ylabel('Label', size=15, labelpad=12.5)
    for j in range(2):
        plt.tick_params(axis='x', labelsize=12)
        plt.tick_params(axis='y', labelsize=12)
    
    plt.legend(loc='best', prop={'size': 10})
        
plt.show()
        

Last but not least, a quick look at the correlation between features.

In [None]:
correlation = df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corr, vmax=0.9, cmap="Blues", square=True)

## Feature Engineering

In [None]:
# Filling missing values
def percent_missing(df_cols):
    dict_x = {}
    for i in range(0,len(df_columns)):
        dict_x.update({df_cols[i]: round(data[df_cols[i]].isnull().mean()*100,2)})
    return dict_x

missing = percent_missing(df.drop(['label'], axis=1).columns)
df_miss = sorted(missing.items(), key=lambda x: x[1], reverse=True)
print('Percent of missing data')
df_miss[0:10]