In [1]:
%load_ext nb_black
%matplotlib inline

<IPython.core.display.Javascript object>

# Required Libraries

In [None]:
# dataset preprocessing
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler

# visualization
import matplotlib.pyplot as plt

# NSL-KDD Dataset Introduction

The NSL-KDD dataset was collated in order to mitigate the problems of the KDD'99 dataset. The KDD'99 dataset attracted the attention of many researchers due to the lack of public data sets for signature based intrusion detection systems (IDSs). Tavallaee et al. [1] found that the most significant issue with the 1999 dataset is that it contained a significant number of redundant instances such that classifiers were biased towards these instances while disregarding infrequent instances. However, it is often these infrequent instances that can be indicative of new devastating network intrusions.

Some of the improvements present in the NSL-KDD dataset is that all redundant instances were removed. Moreover, the number of records from each difficulty group was selected to be inversely proportional to the percentage of records in the original KDD dataset. Difficulty refers to how many of 21 classifiers were able to classify an instance correctly from the 1999 dataset. In addition, some of the instances in the original dataset contained unsystematic noise, thus these instances were discarded when constructing the improved dataset. Overall, the original KDD dataset was disproportionately distributed, thus rendering the classifiers ineffective in serving as a discriminative tool for detecting network intrusions that can be identified by signature.

The NSL-KDD dataset can be downloaded from: https://www.unb.ca/cic/datasets/nsl.html

Please note that a 20% subset of the whole dataset will be used due to the magnitude of the whole dataset (more than 120,000 instances). The 20% subset is also provided at the above link under the name 'KDDTrain+_20Percent.TXT'

# Dataset Overview

In [None]:
df = pd.read_csv("ids.csv")
print(f"Number of Instances: {df.shape[0]}")
print(f"Number of Features (including class): {df.shape[1]}")
df.head()

In [None]:
num_of_missing_values = df.isnull().sum().sum()
print("There are %d missing values in the dataset." % (num_of_missing_values))

In [None]:
data_types = df.dtypes.value_counts().to_frame().reset_index()
data_types.columns = ["Type", "Count"]
data_types

In [None]:
categorical_features = df.select_dtypes(include=["object"])
categorical_features.head()

## Feature Statistics

In [None]:
df.describe()

The majority of the continuous features such as duration, src_bytes, and dst_bytes have very large standard deviations and ranges. Thus, these continuous features will be transformed to a feature range of [0, 1] in the preprocessing phase

## Class Distribution

In [None]:
labels = df["class"].value_counts().index

fig, ax = plt.subplots()
_, _, autopcts = ax.pie(
    df["class"].value_counts(),
    textprops={"fontsize": 14},
    labels=labels,
    colors=("green", "red"),
    autopct="%.2f%%",
    startangle=60,
    explode=(0, 0.05),
)

plt.setp(autopcts, **{"color": "white", "weight": "bold", "fontsize": 13})
ax.set_title("Class Distribution", fontdict={"fontsize": 16})

plt.show()

Overall, the class distribution is slightly skewed towards the 'Normal' class, thus stratified cross validation (CV) will be used to ensure that the folds preserve the percentage of samples for each class. The 'Normal' class represents network traffic that is not indicative of an intrusion while the 'Anomaly' class represents network traffic that is associated with malicious behavior.

# Pre-processing

In [None]:
cols_to_drop = [col for col in list(df) if df[col].nunique() <= 1]
df = df.drop(columns=cols_to_drop)

All features with only 1 unique value are dropped since they are futile for a classification task. The only 2 features that have 1 unique value are 'num_outbound_cmds' and 'is_host_login'.

In [None]:
X = df.iloc[:, :-1]
y = df["class"]

The data is separated into two components. X contains the features of the dataset while y contains the class labels for each instance.

In [None]:
binary_features = df.columns[df.isin([0, 1]).all()].tolist()
binary_features

All the binary features do not need to be scaled to a feature range of [0, 1], thus the feature names are extracted from the dataframe.

In [None]:
numeric_features_nb = df.select_dtypes("number").columns.drop(binary_features).tolist()

All the numeric features that are non-binary, i.e., continuous, will be scaled to a feature range of [0, 1] using MinMaxScaler().

In [None]:
enc = OrdinalEncoder(handle_unknown="ignore")
mms = MinMaxScaler()

ct = make_column_transformer(
    (enc, ["protocol_type", "service", "flag"]),
    (mms, numeric_features_nb),
    remainder="passthrough",
)

The 3 categorical features, 'protocol_type', 'service', and 'flag' are encoded using an ordinal encoder. The ordinal encoder transforms a categorical feature to an integer range in 0 to n_categories - 1. The above 3 feature seem to be nominal, thus I initially tried using one hot encoding. However, after using the ordinal encoder, I compared the accuracy of the models and found that all classifiers achieved higher accuracy using the ordinal encoder. Furthermore, the one hot encoder resulted in the 'service' feature being split into 66 distinct columns as that feature can take on 66 values. The higher accuracy of the classifiers using ordinal encoding on the categorical features demonstrates that although these feature may seem to be nominal based on the concepts they represent, they are highly likely to posses a natural ordering. The ordinal encoder is set to handle unknown feature values from the testing set during CV evaluation by ignoring such values. Thus, sklearn's ordinal encoder was used as opposed to the pandas ordinal encoder since it can handle unknown categorical features seen during CV.

The class labels did not need to be encoded since the majority of sklearn's classifiers automatically encode the target labels using a label encoder. All other features are already numeric, thus they were not encoded via the 'remainder="passthrough"' setting.

All these feature transformations (column transformations) will be used in a pipeline. A pipeline was used as it makes it easier to compose estimators since at every fit or predict call within the CV procedure, it will automatically apply the column transformations. This will prevent information leakage since the feature transformations will not be applied to the training set as a whole.

In [None]:
def extract_feature_names(ct):
    ct_temp = ct
    ct_temp.fit(X, y)
    features = []

    # disregard remainder = "passthrough"
    for transformer in ct_temp.transformers_[:-1]:
        features += transformer[2]

    return features


all_features = extract_feature_names(ct)

# have to define passedthrough_features since they were not explicitly stated
# when creating the column transformer
passedthrough_features = [
    "land",
    "urgent",
    "logged_in",
    "root_shell",
    "num_shells",
    "is_guest_login",
]
all_features = all_features + passedthrough_features

The feature names are extracted from the column transformer since the features will be re-ordered by the transformer. This will be used to extract the feature names when feature selection is used.