# AutoEncoders for Supervised Anomaly Detection

## Anomaly Detection: Detecting the Unusual

Anomaly detection is a crucial task in various domains, where the objective is to identify rare and abnormal instances that deviate significantly from the expected or normal behavior. It plays a vital role in fraud detection, network security, system monitoring, and predictive maintenance.

## Autoencoders for Unsupervised Anomaly Detection

In the unsupervised setting, autoencoders are trained to reconstruct the input data accurately. The idea is that the model learns to encode the normal instances in a compressed representation and then decodes them back to their original form. Anomalies, being different from the normal patterns, are expected to have higher reconstruction errors. By setting a suitable threshold on the reconstruction error, anomalies can be identified.

## Incorporating Supervision with Autoencoder Features

**In this exercise, we will explore a different approach by incorporating supervision into the anomaly detection process. We will still employ autoencoders, but instead of relying solely on the reconstruction error for anomaly detection, we will leverage the learned features from the encoder part of the autoencoder and utilize them to learn a supervised classifier to detect anomalies.**

In this exercise you will:



*   Learn how to train AutoEncoders on Relational Data.
*   Learn how compressed representations from an AutoEncoder can help achieve performance comparable to original dimensions on downstream task.
*   Learn how to use AutoEncoder for Supervised Anomaly Detection tasks.





*This notebook is designed to help you guide how to approach this assignment.*

<i><font color='blue'>Some parts of the notebook are left as exercise for you and are the corresponding headers are marked in blue</font></i>

# Exploring the Dataset

* Create a directory called **kdd-data** in the root (**My Drive**) of your Google Drive.
* Upload the provided **kdd-data.zip** to this **kdd-data** directory.

## Importing the required libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import layers, Model, models
from sklearn.metrics import classification_report, confusion_matrix
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from tensorflow.keras.optimizers import Adam
import seaborn as sns

## Mount the Google Drive

After mounting the drive, the **kdd-data.zip** should appear at the following path in the left (**Files**) pane in Colab: **/content/gdrive/MyDrive/kdd-data/kdd-data.zip.**

If it doesn't, make sure you recheck and follow the steps mentioned above.

In [None]:
drive.mount('/content/gdrive', force_remount=True)

## Upzip the Data Zip Archive

In [None]:
!unzip /content/gdrive/MyDrive/kdd-data/kdd-data.zip -d /content/gdrive/MyDrive/kdd-data/

The uncompressed data should now be available under **/content/gdrive/MyDrive/kdd-data/**

## Description of the KDD Cup 1999 Data

The KDD Cup 1999 dataset is a widely used benchmark dataset in the field of anomaly detection and network security. It was created as part of the Third International Knowledge Discovery and Data Mining Tools Competition, held in 1999.

The dataset contains a large set of network traffic data captured from various sources, simulating a network environment. It includes a wide range of features that represent different aspects of network connections, such as duration, protocol type, service, flag, source and destination bytes, and more.

The primary goal of the KDD Cup 1999 dataset is to classify network connections as either "normal" or "anomalous."

The dataset includes a categorical label column called "label," which categorizes network connections into various attack types, such as "normal," "dos" (denial-of-service), "probe" (scanning and probing), "r2l" (unauthorized access from a remote machine), and "u2r" (unauthorized access to local root privileges).

## Loading the dataset

The KDD Dataset contains around 5 million instances of network traffic data. Each instance represents a network connection with various features and a corresponding label.

Due to the resource limitations in Colab, we will utilize a smaller subset of the data by using a 10% split (kddcup.data_10_percent) for training our models. This subset will allow us to work within the resource constraints while still providing enough data for training and evaluation.

The description of the dataset, along with the features is available [here](https://kdd.ics.uci.edu/databases/kddcup99/task.html).

The dataset file doesn't contain any header. Therefore loading the header information about the dataset as described [here](https://kdd.ics.uci.edu/databases/kddcup99/kddcup.names).

In [None]:
# Load the KDD Cup 1999 dataset

data = pd.read_csv('/content/gdrive/MyDrive/kdd-data/kddcup.data_10_percent/kddcup.data_10_percent', header=None)

In [None]:
# Define the column names
column_names = ["duration", "protocol_type", "service", "flag", "src_bytes",
                "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
                "num_failed_logins", "logged_in", "num_compromised",
                "root_shell", "su_attempted", "num_root", "num_file_creations",
                "num_shells", "num_access_files", "num_outbound_cmds",
                "is_host_login", "is_guest_login", "count", "srv_count",
                "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
                "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
                "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate",
                "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
                "dst_host_srv_diff_host_rate", "dst_host_serror_rate",
                "dst_host_srv_serror_rate", "dst_host_rerror_rate",
                "dst_host_srv_rerror_rate", "label"]

# Assign the column names to the dataset
data.columns = column_names

## Checkout the first few rows of the dataset

In [None]:
pd.set_option('display.max_columns', None)
data.head(10)

## Label Fequency

<font color='blue'>Create a Bar Plot to check the frequency for each label in the dataset</font>

In [None]:
# def plot_value_count(data):

# plot_value_count(data)

### All classes except "normal." in this dataset are considered attacks. They can be considered anomalies in our case.

Since our goal is to classify the network traffic as "normal" or "anomalous", we just need to convert the remaining classes into these two, and project this as a binary classification problem.

<font color='blue'>Convert all "normal" labels to a value "0", while all other labels to a value "1". Update the same "label" column in the "data" dataframe with the new values.</font>

In [None]:
# data["label"] =

Let's look at the label frequency now

In [None]:
# the "label" column in the data df should contain the two labels "0" and "1" now.
plot_value_count(data)

Since we have the labels available for anomalous and non amomalous datapoints, this dataset is a good candidate for supervised anomaly detection.

Note that anomaly detection datasets usually have "less" anomalous data points than non-anomalous dataset, however, this is not the case in this dataset, even without binarizing the labels. Since, the dataset has **significant** number of anomalous datapoints, this also makes it a good candidate for Supervised Anomaly Detection.

## Splitting the dataset before Preprocessing

In [None]:
# Separate the labels from the features
X = data.drop("label", axis=1)
y = data["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Processing the Categorical Features

Let's look at the dataset description and figure out all the categorical features available in the dataset.





In [None]:
categorical_features = ["protocol_type", "service", "flag"]

<font color='blue'>Now convert the categorical features in One Hot representation. The code should update the X_train and X_test dataframes, with the one hot encoded features, and the original features dropped.</font>

In [None]:
# Initialize the OneHotEncoder

# Fit and transform the categorical features for training data

# Transform the categorical features for test data

# Create DataFrames with the one-hot encoded features

# Drop the original categorical features

# Concatenate the one-hot encoded features with the original datasets

Let's look at the new shape of our encoded dataset

In [None]:
print(X_train.shape)
print(X_test.shape)

## Normalizing the Continous Features

The continous (non-categorical) features in the KDD dataset have different scales. This can be a problem for our models. Let's normalize these features.

In [None]:
continuous_features = [x for x in column_names if x not in categorical_features and x !='label']
print('Total number of non-categorical features: ', len(continuous_features))

<font color='blue'>Write code to normalize the non categorical features in the dataset. The code should update X_train and X_test dataframes.</font>

In [None]:
# Initialize the MinMaxScaler

# Fit and transform the continuous features in X_train

# Transform the continuous features in X_test

# Display the normalized datasets

# Modelling

## AutoEncoder

Now that the train and test data is ready. Let's design an AutoEncoder.

### Encoder

<font color='blue'>Design an Encoder. Take any aribitrary number of Dense Layers, with arbitrary number of neurons in each layers. The output dimensions should be significantly less than the number of dimensions in the Training Dataset. Feel free to try out various architectures here and use the one yeilding best results. The encoder should be an object of "Model" class</font>

In [None]:
# Define/Initialize the input dimension

# Define the latent space dimension

# Define the layers of the Encoder

# Initialize the encoder Model using the above created architecture

# Print encoder summary

<font color='blue'>Design the decoder. The architecture of the Decoder should complement the Encoder. The decoder should be an object of "Model" class</font>

In [None]:
# Define the layers of the Decoder

# Initialize the Decoder Model using the above created architecture

# Print decoder summary

<font color='blue'>Combine encoder and decoder to produce the final model.</font>

In [None]:
# Combine encoder and decoder models to create the Autoencoder model

<font color='blue'>Compile and Train the AutoEncoder on X_train. Feel free to try out various hyperparameters, and/or try various hyperparameter search techniques.</font>

In [None]:
# Compile the autoencoder

# Train the autoencoder

<font color='blue'>Create a plot to visualize the training and validation loss</font>

In [None]:
# Access the loss history

# Plot the training and validation loss curves

## Classifier

The AutoEncoder can now give us a latent space from the Encoder, with the features representing the maximum information from the original set of 117 features. The task is to build a classifier, use this latent space from the AutoEncoder as input features to the classifier, and see if the classifier can predict network attacks accurately.

<font color='blue'> Build a classifier here. You are free to use any architecture i.e. non-neural models such as Logistic Regression, Decision Trees, Random Forest etc. or a Sequential Feed Forward Neural Network. </font>

In [None]:
# Use the following function definition for the classifier. Feel free to add more hyperparameters to the function parameters, if needed.
# def get_classifier(input_dimension, lr):

  # return classifier

### Classification on the non-autoencoded dataset

<font color='blue'>Train the classifier on the original data first - i.e. on X_train. You do not need to worry about the AutoEncoder's output in this exercise.</font>

In [None]:
# Get the classifier object using the above function defintion

# Train the classifier model on the raw features. Use X_test as validation data

<font color='blue'>Generate predictions on the X_test, Generate Classification Report and plot confusion matrix</font>

In [None]:
# Predict on the test set

# Evaluate the classifier by generating the classification report.

In [None]:
# Generate the confusion matrix

# Plot the confusion matrix

### Classification on the Autoencoded dataset

<font color='blue'>Write code to generate latent representations for X_train and X_test from the Encoder component of the AutoEncoder</font>

In [None]:
# Obtain the encoded features on X_train

# Obtain the encoded features on X_test

<font color='blue'>Now train the classifier on these latent representations. You should try out various hyperparameters which your classifier can take in order to achieve the best performance.</font>

In [None]:
# Get the classifier object using the above function defintion

# Train the classifier model on the encoded features obtained from the AutoEncoder. Use encoded features obtained for X_test as validation data.

<font color='blue'>Generate predictions on the X_test, Generate Classification Report and plot confusion matrix.</font>

In [None]:
# Predict on the test set

# Evaluate the classifier

In [None]:
# Calculate the confusion matrix

# Plot the confusion matrix

#Summary

<font color='blue'>Summarize your observations when training the classifier on the Raw Features vs training it on the encoded features.

Compare and Summarize the results of the two approaches.</font>