# Task 4: Network Anomaly Detection using a Deep Autoencoder

## Project Overview

**Objective:**
The primary goal of this project is to develop and evaluate a deep autoencoder model for detecting anomalies in network traffic. The model will be trained to distinguish between normal network connections and various types of malicious attacks, such as Denial-of-Service (DoS), port scanning, and unauthorized access attempts.

**Dataset:**
The project utilizes the **KDD Cup 1999 dataset**, which was created by MIT Lincoln Labs for intrusion detection system evaluations. We will be working with the `kddcup.data_10_percent.gz` subset, which contains a large number of network connection records. Each record is described by 41 features and is labeled as either `normal.` or a specific type of attack.

**Methodology:**
The core approach is to build an autoencoder, a type of neural network trained to reconstruct its input data. The key steps of the methodology are:
1.  **Data Loading and Preprocessing:** Load the dataset, assign correct column names, and perform necessary preprocessing, including scaling numerical features and encoding categorical ones.
2.  **Model Architecture:** Design a deep autoencoder with multiple dense layers for both the encoder and the decoder.
3.  **Training Strategy:** Crucially, the autoencoder will be trained **exclusively on data corresponding to 'normal' network traffic**. The underlying hypothesis is that the model will learn to reconstruct normal data with a low error, but will struggle to reconstruct anomalous data (attacks), resulting in a high reconstruction error.
4.  **Evaluation:** The reconstruction error will serve as an anomaly score. By setting an appropriate threshold on this error, we can classify connections as either normal or anomalous. The model's performance will be evaluated on a test set containing both normal and anomalous data using metrics such as the confusion matrix, Receiver Operating Characteristic (ROC) curve, and the Area Under the Curve (AUC).

**Tools and Libraries:**
*   **Python 3.x**
*   **Pandas & NumPy** for data manipulation.
*   **Scikit-learn** for data preprocessing (scaling, splitting).
*   **TensorFlow/Keras** for building and training the deep autoencoder model.
*   **Matplotlib & Seaborn** for data visualization.

### Import required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1. Data Loading and Analysis

In [None]:
# Set pandas display options to show all columns
pd.set_option('display.max_columns', None)

# List to store column names
column_names = []

# Load column names from the kddcup.names file
with open('kddcup.names', 'r') as f:
    for line in f:
        # Skip lines that do not contain column descriptions
        if ':' in line:
            # Split the line by ':' and take the first part as the column name
            name = line.split(':')[0]
            column_names.append(name)

# Add the target column name at the end, which is not described among the features
column_names.append('outcome')

print(f"Number of loaded column names: {len(column_names)}")
print("List of column names:")
print(column_names)

# Path to the data file
file_path = 'kddcup.data_10_percent.gz'

# Load data into a pandas DataFrame
df = pd.read_csv(file_path, header=None, names=column_names, compression='gzip')

print("Data loaded. DataFrame dimensions:")
print(df.shape)

# Check unique values and their counts in the 'outcome' column
outcome_counts = df['outcome'].value_counts()

print(f"Number of unique values in the 'outcome' column: {len(outcome_counts)}")
print("\nCounts of each transmission type:")
print(outcome_counts)

Number of loaded column names: 42
List of column names:
['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'outcome']
Data loaded. DataFrame dimensions:
(494021, 42)
Number of unique values in the 'outcome' column: 23

Counts of each transmission type:
outcome
smurf.              280790
neptune.            107201
normal.