# 1. Task description summary
Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections.

A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.

Attacks fall into four main categories:

- DOS: denial-of-service, e.g. syn flood;
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- U2R: unauthorized access to local superuser (root) privileges, e.g., various ''buffer overflow'' attacks;
- probing: surveillance and other probing, e.g., port scanning.

It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only.

The complete task description could be found here.

## NSL-KDD dataset description

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set.

The NSL-KDD data set has the following advantages over the original KDD data set:

- It does not include redundant records in the train set, so the classifiers will not be biased towards more frequent records.
- There is no duplicate records in the proposed test sets; therefore, the performance of the learners are not biased by the methods which have better detection rates on the frequent records.
- The number of selected records from each difficultylevel group is inversely proportional to the percentage of records in the original KDD data set. As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques.
- The number of records in the train and test sets are reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research works will be consistent and comparable.

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.models import Sequential
from keras.layers import Dense
import shap

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Fonte para o pré-processamento dos dados do dataset NSL-KDD:
# https://www.kaggle.com/code/ajeffreyrufus/network-intrusion-detection-using-ml-99-accuracy
# https://kdd.ics.uci.edu/databases/kddcup99/task.html

# train20_nsl_kdd_dataset_path = os.path.join("../data/raw/NSL-KDD", "KDDTrain+_20Percent.txt")
# train_nsl_kdd_dataset_path = os.path.join("../data/raw/NSL-KDD", "KDDTrain+.txt")
# test_nsl_kdd_dataset_path = os.path.join("../data/raw/NSL-KDD", "KDDTest+.txt")

In [3]:
columns = np.array(
    [
    "duration",  # Duração da conexão em segundos
    "protocol_type",  # Tipo de protocolo utilizado (tcp, udp, icmp)
    "service",  # Serviço de destino (ex: http, ftp, telnet)
    "flag",  # Estado da conexão TCP (ex: SF - conexão finalizada sem erro)
    "src_bytes",  # Bytes enviados da origem para o destino
    "dst_bytes",  # Bytes recebidos pelo destino
    "land",  # 1 se origem e destino são iguais (ataque land), 0 caso contrário
    "wrong_fragment",  # Número de fragmentos incorretos
    "urgent",  # Número de pacotes urgentes
    "hot",  # Número de acessos a diretórios sensíveis do sistema
    "num_failed_logins",  # Número de tentativas de login falhas
    "logged_in",  # 1 se o login foi bem-sucedido, 0 caso contrário
    "num_compromised",  # Número de condições que comprometem a segurança do sistema
    "root_shell",  # 1 se um shell root foi obtido, 0 caso contrário
    "su_attempted",  # 1 se houve tentativa de usar `su` para privilégio, 0 caso contrário
    "num_root",  # Número de acessos como root
    "num_file_creations",  # Número de operações de criação de arquivos
    "num_shells",  # Número de shells abertos
    "num_access_files",  # Número de acessos a arquivos críticos do sistema
    "num_outbound_cmds",  # Número de comandos outbound em conexão FTP (sempre 0 no dataset)
    "is_host_login",  # 1 se login foi feito no host, 0 caso contrário
    "is_guest_login",  # 1 se login foi feito como convidado, 0 caso contrário
    "count",  # Número de conexões para o mesmo host nas últimas 2 segundos
    "srv_count",  # Número de conexões para o mesmo serviço nas últimas 2 segundos
    "serror_rate",  # Taxa de conexões com erro de sincronização (SYN)
    "srv_serror_rate",  # Taxa de conexões com erro de sincronização no mesmo serviço
    "rerror_rate",  # Taxa de conexões rejeitadas
    "srv_rerror_rate",  # Taxa de conexões rejeitadas no mesmo serviço
    "same_srv_rate",  # Taxa de conexões para o mesmo serviço
    "diff_srv_rate",  # Taxa de conexões para serviços diferentes
    "srv_diff_host_rate",  # Taxa de conexões para diferentes hosts no mesmo serviço
    "dst_host_count",  # Número de conexões para o mesmo destino
    "dst_host_srv_count",  # Número de conexões para o mesmo serviço no destino
    "dst_host_same_srv_rate",  # Taxa de conexões para o mesmo serviço no destino
    "dst_host_diff_srv_rate",  # Taxa de conexões para serviços diferentes no destino
    "dst_host_same_src_port_rate",  # Taxa de conexões para o mesmo destino usando a mesma porta de origem
    "dst_host_srv_diff_host_rate",  # Taxa de conexões para diferentes hosts no mesmo serviço no destino
    "dst_host_serror_rate",  # Taxa de conexões com erro de sincronização no destino
    "dst_host_srv_serror_rate",  # Taxa de conexões com erro de sincronização no mesmo serviço no destino
    "dst_host_rerror_rate",  # Taxa de conexões rejeitadas no destino
    "dst_host_srv_rerror_rate",  # Taxa de conexões rejeitadas no mesmo serviço no destino
    "attack",  # Tipo de ataque ou "normal" se for tráfego legítimo
    "level",  # Gravidade do ataque (normalmente não usado)
]
)

nominal_inx = [1, 2, 3]
binary_inx = [6, 11, 13, 14, 20, 21]
numeric_inx = list(set(range(41)).difference(nominal_inx).difference(binary_inx))

nominal_cols = columns[nominal_inx].tolist()
binary_cols = columns[binary_inx].tolist()
numeric_cols = columns[numeric_inx].tolist()

print(f"Nominal columns: {nominal_cols}")
print(f"Binary columns: {binary_cols}")
print(f"Numeric columns: {numeric_cols}")

Nominal columns: ['protocol_type', 'service', 'flag']
Binary columns: ['land', 'logged_in', 'root_shell', 'su_attempted', 'is_host_login', 'is_guest_login']
Numeric columns: ['duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'num_compromised', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']


In [4]:
# Mapeamento de ataques para categorias
attack_mapping = {
    'normal': 'normal',
    
    'back': 'DoS',
    'land': 'DoS',
    'neptune': 'DoS',
    'pod': 'DoS',
    'smurf': 'DoS',
    'teardrop': 'DoS',
    'mailbomb': 'DoS',
    'apache2': 'DoS',
    'processtable': 'DoS',
    'udpstorm': 'DoS',
    
    'ipsweep': 'Probe',
    'nmap': 'Probe',
    'portsweep': 'Probe',
    'satan': 'Probe',
    'mscan': 'Probe',
    'saint': 'Probe',

    'ftp_write': 'R2L',
    'guess_passwd': 'R2L',
    'imap': 'R2L',
    'multihop': 'R2L',
    'phf': 'R2L',
    'spy': 'R2L',
    'warezclient': 'R2L',
    'warezmaster': 'R2L',
    'sendmail': 'R2L',
    'named': 'R2L',
    'snmpgetattack': 'R2L',
    'snmpguess': 'R2L',
    'xlock': 'R2L',
    'xsnoop': 'R2L',
    'worm': 'R2L',
    
    'buffer_overflow': 'U2R',
    'loadmodule': 'U2R',
    'perl': 'U2R',
    'rootkit': 'U2R',
    'httptunnel': 'U2R',
    'ps': 'U2R',    
    'sqlattack': 'U2R',
    'xterm': 'U2R'
}

In [5]:
train_df = pd.read_csv("../data/raw/NSL-KDD/KDDTrain+.txt", names=columns, sep=',')
print(train_df["attack"].unique())

# test_data = pd.read_csv("../data/raw/NSL-KDD/KDDTest+.txt", header=None)

train_df["attack_label2"] = train_df["attack"].apply(
    lambda x: "normal" if x == "normal" else "attack"
)
train_df["attack_label5"] = train_df["attack"].map(attack_mapping)

train_labels2 = train_df["attack_label2"].value_counts()
train_labels5 = train_df["attack_label5"].value_counts()

print(train_labels2)
print(train_labels5)

['normal' 'neptune' 'warezclient' 'ipsweep' 'portsweep' 'teardrop' 'nmap'
 'satan' 'smurf' 'pod' 'back' 'guess_passwd' 'ftp_write' 'multihop'
 'rootkit' 'buffer_overflow' 'imap' 'warezmaster' 'phf' 'land'
 'loadmodule' 'spy' 'perl']
attack_label2
normal    67343
attack    58630
Name: count, dtype: int64
attack_label5
normal    67343
DoS       45927
Probe     11656
R2L         995
U2R          52
Name: count, dtype: int64


In [6]:
test_df = pd.read_csv("../data/raw/NSL-KDD/KDDTest+.txt", names=columns, sep=",")
print(test_df["attack"].unique())

test_df["attack_label2"] = test_df["attack"].apply(
    lambda x: "normal" if x == "normal" else "attack"
)
test_df["attack_label5"] = test_df["attack"].map(attack_mapping)

test_labels2 = test_df["attack_label2"].value_counts()
test_labels5 = test_df["attack_label5"].value_counts()

print(test_labels2)
print(test_labels5)

['neptune' 'normal' 'saint' 'mscan' 'guess_passwd' 'smurf' 'apache2'
 'satan' 'buffer_overflow' 'back' 'warezmaster' 'snmpgetattack'
 'processtable' 'pod' 'httptunnel' 'nmap' 'ps' 'snmpguess' 'ipsweep'
 'mailbomb' 'portsweep' 'multihop' 'named' 'sendmail' 'loadmodule' 'xterm'
 'worm' 'teardrop' 'rootkit' 'xlock' 'perl' 'land' 'xsnoop' 'sqlattack'
 'ftp_write' 'imap' 'udpstorm' 'phf']
attack_label2
attack    12833
normal     9711
Name: count, dtype: int64
attack_label5
normal    9711
DoS       7458
R2L       2754
Probe     2421
U2R        200
Name: count, dtype: int64
