## Exercise 3: 

This notebook demonstrates how to process data to build predictive machine learning models using a network dataset. This dataset is something that a Threat Hunter, Security Operations (SOC) analyst or a detection engineer will encounter in their day-to-day role. We'll use a custom built tool and **pandas** for data manipulation, **numpy** for making data become ML-ready

**What's the story?**

You are a threat hunter who is proactively looking to secure your organization. You create a hypothesis that you will find some sneaky malicious activity and start looking at network data. After the EDA process and further investigation by the Incident Response team (IR), you realize that there's benign and malicious network traffic. You are eager to catch such activity in the future and you set out on this mission! This is real world network data, fairly voluminous and not so kind to you. 


### Key Questions:
- What does my security spidey-sense have to say on distinguishing factors between malicious and benign?
- At what point do we want to identify it?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import ipaddress
 
from utils.pcap import pcap_to_dataframe, extract_streams

# Load data
There are two ways to load the data:

- Directly reading a `pcap` and converting it to a Pandas `DataFrame`,
- Preloading the dataframe to a `.pkl` file. For more information on pickle files check the [article from RealPython about pickle module](https://realpython.com/python-pickle-module/).

In [None]:
malicious_pcap = "../data/mirai.pcap"
benign_pcap = "../data/benign.pcapng"
malicious_pkl_path = "../data/mirai.pkl"
benign_pkl_path = "../data/benign.pkl"
malicious_stream_path = "../data/mirai_stream.pkl"
benign_stream_path = "../data/benign_stream.pkl"

In [None]:
# First time you run this, you should create your own pkl. Untrusted pickle files can cause deserialization harm creating remote code execution (RCE) opportunities. 
# For subsequent runs, set this flag to True
READ_FROM_PKL = False

In [None]:
if READ_FROM_PKL:
    malicious_df = pd.read_pickle(malicious_pkl_path)
    benign_df = pd.read_pickle(benign_pkl_path)
    
    malicious_stream_df = pd.read_pickle("data/mirai_stream.pkl")
    benign_stream_df = pd.read_pickle("data/benign_stream.pkl")
else:
    malicious_df = pcap_to_dataframe(malicious_pcap)
    benign_df = pcap_to_dataframe(benign_pcap)
    malicious_stream_df = extract_streams(malicious_df)
    benign_stream_df = extract_streams(benign_df)
    # Save to pkl for accelerated processing in subsequent runs
    malicious_df.to_pickle(malicious_pkl_path)
    benign_df.to_pickle(benign_pkl_path)
    malicious_stream_df.to_pickle(malicious_stream_path)
    benign_stream_df.to_pickle(benign_stream_path)

In [None]:
malicious_df.sample(n=10)

In [None]:
malicious_df.shape

In [None]:
# Copy the dataframes to a features dataframe while omitting the packets with incomplete information such as NaN src/dst ips/ports
malicious_features = malicious_df.dropna(subset=["Source IP", "Destination IP", "Source Port", "Destination Port"])
benign_features = benign_df.dropna(subset=["Source IP", "Destination IP", "Source Port", "Destination Port"])

In [None]:
malicious_features.sample(n=10)

In [None]:
malicious_features.shape

#### How much has the dataset reduced after handling incomplete/real-world data?

# Numerical features
Post processing numbers to ... better numbers that describe context or add more data/information.

## Cumulative

Summarize your numerical features and give them a new meaning and utility.

In [None]:
malicious_features["src_ip_total_bytes"] = malicious_features.groupby("Source IP")[
    "Packet Length"
].cumsum()

In [None]:
benign_features["src_ip_total_bytes"] = benign_features.groupby("Source IP")[
    "Packet Length"
].cumsum()

In [None]:
malicious_features["dst_ip_total_bytes"] = malicious_features.groupby("Destination IP")[
    "Packet Length"
].cumsum()

In [None]:
benign_features["dst_ip_total_bytes"] = benign_features.groupby("Destination IP")[
    "Packet Length"
].cumsum()

In [None]:
malicious_features.sample(n=10)

## Numerical conversions

Convert numerical features to usable numbers.

In [None]:
def ip_to_numeric(ip):
    try:
        ip_obj = ipaddress.ip_interface(ip)
        ip = int(ip_obj.network.network_address)
    except ValueError:
        ip = 0

    return ip

In [None]:
malicious_features["Numeric Source IP"] = malicious_features["Source IP"].apply(
    ip_to_numeric
)

malicious_features["Numeric Destination IP"] = malicious_features["Destination IP"].apply(
    ip_to_numeric
)

In [None]:
benign_features["Numeric Source IP"] = benign_features["Source IP"].apply(
    ip_to_numeric
)

benign_features["Numeric Destination IP"] = benign_features["Destination IP"].apply(
    ip_to_numeric
)

In [None]:
# remove non-numeric IPs
malicious_features.pop("Source IP")
malicious_features.pop("Destination IP")

benign_features.pop("Source IP")
benign_features.pop("Destination IP")

# Categorical features
What about the text data? We can convert those to numbers too.

## Frequency encoding
Counts the population that corresponds to a specific category. The result is still a vector of categories, however not with 0s ans 1s, but with real numbers that indicate how often the category is encountered in the data.

In [None]:
malicious_frequency_encoding = (
   malicious_features["Destination Port"].value_counts(normalize=True).to_dict()
)

In [None]:
malicious_frequency_encoding

In [None]:
benign_frequency_encoding = (
   benign_features["Destination Port"].value_counts(normalize=True).to_dict()
)

In [None]:
malicious_features["dst_port_freq_encoded"] = malicious_features["Destination Port"].map(
    malicious_frequency_encoding
)

In [None]:
benign_features["dst_port_freq_encoded"] = benign_features["Destination Port"].map(
    benign_frequency_encoding
)

## Derived Features

In [None]:
# Define a function to convert Scapy timestamps to pandas datetime
def scapy_timestamp_to_datetime(ts):
    return pd.to_datetime(
        ts.to_eng_string(), unit="s"
    )  # Convert to a format pandas understands


# Convert the Scapy timestamps to pandas datetime
malicious_features["Timestamp"] = malicious_features["Timestamp"].apply(scapy_timestamp_to_datetime).astype(int) / 10**9
benign_features["Timestamp"] = benign_features["Timestamp"].apply(scapy_timestamp_to_datetime).astype(int) / 10**9

In [None]:
malicious_features["Interarrival"] = malicious_features["Timestamp"].diff()
benign_features["Interarrival"] = benign_features["Timestamp"].diff()

In [None]:
malicious_features.sample(n=5)

In [None]:
malicious_features.dtypes

In [None]:
malicious_features.to_pickle("../data/malicious_features_numeric.pkl")
benign_features.to_pickle("../data/benign_features_numeric.pkl")

## One hot encoding
Binary encoding that creates a vector with 0s and 1s that correspond to specific categories. If your data had the category populated, mark it as 1 otherwise mark as 0. 

In [None]:
network_protocols = {
    1: "ICMP",
    6: "TCP",
    17: "UDP",
    23: "Telnet",
    41: "IPv6_encapsulation",
    47: "GRE",
    50: "ESP",
    51: "AH",
    53: "DNS",
    58: "ICMPv6",
    89: "OSPF",
    132: "SCTP",
    135: "SCTP",
    136: "UDPLite",
    137: "NETBIOS-NS",
    138: "NETBIOS-DGM",
    139: "NETBIOS-SSN",
    143: "IMAP",
    161: "SNMP",
    162: "SNMP_trap",
    443: "HTTPS",
    514: "Syslog",
    636: "LDAPS",
    989: "FTPS",
    993: "IMAPS",
    995: "POP3S",
    1080: "SOCKS_proxy",
    # Add more protocols as needed
}

In [None]:
def one_hot_port(port, df):
    new_df = pd.DataFrame()
    for protocol_port, protocol_name in network_protocols.items():
        new_df[protocol_name] = df[port].apply(
            lambda port: 1 if port == protocol_port else 0
        )
    return new_df

In [None]:
malicious_protocol_one_hot = one_hot_port("Destination Port", malicious_features)
malicious_features = pd.concat([malicious_features, malicious_protocol_one_hot], axis=1)

In [None]:
benign_protocol_one_hot = one_hot_port("Destination Port", benign_features)
benign_features = pd.concat([benign_features, benign_protocol_one_hot], axis=1)

In [None]:
malicious_features.to_pickle("../data/malicious_features.pkl")
benign_features.to_pickle("../data/benign_features.pkl")

In [None]:
malicious_features.describe()

In [None]:
benign_features.describe()