# COMP500827 机器学习在信息安全中的应用 <br> 课程实验：DDoS攻击流量检测 <br> 截止日期：2025年1月6日24:00

姓名：，学号：

**提交要求**：
- 务必在上面填上自己的姓名和学号
- 提交内容应包括：以jupyter notebook形式编写的Python代码（文件后缀名为.ipynb）、带所有输出结果的jupyter notebook文件对应的PDF文件（注意：不需要把数据打包）
- 所有文件须用zip格式打包成一个压缩包，要求解压即可运行
- 压缩包上传至教学系统（https://class.xjtu.edu.cn/course/74005/homework#/ ）

**迟交惩罚制度**：在截止时间前提交的作业/大实验不会遭受任何迟交惩罚。每位学生在整个课程期间有三天的延期时间，可以根据自己的需要使用。在使用这三天延期时，每延迟一天，你的作业/大实验成绩将被扣除15%。你可以根据自己的需要灵活使用这三天延期。例如，你可以将这三天全都用在一个作业/大实验上（本次作业/大实验最多只能获得55%的分数），或者分开使用，每个作业/大实验使用一天延期（每个作业/大实验的最高分为85%）。一旦你使用完这三天延期，之后提交的任何迟交作业将无法获得学分。

# 利用有监督学习进行DDoS攻击流量检测

通过本次实验，探索如何利用有监督机器学习技术来检测分布式拒绝服务 (DDoS) 攻击。

## 实验目标

1. 了解如何处理物联网设备的流量数据 
2. 搭建一个简单的前馈神经网络
3. 比较人工神经网络和其他机器学习算法的性能

In [None]:
# 如果使用新版本的pytorch，可能需要降低numpy的版本：https://stackoverflow.com/questions/78636947/a-module-that-was-compiled-using-numpy-1-x-cannot-be-run-in-numpy-2-0-0-as-it-ma
# pip install numpy==1.26.4
# pip install torch

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict
import math

# Scikit-learn库：SVM、KNN、分类指标等
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn import svm

# PyTorch库：用于搭建神经网络
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Matplotlib库：用于画图
import matplotlib.pyplot as plt
import itertools
%matplotlib inline

### 读取正常流量数据
该数据集来源于三个物联网设备（一个安全摄像头、一个血压计和一个智能开关）在10分钟内的正常流量。 数据的详细收集过程见[论文](https://ieeexplore.ieee.org/document/8424629)

In [None]:
normal = pd.read_csv('./data/normal.csv')
# 二分类任务, Class = 1 (攻击流量)；Class = 0 (正常流量)
normal['Label'] = 0

In [None]:
# 将流量数据按照不同设备的IP进行分割

# WeMo智能开关
normal_switch = normal[normal['Source'] == '172.24.1.81']

# Yi家庭摄像头
normal_camera = normal[normal['Source'] == '172.24.1.107']

# 安卓手机
normal_phone = normal[normal['Source'] == '172.24.1.63']

### 读取攻击流量数据
该数据集来源于模拟实验环境下收集到的Mirai僵尸网络攻击流量。具体地说，该实验用Kali linux虚拟机作DoS源，用Raspberry Pi 2运行Apache web服务器作为被攻击者，模拟了感染Mirai僵尸网络的物联网设备会执行的三种最常见的拒绝服务攻击：TCP SYN Flood，UDP Flood，HTTP GET Flood：

1. HTTP GET Flood - 2 分钟; 使用 Goldeneye 在 Linux 虚拟机上模拟攻击 Google.com
2. TCP SYN Flood - 5 分钟; 使用 hping3 在 Raspberry Pi 上模拟攻击局域网上的 Linux 虚拟机
3. UDP Flood - 2.5 分钟; 使用 hping3 在 Raspberry Pi 上模拟攻击局域网上的 Linux 虚拟机

通过对攻击流量进行预处理，实验得以将三台主机上的攻击流量与正常流量叠加。三台物联网设备都感染了僵尸网络，并将在实验周期内以随机顺序执行四次攻击，每次攻击的持续时间约为100秒。这样，在任何给定时间内，都有50%的概率正在进行攻击（(100s * 3)/(60s * 10min) = .5）。 

在攻击期间，每个设备都能同时发送攻击流量和正常流量。设备之间的攻击分布是相互独立的。实验假设所有攻击的目标IP为：8.8.8.8，端口固定（80 用于 http 攻击，443 用于 tcp/udp 攻击）。 

In [None]:
# 读取攻击流量数据
attack_http = pd.read_csv('./data/http_get_attack.csv')
attack_tcp = pd.read_csv('./data/tcp_flood.csv')
attack_udp = pd.read_csv('./data/udp_flood.csv')

# 添加标签，0表示正常流量，1表示攻击流量
attack_http['Label'] = 1
attack_tcp['Label'] = 1
attack_udp['Label'] = 1

# 通过隔离源IP地址来清理攻击数据
# 只考虑来自网络内部的 DOS 攻击。
attack_http = attack_http[ (attack_http['Source'] == '172.24.1.67') & (attack_http['Destination'] == '172.217.11.36')]
attack_tcp = attack_tcp[ (attack_tcp['Source'] == '172.24.1.108') & (attack_tcp['Destination'] == '172.24.1.67') ]
attack_udp = attack_udp[ (attack_udp['Source'] == '172.24.1.108') & (attack_udp['Destination'] == '172.24.1.67') ]

# 假设攻击的目标IP都为8.8.8.8
attack_http['Destination'] = '8.8.8.8'
attack_tcp['Destination'] = '8.8.8.8'
attack_udp['Destination'] = '8.8.8.8'

# 设置攻击的目的端口
attack_http['Dst_port'] = 80
attack_tcp['Dst_port'] = 443
attack_udp['Dst_port'] = 443

In [None]:
attack_tcp.head(10)

#### 设备1（WeMo 智能开关）攻击介绍  
攻击1：HTTP GET Flood, 90 seconds, 20-110 sec  
攻击2：TCP SYN Flood, 110 seconds, 300-410 sec  
攻击3：UDP Flood, 100 seconds, 475-575 sec  

In [None]:
# 攻击 1
attack_http_switch = attack_http[attack_http['Time'] <= 90]
attack_http_switch.loc[:,'Time'] = attack_http_switch.loc[:,'Time'] + 20
attack_http_switch.loc[:,'Source'] = '172.24.1.81'

# 攻击 2
attack_tcp_switch = attack_tcp[attack_tcp['Time'] <= 110]
attack_tcp_switch.loc[:,'Time'] = attack_tcp_switch.loc[:,'Time'] + 300
attack_tcp_switch.loc[:,'Source'] = '172.24.1.81'

# 攻击 3
attack_udp_switch = attack_udp[attack_udp['Time'] <= 100]
attack_udp_switch.loc[:,'Time'] = attack_udp_switch.loc[:,'Time'] + 475
attack_udp_switch.loc[:,'Source'] = '172.24.1.81'

#### 设备2（YI 家庭摄像头）攻击介绍  
攻击1：TCP SYN Flood, 80 seconds, 25-107 sec  
攻击2：HTTP GET Flood, 100 seconds, 310-410 sec  
攻击3：UDP Flood, 120 seconds, 450-570 sec  

In [None]:
# 攻击 1
attack_tcp_camera = attack_tcp[attack_tcp['Time'] <= 80]
attack_tcp_camera.loc[:,'Time'] = attack_tcp_camera['Time'] + 25
attack_tcp_camera.loc[:,'Source'] = '172.24.1.107'

# 攻击 2
attack_http_camera = attack_http[attack_http['Time'] <= 100]
attack_http_camera.loc[:,'Time'] = attack_http_camera['Time'] + 310
attack_http_camera.loc[:,'Source'] = '172.24.1.107'

# 攻击 3
attack_udp_camera = attack_udp[attack_udp['Time'] <= 120]
attack_udp_camera.loc[:,'Time'] = attack_udp_camera['Time'] + 450
attack_udp_camera.loc[:,'Source'] = '172.24.1.107'

#### 设备3（安卓手机）攻击介绍  
攻击1：UDP Flood, 105 seconds, 5-120 sec  
攻击2：TCP SYN Flood, 80 seconds, 240-320 sec  
攻击3：HTPP GET Flood, 115 seconds, 420-535 sec  

In [None]:
# 攻击 1
attack_udp_phone = attack_udp[attack_udp['Time'] <= 105]
attack_udp_phone.loc[:,'Time'] = attack_udp_phone['Time'] + 5
attack_udp_phone.loc[:,'Source'] = '172.24.1.63'

# 攻击 2
attack_tcp_phone = attack_tcp[attack_tcp['Time'] <= 80]
attack_tcp_phone.loc[:,'Time'] = attack_tcp_phone['Time'] + 240
attack_tcp_phone.loc[:,'Source'] = '172.24.1.63'

# 攻击 3
attack_http_phone = attack_http[attack_http['Time'] <= 115]
attack_http_phone.loc[:,'Time'] = attack_http_phone['Time'] + 420
attack_http_phone.loc[:,'Source'] = '172.24.1.63'

### 提取每一台设备的流量特征和标签（1表示攻击，0表示正常）

In [None]:
# 合并攻击和正常流量
switch_data = pd.concat([normal_switch, attack_http_switch, attack_tcp_switch, attack_udp_switch])
camera_data = pd.concat([normal_camera, attack_http_camera, attack_tcp_camera, attack_udp_camera])
phone_data = pd.concat([normal_phone, attack_http_phone, attack_tcp_phone, attack_udp_phone])

In [None]:
# 生成设备特定的时间特征
def generate_device_temporal_features_and_labels(data):
    # 将每一个时间戳量化到10秒钟的时间间隔
    data['TimeBin'] = data['Time']
    data['TimeBin'] = (data['TimeBin']/10.0)
    data['TimeBin'] = data['TimeBin'].apply(np.floor)
    # 按照时间间隔提取新特征
    group = data.groupby(['TimeBin'])
    group_features = group.apply(group_feature_extractor)
    group_features['device_timebin_delta_num_dest'] = group_features['device_timebin_num_dest'].diff(periods=1)
    group_features['device_timebin_delta_num_dest'] = group_features['device_timebin_delta_num_dest'].fillna(0)

    data = data.merge(group_features, left_on='TimeBin', right_index=True)
    return data
    
def group_feature_extractor(g):
    ten_sec_traffic = (g['Length']).sum() / 10
    ten_sec_num_host = len(set(g['Destination']))
    return pd.Series([ten_sec_traffic, ten_sec_num_host], index = ['device_timebin_bandwidth', 'device_timebin_num_dest'])

In [None]:
switch_data = generate_device_temporal_features_and_labels(switch_data)
camera_data = generate_device_temporal_features_and_labels(camera_data)
phone_data = generate_device_temporal_features_and_labels(phone_data)

In [None]:
switch_data

In [None]:
def generate_features_and_labels(data):
    data.sort_values(by=['Time'],  ascending=[1]) # 按照时间排序所有流量
    data = data.dropna() # 删除缺少源端口或目的端口的行
    data = data.reset_index(drop=True)
    
    # 生成特征
    features = data.copy(deep=True)

    # 发包速度、加速度和加速度变化率
    features['dT'] = features['Time'] - features['Time'].shift(3)
    features['dT2'] = features['dT'] - features['dT'].shift(3)
    features['dT3'] = features['dT2'] - features['dT2'].shift(3)
    features = features.fillna(0)

    # 常用协议（HTTP：TCP、UDP 和 OTHER）的 one-hot encoding
    features['is_HTTP'] = 0
    features.loc[ ( (features['Protocol'] == 'HTTP') | (features['Protocol'] == 'HTTP/XML') ), ['is_HTTP']] = 1

    features['is_TCP'] = 0
    features.loc[features['Protocol'] == 'TCP', ['is_TCP']] = 1

    features['is_UDP'] = 0
    features.loc[features['Protocol'] == 'UDP', ['is_UDP']] = 1

    features['is_OTHER'] = 0
    features.loc[(
                    (features['Protocol'] != 'HTTP') &
                    (features['Protocol'] != 'HTTP/XML') &
                    (features['Protocol'] != 'TCP') &
                    (features['Protocol'] != 'UDP') 
                ), ['is_OTHER']] = 1

    # 生成标签
    labels = features['Label']

    # 删除没有意义或无法量化的特征
    del features['No.']
    del features['Time']
    del features['Source']
    del features['Destination']
    del features['Protocol']
    del features['Info']
    del features['Src_port']
    del features['Dst_port']
    del features['Delta_time']
    del features['Label']
    del features['TimeBin']
    
    return (features, labels)

In [None]:
# 生成特征数据和每一条数据实例对应的标签
switch_features, switch_labels = generate_features_and_labels(switch_data)
camera_features, camera_labels = generate_features_and_labels(camera_data)
phone_features, phone_labels = generate_features_and_labels(phone_data)

In [None]:
all_features = pd.concat([switch_features, camera_features, phone_features])
all_labels = pd.concat([switch_labels, camera_labels, phone_labels])
all_features

In [None]:
# 绘制Confusion Matrix和计算分类指标
def plot_confusion_matrix(
        cm, classes, y_pred, y,
        title='Confusion matrix',
        cmap=plt.cm.Blues,
        plot_flag=True
    ):
    
    # 计算AUC
    test_fpr, test_tpr, te_thresholds = metrics.roc_curve(y, y_pred)
    auc = metrics.auc(test_fpr, test_tpr)
    
    if plot_flag:
        # 绘制confusion matrix
        fmt = 'd' 
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, format(cm[i, j], fmt), fontsize=20,
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")

        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.title(title, fontsize=20)
        plt.colorbar()
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, fontsize=15)
        plt.yticks(tick_marks, classes, fontsize=15)
        plt.tight_layout()
        plt.ylabel('True label', fontsize=20)
        plt.xlabel('Predicted label', fontsize=20)

        # 绘制AUC curve
        fig, ax = plt.subplots(figsize=(6,5))
        plt.grid()
        plt.plot(test_fpr, test_tpr, label=f"AUC={auc:.4f}")
        plt.plot([0,1],[0,1],'g--')
        plt.legend(fontsize=20)
        plt.xticks(fontsize=15)
        plt.yticks(fontsize=15)
        plt.xlabel("False Positive Rate", fontsize=20)
        plt.ylabel("True Positive Rate", fontsize=20)
        plt.title("AUC (ROC curve)", fontsize=20)
        plt.grid(color='black', linestyle='-', linewidth=0.5)
        plt.show()

        print(f"Accuracy: {metrics.accuracy_score(y, y_pred):.4f}, Precision: {metrics.precision_score(y, y_pred):.4f}, Recall: {metrics.recall_score(y, y_pred):.4f}, F1 score: {metrics.f1_score(y, y_pred):.4f}")
        
    return auc

# Q1：调整神经网络结构，搭建自己的神经网络（45%）

In [None]:
def train_test(data, labels, classes = ['Normal', 'Attack'], epochs = 1):
    x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.15)   

    # 把numpy数据转化为PyTorch tensor
    x_train_tensor = torch.tensor(x_train, dtype=torch.float32)
    x_test_tensor = torch.tensor(x_test, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
    y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

    # 用dataloader准备好训练数据
    train_dataset = TensorDataset(x_train_tensor, y_train_tensor)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

    # 定义神经网络结构
    class Model(nn.Module):
        def __init__(self, input_dim):
            # 改变你的网络结构 https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html
            super(Model, self).__init__()
            self.fc1 = nn.Linear(input_dim, 3)
            self.relu = nn.ReLU()
            self.fc2 = nn.Linear(3, 1)
            self.sigmoid = nn.Sigmoid()
            ...

        def forward(self, x):
            # 相应地改变每层的激活函数
            x = self.relu(self.fc1(x))
            x = self.sigmoid(self.fc2(x))
            ...
            return x

    # 根据训练数据确定输入层维度
    input_dim = x_train.shape[1]
    model = Model(input_dim)

    # 定义损失函数和模型参数优化算法
    criterion = nn.MSELoss() # Mean Square Error Loss
    optimizer = optim.SGD(model.parameters()) # Stochastic Gradient Descent Optimizer    
#     criterion = nn.BCELoss()  # Binary Cross Entropy Loss
#     optimizer = optim.RMSprop(model.parameters()) # Root Mean Square Propagation Optimizer

    # 根据epoch数量循环训练
    for epoch in range(epochs):
        model.train()
        for batch in train_loader:
            x_batch, y_batch = batch

            # 输出当前模型参数前馈预测结果
            outputs = model(x_batch)

            # 基于当前输出，计算损失函数值
            loss = criterion(outputs, y_batch)

            # 利用反向传播算法计算梯度，并使用选择的优化算法调整模型参数
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

    # 在测试集上收集分类结果
    model.eval()
    with torch.no_grad():
        y_predict = model(x_test_tensor)
        y_predict = torch.round(y_predict)  # 将softmax形式的结果近似到0或1

    # 评估分类性能
    cm = metrics.confusion_matrix(y_test, y_predict)
    plot_confusion_matrix(cm, classes, y_predict, y_test)

In [None]:
data, labels = all_features.values, np.asarray(all_labels)
train_test(data, labels, epochs = 5) # 可以改成你认为合适的训练轮数

# Q2：选择合适的特征，并寻找合适的方法规范化数据。（35%）
https://en.wikipedia.org/wiki/Standard_score

零均值规范化定义：$z = \frac{x-\mu}{\sigma}$
- $\mu$：x的平均值
- $\sigma$：x的标准差

In [None]:
# 用零均值规范化（z-socre标准化、Standardization）处理数值型特征
numerical = all_features.iloc[:, [0]] # 选择你认为最有效的特征
numerical = # 填入正确的代码

# 将类别型特征的标签从[0,1]调整为[-1,1]
categorical = all_features.iloc[:, [7]] # 选择你认为最有效的特征
categorical = # 填入正确的代码

# 数据重组
all_features_normalized = pd.concat([numerical, categorical], axis=1)

In [None]:
all_features_normalized

### 训练和测试神经网络

In [None]:
data, labels = all_features_normalized.values, np.asarray(all_labels)
train_test(data, labels, epochs = 5) # 可以改成你认为合适的训练轮数

# Q3：尝试用其他机器学习模型（例如SVM、KNN）来解决问题（20%）

In [None]:
# 如果svm.SVC()太慢，尝试用LinearSVC()
def train_test_general(data, labels, classes = ['Normal', 'Attack']):
    x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.15)   

    # 初始化其他机器学习模型
    classifier_svm = # 填入正确的代码
    classifier_knn = # 填入正确的代码
    
    # 用训练集训练模型
    # 填入正确的代码
    ...
    
    # 输出分类结果
    y_predict_svm = # 填入正确的代码
    y_predict_knn = # 填入正确的代码

    # 分析
    cm = metrics.confusion_matrix(y_test, y_predict_svm)
    plot_confusion_matrix(cm, classes, y_predict_svm, y_test)
    
    cm = metrics.confusion_matrix(y_test, y_predict_knn)
    plot_confusion_matrix(cm, classes, y_predict_knn, y_test)

In [None]:
data, labels = all_features_normalized.values, np.asarray(all_labels)
train_test_general(data, labels)

1. 比较不同模型的性能，想想哪一个模型的性能更好？为什么？结合课堂内容，简单谈谈你对数据特征、超参数调优、模型选择的感想。

答：

2. 该实验的攻击流量检测方法需要直接获取所有物联网设备的流量相关特征历史数据。试简单讨论收集和测量这些数据可能带来的隐私泄露风险。

答