<a href="https://colab.research.google.com/github/jhen-fang/P_Project-24/blob/main/P_Project_24_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. 專案名稱：使用網路威脅資料集實作文本分類以及 QA 問答**

- **姓名：蔡甄芳**
- **系級：資管四乙**
- **學號：109306056**
- **GitHUB codes: https://github.com/jhen-fang/P_Project-24/blob/main/P_Project_24_final.ipynb**
- **Colab link: https://colab.research.google.com/drive/13K60edQIt0uHyqfikX1O7napKgZV93lS?usp=sharing**

## **2. 資料集介紹**

##### (1) **資料集名稱：Cyber Threat Dataset: Network, Text & Relation**

##### (2) **資料集來源：https://www.kaggle.com/datasets/ramoliyafenil/text-based-cyber-threat-detection**

##### (3) **資料集簡介：這個資料集包含了網路流量資料( network traffic data), 文字內容( textual content), 實體關係( entity realationships)等等, 可用來檢測、診斷和減輕網路威脅。**

##### **(4) 資料集欄位：**
    - id: 資料集中每個 instance 的 identifier。
    - text: 透過網路傳輸的文字內容，如：電子郵件、訊息或網路流量負載。並包含潛在的網路威脅描述。
    - Entries: JSON 清單，包含以下
        - sender_id
        - label : 識別出的網路威脅或攻擊模式
        - start_offset
        - end_offset
        - receive_ids
    - relations: 一個 tuples 表示實體關係，包含一對實體 IDs ( source and target )
    - diagnosis: 對已經識別出的網路威脅的描述及診斷，提供見解。
    - solutions: 針對網路威脅提供解決方案或緩解策略的描述。

---


*F. Ramoliya, R. Kakkar, R. Gupta, S. Tanwar and S. Agrawal, "SEAM: Deep Learning-based Secure Message Exchange Framework For Autonomous EVs," 2023 IEEE Globecom Workshops (GC Wkshps), Kuala Lumpur, Malaysia, 2023, pp. 80-85, doi: 10.1109/GCWkshps58843.2023.10465168.*

## **3. 專案目的：**

#### **(1) 網路威脅偵測：根據 Text ( textual content 以及 network traffic data ) 分類網路威脅 ( Entries: label )**

    - Pipeline:
        - text Classification
        - zero-shot-classification
    - 目標：利用網路威脅描述(text) 分類出攻擊模式(label)

#### **(2) 資安事件診斷及解方 QA 問答：依照 diagnosis 和 solution 內容，根據受到的網路威脅提供解方與回應**

    - Pipeline: question-answering

    - 目標：
        - 利用問答的方式，輸入網路威脅描述(text)識別出 網路威脅與攻擊模式(label)
        - 利用問答的方式，輸入網路威脅攻擊模式(label)識別出 診斷見解(diagnosis)
        - 利用問答的方式，輸入診斷見解(diagnosis) 給出解決方案(solutions)


## **4. 程式碼實作 Downstream Task**

In [1]:
!pip install transformers pandas numpy matplotlib seaborn kaggle datasets evaluate transformers[torch]


Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━

In [2]:
import pandas as pd
import numpy as np
import torch
import evaluate
import os
from google.colab import userdata
from datasets import Dataset, DatasetDict, load_metric
from torch import tensor
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    pipeline,
    AutoModelForQuestionAnswering,
    DefaultDataCollator,
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    GPT2Tokenizer,
    GPT2LMHeadModel
    )

In [3]:
# 下載資料集
api_key = userdata.get('kaggle_key')
username = userdata.get('kaggle_username')

os.environ['KAGGLE_USERNAME'] = username
os.environ['KAGGLE_KEY'] = api_key

!kaggle datasets download -d ramoliyafenil/text-based-cyber-threat-detection

# 解壓縮資料集
!unzip text-based-cyber-threat-detection.zip

Dataset URL: https://www.kaggle.com/datasets/ramoliyafenil/text-based-cyber-threat-detection
License(s): Apache 2.0
Downloading text-based-cyber-threat-detection.zip to /content
100% 3.91M/3.91M [00:01<00:00, 4.43MB/s]
100% 3.91M/3.91M [00:01<00:00, 3.23MB/s]
Archive:  text-based-cyber-threat-detection.zip
  inflating: Cyber-Threat-Intelligence-Custom-Data_new_processed.csv  
  inflating: all.jsonl               
  inflating: cyber-threat-intelligence-splited_test.csv  
  inflating: cyber-threat-intelligence-splited_train.csv  
  inflating: cyber-threat-intelligence-splited_validate.csv  
  inflating: cyber-threat-intelligence_all.csv  
  inflating: test.jsonl              
  inflating: train.jsonl             
  inflating: validation.jsonl        


In [4]:
custom_data_new_processed = pd.read_csv('/content/Cyber-Threat-Intelligence-Custom-Data_new_processed.csv')
custom_data_new_processed.head()

Unnamed: 0,id,text,relations,diagnosis,solutions,id_1,label_1,start_offset_1,end_offset_1,id_2,label_2,start_offset_2,end_offset_2,id_3,label_3,start_offset_3,end_offset_3
0,249,A cybersquatting domain save-russia[.]today is...,"[{'from_id': 44658, 'id': 9, 'to_id': 44659, '...",The diagnosis is a cyber attack that involves ...,1. Implementing DNS filtering to block access ...,44656,attack-pattern,2,16,44657,url,24,43,44658.0,attack-pattern,57.0,68.0
1,14309,"Like the Android Maikspy, it first sends a not...","[{'from_id': 48531, 'id': 445, 'to_id': 48532,...",The diagnosis is that the entity identified as...,1. Implementing a robust anti-malware software...,48530,SOFTWARE,9,17,48531,malware,17,24,48532.0,Infrastucture,63.0,73.0
2,13996,While analyzing the technical details of this ...,"[{'from_id': 48781, 'id': 461, 'to_id': 48782,...",Diagnosis: APT37/Reaper/Group 123 is responsib...,1. Implementing advanced threat detection tech...,48781,threat-actor,188,194,48782,threat-actor,210,217,48783.0,threat-actor,220.0,229.0
3,13600,(Note that Flash has been declared end-of-life...,"[{'from_id': 51688, 'id': 1133, 'to_id': 51689...",The diagnosis is a malware infection. The enti...,1. Implementing a robust antivirus software th...,51687,TIME,62,79,51688,malware,207,215,51689.0,malware,247.0,258.0
4,14364,Figure 21. Connection of Maikspy variants to 1...,"[{'from_id': 51780, 'id': 1161, 'to_id': 44372...",The diagnosis is that Maikspy malware variants...,1. Implementing a robust firewall system that ...,51779,URL,163,191,51777,URL,70,93,51781.0,malware,120.0,127.0


In [5]:
# 1. 挑出 label_1, text 作為文本分類的欄位
selected_data_for_cls = custom_data_new_processed[['label_1', 'text']].rename(columns={'label_1': 'label'})
# 2. 挑出 label_1, text, diagnosis, solutions 作為 QA 問答以及文本生成的欄位
selected_data_for_qa = custom_data_new_processed[['label_1','text', 'diagnosis','solutions']]


#### **4-1. 網路威脅偵測：根據 Text ( textual content 以及 network traffic data ) 分類網路威脅 ( Entries: label )**

In [6]:
dataset_for_cls = Dataset.from_pandas(selected_data_for_cls)
train_test_split_cls = dataset_for_cls.train_test_split(test_size=0.2)
train_val_split_cls = train_test_split_cls['train'].train_test_split(test_size=0.1)  # 例如，將10%的訓練數據用作驗證集

# 將分類(cls)用的資料集切分成 train, validation, test = 0.72 : 0.08 : 0.2
dataset_dict_for_cls = DatasetDict({
    'train': train_val_split_cls['train'],
    'validation': train_val_split_cls['test'],
    'test': train_test_split_cls['test']
})

# 查看 train dataset 的第一筆資料
dataset_dict_for_cls['train'][0]

{'label': 'malware',
 'text': 'We correlated the AnubisSpy variants to Sphinx’s desktop/PC-targeting malware through the following:  Shared C&C server, 86[.]105[.]18[.]107 Shared technique of decrypting JSON files, and similarity between the file structures of AnubisSpy and Sphinx’s malware Similar targets (highly concentrated in Middle Eastern countries)     Figure 2: Comparison of file structure in Sphinx’s desktop/PC-targeting malware (left) and AnubisSpy (right)'}

In [7]:
# 1. 載入 tokenizer
tokenizer_cls_origin = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

In [8]:
def tokenize_function(examples):
    return tokenizer_cls_origin(examples['text'], padding="max_length", truncation=True)

# 對資料集 dataset_dict_for_cls tokenize 前處理
tokenized_datasets = dataset_dict_for_cls.map(tokenize_function, batched=True)
# data_collator 用於動態填充批次中的文本長度
data_collator = DataCollatorWithPadding(tokenizer=tokenizer_cls_origin, return_tensors="pt")


Map:   0%|          | 0/342 [00:00<?, ? examples/s]

Map:   0%|          | 0/38 [00:00<?, ? examples/s]

Map:   0%|          | 0/96 [00:00<?, ? examples/s]

In [9]:
# 挑出 train, validation set 中的 label 標籤
unique_labels_train = set(dataset_dict_for_cls['train']['label'])
unique_labels_validation = set(dataset_dict_for_cls['validation']['label'])
all_unique_labels = unique_labels_train.union(unique_labels_validation)

# 創建 label_id : str - int mapping
label2id = {label: idx for idx, label in enumerate(all_unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}

print("Label to ID mapping:", label2id)

Label to ID mapping: {'location': 0, 'SOFTWARE': 1, 'Infrastucture': 2, 'malware': 3, 'hash': 4, 'tools': 5, 'FILEPATH': 6, 'vulnerability': 7, 'URL': 8, 'url': 9, 'IPV4': 10, 'attack-pattern': 11, 'TIME': 12, 'campaign': 13, 'identity': 14, 'threat-actor': 15}


In [10]:
def label_to_id(example):
    example['label'] = label2id[example['label']]
    return example

# mapping test data and validation data
tokenized_datasets['train'] = tokenized_datasets['train'].map(label_to_id, batched=False)
tokenized_datasets['validation'] = tokenized_datasets['validation'].map(label_to_id, batched=False)

# remove columns
train_dataset = tokenized_datasets['train'].remove_columns([col for col in tokenized_datasets['train'].column_names if col not in ["input_ids", "attention_mask", "label"]])
eval_dataset = tokenized_datasets['validation'].remove_columns([col for col in tokenized_datasets['validation'].column_names if col not in ["input_ids", "attention_mask", "label"]])

# set_format
train_dataset.set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
eval_dataset.set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/342 [00:00<?, ? examples/s]

Map:   0%|          | 0/38 [00:00<?, ? examples/s]

In [11]:
# 載入模型
model_cls_origin = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels=len(all_unique_labels), id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([16]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([16, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# evaluare: accuracy metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


  metric = load_metric("accuracy")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [13]:
# 定義訓練參數
training_args = TrainingArguments(
    output_dir="./results_classification",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=12,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [14]:
# 初始化 trainer
trainer = Trainer(
    model=model_cls_origin.to('cuda'),
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer_cls_origin,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()
results = trainer.evaluate()
print(results)
trainer.save_model("./finetuned_with_CyberThreat_classification_model")
tokenizer_cls_origin.save_pretrained("./finetuned_with_CyberThreat_classification_model")

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.47665,0.236842
2,No log,2.33111,0.263158
3,No log,2.216546,0.289474
4,No log,2.115495,0.368421
5,No log,2.050759,0.368421
6,No log,1.992832,0.368421
7,No log,1.954135,0.342105
8,No log,1.942628,0.368421
9,No log,1.903593,0.394737
10,No log,1.894984,0.394737


{'eval_loss': 1.8854610919952393, 'eval_accuracy': 0.39473684210526316, 'eval_runtime': 0.6735, 'eval_samples_per_second': 56.418, 'eval_steps_per_second': 4.454, 'epoch': 12.0}


('./finetuned_with_CyberThreat_classification_model/tokenizer_config.json',
 './finetuned_with_CyberThreat_classification_model/special_tokens_map.json',
 './finetuned_with_CyberThreat_classification_model/vocab.txt',
 './finetuned_with_CyberThreat_classification_model/added_tokens.json')

#### **4-4-1. Text classification**

In [15]:
# 1. 載入剛剛 fine tune 完畢的 model 與 tokenizer
finetuned_model_checkpoint = './finetuned_with_CyberThreat_classification_model'
finetuned_tokenizer_checkpoint = './finetuned_with_CyberThreat_classification_model'

finetuned_model = AutoModelForSequenceClassification.from_pretrained(finetuned_model_checkpoint)
finetuned_tokenizer = AutoTokenizer.from_pretrained(finetuned_tokenizer_checkpoint)

# 2. text-classification pipeline for fine tuned model
finetuned_classifier = pipeline('text-classification', model=finetuned_model, tokenizer=finetuned_tokenizer)

# 3. 載入原始 model 與 tokenizer
origin_model_checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
origin_model = AutoModelForSequenceClassification.from_pretrained(origin_model_checkpoint)
origin_tokenizer = AutoTokenizer.from_pretrained(origin_model_checkpoint)

# 4. text-classification pipeline for origin model
origin_classifier = pipeline('text-classification', model=origin_model, tokenizer=origin_tokenizer)

# 5. 進行預測
test_texts = dataset_dict_for_cls['test']['text']

finetuned_predictions = finetuned_classifier(test_texts)
finetuned_predicted_labels = [pred['label'] for pred in finetuned_predictions]

origin_predictions = origin_classifier(test_texts)
origin_predicted_labels = [pred['label'] for pred in origin_predictions]

# 6. 計算準確率
true_labels = dataset_dict_for_cls['test']['label']
finetuned_accuracy = sum(pred_label == true_label for pred_label, true_label in zip(finetuned_predicted_labels, true_labels)) / len(true_labels)
origin_accuracy = sum(pred_label == true_label for pred_label, true_label in zip(origin_predicted_labels, true_labels)) / len(true_labels)

print("Fine-tuned model accuracy:", finetuned_accuracy)
print("Origin DistilBERT model accuracy:", origin_accuracy)

Fine-tuned model accuracy: 0.4375
Origin DistilBERT model accuracy: 0.0


#### **4-1-2. Zero-shot classification**

In [18]:
# 1. 加載 Zero-shot 分類的 pipeline: 因為 zero-shot 對一般語言模型通常表現不太好，所以取 huggingface 上下載量較多的一起比較
zero_shot_classifier= pipeline("zero-shot-classification", model="cross-encoder/nli-roberta-base")

# 2. 加載剛剛 fine-tuned 後的模型
model_zero_shot = AutoModelForSequenceClassification.from_pretrained('./finetuned_with_CyberThreat_classification_model')
tokenizer_zero_shot = AutoTokenizer.from_pretrained('./finetuned_with_CyberThreat_classification_model')
zero_shot_finetuned_classifier = pipeline('zero-shot-classification', model=model_zero_shot, tokenizer=tokenizer_zero_shot)

# 3. 測試數據並放入 candidate labels
sample_size = 50
test_texts = dataset_dict_for_cls['test']['text'][:sample_size]
true_labels = dataset_dict_for_cls['test']['label'][:sample_size]
candidate_labels = list(set(true_labels))

# 4. Zero-shot 分類
predictions1 = [pred['labels'][0] for pred in zero_shot_classifier(test_texts, candidate_labels=candidate_labels, multi_label=False)]
finetuned_predictions = [pred['labels'][0] for pred in zero_shot_finetuned_classifier(test_texts, candidate_labels=candidate_labels, multi_label=False)]

# 5. 計算準確率
def calculate_accuracy(predictions, true_labels):
    return sum(pred == true for pred, true in zip(predictions, true_labels)) / len(true_labels)

accuracy = calculate_accuracy(predictions1, true_labels)
finetuned_accuracy = calculate_accuracy(finetuned_predictions, true_labels)

# 6. print
print(f"Zero-shot Model (BART): Accuracy = {accuracy:.4f}")
print(f"Finetuned Model: Accuracy = {finetuned_accuracy:.4f}")
print("\nSample Predictions:")
for i in range(min(5, len(test_texts))):
    print(f"Text: {test_texts[i]}")
    print(f"Predicted by Zero-shot Model: {predictions1[i]}, True Label: {true_labels[i]}")
    print(f"Predicted by Finetuned Model: {finetuned_predictions[i]}, True Label: {true_labels[i]}")
    print()



Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


Zero-shot Model (BART): Accuracy = 0.3400
Finetuned Model: Accuracy = 0.0200

Sample Predictions:
Text: , we were able to observe another GitHub account with the name l4ckyguy, sharing the profile picture, location and URL in the description, with a link to the previously observed account (x4kme), and a name, Ivan Topor, which we believe may be another alias for this threat actor.
Predicted by Zero-shot Model: url, True Label: SOFTWARE
Predicted by Finetuned Model: identity, True Label: SOFTWARE

Text: BIOPASS RAT Loader  Backdoor.Win64.BIOPASS.A  bf4f50979b7b29f2b6d192630b8d7b76adb9cb65157a1c70924a47bf519c4edd  test.exe
Predicted by Zero-shot Model: hash, True Label: malware
Predicted by Finetuned Model: tools, True Label: malware

Text: Examining the Capesand samples The simplified diagram taken from the previous blog shows the combination of ConfuserEx and Cassandra via the second layer of obfuscation protection, which involves the DLL CyaX_Sharp Assembly (both CyaX_Sharp and CyaX a

#### **4-2. 資安事件診斷及解方 QA 問答：依照 diagnosis 和 solution 內容，根據受到的網路威脅提供解方與回應**

In [None]:
# 定義模型和分詞器
data_collator = DefaultDataCollator()
model_checkpoint_qa = "deepset/roberta-base-squad2" # 適合 QA 的 model
tokenizer_qa = AutoTokenizer.from_pretrained(model_checkpoint_qa)
model_qa = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint_qa)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

In [None]:
dataset_for_qa = Dataset.from_pandas(selected_data_for_qa)
train_test_split_qa = dataset_for_qa.train_test_split(test_size=0.2)
train_val_split_qa = train_test_split_qa['train'].train_test_split(test_size=0.1)  # 例如，將10%的訓練數據用作驗證集

# dataset_dict_for_qa 也分成 train, validation test 三種資料集
dataset_dict_for_qa = DatasetDict({
    'train': train_val_split_qa['train'],
    'validation': train_val_split_qa['test'],
    'test': train_test_split_qa['test']
})
# 查看 train dataset 第一筆資料
dataset_dict_for_qa['train'][0]

{'label_1': 'vulnerability',
 'text': 'Affected Software and Versions Background on the Spring Framework Root Cause Analysis for CVE-2022-22965',
 'diagnosis': 'Diagnosis: Vulnerability in software (Spring Framework)  Entity: Affected software version  Relationship: The affected software version has a vulnerability (CVE-2022-22965)',
 'solutions': '1. Regularly update the software to the latest version to ensure that any known vulnerabilities are patched.  2. Implement a vulnerability scanning tool to identify any potential vulnerabilities in the software and take necessary actions to mitigate the risks.  3. Conduct regular security audits to identify any security gaps in the software and implement appropriate measures to mitigate the risks.  4. Use intrusion detection and prevention systems to detect and prevent any attempts to exploit the vulnerabilities in the software.  5. Implement a robust access control mechanism to restrict access'}

In [13]:
# 定義 preprocess_function
def preprocess_function(examples):

    # 1. 因為原本資料集不是問答資料集，因此需要加上問題
    questions = [
        "What is the cybersecurity threat label associated with this text?",
        "What is the cybersecurity diagnosis for the given text?",
        "What cybersecurity solutions are proposed for the given text?"
    ]

    # 2. 提取出所有可能的 labels 讓問答更能夠知道要回答哪些 labels
    all_labels = list(set(examples['label_1']))
    labels_list_str = ", ".join(all_labels)

    new_examples = {
        "questions": [],
        "contexts": [],
        "answers": []
    }

    for i in range(len(examples['text'])):
        # 3. 構建上下文，在 context 中補足線索
        label_context = examples['text'][i] + f" Possible labels are: {labels_list_str}."
        diagnosis_context = label_context + f" The threat label identified is {examples['label_1'][i]}."
        solutions_context = diagnosis_context + f" The diagnosis is {examples['diagnosis'][i]}."

        new_examples['questions'].extend(questions)
        new_examples['contexts'].extend([label_context, diagnosis_context, solutions_context])
        new_examples['answers'].extend([
            {'answer_start': [0], 'text': [examples['label_1'][i]]},
            {'answer_start': [0], 'text': [examples['diagnosis'][i]]},
            {'answer_start': [0], 'text': [examples['solutions'][i]]}
        ])

    inputs = tokenizer_qa(
        new_examples['questions'],
        new_examples['contexts'],
        max_length=512,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length"
    )

    offset_mapping = inputs.pop("offset_mapping")
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = new_examples['answers'][i]
        if answer['text'][0]:  # 仅当答案不为空时计算位置
            start_char = answer['answer_start'][0]
            end_char = start_char + len(answer['text'][0])
            sequence_ids = inputs.sequence_ids(i)

            idx = 0
            while sequence_ids[idx] != 1:
                idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:
                idx += 1
            context_end = idx - 1

            if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
                start_positions.append(0)
                end_positions.append(0)
            else:
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)
        else:
            start_positions.append(0)
            end_positions.append(0)

    inputs['start_positions'] = start_positions
    inputs['end_positions'] = end_positions

    # for testing: 印出一些確保前後綴都有加上
    for i in range(3):
        print(f"Example {i}")
        print(f"Question: {new_examples['questions'][i]}")
        print(f"Context: {new_examples['contexts'][i][:300]}...")  # 只打印部分上下文
        print(f"Answer Start: {start_positions[i]}, End: {end_positions[i]}")
        print(f"Answer Text: {new_examples['answers'][i]['text'][0]}\n")

    return inputs

In [14]:
train_dataset = dataset_dict_for_qa['train'].map(preprocess_function, batched=True, remove_columns=dataset_dict_for_qa['train'].column_names)
eval_dataset = dataset_dict_for_qa['validation'].map(preprocess_function, batched=True, remove_columns=dataset_dict_for_qa['validation'].column_names)

Map:   0%|          | 0/342 [00:00<?, ? examples/s]

Example 0
Question: What is the cybersecurity threat label associated with this text?
Context: Affected Software and Versions Background on the Spring Framework Root Cause Analysis for CVE-2022-22965 Possible labels are: url, REGISTRYKEY, attack-pattern, tools, FILEPATH, vulnerability, malware, TIME, campaign, Infrastucture, IPV4, location, threat-actor, identity, SOFTWARE, hash, URL....
Answer Start: 14, End: 17
Answer Text: vulnerability

Example 1
Question: What is the cybersecurity diagnosis for the given text?
Context: Affected Software and Versions Background on the Spring Framework Root Cause Analysis for CVE-2022-22965 Possible labels are: url, REGISTRYKEY, attack-pattern, tools, FILEPATH, vulnerability, malware, TIME, campaign, Infrastucture, IPV4, location, threat-actor, identity, SOFTWARE, hash, URL. The thr...
Answer Start: 13, End: 54
Answer Text: Diagnosis: Vulnerability in software (Spring Framework)  Entity: Affected software version  Relationship: The affected software

Map:   0%|          | 0/38 [00:00<?, ? examples/s]

Example 0
Question: What is the cybersecurity threat label associated with this text?
Context: However, for the first time, TAG has observed COLDRIVER campaigns targeting the military of multiple Eastern European countries, as well as a NATO Centre of Excellence. Possible labels are: url, FILEPATH, attack-pattern, tools, vulnerability, malware, campaign, threat-actor, location, identity, SOFT...
Answer Start: 14, End: 15
Answer Text: campaign

Example 1
Question: What is the cybersecurity diagnosis for the given text?
Context: However, for the first time, TAG has observed COLDRIVER campaigns targeting the military of multiple Eastern European countries, as well as a NATO Centre of Excellence. Possible labels are: url, FILEPATH, attack-pattern, tools, vulnerability, malware, campaign, threat-actor, location, identity, SOFT...
Answer Start: 13, End: 72
Answer Text: Diagnosis: The cybersecurity issue is a COLDRIVER campaign targeting the military of multiple Eastern European countries and

#### **4-2-1. Question-answering**

In [15]:
# 訓練參數設置
training_args = TrainingArguments(
    output_dir='./results_qa',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=10,
    weight_decay=0.01,
    no_cuda=False
)
model_qa= model_qa.to('cuda')
trainer_qa = Trainer(
    model=model_qa,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer_qa.train()

results = trainer_qa.evaluate()
print(results)
trainer_qa.save_model("./finetuned_with_CyberThreat_qa_model")
tokenizer_qa.save_pretrained("./finetuned_with_CyberThreat_qa_model")



Epoch,Training Loss,Validation Loss
1,No log,1.345927
2,No log,1.215255
3,No log,1.205874
4,No log,1.328115
5,No log,1.371892
6,No log,1.379601
7,No log,1.468729
8,1.265400,1.619096
9,1.265400,1.621985
10,1.265400,1.672276


{'eval_loss': 1.6722763776779175, 'eval_runtime': 3.5208, 'eval_samples_per_second': 32.379, 'eval_steps_per_second': 0.568, 'epoch': 10.0}


('./finetuned_with_CyberThreat_qa_model/tokenizer_config.json',
 './finetuned_with_CyberThreat_qa_model/special_tokens_map.json',
 './finetuned_with_CyberThreat_qa_model/vocab.json',
 './finetuned_with_CyberThreat_qa_model/merges.txt',
 './finetuned_with_CyberThreat_qa_model/added_tokens.json',
 './finetuned_with_CyberThreat_qa_model/tokenizer.json')

#### **4-2-2. Text-generation**

In [18]:
# 定義模型和分詞器:使用 gpt2 text generation 表現較好
tokenizer_text_gen = GPT2Tokenizer.from_pretrained('gpt2')
model_text_gen = GPT2LMHeadModel.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [19]:
# 訓練參數設置
training_args = TrainingArguments(
    output_dir='./results_text_gen',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=10,
    weight_decay=0.01,
    no_cuda=False
)
model_text_gen= model_qa.to('cuda')
trainer_text_gen = Trainer(
    model=model_text_gen,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer_text_gen.train()

results = trainer_text_gen.evaluate()
print(results)
trainer_text_gen.save_model("./finetuned_with_CyberThreat_text_gen_model")
tokenizer_text_gen.save_pretrained("./finetuned_with_CyberThreat_text_gen_model")



Epoch,Training Loss,Validation Loss
1,No log,1.419433
2,No log,1.765685
3,No log,2.067802
4,No log,2.195814
5,No log,2.383987
6,No log,2.422929
7,No log,2.454837
8,0.507800,2.526959
9,0.507800,2.547204
10,0.507800,2.557529


{'eval_loss': 2.5575287342071533, 'eval_runtime': 3.5076, 'eval_samples_per_second': 32.501, 'eval_steps_per_second': 0.57, 'epoch': 10.0}


('./finetuned_with_CyberThreat_text_gen_model/tokenizer_config.json',
 './finetuned_with_CyberThreat_text_gen_model/special_tokens_map.json',
 './finetuned_with_CyberThreat_text_gen_model/vocab.json',
 './finetuned_with_CyberThreat_text_gen_model/merges.txt',
 './finetuned_with_CyberThreat_text_gen_model/added_tokens.json')

#### **4-2-3. 使用 question-answering + text_genration 打造問答系統**

In [23]:
print(dataset_dict_for_qa['test'][0])


{'label_1': 'malware', 'text': ' Three of the backdoors, NFlog, PoisonIvy, and NewCT have previously been publicly associated with DragonOK.', 'diagnosis': 'DragonOK is the threat actor responsible for the use of the backdoors NFlog, PoisonIvy, and NewCT, which have been associated with malware.', 'solutions': '1. Implementing network segmentation to prevent lateral movement of malware  2. Conducting regular vulnerability assessments and patching vulnerable systems  3. Deploying advanced endpoint protection solutions that can detect and prevent the use of backdoors  4. Implementing multi-factor authentication to prevent unauthorized access to sensitive systems  5. Conducting regular security awareness training for employees to prevent social engineering attacks  6. Implementing intrusion detection and prevention systems to detect and block malicious traffic  7. Conducting regular penetration testing to'}


In [29]:
model_checkpoint_qa_origin = "deepset/roberta-base-squad2"
tokenizer_qa_origin = AutoTokenizer.from_pretrained(model_checkpoint_qa_origin)
model_qa_origin = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint_qa_origin)


tokenizer_qa = AutoTokenizer.from_pretrained("./finetuned_with_CyberThreat_qa_model")
model_qa = AutoModelForQuestionAnswering.from_pretrained("./finetuned_with_CyberThreat_qa_model")

model_text_gen = GPT2LMHeadModel.from_pretrained('./finetuned_with_CyberThreat_text_gen_model')
tokenizer_text_gen = GPT2Tokenizer.from_pretrained('./finetuned_with_CyberThreat_text_gen_model')

qa_original_pipeline = pipeline("question-answering", model=model_qa_origin, tokenizer=tokenizer_qa_origin)
qa_pipeline = pipeline("question-answering", model=model_qa, tokenizer=tokenizer_qa)
text_gen_pipeline = pipeline("text-generation", model=model_text_gen, tokenizer=tokenizer_text_gen)


all_labels = list(set(dataset_dict_for_qa['test']['label_1']))
labels_list_str = ", ".join(all_labels)

# 定義問題前綴
questions = [
    "What is the cybersecurity threat label associated with this text?",
    "What is the cybersecurity diagnosis for the given text?",
    "What cybersecurity solutions are proposed for the given text?"
]
first_ten_examples = dataset_dict_for_qa['test'].select(range(10))


for example in first_ten_examples:
    label_context = example['text']
    diagnosis_context = label_context + f" The threat label identified is {example['label_1']}."
    solutions_context = diagnosis_context + f" The diagnosis is {example['diagnosis']}."
    contexts = [label_context, diagnosis_context, solutions_context]

    for i, (question, context) in enumerate(zip(questions, contexts)):
        if i < 1:
            result = qa_pipeline(question=question, context=context)
            answer = result['answer']
            result_ori = qa_original_pipeline(question=question, context=context)
            answer_ori = result_ori['answer']
        else:
            result = text_gen_pipeline(context, max_length=400)
            answer = result[0]['generated_text']

        print(f"Question: {question}")
        print(f"Context: {context}")
        print(f"Answer from fine-tuned model: {answer}\n")
        if answer_ori and i < 1:
            print(f"Answer from original model: {answer_ori}\n")
            correct_answer = example['label_1']
            print(f"Correct Answer: {correct_answer}\n")

You are using a model of type roberta to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at ./finetuned_with_CyberThreat_text_gen_model and are newly initialized: ['h.0.attn.c_attn.bias', 'h.0.attn.c_attn.weight', 'h.0.attn.c_proj.bias', 'h.0.attn.c_proj.weight', 'h.0.ln_1.bias', 'h.0.ln_1.weight', 'h.0.ln_2.bias', 'h.0.ln_2.weight', 'h.0.mlp.c_fc.bias', 'h.0.mlp.c_fc.weight', 'h.0.mlp.c_proj.bias', 'h.0.mlp.c_proj.weight', 'h.1.attn.c_attn.bias', 'h.1.attn.c_attn.weight', 'h.1.attn.c_proj.bias', 'h.1.attn.c_proj.weight', 'h.1.ln_1.bias', 'h.1.ln_1.weight', 'h.1.ln_2.bias', 'h.1.ln_2.weight', 'h.1.mlp.c_fc.bias', 'h.1.mlp.c_fc.weight', 'h.1.mlp.c_proj.bias', 'h.1.mlp.c_proj.weight', 'h.10.attn.c_attn.bias', 'h.10.attn.c_attn.weight', 'h.10.attn.c_proj.bias', 'h.10.attn.c_proj.weight', 'h.10.ln_1.bias', 'h.10.ln_1.weight', 'h.10.ln_2.bias', 'h.10.

Question: What is the cybersecurity threat label associated with this text?
Context:  Three of the backdoors, NFlog, PoisonIvy, and NewCT have previously been publicly associated with DragonOK.
Answer from fine-tuned model: Three

Answer from original model: DragonOK

Correct Answer: malware

Question: What is the cybersecurity diagnosis for the given text?
Context:  Three of the backdoors, NFlog, PoisonIvy, and NewCT have previously been publicly associated with DragonOK. The threat label identified is malware.
Answer from fine-tuned model:  Three of the backdoors, NFlog, PoisonIvy, and NewCT have previously been publicly associated with DragonOK. The threat label identified is malware.578578578578578 decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease decrease 