<a href="https://colab.research.google.com/github/jhen-fang/P_Project-24/blob/main/P_Project_24_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. 專案名稱：使用網路威脅資料集實作文本分類以及 QA 問答**

- **姓名：蔡甄芳**
- **系級：資管四乙**
- **學號：109306056**
- **GitHUB codes:**
- **Colab link: https://colab.research.google.com/drive/13K60edQIt0uHyqfikX1O7napKgZV93lS?usp=sharing**

## **2. 資料集介紹**

##### (1) **資料集名稱：Cyber Threat Dataset: Network, Text & Relation**

##### (2) **資料集來源：https://www.kaggle.com/datasets/ramoliyafenil/text-based-cyber-threat-detection**

##### (3) **資料集簡介：這個資料集包含了網路流量資料( network traffic data), 文字內容( textual content), 實體關係( entity realationships)等等, 可用來檢測、診斷和減輕網路威脅。**

##### **(4) 資料集欄位：**
    - id: 資料集中每個 instance 的 identifier。
    - text: 透過網路傳輸的文字內容，如：電子郵件、訊息或網路流量負載。並包含潛在的網路威脅描述。
    - Entries: JSON 清單，包含以下
        - sender_id
        - label : 識別出的網路威脅或攻擊模式
        - start_offset
        - end_offset
        - receive_ids
    - relations: 一個 tuples 表示實體關係，包含一對實體 IDs ( source and target )
    - diagnosis: 對已經識別出的網路威脅的描述及診斷，提供見解。
    - solutions: 針對網路威脅提供解決方案或緩解策略的描述。

---


*F. Ramoliya, R. Kakkar, R. Gupta, S. Tanwar and S. Agrawal, "SEAM: Deep Learning-based Secure Message Exchange Framework For Autonomous EVs," 2023 IEEE Globecom Workshops (GC Wkshps), Kuala Lumpur, Malaysia, 2023, pp. 80-85, doi: 10.1109/GCWkshps58843.2023.10465168.*

## **3. 專案目的：**

#### **(1) 網路威脅偵測：根據 Text ( textual content 以及 network traffic data ) 分類網路威脅 ( Entries: label )**

    - Pipeline:
        - text Classification
        - zero-shot-classification
    - 目標：利用網路威脅描述(text) 分類出攻擊模式(label)

#### **(2) 資安事件診斷及解方 QA 問答：依照 diagnosis 和 solution 內容，根據受到的網路威脅提供解方與回應**

    - Pipeline: question-answering

    - 目標：
        - 利用問答的方式，輸入網路威脅描述(text)識別出 網路威脅與攻擊模式(label)
        - 利用問答的方式，輸入網路威脅攻擊模式(label)識別出 診斷見解(diagnosis)
        - 利用問答的方式，輸入診斷見解(diagnosis) 給出解決方案(solutions)


## **4. 程式碼實作 Downstream Task**

In [1]:
!pip install transformers pandas numpy matplotlib seaborn kaggle datasets evaluate transformers[torch]




In [2]:
import pandas as pd
import numpy as np
import torch
import evaluate
import os
from google.colab import userdata
from datasets import Dataset, DatasetDict, load_metric
from torch import tensor
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    pipeline,
    AutoModelForQuestionAnswering,
    DefaultDataCollator,
    DistilBertTokenizer,
    DistilBertForSequenceClassification
    )

In [3]:
# 下載資料集
api_key = userdata.get('kaggle_key')
username = userdata.get('kaggle_username')

os.environ['KAGGLE_USERNAME'] = username
os.environ['KAGGLE_KEY'] = api_key

!kaggle datasets download -d ramoliyafenil/text-based-cyber-threat-detection

# 解壓縮資料集
!unzip text-based-cyber-threat-detection.zip

Dataset URL: https://www.kaggle.com/datasets/ramoliyafenil/text-based-cyber-threat-detection
License(s): Apache 2.0
text-based-cyber-threat-detection.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  text-based-cyber-threat-detection.zip
replace Cyber-Threat-Intelligence-Custom-Data_new_processed.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Cyber-Threat-Intelligence-Custom-Data_new_processed.csv  
replace all.jsonl? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: all.jsonl               
replace cyber-threat-intelligence-splited_test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: cyber-threat-intelligence-splited_test.csv  
replace cyber-threat-intelligence-splited_train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: cyber-threat-intelligence-splited_train.csv  
replace cyber-threat-intelligence-splited_validate.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: cyber-threat-intelligence-split

In [4]:
custom_data_new_processed = pd.read_csv('/content/Cyber-Threat-Intelligence-Custom-Data_new_processed.csv')
custom_data_new_processed.head()

Unnamed: 0,id,text,relations,diagnosis,solutions,id_1,label_1,start_offset_1,end_offset_1,id_2,label_2,start_offset_2,end_offset_2,id_3,label_3,start_offset_3,end_offset_3
0,249,A cybersquatting domain save-russia[.]today is...,"[{'from_id': 44658, 'id': 9, 'to_id': 44659, '...",The diagnosis is a cyber attack that involves ...,1. Implementing DNS filtering to block access ...,44656,attack-pattern,2,16,44657,url,24,43,44658.0,attack-pattern,57.0,68.0
1,14309,"Like the Android Maikspy, it first sends a not...","[{'from_id': 48531, 'id': 445, 'to_id': 48532,...",The diagnosis is that the entity identified as...,1. Implementing a robust anti-malware software...,48530,SOFTWARE,9,17,48531,malware,17,24,48532.0,Infrastucture,63.0,73.0
2,13996,While analyzing the technical details of this ...,"[{'from_id': 48781, 'id': 461, 'to_id': 48782,...",Diagnosis: APT37/Reaper/Group 123 is responsib...,1. Implementing advanced threat detection tech...,48781,threat-actor,188,194,48782,threat-actor,210,217,48783.0,threat-actor,220.0,229.0
3,13600,(Note that Flash has been declared end-of-life...,"[{'from_id': 51688, 'id': 1133, 'to_id': 51689...",The diagnosis is a malware infection. The enti...,1. Implementing a robust antivirus software th...,51687,TIME,62,79,51688,malware,207,215,51689.0,malware,247.0,258.0
4,14364,Figure 21. Connection of Maikspy variants to 1...,"[{'from_id': 51780, 'id': 1161, 'to_id': 44372...",The diagnosis is that Maikspy malware variants...,1. Implementing a robust firewall system that ...,51779,URL,163,191,51777,URL,70,93,51781.0,malware,120.0,127.0


In [5]:
# 1. 挑出 label_1, text 作為文本分類的欄位
selected_data_for_cls = custom_data_new_processed[['label_1', 'text']].rename(columns={'label_1': 'label'})
# 2. 挑出 label_1, text, diagnosis, solutions 作為 QA 問答以及文本生成的欄位
selected_data_for_qa = custom_data_new_processed[['label_1','text', 'diagnosis','solutions']]


#### **4-1. 網路威脅偵測：根據 Text ( textual content 以及 network traffic data ) 分類網路威脅 ( Entries: label )**

In [6]:
dataset_for_cls = Dataset.from_pandas(selected_data_for_cls)
train_test_split_cls = dataset_for_cls.train_test_split(test_size=0.2)
train_val_split_cls = train_test_split_cls['train'].train_test_split(test_size=0.1)  # 例如，將10%的訓練數據用作驗證集

# 將分類(cls)用的資料集切分成 train, validation, test = 0.72 : 0.08 : 0.2
dataset_dict_for_cls = DatasetDict({
    'train': train_val_split_cls['train'],
    'validation': train_val_split_cls['test'],
    'test': train_test_split_cls['test']
})

# 查看 train dataset 的第一筆資料
dataset_dict_for_cls['train'][0]

{'label': 'hash',
 'text': 'Trojan.W97M.CONFUCIUS.B  654c7021a4482da21e149ded58643b279ffbce66badf1a0a7fc3551acd607312  Trojan.W97M.CONFUCIUS.C  712172b5b1895bbfcced961a83baa448e26e93e301be407e6b9dc8cb6526277f  Trojan.Win32.DLOADR.TIOIBELQ'}

In [7]:
# 1. 載入 tokenizer
tokenizer_cls_origin = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [8]:
def tokenize_function(examples):
    return tokenizer_cls_origin(examples['text'], padding="max_length", truncation=True)

# 對資料集 dataset_dict_for_cls tokenize 前處理
tokenized_datasets = dataset_dict_for_cls.map(tokenize_function, batched=True)
# data_collator 用於動態填充批次中的文本長度
data_collator = DataCollatorWithPadding(tokenizer=tokenizer_cls_origin, return_tensors="pt")


Map:   0%|          | 0/342 [00:00<?, ? examples/s]

Map:   0%|          | 0/38 [00:00<?, ? examples/s]

Map:   0%|          | 0/96 [00:00<?, ? examples/s]

In [9]:
# 挑出 train, validation set 中的 label 標籤
unique_labels_train = set(dataset_dict_for_cls['train']['label'])
unique_labels_validation = set(dataset_dict_for_cls['validation']['label'])
all_unique_labels = unique_labels_train.union(unique_labels_validation)

# 創建 label_id : str - int mapping
label2id = {label: idx for idx, label in enumerate(all_unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}

print("Label to ID mapping:", label2id)

Label to ID mapping: {'Infrastucture': 0, 'url': 1, 'malware': 2, 'SOFTWARE': 3, 'campaign': 4, 'attack-pattern': 5, 'vulnerability': 6, 'URL': 7, 'hash': 8, 'TIME': 9, 'REGISTRYKEY': 10, 'tools': 11, 'threat-actor': 12, 'location': 13, 'FILEPATH': 14, 'identity': 15}


In [10]:
def label_to_id(example):
    example['label'] = label2id[example['label']]
    return example

# mapping test data and validation data
tokenized_datasets['train'] = tokenized_datasets['train'].map(label_to_id, batched=False)
tokenized_datasets['validation'] = tokenized_datasets['validation'].map(label_to_id, batched=False)

# remove columns
train_dataset = tokenized_datasets['train'].remove_columns([col for col in tokenized_datasets['train'].column_names if col not in ["input_ids", "attention_mask", "label"]])
eval_dataset = tokenized_datasets['validation'].remove_columns([col for col in tokenized_datasets['validation'].column_names if col not in ["input_ids", "attention_mask", "label"]])

# set_format
train_dataset.set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
eval_dataset.set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/342 [00:00<?, ? examples/s]

Map:   0%|          | 0/38 [00:00<?, ? examples/s]

In [14]:
# 載入模型
model_cls_origin = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels=len(all_unique_labels), id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([16]) in the model instantiated
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([16, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# evaluare: accuracy metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


  metric = load_metric("accuracy")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [16]:
# 定義訓練參數
training_args = TrainingArguments(
    output_dir="./results_classification",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [17]:
# 初始化 trainer
trainer = Trainer(
    model=model_cls_origin.to('cuda'),
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer_cls_origin,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()
results = trainer.evaluate()
print(results)
trainer.save_model("./finetuned_with_CyberThreat_classification_model")
tokenizer_cls_origin.save_pretrained("./finetuned_with_CyberThreat_classification_model")

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.583827,0.210526
2,No log,2.450309,0.236842
3,No log,2.377282,0.289474
4,No log,2.311785,0.289474
5,No log,2.231215,0.289474
6,No log,2.182317,0.315789
7,No log,2.136753,0.342105
8,No log,2.098038,0.394737
9,No log,2.092253,0.394737
10,No log,2.041461,0.394737


{'eval_loss': 1.9167146682739258, 'eval_accuracy': 0.39473684210526316, 'eval_runtime': 0.7906, 'eval_samples_per_second': 48.063, 'eval_steps_per_second': 3.794, 'epoch': 20.0}


('./finetuned_with_CyberThreat_classification_model/tokenizer_config.json',
 './finetuned_with_CyberThreat_classification_model/special_tokens_map.json',
 './finetuned_with_CyberThreat_classification_model/vocab.txt',
 './finetuned_with_CyberThreat_classification_model/added_tokens.json')

#### **4-4-1. Text classification**

In [18]:
# 1. 載入剛剛 fine tune 完畢的 model 與 tokenizer
finetuned_model_checkpoint = './finetuned_with_CyberThreat_classification_model'
finetuned_tokenizer_checkpoint = './finetuned_with_CyberThreat_classification_model'

finetuned_model = AutoModelForSequenceClassification.from_pretrained(finetuned_model_checkpoint)
finetuned_tokenizer = AutoTokenizer.from_pretrained(finetuned_tokenizer_checkpoint)

# 2. text-classification pipeline for fine tuned model
finetuned_classifier = pipeline('text-classification', model=finetuned_model, tokenizer=finetuned_tokenizer)

# 3. 載入原始 model 與 tokenizer
origin_model_checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
origin_model = AutoModelForSequenceClassification.from_pretrained(origin_model_checkpoint)
origin_tokenizer = AutoTokenizer.from_pretrained(origin_model_checkpoint)

# 4. text-classification pipeline for origin model
origin_classifier = pipeline('text-classification', model=origin_model, tokenizer=origin_tokenizer)

# 5. 進行預測
test_texts = dataset_dict_for_cls['test']['text']

finetuned_predictions = finetuned_classifier(test_texts)
finetuned_predicted_labels = [pred['label'] for pred in finetuned_predictions]

origin_predictions = origin_classifier(test_texts)
origin_predicted_labels = [pred['label'] for pred in origin_predictions]

# 6. 計算準確率
true_labels = dataset_dict_for_cls['test']['label']
finetuned_accuracy = sum(pred_label == true_label for pred_label, true_label in zip(finetuned_predicted_labels, true_labels)) / len(true_labels)
origin_accuracy = sum(pred_label == true_label for pred_label, true_label in zip(origin_predicted_labels, true_labels)) / len(true_labels)

print("Fine-tuned model accuracy:", finetuned_accuracy)
print("Origin DistilBERT model accuracy:", origin_accuracy)

Fine-tuned model accuracy: 0.5208333333333334
Origin DistilBERT model accuracy: 0.0


#### **4-1-2. Zero-shot classification**

In [19]:
print(f"Total number of test texts: {len(test_texts)}")


Total number of test texts: 96


In [20]:
# # 1. 加載 Zero-shot 分類的 pipeline: 因為 zero-shot 對一般語言模型通常表現不太好，所以取 huggingface 上兩個下載量較多的一起比較
# zero_shot_classifier1 = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

model_zero_shot = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")
tokenizer_zero_shot = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
# zero_shot_finetuned_classifier = pipeline('zero-shot-classification', model=model_zero_shot, tokenizer=tokenizer_zero_shot)


In [21]:
# 定義訓練參數
training_args = TrainingArguments(
    output_dir="./results_zero_shot",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

# 初始化 trainer
trainer_zero_shot = Trainer(
    model=model_zero_shot.to('cuda'),
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer_zero_shot,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer_zero_shot.train()
results = trainer_zero_shot.evaluate()
print(results)
trainer_zero_shot.save_model("./finetuned_with_CyberThreat_zero_shot_model")
tokenizer_cls_origin.save_pretrained("./finetuned_with_CyberThreat_zero_shot_model")

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 

In [None]:
# 1. 加載 Zero-shot 分類的 pipeline: 因為 zero-shot 對一般語言模型通常表現不太好，所以取 huggingface 上兩個下載量較多的一起比較
zero_shot_classifier= pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
# zero_shot_classifier2 = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli")

# 2. 加載剛剛 fine-tuned 後的模型
model_zero_shot = AutoModelForSequenceClassification.from_pretrained('./finetuned_with_CyberThreat_zero_shot_model')
tokenizer_zero_shot = AutoTokenizer.from_pretrained('./finetuned_with_CyberThreat_zero_shot_model')
zero_shot_finetuned_classifier = pipeline('zero-shot-classification', model=model_zero_shot, tokenizer=tokenizer_zero_shot)

# 3. 測試數據並放入 candidate labels
sample_size = 10
test_texts = dataset_dict_for_cls['test']['text'][:sample_size]
true_labels = dataset_dict_for_cls['test']['label'][:sample_size]
candidate_labels = list(set(true_labels))

# 4. Zero-shot 分類
predictions1 = [pred['labels'][0] for pred in zero_shot_classifier1(test_texts, candidate_labels=candidate_labels, multi_label=False)]
# predictions2 = [pred['labels'][0] for pred in zero_shot_classifier2(test_texts, candidate_labels=candidate_labels, multi_label=False)]
finetuned_predictions = [pred['labels'][0] for pred in zero_shot_finetuned_classifier(test_texts, candidate_labels=candidate_labels, multi_label=False)]

# 5. 計算準確率
def calculate_accuracy(predictions, true_labels):
    return sum(pred == true for pred, true in zip(predictions, true_labels)) / len(true_labels)

accuracy1 = calculate_accuracy(predictions1, true_labels)
# accuracy2 = calculate_accuracy(predictions2, true_labels)
finetuned_accuracy = calculate_accuracy(finetuned_predictions, true_labels)

# 6. print
print(f"Zero-shot Model (BART): Accuracy = {accuracy1:.4f}")
# print(f"Zero-shot Model 2 (DistilBERT): Accuracy = {accuracy2:.4f}")
print(f"Finetuned Model: Accuracy = {finetuned_accuracy:.4f}")
print("\nSample Predictions:")
for i in range(min(5, len(test_texts))):
    print(f"Text: {test_texts[i]}")
    print(f"Predicted by Zero-shot Model 1: {predictions1[i]}, True Label: {true_labels[i]}")
    print(f"Predicted by Zero-shot Model 2: {predictions2[i]}, True Label: {true_labels[i]}")
    print(f"Predicted by Finetuned Model: {finetuned_predictions[i]}, True Label: {true_labels[i]}")
    print()



#### **4-2. 資安事件診斷及解方 QA 問答：依照 diagnosis 和 solution 內容，根據受到的網路威脅提供解方與回應**

In [None]:
dataset_for_qa = Dataset.from_pandas(selected_data_for_qa)
train_test_split_qa = dataset_for_qa.train_test_split(test_size=0.2)
train_val_split_qa = train_test_split_qa['train'].train_test_split(test_size=0.1)  # 例如，將10%的訓練數據用作驗證集

# dataset_dict_for_qa 也分成 train, validation test 三種資料集
dataset_dict_for_qa = DatasetDict({
    'train': train_val_split_qa['train'],
    'validation': train_val_split_qa['test'],
    'test': train_test_split_qa['test']
})
# 查看 train dataset 第一筆資料
dataset_dict_for_qa['train'][0]

In [None]:
# 定義 preprocess_function
def preprocess_function(examples):

    # 1. 因為原本資料集不是問答資料集，因此需要加上問題
    questions = [
        "What is the cybersecurity threat label associated with this text?",
        "What is the cybersecurity diagnosis for the given text?",
        "What cybersecurity solutions are proposed for the given text?"
    ]

    # 2. 提取出所有可能的 labels 讓問答更能夠知道要回答哪些 labels
    all_labels = list(set(examples['label_1']))
    labels_list_str = ", ".join(all_labels)

    new_examples = {
        "questions": [],
        "contexts": [],
        "answers": []
    }

    for i in range(len(examples['text'])):
        # 3. 構建上下文，在 context 中補足線索
        label_context = examples['text'][i] + f" Possible labels are: {labels_list_str}."
        diagnosis_context = label_context + f" The threat label identified is {examples['label_1'][i]}."
        solutions_context = diagnosis_context + f" The diagnosis is {examples['diagnosis'][i]}."

        new_examples['questions'].extend(questions)
        new_examples['contexts'].extend([label_context, diagnosis_context, solutions_context])
        new_examples['answers'].extend([
            {'answer_start': [0], 'text': [examples['label_1'][i]]},
            {'answer_start': [0], 'text': [examples['diagnosis'][i]]},
            {'answer_start': [0], 'text': [examples['solutions'][i]]}
        ])

    inputs = tokenizer(
        new_examples['questions'],
        new_examples['contexts'],
        max_length=512,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length"
    )

    offset_mapping = inputs.pop("offset_mapping")
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = new_examples['answers'][i]
        start_char = answer['answer_start'][0]
        end_char = start_char + len(answer['text'][0])
        sequence_ids = inputs.sequence_ids(i)

        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs['start_positions'] = start_positions
    inputs['end_positions'] = end_positions

    # for testing: 印出一些確保前後綴都有加上
    for i in range(3):
        print(f"Example {i}")
        print(f"Question: {new_examples['questions'][i]}")
        print(f"Context: {new_examples['contexts'][i][:300]}...")  # 只打印部分上下文
        print(f"Answer Start: {start_positions[i]}, End: {end_positions[i]}")
        print(f"Answer Text: {new_examples['answers'][i]['text'][0]}\n")

    return inputs

In [None]:
train_dataset = dataset_dict_for_qa['train'].map(preprocess_function, batched=True, remove_columns=dataset_dict_for_qa['train'].column_names)
eval_dataset = dataset_dict_for_qa['validation'].map(preprocess_function, batched=True, remove_columns=dataset_dict_for_qa['validation'].column_names)

#### **4-2-1. Question-answering**

In [None]:
# 定義模型和分詞器
data_collator = DefaultDataCollator()
model_checkpoint_qa = "deepset/roberta-base-squad2" # 適合 QA 的 model
tokenizer_qa = AutoTokenizer.from_pretrained(model_checkpoint_qa)
model_qa = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint_qa)


In [None]:
# 訓練參數設置
training_args = TrainingArguments(
    output_dir='./results_qa',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=10,
    weight_decay=0.01,
    no_cuda=False
)
model_qa= model_qa.to('cuda')
trainer_qa = Trainer(
    model=model_qa,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer_qa.train()

results = trainer_qa.evaluate()
print(results)
trainer_qa.save_model("./finetuned_with_CyberThreat_qa_model")
tokenizer_qa.save_pretrained("./finetuned_with_CyberThreat_qa_model")

#### **4-2-2. Text-generation**

In [None]:
# 定義模型和分詞器:使用 gpt2 text generation 表現較好
tokenizer_text_gen = GPT2Tokenizer.from_pretrained('gpt2')
model_text_gen = GPT2LMHeadModel.from_pretrained('gpt2')

In [None]:
# 訓練參數設置
training_args = TrainingArguments(
    output_dir='./results_text_gen',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=10,
    weight_decay=0.01,
    no_cuda=False
)
model_text_gen= model_qa.to('cuda')
trainer_text_gen = Trainer(
    model=model_text_gen,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer_text_gen.train()

results = trainer_text_gen.evaluate()
print(results)
trainer_text_gen.save_model("./finetuned_with_CyberThreat_text_gen_model")
tokenizer_text_gen.save_pretrained("./finetuned_with_CyberThreat_text_gen_model")

#### **4-2-3. 使用 question-answering + text_genration 打造問答系統**

In [None]:
model_checkpoint_qa_origin = "deepset/roberta-base-squad2"
tokenizer_qa_origin = AutoTokenizer.from_pretrained(model_checkpoint_qa_origin)
model_qa_origin = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint_qa_origin)


tokenizer_qa = AutoTokenizer.from_pretrained("./finetuned_with_CyberThreat_qa_model")
model_qa = AutoModelForQuestionAnswering.from_pretrained("./finetuned_with_CyberThreat_qa_model")

model_text_gen = GPT2LMHeadModel.from_pretrained('./finetuned_with_CyberThreat_text_gen_model')
tokenizer_text_gen = GPT2Tokenizer.from_pretrained('./finetuned_with_CyberThreat_text_gen_model')

qa_original_pipeline = = pipeline("question-answering", model=model_qa_origin, tokenizer=tokenizer_qa_origin)
qa_pipeline = pipeline("question-answering", model=model_qa, tokenizer=tokenizer_qa)
text_gen_pipeline = pipeline("text-generation", model=model_text_gen, tokenizer=tokenizer_text_gen)


all_labels = list(set(dataset_dict_for_qa['test']['label_1']))
labels_list_str = ", ".join(all_labels)

# 定義問題前綴
questions = [
    "What is the cybersecurity threat label associated with this text?",
    "What is the cybersecurity diagnosis for the given text?",
    "What cybersecurity solutions are proposed for the given text?"
]

for example in dataset_dict_for_qa['test']:
    label_context = example['text'] + f" Possible labels are: {labels_list_str}."
    diagnosis_context = label_context + f" The threat label identified is {example['label_1']}."
    solutions_context = diagnosis_context + f" The diagnosis is {example['diagnosis']}."
    contexts = [label_context, diagnosis_context, solutions_context]

    for i, (question, context) in enumerate(zip(questions, contexts)):
        if i < 1:
            result = qa_pipeline(question=question, context=context)
            answer = result['answer']
            result_ori = qa_original_pipeline(question=question, context=context)
            answer_ori = result_ori['answer']
        else:
            result = text_gen_pipeline(context, max_length=400)
            answer = result[0]['generated_text']

        print(f"Question: {question}")
        print(f"Context: {context}")
        print(f"Answer from fine-tuned model: {answer}\n")
        if answer_ori and i < 1:
            print(f"Answer from original model: {answer_ori}\n")
            correct_answer = example['label_1']
            print(f"Correct Answer: {correct_answer}\n")