<a href="https://colab.research.google.com/github/melody016861/melody_Portfolio.github.io/blob/main/H_LM_ATT%26CK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Model and MITRE ATT&CK


## Instructions

* Use "Fine-tuning a masked language model" as the template to create your own language model.
  * https://huggingface.co/learn/nlp-course/en/chapter7/3
* Selcet a built-in language model, and try to fine-tune it with an additional corpus.
* We would like to make the fine-tuned model learn 'cybersecurity' knowledge, so we choose to use some cybersecurity-related, professional documents from MITRE website.
  * https://attack.mitre.org/resources/attack-data-and-tools/
* In the MITRE data and tools page, please find two excel files which include the definitions of attack tactics and attack techniques.
  * enterprise-attack-v15.1-tactics.xlsx
  * enterprise-attack-v15.1-techniques.xlsx
* Parse the xlsx files, and extract 'name' and 'description' as your additional corpus.
* Try to fine-tune your model.
* Note that you do not have to push your model to huggingface, rather please keep it in your colab and use/test it directly.

In [1]:
!wget https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-tactics.xlsx
!wget https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-techniques.xlsx

--2024-06-10 12:37:14--  https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-tactics.xlsx
Resolving attack.mitre.org (attack.mitre.org)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to attack.mitre.org (attack.mitre.org)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10109 (9.9K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘enterprise-attack-v15.1-tactics.xlsx’


2024-06-10 12:37:15 (71.3 MB/s) - ‘enterprise-attack-v15.1-tactics.xlsx’ saved [10109/10109]

--2024-06-10 12:37:15--  https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-techniques.xlsx
Resolving attack.mitre.org (attack.mitre.org)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to attack.mitre.org (attack.mitre.org)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2615585 (2.5M) [application/vnd.openxmlformats-off

## Corpus

In [2]:
import pandas as pd

In [3]:
tactics_df = pd.read_excel('enterprise-attack-v15.1-tactics.xlsx')
techniques = pd.read_excel('enterprise-attack-v15.1-techniques.xlsx')

In [4]:
tactics_df

Unnamed: 0,ID,STIX ID,name,description,url,created,last modified,domain,version
0,TA0009,x-mitre-tactic--d108ce10-2419-4cf9-a774-46161d...,Collection,The adversary is trying to gather data of inte...,https://attack.mitre.org/tactics/TA0009,17 October 2018,19 July 2019,enterprise-attack,1.0
1,TA0011,x-mitre-tactic--f72804c5-f15a-449e-a5da-2eecd1...,Command and Control,The adversary is trying to communicate with co...,https://attack.mitre.org/tactics/TA0011,17 October 2018,19 July 2019,enterprise-attack,1.0
2,TA0006,x-mitre-tactic--2558fd61-8c75-4730-94c4-11926d...,Credential Access,The adversary is trying to steal account names...,https://attack.mitre.org/tactics/TA0006,17 October 2018,19 July 2019,enterprise-attack,1.0
3,TA0005,x-mitre-tactic--78b23412-0651-46d7-a540-170a1c...,Defense Evasion,The adversary is trying to avoid being detecte...,https://attack.mitre.org/tactics/TA0005,17 October 2018,19 July 2019,enterprise-attack,1.0
4,TA0007,x-mitre-tactic--c17c5845-175e-4421-9713-829d05...,Discovery,The adversary is trying to figure out your env...,https://attack.mitre.org/tactics/TA0007,17 October 2018,19 July 2019,enterprise-attack,1.0
5,TA0002,x-mitre-tactic--4ca45d45-df4d-4613-8980-bac22d...,Execution,The adversary is trying to run malicious code....,https://attack.mitre.org/tactics/TA0002,17 October 2018,19 July 2019,enterprise-attack,1.0
6,TA0010,x-mitre-tactic--9a4e74ab-5008-408c-84bf-a10dfb...,Exfiltration,The adversary is trying to steal data.\n\nExfi...,https://attack.mitre.org/tactics/TA0010,17 October 2018,19 July 2019,enterprise-attack,1.0
7,TA0040,x-mitre-tactic--5569339b-94c2-49ee-afb3-222293...,Impact,"The adversary is trying to manipulate, interru...",https://attack.mitre.org/tactics/TA0040,14 March 2019,25 July 2019,enterprise-attack,1.0
8,TA0001,x-mitre-tactic--ffd5bcee-6e16-4dd2-8eca-7b3bee...,Initial Access,The adversary is trying to get into your netwo...,https://attack.mitre.org/tactics/TA0001,17 October 2018,19 July 2019,enterprise-attack,1.0
9,TA0008,x-mitre-tactic--7141578b-e50b-4dcc-bfa4-08a8dd...,Lateral Movement,The adversary is trying to move through your e...,https://attack.mitre.org/tactics/TA0008,17 October 2018,19 July 2019,enterprise-attack,1.0


In [5]:
tactics_corpus = tactics_df[['name', 'description']].apply(lambda x: ' '.join(x), axis=1).tolist()
techniques_corpus = techniques[['name', 'description']].apply(lambda x: ' '.join(x), axis=1).tolist()

corpus = tactics_corpus + techniques_corpus

## Now on your own

Write your codes here. There should be lots of codes.

In [6]:
!pip install transformers[torch] accelerate -U
!pip install datasets
!pip install pandas
!pip install torch

Collecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86

In [7]:
!pip uninstall transformers accelerate -y
!pip install transformers[torch] accelerate -U

Found existing installation: transformers 4.41.2
Uninstalling transformers-4.41.2:
  Successfully uninstalled transformers-4.41.2
Found existing installation: accelerate 0.31.0
Uninstalling accelerate-0.31.0:
  Successfully uninstalled accelerate-0.31.0
Collecting transformers[torch]
  Downloading transformers-4.41.2-py3-none-any.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m70.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Using cached accelerate-0.31.0-py3-none-any.whl (309 kB)
Installing collected packages: transformers, accelerate
Successfully installed accelerate-0.31.0 transformers-4.41.2


In [8]:
import transformers
import accelerate
print(transformers.__version__)
print(accelerate.__version__)

4.41.2
0.31.0


In [9]:
import transformers
import datasets
import pandas as pd
import torch
import accelerate

In [10]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

# 加載預訓練模型
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
print("模型加載成功")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


模型加載成功


**微調 BERT 模型**

In [24]:
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import Dataset

# 簡化的文本處理
encodings = tokenizer(corpus[:10], truncation=True, padding=True, max_length=512)  # 只取前10條數據測試

# 轉換為數據集
dataset = Dataset.from_dict({'input_ids': encodings['input_ids'], 'attention_mask': encodings['attention_mask']})

# 數據整理器
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# 訓練參數
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
)

# 訓練器
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# 訓練模型
trainer.train()

Step,Training Loss


TrainOutput(global_step=5, training_loss=1.4538101196289062, metrics={'train_runtime': 1.1333, 'train_samples_per_second': 8.823, 'train_steps_per_second': 4.412, 'total_flos': 539775495000.0, 'train_loss': 1.4538101196289062, 'epoch': 1.0})

## Perplexity

Show the perplexity of newly trained model.

In [12]:
import math

def calculate_perplexity(trainer, dataset):
    eval_results = trainer.evaluate(eval_dataset=dataset)
    return math.exp(eval_results["eval_loss"])

perplexity = calculate_perplexity(trainer, dataset)
print(f'Perplexity: {perplexity}')

Perplexity: 11.815901183463946


## Downstream Task Test

* Now you should have two models, one is the original one downloaded from the HuggingFace, the other one is a fine-tuned one.

* Let's try a downstream task to see if the classification rate changes after your fine-tuned model learns some additional cybersecurity knowledge.

* In the example of 'Fine-tuning a masked language model', its 'Using our fine-tuned model' tests the now model with a "fill-mask" pipeline.

* In "Transformers, what can they do?" (https://huggingface.co/learn/nlp-course/en/chapter1/3), there are severl piplelines. Lets try 'Zero-shot classification'.

* Please prepare severl sentences (> 100) from the website (not from the downloaded xlsx files) as your testing examples.

* Feed these sentences into the original model and your fine-tuned model, and ask them which 'tactics' and 'techniques' this sentence belongs to?

* Show us the classification rate of 'tactics' and 'techniques' increase (or not) if fine-tuned model is used.

* Show us some examples that they really changes label of 'tactics' or 'techniques' when new model is used.

### Zero-shot classification task

**使用 facebook/bart-large-mnli 進行零樣本分類**

In [25]:
from transformers import BertForSequenceClassification

# 加载预训练的 BERT 模型
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(candidate_labels))

# 构建数据集
labels = [0] * len(corpus)  # 因为我们没有真实标签，这只是一个示例
encodings = tokenizer(corpus, truncation=True, padding=True, max_length=512)
dataset = Dataset.from_dict({'input_ids': encodings['input_ids'], 'attention_mask': encodings['attention_mask'], 'labels': labels})

# 训练参数
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


TrainOutput(global_step=326, training_loss=0.34475443553339485, metrics={'train_runtime': 81.2223, 'train_samples_per_second': 8.015, 'train_steps_per_second': 4.014, 'total_flos': 172283395156992.0, 'train_loss': 0.34475443553339485, 'epoch': 1.0})

In [13]:
from transformers import pipeline

# 使用支援零樣本分類的模型
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# 範例句子（請替換為你的句子）
sentences = ["Example sentence 1", "Example sentence 2"]

# 候選標籤
candidate_labels = tactics_df['name'].tolist() + techniques['name'].tolist()

# 原始模型的結果
print("Original Model Results:")
# 確保每個句子的候選標籤數量相同
for sentence in sentences:
    result = zero_shot_classifier(sequences=[sentence], candidate_labels=candidate_labels)
    print(f"Sentence: {sentence}")
    print(f"Classification: {result}")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Sentence: Example sentence 1
Classification: [{'sequence': 'Example sentence 1', 'labels': ['Modify Cloud Compute Infrastructure: Modify Cloud Compute Configurations', 'Obfuscated Files or Information: Encrypted/Encoded File', 'Boot or Logon Autostart Execution: Registry Run Keys / Startup Folder', 'Non-Standard Port', 'Obfuscated Files or Information: Stripped Payloads', 'Scheduled Task/Job: Scheduled Task', 'Exfiltration Over Alternative Protocol: Exfiltration Over Unencrypted Non-C2 Protocol', 'Masquerading: Masquerade Task or Service', 'Hijack Execution Flow: Path Interception by Search Order Hijacking', 'Defacement: External Defacement', 'Exfiltration Over Alternative Protocol: Exfiltration Over Asymmetric Encrypted Non-C2 Protocol', 'Boot or Logon Autostart Execution: Login Items', 'Masquerading: Space after Filename', 'File and Directory Permissions Modification: Linux and Mac File and Directory Permissions Modification', 'Boot or Logon Initialization Scripts: Logon Script (Wind

微調模型

使用微調後的模型進行零樣本分類

In [27]:
from transformers import pipeline

# 使用微调后的 BERT 模型
fine_tuned_zero_shot_classifier = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# 示例句子（请替换为你的句子）
sentences = ["Example sentence 1", "Example sentence 2"]
candidate_labels = tactics_df['name'].tolist() + techniques_df['name'].tolist()

# 微调模型的结果
print("Fine-Tuned Model Results:")
for sentence in sentences:
    fine_tuned_result = fine_tuned_zero_shot_classifier(sequences=sentence, candidate_labels=candidate_labels)
    print(f"Sentence: {sentence}")
    print(f"Fine-tuned Classification: {fine_tuned_result}")

Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


Fine-Tuned Model Results:
Sentence: Example sentence 1
Fine-tuned Classification: {'sequence': 'Example sentence 1', 'labels': ['Remote Services: Remote Desktop Protocol', 'Establish Accounts: Email Accounts', 'Supply Chain Compromise', 'Masquerading: Match Legitimate Name or Location', 'Indicator Removal: File Deletion', 'Lateral Movement', 'Screen Capture', 'Automated Exfiltration', 'Account Manipulation: Additional Cloud Roles', 'Unsecured Credentials: Container API', 'Weaken Encryption: Disable Crypto Hardware', 'Event Triggered Execution: Component Object Model Hijacking', 'Command and Scripting Interpreter: AutoHotKey & AutoIT', 'Gather Victim Identity Information', 'Valid Accounts: Domain Accounts', 'Brute Force: Password Spraying', 'Obtain Capabilities: Vulnerabilities', 'Masquerading: Masquerade Task or Service', 'OS Credential Dumping: NTDS', 'Hijack Execution Flow: Dynamic Linker Hijacking', 'Boot or Logon Autostart Execution', 'Supply Chain Compromise: Compromise Software D

In [28]:
from transformers import pipeline

# 使用支持零样本分类的模型
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# 示例句子
sentences = ["Example sentence 1", "Example sentence 2"]
candidate_labels = tactics_df['name'].tolist() + techniques_df['name'].tolist()

# 原始模型的结果
print("Original Model Results:")
for sentence in sentences:
    original_result = zero_shot_classifier(sequences=sentence, candidate_labels=candidate_labels)
    print(f"Sentence: {sentence}")
    print(f"Original Classification: {original_result}")

# 微调模型的结果
print("Fine-Tuned Model Results:")
for sentence in sentences:
    fine_tuned_result = zero_shot_classifier(sequences=sentence, candidate_labels=candidate_labels)
    print(f"Sentence: {sentence}")
    print(f"Fine-tuned Classification: {fine_tuned_result}")

Original Model Results:
Sentence: Example sentence 1
Original Classification: {'sequence': 'Example sentence 1', 'labels': ['Modify Cloud Compute Infrastructure: Modify Cloud Compute Configurations', 'Obfuscated Files or Information: Encrypted/Encoded File', 'Boot or Logon Autostart Execution: Registry Run Keys / Startup Folder', 'Non-Standard Port', 'Obfuscated Files or Information: Stripped Payloads', 'Scheduled Task/Job: Scheduled Task', 'Exfiltration Over Alternative Protocol: Exfiltration Over Unencrypted Non-C2 Protocol', 'Masquerading: Masquerade Task or Service', 'Hijack Execution Flow: Path Interception by Search Order Hijacking', 'Defacement: External Defacement', 'Exfiltration Over Alternative Protocol: Exfiltration Over Asymmetric Encrypted Non-C2 Protocol', 'Boot or Logon Autostart Execution: Login Items', 'Masquerading: Space after Filename', 'File and Directory Permissions Modification: Linux and Mac File and Directory Permissions Modification', 'Boot or Logon Initializa

### Compare with the original model

In [None]:
print("Comparing Results:")
for sentence in sentences:
    original_result = zero_shot_classifier(sequences=[sentence], candidate_labels=candidate_labels)
    fine_tuned_result = fine_tuned_zero_shot_classifier(sequences=[sentence], candidate_labels=candidate_labels)

    print(f"Sentence: {sentence}")
    print(f"Original Classification: {original_result}")
    print(f"Fine-tuned Classification: {fine_tuned_result}")