# Language Model and MITRE ATT&CK


## Instructions

* Use "Fine-tuning a masked language model" as the template to create your own language model.
  * https://huggingface.co/learn/nlp-course/en/chapter7/3
* Selcet a built-in language model, and try to fine-tune it with an additional corpus.
* We would like to make the fine-tuned model learn 'cybersecurity' knowledge, so we choose to use some cybersecurity-related, professional documents from MITRE website.
  * https://attack.mitre.org/resources/attack-data-and-tools/
* In the MITRE data and tools page, please find two excel files which include the definitions of attack tactics and attack techniques.
  * enterprise-attack-v15.1-tactics.xlsx
  * enterprise-attack-v15.1-techniques.xlsx
* Parse the xlsx files, and extract 'name' and 'description' as your additional corpus.
* Try to fine-tune your model.
* Note that you do not have to push your model to huggingface, rather please keep it in your colab and use/test it directly.

In [None]:
!wget https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-tactics.xlsx
!wget https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-techniques.xlsx

--2024-05-15 15:52:37--  https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-tactics.xlsx
Resolving attack.mitre.org (attack.mitre.org)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to attack.mitre.org (attack.mitre.org)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10109 (9.9K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘enterprise-attack-v15.1-tactics.xlsx’


2024-05-15 15:52:37 (34.7 MB/s) - ‘enterprise-attack-v15.1-tactics.xlsx’ saved [10109/10109]

--2024-05-15 15:52:37--  https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1-techniques.xlsx
Resolving attack.mitre.org (attack.mitre.org)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to attack.mitre.org (attack.mitre.org)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2615585 (2.5M) [application/vnd.openxmlformats-off

## Corpus

In [None]:
import pandas as pd

In [None]:
tactics_df = pd.read_excel('enterprise-attack-v15.1-tactics.xlsx')
#techniques = pd.read_excel('enterprise-attack-v15.1-techniques.xlsx')

In [None]:
tactics_df

Unnamed: 0,ID,STIX ID,name,description,url,created,last modified,domain,version
0,TA0009,x-mitre-tactic--d108ce10-2419-4cf9-a774-46161d...,Collection,The adversary is trying to gather data of inte...,https://attack.mitre.org/tactics/TA0009,17 October 2018,19 July 2019,enterprise-attack,1.0
1,TA0011,x-mitre-tactic--f72804c5-f15a-449e-a5da-2eecd1...,Command and Control,The adversary is trying to communicate with co...,https://attack.mitre.org/tactics/TA0011,17 October 2018,19 July 2019,enterprise-attack,1.0
2,TA0006,x-mitre-tactic--2558fd61-8c75-4730-94c4-11926d...,Credential Access,The adversary is trying to steal account names...,https://attack.mitre.org/tactics/TA0006,17 October 2018,19 July 2019,enterprise-attack,1.0
3,TA0005,x-mitre-tactic--78b23412-0651-46d7-a540-170a1c...,Defense Evasion,The adversary is trying to avoid being detecte...,https://attack.mitre.org/tactics/TA0005,17 October 2018,19 July 2019,enterprise-attack,1.0
4,TA0007,x-mitre-tactic--c17c5845-175e-4421-9713-829d05...,Discovery,The adversary is trying to figure out your env...,https://attack.mitre.org/tactics/TA0007,17 October 2018,19 July 2019,enterprise-attack,1.0
5,TA0002,x-mitre-tactic--4ca45d45-df4d-4613-8980-bac22d...,Execution,The adversary is trying to run malicious code....,https://attack.mitre.org/tactics/TA0002,17 October 2018,19 July 2019,enterprise-attack,1.0
6,TA0010,x-mitre-tactic--9a4e74ab-5008-408c-84bf-a10dfb...,Exfiltration,The adversary is trying to steal data.\n\nExfi...,https://attack.mitre.org/tactics/TA0010,17 October 2018,19 July 2019,enterprise-attack,1.0
7,TA0040,x-mitre-tactic--5569339b-94c2-49ee-afb3-222293...,Impact,"The adversary is trying to manipulate, interru...",https://attack.mitre.org/tactics/TA0040,14 March 2019,25 July 2019,enterprise-attack,1.0
8,TA0001,x-mitre-tactic--ffd5bcee-6e16-4dd2-8eca-7b3bee...,Initial Access,The adversary is trying to get into your netwo...,https://attack.mitre.org/tactics/TA0001,17 October 2018,19 July 2019,enterprise-attack,1.0
9,TA0008,x-mitre-tactic--7141578b-e50b-4dcc-bfa4-08a8dd...,Lateral Movement,The adversary is trying to move through your e...,https://attack.mitre.org/tactics/TA0008,17 October 2018,19 July 2019,enterprise-attack,1.0


In [None]:
(tactics_df.iloc[0]['name'], tactics_df.iloc[0]['description'])

('Collection',
 "The adversary is trying to gather data of interest to their goal.\n\nCollection consists of techniques adversaries may use to gather information and the sources information is collected from that are relevant to following through on the adversary's objectives. Frequently, the next goal after collecting data is to steal (exfiltrate) the data. Common target sources include various drive types, browsers, audio, video, and email. Common collection methods include capturing screenshots and keyboard input.")

## Now on your own

Write your codes here. There should be lots of codes.

## Perplexity

Show the perplexity of newly trained model.

## Downstream Task Test

* Now you should have two models, one is the original one downloaded from the HuggingFace, the other one is a fine-tuned one.

* Let's try a downstream task to see if the classification rate changes after your fine-tuned model learns some additional cybersecurity knowledge.

* In the example of 'Fine-tuning a masked language model', its 'Using our fine-tuned model' tests the now model with a "fill-mask" pipeline.

* In "Transformers, what can they do?" (https://huggingface.co/learn/nlp-course/en/chapter1/3), there are severl piplelines. Lets try 'Zero-shot classification'.

* Please prepare severl sentences (> 100) from the website (not from the downloaded xlsx files) as your testing examples.

* Feed these sentences into the original model and your fine-tuned model, and ask them which 'tactics' and 'techniques' this sentence belongs to?

* Show us the classification rate of 'tactics' and 'techniques' increase (or not) if fine-tuned model is used.

* Show us some examples that they really changes label of 'tactics' or 'techniques' when new model is used.