<a href="https://colab.research.google.com/github/merishnaSuwal/bert-finetuning/blob/main/cyber_security_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition in Cyber Security Domain

This google colab is aimed at performing named entity recognition in cyber security domain by fine-tuning BERT-derivative model with the MITRE dataset. This dataset is related to the vulnerabilities, firmware and cyber security.

## Installation

In [None]:
%%capture
!python3 -m pip install -U huggingface_hub
!python3 -m pip install -U accelerate
!python3 -m pip install -U transformers
!python3 -m pip install -U datasets evaluate
!python3 -m pip install -U seqeval

In [None]:
# @title
# Wrap the text in ipython notebook
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Data Preprocessing

## Load MITRE dataset

Load MITRE dataset using our custom data loading script. The train, validation and test datasets are loaded using `load_dataset` function.


In [None]:
!wget https://raw.githubusercontent.com/merishnaSuwal/bert-finetuning/main/CyberSecurity_NER/data/train.txt
!wget https://raw.githubusercontent.com/merishnaSuwal/bert-finetuning/main/CyberSecurity_NER/data/valid.txt
!wget https://raw.githubusercontent.com/merishnaSuwal/bert-finetuning/main/CyberSecurity_NER/data/test.txt
!wget https://raw.githubusercontent.com/merishnaSuwal/bert-finetuning/main/CyberSecurity_NER/data/load_ner.py

In [None]:
from datasets import load_dataset

data_files = {
    "train": "train.txt",
    "validation": "valid.txt",
    "test": "test.txt",
}

raw_datasets = load_dataset("load_ner.py", data_files=data_files)

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Check the basic information on the loaded dataset

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2811
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 813
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 748
    })
})


> Check sample of tokens from train dataset



In [None]:
print(raw_datasets["train"][0]["tokens"])

['Super', 'Mario', 'Run', 'Malware', '#', '2', '–', 'DroidJack', 'RAT', 'Gamers', 'love', 'Mario', 'and', 'Pokemon', ',', 'but', 'so', 'do', 'malware', 'authors', '.']


Check the NER tags (its IDS) of the corresponding sample

In [None]:
print(raw_datasets["train"][0]["ner_tags"])

[1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0]


In [None]:
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-Malware', 'I-Malware', 'B-System', 'I-System', 'B-Organization', 'I-Organization', 'B-Indicator', 'I-Indicator', 'B-Vulnerability', 'I-Vulnerability'], id=None), length=-1, id=None)

### Check the labels in the dataset

In [None]:
label_names = ner_feature.feature.names
label_names # there are total 11 labels

['O',
 'B-Malware',
 'I-Malware',
 'B-System',
 'I-System',
 'B-Organization',
 'I-Organization',
 'B-Indicator',
 'I-Indicator',
 'B-Vulnerability',
 'I-Vulnerability']

### Display the token and labels

In [None]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

for x, y in zip(line1.split(), line2.split()):
    print(x, '\t', y)

Super 	 B-Malware
Mario 	 I-Malware
Run 	 I-Malware
Malware 	 I-Malware
# 	 O
2 	 O
– 	 O
DroidJack 	 B-Malware
RAT 	 I-Malware
Gamers 	 O
love 	 O
Mario 	 B-System
and 	 O
Pokemon 	 B-System
, 	 O
but 	 O
so 	 O
do 	 O
malware 	 O
authors 	 O
. 	 O


## Tokenization

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "xlm-roberta-large"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
# Define the inputs to the model
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
print(inputs.tokens())

['<s>', '▁Super', '▁Mario', '▁Run', '▁Mal', 'ware', '▁#', '▁2', '▁–', '▁Dro', 'id', 'Jack', '▁', 'RAT', '▁Gam', 'ers', '▁love', '▁Mario', '▁and', '▁Pokemon', '▁', ',', '▁but', '▁so', '▁do', '▁malware', '▁author', 's', '▁', '.', '</s>']


In [None]:
print(inputs.word_ids())

[None, 0, 1, 2, 3, 3, 4, 5, 6, 7, 7, 7, 8, 8, 9, 9, 10, 11, 12, 13, 14, 14, 15, 16, 17, 18, 19, 19, 20, 20, None]


## Data Preprocessing

In [None]:
# Align the number of labels and the tokens
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

### Get labels and word ids

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[1, 2, 2, 2, 0, 0, 0, 1, 2, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0]
[-100, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [None]:
# Helper function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

### Tokenize data

In [None]:
# Tokenize all the tokens and labels from the datasets
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/2811 [00:00<?, ? examples/s]

Map:   0%|          | 0/813 [00:00<?, ? examples/s]

Map:   0%|          | 0/748 [00:00<?, ? examples/s]

# Fine Tuning

## Data Collation

Prepare the dataloader for the training session

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    1,    2,    2,    2,    2,    0,    0,    0,    1,    2,    2,
            2,    2,    0,    0,    0,    3,    0,    3,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    3,
            1,    2,    0,    0,    0,    0,    0,    0,    0,    3,    4,    4,
            0,    0,    3,    0,    0, -100, -100]])

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 2, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 0, 0, 3, 0, 0, -100]


## Setup Evaluation

Define metric computation function

In [None]:
import evaluate
import numpy as np

metric = evaluate.load("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [None]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

## Initialize the pretrained model from the checkpoint

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.config.num_labels

11

## Training

Start by generating a hugging face token [here](https://huggingface.co/docs/hub/security-tokens)

In [None]:
from google.colab import userdata
from huggingface_hub import login, notebook_login

# Login to hugging face hub
# notebook_login()
login(token=userdata.get('hugging_face'))

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from transformers import TrainingArguments, Trainer

# Define training arguments
args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

# Initialize the trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.136893,0.637375,0.730867,0.680927,0.966624
2,0.185600,0.125192,0.681044,0.765306,0.720721,0.971267
3,0.058300,0.142473,0.69711,0.769133,0.731352,0.970673


TrainOutput(global_step=1056, training_loss=0.11827445210832538, metrics={'train_runtime': 788.1473, 'train_samples_per_second': 10.7, 'train_steps_per_second': 1.34, 'total_flos': 1373294813753520.0, 'train_loss': 0.11827445210832538, 'epoch': 3.0})

### Evaluation metrics

In [None]:
trainer.evaluate() # Compute metrics on validation data

{'eval_loss': 0.14247283339500427,
 'eval_precision': 0.6971098265895954,
 'eval_recall': 0.7691326530612245,
 'eval_f1': 0.7313523347483324,
 'eval_accuracy': 0.970673462975247,
 'eval_runtime': 14.0097,
 'eval_samples_per_second': 58.031,
 'eval_steps_per_second': 7.281,
 'epoch': 3.0}

## Save the model

In [None]:
saved_model_path='/content/drive/MyDrive/cyber_ner/'
trainer.save_model(saved_model_path)

## Evaluation on Test set

In [None]:
predictions = trainer.predict(tokenized_datasets["test"])

In [None]:
from tabulate import tabulate

metrics = ['precision', 'recall', 'f1', 'accuracy']
prediction_results = []

for key, val in predictions.metrics.items():
    if any(item in key for item in metrics):
        prediction_results.append([key, str(round(val,4)*100)+'%'])

print(tabulate(prediction_results, headers=['Metric', 'Score']))

Metric          Score
--------------  -------
test_precision  61.35%
test_recall     69.87%
test_f1         65.33%
test_accuracy   96.54%


The test results show a decent result on the fine-tuned model.

- In the context of NER, precision indicates the accuracy of the model in identifying entities. A precision of **61.35%** suggests that, among the predicted entities, approximately 61.35% are correct.
- A recall of **69.87%** suggests that the model is successfully identifying nearly 69.87% of the actual entities present in the data.
- An F1 score of **65.33%** indicates a good balance between precision and recall. It is particularly useful when there is an uneven class distribution.
- While a high accuracy of **96.54%** suggests that the model is making correct predictions overall, it might be misleading in imbalanced datasets where the majority class dominates. It's essential to consider precision, recall, and F1 score for a more comprehensive evaluation.

### Key observations:

- The recall is slightly lower than precision, indicating that there might be some entities in the data that the model is missing.
- The model shows strong performance in terms of accuracy, indicating overall effectiveness.

## Inference

We will now run our fine-tuned model on some sentences to see the results on token classification.

In [None]:
from transformers import pipeline

token_classifier = pipeline(
    "token-classification", model=saved_model_path, aggregation_strategy="simple"
)


In [None]:
results = token_classifier("vulnerabilities reported BLU Products, founded in 2009, makes lower-end Android-powered smartphones that sell for as little as $50 on Amazon company.")

prediction_results = []
for each_entity in results:
    prediction_results.append([each_entity['word'], each_entity['entity_group']])

print(tabulate(prediction_results, headers=['Word', 'Predictions']))


Word          Predictions
------------  -------------
BLU Products  Organization
Android-      System
Amazon        Organization


In [None]:
test_sentence = "The	finding, in part, shows the risk that can come in opting for less expensive smartphones , whose manufacturers may not diligently fix security vulnerabilities."
results = token_classifier(test_sentence)

prediction_results = []
for each_entity in results:
    prediction_results.append([each_entity['word'], each_entity['entity_group']])

print(tabulate(prediction_results, headers=['Word', 'Predictions']))


Word             Predictions
---------------  -------------
vulnerabilities  Vulnerability


In summary, the model performs well overall. Further analysis of specific errors and experimenting with different optimization techniques and hyperparameters can be done to guide any adjustments or improvements to the performance.

## THANK YOU!