# INTRODUCTION - DeviceBERT: Using Enriched Named Entity Recognition to Identify Medical Device Terminology in Device Recalls Data.

This notebook provides the code needed to prepare the data, train and implement DeviceBERT, a language model based on BioBERT, which is trained and finetuned to perform entity recognition of Medical Device terms.

To prepare and train DeviceBERT, we create an enriched, BERT tokenizer, which is augmented with medical device vocabulary, curated from various Open FDA datasets. The BioBERT model is trained on a annotated corpus of Medical Device Recalls data, which is annotated in BIO format utiling Doccano.

The Datasets and Model created from this project have been made available on Huggingface hub for use in further training and inferencing tasks:

* **DeviceBERT Model Card:** https://huggingface.co/mfarrington/device_recalls_ner_model \
* **NER Recalls Training Dataset:** https://huggingface.co/mfarrington/biobert-ner-fda-recalls-dataset
* **Enriched DeviceBERT Tokenizer:** https://huggingface.co/mfarrington/DeviceBERT-tokenizer

**Background**: This project was created as a student final project for Stanford CS224N: Natural Language Proecessing with Deep Learning. In this project, we demonstrate that the challenges inherint in sub-domain specific NER tasks can be alleviated through an enriched approach to data preparation, tokenization and regularization techniques.

**Paper:** *DeviceBERT: Addressing Domain-Specific Learning Challenges Using Vocabulary Enrichment to Identify Medical Device Terminology in FDA Recall Action Summaries*

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
from google.colab import userdata
userdata.get('HF_TOKEN')

In [None]:
#Install additional dependencies
!pip install datasets
!pip install seqeval
!pip install evaluate
!pip install accelerate -U
!pip install huggingface_hub
!pip install transformers
!pip install nltk

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
import pandas as pd
import numpy as np
import random
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
import json
import re
import torch
from tqdm import tqdm
import tabulate as tb
from datasets import load_dataset, Dataset, load_metric
import evaluate
from transformers import (AutoModelForTokenClassification,
                          AutoTokenizer,
                          TrainingArguments,
                          DataCollatorForTokenClassification,
                          Trainer,
                          AdamW,
                          BertTokenizer,
                          BertForTokenClassification,
                          BertModel)

In [None]:
#copy the device recalls and annotation file to the working directory
!cp drive/MyDrive/Stanford/annotated-2000.jsonl .
!cp drive/MyDrive/Stanford/device-recall-0001-of-0001.json .
!cp drive/MyDrive/Stanford/foiclass.txt .

#PART 1 - Preparing, Transforming and Tokenizing the Device Recalls Dataset

This part of the notebook performs the data preparation, cleaning, tokenization and alignment for the NER dataset used to train the BioBERT model, which is created in Doccano. The dataset is created utilizing 2000 Device Recall actions annotated with NER tags for device recognition (B-DEVICE, I-DEVICE, O-DEVICE).

Note: Data Annotation is a separate step that is performed in advance. To run this section of the notebook, you will need the original jsonl annotation file ('annotated-2000.jsonl'). Doccano is an open-source device annotation tool used to perform the NER annotations.To view the original annotations, or perform addtional annotations on the data, you can install and run doccano using the instructions here: https://doccano.github.io/doccano/.

In [None]:
# Load the JSON Recalls data
with open('device-recall-0001-of-0001.json') as f:
  data = json.load(f)

# Load the foiclass text data into a dataframe
foi_df = pd.read_csv('foiclass.txt', sep='|', header=0, encoding='latin1')
foi_df = foi_df[['DEVICENAME']].rename(columns={'DEVICENAME': 'device-name-foiclass'})

In [None]:
# Extract the device name fields
cfres_id = []
device_name = []
product_description = []
recall_action = []

rows = []
for result in data['results']:
    device_name = result.get('openfda', {}).get('device_name', None)
    recall_action = result.get('action', None)
    product_description = result.get('product_description', None)
    if device_name is not None:
        device_name = re.sub(r'[^a-zA-Z0-9\s]', '', device_name)  # Remove special characters
    if recall_action is not None:
        recall_action = re.sub(r'[^a-zA-Z0-9\s]', '', recall_action)  # Remove special characters
    if product_description is not None:
        product_description = re.sub(r'[^a-zA-Z0-9\s]', '', product_description)  # Remove special characters

    row = {
        'id': result.get('cfres_id'),
        'device_name': device_name,
        'product_description': product_description,
        'recall_action': recall_action
    }
    rows.append(row)

# Create a dataframe and clean up special characters and extra spaces
df = pd.DataFrame(rows)
df = df.apply(lambda x: x.str.replace('\n', ''))
df = df.applymap(lambda x: ' '.join(x.split()) if isinstance(x, str) else x)
print (df.size)

#Break out each object into a dataframe and remove NAs
device_df = df[['id', 'device_name']].dropna(subset=['device_name'])
product_df = df[['id', 'product_description']].dropna(subset=['product_description'])
recall_df = df[['id', 'recall_action']].dropna(subset=['recall_action'])
foi_df = foi_df.dropna(subset=['device-name-foiclass'])

In [None]:
#Tokenize each column into words
device_df['tokens'] = device_df['device_name'].apply(word_tokenize)
product_df['tokens'] = product_df['product_description'].apply(word_tokenize)
recall_df['tokens'] = recall_df['recall_action'].apply(word_tokenize)
foi_df['tokens'] = foi_df['device-name-foiclass'].apply(word_tokenize)

# Print the resulting DataFrame
print (device_df.size)
print (product_df.size)
print (recall_df.size)
print (foi_df.size)

In [None]:
# Merge the product and recall tokens, we'll remove values which are purely numeric from the vocab
token_df = pd.concat([device_df, product_df, recall_df, foi_df], ignore_index=True)
token_df = token_df[['tokens']]
token_df['tokens'] = token_df['tokens'].astype(str)
token_df = token_df[~token_df['tokens'].str.match('^[0-9]+$')]

token_df = token_df.drop_duplicates(subset=['tokens'])


#shuffle the tokens, convert to list and remove duplicate tokens
token_vocab = token_df['tokens'].tolist()

flattened_token_vocab = [item.strip("[]'") for sublist in token_vocab for item in sublist.split("', '")]

print('BEFORE processing', len(flattened_token_vocab))
flattened_token_vocab = list(set(flattened_token_vocab)) #remove duplicate tokens
random.shuffle(flattened_token_vocab)
vocab_list = [element for element in flattened_token_vocab if not element.isdigit()]
print('AFTER Processing', len(vocab_list))

#Extract and Pre-Process the Annotated Device Recalls Data
In this step we load in the training data which has been previously annotated using Doccano with the BIO (Beginning, Inside, Outside) tags for medical devices (B-DEVICE, I-DEVICE, O-DEVICE) We perform validations to ensure the data contains no duplicate rows, and is formatted in the correct schema. We'll also drop the unneeded 'comments' column, which is used by Doccano.

In [None]:
filename = 'annotated-2000.jsonl'

# Define the dataframe schema
schema = {
    'id': str,
    'text': str,
    'Comments': str,
    'label': object  # will hold a list of lists with the label info
}

# Load the JSONL data into a DataFrame
df = pd.read_json(filename, lines=True, orient='records', dtype=schema)

# Convert the 'label' JSONL column to a list of lists
df['label'] = df['label'].apply(lambda x: [tuple(l) for l in x])

In [None]:
df = df.drop('Comments', axis=1)
df = df.drop_duplicates(subset=['text'])
print(tb.tabulate(df.head(5), headers='keys', tablefmt='psql'))
print(df["id"].value_counts())

# Transform the tokenized data to a Dataset and apply NER Labels
In this step the dataframe is transformed into a dataset dictionary format which is tokenized and re-indexed to identify the correct label spans to apply the the NER tags from the annotations.

In [None]:
def transform_dataframe_to_dataset(df):
    dataset_dict = {
        "id": [],
        "ner_tags": [],
        "tokens": []
    }

    for index, row in df.iterrows():
        # Add id
        dataset_dict["id"].append(row["id"])

        # Tokenize text
        tokens = row["text"].split()

        # Initialize NER tags
        ner_tags = ["O"] * len(tokens)

        # Process labels
        for label in row["label"]:
            start, end, tag = label
            # Find token indices corresponding to label spans
            start_token_idx = None
            end_token_idx = None
            for i, token in enumerate(tokens):
                if start >= len(" ".join(tokens[:i])):
                    start_token_idx = i
                if end <= len(" ".join(tokens[:i + 1])):
                    end_token_idx = i + 1
                    break
            # Update NER tags with NER tag labels
            if start_token_idx is not None and end_token_idx is not None:
                ner_tags[start_token_idx] = f"{tag}"
                for i in range(start_token_idx + 1, end_token_idx):
                    ner_tags[i] = f"{tag}"

        dataset_dict["tokens"].append(tokens)
        dataset_dict["ner_tags"].append(ner_tags)

    return Dataset.from_dict(dataset_dict)

In [None]:
dataset = transform_dataframe_to_dataset(df) #create the dataset

# Perform the label to ID mapping
In this step the NER labels are mapped to a correspondin numerical id, and mapped to the dataset using the existing label locations in the ner_tags element.

In [None]:
# create a map of the expected ids to their labels
id2label = {
    0: 'O-DEVICE',
    1: 'B-DEVICE',
    2: 'I-DEVICE',
    -100: 'O',}

# create a map of the expected labels to their ids
label2id = {
    'O-DEVICE': 0,
    'B-DEVICE': 1,
    'I-DEVICE': 2,
    'O': -100,}

In [None]:
def transform_labels(dataset):
    new_labels = []
    labels = []
    for i in range(len(dataset["ner_tags"])):
       new_label = label2id.get(dataset["ner_tags"][i], -100)
       old_label = id2label.get(new_label, 'O')
       labels.append(old_label)
       new_labels.append(new_label)
    return {"ner_tags": new_labels, "labels": labels}

# map string labels to IDs
dataset = dataset.map(transform_labels)

# Create a Train/Test Split of the Dataset

In [None]:
dataset = dataset.train_test_split(test_size=0.2, shuffle=True, ) #split the dataset into train and test sets

#Optional - Push Dataset to Hub
In this step we use the huggingface cli to push the updated dataset to the HuggingFace hub, making it easier to use in downstream tasks.

In [None]:
!huggingface-cli login #enter token when prompted

In [None]:
#uncomment to push the datset to the hub
dataset.push_to_hub("mfarrington/biobert-ner-fda-recalls-dataset")

#Load the FDA Recalls NER Dataset from Huggingface Hub
In this step we load the latest version of the recalls NER dataset created in the earlier process from the hub.

In [None]:
#load the dataset
hub_dataset = load_dataset("mfarrington/biobert-ner-fda-recalls-dataset")

#Tokenize data using BERT Tokenizer and align the NER Labels
In this step, we'll use a pretrained DistilBERT tokenizer to tokenize the recalls data and evaluate how the tokenizer tokenizes medical device terminology



In [None]:
# Load DistilBERT baseline tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased", use_fast=True)

### In this step we test the tokenizer before expanding its vocabulary to observe how it resorts to sub-word tokenization of medical device terms
As can be seen from the example, the tokenizer performs sub-optimally on words not in its vocabulary; the medical device terms are tokenized into a series of subwords.


In [None]:
#Tokenize device terms
print(tokenizer.tokenize('Colonoscope'))
print(tokenizer.tokenize('Thoracolumbosacral'))
print(tokenizer.tokenize('Rhinoanemometer'))

# Enhance BERT Tokenizer Vocabulary to include Medical Device terminology
In this section, we utilize the corpus of medical device terms extracted and prepared earlier to enrich the tokenizer's vocabulary. We perform experiments with different percentage splits of vocabulary data (100, 50, 25) to identify the best number of tokens to perform optimal tokenization.

In [None]:
#Create vocabulary set splits of 100%, 50%, 25%
midpoint = len(vocab_list) // 2
vocab_list_50 = vocab_list[:midpoint]
random.shuffle(vocab_list_50)
midpoint = len(vocab_list_50) // 2
vocab_list_25 = vocab_list_50[:midpoint]

print(len(vocab_list))
print(len(vocab_list_50))
print(len(vocab_list_25))


print("[BEFORE] ] tokenizer vocab size:", len(tokenizer))
added_tokens = tokenizer.add_tokens(vocab_list)
print("[ AFTER ] tokenizer vocab size:", len(tokenizer))
print('added_tokens:',added_tokens)

In [None]:
#Verify the device terms are tokenized without subwords
print(tokenizer.tokenize('Colonoscope'))
print(tokenizer.tokenize('Thoracolumbosacral'))
print(tokenizer.tokenize('Rhinoanemometer'))

In [None]:
#Optional - push the vocab enriched tokenizer to hub
tokenizer.save_pretrained('devicebert-tokenizer')
tokenizer.push_to_hub('devicebert-tokenizer')

#Realign the NER Labels and Tokens
Because the BERT tokenizer applies subword tokenization and special labels to the data, we need to realign the NER tags and BIO labels to ensure they accurately span the tokenized data. We also apply truncation to max lenght of the tokenizer model, in this case 512.

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    ner_tags = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["ner_tags"] = labels
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
tokenized_dataset = hub_dataset.map(tokenize_and_align_labels, batched=True)

In [None]:
def map_labels(example):
    labels = example['labels']
    mapped_labels = [id2label.get(label, 'O') for label in labels]
    example['labels'] = mapped_labels
    return example

final_dataset = tokenized_dataset.map(map_labels)

In [None]:
print(final_dataset['train'][1])

# PART 2 - Training, Finetuning, Regularizing and Evaluating the Model on the Annotated NER Dataset

This section of the notebook performs the model training, and finetuning. To establish a baseline metric, we start with a BERT and BioBert Base Cased  transformers:

* BERT:
* BioBERT: https://huggingface.co/dmis-lab/biobert-base-cased-v1.2

Because of the relative small size of the Dataset (~2000 recall records), the baseline model overfits on the training. Therefore, we take additional steps to avoid overfitting through the application of several regularization techniques in this section. After training and evaluating the model performance to achieve acceptable results, the model is saved to the Hub and we can move on to Part 3 - Inferencing using the pretrained saved model.

The Datasets and Model created from this project have been made available on Huggingface hub for use in further training and inferencing tasks.

**Huggingface Model Card:** https://huggingface.co/mfarrington/device_recalls_ner_model. \
**NER Recalls Training Dataset:** https://huggingface.co/mfarrington/biobert-ner-fda-recalls-dataset

## Define a function to compute the model scores

In [None]:
label_list = list(id2label.values()) #Extract the device labels

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)

    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


## Train and Finetune BioBERT on the Enhanced Vocabulary, with Regularization techniques.

In [None]:
#Initialize the evaluator
seqeval = evaluate.load("seqeval")

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
#Initialized biobert
deviceBERT_train_model = BertForTokenClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.2",
    num_labels=4,
    id2label=id2label,
    label2id=label2id,
    )

#Resize the model based on the length of the BERT tokenizer after adding the new device vocabulary
deviceBERT_train_model.resize_token_embeddings(len(tokenizer))

# Apply dropout

In [None]:
import torch.nn as nn
dropout_prob = 0.1

#Regularization - applying dropout to the BERT embeddings layer
deviceBERT_train_model.bert.embeddings.dropout = nn.Dropout(dropout_prob)

#Regularization - applying dropout to the BERT encoding layer
deviceBERT_train_model.bert.encoder.dropout = nn.Dropout(dropout_prob)

In [None]:
training_args = TrainingArguments(
    output_dir="devicebert-base-cased-v1.0",
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    warmup_steps=500,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    bf16=True,
)

In [None]:
# Define the optimizer
optimizer = AdamW(deviceBERT_train_model.parameters(), lr=training_args.learning_rate)

### Perform Cross-Validation

In [None]:
from sklearn.model_selection import KFold

# Use KFold for cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Store results for each fold
all_results = []


for train_index, val_index in kf.split(tokenized_dataset['train']):
    train_subsampler = Dataset.from_dict(tokenized_dataset['train'][train_index])
    val_subsampler = Dataset.from_dict(tokenized_dataset['train'][val_index])

    trainer = Trainer(
        model=deviceBERT_train_model,
        args=training_args,
        train_dataset=train_subsampler,
        eval_dataset=val_subsampler,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        #optimizers=(optimizer, None)
    )

    trainer.train()

    results = trainer.evaluate()
    all_results.append(results)

# Aggregate results
avg_precision = np.mean([result['eval_precision'] for result in all_results])
avg_recall = np.mean([result['eval_recall'] for result in all_results])
avg_f1 = np.mean([result['eval_f1'] for result in all_results])
avg_accuracy = np.mean([result['eval_accuracy'] for result in all_results])

print(f" Average Precision: {avg_precision}")
print(f" Average Recall: {avg_recall}")
print(f" Average F1: {avg_f1}")
print(f" Average Accuracy: {avg_accuracy}")

In [None]:
#Capture the AdamW parameters
def print_adamw_params(trainer):
    print("AdamW Optimizer Parameters:")
    print(f"  Learning Rate (α): {trainer.args.learning_rate}")
    print(f"  Beta 1 (β1): {trainer.args.adam_beta1}")
    print(f"  Beta 2 (β2): {trainer.args.adam_beta2}")
    print(f"  Epsilon (ε): {trainer.args.adam_epsilon}")
    print(f"  Weight Decay (w): {trainer.args.weight_decay}")

print_adamw_params(trainer)

In [None]:
#Capture the batch size and input length
batch_size = trainer.args.per_device_train_batch_size
input_length = trainer.tokenizer.model_max_length

print(f"Batch Size: {batch_size}")
print(f"Input Length: {input_length}")

#Define the Trainer class for remaining Experiments

In [None]:
trainer = Trainer(
    model=deviceBERT_train_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
trainer.push_to_hub()

## Initialize the 3 models
In this step we define 3 models: our BERT baseline model, our bioBERT baseline model, and our deviceBERT model.

In [None]:
from transformers import AutoModelForTokenClassification

bert_model = BertForTokenClassification.from_pretrained("google-bert/bert-base-cased",
    num_labels=4,
    id2label=id2label,
    label2id=label2id)

biobert_model = BertForTokenClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.2",
    num_labels=4,
    id2label=id2label,
    label2id=label2id)

devicebert_model = BertForTokenClassification.from_pretrained("mfarrington/devicebert-base-cased-v1.0",
    num_labels=4,
    id2label=id2label,
    label2id=label2id)

model_dict = {}
model_dict.update({'bert model': bert_model, 'biobert model': biobert_model, 'devicebert model': devicebert_model})

# Evaluate all 3 Model Performance (BERT base, BioBERT base, DeviceBERT) and record outputs.

In [None]:
for key, value in model_dict.items():
  model_name = key
  model = value

  training_args = TrainingArguments(
      output_dir="device_recalls_ner_baseline",
      learning_rate=2e-5,
      per_device_train_batch_size=16,
      per_device_eval_batch_size=16,
      num_train_epochs=10,
      weight_decay=0.02,
      eval_strategy="epoch",
      save_strategy="epoch",
      load_best_model_at_end=True,
      push_to_hub=True,
      bf16=True,
  )

  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=tokenized_dataset["train"],
      eval_dataset=tokenized_dataset["test"],
      tokenizer=tokenizer,
      data_collator=data_collator,
      compute_metrics=compute_metrics,
  )

  # Define the optimizer
  optimizer = AdamW(model.parameters(), lr=training_args.learning_rate)

  print(f'Evaluating {model_name} .....')
  trainer.train()

#Create a pipeline and perform inferencing utilizing DeviceBERT on a new Recall Action

In [None]:
from transformers import pipeline

# Load the NER pipeline
ner_pipeline = pipeline("ner", model="mfarrington/devicebert-base-cased-v1.0", tokenizer=tokenizer)

# Input text
input_text = """Philips Healthcare sent an URGENT-MEDICAL DEVICE RECALL letter dated October 15, 2012 to all affected customers.  The letter identified the product, problem, and actons to be taken by the c
ustomers. Customers were instructed to inspect all casters of the unit to ensure that they are all secured. If a caster is loose, customers were told to lock the caster in place, limit movement
 of the cart and contact their local Phillips Invivo Representative. Nevertheless, a Philips Invivo representative will contact the customer regarding their affected device.  All affected d
Philips Healthcare sent an "URGENT-MEDICAL DEVICE RECALL" letter dated October 15, 2012 to all affected customers.  The letter identified the product, problem, and actons to be taken by the
customers.  Customers were instructed to inspect all casters of the unit to ensure that they are all secured. If a caster is loose, customers were told to lock the caster in place, limit movement of the
cart and contact their local Phillips Invivo Representative. Nevertheless, a Philips Invivo representative will contact the customer regarding their affected device.  All affected devices will have new
casters installed in order to correct the problem.  Contact your local Philips Invivo Representative at 1-800-722-9377 for further information and support."""

# Perform NER prediction
ner_results = ner_pipeline(input_text)
labeled_tokens = []

# Perform NER prediction with threshold of 99% probability
threshold = 0.99
ner_results = ner_pipeline(input_text)

# Print labeled tokens
for result in ner_results:
    if result['score'] >= threshold:
        print(f"Token: {result['word']} - Label: {result['entity']}")

## Plot the combined score chart for all the experiments

In [None]:
import matplotlib.pyplot as plt

# Define the models and their corresponding precision, recall, and F1 scores
models = ['BERT', 'BioBERT', 'DeviceBERT (+Reg only)', 'DeviceBERT (+Vocab 100%)', 'DeviceBERT (+Vocab 50%)', 'DeviceBERT (+Vocab 25%)', 'DeviceBERT (+Reg+Vocab)']
precision = [72.96, 73.42, 85.14, 75.59, 81.56, 80.14, 82.37]
recall = [73.99, 73.29, 82.07, 73.46, 80.11, 77.87, 78.52]
f1_score = [73.47, 73.35, 83.56, 74.51, 80.83, 78.91, 80.37]

# Sort the models based on F1 score
sorted_indices = sorted(range(len(f1_score)), key=lambda i: f1_score[i], reverse=True)
models_sorted = [models[i] for i in sorted_indices]
precision_sorted = [precision[i] for i in sorted_indices]
recall_sorted = [recall[i] for i in sorted_indices]
f1_score_sorted = [f1_score[i] for i in sorted_indices]

# Highlighting the best performing model's text in red
best_model_index = f1_score_sorted.index(max(f1_score_sorted))
best_model = models_sorted[best_model_index]

plt.figure(figsize=(8, 8))

# Plotting precision, recall, and F1 score
plt.plot(range(len(models_sorted)), precision_sorted, marker='o', linestyle='-', label='Precision')
plt.plot(range(len(models_sorted)), recall_sorted, marker='o', linestyle='-', label='Recall')
plt.plot(range(len(models_sorted)), f1_score_sorted, marker='o', linestyle='-', label='F1 Score')

# Annotate the best performing model with red color
plt.annotate(best_model, xy=(best_model_index, f1_score_sorted[best_model_index]),
             xytext=(best_model_index + 0.2, f1_score_sorted[best_model_index] - 1),
             arrowprops=dict(facecolor='red', arrowstyle='->'))

# Set x-ticks as model names
plt.xticks(range(len(models_sorted)), models_sorted, rotation=45, ha='right')

# Set labels and title
plt.ylabel('Scores (%)')
plt.xlabel('Model')
plt.title('Performance of NER Models')

# Set legend
plt.legend()

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()
