<a href="https://colab.research.google.com/github/ajtamayoh/NeRUBioS/blob/main/Source%20code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Source code for the paper:

Augmenting a Spanish Clinical Dataset for Transformer-Based Linking of Negations and their Out-of-Scope References

Authors:

Antonio Tamayo (ajtamayo2019@ipn.cic.mx, ajtamayoh@gmail.com)

Diego A. Burgos (burgosda@wfu.edu)

Alexander Gelbulkh (gelbukh@gelbukh.com)

For bugs or questions related to the code, do not hesitate to contact us (Antonio Tamayo: ajtamayoh@gmail.com)

If you use this code please cite our work:

Comming soon...

# Requirements

To run this code you need to download the dataset (three partitions: NeRUBioS_train.json, NeRUBioS_dev.json and NeRUBioS_test.json) at: [NeRUBioS dataset](https://github.com/ajtamayoh/NeRUBioS/tree/main/NeRUBios%20dataset)

Then, you must create a folder called "Datasets" in the root of your Google Drive and load there the three files previously downloaded.

Once the dataset is ready to use, you should [open this notebook in colab](https://colab.research.google.com/github/ajtamayoh/NeRUBioS/blob/main/Source%20code.ipynb) and save a copy in your Google Drive.

## About the infrastructure

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

## Connecting to Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Out-of-scope negation references identification and linking as a token classification problem




### Exploring & Preprocessing Data

In [None]:
import pandas as pd
import numpy as np
import spacy

### Install the Transformers and Datasets libraries to run this notebook.

In [None]:
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs
!pip install seqeval

## Hugging Face Authentication

If you want to save your own model and make it available online we strongly recommend signing up at: https://huggingface.co/

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "your@email"
!git config --global user.name "your_username"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Loading the Dataset

In [None]:
from datasets import load_dataset
import json

# NeRUBioS dataset
nerubios_dataset_train = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/NeRUBioS_train.json", field="data")
nerubios_dataset_dev = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/NeRUBioS_dev.json", field="data")
nerubios_dataset_test = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/NeRUBioS_test.json", field="data")

In [None]:
from datasets import DatasetDict
#For training, development, and testing partitions
raw_datasets = DatasetDict({
    'train': nerubios_dataset_train['train'],
    'validation': nerubios_dataset_dev['train'],
    'test': nerubios_dataset_test['train']
    })

In [None]:
raw_datasets

In [None]:
# NeRUBioS tagset

# 0 -> Outside ('O')
# 1 -> 'B-NegREF'
# 2 -> 'I-NegREF'
# 3 -> 'B-NEG'
# 4 -> 'I-NEG'
# 5 -> 'B-NSCO'
# 6 -> 'I-NSCO'
# 7 -> 'B-UNC'
# 8 -> 'I-UNC'
# 9 -> 'B-USCO'
# 10 -> 'I-USCO'

label_names = ["O", "B-NegREF", "I-NegREF", "B-NEG", "I-NEG", "B-NSCO", "I-NSCO", "B-UNC", "I-UNC", "B-USCO", "I-USCO"]
label_names

In [None]:
def show_sample_aligned(words, labels):
  
  line1 = ""
  line2 = ""
  for word, label in zip(words, labels):
      full_label = label_names[label]
      max_length = max(len(word), len(full_label))
      line1 += word + " " * (max_length - len(word) + 1)
      line2 += full_label + " " * (max_length - len(full_label) + 1)

  print(line1)
  print(line2)

words = raw_datasets["train"][0]["tokens"]
labels = [int(n) for n in raw_datasets["train"][0]["ner_tags"]]
#labels = raw_datasets["train"][0]["pos_tags"]
#labels = raw_datasets["train"][0]["chunk_tags"]
show_sample_aligned(words, labels)

## Loading mBERT as a pre-trained model

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "PlanTL-GOB-ES/roberta-base-biomedical-clinical-es" #Best model
#model_checkpoint = "bert-base-multilingual-cased"
#model_checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

In [None]:
tokenizer.is_fast

In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

In [None]:
inputs.word_ids()

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
#print(labels)
#print(word_ids)
print(align_labels_with_tokens(labels, word_ids))

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

In [None]:
from datasets import load_metric

metric = load_metric("seqeval")

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

In [None]:
predictions = labels.copy()
predictions[1] = "O"
metric.compute(predictions=[predictions], references=[labels])

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    
    return {
        
        #Per label
        #NegREF
        "NegREF_precision": all_metrics["NegREF"]['precision'],
        "NegREF_recall": all_metrics["NegREF"]['recall'],
        "NegREF_F1": all_metrics["NegREF"]['f1'],
        #Negation
        "NEG_precision": all_metrics["NEG"]['precision'],
        "NEG_recall": all_metrics["NEG"]['recall'],
        "NEG_F1": all_metrics["NEG"]['f1'],
        "NSCO_precision": all_metrics["NSCO"]['precision'],
        "NSCO_recall": all_metrics["NSCO"]['recall'],
        "NSCO_F1": all_metrics["NSCO"]['f1'],
        #Uncertainty
        "UNC_precision": all_metrics["UNC"]['precision'],
        "UNC_recall": all_metrics["UNC"]['recall'],
        "UNC_F1": all_metrics["UNC"]['f1'],
        "USCO_precision": all_metrics["USCO"]['precision'],
        "USCO_recall": all_metrics["USCO"]['recall'],
        "USCO_F1": all_metrics["USCO"]['f1'],
      

        #Overall
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
id2label

In [None]:
label2id

# Training process

## Changing the head of prediction for OSR + linking + Unc. under the BIO scheme

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(    
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    num_labels = 11,   # BIO squeme OSRs + NSD + USD
)

In [1]:
model.config.num_labels

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
  
    "NeRUBioS_RoBERTa_Training_Testing",
    
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5, 
    num_train_epochs=12, 
    weight_decay=0.1,   
    push_to_hub=True,
)

## Fine-tuning mBERT for Negation Scope Detection

In [1]:
from transformers import Trainer
import evaluate

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    #eval_dataset=tokenized_datasets["validation"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

## Saving the fine-tuned model at Hugging Face (It requires previous authentication)

In [1]:
trainer.push_to_hub(commit_message="Fine-tuning completed")

## Loading the model for inference

In [None]:
from transformers import pipeline

#Replace this with your own checkpoint. If you have run all the previous cells successfully, the model should be available at your hugging face account with the name: NeRUBioS_RoBERTa_Training_Testing

model_checkpoint = "your_huggingface_username/NeRUBioS_RoBERTa_Training_Testing"

token_classifier = pipeline(
    "token-classification", model=model_checkpoint, grouped_entities=True
)

In [None]:
pred = token_classifier("El ecocardiograma doppler color no muestra patologia que justifique los síntomas y la paciente evoluciona completamente asintomática y estable.")

for token in pred:
    print(token["word"], token["entity_group"])
pred