# Named Entity Recognition (NER) in Biomedical Texts

This notebook demonstrates how to train a Named Entity Recognition (NER) model using the Biocreative V CDR dataset. The model will be trained to recognize chemical and disease entities in biomedical texts.

## Setup and Installation
We'll begin by installing the necessary libraries.

In [None]:
!pip install transformers datasets seqeval
!pip install accelerate -U

## Loading the Dataset
We will use the `datasets` library to load the BioCreative V CDR dataset.

In [2]:
from datasets import load_dataset

dataset = load_dataset('tner/bc5cdr')


# Select a smaller subset for training, validation, and testing
dataset['train'] = dataset['train'].select(range(300))  # Select first 100 samples for training
dataset['validation'] = dataset['validation'].select(range(60))  # Select first 20 samples for validation
dataset['test'] = dataset['test'].select(range(60))  # Select first 20 samples for testing


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/367k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/364k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/386k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5228 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5330 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5865 [00:00<?, ? examples/s]

## Exploring the Dataset
Let's take a look at the structure of the dataset.

## Data Preprocessing
We need to preprocess the data to be suitable for training. This includes tokenizing the texts and aligning the labels with the tokenized inputs.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')



def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['tokens'], padding='max_length', is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

In [4]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 300
    })
    validation: Dataset({
        features: ['tokens', 'tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 60
    })
    test: Dataset({
        features: ['tokens', 'tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 60
    })
})

## Model Building
We will use a pre-trained BERT model and fine-tune it for the NER task.

In [5]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from seqeval.metrics import classification_report

model = AutoModelForTokenClassification.from_pretrained('bert-base-cased', num_labels=5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Model Training
We will define the training arguments and train the model.

In [6]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    logging_steps=50,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
)


trainer.train()



Epoch,Training Loss,Validation Loss
1,No log,0.246353
2,0.489100,0.213242
3,0.157600,0.193077


TrainOutput(global_step=114, training_loss=0.2978483752200478, metrics={'train_runtime': 6194.7068, 'train_samples_per_second': 0.145, 'train_steps_per_second': 0.018, 'total_flos': 235173459456000.0, 'train_loss': 0.2978483752200478, 'epoch': 3.0})

## Model Evaluation
After training, we will evaluate the model on the test set.

In [7]:
results = trainer.evaluate()
results

{'eval_loss': 0.19307731091976166,
 'eval_runtime': 109.1842,
 'eval_samples_per_second': 0.55,
 'eval_steps_per_second': 0.073,
 'epoch': 3.0}

Here, we can see a result sample

## Conclusion
We have successfully trained a NER model on the BioCreative V CDR dataset using a pre-trained BERT model. The model can now be used to recognize chemical and disease entities in biomedical texts.