## Name: Kartik VISWANATHAN (MSD 2024)

# Building models for named entity recognition

The project consists in building two named entity recognition (NER) systems. The systems will make use of the IOB tagging scheme to detect entities of type PER, ORG, LOC and MISC. The tagging scheme thus includes the following tags, assuming one tag per token:

- B-PER and I-PER: token corresponds to the start, resp. the inside, of a person's entity
- B-LOC and I-LOC: token corresponds to the start, resp. the inside, of a location entity
- B-ORG and I-ORG: token corresponds to the start, resp. the inside, of an organization entity
- B-MISC and I-MISC: token corresponds to the start, resp. the inside, of any other named entity
- O: token corresponds to no entity

## Dataset

You are provided with training, validation and test data derived from the CONLL 03 dataset. The dataset has been marginally cleaned and reformatted for facilitated use. You can directly load the three folds from the json file provided:

```python
with open('conll03-iob-pos.json', 'r') as f:
    data = json.load(f)
```
For each fold, the dataset consists of a list of dictionaries, one per sample, with the two fields 'tokens' and 'labels', e.g.

{'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'tags': ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']}

## TODO

Building on the notebooks we've seen during the lectures and on the tipcs below, you task is to build two tagging models:
1. a RNN-based model: an embedding layer, a LSTM layer, a feed-forward layer
2. a fine-tuned BERT tagger: a BERT (pre-trained) layer, a feed-forward layer
The final feed-forward layer procudes a probability distribution over the set possible tags for each input token.

For both, we will use BERT's tokenizer, which is a sub-word tokenizer. The advantage of this tokenizer is that the vocabulary is finite (no out-of-vocabulary tokens): you can get the vocabulary size from tokenizer.vocab_size and you don't have to bother with defining your vocabulary and mapping unkown tokens to some special token. The disadvantage of sub-word tokenization is that we will have to relabel the input sequences, which are labeled on a word basis rather than on a sub-word basis. To make things easier, we provide a function that aligns and encode the labels. Note that special tokens will arbitrarily get the tag -100 which is a default value to indicate Torch's loss functions that gradient should not be propagated from there (in other words, ignore thos tokens in training).

Another advantage of using the same tokenizer is that you will have to prepare your dataset and the corresponding loaders only once for the two models.

Here are the steps you'll have to go through:

1. Define a Dataset class that will hold for each sample the list of encoded tokens and the corresponding list of encoded tags. You will then encode the three folds as a Dataset and define the corresponding DataLoader instances.

2. Define your LSTM model class and train it. You can get inspired by the RNN language model notebook.

3. Define your BERT model class and train it. You can adapt the LLM finetuning notebook, changing the classification head to operate on each token (as for the LSTM) rather than on the embedding of the [CLS] token.

4. Evaluate both and compare. Token tag accuracy is one measure (used for instance to measure the convergence of training) but it's not the ultimate one as the final task is not to tag tokens but to detect entities. You should thus also report in the final evaluation the entitu recognition rate.

One last thing to think about: comutation of the accuracy for validation and testing must be adapted in two ways compared to what we've seen in the previous notebooks. First, each prediction is a sequence of tags and not a single tag. Second, tags corresponding to the special tokens (indicated as -100 in the reference) must not be accounted for when computing the accuracy.

**Good luck no your mission!**

## REPORT

The report will be a commented notebook. This is not a python programing project but a NLP project. I'm thus expecting you to comment on your model definition choices, to analyze the results and errors, to provide hints at how things could be improved. If you did trial and error cells, please clean up a bit to facilitate reading, leaving only the final version in the report notebook.



In [1]:
import json

from transformers import AutoModel, AutoTokenizer

from sklearn.metrics import accuracy_score

import torch
from torch.utils.data import Dataset, DataLoader


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#
# tag to id mapping and vice versa
#
# for tokens that does not have a tag, we will use -100 as the corresponding tag ID
#

tag2id = {
    'O': 0,
    'B-LOC': 1, 'I-LOC': 2,
    'B-ORG': 3, 'I-ORG': 4,
    'B-PER': 5, 'I-PER': 6,
    'B-MISC': 7, 'I-MISC': 8
}

id2tag = list(tag2id.keys())

print(id2tag)

['O', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-MISC']


In [3]:
print(torch.__version__)

2.5.1+cu124


In [4]:
#
# load data from json file
#

with open('data.json', 'r') as f:
    data = json.load(f)

for fold in ('train', 'valid', 'test'):
    print(fold, len(data[fold]))

train 14041
valid 3250
test 3453


In [5]:
print_json = lambda x: print(json.dumps(x, indent=2))

print_json(data['train'][0])

{
  "tokens": [
    "EU",
    "rejects",
    "German",
    "call",
    "to",
    "boycott",
    "British",
    "lamb",
    "."
  ],
  "tags": [
    "B-ORG",
    "O",
    "B-MISC",
    "O",
    "O",
    "O",
    "B-MISC",
    "O",
    "O"
  ]
}


In [6]:
#
# load BERT's tokenizer
#

checkpoint = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
print(tokenizer)

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


In [7]:
print("GPU Device:", torch.cuda.get_device_name(0))


GPU Device: NVIDIA GeForce GTX 1080 Ti


In [8]:
#
# Here's an example showing how to tokenize texts and create the corresponding aligned and encoded labels
#
# Note that the tokenizer enables to retrieve the index of the corresponding word for each (sub-word) token
# through the inputs.word_ids(batch_index=i) function (to retrieve input word indices for each token in
# inputs['input_ids'][i]). Special tokens ([CLS], [SEP], [PAD]) are mapped to None. We will make use of this
# mapping to create token-level labels adapted to sub-word tokenization. See next cell.
#

train_texts = [x['tokens'] for x in data['train']]
train_labels = [x['tags'] for x in data['train']]

inputs = tokenizer(train_texts, is_split_into_words=True, padding=True, truncation=True, return_tensors="pt")

print(train_texts[0])
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))
print(inputs.word_ids(batch_index=0))

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[P

In [9]:
def align_and_encode_labels(_token_ids, _word_ids, _labels):
    '''
    Align word-level labels to sub-word tokens for an entry
    '''

    global tag2id

    ignore_id = -100

    buf = [ignore_id] # ignore tag for token [CLS]

    prev_token_word = -1
    which_type = 0

    # print(len(_token_ids), tokenizer.convert_ids_to_tokens(_token_ids))
    # print(_word_ids)
    # print(_labels)

    for i in range(1, len(_token_ids)):
        word_id = _word_ids[i]

        if word_id == None:
            # token does not belong to any input word ([CLS], [SEP] or [PAD]) -- ignore
            buf.append(ignore_id)

        else:
            tag_id = tag2id[_labels[word_id]]

            if word_id == prev_token_word:
            # sub-word token of the previous word: need to do something
            #   word has an O tag: just use a O tag
            #   word has an I-X tag: just use the I-X tag
            #   word has a B-X tag: replace by corresponding I-X tag

                buf.append(tag_id + 1 if tag_id in (1, 3, 5, 7) else tag_id)

            else:
                # token starting a new word --> keep tag unchanged
                prev_token_word = word_id
                buf.append(tag_id)

    return buf

#
# The following illustrate how we can get aligned and encoded labels for sample i in the training set.
#

i = 10

print(train_texts[i], train_labels[i])

new_labels = align_and_encode_labels(inputs['input_ids'][i], inputs.word_ids(batch_index=i), train_labels[i])

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][i])

for j in range(len(tokens)):
    if tokens[j] != '[PAD]':
        print(tokens[j], ' -- ', id2tag[new_labels[j]] if new_labels[j] >= 0 else 'NONE')

['Spanish', 'Farm', 'Minister', 'Loyola', 'de', 'Palacio', 'had', 'earlier', 'accused', 'Fischler', 'at', 'an', 'EU', 'farm', 'ministers', "'", 'meeting', 'of', 'causing', 'unjustified', 'alarm', 'through', '"', 'dangerous', 'generalisation', '.', '"'] ['B-MISC', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
[CLS]  --  NONE
Spanish  --  B-MISC
Farm  --  O
Minister  --  O
Loyola  --  B-PER
de  --  I-PER
Pa  --  I-PER
##la  --  I-PER
##cio  --  I-PER
had  --  O
earlier  --  O
accused  --  O
Fi  --  B-PER
##sch  --  I-PER
##ler  --  I-PER
at  --  O
an  --  O
EU  --  B-ORG
farm  --  O
ministers  --  O
'  --  O
meeting  --  O
of  --  O
causing  --  O
un  --  O
##ju  --  O
##st  --  O
##ified  --  O
alarm  --  O
through  --  O
"  --  O
dangerous  --  O
general  --  O
##isation  --  O
.  --  O
"  --  O
[SEP]  --  NONE


## STEP 1: Data Preperation

In [10]:
import torch

from torch.utils.data import Dataset, DataLoader

### Create MyDataset class

In [11]:
class MyDataset(Dataset):
    def __init__(self, encodings, labels):
        
        assert(len(encodings) == len(labels))
        
        self.nsamples = len(labels)
        
        # print(f'Initializing dataset with {self.nsamples} entries')
        
        self.encodings = encodings # list[list[int]]: contains the padded list of token ids for each sample
        self.labels = labels # list[int]: contains the label for each sample
        self.nlabels = 9 # number of labels

    def __getitem__(self, idx):
        '''
        Returns a dictionary containing the label and padded token ids for a sample
        '''
        
        # print(f'accessing dataset item at index {idx}')
        # print(torch.tensor(self.encodings[idx]), torch.tensor(self.labels[idx]))

        return {'ids': torch.tensor(self.encodings[idx]), 'label': torch.tensor(self.labels[idx])}

    def __len__(self):
        return len(self.labels)
    
  

In [12]:
#
# Function to convert the dataset as a list[dict] into a proper torch.Dataset object
# 
def to_dataset(texts, labels) -> MyDataset:
    '''
    Convert data as processed before into a proper pyTorch dataset with the specified tokenizer. 
    If maxlen <= 0, then we take the maximum length within the sequence.
    '''

    inputs = tokenizer(texts, is_split_into_words=True, padding=True, truncation=True, return_tensors="pt")
    encoded_tokens = [inputs['input_ids'][i].tolist() for i in range(len(texts))]
    encoded_labels = [align_and_encode_labels(inputs['input_ids'][i].tolist(), inputs.word_ids(batch_index=i), labels[i]) for i in range(len(texts))]

    print(len(encoded_tokens), len(encoded_labels))

    print(encoded_tokens[0], encoded_labels[0])
    
    return MyDataset(encoded_tokens, encoded_labels)


ds = dict()

train_texts = [x['tokens'] for x in data['train']]
train_labels = [x['tags'] for x in data['train']]

test_texts = [x['tokens'] for x in data['test']]
test_labels = [x['tags'] for x in data['test']]

val_texts = [x['tokens'] for x in data['valid']]
val_labels = [x['tags'] for x in data['valid']]

ds['train'] = to_dataset(train_texts, train_labels)
ds['valid'] = to_dataset(val_texts, val_labels)
ds['test'] = to_dataset(test_texts, test_labels)

print('Training set:  nsamples =', ds['train'].nsamples, ' nlabels =', len(tag2id))

for i in range(3):
    print(data['train'][i])
    print('   >>', ds['train'][i])


14041 14041
[101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,

### Use Dataloader to convert to tensors

In [13]:
#
# Create batched dataset with data loaders
#

batch_size = 16

loader = dict()
loader['train'] = DataLoader(ds['train'], batch_size=batch_size, shuffle=True) # set to False for debugging purposes
loader['valid'] = DataLoader(ds['valid'], batch_size=batch_size)
loader['test'] = DataLoader(ds['test'], batch_size=batch_size)

print('Number of samples:', len(ds['train']))
print(f'Number of training batches:', len(loader['train']))

Number of samples: 14041
Number of training batches: 878


## STEP 2: Create the Recurrent NN model

RNNs: Recurrent Neural Networks process sequences of data by maintaining a hidden state that captures information from previous time steps. They are particularly suited for tasks where the order of data matters, such as time series prediction or language modeling.

In [14]:
class NERNN(torch.nn.Module):
    def __init__(self, vocsize, nclasses=9, embed_dim=200, hidden_dim=200, dropout=0.3):
        super(NERNN, self).__init__()
        
        self.embedding = torch.nn.Embedding(vocsize, embed_dim, padding_idx=0)
        self.dropout = torch.nn.Dropout(dropout)
        self.lstm = torch.nn.LSTM(embed_dim, hidden_dim // 2, bidirectional=True, batch_first=True)
        self.linear = torch.nn.Linear(hidden_dim, nclasses)
        self.nclasses = nclasses
        
   
    def forward(self, ids):
        x = self.embedding(ids)  # batch_size * maxlen * embed_dim
        if self.dropout is not None:
            x = self.dropout(x)
        
        output, _ = self.lstm(x)  # batch_size * maxlen * hidden_dim
        logits = self.linear(output)  # batch_size * maxlen * num_classes
        return logits


In [15]:
vocab_size = tokenizer.vocab_size

num_classes = len(tag2id)

print('Vocabulary size:', vocab_size)
print('Number of classes:', ds['train'].nlabels)

model = NERNN(vocab_size, nclasses = 9)

Vocabulary size: 28996
Number of classes: 9


In [16]:
print(model)

NERNN(
  (embedding): Embedding(28996, 200, padding_idx=0)
  (dropout): Dropout(p=0.3, inplace=False)
  (lstm): LSTM(200, 100, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=200, out_features=9, bias=True)
)


### Training and evaluation steps

In [17]:
def train_step(_model, _loader, _loss, _optim, device="cpu", report=0):
    _model.train(True)
    total_loss = 0.
    running_loss = 0.

    for i, batch in enumerate(_loader):
        _optim.zero_grad()

        labels = batch['label'].to(device)
        inputs = {k: v.to(device) for k, v in batch.items() if k != 'label'}
        outputs = _model(**inputs)

        # Reshape outputs and labels
        outputs = outputs.view(-1, outputs.shape[-1])  # (batch_size * seq_length, num_classes)
        labels = labels.view(-1)  # (batch_size * seq_length)

        # Ignore padding tokens (label -100)
        mask = (labels != -100)
        outputs = outputs[mask]
        labels = labels[mask]

        loss = _loss(outputs, labels)
        total_loss += loss.item()
        running_loss += loss.item()

        loss.backward()
        _optim.step()

        if report != 0 and i % report == report - 1:
            print('  batch {} avg. loss per batch={:.4f}'.format(i + 1, running_loss / report))
            running_loss = 0.

    _model.train(False)
    return total_loss


In [18]:
def eval_step(_model, _loader, device='cpu', loss_fn=None):
    '''
    Evaluate the model's performance on data within loader.
    
    :return: 
    total_loss accumulated throughout the batches
    accuracy
    '''
    _model.eval()  # disable training mode

    all_predictions = []
    all_labels = []
    total_loss = 0.0

    with torch.no_grad():
        for batch in _loader:
            inputs = {k: v.to(device) for k, v in batch.items() if k != 'label'}
            labels = batch['label'].to(device)

            outputs = _model(**inputs)
            
            if loss_fn is not None:
                loss = loss_fn(outputs.view(-1, outputs.shape[-1]), labels.view(-1))
                total_loss += loss.item()

            predictions = torch.argmax(outputs, dim=-1)
            
            # Remove padding tokens
            mask = labels != -100
            predictions = predictions[mask]
            labels = labels[mask]
            
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    accuracy = accuracy_score(all_labels, all_predictions)
    return total_loss, accuracy


### Choose the device to run the training and adjust training parameters

In [19]:
#
# Last but not least, we have to set the training parameters and instatiate all the objects 
# needed for training.
#

lr = 1e-4
nepochs = 10
report_freq = 100

# check what device we can work on
if torch.backends.mps.is_built(): # MPS GPU library for MacOS -- requires metal to be installed
    device = "mps"
    torch.mps.empty_cache()
elif torch.cuda.is_available(): # CUDA GPU acceleration available
    device = torch.device('cuda')
else:
    device = "cpu"
print(f'Running on {device} device')

optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
celoss = torch.nn.CrossEntropyLoss()

Running on cuda device


### Run the training

In [20]:
#
# At last, here we go with the main training loop, iterating over epochs
#

model.to(device)

for epoch in range(nepochs):
    print(f'epoch: {epoch}')
    
    total_loss = train_step(model, loader['train'], celoss, optimizer, device=device, report=report_freq)
    _, trn_acc = eval_step(model, loader['train'], device=device, loss_fn=None)
    
    val_loss, val_acc = eval_step(model, loader['valid'], device=device, loss_fn=celoss)

    print('  **train** avg_loss={:.4f}    acuracy={:.2f}%'.format(total_loss / len(loader['train']), 100 * trn_acc))
    print('  **valid** avg_loss={:.4f}    acuracy={:.2f}%'.format(val_loss / len(loader['valid']), 100 * val_acc))


epoch: 0
  batch 100 avg. loss per batch=1.9848
  batch 200 avg. loss per batch=1.1641
  batch 300 avg. loss per batch=0.9487
  batch 400 avg. loss per batch=0.9240
  batch 500 avg. loss per batch=0.9189
  batch 600 avg. loss per batch=0.8918
  batch 700 avg. loss per batch=0.8576
  batch 800 avg. loss per batch=0.8407
  **train** avg_loss=1.0473    acuracy=76.46%
  **valid** avg_loss=0.8830    acuracy=76.43%
epoch: 1
  batch 100 avg. loss per batch=0.8190
  batch 200 avg. loss per batch=0.7708
  batch 300 avg. loss per batch=0.7902
  batch 400 avg. loss per batch=0.7670
  batch 500 avg. loss per batch=0.7595
  batch 600 avg. loss per batch=0.7334
  batch 700 avg. loss per batch=0.7293
  batch 800 avg. loss per batch=0.7062
  **train** avg_loss=0.7531    acuracy=81.13%
  **valid** avg_loss=0.6880    acuracy=80.86%
epoch: 2
  batch 100 avg. loss per batch=0.6499
  batch 200 avg. loss per batch=0.6386
  batch 300 avg. loss per batch=0.6445
  batch 400 avg. loss per batch=0.6158
  batch 5

### Test Accuracy with Recurrent NN for NER

In [21]:
_, tst_acc = eval_step(model, loader['test'], device=device, loss_fn=None)

print('  **test** acuracy={:.2f}%'.format(100 * tst_acc))

  **test** acuracy=89.03%


### Short conclusion about RNN
* RNNs typically require training on labeled sequences and can be sensitive to the order of input. They often need careful tuning of hyperparameters and may require more epochs to converge. 
* RNNs require training on labeled sequences and can be sensitive to the order of input. They often need careful tuning of hyperparameters and may require more epochs to converge.
* While RNNs can perform well on sequence labeling tasks, their performance may degrade with longer sequences or complex dependencies. They often require more engineering efforts for feature extraction and may not generalize as well.
* Less resource intensive. The training went through relatively fast

## STEP 3: Create model for BERT based NER

BERT: Bidirectional Encoder Representations from Transformers (BERT) is based on the transformer architecture, which uses self-attention mechanisms to process input data. BERT captures context from both directions (left and right) simultaneously, allowing it to understand the meaning of words in relation to their surrounding words.

### Get a pre-trained model from Hugging-face
While using BERT may involve more complex initial setup due to its architecture and tokenization requirements, pre-trained models are readily available through libraries like Hugging Face's Transformers, making it easier to implement.

In [22]:
#
# Load tokenizer and model
#

checkpoint = 'distilbert-base-uncased' # distilroberta-base

tokenizer = AutoTokenizer.from_pretrained(checkpoint) # load tokenizer
bert = AutoModel.from_pretrained(checkpoint) # load model
bert.eval()

2024-12-26 22:39:44.822001: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-26 22:39:44.822094: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-26 22:39:44.822152: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-26 22:39:44.837630: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

### Our BERT model

In [26]:
import copy

#
# Define the model to do the following:
#    1. encode input with the BERT encoder
#    2. get embedding of token [CLS] (dimension is 768, i.e.,  encoder.config.dim)
#    3. run classification from the [CLS] embedding through two feed-forward layers
#

class BERTClassifier(torch.nn.Module):

    def __init__(self, _encoder, nclasses = 2, dropout = None):
        super(BERTClassifier, self).__init__()

        self.nclasses = nclasses
        self.encoder = copy.deepcopy(_encoder) # to avoid modifying the encoder directly
        self.dropout = torch.nn.Dropout(dropout) if dropout != None else None
        self.linear1 = torch.nn.Linear(self.encoder.config.dim, 100)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(100, nclasses)
        # self.softmax = torch.nn.Softmax()

    def forward(self, ids):
        x = self.encoder(ids)        # run batch through the BERT encoder
        x = x.last_hidden_state[:,:,:]    # get embedding of all tokens for each input in the batch
        if self.dropout != None:
            x = self.dropout(x)
        x = self.linear1(x)               # project to a 100-d hidden layer
        x = self.activation(x)
        if self.dropout != None:
            x = self.dropout(x)
        x = self.linear2(x)               # project to logits, one per class
        # x = self.softmax(x)
        
        return x

model1 = BERTClassifier(bert, nclasses = 9, dropout = 0.2)


In [27]:
#
# Set training parameters
#

lr = 1e-5
nepochs = 10
report_freq = 100

if torch.backends.mps.is_built(): # MPS GPU library for MacOS -- requires metal to be installed
    device = "mps"
    torch.mps.empty_cache()
elif torch.cuda.is_available(): # CUDA GPU acceleration available
    device = torch.device('cuda')
else:
    device = "cpu"
print(f'Running on {device} device')

optimizer = torch.optim.AdamW(model1.parameters(), lr=lr)
celoss = torch.nn.CrossEntropyLoss()

Running on cuda device


### Train the model

In [28]:
#
# Run the training loop
#
model1.to(device)

for epoch in range(nepochs):
    print(f'epoch: {epoch}')
    
    total_loss = train_step(model1, loader['train'], celoss, optimizer, device=device, report=report_freq)
    _, trn_acc = eval_step(model1, loader['train'], device=device, loss_fn=None)
    
    val_loss, val_acc = eval_step(model1, loader['valid'], device=device, loss_fn=celoss)

    print('  **train** avg_loss={:.4f}    acuracy={:.2f}%'.format(total_loss / len(loader['train']), 100 * trn_acc))
    print('  **valid** avg_loss={:.4f}    acuracy={:.2f}%'.format(val_loss / len(loader['valid']), 100 * val_acc))


epoch: 0
  batch 100 avg. loss per batch=1.1476
  batch 200 avg. loss per batch=0.9372
  batch 300 avg. loss per batch=0.8671
  batch 400 avg. loss per batch=0.7961
  batch 500 avg. loss per batch=0.7691
  batch 600 avg. loss per batch=0.7517
  batch 700 avg. loss per batch=0.7125
  batch 800 avg. loss per batch=0.6989
  **train** avg_loss=0.8206    acuracy=82.23%
  **valid** avg_loss=0.6487    acuracy=81.34%
epoch: 1
  batch 100 avg. loss per batch=0.6075
  batch 200 avg. loss per batch=0.5739
  batch 300 avg. loss per batch=0.5694
  batch 400 avg. loss per batch=0.5353
  batch 500 avg. loss per batch=0.5271
  batch 600 avg. loss per batch=0.5153
  batch 700 avg. loss per batch=0.4927
  batch 800 avg. loss per batch=0.4997
  **train** avg_loss=0.5319    acuracy=88.04%
  **valid** avg_loss=0.4471    acuracy=86.67%
epoch: 2
  batch 100 avg. loss per batch=0.4242
  batch 200 avg. loss per batch=0.4324
  batch 300 avg. loss per batch=0.4027
  batch 400 avg. loss per batch=0.3951
  batch 5

In [30]:
_, tst_acc = eval_step(model1, loader['test'], device=device, loss_fn=None)

print('  **test** acuracy={:.2f}%'.format(100 * tst_acc))

  **test** acuracy=91.10%


### Short conclusion about BERT based NER
* BERT excels at understanding context due to its self-attention mechanism, which allows it to weigh the importance of all words in a sentence when encoding a particular word. This leads to richer representations that are contextually aware.
* Pre-trained on large corpora using unsupervised learning techniques (masked language modeling and next sentence prediction), BERT can be fine-tuned on specific tasks like NER with relatively fewer labeled examples, often leading to better performance.
* Generally achieves state-of-the-art results in various NLP tasks, including NER. Its ability to leverage pre-trained knowledge allows it to perform exceptionally well even with limited task-specific data.
* Requires substantial computational resources due to its large number of parameters and the complexity of the transformer architecture. Fine-tuning BERT models typically necessitates GPUs or TPUs for efficient training.

## Conclusion
* Overall the BERT based NER model outperforms the RNN based NER model for the same number of epochs of training.
* The BERT based training is considerable slower and computationally taxing than RNN based model.