<a href="https://colab.research.google.com/github/matthewleechen/woodcroft_patents/blob/main/ner/notebooks/fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is a slightly modified clone of Niels Rogge's (extremely helpful!) notebook, "Fine-tuning BERT for named entity recognition", linked [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT.ipynb). 

It is **not** recommended to run this notebook on the Colab free plan. This notebook's training loop was originally run using Colab Pro on 1 Nvidia Tesla V100 (16GB) GPU. You can also run this locally on a virtual machine or server, but carefully check for dependencies.

**Data Preprocessing**

In [1]:
%%capture
!pip install transformers seqeval[gpu]
!pip install conllu

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertConfig, BertForTokenClassification, set_seed
import conllu
import csv

In [3]:
# Check if GPU is available
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


In [4]:
# Set seed
set_seed(42)

Upload .conll dataset exported from Label Studio

In [5]:
# Visualize annotations as .conll dataset
data = open("/path/to/conll/dataset", mode = "r", encoding = "utf-8")
annotations = data.read()
print(annotations[:1000])

-DOCSTART- -X- O
JOLY. -X- _ O
24th -X- _ B-DATE
November -X- _ I-DATE
1852. -X- _ I-DATE
852. -X- _ B-NUM
ALPHONSE -X- _ B-PER
JOLY -X- _ I-PER
, -X- _ O
of -X- _ O
Paris -X- _ B-LOC
, -X- _ I-LOC
in -X- _ I-LOC
the -X- _ I-LOC
Republic -X- _ I-LOC
of -X- _ I-LOC
France -X- _ I-LOC
, -X- _ O
Civil -X- _ B-OCC
Engineer -X- _ I-OCC
, -X- _ O
for -X- _ O
an -X- _ O
invention -X- _ O
for -X- _ O
" -X- _ O
Certain -X- _ B-MISC
improvements -X- _ I-MISC
in -X- _ I-MISC
steam -X- _ I-MISC
engines. -X- _ I-MISC
" -X- _ O
Provisional -X- _ B-INFO
protection -X- _ I-INFO
only -X- _ I-INFO
. -X- _ I-INFO

LENOX. -X- _ O
ROBERTS -X- _ O
, -X- _ O
18th -X- _ B-DATE
October -X- _ I-DATE
1852. -X- _ I-DATE
426. -X- _ B-NUM
GEORGE -X- _ B-PER
WILLIAM -X- _ I-PER
LENOX -X- _ I-PER
, -X- _ O
of -X- _ O
Billiter -X- _ B-LOC
Square -X- _ I-LOC
, -X- _ I-LOC
in -X- _ I-LOC
the -X- _ I-LOC
City -X- _ I-LOC
of -X- _ I-LOC
London -X- _ I-LOC
, -X- _ O
Chain -X- _ B-OCC
Cable -X- _ I-OCC
Manufacturer -X- _ I-

In [6]:
# define input and output file names
input_file = "ner_patents.conll" ## replace "ner_patents" with name of conll file
output_file = "ner_patents.csv" ## same here

# initialize the csv writer and write the header row
csv_writer = csv.writer(open(output_file, "w", newline="", encoding="utf-8"))
csv_writer.writerow(["sentence_no", "word", "tag"])

# read the input file line by line
with open(input_file, "r", encoding="utf-8") as f:
    sentence_num = 1
    for line in f:
        line = line.strip()
        if not line:
            # empty line denotes end of sentence
            sentence_num += 1
        else:
            # split line into columns
            columns = line.split()
            word = columns[0]
            tag = columns[-1]
            sentence = "{}".format(sentence_num)
            # write row to csv file
            csv_writer.writerow([sentence, word or "NaN", tag or "NaN"])

In [7]:
# Visualize data
pd.read_csv('ner_patents.csv').drop(0).to_csv('ner_patents.csv', index=False)
data = pd.read_csv("ner_patents.csv", encoding='unicode_escape')
data.head()

Unnamed: 0,sentence_no,word,tag
0,1,JOLY.,O
1,1,24th,B-DATE
2,1,November,I-DATE
3,1,1852.,I-DATE
4,1,852.,B-NUM


In [8]:
# Group by sentence
# let's create a new column called "patent" which groups the words by sentence 
data['sentence'] = data[['sentence_no','word','tag']].groupby(['sentence_no'])['word'].transform(lambda x: ' '.join(x))
# let's also create a new column called "word_labels" which groups the tags by sentence 
data['word_labels'] = data[['sentence_no','word','tag']].groupby(['sentence_no'])['tag'].transform(lambda x: ','.join(x))
# Show data
data.head()

Unnamed: 0,sentence_no,word,tag,sentence,word_labels
0,1,JOLY.,O,"JOLY. 24th November 1852. 852. ALPHONSE JOLY ,...","O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,O,O,B..."
1,1,24th,B-DATE,"JOLY. 24th November 1852. 852. ALPHONSE JOLY ,...","O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,O,O,B..."
2,1,November,I-DATE,"JOLY. 24th November 1852. 852. ALPHONSE JOLY ,...","O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,O,O,B..."
3,1,1852.,I-DATE,"JOLY. 24th November 1852. 852. ALPHONSE JOLY ,...","O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,O,O,B..."
4,1,852.,B-NUM,"JOLY. 24th November 1852. 852. ALPHONSE JOLY ,...","O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,O,O,B..."


In [9]:
# Make dictionary mapping tags to indices
label2id = {k: v for v, k in enumerate(data.tag.unique())}
id2label = {v: k for v, k in enumerate(data.tag.unique())}
label2id

{'O': 0,
 'B-DATE': 1,
 'I-DATE': 2,
 'B-NUM': 3,
 'B-PER': 4,
 'I-PER': 5,
 'B-LOC': 6,
 'I-LOC': 7,
 'B-OCC': 8,
 'I-OCC': 9,
 'B-MISC': 10,
 'I-MISC': 11,
 'B-INFO': 12,
 'I-INFO': 13,
 'B-COMM': 14,
 'I-COMM': 15,
 'I-NUM': 16}

In [10]:
data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
data.head()

Unnamed: 0,sentence,word_labels
0,"JOLY. 24th November 1852. 852. ALPHONSE JOLY ,...","O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,O,O,B..."
1,"LENOX. ROBERTS , 18th October 1852. 426. GEORG...","O,O,O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,I..."
2,EILER. 1st October 1852. 75. LAURENTIUS MATHIA...,"O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,I-PER..."
3,"943. HENRY HITCHINS , of King William Street ,...","B-NUM,B-PER,I-PER,O,O,B-LOC,I-LOC,I-LOC,I-LOC,..."
4,GREAVES. 7th October 1852. 283. THOMAS GREAVES...,"O,B-DATE,I-DATE,I-DATE,B-NUM,B-PER,I-PER,O,O,B..."


**Dataset and Dataloader**

In [11]:
# Define variables
MAX_LEN = 128
TRAIN_BATCH_SIZE = 120
VALID_BATCH_SIZE = 60
EPOCHS = 150
LEARNING_RATE = 1e-05
MAX_GRAD_NORM = 10
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [12]:
# Word-level tokenization
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):
    """
    Word piece tokenization makes it difficult to match word labels
    back up with individual word pieces. This function tokenizes each
    word one at a time so that it is easier to preserve the correct
    label for each subword. It is, of course, a bit slower in processing
    time, but it will help our model achieve higher accuracy.
    """

    tokenized_sentence = []
    labels = []

    sentence = sentence.strip()

    for word, label in zip(sentence.split(), text_labels.split(",")):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

In [13]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        # step 1: tokenize (and adapt corresponding labels)
        sentence = self.data.sentence[index]  
        word_labels = self.data.word_labels[index]  
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)
        
        # step 2: add special tokens (and corresponding labels)
        tokenized_sentence = ["[CLS]"] + tokenized_sentence + ["[SEP]"] # add special tokens
        labels.insert(0, "O") # add outside label for [CLS] token
        labels.insert(-1, "O") # add outside label for [SEP] token

        # step 3: truncating/padding
        maxlen = self.max_len

        if (len(tokenized_sentence) > maxlen):
          # truncate
          tokenized_sentence = tokenized_sentence[:maxlen]
          labels = labels[:maxlen]
        else:
          # pad
          tokenized_sentence = tokenized_sentence + ['[PAD]'for _ in range(maxlen - len(tokenized_sentence))]
          labels = labels + ["O" for _ in range(maxlen - len(labels))]

        # step 4: obtain the attention mask
        attn_mask = [1 if tok != '[PAD]' else 0 for tok in tokenized_sentence]
        
        # step 5: convert tokens to input ids
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)

        label_ids = [label2id[label] for label in labels]
        # the following line is deprecated
        #label_ids = [label if label != 0 else -100 for label in label_ids]
        
        return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(attn_mask, dtype=torch.long),
              #'token_type_ids': torch.tensor(token_ids, dtype=torch.long),
              'targets': torch.tensor(label_ids, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

In [14]:
train_size = 0.8
train_dataset = data.sample(frac=train_size,random_state=200)
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = dataset(train_dataset, tokenizer, MAX_LEN)
testing_set = dataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (596, 2)
TRAIN Dataset: (477, 2)
TEST Dataset: (119, 2)


In [15]:
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["ids"][:30]), training_set[0]["targets"][:30]):
  print('{0:10}  {1}'.format(token, id2label[label.item()]))

[CLS]       O
14          B-NUM
,           B-NUM
09          B-NUM
##1         B-NUM
.           B-NUM
a           O
grant       O
unto        O
alfred      B-PER
taylor      I-PER
,           O
of          O
warwick     B-LOC
lane        I-LOC
,           I-LOC
in          I-LOC
the         I-LOC
city        I-LOC
of          I-LOC
london      I-LOC
,           O
and         O
henry       B-PER
george      I-PER
fra         I-PER
##si        I-PER
,           O
of          O
herbert     B-LOC


In [16]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)


**Define the model**

In [17]:
model = BertForTokenClassification.from_pretrained('bert-base-uncased', 
                                                   num_labels=len(id2label),
                                                   id2label=id2label,
                                                   label2id=label2id)
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

**Training**

In [18]:
ids = training_set[0]["ids"].unsqueeze(0)
mask = training_set[0]["mask"].unsqueeze(0)
targets = training_set[0]["targets"].unsqueeze(0)
ids = ids.to(device)
mask = mask.to(device)
targets = targets.to(device)
outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
initial_loss = outputs[0]
initial_loss

tensor(2.8974, device='cuda:0', grad_fn=<NllLossBackward0>)

In [19]:
tr_logits = outputs[1]
tr_logits.shape

torch.Size([1, 128, 17])

In [20]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [21]:
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()

    for idx, batch in enumerate(training_loader):
        
        ids = batch['ids'].to(device, dtype = torch.long)
        mask = batch['mask'].to(device, dtype = torch.long)
        targets = batch['targets'].to(device, dtype = torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
        loss, tr_logits = outputs.loss, outputs.logits
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += targets.size(0)
        
        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")

        # compute training accuracy
        flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
        active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
        targets = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_preds.extend(predictions)
        tr_labels.extend(targets)
        
        tmp_tr_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

In [22]:
# Training loop for model
for epoch in range(EPOCHS):
    print(f"Training epoch: {epoch + 1}")
    train(epoch)

Training epoch: 1
Training loss per 100 training steps: 2.8902788162231445
Training loss epoch: 2.743460953235626
Training accuracy epoch: 0.12151810856919382
Training epoch: 2
Training loss per 100 training steps: 2.500870704650879
Training loss epoch: 2.364936053752899
Training accuracy epoch: 0.3631162603076588
Training epoch: 3
Training loss per 100 training steps: 2.1168150901794434
Training loss epoch: 1.9946039915084839
Training accuracy epoch: 0.4152293277721444
Training epoch: 4
Training loss per 100 training steps: 1.776734471321106
Training loss epoch: 1.6534762382507324
Training accuracy epoch: 0.4114438690587558
Training epoch: 5
Training loss per 100 training steps: 1.5467045307159424
Training loss epoch: 1.4434444606304169
Training accuracy epoch: 0.41142802269320367
Training epoch: 6
Training loss per 100 training steps: 1.3679903745651245
Training loss epoch: 1.337791085243225
Training accuracy epoch: 0.43659612047109086
Training epoch: 7
Training loss per 100 training

**Evaluate model**

In [27]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):
            
            ids = batch['ids'].to(device, dtype = torch.long)
            mask = batch['mask'].to(device, dtype = torch.long)
            targets = batch['targets'].to(device, dtype = torch.long)
            
            outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
            loss, eval_logits = outputs.loss, outputs.logits
            
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += targets.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
            active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
            targets = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(targets)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy
    
    #print(eval_labels)
    #print(eval_preds)

    labels = [id2label[id.item()] for id in eval_labels]
    predictions = [id2label[id.item()] for id in eval_preds]

    #print(labels)
    #print(predictions)
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [28]:
labels, predictions = valid(model, testing_loader)

Validation loss per 100 evaluation steps: 0.09809552878141403
Validation Loss: 0.08194364607334137
Validation Accuracy: 0.9805396786045064


In [29]:
from seqeval.metrics import classification_report

print(classification_report([labels], [predictions]))

              precision    recall  f1-score   support

        COMM       0.78      0.74      0.76        19
        DATE       1.00      1.00      1.00       132
        INFO       0.94      0.97      0.95       152
         LOC       0.87      0.89      0.88       218
        MISC       0.82      0.92      0.87       123
         NUM       1.00      1.00      1.00       340
         OCC       0.87      0.94      0.90       162
         PER       0.99      0.99      0.99       203

   micro avg       0.93      0.96      0.95      1349
   macro avg       0.91      0.93      0.92      1349
weighted avg       0.94      0.96      0.95      1349



**Save model weights and tokenizer locally**

In [30]:
model_path = "/path/to/directory"

model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/tokenizer_config.json',
 '/content/drive/MyDrive/special_tokens_map.json',
 '/content/drive/MyDrive/vocab.txt',
 '/content/drive/MyDrive/added_tokens.json')

Once you stop training, you **must** save the contents of the outputs folder to your local directory (Colab deletes local files once the runtime is deleted).