<a href="https://colab.research.google.com/github/ldg0118/NER_MODEL/blob/main/NER_MODEL_BUILT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER model built based on BERT

Dataset used: conll dataset 
*   Downloaded from https://deepai.org/dataset/conll-2003-english
*   Detailed explanation https://github.com/huggingface/datasets/tree/master/datasets/conll2003

Code reference:
https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb#scrollTo=VuUdX_fImswO

Model used: BertForTokenClassification
https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForTokenClassification

Model Base:
https://github.com/chambliss/Multilingual_NER/blob/master/python/utils/main_utils.py#L118

GitHub for the detailed model built process: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py#L512

Loss function: CrossEntropyLoss
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html


## Install and import packages

In [1]:
#!pip install torch

In [2]:
#!pip install transformers

In [3]:
#!pip install transformers seqeval[gpu]

In [4]:
import pandas as pd
import numpy as np
import torch
from torch import cuda
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification
from sklearn.metrics import accuracy_score
from seqeval.metrics import classification_report

## Data preprocessing

In [5]:
def get_raw_df(txt_list):
  i = 0
  all_list = [sentence.strip().split() for sentence in txt_list]
  for l in all_list:
    if l == []:
      i += 1
    else:
      l.append(f'sentence {i}')
  all_list = [x for x in all_list if x and x[4] != "sentence 0"]
  df = pd.DataFrame(all_list, columns=["id", "pos_tags","chunk_tags","ner_tags", "sentence number"])
  return df

def get_preprocessed_df(txt_list, tag_name):
  raw_df = get_raw_df(txt_list)
  """tag_name is a string representing tag name from "pos_tags","chunk_tags","ner_tags" """
  raw_df["sentence"] = raw_df[["sentence number", "id", "ner_tags"]].groupby(["sentence number"])["id"].transform(lambda x: " ".join(x))
  raw_df["labels"] = raw_df[["sentence number", "id", "ner_tags"]].groupby(["sentence number"])[tag_name].transform(lambda x: " ".join(x))
  return raw_df


In [6]:
with open("train.txt") as f:
  train = f.readlines()
df_train = get_preprocessed_df(train, "ner_tags")

with open("valid.txt") as f:
  valid = f.readlines()
df_valid = get_preprocessed_df(valid, "ner_tags")

with open("test.txt") as f:
  test = f.readlines()
df_test = get_preprocessed_df(test, "ner_tags")

In [7]:
print(f"Length of train dataset: {len(df_train)}, Length of validation dataset: {len(df_valid)}, Length of test dataset: {len(df_test)} ")

Length of train dataset: 204566, Length of validation dataset: 51577, Length of test dataset: 46665 


In [8]:
df_train["ner_tags"].value_counts()

O         170523
B-LOC       7140
B-PER       6600
B-ORG       6321
I-PER       4528
I-ORG       3704
B-MISC      3438
I-LOC       1157
I-MISC      1155
Name: ner_tags, dtype: int64

In [9]:
labels_to_ids = {"O": 0, "B-LOC": 1, "B-PER": 2, "B-ORG": 3, "I-PER": 4, "I-ORG": 5, "B-MISC": 6, "I-LOC": 7, "I-MISC": 8}
ids_to_labels = dict((v,k) for k,v in labels_to_ids.items())

In [10]:
def get_sentence_labels(raw_df):
  return raw_df[["sentence", "labels"]].drop_duplicates().reset_index(drop=True)

In [11]:
df_train = get_sentence_labels(df_train)
df_valid = get_sentence_labels(df_valid)
df_test = get_sentence_labels(df_test)

In [12]:
df_train.head()

Unnamed: 0,sentence,labels
0,EU rejects German call to boycott British lamb .,B-ORG O B-MISC O O O B-MISC O O
1,Peter Blackburn,B-PER I-PER
2,BRUSSELS 1996-08-22,B-LOC O
3,The European Commission said on Thursday it di...,O B-ORG I-ORG O O O O O O B-MISC O O O O O B-M...
4,Germany 's representative to the European Unio...,B-LOC O O O O B-ORG I-ORG O O O B-PER I-PER O ...


In [13]:
print(f"Length of train dataset: {len(df_train)}, Length of validation dataset: {len(df_valid)}, Length of test dataset: {len(df_test)} ")

Length of train dataset: 12694, Length of validation dataset: 3072, Length of test dataset: 3188 


## Change dataframe to PyTorch tensors 

Defining some hyperparameters for the model.


1.   Use Learning rate 0.00001 currently, may tune later.
2.   Use tokenizer from bert-base-uncased pretrained model, but this takes words by wordpiece, not that accurate? may improve further




In [14]:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1
LEARNING_RATE = 1e-05
MAX_GRAD_NORM = 10
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') #Bert using wordpiece *** may improve further

In [15]:
class dataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

  def __getitem__(self, index):
        # step 1: get the sentence and word labels 
        sentence = self.data.sentence[index].strip().split()  
        word_labels = self.data.labels[index].split() 

        # step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
        # BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
        encoding = self.tokenizer(sentence,
                             is_split_into_words=True, #no is_pretokenlized(Modification), we already have a splitted sentence
                             return_offsets_mapping=True, 
                             padding='max_length', 
                             truncation=True, 
                             max_length=self.max_len)
        
        # step 3: create token labels only for first word pieces of each tokenized word
        labels = [labels_to_ids[label] for label in word_labels] 
        # code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
        # create an empty array of -100 of length max_length
        encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100
        
        # set only labels whose first offset position is 0 and the second is not 0
        i = 0
        for idx, mapping in enumerate(encoding["offset_mapping"]):
          if mapping[0] == 0 and mapping[1] != 0:
            # overwrite label
            encoded_labels[idx] = labels[i]
            i += 1

        # step 4: turn everything into PyTorch tensors
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        item['labels'] = torch.as_tensor(encoded_labels)
        
        return item

  def __len__(self):
        return self.len

In [16]:
training_set = dataset(df_train, tokenizer, MAX_LEN)
validation_set = dataset(df_valid, tokenizer, MAX_LEN)
testing_set = dataset(df_test, tokenizer, MAX_LEN)

In [17]:
training_set[0]

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]),
 'input_ids': tensor([  101,  7327, 19164,  2446,  2655,  2000, 17757,  2329, 12559,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     

## Build the model

In [18]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

In [19]:
model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(labels_to_ids))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [20]:
device = 'cuda' if cuda.is_available() else 'cpu' #save the processing time
model.to(device)

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

In [21]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [27]:
inputs = training_set[0]
input_ids = inputs["input_ids"].unsqueeze(0)
attention_mask = inputs["attention_mask"].unsqueeze(0)
labels = inputs["labels"].unsqueeze(0)

input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
labels = labels.to(device)

outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
initial_loss = outputs[0]
initial_loss

tensor(0.1626, device='cuda:0', grad_fn=<NllLossBackward0>)

In [28]:
outputs[1]

tensor([[[ 7.2604, -0.7276, -0.9241,  ...,  0.2589, -1.6862, -0.5932],
         [ 0.0611,  4.1971, -1.1044,  ...,  1.2290, -1.7444, -2.3538],
         [ 8.3707, -1.3601, -1.4624,  ..., -0.7933, -1.7581, -0.5170],
         ...,
         [ 1.5688,  2.1598, -0.6317,  ...,  1.5575, -1.5684, -1.4480],
         [ 1.4428,  2.2026, -0.4850,  ...,  1.5535, -1.5579, -1.4704],
         [ 1.5983,  1.9302, -0.5046,  ...,  1.6377, -1.5401, -1.3511]]],
       device='cuda:0', grad_fn=<AddBackward0>)

## Train the model

In [22]:
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].to(device, dtype = torch.long)
        mask = batch['attention_mask'].to(device, dtype = torch.long)
        labels = batch['labels'].to(device, dtype = torch.long)

        #loss, tr_logits = model(input_ids=ids, attention_mask=mask, labels=labels)
        output = model(input_ids=ids, attention_mask=mask, labels=labels)
        tr_loss += output[0]

        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)
        
        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
        active_logits = output[1].view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        
        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))
        
        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_labels.extend(labels)
        tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        output[0].backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

In [23]:
for epoch in range(EPOCHS):
  print(f"Training epoch: {epoch + 1}")
  train(epoch)

Training epoch: 1
Training loss per 100 training steps: 2.2004306316375732
Training loss per 100 training steps: 0.8841642737388611
Training loss per 100 training steps: 0.7637917995452881
Training loss per 100 training steps: 0.6042304635047913
Training loss per 100 training steps: 0.5060256123542786
Training loss per 100 training steps: 0.45007655024528503
Training loss per 100 training steps: 0.39518532156944275
Training loss per 100 training steps: 0.3751581609249115
Training loss per 100 training steps: 0.34898972511291504
Training loss per 100 training steps: 0.32761529088020325
Training loss per 100 training steps: 0.3039153218269348
Training loss per 100 training steps: 0.2881028950214386
Training loss per 100 training steps: 0.27282842993736267
Training loss per 100 training steps: 0.2575330436229706
Training loss per 100 training steps: 0.2451707422733307
Training loss per 100 training steps: 0.23471219837665558
Training loss per 100 training steps: 0.22432264685630798
Traini

## Test the model

In [24]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):
            
            ids = batch['input_ids'].to(device, dtype = torch.long)
            mask = batch['attention_mask'].to(device, dtype = torch.long)
            labels = batch['labels'].to(device, dtype = torch.long)
            
            #loss, eval_logits = model(input_ids=ids, attention_mask=mask, labels=labels)
            output = model(input_ids=ids, attention_mask=mask, labels=labels)

            eval_loss += output[0].item()

            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
            active_logits = output[1].view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            
            # only compute accuracy at active labels
            active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        
            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(labels)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    labels = [ids_to_labels[id.item()] for id in eval_labels]
    predictions = [ids_to_labels[id.item()] for id in eval_preds]
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [25]:
labels, predictions = valid(model, testing_loader)

Validation loss per 100 evaluation steps: 0.5878165364265442
Validation loss per 100 evaluation steps: 0.08609670852218745
Validation loss per 100 evaluation steps: 0.0866909336718031
Validation loss per 100 evaluation steps: 0.0951138291464642
Validation loss per 100 evaluation steps: 0.08436028769909244
Validation loss per 100 evaluation steps: 0.08507894132178957
Validation loss per 100 evaluation steps: 0.09463777601672027
Validation loss per 100 evaluation steps: 0.09053018444600368
Validation loss per 100 evaluation steps: 0.08993140449785231
Validation loss per 100 evaluation steps: 0.09399347265655915
Validation loss per 100 evaluation steps: 0.0920146119994945
Validation loss per 100 evaluation steps: 0.10009727149910601
Validation loss per 100 evaluation steps: 0.09936937424129373
Validation loss per 100 evaluation steps: 0.10162544719478277
Validation loss per 100 evaluation steps: 0.09899976121648604
Validation loss per 100 evaluation steps: 0.09559049438813925
Validation L

# Produce the classification_report

In [26]:
print(classification_report([labels], [predictions]))

              precision    recall  f1-score   support

         LOC       0.86      0.93      0.89      1528
        MISC       0.75      0.76      0.76       697
         ORG       0.85      0.84      0.84      1611
         PER       0.97      0.96      0.97      1614

   micro avg       0.87      0.89      0.88      5450
   macro avg       0.86      0.87      0.86      5450
weighted avg       0.88      0.89      0.88      5450



## Save the model for further use