In [1]:
# thrid party imports
import numpy as np
import pandas as pd
# local imports
from BERT_geoparser.tokenizer import Tokenizer
from BERT_geoparser.data import Data
from BERT_geoparser.model import BertModel
from BERT_geoparser.analysis import Results


# Fine-tuning a BERT language model on NER data
In this notebook we use the `BERT_geoparser` package to build and fine tune a BERT model to perform Named Entity Recognition (NER) tasks. This is the first step in a multi-step process to build and train a BERT model to identify target and incidental locations within text. 

We use an NER dataset labelled using the B-I-O format, with 8 categories of word - location (`geo`), time (`tim`), organization (`org`), person (`per`), geo-political entity (`gpe`), art/culture (`art`), event (`eve`) or nature (`nat`). Each tag can indicate whether a word is the *begining* of a related phrase (`B`) or *inside* a phrase (`I`). Words which do not belong to any category are given the *outer* tag (`O`). Specialtokens indicating the start (`CLS`) and end (`SEP`) of a sentence are also added. For example, the phrase:

<p style="text-align: center;"><span style="color:red">Jane</span> visited <span style="color:green">Madisson Square Gardens</span> while in <span style="color:yellow">New York</span>.</p>

Would receive the tags:

<p style="text-align: center;"> [CLS] <span style="color:red"> [B-PER] </span> [O] <span style="color:green">[B-ORG] [I-ORG] [I-ORG]</span> [O] [O] <span style="color:yellow">[B-GEO] [I-GEO]</span> [SEP] </p>

The Fine tuned bert model can then estimate the most likely sequence of tags for a given sentence, and can provide the confidence on the given tags.


In [2]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [3]:
# Load the dataset using the BERT_geoparser Data.py module
data_csv = r'data/step_1/train_ner_dataset.csv'
tokenizer = Tokenizer(size='large', cased=True)
data = Data(data_path=data_csv, 
            tokenizer=tokenizer,
            max_len=80)


In [4]:
# Initialize a new BERTModel object
model = BertModel(saved_model=None, data=data, convolutional=True, lr=10e-6)
model.model.summary()

Some layers from the model checkpoint at bert-large-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-large-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 80)]         0           []                               
                                                                                                  
 input_3 (InputLayer)           [(None, 80)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 80)]         0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  333579264   ['input_1[0][0]',                
                                thPoolingAndCrossAt               'input_3[0][0]',            

  super().__init__(name, **kwargs)


In [None]:
from sklearn.utils import class_weight
data = pd.read_csv(data_csv)
class_weights_list = class_weight.compute_class_weight('balanced',
                                                 classes=['B-inc', 'B-tar', 'I-inc', 'I-tar', 'O'],
                                                 y=data.Tag.values)

class_weights = {i:w for i,w in enumerate(class_weights_list)}
class_weights.update({5:0.01})

In [5]:
model.train(save_as='20231011_bert_model_large_cased.hdf5', n_epochs=5, batch_size=4, validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
  45/8633 [..............................] - ETA: 29:09 - loss: 0.0419 - masked_ce_loss: 0.0419 - weighted_masked_ce_loss: 0.0419

KeyboardInterrupt: 

## Testing the model on new data
We wil consider the recall and precision across each category, where, for each category, $C$, categorical recall, $r_C$, and precision, $p_C$ are defined in terms of the number of true positives, $TP_C$, false positives, $FP_C$ and false negatives $FN_C$ in each category, such that:

$$ r_C = \frac{TP_C}{FN_C + TP_C}, $$
$$ p_C = \frac{TP_C}{FP_C + TP_C}. $$

We will also consider the micro averaged recall, $\mu_r$, and precision $\mu_p$; and macro averaged recall, $\nu_r$, and precision $\nu_p$. For a model with categories $C \in \{ 1,2,...,N \}$, this is given as: 

$$ \nu_r = \frac{\sum_{C=1}^{N}r_C}{N},$$
$$ \nu_r = \frac{\sum_{C=1}^{N}r_C}{N},$$

$$ \mu_r = \frac{\sum_{C=1}^{N}TP_C}{\sum_{C=1}^{N}(TP_C + FN_C)},$$
$$ \mu_p = \frac{\sum_{C=1}^{N}TP_C}{\sum_{C=1}^{N}(TP_C + FP_C)}.$$

Considering both micro and macro averaged statistics lets us better understand how class imbalances interact with our model results. The macro averaged statistics treat all classes equally, regardless of number of occurances. The micro averaged statistic gives an equal weight to each sample in the dataset, which can be helpful when there is a class imbalance. In this dataset the 'O' class is significantly larger than any other class, so the micro average is likely to be more important.

For mathematical reason which aren't too important here, $\mu_r$ and $\mu_p$ will always give the same value, as will a micro averaged F1 score. Hence, we will also consider a macro-averaged F1 statistic given by:

$$ F_{\nu} = 2 \times \frac{\nu_p \cdot \nu_r}{\nu_p + \nu_r}. $$

In [7]:
#model = BertModel(saved_model='20230808_bert_model_large.hdf5', data=data)
y_pred, y_true = model.test('data/step_1/test_ner_dataset.csv')



In [8]:
res = Results(y_true, y_pred)
for cat in ['O', 'geo', 'per', 'gpe', 'org']:
    print(f'"{cat}" accuracy : {np.round(res.categorical_accuracy(cat),3)}')
    print(f'"{cat}" precision : {np.round(res.categorical_precision(cat),3)}')
    print(f'"{cat}" recall : {np.round(res.categorical_recall(cat),3)}')
    print('=======================')
print(f'macro average recall : {np.round(res.macro_average_recall(), 6)}')
print(f'macro average precision : {np.round(res.macro_average_precision(),6)}')
print(f'micro average recall : {np.round(res.micro_average_recall(),3)}')
print(f'micro average precision : {np.round(res.micro_average_precision(),3)}')
print(f'macro average F1 : {np.round(res.macro_average_F1(), 3)}')

"O" accuracy : 0.992
"O" precision : 0.985
"O" recall : 0.992
"geo" accuracy : 0.899
"geo" precision : 0.859
"geo" recall : 0.915
"per" accuracy : 0.866
"per" precision : 0.902
"per" recall : 0.899
"gpe" accuracy : 0.955
"gpe" precision : 0.967
"gpe" recall : 0.957
"org" accuracy : 0.675
"org" precision : 0.779
"org" recall : 0.689
macro average recall : 0.566862
macro average precision : 0.640008
micro average recall : 0.954
micro average precision : 0.954
macro average F1 : 0.601


In [7]:
res = Results(y_true, y_pred)
for cat in ['O', 'geo', 'per', 'gpe', 'org']:
    print(f'"{cat}" accuracy : {np.round(res.categorical_accuracy(cat),3)}')
    print(f'"{cat}" precision : {np.round(res.categorical_precision(cat),3)}')
    print(f'"{cat}" recall : {np.round(res.categorical_recall(cat),3)}')
    print('=======================')
print(f'macro average recall : {np.round(res.macro_average_recall(), 6)}')
print(f'macro average precision : {np.round(res.macro_average_precision(),6)}')
print(f'micro average recall : {np.round(res.micro_average_recall(),3)}')
print(f'micro average precision : {np.round(res.micro_average_precision(),3)}')
print(f'macro average F1 : {np.round(res.macro_average_F1(), 3)}')

"O" accuracy : 0.991
"O" precision : 0.986
"O" recall : 0.991
"geo" accuracy : 0.856
"geo" precision : 0.891
"geo" recall : 0.871
"per" accuracy : 0.895
"per" precision : 0.867
"per" recall : 0.929
"gpe" accuracy : 0.957
"gpe" precision : 0.957
"gpe" recall : 0.957
"org" accuracy : 0.675
"org" precision : 0.754
"org" recall : 0.691
macro average recall : 0.61038
macro average precision : 0.669466
micro average recall : 0.952
micro average precision : 0.952
macro average F1 : 0.639


The precision and recall on the 'geo' category is pretty poor. We'll try and reduce the number of categories the model is guessing and see if that helps.

In [None]:
## remove all tags that arent in the new tagging system
# test data
test_data = pd.read_csv('data/step_1/test_ner_dataset.csv')
new_tags = ['B-geo', 'I-geo', 'B-gpe', 'I-gpe', 'B-org', 'I-org', 'B-per', 'I-per']
test_data['Tag'] = [x if x in new_tags else 'O' for x in test_data.Tag ]
# train data
train_data = pd.read_csv('data/step_1/train_ner_dataset.csv')
new_tags = ['B-geo', 'I-geo', 'B-gpe', 'I-gpe', 'B-org', 'I-org', 'B-per', 'I-per']
train_data['Tag'] = [x if x in new_tags else 'O' for x in train_data.Tag ]
# save as new datasets
test_data.to_csv('data/step_1/test_ner_dataset_reduced_cats.csv', index=False)
train_data.to_csv('data/step_1/train_ner_dataset_reduced_cats.csv', index=False)

In [9]:
# Load the dataset using the BERT_geoparser Data.py module
data_csv = r'data/step_1/train_ner_dataset_reduced_cats.csv'
tokenizer = Tokenizer(size='base', cased=False)
data = Data(data_path=data_csv, 
            tokenizer=tokenizer,
            max_len=125)

In [10]:
# Initialize a new BERTModel object
model = BertModel(saved_model=None, data=data)
model.model.summary()

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 125)]        0           []                               
                                                                                                  
 input_3 (InputLayer)           [(None, 125)]        0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 125)]        0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  109482240   ['input_1[0][0]',                
                                thPoolingAndCrossAt               'input_3[0][0]',            

  super().__init__(name, **kwargs)


In [11]:
model.train(save_as='20230808_bert_model_large_reduced_cats.hdf5', n_epochs=2, batch_size=16, validation_split=0.1)

Epoch 1/2
Epoch 2/2


In [12]:
#model = BertModel(saved_model='20230808_bert_model_large.hdf5', data=data)
y_pred, y_true = model.test('data/step_1/test_ner_dataset_reduced_cats.csv')



In [13]:
res = Results(y_true, y_pred)
for cat in ['O', 'geo', 'per', 'gpe', 'org']:
    print(f'"{cat}" accuracy : {np.round(res.categorical_accuracy(cat),3)}')
    print(f'"{cat}" precision : {np.round(res.categorical_precision(cat),3)}')
    print(f'"{cat}" recall : {np.round(res.categorical_recall(cat),3)}')
    print('=======================')
print(f'macro average recall : {np.round(res.macro_average_recall(), 3)}')
print(f'macro average precision : {np.round(res.macro_average_precision(),3)}')
print(f'micro average recall : {np.round(res.micro_average_recall(),3)}')
print(f'micro average precision : {np.round(res.micro_average_precision(),3)}')

"O" accuracy : 0.991
"O" precision : 0.992
"O" recall : 0.991
"geo" accuracy : 0.876
"geo" precision : 0.859
"geo" recall : 0.886
"per" accuracy : 0.887
"per" precision : 0.872
"per" recall : 0.918
"gpe" accuracy : 0.951
"gpe" precision : 0.964
"gpe" recall : 0.952
"org" accuracy : 0.703
"org" precision : 0.776
"org" recall : 0.719
macro average recall : 0.843
macro average precision : 0.841
micro average recall : 0.968
micro average precision : 0.968
