# Named Entity Recognition Outline -- TF
This script was written to be an outline for anyone attempting to apply NER to a dataset using the Tensorflow framework and HuggingFace. 

It is important to understand that the model will take in the basic 3 BERT inputs: `input_ids`, `attention_mask`, and `token_type_ids` and will output a tensor `label_ids`. All tensors should be of length 512. The values of `label_ids` are the predicted entity classifications of `input_ids`. For example:

if `input_ids[5] = word_to_id("Edinburgh")`,

then `label_ids[5] = category_to_id("CITY")`

Thus, `label_ids` (before encoding) should look something like, `["O", "CITY", "O", "NAME", "O", ...]` where "O" stands for "outside of entity". But, of course `label_ids` should be a numerical tensor, encoded like this `[0, 1, 0, 2, 0, ...]` where `{0:"O", 1:"CITY", 2:"NAME"}`. Because we are not using one-hot encoding in the previous tensor, we must use `sparse categorical crossentropy` for our `loss` and `metrics` during model compilation.

One of the most important things to consider in the NER classification process is the data preprocessing stage. In order to train a BERT NER model, one must provide their own y-values or `label_ids`. The default BERT tokenizer will split up unknown words into smaller pieces, prepending "##" to the token to signify a broken-up word. When this happens, we must somehow be able to detect this and allow the entity label to span multiple indexes. For example, if "Edinburgh" were split into `[..., "Edin", "##burgh", ...]`, we would need to add two "CITY" labels instead of one, like this `[..., "CITY", "CITY", ...]`. Furthermore, the default tokenizer splits up punctuation and gives punctuation their own token ids. We must also be able to detect punction and assign "O", or "outside of entity" label ids to these tokens. I added this functionality to `entity_label_adjustment` within this notebook.

## Functions

In [None]:
#@title Progress Counter:
#@markdown class/function to print percentage progress after each iteration in loop

# Progress Counter
import sys

class progressCounter():

  def __init__(self, num_iterations):
    self.progress = 0
    self.N = num_iterations
    self.calls = 0

  def check_pt(self):
    self.calls+=1
    curr_progress = int(self.calls/(self.N-1) * 100)
    if curr_progress - self.progress > 0:
      self.progress = curr_progress
      sys.stdout.write("\rProgress: {0}%".format(self.progress))
      sys.stdout.flush()

In [None]:
#@title BERT tokenizer entity label adjustment function
#@markdown function to detect and expand entity labels to take into account punctuation and words split-up by tokenizer

#@markdown when function detects punctuation or BERT special tokens, function applies label 'O', or "outside of entity"

#@markdown when function detects split-up word (Ex. `[..., Yu, ##k, ##ko, ...]`), function splits label three ways (Ex. `[..., NAME, NAME, NAME, ...]`)

import re
import string
# takes in a list of input_ids and a list of entity labels and adapts entity label array to tokenizer
def entity_label_adjustment(seq_input_ids, seq_labels, tokenizer, label_dict):
  special_tokens = ['[CLS]', '[SEP]', '[PAD]']
  seq_labels = seq_labels.copy()
  seq_tokens = tokenizer.convert_ids_to_tokens(seq_input_ids)
  for i in range(len(seq_tokens)):
    if seq_tokens[i] in special_tokens or seq_tokens[i] in string.punctuation: # defnitely outside of entity
      seq_labels.insert(i, label_dict["O"])
      continue
    m = re.search("^##[a-zA-Z]", seq_tokens[i])
    if m != None:
      seq_labels.insert(i, seq_labels[i-1])
  return seq_labels

## Import NER Classifier & Tokenizer

In [None]:
!pip install transformers
from transformers import BertTokenizer, TFBertForTokenClassification
import tensorflow as tf
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForTokenClassification.from_pretrained("bert-base-cased", output_attentions = False, output_hidden_states = False, num_labels=3) # <-- number labels required to equal number of labels in dataset

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 2.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 9.4MB/s 
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 16.7MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K  

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForTokenClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Data

In [None]:
#@title Dataset Placeholder
#@markdown Example dataset inside
categories = ['soc.religion.christian']
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)
data = data.data

In [None]:
# Another example
# Use non-english name to force tokenizer splitting
data = ["Hello, my name is Yukko."] # Yukko becomes [Yu, ##k, ##ko] when tokenized
labels = [ "O", "O", "N", "O", "NAME" ]
label_dict = {"O" : 0, "NAME" : 1, "N" : 2 }
# translate data set to ids
label_ids = [[ 0, 0, 2, 0, 1 ]]

In [None]:
dict_data = tokenizer(data, return_tensors="tf", truncation=True, padding='max_length', max_length=512)

In [None]:
# apply entity label adjustment to all input/label_ids
dict_data["label_ids"] = [0] * len(dict_data["input_ids"])
for i in range(len(dict_data["input_ids"])):
  dict_data["label_ids"][i] = tf.convert_to_tensor(     entity_label_adjustment(dict_data["input_ids"][i],label_ids[i],tokenizer, label_dict)        , dtype=tf.int8 )
  if len(dict_data["label_ids"][0]) != len(dict_data["input_ids"][0]):
    print('Error in entity label adjustment!!!')
    print('Unknown character encountered: recommended update adjustment function')
    break

dict_data["label_ids"] = tf.reshape(dict_data["label_ids"], (1,512) ) # reshape for training

## Configure, Compile, and Train

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:
model_history = model.fit( [dict_data["input_ids"], dict_data["attention_mask"], dict_data["token_type_ids"]], [dict_data["label_ids"]], verbose=1, batch_size=1, epochs=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


## Other

In [None]:
#@title Print Useful Variables
ids = dict_data.input_ids[0][0:15]
input_ids = dict_data.input_ids[0][0:15]
words = tokenizer.convert_ids_to_tokens(ids)
output_labels = dict_data.label_ids[0][0:15]

print(words)
print(input_ids)
print(output_labels)

['[CLS]', 'Hello', ',', 'my', 'name', 'is', 'Yu', '##k', '##ko', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
tf.Tensor(
[  101  8667   117  1139  1271  1110 10684  1377  2718   119   102     0
     0     0     0], shape=(15,), dtype=int32)
tf.Tensor([0 0 0 0 2 0 1 1 1 0 0 0 0 0 0], shape=(15,), dtype=int8)


In [None]:
#@title Tests Cases
sentence = "Hello, my name is Yukko."
label_ids = [ 0, 0, 2, 0, 1 ]
label_dict = {"O" : 0, "NAME" : 1, "N" : 2 }
dict_data = tokenizer("Hello, my name is Yukko.", return_tensors="tf", truncation=True, padding='max_length', max_length=512)
ids = dict_data.input_ids[0][0:12]
words = tokenizer.convert_ids_to_tokens(ids)
print(words)
seq_labels = entity_label_adjustment(ids,label_ids,tokenizer)
print(seq_labels)
print(len(seq_labels))
print(len(ids))