# NER (Named Entity Recognition)

## 1. Ner example

It allow to abstrct designated objects in a text.

eg. Paul is at the the parc.


| type    |  object |
----------|:-------:|
| place   |  parc   |
| subject |  Paul   |


## 2. labeling system

    - IOB
    - IOBES


## 3. evaluation

    - metrics: precision recall, F1

eg.


|truth:| The | old | lady | whirled | round | and | snatched | her  | skirts | out | of   | danger |
|------|:---:|----:|-----:|--------:|------:|----:|---------:|-----:|-------:|----:|-----:|-----:|
|gold: | <font color="blue">b-A</font> | <font color="blue">i-A</font> | <font color="blue">i-A</font>  |  <font color="blue">e-A</font>    | o   | o   | b-DSE    |i-DSE | e-DSE  | o   | <font color="yellow">b-T </font>   | <font color="yellow">e-T</font>    |
|pred: | o   | <font color="pink">b-A</font> |<font color="pink">e-A</font>  |   o      | b-DSE | o   | <font color="red">b-DSE</font>   |<font color="red">i-DSE</font> | <font color="red">e-DSE</font>  | o   | b-T  | i-T    |


The truth has 3 objects marked as blue, white and yellow. The prediction identified 2 objects marked as pink and red, where red is the correct one.

So the corresponding confusion matrix is as follow where the columns are the ground truth and the lines are the predictions: 

|   | P  | N  | 
|---|---|---|
| P | 1  | 1 |
| N | 2  | - |

$$Precision = \frac{1}{1+1}$$
$$Recall = \frac{1}{1+2}$$
$$F1 = 2 * \frac{P * R}{P + R} = 2 * \frac{1/2 * 1/3}{1/2 + 1/3}$$

Where precision measure the rate of correct identifications over the true entities, recall measures the rate of correct identifications over all predictions. And f1 is the averaged rate of ^recision and recall.

Inspired from : https://huggingface.co/learn/nlp-course/chapter7/2

In [1]:
# if not installed, uncomment and run this

# !python -m pip install seqeval --break-system-packages

In [2]:
import evaluate
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from transformers import pipeline
from datasets import load_dataset

2024-05-29 10:54:51.768523: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-29 10:54:51.768625: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-29 10:54:51.770865: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-29 10:54:51.785118: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## download dataset

In [3]:
# https://huggingface.co/datasets/ashleyliu31/tech_product_names_ner

data = load_dataset("ashleyliu31/tech_product_names_ner", cache_dir='./ner')

In [4]:
# we split data since the original dataset don't provide test set
# Only dataset can be split, not the datadict

data = data["train"].train_test_split(test_size=0.2)

In [5]:
# show the features of the dataset

data["train"].features

{'Unnamed: 0': Value(dtype='int64', id=None),
 'sentence': Value(dtype='string', id=None),
 'word_labels': Value(dtype='string', id=None)}

In [6]:
# if we look closer to the data, we realize that the format of the data are all in string
# we have to cut them apart and do some clean-ups (remove punctuations)

ind = 0
print(data["train"][ind])
print(len(data["train"][ind]["sentence"]))
print(len(data["train"][ind]["word_labels"]))

{'Unnamed: 0': 485, 'sentence': "The Acer TravelMate TMP-249-0's subpar screen quality diminishes its overall user experience.", 'word_labels': 'O, B-pn, I-pn, I-pn, O, O, O, O, O, O, O, O, O, O'}
93
49


In [7]:
# but one thing for sure is that the labels consist of 3 symbols that we need later
# so we define them here

label_list = {'O':0, 'B-pn':1, 'I-pn':2}
label_list_inv = {0:'O', 1:'B-pn', 2:'I-pn'}

## process data

In [8]:
# download the tokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

In [9]:
# After some tweaks, I found a way to treat the data.
# For both sentence and labels:
# - remove punctuations (except - and . for model names)
# - split the string by space

ind = 169 # pick a sample

# remove , and . at the end of the sentence (so we replace ". " instead of ".")
# split the sentence into list
sen = data["train"][ind]["sentence"].replace(",", " ").replace(". ", " ").split()

# we tokenize the sentence
# the option "is_split_into_words" works on list instead of string
tok = tokenizer(sen, is_split_into_words=True)

# It can be seen that the tokenized sentence has different length as the labels
# This is because the way the tokenizer separate the sentence, which is not solely based
# on word level, but the label was based on word.                                  

print(data["train"][ind])
print(sen)
print(tok)
print(data["train"][ind]["word_labels"].split(", "))
print(len(tok["input_ids"]))
print(len(data["train"][ind]["word_labels"].split(", ")))

{'Unnamed: 0': 44, 'sentence': 'The HP Pavilion 15-af006ax is praised for its sleek design and robust performance.', 'word_labels': 'O, B-pn, I-pn, I-pn, O, O, O, O, O, O, O, O, O, O'}
['The', 'HP', 'Pavilion', '15-af006ax', 'is', 'praised', 'for', 'its', 'sleek', 'design', 'and', 'robust', 'performance.']
{'input_ids': [101, 1996, 6522, 10531, 2321, 1011, 21358, 8889, 2575, 8528, 2003, 5868, 2005, 2049, 21185, 2640, 1998, 15873, 2836, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['O', 'B-pn', 'I-pn', 'I-pn', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
21
14


In [10]:
# The tokenizer provides information about the tokens such as the word it was tokenized.
# Where None is the special token marks the begining and the end of the sentence

# we should take consideration of this info when we re-construct the label
tok.tokens() # show the tokens
tok.word_ids() # show the ids of the tokens

[None, 0, 1, 2, 3, 3, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, None]

In [11]:
# by putting all above mentioned processing together, we define the function 
# to preprocess the data

def process(examples) :

    # convert strings to list
    sens = [ex.replace(",", " ").replace(". ", " ").split() for ex in examples["sentence"]]
    lab_raw = [ex.split(", ") for ex in examples["word_labels"]]

    tokenized = tokenizer(sens, is_split_into_words=True, max_length=128, truncation=True)
    
    labels = []
    for i, lab in enumerate(lab_raw):
        
        lab = [label_list[x] for x in lab] # convert word label to num label

        label = [-100 if l is None else lab[int(l)] for l in tokenized.word_ids(batch_index=i)] # match label length to token length

        labels.append(label)

        assert len(label) == len(tokenized["input_ids"][i])

    tokenized["labels"] = labels

    return tokenized


In [12]:
# preprocess the data

tokenized_data = data.map(process, batched=True)

Map:   0%|          | 0/4662 [00:00<?, ? examples/s]

Map:   0%|          | 0/1166 [00:00<?, ? examples/s]

In [13]:
# show preprocessed data
print(tokenized_data["train"][0])

{'Unnamed: 0': 485, 'sentence': "The Acer TravelMate TMP-249-0's subpar screen quality diminishes its overall user experience.", 'word_labels': 'O, B-pn, I-pn, I-pn, O, O, O, O, O, O, O, O, O, O', 'input_ids': [101, 1996, 9078, 2099, 3604, 8585, 1056, 8737, 1011, 23628, 1011, 1014, 1005, 1055, 4942, 19362, 3898, 3737, 11737, 5498, 4095, 2229, 2049, 3452, 5310, 3325, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 0, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]}


## load model

In [14]:
# load model
# By default, the output dim is 2.
# However, we need a 3-class classifier.
# So to change the output dim, we use the option "num_labels" to specify the wanted dim.

model = AutoModelForTokenClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=len(label_list))

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# we can check the output dim is indeed 3

model.config.num_labels

3

## evaluation


In [16]:
# define evaluation module

seqeval = evaluate.load("seqeval")
seqeval

EvaluationModule(name: "seqeval", module_type: "metric", features: {'predictions': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence')}, usage: """
Produces labelling scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions: List of List of predicted labels (Estimated targets as returned by a tagger)
    references: List of List of reference labels (Ground truth (correct) target values)
    suffix: True if the IOB prefix is after type, False otherwise. default: False
    scheme: Specify target tagging scheme. Should be one of ["IOB1", "IOB2", "IOE1", "IOE2", "IOBES", "BILOU"].
        default: None
    mode: Whether to count correct entity labels with incorrect I/B tags as true positives or not.
        If you want to only count exact matches, pass mode="strict". default: None.
    sample_weight: Array-like of sha

In [17]:
import numpy as np

def eval_metric(preds):

    pred, label = preds
    pred = np.argmax(pred, axis=-1)

    pred_word_labels = []
    label_word_labels = []

    for p, l in zip(pred, label) :
        if len(p) != len(l):
            print(p)
            print(l)

        pred_word_labels.append([label_list_inv[i] for i, j in zip(p, l) if j != -100])

        label_word_labels.append([label_list_inv[i] for i in l if i != -100 ])

    result = seqeval.compute(predictions=pred_word_labels, references=label_word_labels, mode="strict", scheme="IOB2")
    
    return {'f1': result["overall_f1"]} # we have to reture a dict

## train

In [18]:

args = TrainingArguments(
        output_dir="./checkpoints",
        per_device_train_batch_size=64,
        per_device_eval_batch_size=128,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        metric_for_best_model="f1",
        load_best_model_at_end=True,
        logging_steps=10,
        use_cpu=True,
)


In [19]:
trainer = Trainer(
    model = model,
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    compute_metrics=eval_metric,
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer)
)

In [20]:
trainer.evaluate(eval_dataset=tokenized_data["test"])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 1.0745091438293457,
 'eval_f1': 0.036885245901639344,
 'eval_runtime': 25.3014,
 'eval_samples_per_second': 46.084,
 'eval_steps_per_second': 0.395}

In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.0591,0.037632,0.926089
2,0.0264,0.022258,0.954174
3,0.0106,0.014311,0.981845


TrainOutput(global_step=219, training_loss=0.057121608559399434, metrics={'train_runtime': 846.9158, 'train_samples_per_second': 16.514, 'train_steps_per_second': 0.259, 'total_flos': 243321535569708.0, 'train_loss': 0.057121608559399434, 'epoch': 3.0})

In [22]:
trainer.evaluate(eval_dataset=tokenized_data["test"])

{'eval_loss': 0.014311173930764198,
 'eval_f1': 0.9818445896877269,
 'eval_runtime': 20.1221,
 'eval_samples_per_second': 57.946,
 'eval_steps_per_second': 0.497,
 'epoch': 3.0}

## evaluation

In [23]:
# we use pipeline to evaluate the model
pipe_ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

In [24]:
# we have to customized the label-id conversion to output meaningful result
# otherwise, the results will only show default labels

model.config.label2id = label_list
model.config.id2label = label_list_inv
print(model.config) # show the configs

BertConfig {
  "_name_or_path": "google-bert/bert-base-uncased",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-pn",
    "2": "I-pn"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-pn": 1,
    "I-pn": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [30]:
pipe_ner("The first HP LaserJet and the first Apple LaserWriter used the same print engine, the Canon CX engine")

[{'entity_group': 'pn',
  'score': 0.97650355,
  'word': 'laser',
  'start': 13,
  'end': 18},
 {'entity_group': 'pn',
  'score': 0.87703407,
  'word': '##jet',
  'start': 18,
  'end': 21},
 {'entity_group': 'pn',
  'score': 0.9707254,
  'word': 'laser',
  'start': 42,
  'end': 47},
 {'entity_group': 'pn',
  'score': 0.64945084,
  'word': '##writer',
  'start': 47,
  'end': 53}]