# Named Entity Recognition (NER)

In this notebook, it will:

    I. explain the NER problem.
    II. Model
    III. Realization

## I. Presentation

### 1. Definition

It allows to abstract designated entity (eg. name, place, brand, product...) in a text.
There are 2 parts of this task:
 * recognize the position of the entory
 * classify the entiry

example: Paul is at the the parc.

| type    |  entity |
----------|:-------:|
| place   |  parc   |
| subject |  Paul   |


### 2. labeling methods

- IOB
- IOE
- IOBES
- BILOU

where 

* I designates the middle of the entity, B the begining and O is not the entity.
* E designates the end of the entity, S a single word for entity.
* Sometimes, I is placed by M.
* the name after the balise is the class

* B-Person: the bigining of a person's name
* I-Person: the middle of a person's name
* B-Place: the bigining of a place's name
* I-Place: the middle of a place's name
* O is outside of all entities

### 3. evaluation

- metrics: precision recall, F1

eg.


|truth:| The | old | lady | whirled | round | and | snatched | her  | skirts | out | of   | danger |
|------|:---:|----:|-----:|--------:|------:|----:|---------:|-----:|-------:|----:|-----:|-----:|
|gold: | <font color="blue">b-A</font> | <font color="blue">i-A</font> | <font color="blue">i-A</font>  |  <font color="blue">e-A</font>    | o   | o   | b-DSE    |i-DSE | e-DSE  | o   | <font color="yellow">b-T </font>   | <font color="yellow">e-T</font>    |
|pred: | o   | <font color="pink">b-A</font> |<font color="pink">e-A</font>  |   o      | b-DSE | o   | <font color="red">b-DSE</font>   |<font color="red">i-DSE</font> | <font color="red">e-DSE</font>  | o   | b-T  | i-T    |


The truth has 3 objects marked as blue, white and yellow. The prediction identified 2 objects marked as pink and red, where red is the correct one.

So the corresponding confusion matrix is as follow where the columns are the ground truth and the lines are the predictions: 

|   | P  | N  | 
|---|---|---|
| P | 1  | 1 |
| N | 2  | - |

$$Precision = \frac{1}{1+1}$$
$$Recall = \frac{1}{1+2}$$
$$F1 = 2 * \frac{P * R}{P + R} = 2 * \frac{1/2 * 1/3}{1/2 + 1/3}$$

Where 
* precision measure the rate of correct identifications over the true entities. 
* recall measures the rate of correct identifications over all predictions. 
* f1 is the averaged rate of precision and recall.

Inspired from : https://huggingface.co/learn/nlp-course/chapter7/2

## II. Model

Since, this is a classification problem,  the model used is AutoModelForTokenClassification, with the bert base model.

This model uses bert base model to encode the input text, and output the classes (num_labels) of each tokens according to the classes we defined. So the output dimension is [batch, seq_len, classes].

In the class's init function:

```python
    self.num_labels = config.num_labels
    self.bert = BertModel(config, add_pooling_layer=False)
    ...
    self.dropout = nn.Dropout(classifier_dropout)
    self.classifier = nn.Linear(config.hidden_size, config.num_labels)
```
Where num_labels is the number of the labels. In our case, it is 3.

In the forward function:

```python
    outputs = self.bert(
            input_ids,
            ...
        )
    sequence_output = outputs[0]
    sequence_output = self.dropout(sequence_output)
    logits = self.classifier(sequence_output)
```

The input is encoded using bert model. The output of bert model is then put into a linear layer to project the hidden values to the classes.

## III. Realization

In [1]:
# if not installed, uncomment and run this

# !python -m pip install seqeval --break-system-packages

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first
# skip this if you don't need.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
## defin repos for data and model

# data
# https://huggingface.co/datasets/ashleyliu31/tech_product_names_ner

ckp_data = "ashleyliu31/tech_product_names_ner"

# model

ckp = "google-bert/bert-base-uncased"

### 1. import

In [3]:
## import

import evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification, pipeline

2024-06-20 12:09:09.062874: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-20 12:09:09.062934: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-20 12:09:09.065212: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-20 12:09:09.078091: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. download dataset

Depending on dataset, the formats can vary. For example, the easiest case is where the text was provided cleaned and cut and the labels was provided.
But in our case, the text is in string format, with punctuations. So we need to do some basic cleaning before usage.
In some other cases, the data may be more complex and need more actions.
So, one should always check the data content and make sure it responds the use case before using.

In [4]:
# load

data = load_dataset(ckp_data, cache_dir='../tmp/ner', split="train")
data

Dataset({
    features: ['Unnamed: 0', 'sentence', 'word_labels'],
    num_rows: 5828
})

In [5]:
# show the features of the dataset

data.features

{'Unnamed: 0': Value(dtype='int64', id=None),
 'sentence': Value(dtype='string', id=None),
 'word_labels': Value(dtype='string', id=None)}

In [23]:
# if we look closer to the data, we realize that the format of the data are all in string
# we have to cut them apart and do some clean-ups to make sure that the text corresponds to the labels

# After some inspections, the rules for processing are:
#   - the punctuation counted in the labeling
#   - words linked by "-" counted as one word
#   - the words linked by "'" should be separated into 2 words, remove the "'"


ind = 10
print(data[ind])
print(len(data[ind]["sentence"].split(" ")))
print(len(data[ind]["word_labels"].split(", ")))

{'Unnamed: 0': 10, 'sentence': 'The HP Pavilion x360 13-s101tu is highly versatile with its 360-degree hinge and touchscreen display.', 'word_labels': 'O, B-pn, I-pn, I-pn, I-pn, O, O, O, O, O, O, O, O, O, O, O'}
15
16


In [9]:
# but one thing for sure is that the labels consist of 3 symbols that we need later
# so we define them here

label2id = {'O':0, 'B-pn':1, 'I-pn':2}
id2label = {0:'O', 1:'B-pn', 2:'I-pn'}

### 3. Split data

In [5]:
# If the dataset was already split, skip this

# we split data since the original dataset don't provide test set
# Only dataset can be split, not the datadict

split_data = data.train_test_split(test_size=0.2)
split_data

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'sentence', 'word_labels'],
        num_rows: 4662
    })
    test: Dataset({
        features: ['Unnamed: 0', 'sentence', 'word_labels'],
        num_rows: 1166
    })
})

### 4. tokenization

In [6]:
# download the tokenizer

tokenizer = AutoTokenizer.from_pretrained(ckp)

In [30]:
# After some tweaks, I found a way to treat the data.
# For both sentence and labels:
# - remove punctuations (except - and . for model names)
# - split the string by space

ind = 169 # pick a sample

# remove , and . at the end of the sentence (so we replace ". " instead of ".")
# split the sentence into list
sen = split_data["train"][ind]["sentence"].replace(",", " , ").replace(". ", " . ").replace("?", " ? ").replace("'", " ").split()

labels = split_data["train"][ind]["word_labels"].split(", ")

# we tokenize the sentence
# the option "is_split_into_words" works on list instead of string
tok = tokenizer(sen, is_split_into_words=True)

# It can be seen that the tokenized sentence has different length as the labels
# This is because the way the tokenizer separate the sentence, which is not solely based
# on word level, but the label was based on word.                                  

print("data: ", split_data["train"][ind])
print("text: ", sen)
print("tokens: ", tok)
print("labels: ", labels)

print("text len: ", len(sen))
print("token len: ", len(tok["input_ids"]))
print("labels len: ", len(split_data["train"][ind]["word_labels"].split(", ")))

assert(len(sen) == len(labels))

data:  {'Unnamed: 0': 1816, 'sentence': "I'm keen on the Asus EeeBook E402WA-WH21, any discounts available?", 'word_labels': 'O, O, O, O, O, B-pn, I-pn, I-pn, O, O, O, O, O'}
text:  ['I', 'm', 'keen', 'on', 'the', 'Asus', 'EeeBook', 'E402WA-WH21', ',', 'any', 'discounts', 'available', '?']
tokens:  {'input_ids': [101, 1045, 1049, 10326, 2006, 1996, 2004, 2271, 25212, 15878, 14659, 1041, 12740, 2475, 4213, 1011, 1059, 2232, 17465, 1010, 2151, 19575, 2015, 2800, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
labels:  ['O', 'O', 'O', 'O', 'O', 'B-pn', 'I-pn', 'I-pn', 'O', 'O', 'O', 'O', 'O']
text len:  13
token len:  26
labels len:  13


In [31]:
# The tokenizer provides information about the tokens such as the word it tokenized.
# Where None is the special token marks the begining and the end of the sentence
# we should take consideration of this info when we re-construct the label

# For more details, see hf_transformers_basics_3_tokenizer.ipynb

print(tok.tokens()) # show the tokens
print(tok.word_ids()) # show the ids of the words

['[CLS]', 'i', 'm', 'keen', 'on', 'the', 'as', '##us', 'ee', '##eb', '##ook', 'e', '##40', '##2', '##wa', '-', 'w', '##h', '##21', ',', 'any', 'discount', '##s', 'available', '?', '[SEP]']
[None, 0, 1, 2, 3, 4, 5, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 8, 9, 10, 10, 11, 12, None]


In [7]:
# by putting all above mentioned processing together, we define the function 
# to preprocess the data

def process(examples) :

    # split text
    texts = [ex.replace(", ", " , ").replace(". ", " . ").replace("?", " ? ").replace("'s", " ").replace("â€™s", " ").split() for ex in examples["sentence"]]
    
    # split labels
    labels_raw = [ex.split(", ") for ex in examples["word_labels"]]

    # tokenization of text
    tokenized = tokenizer(texts, is_split_into_words=True, max_length=128, truncation=True)
    
    labels = []

    for i, lab in enumerate(labels_raw):

        if len(lab) < len(texts[i]):
            print(lab, texts[i], examples["sentence"][i])
        
        # convert string label to num label
        lab = [label2id[x] for x in lab] 

        # match label length to token length
        label = [-100 if l is None else lab[int(l)] for l in tokenized.word_ids(batch_index=i)] 

        labels.append(label)

        
        assert len(label) == len(tokenized["input_ids"][i])

    tokenized["labels"] = labels

    return tokenized


In [10]:
# preprocess the data

tokenized_data = split_data.map(process, batched=True)
tokenized_data

Map:   0%|          | 0/4662 [00:00<?, ? examples/s]

Map:   0%|          | 0/1166 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'sentence', 'word_labels', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 4662
    })
    test: Dataset({
        features: ['Unnamed: 0', 'sentence', 'word_labels', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1166
    })
})

In [73]:
# show preprocessed data
print(tokenized_data["train"][0])

{'Unnamed: 0': 4050, 'sentence': 'I found the Ziox Astra Titan 4G to be durable with decent performance.', 'word_labels': 'O, O, O, B-pn, I-pn, I-pn, I-pn, O, O, O, O, O, O, O', 'input_ids': [101, 1045, 2179, 1996, 1062, 3695, 2595, 2004, 6494, 16537, 1018, 2290, 2000, 2022, 25634, 2007, 11519, 2836, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, -100]}


### 5. load model

In [11]:
# By default, the model output dim is 2.
# However, we need a 3-class classifier.
# So to change the output dim, we use the option "num_labels" to specify the required dim.

model = AutoModelForTokenClassification.from_pretrained(ckp, num_labels=len(label2id))

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [76]:
# we can check the output dim is indeed 3

model.config.num_labels

3

### 6. Define metrics


In [12]:
# define evaluation module

seqeval = evaluate.load("seqeval")
seqeval

EvaluationModule(name: "seqeval", module_type: "metric", features: {'predictions': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence')}, usage: """
Produces labelling scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions: List of List of predicted labels (Estimated targets as returned by a tagger)
    references: List of List of reference labels (Ground truth (correct) target values)
    suffix: True if the IOB prefix is after type, False otherwise. default: False
    scheme: Specify target tagging scheme. Should be one of ["IOB1", "IOB2", "IOE1", "IOE2", "IOBES", "BILOU"].
        default: None
    mode: Whether to count correct entity labels with incorrect I/B tags as true positives or not.
        If you want to only count exact matches, pass mode="strict". default: None.
    sample_weight: Array-like of sha

In [18]:
import numpy as np

def eval_metric(preds):

    pred, label = preds
    pred = np.argmax(pred, axis=-1)

    pred_word_labels = []
    label_word_labels = []

    for p, l in zip(pred, label) :
        if len(p) != len(l):
            print(p)
            print(l)

        pred_word_labels.append([id2label[i] for i, j in zip(p, l) if j != -100])

        label_word_labels.append([id2label[i] for i in l if i != -100 ])

    result = seqeval.compute(predictions=pred_word_labels, references=label_word_labels, mode="strict", scheme="IOB2")
    
    return {'f1': result["overall_f1"]} # we have to reture a dict

### 7. train args

In [19]:

args = TrainingArguments(
        output_dir="../tmp/checkpoints",
        per_device_train_batch_size=64,
        per_device_eval_batch_size=128,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        metric_for_best_model="f1",
        load_best_model_at_end=True,
        logging_steps=10,
        use_cpu=True,
)




### 8. trainer

In [20]:
trainer = Trainer(
    model = model,
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    compute_metrics=eval_metric,
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer)
)

### 9. train + eval

In [21]:
trainer.evaluate(eval_dataset=tokenized_data["test"])

{'eval_loss': 1.2252492904663086,
 'eval_f1': 0.046061722708429294,
 'eval_runtime': 22.7805,
 'eval_samples_per_second': 51.184,
 'eval_steps_per_second': 0.439}

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.0314,0.032999,0.945546
2,0.0134,0.021135,0.970661
3,0.0042,0.016284,0.979146


TrainOutput(global_step=219, training_loss=0.042451643678423474, metrics={'train_runtime': 828.9591, 'train_samples_per_second': 16.872, 'train_steps_per_second': 0.264, 'total_flos': 245454797909268.0, 'train_loss': 0.042451643678423474, 'epoch': 3.0})

In [22]:
trainer.evaluate(eval_dataset=tokenized_data["test"])

{'eval_loss': 0.014311173930764198,
 'eval_f1': 0.9818445896877269,
 'eval_runtime': 20.1221,
 'eval_samples_per_second': 57.946,
 'eval_steps_per_second': 0.497,
 'epoch': 3.0}

### 10. inference

In [23]:
# pipeline
##########

# use pipeline to evaluate the model
pipe_ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

In [24]:
# we have to customized the label-id conversion to output meaningful result
# otherwise, the results will only show default labels

model.config.label2id = label2id
model.config.id2label = id2label
print(model.config) # show the configs

BertConfig {
  "_name_or_path": "google-bert/bert-base-uncased",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-pn",
    "2": "I-pn"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-pn": 1,
    "I-pn": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.41.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [25]:
pipe_ner("The first HP LaserJet and the first Apple LaserWriter used the same print engine, the Canon CX engine")

[{'entity_group': 'pn',
  'score': 0.9971613,
  'word': 'hp laserjet',
  'start': 10,
  'end': 21},
 {'entity_group': 'pn',
  'score': 0.9889905,
  'word': 'apple laserwriter',
  'start': 36,
  'end': 53},
 {'entity_group': 'pn',
  'score': 0.6539879,
  'word': 'cx',
  'start': 92,
  'end': 94}]

In [65]:
# manual
########

toks = tokenizer("The first HP LaserJet and the first Apple LaserWriter used the same print engine, the Canon CX engine", return_tensors="pt")
print("toks: ", toks.input_ids)

logits = model(**toks).logits
print("logits: ", logits)

id_cls = logits.argmax(axis=-1)
print("predicted class ids: ", id_cls[0])

preds_cls = [id2label.get(p.item()) for p in id_cls[0]]
print("predicted class labels: ", preds_cls)


word_ids = []
word = []
old = -1
for ind, i in enumerate(id_cls[0].numpy()):
    if old != 0 and i == 0:
        word = []
        word_ids.append(word)
    if i != 0:
        word.append(toks.input_ids[0][ind].item())
    old = i
print("entity ids: ", word_ids)

for i, w in enumerate(word_ids):
    if len(w) > 0:
        print("entity ", i, ": ", tokenizer.decode(w))

toks:  tensor([[  101,  1996,  2034,  6522,  9138, 15759,  1998,  1996,  2034,  6207,
          9138, 15994,  2109,  1996,  2168,  6140,  3194,  1010,  1996,  9330,
          1039,  2595,  3194,   102]])
logits:  tensor([[[ 3.8971, -1.8646, -1.2216],
         [ 5.5416, -2.4371, -2.2886],
         [ 4.0385, -1.0274, -2.5045],
         [-1.5644,  5.1154, -3.4098],
         [-2.5015, -1.8554,  4.3861],
         [-2.1167, -2.1242,  4.0641],
         [ 2.3298, -1.1273, -0.3853],
         [ 4.5360, -1.5070, -2.2281],
         [ 3.1980, -0.2681, -2.2158],
         [-0.2964,  3.8050, -2.9993],
         [-2.2740, -1.6701,  4.0373],
         [-1.6740, -1.6256,  3.5832],
         [ 5.0597, -2.8325, -1.4988],
         [ 5.7788, -3.1890, -2.2796],
         [ 4.7521, -2.6377, -1.8266],
         [ 4.3349, -2.4598, -1.4123],
         [ 5.6720, -2.7675, -1.8111],
         [ 4.5814, -2.5581, -1.3707],
         [ 5.7639, -2.7239, -2.0398],
         [ 2.6216,  0.9343, -3.0131],
         [ 0.4343, -1.0765,