The main motive  is to select a good pretrained Named Entity Recognition model from the hugging face library. The model is then evaluated on the validation dataset and made predictions on the test dataset. The dataset selected for this task is the CONLL 2003 dataset, which is a most commonly used dataset for named entity recognition.

In [None]:
!pip install datasets -q
!pip install tokenizers -q
!pip install transformers -q
!pip install seqeval -q

You should consider upgrading via the '/home/jupyter/.pyenv/versions/3.8.16/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/home/jupyter/.pyenv/versions/3.8.16/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/home/jupyter/.pyenv/versions/3.8.16/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/home/jupyter/.pyenv/versions/3.8.16/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
[0m

In [None]:
#Loading dataset
from datasets import load_dataset

dataset = load_dataset("conll2003")

Found cached dataset conll2003 (/home/jupyter/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98)


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
#Keys
dataset.keys()

dict_keys(['train', 'validation', 'test'])

In [None]:
#Label names in CONLL Dataset
label_names = dataset["train"].features["ner_tags"].feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

The list ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'] represents the labels or tags that are commonly used in Named Entity Recognition (NER) tasks. These labels indicate different entity types that can be identified in text. Here's what each label represents:

'O': Represents tokens that are not part of any named entity.
'B-PER': Represents the beginning of a person entity.
'I-PER': Represents tokens inside a person entity.
'B-ORG': Represents the beginning of an organization entity.
'I-ORG': Represents tokens inside an organization entity.
'B-LOC': Represents the beginning of a location entity.
'I-LOC': Represents tokens inside a location entity.
'B-MISC': Represents the beginning of a miscellaneous entity.
'I-MISC': Represents tokens inside a miscellaneous entity.

In [None]:
dataset.column_names

{'train': ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
 'validation': ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
 'test': ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags']}

# Loading the model

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("elastic/distilbert-base-uncased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("elastic/distilbert-base-uncased-finetuned-conll03-english")


The tokenizer initializes a tokenizer that is capable of encoding text into tokens that can be processed by the model. It loads the tokenizer from the "elastic/distilbert-base-uncased-finetuned-conll03-english" model.

Then loads the pre-trained model for token classification. It initializes an instance of the model architecture that has been fine-tuned on the CoNLL-03 dataset for named entity recognition (NER) tasks. This model is specifically designed to classify tokens into different entity types.

In [None]:
#function tokenizes the input text by using the tokenizer, and pads or truncates the sequences to a fixed length
def tokenize_function(examples):
    return tokenizer(examples["tokens"], padding="max_length", truncation=True, is_split_into_words=True)

In [None]:
#applies the tokenize_function to all examples in the dataset using the map method,
#which creates a new dataset where each example has been tokenized by the tokenizer.
tokenized_datasets_ = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98/cache-0c44b82bd4ff6497.arrow
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98/cache-244f4a92dbaa7259.arrow


Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets_['train'][0]['input_ids'][:20]

[101,
 7327,
 19164,
 2446,
 2655,
 2000,
 17757,
 2329,
 12559,
 1012,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [None]:
#accesses the tokenized input IDs of the first example in the training split of the tokenized dataset
tokenized_datasets_['train'][0]['ner_tags'][:20]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [None]:
#Get the values for input_ids, attention_mask, adjusted labels
def tokenize_adjust_labels(all_samples_per_split):
  tokenized_samples = tokenizer.batch_encode_plus(all_samples_per_split["tokens"], is_split_into_words=True, truncation=True)

  total_adjusted_labels = []

  for k in range(0, len(tokenized_samples["input_ids"])):
    prev_wid = -1
    word_ids_list = tokenized_samples.word_ids(batch_index=k)
    existing_label_ids = all_samples_per_split["ner_tags"][k]
    i = -1
    adjusted_label_ids = []

    for word_idx in word_ids_list:
      # Special tokens have a word id that is None. We set the label to -100 so they are automatically
      # ignored in the loss function.
      if(word_idx is None):
        adjusted_label_ids.append(-100)
      elif(word_idx!=prev_wid):
        i = i + 1
        if existing_label_ids[i] != 0:
          adjusted_label_ids.append(existing_label_ids[i])
        else:
          adjusted_label_ids.append(-100)
        prev_wid = word_idx
      else:
        label_name = label_names[existing_label_ids[i]]
        adjusted_label_ids.append(existing_label_ids[i])

    total_adjusted_labels.append(adjusted_label_ids)

  #add adjusted labels to the tokenized samples
  tokenized_samples["labels"] = total_adjusted_labels
  return tokenized_samples
#applies the tokenize_adjust_labels function to each example in the dataset and
#creates a new tokenized_dataset by replacing the tokens and ner_tags columns of the original dataset
#with the tokenized inputs (input_ids, attention_mask, labels).
tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True, remove_columns=['tokens', 'ner_tags'])

Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98/cache-31ab99eb718d5ddb.arrow
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98/cache-304cd54f85a1ea01.arrow


Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'pos_tags', 'chunk_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'pos_tags', 'chunk_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'pos_tags', 'chunk_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})


In [None]:
#Padding to make it equal length
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
data_collator

DataCollatorForTokenClassification(tokenizer=DistilBertTokenizerFast(name_or_path='elastic/distilbert-base-uncased-finetuned-conll03-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None, label_pad_token_id=-100, return_tensors='pt')

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForTokenClassification, AdamW

In [None]:
model = AutoModelForTokenClassification.from_pretrained("elastic/distilbert-base-uncased-finetuned-conll03-english",num_labels=len(label_names))



In [None]:
import numpy as np
from datasets import load_metric
metric = load_metric("seqeval")

#Compute metrics function
def compute_metrics(p):
    predictions, labels = p

    #select predicted index with maximum logit for each token
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

  metric = load_metric("seqeval")


In [None]:
#Example from the "train" split of a Named Entity Recognition (NER) dataset,
#extracts the corresponding NER tags, and uses the metric.compute() function to compute evaluation metrics

example = dataset["test"][8]
labels = [label_names[i] for i in example[f"ner_tags"]]
metric.compute(predictions=[labels], references=[labels])

{'LOC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [None]:
from transformers import TrainingArguments, Trainer

batch_size = 16
logging_steps = len(tokenized_dataset['train']) // batch_size
epochs = 1
#Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="epoch",
    disable_tqdm=False,
    logging_steps=logging_steps)

In [None]:
#Defining trainer
trainer = Trainer(
    model=model,
    args=training_args,
    #train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,

    compute_metrics=compute_metrics
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
# Initialize wandb in dry run mode- see https://discuss.huggingface.co/t/logging-experiment-tracking-with-w-b/498/29?page=4
wandb.init(mode="dryrun")

# Evaluating and predicting on the validation and test dataset

In [None]:
#Evaluating model on validation set
trainer.evaluate()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.17466357350349426,
 'eval_precision': 0.9454525089605734,
 'eval_recall': 0.9442890703658128,
 'eval_f1': 0.9448704315217998,
 'eval_accuracy': 0.9621062017039825,
 'eval_runtime': 116.9343,
 'eval_samples_per_second': 27.793,
 'eval_steps_per_second': 1.745}

It is observed that the model performs well on the validation dataset even without finetuning. The accuracy and F1 score is high. The validation accuracy closely matches the accuracy given in the model card of hugging face found at https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english

In [None]:
#Making predictions on the test set
predictions, labels, _ = trainer.predict(tokenized_dataset["test"])
predictions = np.argmax(predictions, axis=2)
# Remove ignored index (special tokens)
true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
results = metric.compute(predictions=true_predictions, references=true_labels)
results

{'LOC': {'precision': 0.9134481154025128,
  'recall': 0.9241996233521658,
  'f1': 0.9187924175052657,
  'number': 2124},
 'MISC': {'precision': 0.8294314381270903,
  'recall': 0.7469879518072289,
  'f1': 0.786053882725832,
  'number': 996},
 'ORG': {'precision': 0.8807057920981972,
  'recall': 0.8871715610510046,
  'f1': 0.8839268527430221,
  'number': 2588},
 'PER': {'precision': 0.9710417450169236,
  'recall': 0.9499632082413539,
  'f1': 0.9603868328063976,
  'number': 2718},
 'overall_precision': 0.9125360923965351,
 'overall_recall': 0.9001898884405412,
 'overall_f1': 0.9063209463496236,
 'overall_accuracy': 0.9420947910820876}

Based on the provided evaluation metrics, the entity type with the highest performance is "Person" (PER).

The precision for person entities is 0.9710, indicating that the model correctly predicted 97.10% of the person entities. The recall value of 0.9500 suggests that 95.00% of the actual person entities were correctly identified by the model. The F1-score of 0.9604 represents a balanced measure of precision and recall, indicating the overall accuracy of the model in identifying person entities.

Therefore, the "Person" entity type has the highest performance among the evaluated entity types. Also, the overall accuracy and F1 score of the model on the dataset is also high. So, "elastic/distilbert-base-uncased-finetuned-conll03-english" model is a good model for the named enity recognition tasks.

# Conclusion

The main scope of this part was to choose a good pretrained NER model, evaluate in on the validation dataset and use it to make predictions on the test dataset, without fine tuning. Several models were evaluated and finally "elastic/distilbert-base-uncased-finetuned-conll03-english" model from the hugging face is found to be the best model which performed well on the CONLL Dataset. The next step is evaluate the model on the master dataset and predict the NER tags.

# References
https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english

https://huggingface.co/dslim/bert-base-NER/discussions

https://medium.com/codex/nlp-deep-learning-training-on-downstream-tasks-using-pytorch-lightning-ner-on-conll-data-part-fe1512ae4183

https://github.com/keep-steady/NER_pytorch

https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03

https://arxiv.org/pdf/1911.02116.pdf

https://huggingface.co/models?sort=downloads&search=CONLL

https://stackoverflow.com/questions/63673511/how-to-use-the-outputs-of-bert-model

https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion

https://towardsdatascience.com/simple-transformers-named-entity-recognition-with-transformer-models-c04b9242a2a0

https://huggingface.co/spaces/autoevaluate/model-evaluator?dataset=conll2003

https://stackoverflow.com/questions/73258047/pytorch-based-bert-ner-for-transfer-learning-retraining

