Intro

- Introduce NER and Knowledge Graphs
- Introduce LLMs and Bert
- Talk about Fine tuning on different domains

Extracting these entities from documents we can create knowledge graphs connecting otherwise disparate documents together. This can be used for document topic inference, relating useful documents to eachother, and question and answering tasks. 

specific entities which it is useful to identify in documents, including specific named committees and assemblies, important topics like the Sustainable Development Goals (SGDs)

Leslie Huang’s generously open sourced UN NER dataset as training and test data.

Introduce Layer

In [1]:
!pip install layer --upgrade -qqq
!pip install -U ipython

!pip install transformers
!pip install datasets
!pip install seqeval

In [1]:
import layer
from layer.decorators import model,pip_requirements,fabric
layer.login()
layer.init("ner-finetuning")

Your Layer project is here: https://app.layer.ai/kaankarakeben/ner-finetuning

The corpus consists of a sample of transcribed speeches given at the UN General Assembly from 1993-2016, which were scraped from the UN website, parsed (e.g. from PDF), and cleaned.

More than 50,000 tokens in the test data were manually tagged for Named Entity Recognition (O - Not a Named Entity; I-PER - Person; I-ORG - Organization; I-LOC - Location; I-MISC - Other Named Entity).


In [6]:
!git clone https://github.com/leslie-huang/UN-named-entity-recognition

Cloning into 'UN-named-entity-recognition'...
remote: Enumerating objects: 21580, done.[K
remote: Total 21580 (delta 0), reused 0 (delta 0), pack-reused 21580[K
Receiving objects: 100% (21580/21580), 14.70 MiB | 6.47 MiB/s, done.
Resolving deltas: 100% (21095/21095), done.


Define Tokenizer
- tokenize our inputs and match the labels in the UN NER dataset to those labels used in upstream BERT training.

In [2]:
from transformers import BertTokenizer
import os
import itertools
import pandas as pd
import random
from math import ceil
from datasets import Dataset

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
import torch

batch_size = 16

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DistilBertTokenizer'. 
The class this function is called from is 'BertTokenizer'.


In [11]:
from layer.decorators import dataset, resources

@dataset("un_ner_training")
@resources(path="./UN-named-entity-recognition")
def create_dataset():
    directories = ['./UN-named-entity-recognition/tagged-training/', './UN-named-entity-recognition/tagged-test/']
    data_files = []
    for dir in directories:
        for filename in os.listdir(dir):
            file_path = os.path.join(dir, filename)

            with open(file_path, 'r', encoding="utf8") as f:
                lines = f.readlines()
                split_list = [list(y) for x, y in itertools.groupby(lines, lambda z: z == '\n') if not x]
                tokens = [[x.split('\t')[0] for x in y] for y in split_list]
                entities = [[x.split('\t')[1][:-1] for x in y] for y in split_list]
                data_files.append(pd.DataFrame({'tokens': tokens, 'ner_tags': entities}))

    dataset = pd.concat(data_files).reset_index().drop('index', axis=1)
    layer.log({"# Training Examples": len(dataset)})

    return dataset

layer.run([create_dataset])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Run(project_name='ner-finetuning', files_hash='9dd6148b2cb647c4647909f3ca19bfb5d94bf32cc3fc91ed7714c543e370f707', account=Account(id=UUID('d5252c76-e2a3-4b4c-a93f-d86189d26586'), name='kaankarakeben'))

In [3]:
# grabbing the label set
dataset = layer.get_dataset("kaankarakeben/ner-finetuning/datasets/un_ner_training").to_pandas()

from collections import Counter

total_labels = []
for labels in dataset["ner_tags"]:
    total_labels.extend(labels)

label_counter = Counter(total_labels)

labels_to_ids = {k: v for v, k in enumerate(label_counter.keys())}
id_to_label = {i: l for l, i in labels_to_ids.items()}


In [None]:
# entities_to_remove = ["I-PRG", "I-I-MISC", "I-OR", "VMISC", "I-", "0"]
# data = data[~data.Tag.isin(entities_to_remove)]

In [35]:
@pip_requirements(packages=["transformers"])
@fabric("f-medium")
@model(name="distilbert-base-cased-tokenizer")
def download_tokenizer():
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
    return tokenizer

layer.run([download_tokenizer])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Run(project_name='ner-finetuning', files_hash='492acdf6275e70f15fee939511a8ceceb01aa4d51674e910a59ba8d82a27ed05', account=Account(id=UUID('d5252c76-e2a3-4b4c-a93f-d86189d26586'), name='kaankarakeben'))

In [4]:

tokenizer = layer.get_model("kaankarakeben/ner-finetuning/models/distilbert-base-cased-tokenizer").get_train()

In [6]:
def tokenize_and_align_labels(examples):
    # https://huggingface.co/docs/transformers/tasks/token_classification
    label_all_tokens = True
    tokenized_inputs = tokenizer(list(examples["tokens"]), truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif label[word_idx] == '0':
                label_ids.append(0)
            elif word_idx != previous_word_idx:
                label_ids.append(labels_to_ids[label[word_idx]])
            else:
                label_ids.append(labels_to_ids[label[word_idx]] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)
        
    tokenized_inputs["labels"] = labels
    return tokenized_inputs


dataset_ix = set(dataset.index)
test_ix = random.sample(dataset_ix, ceil(len(dataset) * 0.2))
train_ix = dataset_ix - set(test_ix)

train_dataset = Dataset.from_pandas(dataset.loc[train_ix])
test_dataset = Dataset.from_pandas(dataset.loc[test_ix])
tokenized_train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [7]:
@pip_requirements(packages=["transformers"])
@fabric("f-gpu-small")
@model("ner_distilbert_cased_un_dataset_finetune")
def train():
    model = AutoModelForTokenClassification.from_pretrained("distilbert-base-cased", num_labels=len(label_counter))

    training_args = TrainingArguments(
        output_dir="./results",
        evaluation_strategy = "epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=1,
        weight_decay=0.01,
    )

    data_collator = DataCollatorForTokenClassification(tokenizer)

    trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_test_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer
    )

    trainer.train()

    return trainer.model

layer.run([train])

KeyboardInterrupt: 

In [47]:
model = layer.get_model("kaankarakeben/ner-finetuning/models/distilbert-base-cased:1.2").get_train()

Evaluation of the model on a example outside the test set

In [48]:
paragraph = '''Expressing deep concern about the impact of the food security crisis on the
assistance provided by United Nations humanitarian agencies, in particular the World
Food Programme, the United Nations Children’s Fund, the Office for the
Coordination of Humanitarian Affairs of the Secretariat and the Office of the United
Nations High Commissioner for Refugees'''

# source: https://daccess-ods.un.org/tmp/9897623.65818024.html

tokens = tokenizer(paragraph)
predictions = model.forward(input_ids=torch.tensor(tokens['input_ids']).unsqueeze(0), attention_mask=torch.tensor(tokens['attention_mask']).unsqueeze(0))
predictions = torch.argmax(predictions.logits.squeeze(), axis=1)
predictions = [id_to_label[int(i)] for i in predictions]

words = tokenizer.batch_decode(tokens['input_ids'])
pd.DataFrame({'ner': predictions, 'words': words})

References:

https://medium.com/@andrewmarmon/fine-tuned-named-entity-recognition-with-hugging-face-bert-d51d4cb3d7b5

https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb#scrollTo=zPDla1mmZiax

https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/

https://jalammar.github.io/illustrated-bert/

https://huggingface.co/docs/transformers/tasks/token_classification