For details, see the original notebook on `LoRA_for_token_classifcation` [[link](https://github.com/matthiasdroth/Compute-Optimal_LoRA-Adapters_for_Language_Models/blob/main/LoRA_for_token_classification.ipynb)].

**Done**:
- change model to `FacebookAI/roberta-large`
- change dataset to `DFKI-SLT/few-nerd`
- collect all splits in one huge dataset
- tokenize and align labels
- split dataset into train, valid, test, and dev splits

**ToDo**:
- define data_collators
- configure LoRA adapter
- define LoRA model
- run training via trainer
- change training to `accelerate`
- count FLOPs via `einops` in training loop
- add logic to find maximum batch size
- add basic sweep and log to wandb

In [1]:
from datasets import load_dataset, concatenate_datasets, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer
)
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType
import evaluate
import torch
import numpy as np

checkpoint = "FacebookAI/roberta-large"
model = AutoModelForTokenClassification.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, add_prefix_space=True)

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/roberta-large and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
fewnerd = load_dataset("DFKI-SLT/few-nerd", "supervised")
fewnerd_all = concatenate_datasets([fewnerd["train"], fewnerd["validation"], fewnerd["test"]]).rename_column("tokens", "words")
fewnerd_all

Dataset({
    features: ['id', 'words', 'ner_tags', 'fine_ner_tags'],
    num_rows: 188239
})

In [3]:
labels = fewnerd_all.features["ner_tags"].feature.names
id2label = {k: v for k, v in enumerate(labels)}
label2id = {v: k for k, v in id2label.items()}
id2label, label2id, labels

({0: 'O',
  1: 'art',
  2: 'building',
  3: 'event',
  4: 'location',
  5: 'organization',
  6: 'other',
  7: 'person',
  8: 'product'},
 {'O': 0,
  'art': 1,
  'building': 2,
  'event': 3,
  'location': 4,
  'organization': 5,
  'other': 6,
  'person': 7,
  'product': 8},
 ['O',
  'art',
  'building',
  'event',
  'location',
  'organization',
  'other',
  'person',
  'product'])

In [4]:
def tokenize_and_align_ner_tags(instance_data, verbose=False):
    """verbosity = 0 (no logs), 1 (few logs), or 2 (all logs)"""
    # tokenize "words" field and extract items
    inputs = tokenizer(instance_data["words"], is_split_into_words=True)
    tokens = inputs.tokens()
    word_ids = inputs.word_ids()
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    # declare variables to use in the loop below
    labels = instance_data["ner_tags"]
    simple_word_ids = word_ids[1:-1]
    simple_tokens = tokens[1:-1]
    previous_word_id = False
    previous_label = False
    label_index = -1
    match_labels = []
    # loop start
    for i in range(len(simple_word_ids)):
        word_id = simple_word_ids[i]
        if word_id==previous_word_id and type(word_id)==type(previous_word_id): # type(False) = bool (handle previous_word_id = False)
            if verbose:
                print("\t word_id repeats")
            # add previous_label
            match_labels.append(int(previous_label)) # int(False) = 0 (handle previous_word_id = False)
        else:
            # increment label_index and get label via label_index
            label_index += 1
            label = labels[label_index]
            match_labels.append(label)
        # update previous_word_id and previous_label
        previous_word_id = word_id
        previous_label = label
        if verbose:
            # logs
            print(f"i={i} \t word_id={word_id} \t label={label} \t token={tokens[i+1]}")
    # loop end
    match_labels = [-100] + match_labels + [-100]
    return_items = {
        "tokens": tokens,
        "word_ids": word_ids,
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "matched_ner_tags": match_labels
    }
    return return_items

fewnerd_all_mapped = fewnerd_all.map(tokenize_and_align_ner_tags)
fewnerd_all_mapped

Dataset({
    features: ['id', 'words', 'ner_tags', 'fine_ner_tags', 'tokens', 'word_ids', 'input_ids', 'attention_mask', 'matched_ner_tags'],
    num_rows: 188239
})

In [5]:
idx = 22
tokens_idx = fewnerd_all_mapped[idx]["tokens"]
ner_tags_idx = fewnerd_all_mapped[idx]["matched_ner_tags"]
for i in range(len(tokens_idx)):
    token_i = tokens_idx[i]
    ner_tag_i = ner_tags_idx[i]
    if ner_tag_i>-100:
        print(f"{token_i} ({id2label[ner_tag_i]})")

ĠKnown (O)
Ġlocally (O)
Ġas (O)
Ġ`` (O)
ĠFair (product)
bottom (product)
ĠBob (product)
s (product)
Ġ`` (O)
Ġit (O)
Ġis (O)
Ġnow (O)
Ġpreserved (O)
Ġat (O)
Ġthe (O)
ĠHenry (building)
ĠFord (building)
ĠMuseum (building)
Ġin (O)
ĠDear (location)
born (location)
Ġ, (O)
ĠMichigan (location)
Ġ. (O)


In [6]:
trainvalid_test_split = fewnerd_all_mapped.train_test_split(test_size=0.15)
trainvalid_split = trainvalid_test_split["train"]
test_split = trainvalid_test_split["test"]
train_valid_split = trainvalid_split.train_test_split(test_size=0.15)
valid_split = train_valid_split["test"]
train_split = train_valid_split["train"]
dev_split = fewnerd_all_mapped.train_test_split(test_size=4)["test"]
fewnerd_dataset = DatasetDict({
    "train": train_split,
    "valid": valid_split,
    "test": test_split,
    "dev": dev_split
}).remove_columns(["id", "word_ids", "tokens", "ner_tags", "fine_ner_tags"])
fewnerd_dataset

DatasetDict({
    train: Dataset({
        features: ['words', 'input_ids', 'attention_mask', 'matched_ner_tags'],
        num_rows: 136002
    })
    valid: Dataset({
        features: ['words', 'input_ids', 'attention_mask', 'matched_ner_tags'],
        num_rows: 24001
    })
    test: Dataset({
        features: ['words', 'input_ids', 'attention_mask', 'matched_ner_tags'],
        num_rows: 28236
    })
    dev: Dataset({
        features: ['words', 'input_ids', 'attention_mask', 'matched_ner_tags'],
        num_rows: 4
    })
})

In [7]:
label_names = fewnerd_all.features["ner_tags"].feature.names
label_names

['O',
 'art',
 'building',
 'event',
 'location',
 'organization',
 'other',
 'person',
 'product']

Load the [**fewnerd**](https://arxiv.org/pdf/2105.07464v6.pdf) dataset and read the according [**publication**](https://aclanthology.org/2021.acl-long.248/)!

<font style="font-weight:300">✔</font>