- *It works!*
- DONE: use `max_batchsize` from `utils`
- DONE: make **dedicated notebook** to (i) compare `conll2003` to `fewnerd` and (ii) to bring `fewnerd` into the same format.
- DONE: use fewnerd
- DONE: get total flops count with **einops**
- DONE: make custom splits
- DONE: reorder the notebook cells:
  - DONE: model (LoRA) and tokenizer (save peftconfig if necessary)
  - DONE: dataset etc
  - DONE: training, metrics and saving (tokenizer and model)
  - DONE: inference via loading saved items (probably tokenizer, model and peftconfig – OR follow this [guide](https://huggingface.co/docs/peft/v0.9.0/en/package_reference/lora#peft.LoraModel))
- DONE: In **Using the fine-tuned model**, [merge and unload](https://huggingface.co/docs/peft/v0.6.2/en/package_reference/tuners#peft.LoraModel.merge_and_unload) or [reinstantiate](https://huggingface.co/docs/peft/v0.6.2/en/task_guides/token-classification-lora#inference) the LoRA model!
- DONE: adjust batch size – and if necessary epochs – for 3000 training steps `num_steps = train_instances*epochs/batch_size` $\geq$ 3000 $\Rightarrow$ `batch_size` $\leq$ `train_instances*epochs/3000 = train_instances/1000` $\Rightarrow$ `batch_size` $\leq$ `train_instances / 1000` for `epochs = 3`<br>DONE: But do it like this:
  - DONE: get max batch size for model (= max_batchsize_by_model)
  - DONE: specify trainig split
  - DONE: get max batch size for training split length (=max_batchsize_by_trainsplit)
  - DONE: impose max batch size of 32
  - DONE: the batch size is the minimum of these three numbers
- DONE: build `results.json` (consider pandas series) via dict. It holds: splits, specified loraconfig details, flops, metrics (per epoch)
- DONE: use the uuid library to save `results.json` under `results_{uuid}.json`
- DONE: declare variable `split` and use it to select splits as well as for logging it in `results_{uuid}.json`.
- DONE: use split `dev` and determine the learning rate for LoRA models. It seems that with `accelerate`, the maximum accepted learning rate is `5e-4` since training fails for higher learning rates.
- DONE: try a different checkpoint for the model (larger!) and tokenizer. Outcomes:
  - mistral etc. cannot perform token classification (loading these model with AutoModelForTokenClassification will error out)
  - stick with `"FacebookAI/roberta-large"`
- DONE: [Set random seeds](https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy) for reproducibility.
- DONE: Drop `uuid` to remove ambiguity about which checkpoint to use. And tidy up the according commented out sections!
- build dedicated inference notebook for final evaluation on test set, inspired by accelerator validation loop that (i) loads the model, tokenizer and dataset, (ii) runs inference on the according test set.
- add comments
- Check once more the wandb [ablation study](https://wandb.ai/ayush-thakur/dl-question-bank/reports/What-s-the-Optimal-Batch-Size-to-Train-a-Neural-Network---VmlldzoyMDkyNDU).
- make new notebook for wandb sweeps (outsource lots of code snippets into dedicated `ner_utils.py` file.

# NER with `fewnerd` and LoRA

In [1]:
import os
import re
import json
import time
import torch
import random
import datetime
import numpy as np

from peft import LoraConfig, TaskType, PeftModel, PeftConfig, get_peft_model, prepare_model_for_kbit_training
from utils import get_max_instance, get_max_batchsize
from dotenv import load_dotenv
from datasets import load_dataset, concatenate_datasets, DatasetDict
from evaluate import load
from tqdm.auto import tqdm
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import get_scheduler, pipeline, AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification
from huggingface_hub import login
from torch.utils.data import DataLoader
from deepspeed.profiling.flops_profiler import FlopsProfiler

load_dotenv()
login(token=os.getenv("HUGGINGFACE_API_KEY"))
logs_dict = {}

[2024-03-10 03:05:33,290] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/matthias/.cache/huggingface/token
Login successful


In [2]:
def set_seed(seed=42):
    # https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # when running the cudnn backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # set a fixed value for the hash seed
    os.environ["PYTHONHASHSEED"] = str(seed)
    print(f"Random seeds set as {seed}.")
    pass
    
set_seed()

Random seeds set as 42.


## Model

In [3]:
raw_datasets = load_dataset("DFKI-SLT/few-nerd", "supervised")
ner_feature = raw_datasets["train"].features["ner_tags"]
label_names = ner_feature.feature.names
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
id2label, label2id

({'0': 'O',
  '1': 'art',
  '2': 'building',
  '3': 'event',
  '4': 'location',
  '5': 'organization',
  '6': 'other',
  '7': 'person',
  '8': 'product'},
 {'O': '0',
  'art': '1',
  'building': '2',
  'event': '3',
  'location': '4',
  'organization': '5',
  'other': '6',
  'person': '7',
  'product': '8'})

In [4]:
base_model_id = "FacebookAI/roberta-large"
logs_dict["base_model_id"] = base_model_id
base_model = AutoModelForTokenClassification.from_pretrained(
    base_model_id,
    id2label=id2label,
    label2id=label2id,
    device_map="auto",
    load_in_8bit=True
)
# LoRA model
# datasets:      3 values [1%, 10%, 100%]
# lora_rank:    10 values [1, ..., 512]
# lora_dropout:  5 values [0, 0.1, 0.2, 0.3, 0.4]
# lora_bias:     3 values ["all", "none", "lora_only"]
# => 3 x 10 x 5 x 3 = 450 sweeps per notebook
# => start with 2 dataset values (10%, 100%) and 3 rank values (2, 8, 32) => 6 sweeps
LoRA_params_dict = {
    "r": 64,
    "target_modules": ["query", "key", "value", "query_proj", "key_proj", "value_proj"],
    "bias": "lora_only",
    "use_rslora": True,
    "task_type": TaskType.TOKEN_CLS,
    "lora_dropout": 0.2
}
logs_dict["LoRA_params_dict"] = LoRA_params_dict
# r =   1 => (  156681, 354476050, 0.00044)
# r =   2 => (  304137, 354623506, 0.00086)
# r =   4 => (  599049, 354918418, 0.00169)
# r =   8 => ( 1188873, 355508242, 0.00334)
# r =  16 => ( 2368521, 356687890, 0.00664)
# r =  32 => ( 4727817, 359047186, 0.01317)
# r =  64 => ( 9446409, 363765778, 0.02597)
# r = 128 => (18883593, 373202962, 0.0506)
# r = 256 => (37757961, 392077330, 0.0963)
# r = 512 => (75506697, 429826066, 0.17567)
config = LoraConfig(
    # GUIDE   => https://huggingface.co/docs/peft/main/en/conceptual_guides/lora#common-lora-parameters-in-peft
    # https://huggingface.co/docs/peft/main/en/conceptual_guides/lora#common-lora-parameters-in-peft:~:text=use_rslora%3A%20When%20set%20to%20True%2C%20uses%20Rank%2DStabilized%20LoRA%20which%20sets%20the%20adapter%20scaling%20factor
    # https://arxiv.org/abs/2312.03732, 
    r = LoRA_params_dict["r"],
    target_modules=LoRA_params_dict["target_modules"],
    bias=LoRA_params_dict["bias"],
    use_rslora=LoRA_params_dict["use_rslora"],
    task_type=LoRA_params_dict["task_type"],
    lora_dropout=LoRA_params_dict["lora_dropout"]
)
logs_dict["LoraConfig"] = str(config)
print(f"LoRA config:\n{config}\n")
adapter_model = prepare_model_for_kbit_training(base_model)
adapter_model = get_peft_model(adapter_model, config)
print(f"base_model type:\n{type(base_model)}\n\nadapter_model type:\n{type(adapter_model)}")
trainable_params, all_params = adapter_model.get_nb_trainable_parameters()
trainable_fraction = round(trainable_params/all_params, 5)
logs_dict["LoRA_model_trainable_params"] = trainable_params
logs_dict["LoRA_model_all_params"] = all_params
logs_dict["LoRA_model_trainable_fraction"] = trainable_fraction
print(f"\ntrainable parameters:\n{trainable_params}")
print(f"\nnall parameters:\n{all_params}")
print(f"\ntrainable fraction:\n{trainable_fraction}")

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/roberta-large and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


LoRA config:
LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.TOKEN_CLS: 'TOKEN_CLS'>, inference_mode=False, r=64, target_modules={'query', 'value_proj', 'value', 'key', 'key_proj', 'query_proj'}, lora_alpha=8, lora_dropout=0.2, fan_in_fan_out=False, bias='lora_only', use_rslora=True, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})

base_model type:
<class 'transformers.models.roberta.modeling_roberta.RobertaForTokenClassification'>

adapter_model type:
<class 'peft.peft_model.PeftModelForTokenClassification'>

trainable parameters:
9446409

nall parameters:
363765778

trainable fraction:
0.02597


In [5]:
tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_prefix_space=True)
logs_dict["tokenizer"] = base_model_id
print(f"tokenizer is fast: {tokenizer.is_fast}")
tokenizer

tokenizer is fast: True


RobertaTokenizerFast(name_or_path='FacebookAI/roberta-large', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

## Dataset

In [6]:
raw_datasets = load_dataset("DFKI-SLT/few-nerd", "supervised")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],
        num_rows: 131767
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],
        num_rows: 18824
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'fine_ner_tags'],
        num_rows: 37648
    })
})

In [7]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)
    return new_labels

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

fewnerd_all_processed = concatenate_datasets([raw_datasets["train"], raw_datasets["validation"], raw_datasets["test"]]).map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names
)
fewnerd_all_processed

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 188239
})

In [8]:
# filter dataset by length if necessary
trainvalid_test_splits = fewnerd_all_processed.train_test_split(test_size=0.15)
test_split_100 = trainvalid_test_splits["test"]
test_split_10 = test_split_100.train_test_split(test_size = 0.1)["test"]
test_split_1 = test_split_100.train_test_split(test_size = 0.01)["test"]
trainvalid_split = trainvalid_test_splits["train"]
train_valid_split = trainvalid_split.train_test_split(test_size=0.15)
valid_split_100 = train_valid_split["test"]
valid_split_10 = valid_split_100.train_test_split(test_size = 0.1)["test"]
valid_split_1 = valid_split_100.train_test_split(test_size = 0.01)["test"]
train_split_100 = train_valid_split["train"]
train_split_10 = train_split_100.train_test_split(test_size = 0.1)["test"]
train_split_1 = train_split_100.train_test_split(test_size = 0.01)["test"]
dev_train_split = train_split_100.train_test_split(test_size = 120)["test"]
dev_valid_split = valid_split_100.train_test_split(test_size = 32)["test"]
dev_test_split = test_split_100.train_test_split(test_size = 8)["test"]
fewnerd_dsd = DatasetDict({
    "train_100": train_split_100,
    "train_10": train_split_10,
    "train_1": train_split_1,
    "valid_100": valid_split_100,
    "valid_10": valid_split_10,
    "valid_1": valid_split_1,
    "test_100": test_split_100,
    "test_10": test_split_10,
    "test_1": test_split_1,
    "train_dev": dev_train_split,
    "valid_dev": dev_valid_split,
    "test_dev": dev_test_split
})
fewnerd_dsd

DatasetDict({
    train_100: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 136002
    })
    train_10: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 13601
    })
    train_1: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1361
    })
    valid_100: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 24001
    })
    valid_10: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2401
    })
    valid_1: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 241
    })
    test_100: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 28236
    })
    test_10: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2824
    })
    test_1: Dataset({
        features: ['input_ids', 'attentio

In [9]:
split = "dev" # "100", "10", "1", "dev"
assert split in ("100", "10", "1", "dev"), f"Split '{split}' is not a valid choice."
train_split = f"train_{split}"
valid_split = f"valid_{split}"
test_split = f"test_{split}"
train_split, valid_split, test_split

('train_dev', 'valid_dev', 'test_dev')

## Training

In [10]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
data_collator

DataCollatorForTokenClassification(tokenizer=RobertaTokenizerFast(name_or_path='FacebookAI/roberta-large', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multipl

In [11]:
batch = data_collator([fewnerd_dsd[train_split][i] for i in [2, 3]])
batch["labels"] # As we can see, the second set of labels has been padded to the length of the first one using -100s.

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    0,    0,    0,    0,    0,    0,    5,    6,    0,    0,    4,
            4,    4,    0,    4,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100],
        [-100,    0,    5,    6,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    4,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0, -100]])

In [12]:
# largest batch size that still leads to 1000 or more steps per training epoch
trainsplit_len = len(fewnerd_dsd[train_split])
max_batchsize_by_trainsplit = 1
for i in range(20):
    bs = 2**i
    if trainsplit_len/bs >= 1000:
        max_batchsize_by_trainsplit = bs
# longest input instance
max_instance, len_max_instance = get_max_instance(fewnerd_dsd[train_split])
# max_batchsize (for current LoRA model)
max_batchsize_by_model = get_max_batchsize(adapter_model, max_instance, data_collator)
max_batchsize_by_speedup = 32 # diminishing speedup beyond batch_size=32 AND fewer loss minimization steps
print(f"max_batchsize_by_trainsplit:\t{max_batchsize_by_trainsplit}")
print(f"max_batchsize_by_model:\t\t{max_batchsize_by_model}")
print(f"max_batchsize_by_speedup:\t{max_batchsize_by_speedup}")
batch_size = min(max_batchsize_by_trainsplit, max_batchsize_by_model, max_batchsize_by_speedup)
logs_dict["batch_size"] = batch_size
len_max_instance, batch_size



Batch size	1	works!
Batch size	2	works!
Batch size	4	works!
Batch size	8	works!
Batch size	16	works!
Batch size	32	works!
Batch size	64	works!
Batch size	128	works!
Batch size	256	works!
max_batchsize_by_trainsplit:	1
max_batchsize_by_model:		256
max_batchsize_by_speedup:	32


(94, 1)

In [13]:
train_dataloader = DataLoader(
    fewnerd_dsd[train_split],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size
)
valid_dataloader = DataLoader(fewnerd_dsd[valid_split], collate_fn=data_collator, batch_size=batch_size)
train_dataloader, valid_dataloader

(<torch.utils.data.dataloader.DataLoader at 0x7310e00ffc10>,
 <torch.utils.data.dataloader.DataLoader at 0x7310e00fe590>)

In [14]:
optimizer = AdamW(adapter_model.parameters(), lr=5e-4) # 5e-4 works
accelerator = Accelerator()
acc_model, optimizer, train_dataloader, valid_dataloader = accelerator.prepare(
    adapter_model,
    optimizer,
    train_dataloader,
    valid_dataloader
)
model_folder = re.sub("/", "-", base_model_id)
output_dir = f"ner_logs/{model_folder}"
output_dir

'ner_logs/FacebookAI-roberta-large'

In [15]:
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch
logs_dict["num_training_steps"] = num_training_steps
num_warmup_steps = min(500, round(0.15 * num_update_steps_per_epoch)) # 500 or 15% of one epoch, whichever is less
logs_dict["num_warmup_steps"] = num_warmup_steps
lr_scheduler = get_scheduler(
    "cosine",
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)
print(f"training_steps (all epochs):\t{num_training_steps}\nnum_warmup_steps (first epoch):\t{num_warmup_steps}")

training_steps (all epochs):	360
num_warmup_steps (first epoch):	18


In [16]:
metric = load("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()
    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

prof = FlopsProfiler(acc_model) # deepspeed profiler
flops_list = []
training_start = True
validation_start = True
progress_bar = tqdm(range(num_training_steps))
start_time = time.time() # start time
for epoch in range(num_train_epochs):
    # Training
    acc_model.train()
    prof.start_profile() # start profiling
    for batch in train_dataloader:
        outputs = acc_model(**batch)
        if training_start:
            print("\ntraining")
            print([f"{key} shape: {list(batch[key].shape)}" for key in list(batch.keys())])
            print(f"logits shape: {list(outputs['logits'].shape)}, loss: {float(outputs['loss'])}\n")
            training_start = False
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    prof.stop_profile() # stop profiling
    total_flops = prof.get_total_flops()
    flops_list.append(total_flops)
    # Validation
    acc_model.eval()
    for batch in valid_dataloader:
        with torch.no_grad():
            outputs = acc_model(**batch)
        if validation_start:
            print("validation")
            print([f"{key} shape: {list(batch[key].shape)}" for key in list(batch.keys())])
            print(f"logits shape: {list(outputs['logits'].shape)}, loss: {float(outputs['loss'])}\n")
            validation_start = False
        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]
        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)
        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)
    results = metric.compute()
    results_dict = {key: results[f"overall_{key}"] for key in ["precision", "recall", "f1", "accuracy"]}
    print(f"epoch {epoch}:", results_dict)
    logs_dict[f"epoch_{epoch}_results"] = results_dict
    # save acc_model
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(acc_model)
    unwrapped_model.save_pretrained(output_dir)
    # save tokenizer
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
#
stop_time = time.time()
training_loop_time = str(datetime.timedelta(seconds = round(stop_time-start_time)))
print(f"training loop time: {training_loop_time}")
logs_dict["training_loop_time"] = training_loop_time
prof.end_profile() # end profiling
logs_dict["flops_list"] = flops_list
print(flops_list)
flops_array = np.array(flops_list)
np.sum(flops_array), np.mean(flops_array)

  0%|          | 0/360 [00:00<?, ?it/s]

[2024-03-10 03:06:01,464] [INFO] [profiler.py:80:start_profile] Flops profiler started

training
['input_ids shape: [1, 30]', 'attention_mask shape: [1, 30]', 'labels shape: [1, 30]']
logits shape: [1, 30, 9], loss: 0.887360692024231

validation
['input_ids shape: [1, 50]', 'attention_mask shape: [1, 50]', 'labels shape: [1, 50]']
logits shape: [1, 50, 9], loss: 0.15421250462532043



  _warn_prf(average, modifier, msg_start, len(result))


epoch 0: {'precision': 0.175, 'recall': 0.2413793103448276, 'f1': 0.2028985507246377, 'accuracy': 0.7461629279811098}
[2024-03-10 03:06:56,324] [INFO] [profiler.py:80:start_profile] Flops profiler started
epoch 1: {'precision': 0.425, 'recall': 0.38636363636363635, 'f1': 0.40476190476190477, 'accuracy': 0.8240850059031877}
[2024-03-10 03:07:51,066] [INFO] [profiler.py:80:start_profile] Flops profiler started
epoch 2: {'precision': 0.425, 'recall': 0.37777777777777777, 'f1': 0.3999999999999999, 'accuracy': 0.820543093270366}
training loop time: 0:02:45
[2024-03-10 03:08:46,708] [INFO] [profiler.py:226:end_profile] Flops profiler finished
[199381538048, 198746969344, 199153235200]


(597281742592, 199093914197.33334)

## Save logs

In [17]:
# dev, epoch 0:
# epoch 0: {'precision': 0.175, 'recall': 0.2413793103448276, 'f1': 0.2028985507246377, 'accuracy': 0.7461629279811098}
LoRA_params_dict

{'r': 64,
 'target_modules': ['query',
  'key',
  'value',
  'query_proj',
  'key_proj',
  'value_proj'],
 'bias': 'lora_only',
 'use_rslora': True,
 'task_type': <TaskType.TOKEN_CLS: 'TOKEN_CLS'>,
 'lora_dropout': 0.2}

In [18]:
filename = "logsdict"
filename += f"__split={split}"
filename += f"__r={LoRA_params_dict['r']}"
filename += f"__bias={LoRA_params_dict['bias']}"
filename += f"__loradroput=0point{str(LoRA_params_dict['lora_dropout'])[2:]}"
filename

'logsdict__split=dev__r=64__bias=lora_only__loradroput=0point2'

In [19]:
with open(f"{output_dir}/{filename}.json", "w") as outfile: 
	json.dump(logs_dict, outfile, indent=2)
logs_dict

{'base_model_id': 'FacebookAI/roberta-large',
 'LoRA_params_dict': {'r': 64,
  'target_modules': ['query',
   'key',
   'value',
   'query_proj',
   'key_proj',
   'value_proj'],
  'bias': 'lora_only',
  'use_rslora': True,
  'task_type': <TaskType.TOKEN_CLS: 'TOKEN_CLS'>,
  'lora_dropout': 0.2},
 'LoraConfig': "LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.TOKEN_CLS: 'TOKEN_CLS'>, inference_mode=False, r=64, target_modules={'query', 'value_proj', 'value', 'key', 'key_proj', 'query_proj'}, lora_alpha=8, lora_dropout=0.2, fan_in_fan_out=False, bias='lora_only', use_rslora=True, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})",
 'LoRA_model_trainable_params': 9446409,
 'LoRA_model_all_params': 363765778,
 'LoRA_model_trainable_fraction': 0.02597,
 'tokenizer

## Inference

In [20]:
# load inference model
config = PeftConfig.from_pretrained(output_dir)
tokenizer = AutoTokenizer.from_pretrained(output_dir)
inference_model = PeftModel.from_pretrained(base_model, output_dir).merge_and_unload()
print(f"type(inference_model):\n{type(inference_model)}\n")
# get inputs from text (source: https://en.wikipedia.org/wiki/Konstanz#History)
text = "Konstanz was the birthplace of Count Ferdinand von Zeppelin, constructor of the famous Zeppelin airships."
inputs = tokenizer(text, return_tensors="pt")
inputs



type(inference_model):
<class 'transformers.models.roberta.modeling_roberta.RobertaForTokenClassification'>



{'input_ids': tensor([[    0, 11272, 22398,   329,    21,     5, 32357,     9, 12440, 28855,
          5689, 10915, 38042,     6, 47073,     9,     5,  3395, 10915, 38042,
           935, 31404,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [21]:
# inference
with torch.no_grad():
    logits = inference_model(**inputs).logits
tokens = inputs.tokens()
predictions = torch.argmax(logits, dim=2)
print(f"token (prediction)\n")
for token, prediction in zip(tokens, predictions[0].numpy()):
    print(f"{token} ({id2label[str(prediction)]})")

token (prediction)

<s> (O)
ĠKon (location)
stan (location)
z (location)
Ġwas (O)
Ġthe (O)
Ġbirthplace (O)
Ġof (O)
ĠCount (person)
ĠFerdinand (person)
Ġvon (person)
ĠZe (product)
ppelin (product)
, (O)
Ġconstructor (O)
Ġof (O)
Ġthe (O)
Ġfamous (O)
ĠZe (product)
ppelin (product)
Ġair (product)
ships (product)
. (O)
</s> (O)


$\checkmark$