# Longformer Model
- This notebook explores the Longformer Model: https://huggingface.co/docs/transformers/en/model_doc/longformer
    - specifically: https://huggingface.co/allenai/longformer-base-4096
    - for reference:
        - https://huggingface.co/hewonty/longformer-ner-finetuned-pii
- Rationale:
    - RoBERTa more modern than BERT
    - based on RoBERTa - NER and long token input size for our dataset

## Imports

In [1]:
!pip install -q transformers
!pip install -q datasets
!pip install -q evaluate
!pip install -q seqeval
!pip install -q -U bitsandbytes
!pip install -q -U peft
# !pip install -q tensorflow==2.15.0
# !pip install -q tf_keras==2.15.0

In [6]:
# generic
import numpy as np
# from pprint import pprint

# ml
from datasets import load_dataset
import evaluate
from transformers import AutoModelForTokenClassification, AutoTokenizer, LongformerTokenizer, TFLongformerForTokenClassification, TFLongformerModel, LongformerTokenizerFast, DataCollatorForTokenClassification, TrainingArguments, Trainer, BitsAndBytesConfig
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
import tensorflow as tf
from tensorflow import keras
# import torch

from google.colab import drive

drive.mount('/content/drive')
path = '/content/drive/MyDrive/Colab Notebooks/DATASCI 266/266 project'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
def print_version(library_name):
    try:
        lib = __import__(library_name)
        version = getattr(lib, '__version__', 'Version number not found')
        print(f"{library_name} version: {version}")
    except ImportError:
        print(f"{library_name} not installed.")
    except Exception as e:
        print(f"An error occurred: {e}")

print_version('transformers')
print_version('tensorflow')
print_version('keras')

transformers version: 4.44.2
tensorflow version: 2.17.0
keras version: 3.4.1


In [None]:
model_checkpoint = 'allenai/longformer-base-4096'
# model = AutoModel.from_pretrained(model_checkpoint)
# tokenizer = LongformerTokenizer.from_pretrained(model_checkpoint)

## Example Pipeline (pre-trained NER)

### pre-trained NER (Longformer for TokenClassification)
- includes linear layer on top of hidden-states output

In [None]:
model_checkpoint = 'allenai/longformer-base-4096'
model = TFLongformerForTokenClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/765M [00:00<?, ?B/s]

Some layers from the model checkpoint at allenai/longformer-base-4096 were not used when initializing TFLongformerForTokenClassification: ['lm_head']
- This IS expected if you are initializing TFLongformerForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFLongformerForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFLongformerForTokenClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [None]:
text = 'Jason Dong lives in Oakland, California!'
inputs = tokenizer(text, add_special_tokens=False, return_tensors='tf')
inputs

{'input_ids': <tf.Tensor: shape=(1, 8), dtype=int32, numpy=
array([[24434, 21570,  1074,    11,  5147,     6,   886,   328]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 8), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [None]:
logits = model(**inputs).logits
predicted_token_class_ids = tf.math.argmax(logits, axis=-1)

predicted_tokens_classes = [model.config.id2label[t] for t in predicted_token_class_ids[0].numpy().tolist()]

In [None]:
predicted_tokens_classes

['LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0']

In [None]:
labels = predicted_token_class_ids
loss = tf.math.reduce_mean(model(**inputs, labels=labels).loss)
loss

<tf.Tensor: shape=(), dtype=float32, numpy=0.57999486>

### longformer base

In [None]:
model_checkpoint = 'allenai/longformer-base-4096'
model = TFLongformerModel.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at allenai/longformer-base-4096 were not used when initializing TFLongformerModel: ['lm_head']
- This IS expected if you are initializing TFLongformerModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFLongformerModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFLongformerModel were initialized from the model checkpoint at allenai/longformer-base-4096.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFLongformerModel for predictions without further training.


In [None]:
text = 'Jason Dong lives in Oakland, California!'
inputs = tokenizer(text, max_length=12, truncation=True, padding='max_length', return_tensors='tf')
inputs

{'input_ids': <tf.Tensor: shape=(1, 12), dtype=int32, numpy=
array([[    0, 24434, 21570,  1074,    11,  5147,     6,   886,   328,
            2,     1,     1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 12), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]], dtype=int32)>}

In [None]:
longformer_output = model(inputs)
longformer_output

TFLongformerBaseModelOutputWithPooling(last_hidden_state=<tf.Tensor: shape=(1, 12, 768), dtype=float32, numpy=
array([[[-0.07698123,  0.0474745 , -0.04398734, ..., -0.1192084 ,
         -0.07126109, -0.03166773],
        [-0.05263936, -0.12233637,  0.00337647, ..., -0.5615269 ,
         -0.10511623,  0.04805949],
        [ 0.10180724,  0.05145535, -0.02661566, ..., -0.23609668,
         -0.17652114, -0.00441759],
        ...,
        [-0.0788525 ,  0.04526231, -0.05874642, ..., -0.13737126,
         -0.07322288, -0.03646876],
        [-0.02362809,  0.07412362, -0.01453518, ..., -0.09896468,
         -0.04090574, -0.07447816],
        [-0.02362809,  0.07412362, -0.01453518, ..., -0.09896468,
         -0.04090574, -0.07447816]]], dtype=float32)>, pooler_output=<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[ 1.67768255e-01, -2.75648117e-01,  7.48496503e-02,
         3.64037976e-02,  2.95565248e-01, -1.94567651e-01,
        -4.82384682e-01, -4.26223427e-01, -1.05960257e-01,
    

In [None]:
# last_hidden_state; token classification
print(longformer_output[0].shape)
longformer_output[0]

(1, 12, 768)


<tf.Tensor: shape=(1, 12, 768), dtype=float32, numpy=
array([[[-0.07698123,  0.0474745 , -0.04398734, ..., -0.1192084 ,
         -0.07126109, -0.03166773],
        [-0.05263936, -0.12233637,  0.00337647, ..., -0.5615269 ,
         -0.10511623,  0.04805949],
        [ 0.10180724,  0.05145535, -0.02661566, ..., -0.23609668,
         -0.17652114, -0.00441759],
        ...,
        [-0.0788525 ,  0.04526231, -0.05874642, ..., -0.13737126,
         -0.07322288, -0.03646876],
        [-0.02362809,  0.07412362, -0.01453518, ..., -0.09896468,
         -0.04090574, -0.07447816],
        [-0.02362809,  0.07412362, -0.01453518, ..., -0.09896468,
         -0.04090574, -0.07447816]]], dtype=float32)>

In [None]:
# pooler output; use for classification
print(longformer_output[1].shape)
longformer_output[1]

(1, 768)


<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[ 1.67768255e-01, -2.75648117e-01,  7.48496503e-02,
         3.64037976e-02,  2.95565248e-01, -1.94567651e-01,
        -4.82384682e-01, -4.26223427e-01, -1.05960257e-01,
        -2.73451060e-01, -3.69706601e-01, -1.42308652e-01,
         1.75642237e-01, -3.21777016e-01,  1.01584628e-01,
        -1.54207155e-01, -3.88846457e-01, -8.81474614e-02,
        -8.41863826e-02, -3.39631177e-02, -2.77868733e-02,
        -2.51630452e-02, -2.30498284e-01, -7.31728098e-04,
        -4.26168799e-01,  4.03053535e-04, -1.67844251e-01,
        -2.53685474e-01,  2.47604832e-01,  1.96822643e-01,
        -3.25214453e-02, -9.08859354e-03, -3.35210800e-01,
         5.26603684e-02, -9.75469649e-02,  2.24627107e-01,
        -1.71481177e-01,  1.38313517e-01, -3.97278726e-01,
         2.16225505e-01,  1.20706037e-02,  2.23266140e-01,
         2.67186701e-01, -5.02591836e-04,  1.39914975e-01,
         1.72542989e-01, -1.99854106e-01, -3.13327432e-01,
      

In [None]:
longformer_output[0]

<tf.Tensor: shape=(1, 12, 768), dtype=float32, numpy=
array([[[-0.07698123,  0.0474745 , -0.04398734, ..., -0.1192084 ,
         -0.07126109, -0.03166773],
        [-0.05263936, -0.12233637,  0.00337647, ..., -0.5615269 ,
         -0.10511623,  0.04805949],
        [ 0.10180724,  0.05145535, -0.02661566, ..., -0.23609668,
         -0.17652114, -0.00441759],
        ...,
        [-0.0788525 ,  0.04526231, -0.05874642, ..., -0.13737126,
         -0.07322288, -0.03646876],
        [-0.02362809,  0.07412362, -0.01453518, ..., -0.09896468,
         -0.04090574, -0.07447816],
        [-0.02362809,  0.07412362, -0.01453518, ..., -0.09896468,
         -0.04090574, -0.07447816]]], dtype=float32)>

# Named Entity Classification Test
- https://www.youtube.com/watch?v=dzyDHMycx_c&ab_channel=Rohan-Paul-AI

## Pre-processing

In [None]:
conll = load_dataset('eriktks/conll2003', trust_remote_code=True)

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
model_checkpoint = 'allenai/longformer-base-4096'
tokenizer = LongformerTokenizerFast.from_pretrained(model_checkpoint, add_prefix_space=True)

In [None]:
example_text = conll['train'][2]
print('example:', example_text)

tokenized_input = tokenizer(example_text['tokens'], is_split_into_words=True)
print('tokenized input:', tokenized_input)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])
print('tokens:', tokens)
word_ids = tokenized_input.word_ids()
print('word_ids:', word_ids)

# need to modify ner_tags to match length of tokenized input
print('length:', len(example_text['ner_tags']), len(tokenized_input['input_ids']))

example: {'id': '2', 'tokens': ['BRUSSELS', '1996-08-22'], 'pos_tags': [22, 11], 'chunk_tags': [11, 12], 'ner_tags': [5, 0]}
tokenized input: {'input_ids': [0, 6823, 16551, 16416, 8008, 12, 3669, 12, 2036, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokens: ['<s>', 'ĠBR', 'USS', 'ELS', 'Ġ1996', '-', '08', '-', '22', '</s>']
word_ids: [None, 0, 0, 0, 1, 1, 1, 1, 1, None]
length: 2 10


In [None]:
# resolve tokenizer length difference: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ
def tokenize_and_align_labels(examples, label_all_tokens=True, task='ner'):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
tokenize_and_align_labels(conll['train'][:3])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[0, 1281, 24020, 1859, 486, 7, 13978, 1089, 17988, 479, 2], [0, 2155, 20809, 2], [0, 6823, 16551, 16416, 8008, 12, 3669, 12, 2036, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100], [-100, 1, 2, -100], [-100, 5, 5, 5, 0, 0, 0, 0, 0, -100]]}

In [None]:
conll['train'][2]

{'id': '2',
 'tokens': ['BRUSSELS', '1996-08-22'],
 'pos_tags': [22, 11],
 'chunk_tags': [11, 12],
 'ner_tags': [5, 0]}

In [None]:
tokenized_datasets = conll.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [None]:
label_list = conll['train'].features['ner_tags'].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

## Fine-tuning (OK)

In [None]:
# quantization_config = BitsAndBytesConfig(
#     load_in_8bit=True,
#     bnb_8bit_quant_type='nf4',
#     bnb_8bit_use_double_quant=True,
#     bnb_8bit_compute_dtype='float16'
# )

In [None]:
# reference: https://github.com/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb
# model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list), quantization_config=quantization_config)
# model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list)) # change to TFAutoModelForTokenClassification
model = TFLongformerModel.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at allenai/longformer-base-4096 were not used when initializing TFLongformerModel: ['lm_head']
- This IS expected if you are initializing TFLongformerModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFLongformerModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFLongformerModel were initialized from the model checkpoint at allenai/longformer-base-4096.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFLongformerModel for predictions without further training.


In [None]:
model.summary()

Model: "tf_longformer_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 longformer (TFLongformerMa  multiple                  148659456 
 inLayer)                                                        
                                                                 
Total params: 148659456 (567.09 MB)
Trainable params: 148659456 (567.09 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
tokenized_datasets['train']['input_ids']

[[0, 1281, 24020, 1859, 486, 7, 13978, 1089, 17988, 479, 2],
 [0, 2155, 20809, 2],
 [0, 6823, 16551, 16416, 8008, 12, 3669, 12, 2036, 2],
 [0,
  20,
  796,
  1463,
  26,
  15,
  296,
  24,
  19286,
  19,
  1859,
  2949,
  7,
  2360,
  7,
  23795,
  1089,
  17988,
  454,
  4211,
  3094,
  549,
  7758,
  12094,
  2199,
  64,
  28,
  20579,
  7,
  14336,
  479,
  2],
 [0,
  1600,
  128,
  29,
  4915,
  7,
  5,
  796,
  1332,
  128,
  29,
  24443,
  1540,
  26978,
  525,
  5577,
  4621,
  26,
  15,
  307,
  2360,
  197,
  907,
  14336,
  38542,
  31,
  749,
  97,
  87,
  1444,
  454,
  5,
  6441,
  2949,
  21,
  18618,
  479,
  2],
 [0,
  22,
  166,
  109,
  295,
  75,
  323,
  143,
  215,
  6492,
  142,
  52,
  109,
  295,
  75,
  192,
  143,
  5619,
  13,
  24,
  2156,
  22,
  5,
  1463,
  128,
  29,
  834,
  1565,
  26607,
  687,
  3538,
  1935,
  11920,
  174,
  10,
  340,
  7515,
  479,
  2],
 [0,
  91,
  26,
  617,
  6441,
  892,
  21,
  1552,
  8,
  114,
  24,
  21,
  303,
  14,
  8

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors='np')

# train_set = model.prepare_tf_dataset(
#     tokenized_datasets['train'],
#     shuffle=True,
#     batch_size=16,
#     collate_fn=data_collator
# )

train_set = tokenized_datasets['train'].to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    label_cols=['labels'],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

In [None]:
def create_longformer_classification_model(lf_model, max_sequence_length=4096, learning_rate=0.005):
    # discuss; could make last few layers trainable; another option is to use longformer huggingface function to load a smaller model
    lf_model.trainable=False

    input_ids = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int64, name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int64, name='attention_mask')

    lf_inputs = {'input_ids': input_ids,
                'attention_mask': attention_mask}

    lf_out = lf_model(lf_inputs)

    embedding = lf_out['last_hidden_state']

    hidden1 = tf.keras.layers.Dense(200, activation='relu', name='hidden_layer_1')(embedding)
    dropout1 = tf.keras.layers.Dropout(0.3)(hidden1)

    ner_class = tf.keras.layers.Dense(9, activation='softmax', name='ner_classification')(dropout1)

    ner_model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=[ner_class])

    ner_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(ignore_class=-100),
                  metrics='accuracy')

    return ner_model

In [None]:
test_model = create_longformer_classification_model(model)
test_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 attention_mask (InputLayer  [(None, 4096)]               0         []                            
 )                                                                                                
                                                                                                  
 input_ids (InputLayer)      [(None, 4096)]               0         []                            
                                                                                                  
 tf_longformer_model_1 (TFL  TFLongformerBaseModelOutpu   1486594   ['attention_mask[0][0]',      
 ongformerModel)             tWithPooling(last_hidden_s   56         'input_ids[0][0]']           
                             tate=(None, 4096, 768),                                        

In [None]:
test_model.fit(
    train_set,
    epochs=1
)

  1/878 [..............................] - ETA: 43:35:45 - loss: 2.5386 - accuracy: 0.0125

#### with huggingface api (this trains entire model)

In [None]:
task='ner'
batch_size=16
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f'{model_name}-finetuned-{task}',
    eval_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to='none'
)

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [None]:
metric_seqeval = evaluate.load('seqeval')
labels = [label_list[i] for i in example_text[f"{task}_tags"]]
metric_seqeval.compute(predictions=[labels], references=[labels])

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

{'LOC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14041
})

In [None]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details

In [None]:
trainer.train()

Input ids are automatically padded to be a multiple of `config.attention_window`: 512


# Fine-Tuning with TAB

## Data Processing

In [None]:
model_checkpoint = 'allenai/longformer-base-4096'
tokenizer = LongformerTokenizerFast.from_pretrained(model_checkpoint, add_prefix_space=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]



In [None]:
ds = load_dataset('json', data_files=f'{path}/data/tab/train_tab_model_testing.json')
ds

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'mask_tags', 'text_spans', 'tokens', 'text'],
        num_rows: 25
    })
})

In [None]:
example_text = ds['train'][2]
print('example:', example_text.keys())

tokenized_input = tokenizer(example_text['tokens'], is_split_into_words=True)
print('tokenized input:', tokenized_input.keys())
print('tokenized input:', tokenized_input)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])
print('tokens:', tokens)
word_ids = tokenized_input.word_ids()
print('word_ids:', word_ids)

# need to modify ner_tags to match length of tokenized input
print('length:', len(example_text['ner_tags']), len(tokenized_input['input_ids']), len(tokens))

example: dict_keys(['id', 'ner_tags', 'mask_tags', 'text_spans', 'tokens', 'text'])
tokenized input: dict_keys(['input_ids', 'attention_mask'])
tokenized input: {'input_ids': [0, 36352, 1691, 12435, 1437, 50140, 20, 403, 19575, 11, 80, 2975, 36, 13736, 479, 32968, 5607, 73, 5208, 8, 3330, 37663, 73, 5208, 4839, 136, 5, 3497, 9, 2769, 15434, 19, 5, 796, 1463, 9, 3861, 3941, 223, 320, 6776, 564, 9, 5, 9127, 13, 5, 5922, 9, 3861, 3941, 8, 37898, 28000, 6806, 36, 44, 48, 5, 9127, 44, 46, 4839, 30, 80, 4423, 12437, 2156, 427, 8495, 15997, 4701, 10031, 8, 2135, 208, 31477, 952, 6382, 1168, 677, 36, 44, 48, 5, 10858, 44, 46, 4839, 2156, 15, 155, 587, 6708, 479, 1437, 50140, 20, 10858, 2156, 54, 56, 57, 4159, 1030, 2887, 2156, 58, 4625, 30, 3801, 381, 4, 952, 6382, 10031, 90, 677, 2156, 10, 2470, 21886, 3009, 11, 12275, 479, 20, 4423, 1621, 36, 44, 48, 5, 1621, 44, 46, 4839, 222, 45, 9653, 41, 2936, 7, 3594, 106, 137, 5, 837, 479, 1437, 50140, 374, 112, 902, 3788, 5, 1234, 7162, 1276, 7, 1962,

In [None]:
# # resolve tokenizer length difference: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ
# def tokenize_and_align_labels(examples, label_all_tokens=True, task='both'):
#     """
#     Function to match labels array with extended tokenized input.
#     Input:
#         examples: tokenized examples
#         label_all_tokens = boolean
#         task = 'ner', 'mask', 'both'
#     Returns:
#         Dictionary {input_ids, attention_mask, ner_labels / mask_labels}
#     """
#     tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)


#     if task == 'both':
#         ner_labels = []
#         mask_labels = []
#         task == 'ner'
#         for i, label in enumerate(examples[f"{task}_tags"]):
#             word_ids = tokenized_inputs.word_ids(batch_index=i)
#             previous_word_idx = None
#             ner_label_ids = []
#             mask_label_ids = []
#             for word_idx in word_ids:
#                 # Special tokens have a word id that is None. We set the label to -100 so they are automatically
#                 # ignored in the loss function.
#                 if word_idx is None:
#                     ner_label_ids.append(-100)
#                     mask_label_ids.append(-100)
#                 # We set the label for the first token of each word.
#                 elif word_idx != previous_word_idx:
#                     ner_label_ids.append(label[word_idx])
#                     mask_label_ids.append(examples['mask_tags'][i][word_idx])
#                 # For the other tokens in a word, we set the label to either the current label or -100, depending on
#                 # the label_all_tokens flag.
#                 else:
#                     ner_label_ids.append(label[word_idx] if label_all_tokens else -100)
#                     mask_label_ids.append(examples['mask_tags'][i][word_idx] if label_all_tokens else -100)
#                 previous_word_idx = word_idx
#     else:
#         labels = []
#         for i, label in enumerate(examples[f"{task}_tags"]):
#             word_ids = tokenized_inputs.word_ids(batch_index=i)
#             previous_word_idx = None
#             label_ids = []
#             for word_idx in word_ids:
#                 # Special tokens have a word id that is None. We set the label to -100 so they are automatically
#                 # ignored in the loss function.
#                 if word_idx is None:
#                     label_ids.append(-100)
#                 # We set the label for the first token of each word.
#                 elif word_idx != previous_word_idx:
#                     label_ids.append(label[word_idx])
#                 # For the other tokens in a word, we set the label to either the current label or -100, depending on
#                 # the label_all_tokens flag.
#                 else:
#                     label_ids.append(label[word_idx] if label_all_tokens else -100)
#                 previous_word_idx = word_idx

#     if both
#         labels.append(label_ids)

#     tokenized_inputs["ner_labels"] = labels
#     return tokenized_inputs

In [None]:
# resolve tokenizer length difference: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ
def tokenize_and_align_labels(examples, label_all_tokens=True, task='ner'):
    """
    Tokenizes and aligns labels to match longformer tokenizer strategy; function should work as expected for other BERT based models
    Currently only returns mask or ner task
    Input:
        examples: individual example from dataset
    Output:
        dataset: tokenized and array aligned dataset with lables
    """
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    if task == 'both':
        task = ['ner', 'mask']
    else:
        task = [task]
    for t in task:
        print(t)
        labels = []
        for i, label in enumerate(examples[f'{t}_tags']):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                # Special tokens have a word id that is None. We set the label to -100 so they are automatically
                # ignored in the loss function.
                if word_idx is None:
                    label_ids.append(-100)
                # We set the label for the first token of each word.
                elif word_idx != previous_word_idx:
                    label_ids.append(label[word_idx])
                # For the other tokens in a word, we set the label to either the current label or -100, depending on
                # the label_all_tokens flag.
                else:
                    label_ids.append(label[word_idx] if label_all_tokens else -100)
                previous_word_idx = word_idx

            labels.append(label_ids)

            tokenized_inputs[f'{t}_labels'] = labels
    return tokenized_inputs

In [None]:
tokenized_datasets = ds.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

In [None]:
model = TFLongformerModel.from_pretrained(model_checkpoint)
model.summary()

tf_model.h5:   0%|          | 0.00/765M [00:00<?, ?B/s]

Some layers from the model checkpoint at allenai/longformer-base-4096 were not used when initializing TFLongformerModel: ['lm_head']
- This IS expected if you are initializing TFLongformerModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFLongformerModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFLongformerModel were initialized from the model checkpoint at allenai/longformer-base-4096.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFLongformerModel for predictions without further training.


Model: "tf_longformer_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 longformer (TFLongformerMa  multiple                  148659456 
 inLayer)                                                        
                                                                 
Total params: 148659456 (567.09 MB)
Trainable params: 148659456 (567.09 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
tokenized_datasets['train']['input_ids']

[[0,
  36352,
  1691,
  12435,
  1437,
  50140,
  20,
  403,
  19575,
  11,
  41,
  2502,
  36,
  117,
  479,
  1132,
  34357,
  73,
  3933,
  4839,
  136,
  5,
  3497,
  9,
  6508,
  15434,
  19,
  5,
  837,
  223,
  6776,
  2631,
  9,
  5,
  9127,
  13,
  5,
  5922,
  9,
  3861,
  3941,
  8,
  37898,
  28000,
  6806,
  36,
  44,
  48,
  5,
  9127,
  44,
  46,
  4839,
  30,
  427,
  211,
  4,
  312,
  649,
  27,
  642,
  5107,
  677,
  15,
  564,
  550,
  4999,
  479,
  1437,
  50140,
  20,
  11145,
  1621,
  36,
  44,
  48,
  5,
  1621,
  44,
  46,
  4839,
  58,
  4625,
  30,
  49,
  18497,
  2156,
  427,
  344,
  4,
  20963,
  5173,
  649,
  5782,
  29,
  31735,
  9,
  5,
  2803,
  9,
  3125,
  4702,
  479,
  1437,
  50140,
  374,
  195,
  644,
  3010,
  5,
  270,
  9,
  5,
  11035,
  7162,
  1276,
  7,
  492,
  3120,
  9,
  5,
  2502,
  7,
  5,
  1621,
  479,
  2096,
  5,
  7668,
  9,
  6776,
  1132,
  39207,
  155,
  9,
  5,
  9127,
  2156,
  24,
  21,
  1276,
  7,
  10154,
  5,
 

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors='np')

train_set = tokenized_datasets_mask['train'].to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    label_cols=['labels'],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [None]:
print(len(tokenized_datasets['train']['ner_tags'][0]),
      len(tokenized_datasets['train']['tokens'][0]),
      len(tokenized_datasets['train']['mask_tags'][0]),
      len(tokenized_datasets['train']['labels'][0]),
      len(tokenized_datasets['train']['input_ids'][0]))

tokenized_datasets

643 643 643 759 759


DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'mask_tags', 'text_spans', 'tokens', 'text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 25
    })
})

In [None]:
def create_longformer_classification_model(lf_model, max_sequence_length=4096, learning_rate=0.005):
    # discuss; could make last few layers trainable; another option is to use longformer huggingface function to load a smaller model
    lf_model.trainable=False

    input_ids = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int64, name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int64, name='attention_mask')

    lf_inputs = {'input_ids': input_ids,
                'attention_mask': attention_mask}

    lf_out = lf_model(lf_inputs)

    embedding = lf_out['last_hidden_state']

    hidden1 = tf.keras.layers.Dense(200, activation='relu', name='hidden_layer_1')(embedding)
    dropout1 = tf.keras.layers.Dropout(0.3)(hidden1)

    ner_class = tf.keras.layers.Dense(9, activation='softmax', name='ner_classification')(dropout1)

    ner_model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=[ner_class])

    ner_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss=tf.keras.losses.sparse_categorical_crossentropy,
                  metrics='accuracy')

    return ner_model

In [None]:
test_model = create_longformer_classification_model(model)
test_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 attention_mask (InputLayer  [(None, 4096)]               0         []                            
 )                                                                                                
                                                                                                  
 input_ids (InputLayer)      [(None, 4096)]               0         []                            
                                                                                                  
 tf_longformer_model (TFLon  TFLongformerBaseModelOutpu   1486594   ['attention_mask[0][0]',      
 gformerModel)               tWithPooling(last_hidden_s   56         'input_ids[0][0]']           
                             tate=(None, 4096, 768),                                          

In [None]:
test_model.fit(
    train_set,
    epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tf_keras.src.callbacks.History at 0x7c5086755060>

## PEFT (LORA test - incomplete)
- https://github.com/huggingface/peft


In [3]:
import torch
from transformers import Conv1D
from transformers import LongformerForTokenClassification
import peft

In [4]:
print(f'peft version:', peft.__version__)

peft version: 0.13.2


In [8]:
# hyperparameters r, lora_alpha, ; why inference mode = False?; include the last layer for modification as well per testing in another notebook!
peft_config = LoraConfig(
    task_type=TaskType.TOKEN_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules=['attention', 'intermediate'])

model_checkpoint = 'allenai/longformer-base-4096'
# load model and apply peft
# model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=7)
model = LongformerForTokenClassification.from_pretrained(model_checkpoint, num_labels=7)
# print(model)
# model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


Some weights of LongformerForTokenClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AttributeError: 'LongformerForTokenClassification' object has no attribute 'print_trainable_parameters'

In [29]:
def get_specific_layer_names(model):
    # Create a list to store the layer names
    layer_names = []

    # Recursively visit all modules and submodules
    for name, module in model.named_modules():
        # Check if the module is an instance of the specified layers
        if isinstance(module, (torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, Conv1D)):
            # model name parsing

            layer_names.append('.'.join(name.split('.')[4:]).split('.')[0])

    return layer_names

list(set(get_specific_layer_names(model)))

['', 'intermediate', 'attention', 'output']

In [10]:
total_params = sum(p.numel() for p in model.parameters())
print(total_params)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(trainable_params)

148074247
148074247


In [9]:
peft_config = LoraConfig(
    task_type=TaskType.TOKEN_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules='all-linear' #, modules_to_save=
    )

In [10]:
def create_longformer_classification_model(lf_model, max_sequence_length=4096, learning_rate=0.005):
    # discuss; could make last few layers trainable; another option is to use longformer huggingface function to load a smaller model
    lf_model.trainable=False

    input_ids = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int64, name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.int64, name='attention_mask')

    lf_inputs = {'input_ids': input_ids,
                'attention_mask': attention_mask}

    lf_out = lf_model(lf_inputs)

    embedding = lf_out['last_hidden_state']

    hidden1 = tf.keras.layers.Dense(200, activation='relu', name='hidden_layer_1')(embedding)
    dropout1 = tf.keras.layers.Dropout(0.3)(hidden1)

    ner_class = tf.keras.layers.Dense(9, activation='softmax', name='ner_classification')(dropout1)

    ner_model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=[ner_class])

    ner_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss=tf.keras.losses.sparse_categorical_crossentropy,
                  metrics='accuracy')

    return ner_model

In [11]:
test_model = create_longformer_classification_model(model)
test_model.summary()

TypeError: unhashable type: 'slice'

In [12]:
from transformers import LongformerModel
model_py = LongformerModel.from_pretrained(model_checkpoint)

In [13]:
model_py.trainable = False

In [14]:
model_py.add_adapter(peft_config)

In [15]:
# helper function per: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models
import torch
from transformers import Conv1D
def get_specific_layer_names(model):
    # Create a list to store the layer names
    layer_names = []

    # Recursively visit all modules and submodules
    for name, module in model.named_modules():
        # Check if the module is an instance of the specified layers
        if isinstance(module, (torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, Conv1D)):
            # model name parsing

            layer_names.append('.'.join(name.split('.')[4:]).split('.')[0])

    return layer_names

list(set(get_specific_layer_names(model)))

['', 'attention', 'output', 'intermediate']

# Longformer from Checkpoint

## Load Data

In [None]:
dataset = load_dataset("conll2003")

In [16]:
model = LongformerForTokenClassification.from_pretrained(model_checkpoint, num_labels=7)

Some weights of LongformerForTokenClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
training_args = TrainingArguments(
    output_dir=path,
    fp16 =True
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

In [None]:
trainer.train()