# 4️⃣ AdapterFusion for Sequence Classification

In this example we will fine-tune the model [*BERT-base-uncased*](https://aclanthology.org/N19-1423/) to classify a sequence of tokens. For this purpose, we will use a PEFT method called **Adapter Fusion**, which creates a mixture of multiple pre-trained adapters. We will use **transformers** to download tokenizers, **datasets** for data download, **adapters** for creating adapters models and training and **evaluate** for loading evaluation metrics.

You can also open this example in Google Colab:

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ivanvykopal/peft-kinit-2025/blob/master/examples/adapters/04_Adapter_Fusion.ipynb)

## Installation & Imports

In [None]:
# 4.36.0 for compatibility with adapters
# you will probably need to restart the sessions after installing these modules
%pip install -q --user transformers[torch]==4.36.0
%pip install -q --user datasets
%pip install -q --user adapters
%pip install -q --user evaluate

In [None]:
import torch
import evaluate
import logging

from datasets import load_dataset
from transformers import (
    BertTokenizer,
    BertConfig,
    TrainingArguments,
    default_data_collator
)

from adapters import BertAdapterModel, AdapterTrainer
from adapters.composition import Fuse

## Variable initialization

We will be fine-tuning the pre-trained version of model [bert-base-uncased](https://huggingface.co/google-bert/bert-large-uncased) which has **110M** parameters. We will set the max **input length to 128** tokens and train for **3 epochs** with **batch size of 32**.

In [None]:
device = "cuda"
model_name_or_path = "bert-base-uncased"
tokenizer_name_or_path = "bert-base-uncased"

max_length = 128
lr = 1e-3
num_epochs = 3
batch_size = 32 # in case of "unable to allocate" errors, decrease the batch size to some lower number (e.g. 8 or 16)

logging.disable(logging.WARNING)

## Load data

The dataset that we will be using is called [Commitment Bank](https://huggingface.co/datasets/super_glue/viewer/cb) (CB) from the SuperGLUE benchmark. This dataset contains a set of premise-hypothesis pairs where the premise is a passage and the hypothesis is a clause. If the clause is contained within the passage and it is an entailment then the target is 0, for a contradiction it is 1, and 2 for a neutral clause.

The dataset contains **250 training samples and 56 validation samples**. We will also split the validation part of the dataset in half to create a test part for evaluation after training.

In [None]:
dataset = load_dataset("super_glue", "cb")

# test set is not labeled so we need to do custom splits
validtest = dataset["validation"].train_test_split(test_size=0.5)

dataset["validation"] = validtest["train"]
dataset["test"] = validtest["test"]

dataset["train"][0]

Now we will tokenize the dataset. We only don't need to tokenize the labels this time, because we will train a classification head that returns numbers and not strings.

In [None]:
tokenizer = BertTokenizer.from_pretrained(tokenizer_name_or_path)

def preprocess_function(examples):
  return tokenizer(examples["premise"], examples["hypothesis"], max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")


dataset = dataset.map(preprocess_function, batched=True)
processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=["premise", "hypothesis", "idx"],
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

processed_datasets = processed_datasets.rename_column("label", "labels")

train_dataset = processed_datasets["train"].shuffle()
eval_dataset = processed_datasets["validation"]
test_dataset = processed_datasets["test"]


## Create Adapters model

Next, we will create the adapter model and adapter fusion. At first, we load *BertAdapterModel* using the Adapters module, and after that, we will load **3 pre-trained adapters** to the model. These adapters are pre-trained on **MNLI** (cross-genre NLI), **QQP** (question paraphrase) and **QNLI** (QA NLI). As we don't need their prediction heads, we pass **with_head=False** to the loading method.

After that we will add an adapter fusion layer, that combines all 3 adapter layers, activate it and set it trainable. The *train_adapter_fusion()* does two things: It freezes all weights of the model (including adapters!) except for the fusion layer and classification head. It also activates the given adapter setup to be used in the forward pass.

Here is what the AdapterFusion layer looks like in the model:
<p align="center">
<img src="https://raw.githubusercontent.com/Wicwik/peft_tutorial/heads/main/img/af.png" alt="adapter_fusion_arch" width="auto" height="350">
<img src="https://raw.githubusercontent.com/Wicwik/peft_tutorial/heads/main/img/af_arch.png" alt="adapter_fusion" width="auto" height="350">
</p>

In [None]:
id2label = {id: label for (id, label) in enumerate(processed_datasets["train"].features["labels"].names)}

config = BertConfig.from_pretrained(model_name_or_path, id2label=id2label)

# from transformers import AutoModelForSequenceClassification
# model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, config=config)
model = BertAdapterModel.from_pretrained(model_name_or_path, config=config)

# comment everything from this line to the model and replace BertAdapterModel with BertForSequenceClassification to do FFT (using around 6GB of memory)
model.load_adapter("nli/multinli@ukp", load_as="multinli", with_head=False)
model.load_adapter("sts/qqp@ukp", with_head=False)
model.load_adapter("nli/qnli@ukp", with_head=False)

model.add_adapter_fusion(Fuse("multinli", "qqp", "qnli"))
model.set_active_adapters(Fuse("multinli", "qqp", "qnli"))

model.add_classification_head("cb", num_labels=len(id2label))

adapter_setup = Fuse("multinli", "qqp", "qnli")
model.train_adapter_fusion(adapter_setup)

print(model.adapter_summary())

model

We don't have a function for gen number of trainable parameters as in the Hugging Face PEFT module, but we can use [their implementation](https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py#L492) also for this model.

In [None]:
trainable_params = 0
all_param = 0
for n, param in model.named_parameters():
    num_params = param.numel()
    if num_params == 0 and hasattr(param, "ds_numel"):
        num_params = param.ds_numel

    if param.__class__.__name__ == "Params4bit":
        num_params = num_params * 2

    all_param += num_params
    if param.requires_grad:
        # print(n)
        trainable_params += num_params


print(f"trainable params: {trainable_params:,d} || all params: {all_param:,d} || trainable%: {100 * trainable_params / all_param}")

## Training and Evaluation

We will be using Hugging Face [TrainingArguments](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments) and [AdapterTrainer](https://docs.adapterhub.ml/training.html#adaptertrainer) from Adapter Hub (this adapter is based on transformers adapter). The BERT model is not generating tokens, therefore we don't need to do any postprocessing and can use the metric form *evaluate.load()*. The trainer will take a *compute_metrics* method that will be used to compute metrics during the evaluation. 

For SuperGLUE CB dataset *evaluate* computes F1 and accuracy.

In [None]:
metric = evaluate.load("super_glue", "cb")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    preds = preds.argmax(axis=1)

    return metric.compute(predictions=preds, references=labels)

training_args = TrainingArguments(
    "adapters",
    per_device_train_batch_size=batch_size,
    learning_rate=lr,
    num_train_epochs=num_epochs,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Now we will do the training and evaluation. We can have a look at the memory usage.

Since the *AdapterTrainer* class is inherited from the Transformers *Trainer* class, we can see the trainer uploading results to the *wandb*.

In [None]:
# for FFT you also need to replace the AdapterTrainer with standard Trainer
# from transformers import Trainer
# trainer = Trainer(
#     model=model,
#     tokenizer=tokenizer,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=eval_dataset,
#     data_collator=default_data_collator,
#     compute_metrics=compute_metrics,
# )

trainer = AdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

trainer.evaluate(eval_dataset=test_dataset, metric_key_prefix="test")

## Save and load

Now we can save the model with *save_adapter_fusion* and *save_all_adapters* methods to save the AdapterFusion layer and all trained adapters.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

adapter_model_id = f"{model_name_or_path}_adapterfusion_seqcls"

model.save_pretrained(adapter_model_id)
model.save_adapter_fusion(adapter_model_id, "multinli,qqp,qnli")
model.save_all_adapters(adapter_model_id)

Now we can load the model and give it a custom example. It is important to **set active adapters** and to **specify the head** that we want to use.

In [None]:
model = BertAdapterModel.from_pretrained(adapter_model_id)
model.set_active_adapters(Fuse("multinli", "qqp", "qnli"))

print(model.active_adapters)

inputs = tokenizer("A pity. For myself, a great pity. But no one can say Bishop Malduin has not received latitude.", 
                   "Bishop Malduin has not received latitude", 
                   return_tensors="pt"
                   )
print(inputs)
with torch.no_grad():
    logits = model(**inputs, head="cb")[0]
    class_id = torch.argmax(logits).item()
    pred_class = id2label[class_id]
    print(pred_class, class_id)

## References

This tutorial is prepared by [Robert Belanec](https://kinit.sk/member/robert-belanec/). In particular, the implementation is based on the following example: [**adapter_fusion.ipynb**](https://github.com/Wicwik/peft_tutorial/blob/main/examples/adapter_fusion.ipynb).

**Citations:**

[1] Devlin et al. (2019). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://aclanthology.org/N19-1423/) <br/>
[2] Pfeiffer et al. (2021). [AdapterFusion: Non-Destructive Task Composition for Transfer Learning](https://aclanthology.org/2021.eacl-main.39)