# 3️⃣ Combining Pre-Trained Adapters using AdapterFusion

In [the previous notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/master/notebooks/02_Adapter_Inference.ipynb), we loaded a single pre-trained adapter from _AdapterHub_. Now we will explore how to take advantage of multiple pre-trained adapters to combine their knowledge on a new task. Combining multiple adapters together into one 'block' is called an 'adapter composition'. In this notebook, we will explain one such block known as **AdapterFusion** ([Pfeiffer et al., 2020](https://arxiv.org/pdf/2005.00247.pdf)).

For this guide, we select **CommitmentBank** ([De Marneffe et al., 2019](https://github.com/mcdm/CommitmentBank)), a three-class textual entailment dataset, as our target task. We will fuse [adapters from AdapterHub](https://adapterhub.ml/explore/) which were pre-trained on different tasks. During training, their representations are kept fix while a newly introduced fusion layer is trained. As our base model, we will use BERT (`bert-base-uncased`).

## Installation

Again, we install `adapters` and HuggingFace's `datasets` library first:

In [None]:
!pip install -Uq adapters
!pip install -q datasets
!pip install -q accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.2/302.2 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m107.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Dataset Preprocessing

Before setting up training, we first prepare the training data. CommimentBank is part of the SuperGLUE benchmark and can be loaded via HuggingFace `datasets` using one line of code:

In [None]:
from datasets import load_dataset

dataset = load_dataset("super_glue", "cb")
dataset.num_rows

{'train': 250, 'validation': 56, 'test': 250}

Every dataset sample has a premise, a hypothesis and a three-class class label:

In [None]:
dataset['train'].features

{'premise': Value(dtype='string', id=None),
 'hypothesis': Value(dtype='string', id=None),
 'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(names=['entailment', 'contradiction', 'neutral'], id=None)}

Now, we need to encode all dataset samples to valid inputs for our `bert-base-uncased` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(
      batch["premise"],
      batch["hypothesis"],
      max_length=180,
      truncation=True,
      padding="max_length"
  )

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)
# The transformers model expects the target class column to be named "labels"
dataset = dataset.rename_column("label", "labels")
# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])



New we're ready to setup AdapterFusion...

## Fusion Training

We use a pre-trained BERT model from HuggingFace and instantiate our model using `BertAdapterModel`.

In [None]:
from transformers import BertConfig
from adapters import BertAdapterModel

id2label = {id: label for (id, label) in enumerate(dataset["train"].features["labels"].names)}

config = BertConfig.from_pretrained(
    "bert-base-uncased",
    id2label=id2label,
)
model = BertAdapterModel.from_pretrained(
    "bert-base-uncased",
    config=config,
)

Some weights of BertAdapterModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['heads.default.3.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we have everything set up to load our _AdapterFusion_ setup. First, we load three adapters pre-trained on different tasks from the Hub: MultiNLI, QQP and QNLI. As we don't need their prediction heads, we pass `with_head=False` to the loading method. Next, we add a new fusion layer that combines all the adapters we've just loaded. Finally, we add a new classification head for our target task on top.

We can define a fusion layer by adding a `Fuse` block from the `composition` module. The `Fuse` block is a method of combining multiple pre-trained adapters for a new downstream task. Just like `add_adapter` from the previous notebooks, the method `add_adapter_fusion` introduces an untrained fusion layer with randomly initialized weights. The weights of the `Fuse` block then get updated when training the model through the dataset.

To learn more about `AdapterFusion` you can check out: https://docs.adapterhub.ml/adapter_composition.html#fuse

In [None]:
from adapters.composition import Fuse

# Load the pre-trained adapters we want to fuse
model.load_adapter("nli/multinli@ukp", load_as="multinli", with_head=False)
model.load_adapter("sts/qqp@ukp", with_head=False)
model.load_adapter("nli/qnli@ukp", with_head=False)
# Add a fusion layer for all loaded adapters
adapter_setup = Fuse("multinli", "qqp", "qnli")
model.add_adapter_fusion(adapter_setup)

# Add a classification head for our target task
model.add_classification_head("cb", num_labels=len(id2label))



Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]



Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]



Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


The last preparation step train our adapter setup. Similar to `train_adapter()`, `train_adapter_fusion()` does two things: It freezes all weights of the model (including adapters!) except for the fusion layer and classification head. It also activates the given adapter setup to be used in very forward pass.

In [None]:
# Unfreeze and activate fusion setup
model.train_adapter_fusion(adapter_setup)

For training, we make use of the `AdapterTrainer` class built-in into `adapters`. We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

In [None]:
import numpy as np
from transformers import TrainingArguments, EvalPrediction
from adapters import AdapterTrainer

training_args = TrainingArguments(
    learning_rate=5e-5,
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
)

Start the training 🚀 (this will take a while)

In [None]:
trainer.train()

  0%|          | 0/40 [00:00<?, ?it/s]

{'train_runtime': 332.7348, 'train_samples_per_second': 3.757, 'train_steps_per_second': 0.12, 'train_loss': 0.7115382671356201, 'epoch': 5.0}


TrainOutput(global_step=40, training_loss=0.7115382671356201, metrics={'train_runtime': 332.7348, 'train_samples_per_second': 3.757, 'train_steps_per_second': 0.12, 'total_flos': 149577058350000.0, 'train_loss': 0.7115382671356201, 'epoch': 5.0})

After completed training, let's check how well our setup performs on the validation set of our target dataset:

In [None]:
trainer.evaluate()

  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 0.6307274699211121,
 'eval_acc': 0.7678571428571429,
 'eval_runtime': 10.3644,
 'eval_samples_per_second': 5.403,
 'eval_steps_per_second': 0.193,
 'epoch': 5.0}

We can also use our setup to make some predictions (the example is from the test set of CB):

In [None]:
import torch

def predict(premise, hypothesis):
  encoded = tokenizer(premise, hypothesis, return_tensors="pt")
  if torch.cuda.is_available():
    encoded.to("cuda")
  logits = model(**encoded)[0]
  pred_class = torch.argmax(logits).item()
  return id2label[pred_class]

predict("""
``It doesn't happen very often.'' Karen went home
happy at the end of the day. She didn't think that
the work was difficult.
""",
"the work was difficult"
)

'contradiction'

Finally, we can extract and save our fusion layer as well as all the adapters we used for training. Both can later be reloaded into the pre-trained model again.

In [None]:
model.save_adapter_fusion("./saved", "multinli,qqp,qnli")
model.save_all_adapters("./saved")

!ls -l saved

'ls' is not recognized as an internal or external command,
operable program or batch file.


That's it. Do check out [the paper on AdapterFusion](https://arxiv.org/pdf/2005.00247.pdf) for a more theoretical view on what we've just seen.

➡️ `adapters` also enables other composition methods beyond AdapterFusion. For example, check out [the next notebook in this series](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/master/notebooks/04_Cross_Lingual_Transfer.ipynb) on cross-lingual transfer.