# SQuAD and MNLI on IPUs using DeBERTa - Inference

This notebook provides an implementation of two natural language understanding (NLU) tasks using small, efficient models: [Microsoft DeBERTa-base](https://arxiv.org/abs/2006.03654) for sequence classification and question answering. The notebook demonstrates how these models can achieve good performance on standard benchmarks while being relatively lightweight and easy to use. 

The two NLU tasks covered in this notebook are:
- Multi-Genre Natural Language Inference (MNLI) - a sentence-pair classification task
- Stanford Question Answering Dataset (SQuAD) - a question answering task

Hardware requirements: The models show each DeBERTa Base model running on two IPUs. If correctly configured, these models could both be served simultaneously on an IPU POD4.

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/pNJdMj)  [![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

##### Optimum Graphcore
The notebook also demonstrates [Optimum Graphcore](https://github.com/huggingface/optimum-graphcore). Optimum Graphcore is the interface between the Hugging Face Transformers library and [Graphcore IPUs](https://www.graphcore.ai/products/ipu). This notebook demonstrates a more explicit way of using Huggingface models with the IPU. This method is particularly useful when the task in question is not supported by the Huggingface pipelines API.

The easiest way to run a Huggingface inference model would be to instantiate the pipeline as follows:

```
oracle = pipeline(model="Palak/microsoft_deberta-base_squad")
oracle(question="Where do I live?", context="My name is Wolfgang and I live in Berlin")
```

However in some cases such as MNLI, there is no off-the-shelf pipeline ready to use. In this case, you could simply:
- Instantiate the model with the correct execution mode
- Use the optimum-specific call `to_pipelined` to return the model with changes and annotations for running on the IPU
- Set the model to run in `eval` mode and use the `parallelize` method on the new model to parallelize it across IPUs
- Prepare it for inference using `poptorch.inferenceModel()`

```
model = DebertaForQuestionAnswering.from_pretrained("Palak/microsoft_deberta-base_squad")

ipu_config = IPUConfig(ipus_per_replica=2, matmul_proportion=0.2, executable_cache_dir="./exe_cache")
pipelined_model = to_pipelined(model, ipu_config).eval().parallelize()
pipelined_model = poptorch.inferenceModel(pipelined_model, options=ipu_config.to_options(for_inference=True))
```

This method is demoed in this notebook, as Huggingface do not natively support the MNLI inference task.

## Setup
Install the optimum library

In [1]:
%pip install "optimum-graphcore>=0.6, <0.7"

Collecting optimum-graphcore<0.7,>=0.6
  Downloading optimum_graphcore-0.6.0-py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.9/181.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers
  Downloading tokenizers-0.13.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting scipy
  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.5/34.5 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25h

We read some configuration from the environment to support environments like Paperspace Gradient.

In [2]:
import os

executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache")

Imports

In [3]:
import os
import torch
from datasets import load_dataset, Dataset

import poptorch
from optimum.graphcore import IPUConfig
from optimum.graphcore.modeling_utils import to_pipelined

from transformers import BartForConditionalGeneration, BartTokenizerFast,BartForSequenceClassification
from transformers import DebertaForSequenceClassification, DebertaTokenizerFast
from transformers import DebertaForQuestionAnswering, AutoTokenizer

## Multi-Genre Natural Language Inference (MNLI)

MNLI is a sentence-pair classification task, where the goal is to predict whether a given hypothesis is true (entailment) or false (contradiction) given a premise. The task has been proposed as a benchmark for evaluating natural language understanding models. 

In this notebook, we use the Microsoft DeBERTa-base model to classify pairs of sentences on the MNLI task. We first load the model and the tokenizer, then prepare an example input. Finally, we execute the model on an IPU device using PopTorch and obtain the predicted probabilities for the entailment classes.


First, load the model and tokeniser from the Huggingface Model Hub

In [4]:
# tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base-mnli")
# model = DebertaForSequenceClassification.from_pretrained("microsoft/deberta-base-mnli")
# model.half()

Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/557M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DebertaForSequenceClassification(
  (deberta): DebertaModel(
    (embeddings): DebertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=0)
      (LayerNorm): DebertaLayerNorm()
      (dropout): StableDropout()
    )
    (encoder): DebertaEncoder(
      (layer): ModuleList(
        (0): DebertaLayer(
          (attention): DebertaAttention(
            (self): DisentangledSelfAttention(
              (in_proj): Linear(in_features=768, out_features=2304, bias=False)
              (pos_dropout): StableDropout()
              (pos_proj): Linear(in_features=768, out_features=768, bias=False)
              (pos_q_proj): Linear(in_features=768, out_features=768, bias=True)
              (dropout): StableDropout()
            )
            (output): DebertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): DebertaLayerNorm()
              (dropout): StableDropout()
            )
          )
          (intermed

In [7]:
# With BART instead
from transformers import BartForSequenceClassification

model_checkpoint = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
model = BartForSequenceClassification.from_pretrained("facebook/bart-large-mnli")

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [9]:
model.half()

BartForSequenceClassification(
  (model): BartModel(
    (shared): Embedding(50265, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((10

Create some example inputs, and encoder those using the tokeniser

In [8]:
premise = "A man inspects the uniform of a figure in some East Asian country."
hypothesis = "The man is in an East Asian country."

inputs = tokenizer.encode(
    premise, hypothesis, return_tensors="pt", truncation_strategy="only_first"
)

Configure the instantiated model to run on IPUs

In [12]:
# Naively parallelised BART-large across 4 IPUs
ipu_config = IPUConfig(layers_per_ipu=[0,12,6,6], ipus_per_replica=4, matmul_proportion=0.6, executable_cache_dir=executable_cache_dir)
pipelined_model = to_pipelined(model, ipu_config).eval().parallelize()
pipelined_model = poptorch.inferenceModel(pipelined_model, options=ipu_config.to_options(for_inference=True))


Run the MNLI model and print the probability of entailment. We calculate this by throwing away neutral (index 1) and running softmax over the remaining logits.

In [13]:
logits = pipelined_model(inputs)[0]
entail_contradiction_logits = logits[:, [0, 2]]
prob_label_is_true = entail_contradiction_logits.softmax(dim=1)[:, 1]
print(prob_label_is_true)

Graph compilation: 100%|██████████| 100/100 [02:18<00:00]


tensor([0.9476])
