# Fine-Tuning on the MRPC Dataset with `AdaptNLP`

In this notebook we will be following along with the HuggingFace course and `Tuning` Bert on the MRPC dataset.

## Installing What We Need

First we need to install the basic libraries we need for `AdaptNLP` to run, this will include `transformers`, `datasets`, but also `fastai` and we'll use a custom version of `nbdev` too (more on this later):

In [1]:
# !pip install git+https://github.com/novetta/adaptnlp@dev >> /dev/null
# !pip install git+https://github.com/muellerzr/nbdev >> /dev/null

  Running command git clone -q https://github.com/novetta/adaptnlp /tmp/pip-req-build-ylvhli2e
  Running command git checkout -b dev --track origin/dev
  Switched to a new branch 'dev'
  Branch 'dev' set up to track remote branch 'dev' from 'origin'.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
  Running command git clone -q https://github.com/muellerzr/nbdev /tmp/pip-req-build-je7gswrn


## Setting Up The Data

Following the HuggingFace tutorial, let's download the dataset:

In [2]:
from datasets import load_dataset

In [3]:
raw_datasets = load_dataset("glue", "mrpc")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=7777.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4473.0, style=ProgressStyle(description…


Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


Next we'll need to build some of `AdaptNLP`'s `TaskDatasets`

`TaskDatasets` is the mid-level data api `AdaptNLP` provides. Since we're using a raw `HuggingFace` dataset, setting one up is extremely simple

In [4]:
from adaptnlp import TaskDatasets

First let's note our model name (`bert-base-uncased`)

In [5]:
model_name = 'bert-base-uncased'

In the tutorial, they showed a custom `tokenization` function that we should use. To do so, we'll write a custom tokenization function that takes in an `item`, a `tokenizer`, and some `tokenize_kwargs`

> Note: You don't need to worry about `tokenizer` and `tokenizer_kwargs`, these are class attributes the dataset API will have access to

In [6]:
def tok_func(item, tokenizer, tokenize_kwargs):return tokenizer(item['sentence1'], item['sentence2'], **tokenize_kwargs)

And now we can build our datasets! Let's look at what it needs

In [7]:
from nbdev.showdoc import *
show_doc(TaskDatasets)

<h2 id="TaskDatasets" class="doc_header"><code>class</code> <code>TaskDatasets</code><a href="https://github.com/novetta/adaptnlp/tree/master/adaptnlp/training/core.py#L137" class="source_link" style="float:right">[source]</a></h2>

> <code>TaskDatasets</code>(**`train_dset`**, **`valid_dset`**, **`tokenizer_name`**:`str`=*`None`*, **`tokenize`**:`bool`=*`True`*, **`tokenize_func`**:`callable`=*`None`*, **`tokenize_kwargs`**:`dict`=*`{}`*, **`auto_kwargs`**:`dict`=*`{}`*, **`remove_cols`**:`Union`\[`str`, `List`\[`str`\]\]=*`None`*)

A set of datasets for a particular task, with a simple API.

Note: This is the base API, `items` should be a set of regular text and model-ready labels,
      including label or one-hot encoding being applied.

**Function Arguments**:
* `train_dset`: A train `Dataset` object
* `valid_dset`: A validation `Dataset` object
* `tokenizer_name` (`str `): The string name of a `HuggingFace` tokenizer or model. If `None`, will not tokenize the dataset.
* `tokenize` (`bool `): Whether to tokenize the dataset immediatly
* `tokenize_func` (`callable `): A function to tokenize an item with
* `tokenize_kwargs` (`dict `): Some kwargs for when we call the tokenizer
* `auto_kwargs` (`dict `): Some kwargs when calling `AutoTokenizer.from_pretrained`
* `remove_cols` (`Union[str,List[str]] `): What columns to remove
* `tokenizer_name` (`str`): A string name of a `HuggingFace` tokenizer or model


We have our datasets, we have our tokenizer name, but we still need to specify some `tokenize_kwargs` and match up our `remove_cols` with how they do and make sure our model only gets the inputs it expects. Let's define that:

In [8]:
remove_cols=['sentence1', 'sentence2', 'idx']
tokenize_kwargs = {'max_length':64, 'padding':True}

Now that everything is in place, let's build some `TaskDatasets`:

In [9]:
dsets = TaskDatasets(
    raw_datasets['train'], raw_datasets['validation'],
    tokenizer_name = model_name,
    tokenize_kwargs = tokenize_kwargs,
    tokenize_func = tok_func,
    remove_cols = remove_cols
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




All that's left is to build some `DataLoaders`! We'll still use the `DataCollatorWithPadding` that `transformers` provides:

In [10]:
from transformers import DataCollatorWithPadding

And similar to the fastai API, we can call `.dataloaders`, specifying our batch size and the collate function

In [11]:
dls = dsets.dataloaders(
    batch_size=8, 
    collate_fn=DataCollatorWithPadding(tokenizer=dsets.tokenizer)
)

We can also look at a batch via the `show_batch`:

In [12]:
dls.show_batch(n=4)

Unnamed: 0,Input Text,Label
0,"although mr sorbello was taken to hospital for a check - up, he was later released. mr sorbello was taken to hospital for a check - up and later released, while mr pennisi thinks he may have cracked ribs.",0
1,"the dow jones industrial average was up 0. 3 per cent at 9, 886. 75, while the nasdaq composite index was 0. 4 per cent higher at 1, 986. 97. on wall street, the dow jones industrial average rose 0. 5 per cent at 9, 905. 8 and the nasdaq composite added 0. 7 per cent at 1, 995. 1.",1
2,"aspen technology's shares dropped 74 cents, or 23 percent, to close at $ 2. 48 on the nasdaq. in afternoon trading, aspen's shares were off 89 cents or more than 27 percent at $ 2. 33 per share.",1
3,"but they are split over whether the fed will acknowledge risks are tilted toward weakness, or say they are balanced. wall street is debating whether the central bank will say risks are tilted toward weakness or balanced with inflation.",1


## Fine-Tuning

Finally we can fine-tune. We'll use the `SequenceClassificationTuner`:

In [13]:
from adaptnlp import SequenceClassificationTuner, Strategy

In [14]:
show_doc(SequenceClassificationTuner)

<h2 id="SequenceClassificationTuner" class="doc_header"><code>class</code> <code>SequenceClassificationTuner</code><a href="https://github.com/novetta/adaptnlp/tree/master/adaptnlp/training/sequence_classification.py#L92" class="source_link" style="float:right">[source]</a></h2>

> <code>SequenceClassificationTuner</code>(**`dls`**:`DataLoaders`, **`model_name`**:`str`, **`tokenizer`**=*`None`*, **`loss_func`**=*`CrossEntropyLoss()`*, **`metrics`**=*`[<function accuracy at 0x7fc9a765fe60>, <fastai.metrics.AccumMetric object at 0x7fc9a75b9790>]`*, **`opt_func`**=*`Adam`*, **`additional_cbs`**=*`None`*, **`expose_fastai_api`**=*`False`*, **`num_classes`**:`int`=*`None`*, **\*\*`kwargs`**) :: `AdaptiveTuner`

An `AdaptiveTuner` with good defaults for Sequence Classification tasks

**Valid kwargs and defaults:**
  - `lr`:float = 0.001
  - `splitter`:function = `trainable_params`
  - `cbs`:list = None
  - `path`:Path = None
  - `model_dir`:Path = 'models'
  - `wd`:float = None
  - `wd_bn_bias`:bool = False
  - `train_bn`:bool = True
  - `moms`: tuple(float) = (0.95, 0.85, 0.95)



As we can see, we pass in the `dls` we made earlier, our `model_name`, the tokenizer we just made from our dataset, and the number of classes to use. The rest have good defaults for us to work with!

In [34]:
tuner = SequenceClassificationTuner(dls, model_name, dsets.tokenizer, num_classes=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

And all that's left is to `tune`. There are only 4 or 5 functions you can call on our `tuner` currently, and this is by design to make it simplistic. In case you don't want to be boxed in however, if you pass in `expose_fastai_api=True` to our earlier call, it will expose the entirety of `Learner` to you, so you can call `fit_one_cycle`, `lr_find`, and everything else.
> Note: Not everything will *work* out of the box however

In [16]:
show_doc(SequenceClassificationTuner.tune)

<h4 id="AdaptiveTuner.tune" class="doc_header"><code>AdaptiveTuner.tune</code><a href="https://github.com/novetta/adaptnlp/tree/master/adaptnlp/training/core.py#L321" class="source_link" style="float:right">[source]</a></h4>

> <code>AdaptiveTuner.tune</code>(**`epochs`**:`int`, **`lr`**:`float`=*`None`*, **`strategy`**:`Strategy`=*`'fit_one_cycle'`*, **`callbacks`**:`list`=*`[]`*, **\*\*`kwargs`**)

Fine tune `self.model` for `epochs` with an `lr` and `strategy`



This looks extremely familiar to fastai's `fit` API, because we use that internally. Rather than doing `tuner.fit_one_cycle`, we can pass in either a `Strategy` namespace, or the string representation of the fit method we want to call. 

We'll train with the One-Cycle Policy, at the same learning rate as the lesson (5e-5) for 3 epochs:

In [35]:
tuner.tune(3, 5e-5, strategy=Strategy.OneCycle)

epoch,train_loss,valid_loss,accuracy,f1_score,time
0,0.545525,0.464447,0.786765,0.847636,00:31
1,0.351754,0.323562,0.860294,0.898757,00:30
2,0.183519,0.346475,0.860294,0.901554,00:31


Not too bad! We can save our model and tokenizer away now:

In [44]:
tuner.save('fine_tuned_model')

'fine_tuned_model'

## Getting Predictions

There are two ways to get predictions, the first is with the `.predict` method in our `tuner`. This is great for if you just finished training and want to see how your model performs on some new data!

The other method is with AdaptNLP's infernece API, which we will show afterwards

### In Tuner

First let's write a sentence for us to test with:

In [20]:
raw_datasets['train'][0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

In [36]:
sentence = 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'

And then predict with it:

In [None]:
show_doc(SequenceClassificationTuner.predict)

<h4 id="SequenceClassificationTuner.predict" class="doc_header"><code>SequenceClassificationTuner.predict</code><a href="https://github.com/novetta/adaptnlp/tree/master/adaptnlp/training/sequence_classification.py#L182" class="source_link" style="float:right">[source]</a></h4>

> <code>SequenceClassificationTuner.predict</code>(**`text`**:`Union`\[`List`\[`str`\], `str`\], **`bs`**:`int`=*`64`*, **`detail_level`**:`DetailLevel`=*`'low'`*)

Predict some `text` for sequence classification with the currently loaded model

**Function Arguments**:
* `text` (`Union[List[str], str]`): Some text or list of texts to do inference with
* `bs` (`int`): A batch size to use for multiple texts
* `detail_level` (`DetailLevel `): A detail level to return on the predictions


In [39]:
tuner.predict(sentence)

{'predictions': ['LABEL_1'],
 'probs': tensor([[0.4284, 0.5716]]),
 'sentences': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .']}

You'll notice it's great at getting our probablity and predictions, but it has some issues with getting our labels. This is because we never actually passed in a vocabulary for it to use. Currently we can override `dls.categorize.classes` to set a vocab:

In [40]:
tuner.dls.categorize = type('', (), {'classes':['not_equivalent', 'equivalent']})

> Note: this just creates a blank object with a single attribute `classes`

In [24]:
tuner.predict(sentence)

{'predictions': ['not_equivalent'],
 'probs': tensor([[0.5082, 0.4918]]),
 'sentences': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .']}

### With the AdaptNLP Inference API

Next we'll look at using the `EasySequenceClassifier` class, which AdaptNLP offers:

In [41]:
from adaptnlp import EasySequenceClassifier

We simply construct the class, and call `.tag_text`:

In [46]:
classifier = EasySequenceClassifier()

In [47]:
classifier.tag_text(
    sentence,
    model_name_or_path = 'fine_tuned_model',
    class_names = ['not_equivalent', 'equivalent']
)

2021-07-25 17:52:06,923 loading file fine_tuned_model


{'predictions': ['equivalent'],
 'probs': tensor([[0.4284, 0.5716]]),
 'sentences': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .']}

There are also different levels of predictions we can return (which is also the same with our earlier `predict` call).

These live in a namespace `DetailLevel` class, with a few examples below:

In [48]:
from adaptnlp import DetailLevel

In [49]:
DetailLevel.Low

'low'

In [50]:
classifier.tag_text(
    sentence,
    model_name_or_path = 'fine_tuned_model',
    class_names = ['not_equivalent', 'equivalent'],
    detail_level=DetailLevel.Low
)

{'predictions': ['equivalent'],
 'probs': tensor([[0.4284, 0.5716]]),
 'sentences': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .']}

In [51]:
classifier.tag_text(
    sentence,
    model_name_or_path = 'fine_tuned_model',
    class_names = ['not_equivalent', 'equivalent'],
    detail_level=DetailLevel.Medium
)

{'classes': ['LABEL_0', 'LABEL_1'],
 'pairings': OrderedDict([('Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
               tensor([0.4284, 0.5716]))]),
 'predictions': ['equivalent'],
 'probs': tensor([[0.4284, 0.5716]]),
 'sentences': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .']}

In [52]:
classifier.tag_text(
    sentence,
    model_name_or_path = 'fine_tuned_model',
    class_names = ['not_equivalent', 'equivalent'],
    detail_level=DetailLevel.High
)

{'classes': ['LABEL_0', 'LABEL_1'],
 'pairings': OrderedDict([('Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
               tensor([0.4284, 0.5716]))]),
 'predictions': ['equivalent'],
 'probs': tensor([[0.4284, 0.5716]]),
 'sentences': [Sentence: "Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence ."   [− Tokens: 39  − Sentence-Labels: {'sc': [LABEL_0 (0.4284), LABEL_1 (0.5716)]}]]}

In some cases the pairings will actually wind up being the same, this is because there is only so much information we can return.