*Copyright (c) Microsoft Corporation. All rights reserved.*  

*Licensed under the MIT License.*

# Natural Language Inference on MultiNLI Dataset using Transformers

# Before You Start

The running time shown in this notebook is running bert-large-cased on a Standard_NC24rs_v3 Azure Deep Learning Virtual Machine with 4 NVIDIA Tesla V100 GPUs. 
> **Tip:** If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

The table below provides some reference running time on different machine configurations.  

|QUICK_RUN|Machine Configurations|Running time|
|:---------|:----------------------|:------------|
|True|4 **CPU**s, 14GB memory| ~ 15 minutes|
|True|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 5 minutes|
|False|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 10.5 hours|
|False|4 NVIDIA Tesla V100 GPUs, 64GB GPU memory| ~ 2.5 hours|

If you run into CUDA out-of-memory error, try reducing the `BATCH_SIZE` and `MAX_SEQ_LENGTH`, but note that model performance will be compromised. 

In [1]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = False

## Summary
In this notebook, we demostrate fine-tuning pretrained transformer models to perform Natural Language Inference (NLI). We use the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset and the task is to classify sentence pairs into three classes: contradiction, entailment, and neutral.   
To classify a sentence pair, we concatenate the tokens in both sentences and separate the sentences by the special [SEP] token. A [CLS] token is prepended to the token list and used as the aggregate sequence representation for the classification task.The NLI task essentially becomes a sequence classification task. For example, the figure below shows how [BERT](https://arxiv.org/abs/1810.04805) classifies sentence pairs. 
<img src="https://nlpbp.blob.core.windows.net/images/bert_two_sentence.PNG">

We compare the training time and performance of three models: bert-base-cased, bert-large-cased, and xlnet-large-cased. The model used can be set in the **Configurations** section. 

In [2]:
import sys, os
nlp_path = os.path.abspath('../../')
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)
    
from tempfile import TemporaryDirectory

import numpy as np
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

import torch

from utils_nlp.models.transformers.sequence_classification import Processor, SequenceClassifier
from utils_nlp.dataset.multinli import load_pandas_df
from utils_nlp.common.timer import Timer

I1110 19:13:59.935610 140117887072000 file_utils.py:39] PyTorch version 1.2.0 available.
I1110 19:13:59.978967 140117887072000 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


To see all the model supported by `SequenceClassifier`, call the `list_supported_models` method.  
**Note**: Although `SequenceClassifier` supports distilbert for single sequence classification, distilbert doesn't support sentence pair classification and can not be used in this notebook

In [3]:
SequenceClassifier.list_supported_models()

['bert-base-uncased',
 'bert-large-uncased',
 'bert-base-cased',
 'bert-large-cased',
 'bert-base-multilingual-uncased',
 'bert-base-multilingual-cased',
 'bert-base-chinese',
 'bert-base-german-cased',
 'bert-large-uncased-whole-word-masking',
 'bert-large-cased-whole-word-masking',
 'bert-large-uncased-whole-word-masking-finetuned-squad',
 'bert-large-cased-whole-word-masking-finetuned-squad',
 'bert-base-cased-finetuned-mrpc',
 'roberta-base',
 'roberta-large',
 'roberta-large-mnli',
 'xlnet-base-cased',
 'xlnet-large-cased',
 'distilbert-base-uncased',
 'distilbert-base-uncased-distilled-squad']

## Configurations

In [4]:
MODEL_NAME = "bert-large-cased"
TO_LOWER = False
BATCH_SIZE = 16

# MODEL_NAME = "xlnet-large-cased"
# TO_LOWER = False
# BATCH_SIZE = 16

TRAIN_DATA_USED_FRACTION = 1
DEV_DATA_USED_FRACTION = 1
NUM_EPOCHS = 2
WARMUP_STEPS= 2500

if QUICK_RUN:
    TRAIN_DATA_USED_FRACTION = 0.001
    DEV_DATA_USED_FRACTION = 0.01
    NUM_EPOCHS = 1
    WARMUP_STEPS= 10

if not torch.cuda.is_available():
    BATCH_SIZE = BATCH_SIZE/2

RANDOM_SEED = 42

# model configurations
MAX_SEQ_LENGTH = 128

# optimizer configurations
LEARNING_RATE= 5e-5

# data configurations
TEXT_COL_1 = "sentence1"
TEXT_COL_2 = "sentence2"
LABEL_COL = "gold_label"
LABEL_COL_NUM = "gold_label_num"

CACHE_DIR = TemporaryDirectory().name
CACHE_DIR = "./temp"

## Load Data
The MultiNLI dataset comes with three subsets: train, dev_matched, dev_mismatched. The dev_matched dataset are from the same genres as the train dataset, while the dev_mismatched dataset are from genres not seen in the training dataset.   
The `load_pandas_df` function downloads and extracts the zip files if they don't already exist in `local_cache_path` and returns the data subset specified by `file_split`.

In [5]:
train_df = load_pandas_df(local_cache_path=CACHE_DIR, file_split="train")
dev_df_matched = load_pandas_df(local_cache_path=CACHE_DIR, file_split="dev_matched")
dev_df_mismatched = load_pandas_df(local_cache_path=CACHE_DIR, file_split="dev_mismatched")

In [6]:
dev_df_matched = dev_df_matched.loc[dev_df_matched['gold_label'] != '-']
dev_df_mismatched = dev_df_mismatched.loc[dev_df_mismatched['gold_label'] != '-']

In [7]:
print("Training dataset size: {}".format(train_df.shape[0]))
print("Development (matched) dataset size: {}".format(dev_df_matched.shape[0]))
print("Development (mismatched) dataset size: {}".format(dev_df_mismatched.shape[0]))
print()
print(train_df[['gold_label', 'sentence1', 'sentence2']].head())

Training dataset size: 392702
Development (matched) dataset size: 9815
Development (mismatched) dataset size: 9832

   gold_label                                          sentence1  \
0     neutral  Conceptually cream skimming has two basic dime...   
1  entailment  you know during the season and i guess at at y...   
2  entailment  One of our number will carry out your instruct...   
3  entailment  How do you know? All this is their information...   
4     neutral  yeah i tell you what though if you go price so...   

                                           sentence2  
0  Product and geography are what make cream skim...  
1  You lose the things to the following level if ...  
2  A member of my team will execute your orders w...  
3                  This information belongs to them.  
4           The tennis shoes have a range of prices.  


In [9]:
train_df = train_df.sample(frac=TRAIN_DATA_USED_FRACTION).reset_index(drop=True)
dev_df_matched = dev_df_matched.sample(frac=DEV_DATA_USED_FRACTION).reset_index(drop=True)
dev_df_mismatched = dev_df_mismatched.sample(frac=DEV_DATA_USED_FRACTION).reset_index(drop=True)

In [10]:
label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(train_df[LABEL_COL])
train_df[LABEL_COL_NUM] = train_labels 
num_labels = len(np.unique(train_labels))

## Tokenize and Preprocess
Before training, we tokenize the sentence texts and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets.

In [11]:
processor = Processor(model_name=MODEL_NAME, cache_dir=CACHE_DIR, to_lower=TO_LOWER)
train_dataloader = processor.create_dataloader_from_df(
    df=train_df,
    text_col=TEXT_COL_1,
    label_col=LABEL_COL_NUM,
    shuffle=True,
    text2_col=TEXT_COL_2,
    max_len=MAX_SEQ_LENGTH,
    batch_size=BATCH_SIZE,
)
dev_dataloader_matched = processor.create_dataloader_from_df(
    df=dev_df_matched,
    text_col=TEXT_COL_1,
    shuffle=False,
    text2_col=TEXT_COL_2,
    max_len=MAX_SEQ_LENGTH,
    batch_size=BATCH_SIZE,
)
dev_dataloader_mismatched = processor.create_dataloader_from_df(
    df=dev_df_mismatched,
    text_col=TEXT_COL_1,
    shuffle=False,
    text2_col=TEXT_COL_2,
    max_len=MAX_SEQ_LENGTH,
    batch_size=BATCH_SIZE,
)

I1110 19:14:11.376676 140117887072000 tokenization_utils.py:373] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at ./temp/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
100%|██████████| 392702/392702 [03:48<00:00, 1715.17it/s]
100%|██████████| 9815/9815 [00:05<00:00, 1797.48it/s]
100%|██████████| 9832/9832 [00:05<00:00, 1709.69it/s]


In addition, we perform the following preprocessing steps in the cell below:

* Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
* Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence
* Pad or truncate the token lists to the specified max length
* Return mask lists that indicate paddings' positions
* Return token type id lists that indicate which sentence the tokens belong to

*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*

## Train and Predict

### Create Classifier

In [None]:
classifier = SequenceClassifier(
    model_name=MODEL_NAME, num_labels=num_labels, cache_dir=CACHE_DIR
)

### Train Classifier

In [None]:
with Timer() as t:
    classifier.fit(
            train_dataloader,
            num_epochs=NUM_EPOCHS,
            learning_rate=LEARNING_RATE,
            warmup_steps=WARMUP_STEPS,
        )

print("Training time : {:.3f} hrs".format(t.interval / 3600))

### Predict on Test Data

In [14]:
with Timer() as t:
    predictions_matched = classifier.predict(dev_dataloader_matched)
print("Prediction time : {:.3f} hrs".format(t.interval / 3600))

Evaluating: 100%|██████████| 614/614 [04:53<00:00,  2.12it/s]

Prediction time : 0.082 hrs





In [15]:
with Timer() as t:
    predictions_mismatched = classifier.predict(dev_dataloader_mismatched)
print("Prediction time : {:.3f} hrs".format(t.interval / 3600))

Evaluating: 100%|██████████| 615/615 [04:53<00:00,  2.12it/s]

Prediction time : 0.082 hrs





## Evaluate

In [16]:
predictions_matched = label_encoder.inverse_transform(predictions_matched)
print(classification_report(dev_df_matched[LABEL_COL], predictions_matched, digits=3))

               precision    recall  f1-score   support

contradiction      0.872     0.894     0.883      3213
   entailment      0.913     0.862     0.887      3479
      neutral      0.813     0.842     0.828      3123

    micro avg      0.866     0.866     0.866      9815
    macro avg      0.866     0.866     0.866      9815
 weighted avg      0.868     0.866     0.867      9815



In [17]:
predictions_mismatched = label_encoder.inverse_transform(predictions_mismatched)
print(classification_report(dev_df_mismatched[LABEL_COL], predictions_mismatched, digits=3))

               precision    recall  f1-score   support

contradiction      0.891     0.888     0.889      3240
   entailment      0.899     0.862     0.880      3463
      neutral      0.810     0.850     0.830      3129

    micro avg      0.867     0.867     0.867      9832
    macro avg      0.867     0.867     0.866      9832
 weighted avg      0.868     0.867     0.867      9832



## Compare Model Performance

|Model name|Training time|Scoring time|Matched F1|Mismatched F1|
|:--------:|:-----------:|:----------:|:--------:|:-----------:|
|xlnet-large-cased|5.15 hrs|0.11 hrs|0.887|0.890|
|bert-large-cased|4.01 hrs|0.08 hrs|0.867|0.867|