*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Text Classification of MultiNLI Sentences using PyTorch Transformers

In [1]:
# Import packages
import os
import sys
import json 
import pandas as pd
import numpy as np
import torch

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

from utils_nlp.dataset.multinli import load_pandas_df
from utils_nlp.models.transformers.sequence_classification import Processor, SequenceClassifier 
from utils_nlp.common.timer import Timer

In [2]:
# Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs
QUICK_RUN = True

## Introduction 
## [TODO] - Modify for the final model
This notebook fine-tunes and evaluates a pretrained [XLNet](https://arxiv.org/pdf/1906.08237.pdf) model on a subset of the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset.

We use a [sequence classifier](../../utils_nlp/models/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert).

In [14]:
TRAIN_DATA_FRACTION = 1
TEST_DATA_FRACTION = 1
NUM_EPOCHS = 3

if QUICK_RUN:
    TRAIN_DATA_FRACTION = 0.01
    TEST_DATA_FRACTION = 0.01
    NUM_EPOCHS = 1

BATCH_SIZE = 32 if torch.cuda.is_available() else 8
DATA_FOLDER = "./temp"
MODEL_CACHE_DIR = "./temp"
TO_LOWER = True
MAX_LEN = 150
BATCH_SIZE_PRED = 512
TRAIN_SIZE = 0.6
LABEL_COL = "genre"
TEXT_COL = "sentence1" 
TARGET_MODEL = "roberta-base"

### [TODO] - Remove Workflow overview

```
model_name = SequenceClassifier.list_supported_models()[0]
num_labels = len(label_encoder.classes_)
processor = Processor(model_name=model_name, cache_dir=temp_dir)
ds = processor.preprocess(text_train, labels_train, max_len=max_len)
classifier = SequenceClassifier(
    model_name=model_name, num_labels=num_labels, cache_dir=temp_dir
)
classifier.fit(ds, device="cuda", num_epochs=1, batch_size=32, num_gpus=None)
```

## Read Dataset

Let's start by loading a subset of the data.  

The following function downloads and extracts the files, if they don't already exists in the data folder.

The MultiNLI dataset is mainly used for natural language inference (NLI) tasks, where the inputs are sentence pairs and the labels are entailment indicators. The sentence pairs are also classified into *genres* that allow for more coverage and better evaluation of NLI models.

In [15]:
df = load_pandas_df(DATA_FOLDER, "train")

## Quick Analysis of Data  

Let's observe our dataset to see what we are working with.  
For our classification task, we use the first sentence only as the text input, and the corresponding genre as the label. We select the examples corresponding to one of the entailment labels (*neutral* in this case) to avoid duplicate rows, as the sentences are not unique, whereas the sentence pairs are.

In [16]:
df = df[df["gold_label"] == "neutral"] # Get unique sentences
df[[LABEL_COL, TEXT_COL]].head()

Unnamed: 0,genre,sentence1
0,government,Conceptually cream skimming has two basic dime...
4,telephone,yeah i tell you what though if you go price so...
6,travel,But a few Christian mosaics survive above the ...
12,slate,It's not that the questions they asked weren't...
13,travel,"Thebes held onto power until the 12th Dynasty,..."


The examples in the dataset, shown below, are grouped into 5 genres

In [17]:
df[LABEL_COL].value_counts()

telephone     27783
government    25784
travel        25783
fiction       25782
slate         25768
Name: genre, dtype: int64

### Train/Test Data Split 
Using SKlearns (model selection library), split the MNLI Dataset into training and testing. Based on the setting of the `QUICK_RUN` flag, we'll be sampling a fraction of the data for our model

In [18]:
df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state = 0)
df_train = df_train.sample(frac=TRAIN_DATA_FRACTION).reset_index(drop=True)
df_test = df_test.sample(frac=TEST_DATA_FRACTION).reset_index(drop=True)
train_text = df_train[TEXT_COL]



### Encode the labels into numeric values
Label Encoder makes it easy to encode dataset labels, categorical features into numerical values, between `0` and `n_classes - 1`; where `n` is the number of distinct labels

In [19]:
# encode labels
label_encoder = LabelEncoder()
train_data_labels = label_encoder.fit_transform(df_train[LABEL_COL])
test_data_labels = label_encoder.fit_transform(df_test[LABEL_COL])

# Count unique encoded labels
num_labels = len(np.unique(train_data_labels))

In [20]:
print(f"Number of unique labels: {num_labels}")
print(f"Number of training examples: {df_train.shape[0]}")
print(f"Number of testing examples: {df_test.shape[0]}")

Number of unique labels: 5
Number of training examples: 785
Number of testing examples: 524


In [21]:
# model_name = SequenceClassifier.list_supported_models()[0]
# num_labels = len(label_encoder.classes_)
# processor = Processor(model_name=model_name, cache_dir=temp_dir)
# ds = processor.preprocess(text_train, labels_train, max_len=max_len)
# classifier = SequenceClassifier(
#     model_name=model_name, num_labels=num_labels, cache_dir=temp_dir
# )
# classifier.fit(ds, device="cuda", num_epochs=1, batch_size=32, num_gpus=None)
# SequenceClassifier.list_supported_models()

### Preprocess Data For Training.  

Before training a model, the text document needs to be tokenized and converted to a list of tokens. Do the following steps to:  
1. Create a PyTorch Processor - Prepare and Tokenize data  
1. Initialize a RoBERTa PyTorch Transformer Processor 
1. Create a Dataset using the initialized processor  
1. Initialize a Sequence Classifier
1. Fit the newly created classifier model

In [22]:
supported_models = SequenceClassifier.list_supported_models()
assert TARGET_MODEL in supported_models, f"Unfortunately {TARGET_MODEL} is not currently supported"
processor = Processor(model_name=TARGET_MODEL, cache_dir=MODEL_CACHE_DIR)
train_dataset = processor.preprocess(text=train_text, labels=train_data_labels, max_len=MAX_LEN)

### Create Model

Now, we will create a sequence classifier that loads a pre-trained RoBERTa model and the number of labels

In [23]:
classifier = SequenceClassifier(model_name=TARGET_MODEL, num_labels=num_labels, cache_dir=MODEL_CACHE_DIR)

### Train Model

We train the classifier  using the training examples from MNLI. This involves fine-tunning the transformer and a linear classification layer on top of that

In [24]:
with Timer() as t:
    classifier.fit(train_dataset)
print("[Training time: {:.3f} hrs]".format(t.interval / 3600))

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/13 [00:00<?, ?it/s][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:   8%|▊         | 1/13 [00:01<00:23,  1.96s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  15%|█▌        | 2/13 [00:04<00:22,  2.01s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  23%|██▎       | 3/13 [00:06<00:20,  2.01s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  31%|███       | 4/13 [00:08<00:18,  2.00s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  38%|███▊      | 5/13 [00:10<00:16,  2.00s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  46%|████▌     | 6/13 [00:12<00:13,  1.99s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  54%|█████▍    | 7/13 [00:14<00:11,  1.99s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  62%|██████▏   | 8/13 [00:16<00:10,  2.01s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  69%|██████▉   | 9/13 [00:18<00:07,  1.99s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  77%|███████▋  | 10/13 [00:20<00:05,  2.00s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  85%|████████▍ | 11/13 [00:22<00:03,  2.00s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Iteration:  92%|█████████▏| 12/13 [00:24<00:02,  2.00s/it][AA sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.
A sequence with no special tokens has been passed to the RoBERTa model. This model requires special tokens in order to work. Please specify add_special_tokens=True in your encoding.


Batch:  3



Epoch: 100%|██████████| 1/1 [00:26<00:00, 26.16s/it]7s/it][A

[Training time: 0.007 hrs]



