*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Text Classification of SST-2 Sentences using BERT

# Before You Start

> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`**. This will run the notebook on a small subset of the data and a use a smaller number of epochs. 

If you run into CUDA out-of-memory error or the jupyter kernel dies constantly, try reducing the `BATCH_SIZE` and `MAX_LEN`, but note that model performance will be compromised. 

In [1]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = True

In [2]:
import sys
sys.path.append("../../")
import os
import json
import pandas as pd
import numpy as np
import scrapbook as sb
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn

from utils_nlp.dataset.multinli import load_pandas_df 
from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.common.timer import Timer

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [3]:
from interpret_text.msra.MSRAExplainer import MSRAExplainer

## Introduction
In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on a subset of the [SST-2](https://nlp.stanford.edu/sentiment/index.html/) dataset.

We use a [sequence classifier](https://github.com/microsoft/nlp/blob/master/utils_nlp/models/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert).

### Set parameters
Here we set some parameters that we use for our modeling task.

In [31]:
TRAIN_DATA_FRACTION = 1
TEST_DATA_FRACTION = 1
NUM_EPOCHS = 1

if QUICK_RUN:
    TRAIN_DATA_FRACTION = 0.01
    TEST_DATA_FRACTION = 0.01
    NUM_EPOCHS = 1

if torch.cuda.is_available():
    BATCH_SIZE = 32
else:
    BATCH_SIZE = 8

DATA_FOLDER = "./temp/sst2"
BERT_CACHE_DIR = "./temp/sst2"
LANGUAGE = Language.ENGLISH
TO_LOWER = True
MAX_LEN = 150
BATCH_SIZE_PRED = 512
TRAIN_SIZE = 0.6
LABEL_COL = "labels" 
TEXT_COL = "sentences"

## Read Dataset
We start by loading a subset of the data. The following function also downloads and extracts the files, if they don't exist in the data folder.

The SST-2 dataset is dataset of Rotten Tomatoes movie reviews mainly used for natural language inference (NLI) tasks, where the inputs are sentences and the labels are binary (positive or negative) sentiment indicators. 

We start by loading the data for training and testing:

In [32]:
def load_data(fpath):
    df_dict = {LABEL_COL: [], TEXT_COL: []}
    with open(fpath, 'r') as f:
        label_start = 0
        sentence_start = 2
        for line in f:
            label = int(line[label_start])
            sentence = line[sentence_start:]
            df_dict['labels'].append(label)
            df_dict['sentences'].append(sentence)
    return pd.DataFrame.from_dict(df_dict)

df_train = load_data(os.path.join(DATA_FOLDER, 'stsa.binary.train'))
df_test = load_data(os.path.join(DATA_FOLDER, 'stsa.binary.test'))


if QUICK_RUN:
    df_train = df_train.sample(frac=TRAIN_DATA_FRACTION).reset_index(drop=True)
    df_test = df_test.sample(frac=TEST_DATA_FRACTION).reset_index(drop=True)

These are the two classes in the dataset, where "1" corresponds to a positive review and "0" corresponds to a negative review. We don't need to encode the labels as they are already integers.

In [33]:
# display stats and examples for label types
print(df_train[[LABEL_COL, TEXT_COL]].head())
print(df_train[LABEL_COL].value_counts())

# create training and testing labels
labels_train = df_train[LABEL_COL]
labels_test = df_test[LABEL_COL]

   labels                                          sentences
0       0  i have not been this disappointed by a movie i...
1       1  the way the roundelay of partners functions , ...
2       0  by no means a slam-dunk and sure to ultimately...
3       0  the movie 's biggest shocks come from seeing f...
4       1     cool gadgets and creatures keep this fresh .\n
0    41
1    28
Name: labels, dtype: int64


In [34]:
print("Number of unique labels: {}".format(num_labels))
print("Number of training examples: {}".format(df_train.shape[0]))
print("Number of testing examples: {}".format(df_test.shape[0]))

Number of unique labels: 2
Number of training examples: 69
Number of testing examples: 18


## Tokenize and Preprocess

Before we start training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a `BERT tokenizer` given the language, and tokenize the text of the training and testing sets.

In [35]:
tokenizer = Tokenizer(LANGUAGE, to_lower=TO_LOWER, cache_dir=BERT_CACHE_DIR)

tokens_train = tokenizer.tokenize(list(df_train[TEXT_COL]))
tokens_test = tokenizer.tokenize(list(df_test[TEXT_COL]))

print(tokens_train)


  0%|                                                                                                                                       | 0/69 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69/69 [00:00<00:00, 1816.37it/s]
  0%|                                                                                                                                       | 0/18 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 1500.26it/s]

[['i', 'have', 'not', 'been', 'this', 'disappointed', 'by', 'a', 'movie', 'in', 'a', 'long', 'time', '.'], ['the', 'way', 'the', 'round', '##ela', '##y', 'of', 'partners', 'functions', ',', 'and', 'the', 'inter', '##play', 'within', 'partnerships', 'and', 'among', 'partnerships', 'and', 'the', 'general', 'air', 'of', 'ga', '##tor', '-', 'bash', '##ing', 'are', 'consistently', 'delightful', '.'], ['by', 'no', 'means', 'a', 'slam', '-', 'dun', '##k', 'and', 'sure', 'to', 'ultimately', 'di', '##sa', '##pp', '##oint', 'the', 'action', 'fans', 'who', 'will', 'be', 'moved', 'to', 'the', 'edge', 'of', 'their', 'seats', 'by', 'the', 'dynamic', 'first', 'act', ',', 'it', 'still', 'comes', 'off', 'as', 'a', 'touching', ',', 'trans', '##cend', '##ent', 'love', 'story', '.'], ['the', 'movie', "'", 's', 'biggest', 'shocks', 'come', 'from', 'seeing', 'former', 'ny', '##mp', '##hett', '##e', 'juliette', 'lewis', 'playing', 'a', 'salt', '-', 'of', '-', 'the', '-', 'earth', 'mommy', 'named', 'minnie', 




In addition, we perform the following preprocessing steps in the cell below:
- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence, respectively
- Pad or truncate the token lists to the specified max length. In this case, `MAX_LEN = 150`
- Return mask lists that indicate the paddings' positions
- Return token type id lists that indicate which sentence the tokens belong to (not needed for one-sequence classification)

*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*

In [36]:
tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(tokens_train, MAX_LEN)
tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(tokens_test, MAX_LEN)

## Sequence Classifier Model
Next, we use a sequence classifier that loads a pre-trained BERT model, given the language and number of labels.

In [37]:
classifier = BERTSequenceClassifier(language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR)

## Train Model
We train the classifier using the training set. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:

In [39]:
with Timer() as t:
    classifier.fit(token_ids=tokens_train,
                    input_mask=mask_train,
                    labels=labels_train,    
                    num_epochs=NUM_EPOCHS,
                    batch_size=BATCH_SIZE,    
                    verbose=True)    
print("[Training time: {:.3f} hrs]".format(t.interval / 3600))



Iteration:   0%|                                                                                                                             | 0/9 [00:00<?, ?it/s]

epoch:1/1; batch:1->1/9; average training loss:0.691921




Iteration:  11%|████████████▉                                                                                                       | 1/9 [04:29<35:56, 269.51s/it]

epoch:1/1; batch:2->2/9; average training loss:0.696667




Iteration:  22%|█████████████████████████▊                                                                                          | 2/9 [08:59<31:27, 269.68s/it]

epoch:1/1; batch:3->3/9; average training loss:0.645818




Iteration:  33%|██████████████████████████████████████▋                                                                             | 3/9 [13:24<26:49, 268.31s/it]

epoch:1/1; batch:4->4/9; average training loss:0.680287




Iteration:  44%|███████████████████████████████████████████████████▌                                                                | 4/9 [17:30<21:48, 261.61s/it]

epoch:1/1; batch:5->5/9; average training loss:0.709238




Iteration:  56%|████████████████████████████████████████████████████████████████▍                                                   | 5/9 [21:25<16:54, 253.68s/it]

epoch:1/1; batch:6->6/9; average training loss:0.721394




Iteration:  67%|█████████████████████████████████████████████████████████████████████████████▎                                      | 6/9 [25:21<12:24, 248.15s/it]

epoch:1/1; batch:7->7/9; average training loss:0.709737




Iteration:  78%|██████████████████████████████████████████████████████████████████████████████████████████▏                         | 7/9 [29:16<08:08, 244.34s/it]

epoch:1/1; batch:8->8/9; average training loss:0.700559




Iteration:  89%|████████████████████████████████████████████████████████████████████████████████████████████████▉            | 8/9 [16:35:07<4:52:36, 17556.23s/it]

epoch:1/1; batch:9->9/9; average training loss:0.701680




Iteration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [16:39:00<00:00, 12359.45s/it]

[Training time: 16.656 hrs]


## Score Model
We score the test set using the trained classifier:

In [40]:
preds = classifier.predict(token_ids=tokens_test, 
                           input_mask=mask_test, 
                           batch_size=BATCH_SIZE_PRED)



Iteration:   0%|                                                                                                                             | 0/1 [00:00<?, ?it/s]

Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [06:22<00:00, 382.93s/it]

## Evaluate Model
Finally, we compute the overall accuracy, precision, recall, and F1 metrics on the test set. We also look at the metrics for eact of the genres in the the dataset. 

In [46]:
report = classification_report(labels_test, preds, target_names=label_encoder.classes_, output_dict=True) 
accuracy = accuracy_score(labels_test, preds)
# change labels in report to strings for ease of display
report["0"] = report.pop(0)
report["1"] = report.pop(1)

print("accuracy: {}".format(accuracy))
print(json.dumps(report, indent=4, sort_keys=True))

accuracy: 0.7222222222222222
{
    "0": {
        "f1-score": 0.8387096774193548,
        "precision": 0.7222222222222222,
        "recall": 1.0,
        "support": 13
    },
    "1": {
        "f1-score": 0.0,
        "precision": 0.0,
        "recall": 0.0,
        "support": 5
    },
    "macro avg": {
        "f1-score": 0.4193548387096774,
        "precision": 0.3611111111111111,
        "recall": 0.5,
        "support": 18
    },
    "micro avg": {
        "f1-score": 0.7222222222222222,
        "precision": 0.7222222222222222,
        "recall": 0.7222222222222222,
        "support": 18
    },
    "weighted avg": {
        "f1-score": 0.6057347670250895,
        "precision": 0.5216049382716049,
        "recall": 0.7222222222222222,
        "support": 18
    }
}


In [47]:
# for testing
sb.glue("accuracy", accuracy)
sb.glue("precision", report["macro avg"]["precision"])
sb.glue("recall", report["macro avg"]["recall"])
sb.glue("f1", report["macro avg"]["f1-score"])

## Explain Model

In [48]:
device = torch.device("cpu" if not torch.cuda.is_available() else "cuda")

classifier.model.to(device)
for param in classifier.model.parameters():
    param.requires_grad = False
classifier.model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediat

In [49]:
interpreter_msra = MSRAExplainer(model=classifier.model, 
                                 train_dataset=list(df_train[TEXT_COL]), 
                                 device=device, 
                                 target_layer=14)

In [50]:
text = df_test[TEXT_COL][1]
label = df_test[LABEL_COL][1]
print(text, label)

mr. wedge and mr. saldanha handle the mix of verbal jokes and slapstick well .
 1


In [None]:
explanation_msra = interpreter_msra.explain_local(text)



  0%|                                                                                                                                       | 0/69 [00:00<?, ?it/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69/69 [00:00<00:00, 683.15it/s]

## Visualize Explanation

In [None]:
interpreter_msra.visualize(text)