Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Sentiment analysis using BERT

In this notebook, we fine-tune a pretrained [BERT](https://arxiv.org/abs/1810.04805) model to perform sentiment analysis on the IMDb dataset [IMDb Large movie reviews](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz). You'll find the following contents:
* Data preprocessing and tokenization
* Creating and training a model
* Scoring and evaluating results

## BERT basics

BERT is a language representation model that effectively captures deep and subtle textual relationships in a corpus. In the [original paper](https://arxiv.org/abs/1810.04805), the authors demonstrate that the BERT model could be easily adapted to build state-of-the-art models for a number of NLP tasks, including text classification, named entity recognition and question answering. 

We fine-tune BERT for sentiment classification by wrapping a pre-trained BERT model with a [sequence classifier](../../utils_nlp/bert/sequence_classification.py). Although we use [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of [BERT](https://github.com/google-research/bert) in this notebook, the [AzureML introduction to BERT](https://github.com/microsoft/AzureML-BERT/blob/master/docs/bert-intro.md) provides a useful overview of BERT concepts.

## IMDb dataset

The [IMDb Large movie reviews](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) dataset for binary sentiment classification contains positive and negative movie reviews. 

## Prerequisites

Follow the [setup instructions](http://localhost:8888/edit/SETUP.md) in this repo. Then run the following cell to make sure you installed all the packages. 

In [1]:
import sys
sys.path.append("../../")
import os
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from utils_nlp.dataset.bert_sentiment import download_and_load_datasets
from utils_nlp.eval.classification import eval_classification
from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier
from utils_nlp.models.bert.common import Language, Tokenizer
from utils_nlp.common.timer import Timer
import numpy as np

Set global constants.

In [2]:
BERT_CACHE_DIR = "../../../temp"
LANGUAGE = Language.ENGLISH 
TO_LOWER = True # All text to lowercase
MAX_LEN = 150 # Max length of a single movie reviews
BATCH_SIZE = 32
NUM_GPUS = 2 # Change to match your system
NUM_EPOCHS = 2
TRAIN_SIZE = 0.6
LABEL_COL = "polarity" # Positive or negative movie review
TEXT_COL = "sentence" # Text of the movie review

## Download data

Get the dataset and save it into a pandas dataframe.

In [3]:
train, test = download_and_load_datasets()

=====> Begin downloading
=====> Done downloading
=====> Finish extracting
**** Dataset path: C:\Users\ducl\Documents\GitHub\nlp-2\scenarios\text_classification\data\aclImdb
===> Directory: C:\Users\ducl\Documents\GitHub\nlp-2\scenarios\text_classification\data\aclImdb\train


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


===> Complete train df
===> Directory: C:\Users\ducl\Documents\GitHub\nlp-2\scenarios\text_classification\data\aclImdb\test


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


===> Complete test df


## Inspect data

The IMDb dataset contains whole movie reviews. One input is one movie review in paragraph form with one or more sentences. The sentiment column shows how positive the movie review is on a scale from 1 to 10. The polarity column shows whether it is classified as positive or negative.

In [4]:
train.head(5)

Unnamed: 0,sentence,sentiment,polarity
0,I loved this movie. It is a definite inspirati...,10,1
1,"Obviously, I didn't care for Things to Come (a...",4,0
2,Protégé runs in a linear fashion; expect no fa...,3,0
3,Utter dreck. I got to the 16 minute/27 second ...,1,0
4,"Well, there you have it, another disillusion o...",2,0


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
sentence     25000 non-null object
sentiment    25000 non-null object
polarity     25000 non-null int64
dtypes: int64(1), object(2)
memory usage: 586.0+ KB


The dataset is divided equally into 2 groups of polarity. 1 indicates positive reviews and 0 indicates negative reviews. 

In [6]:
train.polarity.value_counts()

1    12500
0    12500
Name: polarity, dtype: int64

## Sample data

Currently we train and test on the full datasets. To run a faster experiment, reduce the sample fraction from 1.0 to 0.2.

In [7]:
df_train = train.sample(frac=1.0)
df_test = test.sample(frac=1.0)

## Label data

Encode the positive and negative class labels.

In [8]:
label_encoder = LabelEncoder()
labels_train = label_encoder.fit_transform(df_train[LABEL_COL])
labels_test = label_encoder.transform(df_test[LABEL_COL])

num_labels = len(np.unique(labels_train))

In [9]:
print("Number of unique labels: {}".format(num_labels))
print("Number of training examples: {}".format(df_train.shape[0]))
print("Number of testing examples: {}".format(df_test.shape[0]))

Number of unique labels: 2
Number of training examples: 5000
Number of testing examples: 5000


## Tokenize and preprocess

Before training, transform the text data into a format that BERT understands. 

First, instantiate a BERT tokenizer with a given language. We'll use this to tokenize the text of the training and testing sets.

In [10]:
tokenizer = Tokenizer(LANGUAGE, to_lower=TO_LOWER, cache_dir=BERT_CACHE_DIR)

tokens_train = tokenizer.tokenize(list(df_train[TEXT_COL]))
tokens_test = tokenizer.tokenize(list(df_test[TEXT_COL]))

100%|█████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:15<00:00, 322.83it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:15<00:00, 315.00it/s]


Second, perform the following preprocessing steps:
* Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
* Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence
* Pad or truncate the token lists to the specified max length
* Return mask lists that indicate paddings' positions
* Return token type id lists that indicate which sentence the tokens belong to (not needed for one-sequence classification)

*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*

In [11]:
tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(
    tokens_train, MAX_LEN
)
tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(
    tokens_test, MAX_LEN
)

## Create model
Create a sequence classifier that loads a pre-trained BERT model given the language and number of labels.

In [12]:
classifier = BERTSequenceClassifier(
    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR
)

## Train the classifier
Attach a linear classifier layer to the pre-trained BERT transformer to perform sequence classification tasks on the training data. The arguments for fitting the classifier are:
* token_ids: list of token indices
* input_mask: mask lists that indicate paddings' position
* labels: list of training labels
* num_gpus: number of GPUs. If none specified, all available GPUs will be used
* num_epochs: number of training epochs (default 1)
* batch_size: training batch size (default 32)
* verbose: displays training progress and loss values

In [None]:
# with Timer() as t:
    classifier.fit(
        token_ids=tokens_train,
        input_mask=mask_train,
        labels=labels_train,    
        num_gpus=0,        
        num_epochs=NUM_EPOCHS,
        batch_size=BATCH_SIZE,    
        verbose=True,
    )    
# print("[Training time: {:.3f} hrs]".format(t.interval / 3600))



epoch:1/2; batch:1->16/156; loss:0.688796
epoch:1/2; batch:17->32/156; loss:0.456730
epoch:1/2; batch:33->48/156; loss:0.304804
epoch:1/2; batch:49->64/156; loss:0.320006
epoch:1/2; batch:65->80/156; loss:0.207728
epoch:1/2; batch:81->96/156; loss:0.013612


## Score
Score the test set using the trained classifier.

In [None]:
preds = classifier.predict(
    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE
)

## Evaluate results
Finally, compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set.

In [None]:
print(classification_report(labels_test, preds, target_names=["negative", "positive"]))