# Text classification with BERT in PyTorch

2018 was an exciting year for Natural Language Processing. One of the most promising evolutions was the breakthrough of transfer learning. Models like Elmo Embeddings, ULMFit and BERT allow us to pre-train a neural network on a large collection of unlabelled texts. Thanks to an auxiliary task such as language modelling, these models are able to learn a lot about the syntax, semantics and morphology of a language. This knowledge can be put to good use: because they already know so much about language use, these models need much less labelled data to reach state-of-the-art performance on other tasks, such as text classification, sequence labelling or question answering. 

One of the most popular models is [BERT](https://arxiv.org/abs/1810.04805), developed by researchers at Google. BERT stands for Bidirectional Encoder Representations from Transformers. It uses the Transformer architecture to pretrain bidirectional "language models". By adding just one task-specific output layer, it is possible to use such a pre-trained BERT model on a variety of NLP tasks. In this notebook, we're going to investigate its performance on a sentiment analysis task, where the task is to predict whether a review is positive or negative. Unfortunately, we can't share the data, but you can easily plug in your own.

## Data

Let's first get our data. We're assuming the corpus is a simple json file with a list of documents. Each of these documents is a dictionary with a "text" and "label". 

In [1]:
import os

CORPUS_PATH = os.path.join(os.path.expanduser("~"), "review_corpus.json")

We now split up the data into a train, development and test portion. 

In [2]:
import json
from sklearn.model_selection import train_test_split

with open(CORPUS_PATH) as i:
    data = json.load(i)
    
texts = [doc["text"] for doc in data]
labels = [doc["label"] for doc in data]
    
rest_texts, test_texts, rest_labels, test_labels = train_test_split(texts, labels, test_size=0.1, random_state=1)
train_texts, dev_texts, train_labels, dev_labels = train_test_split(rest_texts, rest_labels, test_size=0.1, random_state=1)

print("Train size:", len(train_texts))
print("Dev size:", len(dev_texts))
print("Test size:", len(test_texts))

Train size: 1620
Dev size: 180
Test size: 200


Next, we need to determine the number of labels in our data. We'll map each of these labels to an index. In our sentiment analysis example, there are just two labels: positive and negative.

In [3]:
TARGET_NAME_PATH = os.path.join(os.path.expanduser("~"), "target_names.json")

target_names = list(set(labels))
with open(TARGET_NAME_PATH, "w") as o:
    json.dump(target_names, o)

label2idx = {label: idx for idx, label in enumerate(target_names)}
print(label2idx)

{'pos': 0, 'neg': 1}


## BERT

We're going to use a PyTorch implementation of BERT, which has been developed by the team at [HuggingFace](https://github.com/huggingface). A lot of the code in this notebook is taken from their example scripts, so kudos to them for sharing this. 

In [4]:
!pip install pytorch-pretrained-bert

[31mfastai 1.0.33 has requirement spacy==2.0.16, but you'll have spacy 2.0.18 which is incompatible.[0m
[31mfastai 1.0.33 has requirement thinc==6.12.0, but you'll have thinc 6.12.1 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 19.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


You really need a GPU to finetune BERT. Still, to make sure this code runs on any machine we'll let PyTorch determine whether a GPU is available.

In [5]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Initializing a model

Google has made available a range of BERT models for us to experiment with. For English, there is a choice between three models: `bert-large-uncased` is the largest model that will likely give the best results. Its smaller siblings are `bert-base-uncased` and `bert-base-cased`, which are more practical to work with. For Chinese there is `bert-base-chinese`, and for the other languages we have `bert-base-multilingual-uncased` and `bert-base-multilingual-cased`. 

Uncased means that the training text has been lowercased and accents have been stripped. This is usually better, unless you know that case information is important for your task, such as with Named Entity Recognition. 

In our example, we're going to investigate sentiment analysis on Dutch text. We'll therefore use the uncased multilingual model. 

In [6]:
BERT_MODEL = "bert-base-multilingual-uncased"

Each model comes with its own tokenizer. This tokenizer splits texts into [word pieces](https://github.com/google/sentencepiece). These form a vocabulary of subword units that is shared between all the languages is in the multilingual model. In addition, we'll tell the tokenizer it should lowercase the text, as we're going to work with the uncased model. 

In [7]:
from pytorch_pretrained_bert.tokenization import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(BERT_MODEL, do_lower_case=True)

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


A full BERT model consists of a common, pretrained core, and an extension on top that depends on the particular NLP task. After all, the output of a sequence classification model, where we have just one prediction for every sequence, looks very different from the output of a sequence labelling or question answering model. As we're looking at sentiment classification, we're going to use the pretrained BERT model with a final layer for sequence classification on top.

In [8]:
from pytorch_pretrained_bert import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(BERT_MODEL, num_labels = len(label2idx))
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertInterm

### Preparing the data

Next we need to prepare our data for BERT. We'll present every document as an InputFeatures object, which contains all the information BERT needs: 

- a list of input ids. Take a look at the logging output to see what this means. Every text has been split up into subword units, which are shared between all the languages in the multilingual model. When a word appears frequently enough in a combined corpus of all languages, it is kept intact. If it is less frequent, it is split up into subword units that do occur frequently enough across all languages. This allows our model to process every text as a sequence of strings from a finite vocabulary of limited size. Note also the first `[CLS]` token. This token is added at the beginning of every document. The vector at the output of this token will be used by the BERT model for its sequence classification tasks: it serves as the input of the final, task-specific part of the neural network.
- the input mask: the input mask tells the model which parts of the input it should look at and which parts it should ignore. In our example, we have made sure that every text has a length of 100 tokens. This means that some texts will be cut off after 100 tokens, while others will have to be padded with extra tokens. In this latter case, these padding tokens will receive a mask value of 0, which means BERT should not take them into account for its classification task. 
- the segment_ids: some NLP task take several sequences as input. This is the case for question answering, natural language inference, etc. In this case, the segment ids tell BERT which sequence every token belongs to. In a text classification task like ours, however, there's only one segment, so all the input tokens receive segment id 0.
- the label id: the id of the label for this document.

In [9]:
import logging
import numpy as np

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

MAX_SEQ_LENGTH=100

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        

def convert_examples_to_features(example_texts, example_labels, label2idx, max_seq_length, tokenizer, verbose=0):
    """Loads a data file into a list of `InputBatch`s."""
    
    features = []
    examples = zip(example_texts, example_labels)
    for (ex_index, (text, label)) in enumerate(examples):
        tokens = tokenizer.tokenize(text)

        if len(tokens) > max_seq_length - 2:
            tokens = tokens[:(max_seq_length - 2)]
            
        tokens = ["[CLS]"] + tokens + ["[SEP]"]
        segment_ids = [0] * len(tokens)
            
        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        
        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding = [0] * (max_seq_length - len(input_ids))
        input_ids += padding
        input_mask += padding
        segment_ids += padding

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        label_id = label2idx[label]
        if verbose and ex_index == 0:
            logger.info("*** Example ***")
            logger.info("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            logger.info("label:" + str(label) + " id: " + str(label_id))

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id))
    return features

train_features = convert_examples_to_features(train_texts, train_labels, label2idx, MAX_SEQ_LENGTH, tokenizer, verbose=0)
dev_features = convert_examples_to_features(dev_texts, dev_labels, label2idx, MAX_SEQ_LENGTH, tokenizer)
test_features = convert_examples_to_features(dev_texts, dev_labels, label2idx, MAX_SEQ_LENGTH, tokenizer)

Finally, we're going to initialize a data loader for our training, development and testing data. This data loader puts all our data in tensors and will allow us to iterate over them during training.

In [10]:
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

def get_data_loader(features, max_seq_length, batch_size): 

    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
    data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
    sampler = SequentialSampler(data)
    dataloader = DataLoader(data, sampler=sampler, batch_size=batch_size)
    return dataloader

BATCH_SIZE = 16

train_dataloader = get_data_loader(train_features, MAX_SEQ_LENGTH, BATCH_SIZE)
dev_dataloader = get_data_loader(dev_features, MAX_SEQ_LENGTH, BATCH_SIZE)
test_dataloader = get_data_loader(test_features, MAX_SEQ_LENGTH, BATCH_SIZE)

### Evaluation method

Now it's time to write our evaluation method. This method takes as input a model and a data loader with the data we would like to evaluate on. For each batch, it computes the output of the model and the loss. We use this output to compute the obtained precision, recall and F-score. During training, we will print the simple numbers. When we evaluate on the test set, we will output a full classification report.

In [11]:
def evaluate(model, dataloader):

    eval_loss = 0
    nb_eval_steps = 0
    predicted_labels, correct_labels = [], []

    for step, batch in enumerate(tqdm(dataloader, desc="Evaluation iteration")):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        with torch.no_grad():
            tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
            logits = model(input_ids, segment_ids, input_mask)

        outputs = np.argmax(logits, axis=1)
        label_ids = label_ids.to('cpu').numpy()
        
        predicted_labels += list(outputs)
        correct_labels += list(label_ids)
        
        eval_loss += tmp_eval_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    
    correct_labels = np.array(correct_labels)
    predicted_labels = np.array(predicted_labels)
        
    return eval_loss, correct_labels, predicted_labels

### Training

Now it's time to start training. We're going to use the Adam optimizer with a learning rate of 5e-5, and train for a maximum of 100 epochs. Here are some additional things to note: 

- Gradient Accumulation allows us to keep our batches small enough to fit into the memory of our GPU, while getting the advantages of using larger batch sizes. In practice, it means we sum the gradients of several batches, before we perform a step of gradient descent. 
- Our learning rate is going to vary as a function of the training progress. First, during the warm-up stage, we're going to start with a small learning rate, which gradually increases. After the warm-up stage, we let the learning rate increase suddenly, and then decay linearly.

In [12]:
from pytorch_pretrained_bert.optimization import BertAdam

GRADIENT_ACCUMULATION_STEPS = 1
NUM_TRAIN_EPOCHS = 100
LEARNING_RATE = 5e-5
WARMUP_PROPORTION = 0.1

def warmup_linear(x, warmup=0.002):
    if x < warmup:
        return x/warmup
    return 1.0 - x

num_train_steps = int(len(train_texts) / BATCH_SIZE / GRADIENT_ACCUMULATION_STEPS * NUM_TRAIN_EPOCHS)

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
t_total = num_train_steps

optimizer = BertAdam(optimizer_grouped_parameters,
                     LEARNING_RATE,
                     warmup=WARMUP_PROPORTION,
                     t_total=t_total)

We're finally ready to train our model. At each epoch, we're going to train it on our training data and evaluate it on the development data. We keep a history of the loss, and stop training when the loss on the development set doesn't improve for a certain number of steps (we call this number our `patience`). Whenever the development loss of our model improves, we save it. 

In [13]:
import torch
import os
from tqdm import trange
from tqdm import tqdm_notebook as tqdm
from sklearn.metrics import classification_report, precision_recall_fscore_support

OUTPUT_DIR = "/tmp/"
MODEL_FILE_NAME = "pytorch_model.bin"
PATIENCE = 2

global_step = 0
model.train()
loss_history = []
for _ in trange(int(NUM_TRAIN_EPOCHS), desc="Epoch"):
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(tqdm(train_dataloader, desc="Training iteration")):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch
        loss = model(input_ids, segment_ids, input_mask, label_ids)

        if GRADIENT_ACCUMULATION_STEPS > 1:
            loss = loss / GRADIENT_ACCUMULATION_STEPS

        loss.backward()

        tr_loss += loss.item()
        nb_tr_examples += input_ids.size(0)
        nb_tr_steps += 1
        if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            lr_this_step = LEARNING_RATE * warmup_linear(global_step/t_total, WARMUP_PROPORTION)
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr_this_step
            optimizer.step()
            optimizer.zero_grad()
            global_step += 1

    dev_loss, _, _ = evaluate(model, dev_dataloader)
    
    print("Loss history:", loss_history)
    print("Dev loss:", dev_loss)
    
    if len(loss_history) == 0 or dev_loss < min(loss_history):
        model_to_save = model.module if hasattr(model, 'module') else model
        output_model_file = os.path.join(OUTPUT_DIR, MODEL_FILE_NAME)
        torch.save(model_to_save.state_dict(), output_model_file)
    
    if len(loss_history) > 0 and dev_loss > max(loss_history[-PATIENCE:]): 
        print("No improvement on development set. Finish training.")
        break
        
    
    loss_history.append(dev_loss)

Epoch:   0%|          | 0/100 [00:00<?, ?it/s]

HBox(children=(IntProgress(value=0, description='Training iteration', max=102, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…


Loss history: []
Dev loss: 0.6941553999980291


Epoch:   1%|          | 1/100 [01:16<2:06:07, 76.44s/it]

HBox(children=(IntProgress(value=0, description='Training iteration', max=102, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…


Loss history: [0.6941553999980291]
Dev loss: 0.5211682170629501


Epoch:   2%|▏         | 2/100 [02:32<2:04:53, 76.47s/it]

HBox(children=(IntProgress(value=0, description='Training iteration', max=102, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…


Loss history: [0.6941553999980291, 0.5211682170629501]
Dev loss: 0.3689093589782715


Epoch:   3%|▎         | 3/100 [03:50<2:03:57, 76.67s/it]

HBox(children=(IntProgress(value=0, description='Training iteration', max=102, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…


Loss history: [0.6941553999980291, 0.5211682170629501, 0.3689093589782715]
Dev loss: 0.3561308452238639


Epoch:   4%|▍         | 4/100 [05:07<2:02:57, 76.85s/it]

HBox(children=(IntProgress(value=0, description='Training iteration', max=102, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…

Epoch:   5%|▌         | 5/100 [06:18<1:59:02, 75.19s/it]


Loss history: [0.6941553999980291, 0.5211682170629501, 0.3689093589782715, 0.3561308452238639]
Dev loss: 0.36196996333698434


HBox(children=(IntProgress(value=0, description='Training iteration', max=102, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…


Loss history: [0.6941553999980291, 0.5211682170629501, 0.3689093589782715, 0.3561308452238639, 0.36196996333698434]
Dev loss: 0.4219656717032194
No improvement on development set. Finish training.





### Evaluation

Let's now evaluate the model on some documents it has never seen. We'll load our best model and have it predict the labels for all documents in our data. We'll compute its precision, recall and F-score for the training, development and test set and print a full classification report for the test set.

In [14]:
BERT_MODEL = "bert-base-multilingual-uncased"

with open(TARGET_NAME_PATH) as i:
    target_names = json.load(i)

model_state_dict = torch.load(output_model_file)
model = BertForSequenceClassification.from_pretrained(BERT_MODEL, state_dict=model_state_dict, num_labels = len(target_names))
model.to(device)

model.eval()

_, train_correct, train_predicted = evaluate(model, train_dataloader)
_, dev_correct, dev_predicted = evaluate(model, dev_dataloader)
_, test_correct, test_predicted = evaluate(model, test_dataloader)

print("Training performance:", precision_recall_fscore_support(train_correct, train_predicted, average="micro"))
print("Development performance:", precision_recall_fscore_support(dev_correct, dev_predicted, average="micro"))
print("Test performance:", precision_recall_fscore_support(test_correct, test_predicted, average="micro"))

print(classification_report(test_correct, test_predicted, target_names=target_names))

02/03/2019 17:50:19 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz from cache at /home/ubuntu/.pytorch_pretrained_bert/437da855f7aeb6dcc47ee03b11ac55bfbc069d31354f6867f3b298aad8429925.dd2dce7e7331017693bd2230dbc8015b12a975201a420a856a6efbf7ae9d84c5
02/03/2019 17:50:19 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /home/ubuntu/.pytorch_pretrained_bert/437da855f7aeb6dcc47ee03b11ac55bfbc069d31354f6867f3b298aad8429925.dd2dce7e7331017693bd2230dbc8015b12a975201a420a856a6efbf7ae9d84c5 to temp dir /tmp/tmp0b3siw7_
02/03/2019 17:50:25 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hi

HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=102, style=ProgressStyle(descripti…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…




HBox(children=(IntProgress(value=0, description='Evaluation iteration', max=12, style=ProgressStyle(descriptio…


Training performance: (0.9580246913580247, 0.9580246913580247, 0.9580246913580247, None)
Development performance: (0.8722222222222222, 0.8722222222222222, 0.8722222222222223, None)
Test performance: (0.8722222222222222, 0.8722222222222222, 0.8722222222222223, None)
              precision    recall  f1-score   support

         pos       0.92      0.85      0.88       102
         neg       0.82      0.90      0.86        78

   micro avg       0.87      0.87      0.87       180
   macro avg       0.87      0.88      0.87       180
weighted avg       0.88      0.87      0.87       180

