<a href="https://colab.research.google.com/github/nvanommeren/nlp-benchmark/blob/master/4_BERT_on_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pretrained BERT model

Use a pretrained BertModel from HuggingFace, only fit the classifier layers

https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb

Download distilbert model:
* https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-tf_model.h5
* https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json

In [0]:
%%capture
!pip install transformers

In [0]:
import pandas as pd
from transformers import BertTokenizer

import re

import logging

logging.basicConfig(level=logging.WARNING)

2000 records is 3 minutes for creating the embeddings. If we assume linear performance it would take 75 minutes to convert all embeddings. Unfortantely, it leads to a dead kernel in the tokenize step. We need to create batches to run this on a local machine.

In [3]:
from google.colab import drive
import pandas as pd
import re


drive.mount('/content/gdrive')

file = 'gdrive/My Drive/Colab Notebooks/IMDB Dataset.csv'

df = pd.read_csv(file)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [4]:
# df = pd.read_csv('../data/IMDB Dataset.csv')

SAMPLE_SIZE = 50000

def preprocess_imdb_raw_data(x):
    x = re.sub("<br\\s*/?>", " ", x)
    return x 

X = [preprocess_imdb_raw_data(x) for x in df['review'].values][:SAMPLE_SIZE]

y = df['sentiment'].apply(lambda x: int(x == 'positive')).values[:SAMPLE_SIZE]

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


# Using a transformers pipeline
Without any additional training

In [11]:
from transformers import pipeline

nlp_sentence_classif = pipeline('sentiment-analysis', device=0)

HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




In [0]:
from sklearn.model_selection import train_test_split

# Use the same test set as before
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, )

In [0]:
predicted_sentiment = [nlp_sentence_classif(x)[0]['label'].lower() for x in X_test]

In [0]:
from sklearn.metrics import classification_report

y_pred = [s == 'positive' for s in predicted_sentiment]

print(f"Test: {classification_report(y_test, y_pred)}")

Test:               precision    recall  f1-score   support

           0       0.87      0.92      0.89      5044
           1       0.91      0.86      0.88      4956

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



# Pre-trained BertModel

In [6]:
import torch
from transformers import AutoTokenizer, BertTokenizer
from transformers import TFBertModel, BertModel

torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7f7c8b5ea208>

Q: Can you use the tokenizer from a different model?

Q: Distilbert also takes around 3 to create embeddings. What is the efficiency gain that we could have expected?

In [7]:
# Store the model we want to use
MODEL_NAME = "bert-base-cased" 

# We need to create the model and tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertModel.from_pretrained(MODEL_NAME, output_hidden_states=False, 
                                     output_attentions=False)

model.to(device)
model.eval()


HBox(children=(IntProgress(value=0, description='Downloading', max=213450, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=361, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=435779157, style=ProgressStyle(description_…




BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [0]:
def tokenize(text):

    input_ids = []
    attention_masks = []

    for sent in text:

        encoded_dict = tokenizer.encode_plus(
                            sent,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = 128,           # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                    )
        
        # Add the encoded sentence to the list.    
        input_ids.append(encoded_dict['input_ids'])
        
        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])

    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)

    return input_ids, attention_masks

input_ids, attention_masks = tokenize(X)
labels = torch.tensor(y)

In [0]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data import TensorDataset

dataset = TensorDataset(input_ids, attention_masks, labels)

# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
dataloader = DataLoader(
            dataset,  # The training samples.
            sampler = RandomSampler(dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

In [0]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))


In [93]:
import random
import numpy as np
import time

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()


print("")
print("Running Validation...")

t0 = time.time()

# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()

embeddings_result = np.empty([SAMPLE_SIZE, 768])
# torch.Size([32, 128, 768])
labels_result = np.empty([SAMPLE_SIZE])

# Evaluate data for one epoch
for step, batch in enumerate(dataloader):
    
    # Unpack this training batch from our dataloader. 
    #
    # As we unpack the batch, we'll also copy each tensor to the GPU using 
    # the `to` method.
    #
    # `batch` contains three pytorch tensors:
    #   [0]: input ids 
    #   [1]: attention masks
    b_input_ids = batch[0].to(device)
    b_input_mask = batch[1].to(device)
    b_labels = batch[2]

    
    # Tell pytorch not to bother with constructing the compute graph during
    # the forward pass, since this is only needed for backprop (training).
    with torch.no_grad():        

        # Forward pass, calculate logit predictions.
        # token_type_ids is the same as the "segment ids", which 
        # differentiates sentence 1 and 2 in 2-sentence tasks.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        (last_hidden_state, pooled_output) = model(
                                       b_input_ids, 
                                       token_type_ids=None, 
                                       attention_mask=b_input_mask)
        

    
        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(dataloader), elapsed))
            
    # Move logits and labels to CPU
    batch_idx = step*batch_size
    # torch.mean(last_hidden_state, dim=1).numpy().reshape(1, -1)[0]
    embeddings_result[batch_idx:batch_idx+batch_size, :] = pooled_output.detach().cpu().numpy()
    labels_result[batch_idx:batch_idx+batch_size] = b_labels.numpy()

# Measure how long the validation run took.
total_time = format_time(time.time() - total_t0)

# Record all statistics from this epoch.
training_stats.append(
    {
        # 'epoch': epoch_i + 1,
        # 'Training Loss': avg_train_loss,
        # 'Valid. Loss': avg_val_loss,
        # 'Valid. Accur.': avg_val_accuracy,
        'Time': total_time
    }
)

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Running Validation...
  Batch    40  of  1,563.    Elapsed: 0:00:05.
  Batch    80  of  1,563.    Elapsed: 0:00:09.
  Batch   120  of  1,563.    Elapsed: 0:00:14.
  Batch   160  of  1,563.    Elapsed: 0:00:19.
  Batch   200  of  1,563.    Elapsed: 0:00:24.
  Batch   240  of  1,563.    Elapsed: 0:00:28.
  Batch   280  of  1,563.    Elapsed: 0:00:33.
  Batch   320  of  1,563.    Elapsed: 0:00:38.
  Batch   360  of  1,563.    Elapsed: 0:00:42.
  Batch   400  of  1,563.    Elapsed: 0:00:47.
  Batch   440  of  1,563.    Elapsed: 0:00:52.
  Batch   480  of  1,563.    Elapsed: 0:00:56.
  Batch   520  of  1,563.    Elapsed: 0:01:01.
  Batch   560  of  1,563.    Elapsed: 0:01:06.
  Batch   600  of  1,563.    Elapsed: 0:01:11.
  Batch   640  of  1,563.    Elapsed: 0:01:15.
  Batch   680  of  1,563.    Elapsed: 0:01:20.
  Batch   720  of  1,563.    Elapsed: 0:01:25.
  Batch   760  of  1,563.    Elapsed: 0:01:29.
  Batch   800  of  1,563.    Elapsed: 0:01:34.
  Batch   840  of  1,563.    Elapsed:

## Generate sentence embeddings per batch

In [0]:
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split

normalized_embeddings = Normalizer().fit_transform(embeddings_result)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(normalized_embeddings, labels_result, test_size=0.2, random_state=1)

## Model for last clf layer

In [98]:
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras import losses
from tensorflow.keras.optimizers import Adam

def make_simple_model(embedding_size=768):

    inp = Input(shape=[embedding_size])
    
    x = Dense(128, activation="relu")(inp)
    
    out = Dense(1, activation="sigmoid")(x)

    model = Model(inp, out)
    
    print(model.summary())
    
    model.compile(Adam(), loss=losses.binary_crossentropy, metrics=['accuracy'])
    
    return model

model_clf = make_simple_model()

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         [(None, 768)]             0         
_________________________________________________________________
dense_12 (Dense)             (None, 128)               98432     
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 129       
Total params: 98,561
Trainable params: 98,561
Non-trainable params: 0
_________________________________________________________________
None


In [99]:
model_clf.fit(X_train, y_train, epochs=500) 

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

KeyboardInterrupt: ignored

In [86]:
embeddings_result.shape

(50000, 768)

## Validation

In [100]:
from sklearn.metrics import classification_report

y_test_probs = model_clf.predict(x=X_test)
y_test_pred = (y_test_probs >= 0.5).astype(int)

print(f"Test: {classification_report(y_test, y_test_pred)}")

Test:               precision    recall  f1-score   support

         0.0       0.70      0.81      0.75      4942
         1.0       0.78      0.66      0.72      5058

    accuracy                           0.74     10000
   macro avg       0.74      0.74      0.73     10000
weighted avg       0.74      0.74      0.73     10000



In [101]:
from sklearn.metrics import classification_report

y_train_probs = model_clf.predict(x=X_train)
y_train_pred = (y_train_probs >= 0.5).astype(int)

print(f"Train: {classification_report(y_train, y_train_pred)}")

Train:               precision    recall  f1-score   support

         0.0       0.71      0.83      0.77     20058
         1.0       0.79      0.66      0.72     19942

    accuracy                           0.75     40000
   macro avg       0.75      0.75      0.74     40000
weighted avg       0.75      0.75      0.74     40000



## Train own embeddings:

* https://huggingface.co/transformers/v2.0.0/examples.html#language-model-fine-tuning

language-model-fine-tuning is renamed to run-language-modeling

* https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py

* https://huggingface.co/blog/how-to-train: training from scratch
* * https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertformaskedlm