# Pretrained BERT model

Use a pretrained BertModel from HuggingFace, only fit the classifier layers

https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb

Download distilbert model:
* https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-tf_model.h5
* https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json

In [1]:
import pandas as pd
from transformers import BertTokenizer

import re

import logging

logging.basicConfig(level=logging.WARNING)

2000 records is 3 minutes for creating the embeddings. If we assume linear performance it would take 75 minutes to convert all embeddings. Unfortantely, it leads to a dead kernel in the tokenize step. We need to create batches to run this on a local machine.

In [7]:
df = pd.read_csv('../data/IMDB Dataset.csv')

SAMPLE_SIZE = 2000

def preprocess_imdb_raw_data(x):
    x = re.sub("<br\\s*/?>", " ", x)
    return x 

X = [preprocess_imdb_raw_data(x) for x in df['review'].values][:SAMPLE_SIZE]

y = df['sentiment'].apply(lambda x: int(x == 'positive')).values[:SAMPLE_SIZE]

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Using a transformers pipeline
Without any additional training

In [8]:
from transformers import pipeline

nlp_sentence_classif = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




In [98]:
predicted_sentiment = [nlp_sentence_classif(x)[0]['label'].lower() for x in X]

In [99]:
from sklearn.metrics import classification_report

y_pred = [s == 'positive' for s in predicted_sentiment]

print(f"Test: {classification_report(y, y_pred)}")

Test:               precision    recall  f1-score   support

           0       0.87      0.95      0.90       115
           1       0.92      0.80      0.86        85

    accuracy                           0.89       200
   macro avg       0.89      0.87      0.88       200
weighted avg       0.89      0.89      0.88       200



## Using last pooled layer

In [9]:
import torch
from transformers import AutoTokenizer, BertTokenizer
from transformers import TFBertModel

torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x15da2c828>

Q: Can you use the tokenizer from a different model?

Q: Distilbert also takes around 3 to create embeddings. What is the efficiency gain that we could have expected?

In [10]:
# Store the model we want to use
MODEL_NAME = "bert-base-cased" 

# We need to create the model and tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

MODEL_NAME = "../models/distilbert-base-cased"

model_tf = TFBertModel.from_pretrained(MODEL_NAME)

In [11]:
MAX_SEQ_LENGTH = 100

tokens = tokenizer.batch_encode_plus(X, 
                                     max_length=MAX_SEQ_LENGTH, 
                                     return_tensors='tf')

In [12]:
outputs, pooled = model_tf(tokens)
pooled.shape

TensorShape([2000, 768])

In [None]:
# Save the embeddings as numpy array
np.save('../models/bert_pooled_layer.npy', np.array(pooled))

In [144]:
pooled = np.load('../models/bert_pooled_layer.npy')

In [132]:
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras import losses

def make_simple_model(embedding_size=768):

    inp = Input(shape=[embedding_size])

    out = Dense(1, activation="sigmoid")(inp)

    model = Model(inp, out)
    print(model.summary())
    
    model.compile("adam", loss=losses.binary_crossentropy, metrics=['accuracy'])
    
    return model

model_clf = make_simple_model()

Model: "model_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_11 (InputLayer)        [(None, 768)]             0         
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 769       
Total params: 769
Trainable params: 769
Non-trainable params: 0
_________________________________________________________________
None


In [141]:
model_clf.fit(embeddings, y, epochs=5)

Train on 2000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x14e710908>

In [135]:
y_train_probs = model_clf.predict(x=pooled)
y_train_pred = (y_train_probs >= 0.5).astype(int)

print(f"Train: {classification_report(y, y_train_pred)}")

Train:               precision    recall  f1-score   support

           0       0.62      0.41      0.49       995
           1       0.56      0.75      0.64      1005

    accuracy                           0.58      2000
   macro avg       0.59      0.58      0.57      2000
weighted avg       0.59      0.58      0.57      2000

