This notebook contains the code to use BERT with an extra layer of nodes added at the output to predict wether a movie review is positve or negative. Hopefully it will show the improved performance due to transfer learning at the cost of increased training time and model size compared to the models in the other notebook in this repo.


Note the majority of this code is taken from: https://medium.com/tensorflow-2-bert-movie-review-sentiment-analysis/tensorflow-2-bert-movie-review-sentiment-analysis-b4ccabb87824

but I hope to demonstrate some of the pros and cons of using BERT compared to simpler models and vectorisation techniques.
I have also commented the code myself to demonstrate understanding of what is required to fine tune BERT to a specific task.


## MAKE SURE YOU ENABLE GPU FOR GOOGLE COLAB!!!

In [1]:
!pip install bert-for-tf2

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/18/d3/820ccaf55f1e24b5dd43583ac0da6d86c2d27bbdfffadbba69bafe73ca93/bert-for-tf2-0.14.7.tar.gz (41kB)
[K     |████████                        | 10kB 27.5MB/s eta 0:00:01[K     |████████████████                | 20kB 12.9MB/s eta 0:00:01[K     |███████████████████████▉        | 30kB 13.6MB/s eta 0:00:01[K     |███████████████████████████████▉| 40kB 13.9MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 6.5MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/75/2c/2256f28ef35946682ce703e69de914773c3f62048f4de6966d4e2dc1930a/py-params-0.10.1.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... 

In [2]:

# Import modules
import pandas as pd
import numpy as np
import bert
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import  Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from tqdm import tqdm
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split




from tensorflow import keras
import os
import re

print("TensorFlow Version:",tf.__version__)
print("Hub version: ",hub.__version__)
pd.set_option('display.max_colwidth',1000)

TensorFlow Version: 2.4.0
Hub version:  0.11.0


## Load Data
Load dataset using Keras

In [3]:

# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
  data = {}
  data["sentence"] = []
  data["sentiment"] = []
  for file_path in os.listdir(directory):
    with tf.io.gfile.GFile(os.path.join(directory, file_path), "r") as f:
      data["sentence"].append(f.read())
      data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
  return pd.DataFrame.from_dict(data)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
  pos_df = load_directory_data(os.path.join(directory, "pos"))
  neg_df = load_directory_data(os.path.join(directory, "neg"))
  pos_df["polarity"] = 1
  neg_df["polarity"] = 0
  # return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)
  return pd.concat([pos_df, neg_df])

# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
  dataset = tf.keras.utils.get_file(
      fname="aclImdb.tar.gz", 
      origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
      extract=True)
  
  train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                       "aclImdb", "train"))
  test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                      "aclImdb", "test"))
  
  return train_df.drop(columns = ['sentiment']), test_df.drop( columns = ['sentiment'])

In [4]:
train, test = download_and_load_datasets()


Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [5]:
train.head(2)

Unnamed: 0,sentence,polarity
0,"Director Brian Yuzna has had an uneven career in the horror genre, creating masterpieces such as ""Return of the Living Dead 3"" or ""Bride of Re-Animator"", but at the same time he has done awful movies such as ""Faust: Love for the Damned"" or the mediocre ""Progeny"". He is obviously better in the seat of Producer where his work producing Stuart Gordon's films has been superb.<br /><br />""The Dentist"", is one of his lesser works as director, but the low profile it has benefits the film and its lack of pretensions makes it a very enjoyable experience. It tells the story of Dr. Alan Feinstone (played superbly by Corbin Bernsen), a successful dentist who one day discovers that his perfect life is not really as perfect as he thought when he discovers that his beautiful wife (Linda Hoffman)has an affair with the pool boy. This event disturbs his mind and puts him in a killing spree as he takes revenge on the world for being so ""filthy"".<br /><br />The premise is very well handled by Yuzna, a...",1
1,"The scintillating Elizabeth Taylor stars in this lesser-known classic as a young girl from London who falls in love with a tea plantation owner from British Ceylon (current day Sri Lanka). Upon arrival she instantly feels out of place and is forced to adapt to the new culture as well as be in constant awareness of the angry elephant herd. William Dieterle, who also directed The Life Of Emile Zola and Portrait Of Jennie , does a masterful job of bringing a somewhat dark, and almost eerie, undertone to this romance and the setting is one of the most beautiful I've seen with the black and white themed mansion and the gorgeous island scenery.",1


## Preprocessing
BERT creates embedding from 3 inputs: token, segment & position embeddings. Here we will create functions to create these inputs

In [6]:
# Functions for constructing BERT Embeddings: input_ids, input_masks, input_segments and Inputs

## this function creates the mask embeddings where simply 1 for real tokens and 0 for embeddings
MAX_SEQ_LEN=500 # max sequence length
def get_masks(tokens):
    """Masks: 1 for real tokens and 0 for paddings"""
    return [1]*len(tokens) + [0] * (MAX_SEQ_LEN - len(tokens))
 

In [7]:
""" this function creates the segment embeddings i.e. BERT is trained on 2 sentences to predict masked words and
    the next sentence therefore the input shpuld be 2 sentences. With 0 for the first and 1 for the second"""

def get_segments(tokens):
    """Segments: 0 for the first sequence, 1 for the second"""  
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (MAX_SEQ_LEN - len(tokens))

In [8]:
## gets token ids from BERT's vocabulary
def get_ids(tokens, tokenizer):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens,)
    input_ids = token_ids + [0] * (MAX_SEQ_LEN - len(token_ids))
    return input_ids

In [9]:
## tokenize the input, cut it to the max length, and then create input, mask and segment embeddings
def create_single_input(sentence, tokenizer, max_len):
    """Create an input from a sentence"""
    stokens = tokenizer.tokenize(sentence)
    stokens = stokens[:max_len]
    stokens = ["[CLS]"] + stokens + ["[SEP]"]
 
    ids = get_ids(stokens, tokenizer)
    masks = get_masks(stokens)
    segments = get_segments(stokens)
    
    return ids, masks, segments

In [10]:
 ## create features out of whole movie review, NOT JUST FIRST 2 SENTENCES!!
def convert_sentences_to_features(sentences, tokenizer):
    """Convert sentences to features: input_ids, input_masks and input_segments"""
    input_ids, input_masks, input_segments = [], [], []
 
    for sentence in tqdm(sentences,position=0, leave=True):
        ids,masks,segments=create_single_input(sentence,tokenizer,MAX_SEQ_LEN-2)
        assert len(ids) == MAX_SEQ_LEN
        assert len(masks) == MAX_SEQ_LEN
        assert len(segments) == MAX_SEQ_LEN
        input_ids.append(ids)
        input_masks.append(masks)
        input_segments.append(segments)

    return [np.asarray(input_ids, dtype=np.int32), 
          np.asarray(input_masks, dtype=np.int32), 
          np.asarray(input_segments, dtype=np.int32)]

In [11]:
## use bert tokenizer by loading bert vocabualry and tokenizer
def create_tonkenizer(bert_layer):
    """Instantiate Tokenizer with vocab"""
    vocab_file=bert_layer.resolved_object.vocab_file.asset_path.numpy()
    do_lower_case=bert_layer.resolved_object.do_lower_case.numpy() 
    tokenizer=bert.bert_tokenization.FullTokenizer(vocab_file,do_lower_case)
    return tokenizer

## create instance of bert model
- add 768 nodes with relu's and 2 output nodes

In [12]:
def nlp_model(callable_object):
    # Load the pre-trained BERT base model
    bert_layer = hub.KerasLayer(handle=callable_object, trainable=True)  
   
    # BERT layer three inputs: ids, masks and segments
    input_ids = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_ids")           
    input_masks = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_masks")       
    input_segments = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="segment_ids")
    
    inputs = [input_ids, input_masks, input_segments] # BERT inputs
    pooled_output, sequence_output = bert_layer(inputs) # BERT outputs
    
    # Add a hidden layer
    x = Dense(units=768, activation='relu')(pooled_output)
    x = Dropout(0.1)(x)
 
    # Add output layer
    outputs = Dense(2, activation="softmax")(x)

    # Construct a new model
    model = Model(inputs=inputs, outputs=outputs)
    return model

model = nlp_model("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1")
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 500)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_ids[0][0]                  
                                                                 input_masks[0][0]            

Lets check we are creating features correctly for reviews

In [13]:
review = train['sentence'].head(1)
tokenizer = create_tonkenizer(model.layers[3])
features = convert_sentences_to_features(review, tokenizer)
                    
print(review)
print('token_ids : ',features[0])
print('mask embeddings : ', features[1])
print('segment ids : ', features[2])

100%|██████████| 1/1 [00:00<00:00, 141.04it/s]

0    Director Brian Yuzna has had an uneven career in the horror genre, creating masterpieces such as "Return of the Living Dead 3" or "Bride of Re-Animator", but at the same time he has done awful movies such as "Faust: Love for the Damned" or the mediocre "Progeny". He is obviously better in the seat of Producer where his work producing Stuart Gordon's films has been superb.<br /><br />"The Dentist", is one of his lesser works as director, but the low profile it has benefits the film and its lack of pretensions makes it a very enjoyable experience. It tells the story of Dr. Alan Feinstone (played superbly by Corbin Bernsen), a successful dentist who one day discovers that his perfect life is not really as perfect as he thought when he discovers that his beautiful wife (Linda Hoffman)has an affair with the pool boy. This event disturbs his mind and puts him in a killing spree as he takes revenge on the world for being so "filthy".<br /><br />The premise is very well handled by Yuzna, 




## Model Training
 Looks fine lets train the model


In [14]:




train = train.sample(frac=1) # Shuffle the dataset
train_frac = int(0.75*train.shape[0])
train_df = train[:train_frac]
val_df = train[train_frac:]

tokenizer = create_tonkenizer(model.layers[3])
X_train = convert_sentences_to_features(train_df['sentence'], tokenizer)
X_val = convert_sentences_to_features(val_df['sentence'], tokenizer)
X_test = convert_sentences_to_features(test['sentence'], tokenizer)



y_train = to_categorical(train_df['polarity'].values)
y_val = to_categorical(val_df['polarity'].values)
y_test = to_categorical(test['polarity'].values)



print(len(y_train))
print(len(y_val))
print(len(y_test))

100%|██████████| 18750/18750 [01:01<00:00, 304.03it/s]
100%|██████████| 6250/6250 [00:20<00:00, 307.58it/s]
100%|██████████| 25000/25000 [01:20<00:00, 312.25it/s]


18750
6250
25000


In [15]:
# Train the model
BATCH_SIZE = 8
EPOCHS = 2

# Use Adam optimizer to minimize the categorical_crossentropy loss
opt = Adam(learning_rate=2e-5)
model.compile(optimizer=opt, 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

# Fit the data to the model
history = model.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    verbose = 1)

# Save the trained model
model.save('nlp_model.h5')

Epoch 1/2
Epoch 2/2


## Evaluate Model Performance

In [16]:

# Load the pretrained nlp_model
from tensorflow.keras.models import load_model
new_model = load_model('nlp_model.h5',custom_objects={'KerasLayer':hub.KerasLayer})

In [18]:
# Predict on test dataset
from sklearn.metrics import classification_report
pred_test = np.argmax(new_model.predict(X_test), axis=1)

In [19]:
print(classification_report(np.argmax(y_test,axis=1), pred_test))


              precision    recall  f1-score   support

           0       0.96      0.89      0.92     12500
           1       0.89      0.96      0.93     12500

    accuracy                           0.92     25000
   macro avg       0.93      0.92      0.92     25000
weighted avg       0.93      0.92      0.92     25000

