This Notebook was originally written by Tensorflow and has been modified by R. D. Slater to run properly with recent changes.  Although the original worked--new changes have caused a runtime error in the .predict() function which I beleive to be due to tensor shapes (None,1,128) vs (None,128) or data type lists.  I modified the functions that produce embedding to return numpy arrays and the model now works as before.  Note you can also pass tensors (tf.convert_to_tensor()) as well.

# BERT Embeddings with TensorFlow 2.0
With the new release of TensorFlow, this Notebook aims to show a simple use of the BERT model.
- See BERT on paper: https://arxiv.org/pdf/1810.04805.pdf
- See BERT on GitHub: https://github.com/google-research/bert
- See BERT on TensorHub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1
- See 'old' use of BERT for comparison: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb

## Update TF
We need Tensorflow 2.2 and TensorHub 0.7 for this Colab

In [None]:

!pip install bert-for-tf2
!pip install sentencepiece

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/35/5c/6439134ecd17b33fe0396fb0b7d6ce3c5a120c42a4516ba0e9a2d6e43b25/bert-for-tf2-0.14.4.tar.gz (40kB)
[K     |████████                        | 10kB 17.7MB/s eta 0:00:01[K     |████████████████▏               | 20kB 1.7MB/s eta 0:00:01[K     |████████████████████████▎       | 30kB 2.3MB/s eta 0:00:01[K     |████████████████████████████████| 40kB 1.7MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/a4/bf/c1c70d5315a8677310ea10a41cfc41c5970d9b37c31f9c90d4ab98021fd1/py-params-0.9.7.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
print("TF version: ", tf.__version__) # 2.2
print("Hub version: ", hub.__version__) # 0.8

TF version:  2.2.0
Hub version:  0.8.0


If TensorFlow Hub is not 0.7 yet on release, use dev:



In [None]:
# not needed

In [None]:
hub.__version__

'0.8.0'

## Import modules

In [None]:
import tensorflow_hub as hub
import tensorflow as tf
import bert
FullTokenizer = bert.bert_tokenization.FullTokenizer
from tensorflow.keras.models import Model       # Keras is the new high level API for TensorFlow
import math

Building model using tf.keras and hub. from sentences to embeddings.

Inputs:
 - input token ids (tokenizer converts tokens using vocab file)
 - input masks (1 for useful tokens, 0 for padding)
 - segment ids (for 2 text training: 0 for the first one, 1 for the second one)

Outputs:
 - pooled_output of shape `[batch_size, 768]` with representations for the entire input sequences 
 - sequence_output of shape `[batch_size, max_seq_length, 768]` with representations for each input token (in context)

In [None]:
max_seq_length = 128  # Your choice here.
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length), dtype=tf.int32,
                                    name="segment_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2",
                            trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

In [None]:
model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])

Generating segments and masks based on the original BERT

In [None]:
# See BERT paper: https://arxiv.org/pdf/1810.04805.pdf
# And BERT implementation convert_single_example() at https://github.com/google-research/bert/blob/master/run_classifier.py

###############################
# RDS  modifications to these #
# functions to simply return  #
# numpy arrays                #
###############################

def get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return np.array([1]*len(tokens) + [0] * (max_seq_length - len(tokens)))


def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return np.array(segments + [0] * (max_seq_length - len(tokens)))


def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return np.array(input_ids)

Import tokenizer using the original vocab file

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

## Test BERT embedding generator model

In [None]:
[2,45,15,706]
s = "This movie is bad"

Tokenizing the sentence

In [None]:
stokens = tokenizer.tokenize(s)

Adding separator tokens according to the paper

In [None]:
stokens = ["[CLS]"] + stokens + ["[SEP]"]
import numpy as np

Get the model inputs from the tokens

In [None]:
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)

In [None]:
print(stokens)
print(input_ids)
print(input_masks)
print(input_segments)

['[CLS]', 'this', 'movie', 'is', 'bad', '[SEP]']
[ 101 2023 3185 2003 2919  102    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0   

Generate Embeddings using the pretrained model

In [None]:
# Expect a shape Wawrning.  I beleive this is due to eager execution, but not sure
pool_embs, all_embs = model.predict([[input_ids],[input_masks],[input_segments]])













## Pooled embedding vs [CLS] as sentence-level representation

Previously, the [CLS] token's embedding were used as sentence-level representation (see the original paper). However, here a pooled embedding were introduced. This part is a short comparison of the two embedding using cosine similarity

In [None]:
def square_rooted(x):
    return math.sqrt(sum([a*a for a in x]))


def cosine_similarity(x,y):
    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)
    return numerator/float(denominator)

In [None]:
cosine_similarity(pool_embs[0], all_embs[0][0])

0.014019671002449191

In [None]:
cosine_similarity(pool_embs[0], all_embs[0][0])

0.014019671002449191

In [None]:

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 512)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 512)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 512)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [None]:
pool_embs.shape

(512, 768)

In [None]:
all_embs.shape

(512, 1, 768)

# Assignment

Take the imdb database data set.  Convert reviews into text and then create data to be put into BERT.

1.   Load imdb dataset
2.   Convert integers from imdb dictionary to text
3.   Tokenize and convert the text to integers for BERT
4.   Create text Masks for BERT
5.   Create text Segments for BERT


