In this notebook we will use pretrained BERT model from TF Hub, together with Keras Sequential API, to solve the IMDB reviews classification task.

![](https://drive.google.com/uc?export=view&id=1LfLDPCHlovwwChNGPq8ArExk1uUshuVM)

![](https://drive.google.com/uc?export=view&id=1MdNRG2Yt1OqxiPp3UsHdCW9-2lMkvzG_)

In [0]:
!pip install bert-for-tf2

In [0]:
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

from bert import bert_tokenization

# Download imdb dataset

BERT is using its own tokenizer, because of that we have to prepare the data tensors from raw text.

In [0]:
train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], 
                                  batch_size=-1, as_supervised=True)

train_examples, train_labels = tfds.as_numpy(train_data)
test_examples, test_labels = tfds.as_numpy(test_data)

In [0]:
print("Training entries: {}, test entries: {}".format(len(train_examples), len(test_examples)))

In [0]:
train_examples[:10]

In [0]:
train_labels[:10]

# Define BERT layer from TF Hub

[BERT TF Hub documentation](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2)

BERT takes 3 tensors as input:

*   input_word_ids - Tensor with word ids for all words in sequence. For all sequences in the batch.
*   input_mask - Tensor with information about text padding. Mask equals 0 for padding token and 1 otherwise.
*   segment_ids - Tensor with information abouth the segment is - whether it equals 0 or 1. Segments ids are used during BERT pretraining tasks. In our text classification they will be always equal 0.


BERT returns 2 tensors as output:

*   pooled_output - Tensor of shape [batch_size, 768] with representations for the entire input sequences
*   sequence_output - Tensor of shape [batch_size, max_seq_length, 768] with representations for each input token (in context).

In [0]:
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2",
                            trainable=True)

# Prepare data tensors

BERT uses its own tokenization method, so that we cannot use predefined tf.keras imdb dataset and we have to use raw data. However TF Hub contains functions that could help us to prepare input tensors with proper words id that could be consumed by BERT.

In the following code snippet we use BERT tokenizer to split the raw text into tokens. And then convert these tokens into id vectors (that are handled by embedding layers).

In [0]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

In [0]:
tokenizer = bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

In [0]:
def tokenize_text(txt):
    tokenized_txt = tokenizer.tokenize(txt)
    tokenized_txt = ["[CLS]"] + tokenized_txt + ["[SEP]"]
    tokenized_txt = tokenizer.convert_tokens_to_ids(tokenized_txt)
    return tokenized_txt

In [0]:
train_examples = list(map(tokenize_text, train_examples))
test_examples = list(map(tokenize_text, test_examples))

In the following code snippet we pad training and testing sequences (with max sequence lenght set to 100). And then we are preparing arrays with token masks and segment ids.

*   Token mask is array that contains information about text padding. Mask equals 0 for padding token and 1 otherwise.
*   Segments ids are used during BERT pretraining tasks. In our text classification they will be always equal 0.



In [0]:
x_train = tf.keras.preprocessing.sequence.pad_sequences(train_examples, maxlen=100, padding='post')

In [0]:
max_seq_length = x_train.shape[1]
x_test = tf.keras.preprocessing.sequence.pad_sequences(test_examples, maxlen=max_seq_length, padding='post', truncating='post')

In [0]:
train_mask = (x_train != 0).astype(int)
test_mask = (x_test != 0).astype(int)

train_segments = np.zeros(x_train.shape)
test_segments = np.zeros(x_test.shape)

In [0]:
print(x_train.shape, train_mask.shape, train_segments.shape)
print(x_test.shape, test_mask.shape, test_segments.shape)

# Define input and output layers

We have to define 3 input layers for all inputs that are passed for BERT: input_word_ids, input_mask, segment_ids.

BERT returns 2 tensors as output:  pooled_output, sequence_output. We have to take the pooled_output, that represents the vector embedding of the sequence and pass it to the classification layer.

In [0]:
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")

bert_inputs = [input_word_ids, input_mask, segment_ids]
pooled_output, sequence_output = bert_layer(bert_inputs)

In [0]:
final_output = ###

# Define and compile model

In [0]:
model = ###

In [0]:
model.summary()

In [0]:
model.compile(###)

# Train model

Remember that BERT takes 3 tensors as the input.

In [0]:
model.fit(###)