# Binary text classification using BERT models from TF Hub

This notebook demonstrates fine tuning BERT models from [TF Hub](https://tfhub.dev) with the [IMDb movie review dataset from TensorFlow datasets](https://www.tensorflow.org/datasets/catalog/imdb_reviews) for sentiment analysis.

The notebook performs the following steps:
1. [Install dependencies and setup parameters](#1.-Install-dependencies-and-setup-parameters)
2. [Prepare the dataset](#2.-Prepare-the-dataset)
3. [Build the model](#3.-Build-the-model)
4. [Fine tuning and evaluation](#4.-Fine-tuning-and-evaluation)
5. [Export the model](#5.-Export-the-model)
6. [Reload the model and make predictions](#6.-Reload-the-model-and-make-predictions)

## 1. Install dependencies and setup parameters

The notebook assumes that you have already followed the README.md instructions that install Intel-optimized TensorFlow or use the Intel-optimized TensorFlow jupyter docker container. Additional installations needed to run the notebook are done in the next cell.

In [None]:
!pip install --upgrade -q pip
!pip install -q ipywidgets==7.6.5 \
                tensorflow-hub==0.12.0 \
                tensorflow-datasets==4.5.2 \
                'pandas>=1.1.5'
!pip install --no-deps -q tensorflow-text

In [None]:
import os
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

# Note that tensorflow_text isn't used directly but the import is required to register ops used by the
# BERT text preprocessor
import tensorflow_text

from bert_utils import get_model_map

This notebook will run one of the supported [BERT models from TF Hub](https://tfhub.dev/google/collections/bert/1). The table below has a list of the available models and links to their URLs in TF Hub.

In [None]:
# Load the TF Hub model map from json and print a list of the supported models
tfhub_model_map, models_df = get_model_map("tfhub_bert_model_map_classifier.json", return_data_frame=True)
models_df.style.hide(axis="index")

Specify the name of the BERT model to use. This string must match one of the models listed in the table above.

In [None]:
model_name = "small_bert/bert_en_uncased_L-2_H-128_A-2"
if model_name not in tfhub_model_map.keys():
    raise ValueError("The specified model name ({}) is not supported".format(model_name))

In [None]:
# Define a working directory where the dataset will be downloaded
if "WORKING_DIR" in os.environ and os.environ["WORKING_DIR"] != "":
    working_dir = os.environ["WORKING_DIR"]
else:
    working_dir = input("Path to a working directory (to download datasets): ")

# Define an output directory for the saved model to be exported
if "OUTPUT_DIR" in os.environ and os.environ["OUTPUT_DIR"] != "":
    output_dir = os.environ["OUTPUT_DIR"]
else:
    output_dir = input("Path to an output directory (for the saved model): ")

# Output directory for logs and checkpoints generated during training
if not os.path.isdir(output_dir):
    os.makedirs(output_dir)
    
tfhub_preprocess = tfhub_model_map[model_name]["preprocess"]
tfhub_bert_encoder = tfhub_model_map[model_name]["bert_encoder"]

print("Using TF Hub model:", model_name)
print("BERT encoder URL:", tfhub_bert_encoder)
print("Preprocessor URL:", tfhub_preprocess)

## 2. Prepare the dataset

Load the dataset using the TensorFlow datasets library and get splits for training, validation, and test that were defined earlier. The [tfds.load()](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) function will download the dataset if it's not found in the dataset directory. Subsequent runs will reuse the dataset that was downloaded the first time. 

In [None]:
# Define configs used for TensorFlow datasets
tfds_config = {
    # Name of the tensorflow dataset to use (ex: imdb_reviews)
    # For a full list of options see: https://www.tensorflow.org/datasets/catalog/overview
    "tfds_name": "imdb_reviews",
    # Define the splits used for the train, validation, and test dataset
    # A larger amount of training data can result in better accuracy, but will have a longer training time
    "train_split": "train[:50%]",
    "val_split": "train[:20%]",
    "test_split": "test[:20%]"
}

# Training batch size
batch_size = 32

def get_dataset_from_tfds(dataset_dir, configs):
    return tfds.load(configs["tfds_name"],
                     data_dir=dataset_dir,
                     split=[configs["train_split"], configs["val_split"], configs["test_split"]],
                     batch_size=batch_size,
                     as_supervised=True,
                     shuffle_files=True,
                     with_info=True)

# Location where the dataset will be downloaded
dataset_dir = os.path.join(output_dir, tfds_config["tfds_name"])
if not os.path.isdir(dataset_dir):
    os.makedirs(dataset_dir)

[train_ds, val_ds, test_ds], info = get_dataset_from_tfds(dataset_dir, tfds_config)
print(info)

## 3. Build the model

Create the BERT model to fine tune using an input layer, the preprocessing layer (from TF Hub), the BERT encoder layer (from TF Hub), one dense layer, and a dropout layer.

In [None]:
input_layer = tf.keras.layers.Input(shape=(), dtype=tf.string, name='input_layer')
preprocessing_layer = hub.KerasLayer(tfhub_preprocess, name='preprocessing')
encoder_inputs = preprocessing_layer(input_layer)
encoder_layer = hub.KerasLayer(tfhub_bert_encoder, trainable=True, name='encoder')
outputs = encoder_layer(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
classifier_model = tf.keras.Model(input_layer, net)

classifier_model.summary()

## 4. Fine tuning and evaluation

Train the model for the specified number of epochs, then evaluate the model using the test dataset.

In [None]:
%%time

# The number of training epochs to run
num_train_epochs = 2

# Learning rate
learning_rate = 3e-5

# Maximum total input sequence length after WordPiece tokenization (longer sequences will be truncated)
max_seq_length = 128

classifier_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate,epsilon=1e-08),
                         loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                         metrics=tf.metrics.BinaryAccuracy())

history = classifier_model.fit(train_ds, validation_data=val_ds, epochs=num_train_epochs)

Evaluate the accuracy using the test dataset. If the accuracy does not meet your expectations, try to increasing the size of the training dataset split or the number of training epochs.

In [None]:
loss, accuracy = classifier_model.evaluate(test_ds)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

## 5. Export the model

Since training has completed, export the `saved_model.pb` to the output directory in a folder with the model and dataset name.

In [None]:
model_dir = "{}_{}".format(model_name, tfds_config["tfds_name"])
model_dir = os.path.join(output_dir, model_dir)
classifier_model.save(model_dir, include_optimizer=False)

saved_model_path = os.path.join(model_dir, "saved_model.pb")
if os.path.exists(saved_model_path):
    print("Saved model location:", saved_model_path)

## 6. Reload the model and make predictions

Reload from the `saved_model.pb` in the output directory.

In [None]:
reloaded_model = tf.saved_model.load(model_dir)

The next section defines a list of strings to send as input to the reloaded model. If you are using a dataset other than the [IMDB movie reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews), you can update the snippet below with your own list of input text.

In [None]:
if tfds_config["tfds_name"] == "imdb_reviews":
    input_text = ["Awesome movie",
                  "It was entertaining, but completely predictable.",
                  "Wasn't what I expected, but I still enjoyed it",
                  "I wouldn't recommend this movie to my worst enemy",
                  "I'm not sure how good the movie was, because I fell asleep"]
else:
    # Define your own list of input text
    input_text = []
    
if not input_text:
    raise ValueError("Please define the list of input_text strings.")
    
predict_results = tf.sigmoid(reloaded_model(tf.constant(input_text)))

result_list = [[input_text[i], tf.get_static_value(predict_results[i])[0]] for i in range(len(input_text))]
result_df = pd.DataFrame(result_list, columns=["Input Text", "Score"])
result_df.style.hide(axis="index")

## Citations

```
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}
```