# Sentiment Analysis on IMDB dataset

This notebook walks through how perform sentament analysis on the IMDB dataset.
In this setting, one party has the reviews and the other party has the labels.
The party with the labels is helping the party with the reviews train a model
without sharing the labels themselves.

Before starting, install tf-shell and the dataset.

```bash
pip install tf-shell
pip install tensorflow_hub tensorflow_datasets
```

In [1]:
import time
from datetime import datetime
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import keras
import numpy as np
import tf_shell
import tf_shell_ml
import os

2024-06-10 23:49:11.867640: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-10 23:49:11.867982: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-10 23:49:11.869644: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-10 23:49:11.895495: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Set up parameters for the SHELL encryption library.
context = tf_shell.create_context64(
    log_n=12,
    main_moduli=[288230376151760897, 288230376152137729],
    plaintext_modulus=4294991873,
    scaling_factor=3,
    mul_depth_supported=3,
    seed="test_seed",
)

# Create the secret key for encryption and a rotation key (rotation key is
# an auxilary key required for operations like roll or matmul).
secret_key = tf_shell.create_key64(context)
public_rotation_key = tf_shell.create_rotation_key64(context, secret_key)

# The batch size is determined by the ciphertext parameters, specifically the
# schemes polynomial's ring degree because tf-shell uses batch axis packing.
# Furthermore, two micro-batches to run in parallel.
batch_size = context.num_slots

Setup IMDB dataset.

In [3]:
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, val_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

# Print the first example.
for review, label in train_data.take(1):
    print("Review:", review.numpy().decode('utf-8'))
    print("Label:", label.numpy())

epochs = 3
train_data = train_data.shuffle(buffer_size=2048).batch(batch_size, drop_remainder=True).repeat(count=epochs)
val_data = val_data.shuffle(buffer_size=2048).batch(batch_size, drop_remainder=True)
test_data = test_data.shuffle(buffer_size=2048).batch(batch_size, drop_remainder=True)

vocab_size = 50000  # This dataset has 92061 unique words.
max_length = 200
embedding_dim = 50

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=max_length)
    # TODO use pad_to_max_tokens instead of output_sequence_length?

vectorize_layer.adapt(train_data.map(lambda text, label: text))

print("Most used words:", vectorize_layer.get_vocabulary()[:10])
print("Dictionary size:", len(vectorize_layer.get_vocabulary()))

2024-06-10 23:49:58.773005: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/vscode/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Size...: 100%|██████████| 80/80 [00:13<00:00,  5.97 MiB/s]rl]
Dl Completed...: 100%|██████████| 1/1 [00:13<00:00, 13.39s/ url]
2024-06-10 23:51:28.949194: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2024-06-10 23:51:28.949578: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


[1mDataset imdb_reviews downloaded and prepared to /home/vscode/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m
Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
Label: 0


2024-06-10 23:51:30.315651: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Most used words: ['', '[UNK]', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'it']
Dictionary size: 50000


In [4]:
# Create the trainable layers.
embedding_layer = tf_shell_ml.ShellEmbedding(
    vocab_size + 1,  # +1 for OOV token.
    embedding_dim,
)
# TODO dropout layer?
hidden_layer = tf_shell_ml.ShellDense(
    16,
    activation=tf_shell_ml.relu,
    activation_deriv=tf_shell_ml.relu_deriv,
)
# TODO dropout layer?
output_layer = tf_shell_ml.ShellDense(1,
    activation=tf.nn.softmax,
)

## Call the layers once to create the weights.
#y1 = hidden_layer(tf.zeros((batch_size, 784)))
#y2 = output_layer(y1)

loss_fn = tf_shell_ml.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam(0.1)
emb_optimizer = tf.keras.optimizers.Adam(0.1)

Next, define the `train_step` function which will be called for each batch on an
encrypted batch of labels, y. The function first does a forward on the plaintext
image x to compute a predicted label, then does backpropagation using the
encrypted label y.

In [5]:
def train_step(x, enc_y):
    # Forward pass always in plaintext
    # y //= max_length  # Normalize for reduce_sum.
    y = embedding_layer(x)
    y = tf_shell.reshape(y, (batch_size, max_length * embedding_dim))
    # Could also do reduce_sum and division to replicate GlobalAveragePooling1D layer.
    # y = tf_shell.reduce_sum(y, axis=1)

    y = hidden_layer(y)
    y_pred = output_layer(y)

    # Backward pass.
    dJ_dy_pred = loss_fn.grad(enc_y, y_pred)
    dJ_dw2, dJ_dx2 = output_layer.backward(dJ_dy_pred, public_rotation_key)
    dJ_dw1, dJ_dx1 = hidden_layer.backward(dJ_dx2, public_rotation_key)

    dJ_dx1_reshaped = tf_shell.reshape(dJ_dx1, (batch_size, max_length, embedding_dim))
    # Could also tile up to this shape to replicate GlobalAveragePooling1D layer.
    # dJ_dx1_reshaped = tf_shell.broadcast_to(
    #     dJ_dx1, (batch_size, max_length, embedding_dim)
    # )

    embedding_layer.backward_accum(dJ_dx1_reshaped, public_rotation_key)

    # dJ_dw0, _ = embedding_layer.backward(dJ_dx1_reshaped, public_rotation_key)

    # dJ_dw0, the embedding layer gradient, would usually have outer shape [1]
    # for the 1 output classes. tf-shell instead back propagates in two
    # mini-batches per batch resulting in two gradients of shape [2].
    # Furthermore, the gradients are in an "expanded" form where the gradient is
    # repeated by the size of the batch. Said another way, if
    # real_grad_top/bottom is the "real" gradient of shape [10] from the
    # top/bottom halves of the batch:
    #
    # dJ_dw = tf.concat([
    #   tf.repeat(
    #       tf.expand_dims(real_grad_top, 0), repeats=[batch_sz // 2], axis=0
    #   ),
    #   tf.repeat(
    #       tf.expand_dims(real_grad_bottom, 0), repeats=[batch_sz // 2], axis=0
    #   )
    # ])
    #
    # This repetition is result of the SHELL library using a packed
    # representation of ciphertexts for efficiency. As such, if the ciphertexts
    # need to be sent over the network, they may be masked and packed together
    # before being transmitted to the party with the key.
    #
    # Only return the weight gradients at [0], not the bias gradients at [1].
    # The bias is not used in this test.
    # return [dJ_dw2[0], dJ_dw1[0], dJ_dw0[0]]
    return [dJ_dw2[0], dJ_dw1[0]]


@tf.function
def train_step_wrapper(x_batch, y_batch):
    # Encrypt the batch of secret labels y.
    enc_y_batch = tf_shell.to_encrypted(y_batch, secret_key, context)

    # Run the training step. The top and bottom halves of the batch are
    # treated as two separate mini-batches run in parallel to maximize
    # efficiency.
    enc_grads = train_step(x_batch, enc_y_batch)

    # Decrypt the weight gradients. In practice, the gradients should be
    # noised before decrypting.
    repeated_grads = [tf_shell.to_tensorflow(g, secret_key) for g in enc_grads]

    # Pull out grads from the top and bottom batches.
    top_grad = [g[0] for g in repeated_grads]
    bottom_grad = [g[batch_size // 2] for g in repeated_grads]

    # Apply the gradients to the model.
    weights = output_layer.weights + hidden_layer.weights
    optimizer.apply_gradients(zip(top_grad, weights))
    optimizer.apply_gradients(zip(bottom_grad, weights))

    # Apply the embedding layer gradient (contains both batches).
    # optimizer.apply_gradients(embedding_layer.decrypt_grad(secret_key), embedding_layer.weights)
    emb_optimizer.apply_gradients(zip([embedding_layer.decrypt_grad(secret_key)], embedding_layer.weights))

Here is the training loop. Each inner iteration runs two batches of size
$2^{12-1}$ simultaneously.

Tensorboard can be used to visualize the training progress. See cell output for
command to start tensorboard.

In [6]:
start_time = time.time()
tf.config.run_functions_eagerly(False)


def check_accuracy(dataset):
    average_loss = 0
    average_accuracy = 0
    for x, y in dataset:
        y = tf.cast(y, tf.float32)
        y = tf.reshape(y, (batch_size, 1))

        y_pred = vectorize_layer(x)
        # y_pred //= max_length  # Normalize for reduce_sum.
        y_pred = embedding_layer(y_pred)
        y_pred = tf_shell.reshape(y_pred, (batch_size, max_length * embedding_dim))
        # y_pred = tf_shell.reduce_sum(y_pred, axis=1)
        y_pred = hidden_layer(y_pred)
        y_pred = output_layer(y_pred)
        
        loss = tf.reduce_mean(loss_fn(y, y_pred))
        accuracy = tf.reduce_mean(tf.cast(tf.equal(y, tf.round(y_pred)), tf.float32))
        average_loss += loss
        average_accuracy += accuracy
    average_loss /= len(dataset)
    average_accuracy /= len(dataset)

    return average_loss, average_accuracy


# Set up tensorboard logging.
stamp = datetime.now().strftime("%Y%m%d-%H%M%S")
logdir = os.path.abspath("") + "/tflogs/sentiment-%s" % stamp
print(f"To start tensorboard, run: tensorboard --logdir /tmp/tflogs --host 0.0.0.0")
print(f"\ttensorboard profiling requires: pip install tensorboard_plugin_profile")
writer = tf.summary.create_file_writer(logdir)

# Initial accuracy
loss, accuracy = check_accuracy(train_data)
tf.print(f"\ttrain loss: {loss}\taccuracy: {accuracy}")
loss, accuracy = check_accuracy(val_data)
tf.print(f"\tvalidation loss: {loss}\taccuracy: {accuracy}")

# Iterate over the batches of the dataset.
for step, (x_batch, y_batch) in enumerate(train_data.take(batch_size)):
    print(
        f"Step: {step} / {len(train_data)}, Time Stamp: {time.time() - start_time}"
    )

    y_batch = tf.cast(y_batch, tf.float32)
    y_batch = tf.reshape(y_batch, (batch_size, 1))

    if step == 0:
        tf.summary.trace_on(graph=True, profiler=True, profiler_outdir=logdir)

    x_batch = vectorize_layer(x_batch)  # No shape inference, do outside tf.function
    train_step_wrapper(x_batch, y_batch)

    if step == 0:
        with writer.as_default():
            tf.summary.trace_export(name="sentiment", step=step)

    loss, accuracy = check_accuracy(train_data)
    tf.print(f"\ttrain loss: {loss}\taccuracy: {accuracy}")
    loss, accuracy = check_accuracy(val_data)
    tf.print(f"\tvalidation loss: {loss}\taccuracy: {accuracy}")

    with writer.as_default():
        tf.summary.scalar("loss", loss, step=step)
        tf.summary.scalar("accuracy", accuracy, step=step)


print(f"Total training time: {time.time() - start_time} seconds")

To start tensorboard, run: tensorboard --logdir /tmp/tflogs --host 0.0.0.0
	tensorboard profiling requires: pip install tensorboard_plugin_profile
	train loss: nan	accuracy: 0.5001627802848816


2024-06-10 23:51:31.724473: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


	validation loss: nan	accuracy: 0.4974365234375


2024-06-10 23:51:32.024516: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-06-10 23:51:32.059464: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-06-10 23:51:32.059493: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.


Step: 0 / 9, Time Stamp: 1.4508495330810547
