# Sentiment Analysis on IMDB dataset

This notebook walks through how perform sentament analysis on the IMDB dataset.
In this setting, one party has the reviews and the other party has the labels.
The party with the labels is helping the party with the reviews train a model
without sharing the labels themselves.

Before starting, install tf-shell and the dataset.

```bash
pip install tf-shell
pip install tensorflow_hub tensorflow_datasets
```

In [1]:
import time
from datetime import datetime
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import keras
import numpy as np
import tf_shell
import tf_shell_ml
import os

2024-09-13 05:53:52.002235: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-13 05:53:52.025598: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Set up parameters for the SHELL encryption library.
context = tf_shell.create_context64(
    log_n=12,
    main_moduli=[288230376151760897, 288230376152137729],
    plaintext_modulus=4294991873,
    scaling_factor=3,
    seed="test_seed",
)

# Create the secret key for encryption and a rotation key (rotation key is
# an auxilary key required for operations like roll or matmul).
secret_key = tf_shell.create_key64(context)
public_rotation_key = tf_shell.create_rotation_key64(context, secret_key)

# The batch size is determined by the ciphertext parameters, specifically the
# schemes polynomial's ring degree because tf-shell uses batch axis packing.
# Furthermore, two micro-batches to run in parallel.
batch_size = context.num_slots

use_encryption = True

Setup IMDB dataset.

In [3]:
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, val_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

# Print the first example.
for review, label in train_data.take(1):
    print("Review:", review.numpy().decode('utf-8'))
    print("Label:", label.numpy())

epochs = 10
train_data = train_data.shuffle(buffer_size=2048).batch(batch_size, drop_remainder=True).repeat(count=epochs)
val_data = val_data.shuffle(buffer_size=2048).batch(batch_size, drop_remainder=True)
test_data = test_data.shuffle(buffer_size=2048).batch(batch_size, drop_remainder=True)

vocab_size = 10000  # This dataset has 92061 unique words.
max_length = 250
embedding_dim = 16

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
)

vectorize_layer.adapt(train_data.map(lambda text, label: text))

print("Most used words:", vectorize_layer.get_vocabulary()[:10])
print("Dictionary size:", len(vectorize_layer.get_vocabulary()))

# Count the top n words in the training set.
top_n = 200
word_counts = np.zeros(top_n, dtype=np.int64)
for review, label in train_data:
    vectorized_reviews = vectorize_layer(review)
    for i in range(len(word_counts)):
        counts = tf.where(vectorized_reviews == i, 1, 0)
        word_counts[i] += tf.reduce_sum(tf.cast(counts, dtype=tf.int64))

for i in range(len(word_counts)):
    print(f"Word {i} ({vectorize_layer.get_vocabulary()[i]}) count: {word_counts[i]}")

2024-09-13 05:54:11.598078: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/vscode/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


  from .autonotebook import tqdm as notebook_tqdm
Dl Size...: 100%|██████████| 80/80 [00:04<00:00, 18.44 MiB/s]rl]
Dl Completed...: 100%|██████████| 1/1 [00:04<00:00,  4.34s/ url]
2024-09-13 05:54:37.119837: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


[1mDataset imdb_reviews downloaded and prepared to /home/vscode/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m
Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
Label: 0
Most used words: ['', '[UNK]', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'it']
Dictionary size: 10000
Word 0 () c

In [4]:
# Create the trainable layers.
embedding_layer = tf_shell_ml.ShellEmbedding(
    vocab_size + 1,  # +1 for OOV token.
    embedding_dim,
    skip_embeddings_below_index=top_n,
)
# TODO dropout layer?
hidden_layer = tf_shell_ml.GlobalAveragePooling1D()
# TODO dropout layer?
output_layer = tf_shell_ml.ShellDense(
    2,
    activation=tf.nn.softmax,
)

layers = [
    embedding_layer,
    hidden_layer,
    output_layer,
]

loss_fn = tf_shell_ml.CategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(0.1)

Next, define the `train_step` function which will be called for each batch on an
encrypted batch of labels, y. The function first does a forward on the plaintext
image x to compute a predicted label, then does backpropagation using the
encrypted label y.

In [5]:
def train_step(x, enc_y):
    # Forward pass always in plaintext
    y_pred = x
    for i, l in enumerate(layers):
        y_pred = l(y_pred, training=True)

    # Backward pass.
    dx = loss_fn.grad(enc_y, y_pred)
    dJ_dw = []
    dJ_dx = [dx,]
    for l in reversed(layers):
        if isinstance(l, tf_shell_ml.GlobalAveragePooling1D):
            dw, dx = l.backward(dJ_dx[-1])
        else:
            dw, dx = l.backward(dJ_dx[-1], public_rotation_key)
        dJ_dw.extend(dw)
        dJ_dx.append(dx)

    return reversed(dJ_dw)


@tf.function
def train_step_wrapper(x_batch, y_batch):
    if use_encryption:
        # Encrypt the batch of secret labels y.
        enc_y_batch = tf_shell.to_encrypted(y_batch, secret_key, context)
    else:
        enc_y_batch = y_batch

    # Run the training step. The top and bottom halves of the batch are
    # treated as two separate mini-batches run in parallel.
    enc_grads = train_step(x_batch, enc_y_batch)

    filtered_layers = [l for l in layers if len(l.weights) > 0]

    if use_encryption:
        # Decrypt the weight gradients. In practice, the gradients should be
        # noised before decrypting.
        packed_grads = [tf_shell.to_tensorflow(g, secret_key) for g in enc_grads]
        # Unpack the plaintext gradients using the corresponding layer.
        grads = [l.unpack(g) for l, g in zip (filtered_layers, packed_grads)]
    else:
        grads = enc_grads

    weights = []
    for l in filtered_layers:
        weights+=l.weights

    # Apply the gradients to the model.
    optimizer.apply_gradients(zip(grads, weights))

Here is the training loop. Each inner iteration runs two batches of size
$2^{12-1}$ simultaneously.

Tensorboard can be used to visualize the training progress. See cell output for
command to start tensorboard.

In [6]:
start_time = time.time()
tf.config.run_functions_eagerly(False)


def check_accuracy(dataset):
    average_loss = 0
    average_accuracy = 0
    for x, y in dataset:
        y = tf.one_hot(tf.cast(y, tf.int32), 2)

        y_pred = vectorize_layer(x)
        # Do not filter when testing.
        for i, l in enumerate(layers):
            y_pred = l(y_pred)

        loss = tf.reduce_mean(loss_fn(y, y_pred))

        accuracy = tf.reduce_mean(
            tf.cast(
                tf.equal(tf.argmax(y, axis=1), tf.argmax(y_pred, axis=1)), tf.float32
            )
        )
        average_loss += loss
        average_accuracy += accuracy
    average_loss /= len(dataset)
    average_accuracy /= len(dataset)

    return average_loss, average_accuracy


# Set up tensorboard logging.
stamp = datetime.now().strftime("%Y%m%d-%H%M%S")
logdir = os.path.abspath("") + "/tflogs/sentiment-%s" % stamp
print(f"To start tensorboard, run: tensorboard --logdir ./ --host 0.0.0.0")
print(f"\ttensorboard profiling requires: pip install tensorboard_plugin_profile")
writer = tf.summary.create_file_writer(logdir)

# Initial accuracy
loss, accuracy = check_accuracy(val_data)
tf.print(f"\tvalidation loss: {loss}\taccuracy: {accuracy}")

# Iterate over the batches of the dataset.
for step, (x_batch, y_batch) in enumerate(train_data.take(batch_size)):
    print(f"Step: {step} / {len(train_data)}, Time Stamp: {time.time() - start_time}")

    y_batch = tf.one_hot(tf.cast(y_batch, tf.int32), 2)

    if step == 0:
        tf.summary.trace_on(
            graph=True,
            profiler=True,
            # profiler_outdir=logdir,  # Only for tf 2.16+
        )

    x_batch = vectorize_layer(x_batch)  # No shape inference, do outside tf.function
    train_step_wrapper(x_batch, y_batch)

    # tf.print("embedding layer slot counter:")
    # tf.print(embedding_layer._last_slot_count, summarize=-1)
    # tf.print("embedding layer max slot counter:")
    # tf.print(tf.reduce_max(embedding_layer._last_slot_count), summarize=-1)

    if step == 0:
        with writer.as_default():
            tf.summary.trace_export(
                name="sentiment",
                step=step,
                profiler_outdir=logdir,
            )

    loss, accuracy = check_accuracy(train_data)
    tf.print(f"\ttrain loss: {loss}\taccuracy: {accuracy}")
    loss, accuracy = check_accuracy(val_data)
    tf.print(f"\tvalidation loss: {loss}\taccuracy: {accuracy}")

    with writer.as_default():
        tf.summary.scalar("loss", loss, step=step)
        tf.summary.scalar("accuracy", accuracy, step=step)


print(f"Total training time: {time.time() - start_time} seconds")

To start tensorboard, run: tensorboard --logdir ./ --host 0.0.0.0
	tensorboard profiling requires: pip install tensorboard_plugin_profile
	validation loss: 0.34678205847740173	accuracy: 0.50244140625
Step: 0 / 30, Time Stamp: 0.4297327995300293
Instructions for updating:
use `tf.profiler.experimental.start` instead.


Instructions for updating:
use `tf.profiler.experimental.start` instead.
2024-09-13 05:55:43.043938: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-09-13 05:55:43.043966: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.


Instructions for updating:
use `tf.profiler.experimental.stop` instead.


Instructions for updating:
use `tf.profiler.experimental.stop` instead.


Instructions for updating:
`tf.python.eager.profiler` has deprecated, use `tf.profiler` instead.


2024-09-13 07:05:45.882663: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2024-09-13 07:05:45.932257: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
Instructions for updating:
`tf.python.eager.profiler` has deprecated, use `tf.profiler` instead.


Instructions for updating:
`tf.python.eager.profiler` has deprecated, use `tf.profiler` instead.


Instructions for updating:
`tf.python.eager.profiler` has deprecated, use `tf.profiler` instead.


	train loss: 0.3463047742843628	accuracy: 0.508227527141571
	validation loss: 0.3464067280292511	accuracy: 0.5057373046875
Step: 1 / 30, Time Stamp: 4207.950785398483
	train loss: 0.3458217680454254	accuracy: 0.519848644733429
	validation loss: 0.34607988595962524	accuracy: 0.5130615234375
Step: 2 / 30, Time Stamp: 8758.485680818558
	train loss: 0.34537193179130554	accuracy: 0.5295491814613342
	validation loss: 0.3455347418785095	accuracy: 0.5269775390625
Step: 3 / 30, Time Stamp: 12703.130770683289
	train loss: 0.3447962999343872	accuracy: 0.5438232421875
	validation loss: 0.34524139761924744	accuracy: 0.5352783203125
Step: 4 / 30, Time Stamp: 17270.923393011093
	train loss: 0.3442564010620117	accuracy: 0.5548909306526184
	validation loss: 0.3448839783668518	accuracy: 0.5419921875
Step: 5 / 30, Time Stamp: 21087.202694416046




	train loss: 0.34387531876564026	accuracy: 0.5640299320220947
	validation loss: 0.34451237320899963	accuracy: 0.5521240234375
Step: 6 / 30, Time Stamp: 24922.99744272232
	train loss: 0.34310781955718994	accuracy: 0.5768473148345947
	validation loss: 0.3439071774482727	accuracy: 0.561767578125
Step: 7 / 30, Time Stamp: 29491.1250500679
