[View the runnable example on GitHub](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial/notebook/training/tensorflow/tensorflow_training_embedding_sparseadam.ipynb)

# Apply `SparseAdam` Optimizer for Large Embeddings

Embedding layers are often used to encode categorical items in deep learning applications. However, in applications such as recommendation systems, the embedding size may become huge due to large number of items or users, leading to extensive computational costs and space.

For large embeddings, the batch size could be orders of magnitude smaller compared to the embedding matrix size. Thus, gradients to the embedding matrix in each batch could be sparse. Taking advantage of this, BigDL-Nano provides `bigdl.nano.tf.keras.layers.Embedding` and `bigdl.nano.tf.optimizers.SparseAdam` to accelerate large embeddings. `bigdl.nano.tf.optimizers.SparseAdam` is a variant of Adam which handles updates of sparse tensor more efficiently. `bigdl.nano.tf.keras.layers.Embedding` intends to avoid applying regularizer function directly to the embedding matrix, which further avoids making the sparse gradient dense.

To apply Nano's `Embedding` layer and `SparseAdam` optimizer, you need to install BigDL-Nano for TensorFlow:

In [None]:
# install the nightly-built version of bigdl-nano for tensorflow;
# intel-tensorflow will be installed at the meantime with intel's oneDNN optimizations enabled by default
!pip install --pre --upgrade bigdl-nano[tensorflow]
!source bigdl-nano-init  # set environment variables

> 📝 **Note**
>
> Before starting your TensorFlow Keras application, it is highly recommended to run `source bigdl-nano-init` to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most TensorFlow Keras applications on training workloads.

> ⚠️ **Warning**
> 
> For Jupyter Notebook users, we recommend to run the commands above, especially `source bigdl-nano-init` before jupyter kernel is started, or some of the optimizations may not take effect.

In [None]:
# install dependency for the dataset used in the following example
!pip install tensorflow-datasets

To optimize your model for large embedding, you need to **import Nano's** `Embedding` **and** `SparseAdam` **first:**

In [None]:
from bigdl.nano.tf.keras.layers import Embedding
from bigdl.nano.tf.optimizers import SparseAdam

# from tf.keras import Model
from bigdl.nano.tf.keras import Model

> 📝 **Note**
>
> You could import `Model`/`Sequential` from `bigdl.nano.tf.keras` instead of `tf.keras` to gain more optimizations from Nano. Please refer to [API documentation](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/Nano/tensorflow.html#bigdl-nano-tf-keras) for more information.

Let's take the [imdb_reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews) dataset as an example, and suppose we would like to train a model to classify movie reviews as positive/negative. Assuming that the vocabulary size of reviews is $20000$, and we want to fix the word vector to a length of $128$, we would have a big embedding matrix with size $20000 \times 128$.

To prepare the data for training, we need to process the samples as sequences of positive integers:

In [None]:
import re
import string
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers import TextVectorization

def create_datasets():
    (raw_train_ds, raw_val_ds, raw_test_ds), info = tfds.load(
        "imdb_reviews",
        data_dir="/tmp/data",
        split=['train[:80%]', 'train[80%:]', 'test'],
        as_supervised=True,
        batch_size=32,
        with_info=True
    )

    def custom_standardization(input_data):
        lowercase = tf.strings.lower(input_data)
        stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
        return tf.strings.regex_replace(
            stripped_html, f"[{re.escape(string.punctuation)}]", ""
        )

    vectorize_layer = TextVectorization(
        standardize=custom_standardization,
        max_tokens=20000,
        output_mode="int",
        output_sequence_length=500,
    )
    
    text_ds = raw_train_ds.map(lambda x, y: x)
    vectorize_layer.adapt(text_ds)

    def vectorize_text(text, label):
        text = tf.expand_dims(text, -1)
        return vectorize_layer(text), label

    # vectorize the data
    train_ds = raw_train_ds.map(vectorize_text)
    val_ds = raw_val_ds.map(vectorize_text)
    test_ds = raw_test_ds.map(vectorize_text)

    return train_ds, val_ds, test_ds

In [None]:
train_ds, val_ds, test_ds = create_datasets()

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _The definition of_ `create_datasets` _can be found in the_ [runnable example](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial/notebook/training/tensorflow/tensorflow_training_embedding_sparseadam.ipynb).

We could then define the model. Same as using `tf.keras.layers.Embedding`, you could **instantiate a Nano's** `Embedding` **layer** as the first layer in the model:

In [None]:
inputs = tf.keras.Input(shape=(None,), dtype="int64")

# 20000 is the vocabulary size,
# 128 is the embedding dimension
x = Embedding(input_dim=20000, output_dim=128)(inputs)

> 📝 **Note**
>
> If you would like to apply a regularizer function to the embedding matrix through setting `embeddings_regularizer`, Nano will apply the regularizer to the output tensors of the embedding layer instead to avoid making the sparse gradient dense (if `activity_regularize=None`).
>
> Please refer to [API document](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/Nano/tensorflow.html#bigdl.nano.tf.keras.layers.Embedding) for more information on `bigdl.nano.tf.keras.layers.Embedding`.

Next, you could define the remaining parts of the model, and **configure the model for training with** `SparseAdam` **optimizer**:

In [None]:
from tensorflow.keras import layers

def make_backbone():
    inputs = tf.keras.Input(shape=(None, 128))
    x = layers.Dropout(0.5)(inputs)
    x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
    x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
    x = layers.GlobalMaxPooling1D()(x)
    x = layers.Dense(128, activation="relu")(x)
    x = layers.Dropout(0.5)(x)
    predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

    model = Model(inputs, predictions)
    return model

In [None]:
# define the remaining layers of the model
predictions = make_backbone()(x)
model = Model(inputs, predictions)

# Configure the model with Nano's SparseAdam optimizer
model.compile(loss="binary_crossentropy", optimizer=SparseAdam(), metrics=["accuracy"])

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _The definition of_ `make_backbone` _can be found in the_ [runnable example](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial/notebook/training/tensorflow/tensorflow_training_embedding_sparseadam.ipynb).

> 📝 **Note**
>
> `SparseAdam` optimizer is a variant of `tf.keras.optimizers.Adam`. This method only updates moments that show up in the gradient, and applies only those portions of gradient to the trainable variables.
>
> Please refer to [API document](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/Nano/tensorflow.html#bigdl.nano.tf.optimizers.SparseAdam) for more information on `bigdl.nano.tf.optimizers.SparseAdam`.

You could then train and evaluate your model as normal:

In [None]:
model.fit(train_ds, validation_data=val_ds, epochs=10)
model.evaluate(test_ds)

> 📚 **Related Readings**
> 
> - [How to install BigDL-Nano](https://bigdl.readthedocs.io/en/latest/doc/Nano/Overview/nano.html#install)