<a href="https://colab.research.google.com/github/jbischof/keras-io/blob/quickstart/guides/keras_nlp/keras_nlp_quick_tour.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install 0.4 preview from source
!pip install -q git+https://github.com/keras-team/keras-nlp.git tensorflow==2.10 --upgrade

[K     |████████████████████████████████| 578.1 MB 6.6 kB/s 
[K     |████████████████████████████████| 5.8 MB 51.8 MB/s 
[K     |████████████████████████████████| 1.7 MB 52.8 MB/s 
[K     |████████████████████████████████| 438 kB 67.6 MB/s 
[K     |████████████████████████████████| 5.9 MB 61.1 MB/s 
[K     |████████████████████████████████| 5.9 MB 55.3 MB/s 
[?25h  Building wheel for keras-nlp (setup.py) ... [?25l[?25hdone


In [None]:
import keras_nlp
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

# Use mixed precision for optimal performance
keras.mixed_precision.set_global_policy('mixed_float16')

# KerasNLP: Modular NLP Workflows for Keras


`keras-nlp` is a natural language processing library that supports users through their entire development cycle. Our workflows are built from modular components that have SoTA preset weights and architectures when used out-of-the-box and are easily customizable when more control is needed.

This library is an extension of the core `keras` API; all high level modules are `Layers` or `Models`. If you are familiar with `keras`, congratulations! You already understand most of `keras-nlp`.

This guide demonstrates our modular approach using a sentiment analysis example at six levels of complexity:
* Inference with a pretrained classifier
* Fine tuning a pretrained backbone
* Fine tuning with user-controlled preprocessing
* Fine tuning a custom model
* Pretraining a backbone model
* Build and train your own transformer from scratch

Throughout our guide we use Professor Keras, the official Keras mascot, as a visual reference for the complexity of the material:

![picture](https://drive.google.com/uc?id=1d14Qpmfgjf6zu4z30HBaonH8PYDHgVoU)



# API quickstart

Our highest level API is `keras_nlp.models`. For each `XX` architecture (e.g., `Bert`), we offer the following modules:
* **Tokenizer**: `keras_nlp.models.XXTokenizer`
    * Maps raw text to `tf.RaggedTensor`s of token ids.
    * Inherits from `keras.Layer`.
* **Preprocessor**: `keras_nlp.models.XXPreprocessor`
    * Maps raw text to a dictonary of dense tensors consumed by the model.
    * Has a `XXTokenizer`.
    * Inherits from `keras.Layer`.
* **Backbone**: `keras_nlp.models.XXBackbone`
    * Maps preprocessed tensors to dense representation. *Does not handle raw text*.
    * Inherits from `keras.Model`.
* **Task**: e.g., `keras_nlp.models.XXClassifier`
    * Maps raw text to task-specific output (e.g., classification probabilities).
    * Has a `XXBackbone` and `XXPreprocessor`.
    * Inherits from `keras.Model`.

Here is the modular hierarchy for `BertClassifier` (all relationships are compositional):

![picture](https://drive.google.com/uc?id=1vHBQ1oFbto8ItfhsLcxKhIwOIdJE1X9n)

All modules can be used independently and have a `from_preset()` method in addition to the standard constructor that instantiates the class with **preset** architecture and weights (see examples below).

# Data

We will use a running example of sentiment analysis of IMDB movie reviews. In this task, we use the text to predict whether the review was positive (`label = 1`) or negative (`label = 0`).

We load the data from `tensorflow_datasets`, a collection of machine learning benchmarks that uses the powerful `tf.data.Dataset` format for examples.

In [None]:
BATCH_SIZE = 16
imdb_train, imdb_test = tfds.load(
    "imdb_reviews",
    split=["train", "test"],
    as_supervised=True,
    batch_size=BATCH_SIZE,
)

# Inspect first review
# Format is (review text tensor, label tensor)
imdb_train.unbatch().take(1).get_single_element()

(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">,
 <tf.Tensor: shape=(), dtype=int64, numpy=0>)

# Inference with a pretrained classifier

![picture](https://drive.google.com/uc?id=1xeMHVCxYhm3_oC37Gg7k0bG-yhsVr0Dv)

The highest level module in `keras-nlp` is a **task**. A **task** is a `keras.Model` consisting of a (generally pretrained) **backbone** model and task-specific layers. Here's an example using `keras_nlp.models.BertClassifier`.

**Note**: Outputs are the logits per class (`[0, 0]` is 50% chance of positive).



In [None]:
classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2")
# Note: batched inputs expected so must wrap string in iterable
classifier.predict(["I love modular workflows in keras-nlp!"])

Downloading data from https://storage.googleapis.com/keras-nlp/models/bert_tiny_en_uncased_sst2/vocab.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/bert_tiny_en_uncased_sst2/model.h5


array([[-1.54 ,  1.544]], dtype=float16)

All **tasks** have a `from_preset` method that constructs a `keras.Model` instance with preset preprocessing, architecture and weights. This means that we can pass raw strings in any format accepted by a `keras.Model` and get output specific to our task.

This particular **preset** is a `bert_tiny_uncased_en` **backbone** fine-tuned on `sst2`, another movie review sentiment analysis (this time from Rotten Tomatoes). We use the `tiny` architecture for demo purposes, but larger models are recommended for SoTA performance. For all the task-specific presets available for `BertClassifier`, see [keras.io](https://resilient-dango-43f7b8.netlify.app/api/keras_nlp/models/).

Let's evaluate our classifier on the IMDB dataset. We first need to compile the `keras.Model`. Since we are not training, we do not need a `loss` argument.

In [None]:
classifier.compile(
    metrics=["sparse_categorical_accuracy"],
    jit_compile=True,
)

classifier.evaluate(imdb_test)



[0.0, 0.7835599780082703]

# Fine tuning a pretrained BERT backbone

![picture](https://drive.google.com/uc?id=1YytOYRSqsrhJ4NLatVOSuVMbLPa9iXrw)

When labeled text specific to our task is available, fine-tuning a custom classifier can improve performance. If we want to predict IMDB review sentiment, using IMDB data should perform better than Rotten Tomatoes data! And for many tasks no relevant pretrained model will be available (e.g., categorizing customer reviews).

The workflow for fine-tuning is almost identical to above, except that we request a **preset** for the **backbone**-only model rather than the entire classifier. When passed a **backone** **preset**, a **task** `Model` will randomly initialize all task-specific layers in preparation for training. For all the **backbone** presets available for `BertClassifier`, see [keras.io](https://resilient-dango-43f7b8.netlify.app/api/keras_nlp/models/).

To train your classifier, use `Model.compile()` and `Model.fit()` as with any other `keras.Model`. Since preprocessing is included in all **tasks** by default, we again pass the raw data.


In [None]:
classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased",
    num_classes=2,
)
classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
classifier.fit(
    imdb_train,
    validation_data=imdb_test,
    epochs=1,
)



<keras.callbacks.History at 0x7f0481100f10>

Here we see significant lift in validation accuracy (0.78 -> 0.87) with a single epoch of training even though the IMDB dataset is much smaller than `sst2`.


# Fine tuning with user-controlled preprocessing
![picture](https://drive.google.com/uc?id=1T_40vtl8daihS-kKYTFWejFd19KJAyDK)

For some advanced training scenarios, users might prefer direct control over preprocessing. For large datasets, examples can be preprocessed in advance and saved to disk or preprocessed by a separate worker pool using `tf.data.experimental.service`. In other cases, custom preprocessing is needed to handle the inputs.

Pass `preprocessor=None` to the constructor of a **task** `Model` to skip automatic preprocessing or supply your own `keras.Layer` to perform a custom operation instead.



## Separate preprocessing from the same preset

Each model architecture has a parallel **preprocessor** `Layer` with its own `from_preset` constructor. Using the same **preset** for this `Layer` will return the matching **preprocessor** as the **task**.

In this workflow we train the model over three epochs using `tf.data.Dataset.cache()`, which computes the preprocessing once and caches the result before fitting begins.

**Note:** this code only works if your data fits in memory. If not, pass a `filename` to `cache()`.

In [None]:
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
    "bert_tiny_en_uncased"
)

imdb_train_cached = imdb_train.map(
    preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)
imdb_test_cached = imdb_test.map(
    preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)

classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased",
    preprocessor=None,
)
classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
classifier.fit(
    imdb_train_cached,
    validation_data=imdb_test_cached,
    epochs=3,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f1e98c76d30>

After three epochs, our validation accuracy has only increased to 0.88. This is mainly a function of the small size of our dataset; even with the `bert_tiny` architecture we've already learned most generalizable patterns in the first pass.

## Custom preprocessing

In cases where custom preprocessing is required, we offer direct access to the `Tokenizer` class that maps raw strings to tokens. It also has a `from_preset` constructor to get the vocabulary matching pretraining.

**Note:** `Tokenizer` does not pad sequences, so output is `tf.RaggedTensor`.



In [None]:
tokenizer = keras_nlp.models.BertTokenizer.from_preset("bert_tiny_en_uncased")
tokenizer(["I love modular workflows!", "Libraries over frameworks!"])

<tf.RaggedTensor [[1045, 2293, 19160, 2147, 12314, 2015, 999],
 [1045, 2064, 1005, 1056, 3233, 26666, 3642, 1012]]>

In [None]:
# Write your own packer or use one our `Layers`
packer = keras_nlp.layers.MultiSegmentPacker(
    start_value=tokenizer.cls_token_id,
    end_value=tokenizer.sep_token_id,
    sequence_length=64,
)

def preprocess(x, y):
    token_ids, segment_ids = packer(tokenizer(x))
    x = {
        "token_ids": token_ids,
        "segment_ids": segment_ids,
        "padding_mask": token_ids != 0,
    }
    return x, y

imbd_train_preprocessed = imdb_train.map(
    preprocess, tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
imdb_test_preprocessed = imdb_test.map(
    preprocess, tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

# Preprocessed example
imbd_train_preprocessed.unbatch().take(1).get_single_element()

({'token_ids': <tf.Tensor: shape=(64,), dtype=int32, numpy=
  array([  101,  2023,  2001,  2019,  7078,  6659,  3185,  1012,  2123,
          1005,  1056,  2022, 26673,  1999,  2011,  5696,  3328,  2368,
          2030,  2745,  3707,  7363,  1012,  2119,  2024,  2307,  5889,
          1010,  2021,  2023,  2442,  3432,  2022,  2037,  5409,  2535,
          1999,  2381,  1012,  2130,  2037,  2307,  3772,  2071,  2025,
          2417, 21564,  2023,  3185,  1005,  1055,  9951,  9994,  1012,
          2023,  3185,  2003,  2019,  2220,  3157,  7368,  2149, 10398,
           102], dtype=int32)>,
  'segment_ids': <tf.Tensor: shape=(64,), dtype=int32, numpy=
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        dtype=int32)>,
  'padding_mask': <tf.Tensor: shape=(64,), dtype=bool, numpy=
  array([ True,  True,  True,  True, 

# Fine tuning with a custom model
![picture](https://drive.google.com/uc?id=1T_40vtl8daihS-kKYTFWejFd19KJAyDK)

For more advanced applications, an appropriate **task** `Model` may not be available. In this case we provide direct access to the **backbone** `Model`, which has its own `from_preset` constructor and can be composed with custom `Layer`s. Detailed examples can be found at https://keras.io/guides/transfer_learning/.

A **backbone** `Model` does not include automatic preprocessing but can be paired with a matching **preprocessor** using the same **preset** as shown in the previous workflow.

In this workflow we experiment with freezing our backbone model and adding two trainable transfomer layers to adapt to the new input.

**Note**: We can igonore the warning about gradients for the `pooled_dense` layer because we are using BERT's sequence output.


In [None]:
preprocessor = keras_nlp.models.BertPreprocessor.from_preset("bert_tiny_en_uncased")
backbone = keras_nlp.models.BertBackbone.from_preset("bert_tiny_en_uncased")

imdb_train_preprocessed = imdb_train.map(
    preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)
imdb_test_preprocessed = imdb_test.map(
    preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)

backbone.trainable = False
inputs = backbone.input
sequence = backbone(inputs)["sequence_output"]
for _ in range(2):
  sequence = keras_nlp.layers.TransformerEncoder(
      num_heads=2,
      intermediate_dim=512,
      dropout=0.1,
  )(sequence, padding_mask=inputs["padding_mask"])
# Use [CLS] token output to classify
outputs = keras.layers.Dense(2)(sequence[:, backbone.cls_token_index, :])

model = keras.Model(inputs, outputs)
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
model.summary()
model.fit(
    imdb_train_preprocessed,
    validation_data=imdb_test_preprocessed,
    epochs=3,
)

Downloading data from https://storage.googleapis.com/keras-nlp/models/bert_tiny_en_uncased/v1/model.h5
Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 padding_mask (InputLayer)      [(None, None)]       0           []                               
                                                                                                  
 segment_ids (InputLayer)       [(None, None)]       0           []                               
                                                                                                  
 token_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 bert_backbone_1 (BertBackbone)  {'sequence_output':  4385920    ['padding_mask[0][0]', 

<keras.callbacks.History at 0x7f042cc6f250>

This model achieves reasonable accuracy despite having only 10% the trainable parameters of our `BertClassifier` model. Each training step takes about 1/3 of the time---even accounting for cached preprocessing.

# Pretraining a backbone model
![picture](https://drive.google.com/uc?id=1pzwLPCtvzmHY3DKzH-MBzmjWFJ3pKVB5)

Do you have access to large unlabeled datasets in your domain? Are they are around the same size as used to train popular backbones such as BERT, RoBERTa, or GPT2 (XX+ GiB)? If so, you might benefit from domain-specific pretraining of your own backbone models.

NLP models are generally pretrained on a language modeling task, predicting masked words given the visible words in an input sentence. For example, given the input `"The fox [MASK] over the [MASK] dog"`, the model might be asked to predict `["jumped", "lazy"]`. The lower layers of this model are then packaged as a **backbone** to be combined with layers relating to a new task.

The `keras-nlp` library offers SoTA **backbones** and **tokenizers** to be trained from scratch without presets.

In this workflow we pretrain a BERT **backbone** using our IMDB review text. We skip the "next sentence prediction" (NSP) loss because it adds significant complexity to the data processing and was dropped by later models like RoBERTa. See our e2e [BERT pretraining example](https://github.com/keras-team/keras-nlp/tree/4f9ebefa82af22b4f4267dfa80fa525f7a03bd5d/examples/bert) for step-by-step details on how to replicate the original paper.

## Preprocessing

In [None]:
# All BERT `en` models have the same vocabulary, so reuse preprocessor from
# "bert_tiny_en_uncased"
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
    "bert_tiny_en_uncased",
    sequence_length=128,
)
packer = preprocessor.packer
tokenizer = preprocessor.tokenizer

# keras.Layer to replace some input tokens with the "[MASK]" token
masker = keras_nlp.layers.MLMMaskGenerator(
    vocabulary_size=tokenizer.vocabulary_size(),
    mask_selection_rate=0.25,
    mask_selection_length=32,
    mask_token_id=tokenizer.token_to_id("[MASK]"),
    unselectable_token_ids=[
        tokenizer.token_to_id(x) for x in ["[CLS]", "[PAD]", "[SEP]"]
    ],
)

def preprocess(inputs, label):
    inputs = preprocessor(inputs)
    masked_inputs = masker(inputs["token_ids"])
    # Split the masking layer outputs into a (features, labels, and weights)
    # tuple that we can use with keras.Model.fit().
    features = {
        "token_ids": masked_inputs["token_ids"],
        "segment_ids": inputs["segment_ids"],
        "padding_mask": inputs["padding_mask"],
        "mask_positions": masked_inputs["mask_positions"],
    }
    labels = masked_inputs["mask_ids"]
    weights = masked_inputs["mask_weights"]
    return features, labels, weights

pretrain_ds = imdb_train.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)
pretrain_val_ds = imdb_test.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

# Tokens with ID 103 are "masked"
pretrain_ds.unbatch().take(1).get_single_element()

({'token_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
  array([  101,  2023,  2001,  2019,  7078,  6659,  3185,   103,  2123,
          1005,  1056,   103, 26673,  1999,   103,  5696,  3328,  2368,
          2030,   103,  3707,  7363,  1012,  2119,   103,  2307,  5889,
          1010,  2021,  2023,  2442,   103,   103,  2037,  5409,  2535,
          1999,  2381,  1012,  2130,  2037,  2307,  3772,   103,  2025,
          2417,   103,  2023,   103,  1005,  1055,  9951,  9994,  1012,
           103,   103,  2003,  2019,  2220,  3157,  7368,  2149,   103,
          3538,  1012,  1996,  2087, 17203,  5019,   103,  2216,  2043,
          1996, 25882,  8431,  2020,   103,  2037,  3572,  2005, 25239,
          1012,   103, 21878,   103,  2696,   103,  2596,  6887, 16585,
          1010,  1998,  2014, 18404,   103,   103,  6771, 11378,  3328,
          2368,   103,   103,  2021,  1037,   103,  6832, 13354,   103,
           103,  3185,   103,  2001, 22808,  1997,  2151,  2613, 28940,
   

## Pretraining model

In [None]:
# BERT backbone
backbone = keras_nlp.models.BertBackbone(
    vocabulary_size=tokenizer.vocabulary_size(),
    num_layers=2,
    num_heads=2,
    hidden_dim=128,
    intermediate_dim=512,
)

# Language modeling head
mlm_head = keras_nlp.layers.MLMHead(
    embedding_weights=backbone.token_embedding.embeddings,
)

inputs = {
    "token_ids": keras.Input(shape=(None,), dtype=tf.int32),
    "segment_ids": keras.Input(shape=(None,), dtype=tf.int32),
    "padding_mask": keras.Input(shape=(None,), dtype=tf.int32),
    "mask_positions": keras.Input(shape=(None,), dtype=tf.int32),
}

# Encoded token sequence
sequence = backbone(inputs)["sequence_output"]

# Predict an output word for each masked input token.
# We use the input token embedding to project from our encoded vectors to
# vocabulary logits, which has been shown to improve training efficiency.
outputs = mlm_head(sequence, mask_positions=inputs["mask_positions"])

# Define and compile our pretraining model.
pretraining_model = keras.Model(inputs, outputs)
pretraining_model.summary()
pretraining_model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(learning_rate=5e-4),
    weighted_metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)

# Pretrain on IMDB dataset
pretraining_model.fit(
    pretrain_ds,
    validation_data=pretrain_val_ds,
    epochs=3,    # Increase to 6 for higher accuracy
)

Model: "model_14"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_52 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 input_51 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 input_50 (InputLayer)          [(None, None)]       0           []                               
                                                                                                  
 input_49 (InputLayer)          [(None, None)]       0           []                               
                                                                                           



Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f0402dc9610>

After pretraining save your `backbone` submodel to use in a new task!

# Build and train your own transformer from scratch
![picture](https://drive.google.com/uc?id=1pzwLPCtvzmHY3DKzH-MBzmjWFJ3pKVB5)

Want to implement a novel transformer architecture? The `keras-nlp` library offers all the low-level modules used to build SoTA architectures in our `models` API. This includes training your own subword tokenizer using `WordPiece`, `BytePairEncoder`, or `SentencePiece`.

In this workflow we train a custom tokenizer on the IMDB data and design a backbone with custom transformer architecture. For simplicity we then train directly on the classification task. Interested in more details? We wrote an entire guide to pretraining and finetuning a custom transformer: https://keras.io/guides/keras_nlp/transformer_pretraining/

## Train custom vocabulary from IMBD data

In [None]:
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    imdb_train.map(lambda x, y: x),
    vocabulary_size=10_000,    # Increase to 20_000 for better performance
    lowercase=True,
    strip_accents=True,
    reserved_tokens=["[PAD]", "[START]", "[END]", "[MASK]", "[UNK]"],
)
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    lowercase=True,
    strip_accents=True,
    oov_token="[UNK]",
)


## Preprocess data with custom tokenizer

In [None]:
packer = keras_nlp.layers.StartEndPacker(
    start_value=tokenizer.token_to_id("[START]"),
    end_value=tokenizer.token_to_id("[END]"),
    pad_value=tokenizer.token_to_id("[PAD]"),
    sequence_length=64,
)

def preprocess(x, y):
    token_ids = packer(tokenizer(x))
    x = {
        "token_ids": token_ids,
        "padding_mask": token_ids != tokenizer.token_to_id("[PAD]"),
    }
    return x, y

imdb_preproc_train_ds = imdb_train.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)
imdb_preproc_val_ds = imdb_test.map(
    preprocess, num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

imdb_preproc_train_ds.unbatch().take(1).get_single_element()

({'token_ids': <tf.Tensor: shape=(64,), dtype=int32, numpy=
  array([   1,  104,  106,  127,  539,  500,  110,   18,  183,   11,   62,
          121,   54, 3451,  103,  126, 1557, 3771,  134,  585, 5279, 4599,
           18,  300,  118,  179,  254,   16,  111,  104,  309,  437,  121,
          159,  351,  317,  103,  584,   18,  151,  159,  179,  210,  192,
          116, 6815,  104,  110,   11,   61,  772,  903,   18,  104,  110,
          100,  127,  504, 3425, 1749,  280, 2828,  524,    2], dtype=int32)>,
  'padding_mask': <tf.Tensor: shape=(64,), dtype=bool, numpy=
  array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,


## Design a tiny transformer

In [None]:
token_id_input = keras.Input(
    shape=(None,), dtype="int32", name="token_ids",
)
padding_mask = keras.Input(
    shape=(None,), dtype="int32", name="padding_mask",
)
outputs = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=len(vocab),
    sequence_length=packer.sequence_length,
    embedding_dim=64,
)(token_id_input)
outputs = keras_nlp.layers.TransformerEncoder(
    num_heads=2,
    intermediate_dim=128,
    dropout=0.1,
)(outputs, padding_mask=padding_mask)
# Use "[START]" token to classify
outputs = keras.layers.Dense(2)(outputs[:, 0, :])
model = keras.Model(
    inputs={
        "token_ids": token_id_input,
        "padding_mask": padding_mask,
    },
    outputs=outputs,
)

model.summary()

Model: "model_22"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 token_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 token_and_position_embedding_6  (None, None, 64)    637248      ['token_ids[0][0]']              
  (TokenAndPositionEmbedding)                                                                     
                                                                                                  
 padding_mask (InputLayer)      [(None, None)]       0           []                               
                                                                                                  
 transformer_encoder_8 (Transfo  (None, None, 64)    33472       ['token_and_position_embed

## Train the transformer directly on the classification objective

In [None]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.experimental.AdamW(5e-5),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
    jit_compile=True,
)
model.fit(
    imdb_preproc_train_ds,
    validation_data=imdb_preproc_val_ds,
    epochs=3,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f03e333bd90>

While our classification accuracy is a fairly poor 0.76, the transformer architecture is too complicated to learn from scratch on a small dataset. The large performance gap with our earlier models shows the power of pretraining and transfer learning in modern NLP.