## Setup

Before we start implementing the pipeline, let's import all the libraries we need.

In [None]:
!pip install -q --upgrade rouge-score
!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras  # Upgrade to Keras 3.

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m876.5/876.5 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
keras-nlp 0.18.1 requires keras-hub==0.18.1, but you have keras-hub 0.21.1 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
keras-nlp 0.18.1 requires keras-hub==0.18.1, but you have keras-hub 0.21.1 which is incompatible.[0m[31m
[0m

In [None]:
import keras_hub
import pathlib
import random

import keras
from keras import ops

import tensorflow.data as tf_data
from tensorflow_text.tools.wordpiece_vocab import (
    bert_vocab_from_dataset as bert_vocab,
)
import pathlib
import zipfile
import tensorflow as tf
import keras
import os
import requests
import pandas as pd


In [None]:
print("keras_hub version:", keras_hub.__version__)
print("keras version:", keras.__version__)
print("tensorflow version:", tf.__version__)
print("tensorflow_text version:", tf.__version__) # Accessing version through tensorflow
print("requests version:", requests.__version__)

keras_hub version: 0.21.1
keras version: 3.10.0
tensorflow version: 2.18.0
tensorflow_text version: 2.18.0
requests version: 2.32.3


Let's also define our parameters/hyperparameters.

In [None]:
BATCH_SIZE = 64
EPOCHS = 40
MAX_SEQUENCE_LENGTH = 40
ENG_VOCAB_SIZE = 15000
IND_VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

## Downloading the data

Please download the dataset via this link: https://www.manythings.org/anki/

Select the Indonesian-English dataset and save it to your personal drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Step 1: Download manual ke /content/
local_zip_path = "/content/drive/MyDrive/Colab-Notebooks/ind-eng.zip"

# Step 2: Extract to its own folder
extract_dir = "/content/ind-eng_extracted"
if not os.path.exists(extract_dir):
    print("Extracting...")
    with zipfile.ZipFile(local_zip_path, "r") as zip_ref:
        zip_ref.extractall(extract_dir)

# Step 3: Akses file ind.txt
text_file = pathlib.Path(extract_dir) / "ind.txt"
print(f"Path final: {text_file}")
assert text_file.exists(), "File tidak ditemukan!"

Path final: /content/ind-eng_extracted/ind.txt


## Parsing the data

In [None]:
text_pairs = []

with open(text_file, encoding="utf-8") as f:
    lines = f.read().strip().split("\n")

for line in lines:
    parts = line.split("\t")
    if len(parts) >= 2:
        eng = parts[0].strip().lower()
        ind = parts[1].strip().lower()
        text_pairs.append((eng, ind))

Here's what our sentence pairs look like:

In [None]:
for _ in range(5):
    print(random.choice(text_pairs))

('did anything interesting happen while i was gone?', 'apakah sesuatu yang menarik terjadi ketika aku pergi?')
("you aren't my mother.", 'kamu bukan ibuku.')
('i am not a student.', 'aku bukan siswa.')
('she is not young.', 'dia tidak muda.')
('are we done yet?', 'apa kita masih belum selesai?')


Now, let's split the sentence pairs into a training set, a validation set,
and a test set.

In [None]:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")


14881 total pairs
10417 training pairs
2232 validation pairs
2232 test pairs


## Tokenizing the Data

We'll create two tokenizers — one for the source language (Indonesian), and another for the target language (English). To tokenize the text, we’ll use `keras_hub.tokenizers.WordPieceTokenizer`.

`WordPieceTokenizer` uses a WordPiece vocabulary and provides functionality for both breaking text into tokens and reconstructing text from token sequences.

Before setting up the tokenizers, we first need to train them using our dataset. WordPiece is a subword-based tokenization algorithm. Training it on a corpus results in a vocabulary composed of subwords.

Subword tokenization offers a balance between:

* **Word-level tokenization**, which usually requires a very large vocabulary to cover all possible words, and
* **Character-level tokenization**, which often loses semantic information since individual characters don't convey meaning on their own.

Fortunately, KerasHub simplifies the process of training a WordPiece tokenizer with the `keras_hub.tokenizers.compute_word_piece_vocabulary` utility.


In [None]:
def train_word_piece(text_samples, vocab_size, reserved_tokens):
    word_piece_ds = tf_data.Dataset.from_tensor_slices(text_samples)
    vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
        reserved_tokens=reserved_tokens,
    )
    return vocab


Every vocabulary has a few special, reserved tokens. We have four such tokens:

- `"[PAD]"` - Padding token. Padding tokens are appended to the input sequence
length when the input sequence length is shorter than the maximum sequence length.
- `"[UNK]"` - Unknown token.
- `"[START]"` - Token that marks the start of the input sequence.
- `"[END]"` - Token that marks the end of the input sequence.

In [None]:
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

eng_samples = [text_pair[0] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)

ind_samples = [text_pair[1] for text_pair in train_pairs]
ind_vocab = train_word_piece(ind_samples, IND_VOCAB_SIZE, reserved_tokens)

Let's see some tokens!

In [None]:
print("English Tokens: ", eng_vocab[100:110])
print("Indo Tokens: ", ind_vocab[100:110])

English Tokens:  ['ll', '##d', 'did', 've', '##y', 'where', 'about', 'they', 'one', 'time']
Indo Tokens:  ['punya', 'sedang', 'sangat', 'makan', 'mereka', 'sini', 'mana', '##a', '##lah', 'banyak']


Now, let's define the tokenizers. We will configure the tokenizers with the
the vocabularies trained above.

In [None]:
eng_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab, lowercase=False
)
ind_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=ind_vocab, lowercase=False
)

Let's try and tokenize a sample from our dataset! To verify whether the text has
been tokenized correctly, we can also detokenize the list of tokens back to the
original text.

In [None]:
eng_input_ex = text_pairs[0][0]
eng_tokens_ex = eng_tokenizer.tokenize(eng_input_ex)
print("English sentence: ", eng_input_ex)
print("Tokens: ", eng_tokens_ex)
print(
    "Recovered text after detokenizing: ",
    eng_tokenizer.detokenize(eng_tokens_ex),
)

print()

ind_input_ex = text_pairs[0][1]
ind_tokens_ex = ind_tokenizer.tokenize(ind_input_ex)
print("Indo sentence: ", ind_input_ex)
print("Tokens: ", ind_tokens_ex)
print(
    "Recovered text after detokenizing: ",
    ind_tokenizer.detokenize(ind_tokens_ex),
)

English sentence:  do you happen to have matches?
Tokens:  tf.Tensor([ 67  57 466  59  70  38 313 771  25], shape=(9,), dtype=int32)
Recovered text after detokenizing:  do you happen to have matches ?

Indo sentence:  apa kamu mempunyai korek?
Tokens:  tf.Tensor([ 71  66  42 209 146 349 819  40 378 288  29], shape=(11,), dtype=int32)
Recovered text after detokenizing:  apa kamu mempunyai korek ?


## Preparing the Datasets

Next, we'll prepare our datasets for training. At each training step, the model aims to predict the next word (N+1 and beyond) using the input sentence and the current portion of the target sentence, from word 0 to N.

Therefore, each training sample will consist of a tuple `(inputs, targets)`, where:

* `inputs` is a dictionary containing two keys: `encoder_inputs` and `decoder_inputs`. `encoder_inputs` refers to the tokenized input sentence in Indonesian, while `decoder_inputs` contains the target English sentence up to the current word — i.e., words 0 to N — which the model will use to predict the following word(s).
* `targets` is the English sentence shifted by one position, providing the expected next word that the model is supposed to learn to predict.

We'll include special tokens `"[START]"` and `"[END]"` around the tokenized Indonesian input sentence. The input will also be padded to a fixed length, which can be conveniently handled using `keras_nlp.layers.StartEndPacker`.

In [None]:
def preprocess_batch(eng, ind):
    batch_size = ops.shape(ind)[0]

    eng = eng_tokenizer(eng)
    ind = ind_tokenizer(ind)

    # Pad `eng` to `MAX_SEQUENCE_LENGTH`.
    eng_start_end_packer = keras_hub.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=eng_tokenizer.token_to_id("[PAD]"),
    )
    eng = eng_start_end_packer(eng)

    # Add special tokens (`"[START]"` and `"[END]"`) to `ind` and pad it as well.
    ind_start_end_packer = keras_hub.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH + 1,
        start_value=ind_tokenizer.token_to_id("[START]"),
        end_value=ind_tokenizer.token_to_id("[END]"),
        pad_value=ind_tokenizer.token_to_id("[PAD]"),
    )
    ind = ind_start_end_packer(ind)

    return (
        {
            "encoder_inputs": eng,
            "decoder_inputs": ind[:, :-1],
        },
        ind[:, 1:],
    )


def make_dataset(pairs):
    eng_texts, ind_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    ind_texts = list(ind_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, ind_texts))
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.map(preprocess_batch, num_parallel_calls=tf_data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

Let's take a quick look at the sequence shapes
(we have batches of 64 pairs, and all sequences are 40 steps long):

In [None]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")


inputs["encoder_inputs"].shape: (64, 40)
inputs["decoder_inputs"].shape: (64, 40)
targets.shape: (64, 40)



## Building the Model

Now we’re moving on to the exciting part — building our model!

First, we need an embedding layer, which assigns a vector representation to each token in the input sequence. This layer can start with random initialization. We also require a positional embedding layer that encodes the position of each word within the sequence. Typically, the token embeddings and positional embeddings are added together. KerasHub provides a convenient layer for this: `keras_hub.layers.TokenAndPositionEmbedding`, which handles both token and position embedding for us.

Our sequence-to-sequence Transformer model is composed of two main components:

* `keras_hub.layers.TransformerEncoder`
* `keras_hub.layers.TransformerDecoder`

The process works as follows:

1. The input sequence in Indonesian is passed to the `TransformerEncoder`, which generates a contextualized representation of the sentence.
2. This encoded output, along with the English target sequence so far (from token 0 to N), is then passed into the `TransformerDecoder`.
3. The decoder attempts to predict the next token(s) in the English sentence (token N+1 and beyond).

A crucial part of making this architecture work is **causal masking**. Since the `TransformerDecoder` processes the entire sequence at once, we need to ensure it doesn’t access information from future tokens (i.e., tokens beyond N when predicting token N+1). This is where causal masking comes in — it ensures the decoder only attends to previous or current tokens. Thankfully, causal masking is enabled by default in `keras_hub.layers.TransformerDecoder`.

Another important aspect is **masking the padding tokens** (such as `"[PAD]"`). We can achieve this by setting `mask_zero=True` in the `TokenAndPositionEmbedding` layer. This masking will then automatically be respected throughout the rest of the model.


In [None]:
# Encoder
encoder_inputs = keras.Input(shape=(None,), name="encoder_inputs")

x = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=ENG_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(encoder_inputs)

encoder_outputs = keras_hub.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)


# Decoder
decoder_inputs = keras.Input(shape=(None,), name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=IND_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(decoder_inputs)

x = keras_hub.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(IND_VOCAB_SIZE, activation="softmax")(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs,
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

transformer = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="transformer",
)


## Training the Model

To monitor the training process, we'll use accuracy as a simple metric to track performance on the validation set. While accuracy offers a quick overview, it's worth noting that machine translation tasks are more commonly evaluated using metrics like **BLEU** or **ROUGE**.

However, these advanced metrics require converting the model’s output probabilities back into actual text — a process known as decoding. Since text generation is computationally intensive, it's generally not advisable to perform this step during training.

In this example, we only train the model for a single epoch. But in practice, achieving meaningful translation quality will require training for **at least 10 epochs** to allow the model to properly converge.


In [None]:
transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)

Epoch 1/40
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 151ms/step - accuracy: 0.7674 - loss: 2.7666 - val_accuracy: 0.8200 - val_loss: 1.0703
Epoch 2/40
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 66ms/step - accuracy: 0.8223 - loss: 1.0394 - val_accuracy: 0.8243 - val_loss: 0.9653
Epoch 3/40
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 64ms/step - accuracy: 0.8263 - loss: 0.9503 - val_accuracy: 0.8278 - val_loss: 0.9120
Epoch 4/40
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 64ms/step - accuracy: 0.8297 - loss: 0.9009 - val_accuracy: 0.8402 - val_loss: 0.8437
Epoch 5/40
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 64ms/step - accuracy: 0.8421 - loss: 0.8268 - val_accuracy: 0.8487 - val_loss: 0.7917
Epoch 6/40
[1m163/163[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 65ms/step - accuracy: 0.8506 - loss: 0.7672 - val_accuracy: 0.8570 - val_loss: 0.7382
Epoch 7/40
[1m

<keras.src.callbacks.history.History at 0x7b6c9075f350>

In [None]:
# Simpan ke direktori
transformer.save("/content/drive/MyDrive/Colab Notebooks/my_test_transformer_model.keras")


## Decoding Test Sentences (Qualitative Analysis)

Finally, let’s try translating new Indonesian sentences into English.

To do this, we feed the model with a tokenized Indonesian input sentence and initialize the decoding process with the `"[START]"` token in the target sequence. The model will then predict the probability distribution of the next token.

We continue generating tokens one at a time, using the tokens generated so far as context, until the model outputs the special `"[END]"` token, which signals the end of the translation.

For the decoding strategy, we’ll use tools from the `keras_hub.samplers` module. In this example, we’ll apply **Greedy Decoding**, which selects the most likely next token (i.e., the one with the highest probability) at each step of the sequence generation.


In [None]:
model = tf.keras.models.load_model("/content/drive/MyDrive/Colab Notebooks/my_test_transformer_model.keras")


In [None]:
def decode_sequences(input_sentences):
    batch_size = 1

    # Tokenize the encoder input.
    encoder_input_tokens = ops.convert_to_tensor(eng_tokenizer(input_sentences))

    # Truncate or pad the sequence to MAX_SEQUENCE_LENGTH
    encoder_input_tokens = encoder_input_tokens[:, :MAX_SEQUENCE_LENGTH]  # Truncate

    # Define a function that outputs the next token's probability given the input sequence.
    def next(prompt, cache, index):
        logits = model([encoder_input_tokens, prompt])[:, index - 1, :]
        # Ignore hidden states for now; only needed for contrastive search.
        hidden_states = None
        return logits, hidden_states, cache

    # Build a prompt of length 40 with a start token and padding tokens.
    length = 40
    start = ops.full((batch_size, 1), ind_tokenizer.token_to_id("[START]"))
    pad = ops.full((batch_size, length - 1), ind_tokenizer.token_to_id("[PAD]"))
    prompt = ops.concatenate((start, pad), axis=-1)

    generated_tokens = keras_hub.samplers.GreedySampler()(
        next,
        prompt,
        stop_token_ids=[ind_tokenizer.token_to_id("[END]")],
        index=1,  # Start sampling after start token.
    )
    generated_sentences = ind_tokenizer.detokenize(generated_tokens)

    # Return the first element of the generated sentences as a tensor
    return generated_sentences[0]

test_eng_texts = [pair[0] for pair in test_pairs]
for i in range(2):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequences([input_sentence])
    translated = (
        translated.replace("[PAD]", "")
        .replace("[START]", "")
        .replace("[END]", "")
        .strip()
    )
    print(f"** Example {i} **")
    print(input_sentence)
    print(translated)
    print()

** Example 0 **
i'll talk with you about this later, ok?
aku akan membicarakan tentang hal ini , kamu tidak apa - ok ?

** Example 1 **
peel the apple.
kupas apel itu !

