<a href="https://www.kaggle.com/code/nabeelparuk/south-american-translator-with-keras-nlp-in-prog?scriptVersionId=195818989" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **South American Translator with Keras NLP**

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/por-eng/por-eng/por.txt
/kaggle/input/por-eng/por-eng/_about.txt


## Import modules

In [2]:
!pip install -q --upgrade rouge-score
!pip install -q --upgrade keras-nlp
!pip install -q --upgrade keras

In [3]:
import keras_nlp
import pathlib
import random
import os
import pandas as pd
import numpy as np

import tensorflow as tf
import keras
from keras import ops

import tensorflow.data as tf_data

from tensorflow_text.tools.wordpiece_vocab import (
    bert_vocab_from_dataset as bert_vocab
)

import pickle

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

# English-Spanish

## Setup

### Define parameters and hyperparameters

In [4]:
MAX_SEQUENCE_LENGTH = 40
ENG_VOCAB_SIZE = 15000
SPA_VOCAB_SIZE = 15000
POR_VOCAB_SIZE = 15000
AYM_VOCAB_SIZE = 15000
BATCH_SIZE = 64
EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

### Import English-Spanish Dataset

In [5]:
text_file = keras.utils.get_file(
    fname="spa-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
    extract=True
)

text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
[1m2638744/2638744[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


### Parse the data

* English = Source sequence
* Spanish = Target sequence

In [6]:
# We will add the text to a list -> But first make everything lowercase
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []

for line in lines:
    eng, spa = line.split("\t")
    eng = eng.lower()
    spa = spa.lower()
    text_pairs.append((eng, spa))

#### View the sentence pairs

In [7]:
for _ in range(5):
    print(random.choice(text_pairs))

("i don't think i have one of those yet.", 'no creo que tenga todavía uno de esos.')
('is that tom?', '¿es ese tom?')
('would you like to hear me sing a song?', '¿quieres oírme cantar?')
('when you watch television or listen to the radio, the music which you hear is often african in origin.', 'cuando ves televisión o escuchas la radio, la música que oyes es frecuentemente de origen africano.')
('tom thought it would be difficult for mary to get a ticket to that concert.', 'tom pensó que sería difícil que maría consiguiera una entrada para ese concierto.')


## Preprocessing

### Split the data

In [8]:
# Shuffle the list
random.shuffle(text_pairs)

# Set training and validation sizes
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = int(len(text_pairs) - 2 * num_val_samples)

# Get train, val and test sets
train_pairs = text_pairs[: num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"Train size: {len(train_pairs)}")
print(f"Validation size: {len(val_pairs)}")
print(f"Test size: {len(test_pairs)}")

Train size: 83276
Validation size: 17844
Test size: 17844


### Tokenization

We need to define two tokenizers: 1 for the English (source) dataset and one for the Spanish (target) dataset

- But first we need to train them on the dataset we have

#### Start by generating a vocabulary for each language

In [9]:
# Use WordPiece to subword tokenize -> returns a vocabulary of subwords
def train_word_piece(text_samples, vocab_size, reserved_tokens):
    word_piece_ds = tf_data.Dataset.from_tensor_slices(text_samples)
    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
    reserved_tokens=reserved_tokens,
    )
    return vocab

- [PAD] - Padding token
- [UNK] - Unknown token
- [START] - Token that marks the start of the input sequence
- [END] - Token that marks the end of the input sequence

In [10]:
# Reserve these tokens
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

# Tokenize English samples
eng_samples = [text_pair[0] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)

# Tokenize Spanish samples
spa_samples = [text_pair[1] for text_pair in train_pairs]
spa_vocab = train_word_piece(spa_samples, SPA_VOCAB_SIZE, reserved_tokens)

In [11]:
# View some tokens
print("English Tokens: ", eng_vocab[100:110])
print("Spanish Tokens: ", spa_vocab[100:110])

English Tokens:  ['him', 'there', 'they', 'go', 'her', 'has', 'will', 're', 'time', 'how']
Spanish Tokens:  ['para', 'mary', 'las', 'más', 'al', 'yo', 'tu', 'estoy', 'muy', 'eso']


#### Now create the tokenizers with the vocabularies we just made

In [12]:
# English tokenizer
eng_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab, lowercase=False
)

# Spanish tokenizer
spa_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=spa_vocab, lowercase=False
)

In [13]:
# Save tokenizers to file
# English
with open('tokenizer_engspa_eng.pickle', 'wb') as handle:
    pickle.dump(eng_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# Spanish
with open('tokenizer_engspa_spa.pickle', 'wb') as handle:
    pickle.dump(spa_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [14]:
# Lets test the tokenizers on a sample
# English
eng_input_eg = text_pairs[0][0]
eng_tokens_eg = eng_tokenizer.tokenize(eng_input_eg)
print("English Sentence: ", eng_input_eg)
print("Tokens: ", eng_tokens_eg)
print("Recovered text after detokenizing: ",
     eng_tokenizer.detokenize(eng_tokens_eg))

print()
# Spanish
spa_input_eg = text_pairs[0][1]
spa_tokens_eg = spa_tokenizer.tokenize(spa_input_eg)
print("Spanish sentence: ", spa_input_eg)
print("Tokens: ", spa_tokens_eg)
print("Recovered text after detokenizing:",
     spa_tokenizer.detokenize(spa_tokens_eg))

English Sentence:  i'll hide it somewhere.
Tokens:  tf.Tensor([  35    8  110  756   70 1083   12], shape=(7,), dtype=int32)
Recovered text after detokenizing:  tf.Tensor(b"i ' ll hide it somewhere .", shape=(), dtype=string)

Spanish sentence:  lo esconderé en cualquier lugar.
Tokens:  tf.Tensor([  88 2610  217   81  491  335   15], shape=(7,), dtype=int32)
Recovered text after detokenizing: tf.Tensor(b'lo esconder\xc3\xa9 en cualquier lugar .', shape=(), dtype=string)


### Format datasets

**We want the model to predict target words N+1 and beyond using:**
 - The source sentence (English)
 - The words up to N (words already predicted before)
 
**The training dataset will yield a tuple (inputs, targets):**
 -  Inputs: Dictionary with keys `encoder_inputs` and `decoder_inputs`.
     - `encoder_inputs` -> Tokenized source sentence
     - `decoder_inputs` -> Target sentence so far (what has already been predicted i.e. words up to N
 - Targets: target sentence offset by one step
     - Provides the next words in the target sentence (what the model will try to predict

We also need to add special tokens ([START] and [END]) to the input Spanish sentence after tokenizing the text AND we need to pad input to a fixed length
 - This can be done using `keras_nlp.layers.StartEndPacker`

In [15]:
# Define the preprocessing function
def preprocess_batch(eng, spa):
    # Batch size depending on the length of the tokens
    batch_size = ops.shape(spa)[0]
    
    # Tokenize
    eng = eng_tokenizer(eng)
    spa = spa_tokenizer(spa)
    
    # Pad the English tokenized data to 'MAX_SEQUENCE_LENGTH'
    eng_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=eng_tokenizer.token_to_id("[PAD]"),
    )
    eng = eng_start_end_packer(eng)
    
    # Pad the Spanish tokenized data AND add special tokens '[START]' and '[END]'
    spa_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH + 1,
        start_value=spa_tokenizer.token_to_id("[START]"),
        end_value=spa_tokenizer.token_to_id("[END]"),
        pad_value=spa_tokenizer.token_to_id("[PAD]"),
    )
    spa = spa_start_end_packer(spa)
    
    # Now return the tuple with the inputs (encoder inputs and decoder inputs in a dictionary) and the targets
    return (
        {
            "encoder_inputs": eng,
            "decoder_inputs": spa[ : , :-1]
        },
        spa[:, 1:],
    )

In [16]:
# Define the dataset function
def make_dataset(pairs):
    
    # Get the texts for each language individually
    eng_texts, spa_texts = zip(*pairs)
    
    # Turn returned variables into lists
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    
    # Turn it into a tf_data dataset
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.map(preprocess_batch, num_parallel_calls=tf_data.AUTOTUNE)
    
    return dataset.shuffle(2048).prefetch(16).cache()

In [17]:
# Create the dataset
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [18]:
# Look at sequence shapes
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 40)
inputs["decoder_inputs"].shape: (64, 40)
targets.shape: (64, 40)


## Model Construction

**We need:**
- Embeddings -> The following are combined into one layer
    - Embedding layer
        - Creates a vector for every token in our sequence
        - Can be initialised randomly
    - Positional embedding layer
        - Encodes the word order in the sequence
        - With the `mask_zero` argument we can mask the padding tokens ("[PAD]")
- Seq2Seq Transformer
    - Consists of `TransformerEncoder` and `TransformerDecoder` layers chained together

**Workflow of model:**
1. Source sequence (English) passes to `TransformerEncoder` -> produces a new representation of it
2. New representation passed to `TransformerDecoder`
3. With the target sequence so far (what has previously been predicted from 0 to N) and `TransformerDecoder`, the new representation is used to predict the N+1th word.

**Key detail: Causal Masking**
- `TransformerDecoder` sees the whole sequence at once but we only want information from target tokens 0 to N when predicting N + 1
- Using information from the future would result in a model that can't be used in inference time

### Create modelling checkpoint callback

In [19]:
def create_model_checkpoint(model_name, save_path="/kaggle/working/model_experiments"):
    return tf.keras.callbacks.ModelCheckpoint(filepath=os.path.join(save_path, f"{model_name}.keras"),
                                            verbose=0,
                                            save_best_only=True)

### Encoder

In [20]:
# Use Functional API
# Inputs
encoder_inputs = keras.Input(shape=(None,), name="encoder_inputs")

# Embedding
x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ENG_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
)(encoder_inputs)

# Outputs
encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)

# Define encoder
encoder = keras.Model(encoder_inputs, encoder_outputs)

### Decoder

In [21]:
# Use Functional API
# Inputs
decoder_inputs = keras.Input(shape=(None,), name="decoder_inputs") # What has been predicted
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs") # Output from encoder

# Embedding
x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=SPA_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(decoder_inputs)

# Decoder layer
x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM,
    num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)

# Add dropout
x = keras.layers.Dropout(0.5)(x)

# Outputs
decoder_outputs = keras.layers.Dense(SPA_VOCAB_SIZE, activation='softmax')(x)

# Define decoder
decoder = keras.Model(
    [
    decoder_inputs,
    encoded_seq_inputs,
    ],
    decoder_outputs
)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])

### Transformer

In [22]:
transformer1 = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="transformer_eng_spa"
)

transformer1.summary()

In [23]:
# Compile transformer
transformer1.compile(optimizer=tf.keras.optimizers.RMSprop(),
                   loss="sparse_categorical_crossentropy",
                   metrics=['accuracy'])

### Train Model

In [24]:
transformer1.fit(train_ds,
                validation_data=val_ds,
                epochs=5,
                callbacks=[create_model_checkpoint(transformer1.name)]
               )

Epoch 1/5


I0000 00:00:1725805073.177452      86 service.cc:145] XLA service 0x7c0d9c008c20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1725805073.177522      86 service.cc:153]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
I0000 00:00:1725805073.177526      86 service.cc:153]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
W0000 00:00:1725805073.677383      86 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert









[1m   2/1302[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1:46[0m 82ms/step - accuracy: 0.1896 - loss: 9.1189       

I0000 00:00:1725805112.334844      86 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m 774/1302[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m27s[0m 51ms/step - accuracy: 0.7913 - loss: 1.7800

W0000 00:00:1725805152.460128      87 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert







[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step - accuracy: 0.8238 - loss: 1.4405

W0000 00:00:1725805217.522893      87 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert
W0000 00:00:1725805221.953565      87 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert




[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m173s[0m 96ms/step - accuracy: 0.8238 - loss: 1.4400 - val_accuracy: 0.9833 - val_loss: 0.1434
Epoch 2/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 52ms/step - accuracy: 0.9865 - loss: 0.1153 - val_accuracy: 0.9999 - val_loss: 0.0092
Epoch 3/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 50ms/step - accuracy: 0.9994 - loss: 0.0114 - val_accuracy: 1.0000 - val_loss: 8.3736e-04
Epoch 4/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 50ms/step - accuracy: 0.9995 - loss: 0.0048 - val_accuracy: 1.0000 - val_loss: 2.7757e-04
Epoch 5/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 49ms/step - accuracy: 1.0000 - loss: 6.5127e-04 - val_accuracy: 1.0000 - val_loss: 1.8397e-04


<keras.src.callbacks.history.History at 0x7c0de0145420>

## Decoding Test Sentences

Use this section to translate brand new test sentences

In [25]:
def decode_sequences(input_sentences, transformer, lan_tokenizer):
    batch_size = 1
    
    # Tokenize encoder input
    encoder_input_tokens = ops.convert_to_tensor(eng_tokenizer(input_sentences))
    if len(encoder_input_tokens[0]) < MAX_SEQUENCE_LENGTH:
        pads = ops.full((1, MAX_SEQUENCE_LENGTH - len(encoder_input_tokens[0])), 0)
        encoder_input_tokens = ops.concatenate(
            [encoder_input_tokens.to_tensor(), pads], 1
        )
        
    # Define a function that outputs the next tokens probability given the input sequence
    def next(prompt, cache, index):
        logits = transformer([encoder_input_tokens, prompt])[:, index - 1, :]
        # We ignore hidden states for now -> needed only for contrastive search
        hidden_states = None
        return logits, hidden_states, cache
    
    # Build a prompt of length 40 with a start token and padding tokens
    length = 40
    # Add start token
    start = ops.full((batch_size, 1), lan_tokenizer.token_to_id("[START]"))
    # Add pad token
    pad = ops.full((batch_size, length - 1), lan_tokenizer.token_to_id("[PAD]"))
    
    prompt = ops.concatenate((start, pad), axis=-1)
    
    # GreedySampler -> Outputs token with highest probability
    generated_tokens = keras_nlp.samplers.GreedySampler()(
        next,
        prompt,
        stop_token_ids=[lan_tokenizer.token_to_id("[END]")],
        index=1, # Sample only after "[START]" token
    )
    generated_sentences = lan_tokenizer.detokenize(generated_tokens)
    return generated_sentences

In [26]:
# Draw English samples from test set
test_eng_texts = [pair[0] for pair in test_pairs]

for i in range(2):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequences([input_sentence], transformer1, spa_tokenizer)
    translated = translated.numpy()[0].decode("utf-8")
    
    translated = (
        translated.replace("[PAD]","")
        .replace("[START]","")
        .replace("[END]", "")
        .strip()
    )
    print(f"Example {i+1}")
    print(input_sentence)
    print(translated)
    print()

Example 1
i'm tired of listening to his boasts.
soldados j policía — piso enseguida lo ;icidio su caído

Example 2
i always confuse which side is port and which starboard.
me ? tío aceptar fotografíala tenés ó m promesas quéla invitótiendoiembre



## Evaluate model

We are going to use the METEOR score metric to conduct a quantitative analysis of our model

# English-Portuguese

## Setup

#### Parse the data

In [27]:
por_eng_df = pd.read_csv('/kaggle/input/por-eng/por-eng/por.txt', header=None, sep="\t")
por_eng_df = por_eng_df.loc[: , :1]
por_eng_df.head()

Unnamed: 0,0,1
0,Go.,Vai.
1,Go.,Vá.
2,Hi.,Oi.
3,Run!,Corre!
4,Run!,Corra!


In [28]:
# Create lists from each data
eng_list = por_eng_df.loc[:,0].to_list()
por_list = por_eng_df.loc[:,1].to_list()

text_pairs = []

# Iterate over list and lowercase
for i in range(len(eng_list)):
    eng = eng_list[i].lower()
    por = por_list[i].lower()
    
    text_pairs.append((eng, por))

##### View the sentence pairs

In [29]:
for _ in range(5):
    print(random.choice(text_pairs))

('he gave the cat milk.', 'ele deu leite ao gato.')
('we have three dogs.', 'nós temos três cachorros.')
("he's always scowling.", 'ele anda sempre de cara feia.')
('you know tom likes baseball.', 'vocês sabem que o tom gosta de beisebol.')
("i'll see you the day after tomorrow.", 'até depois de amanhã.')


## Preprocessing

### Split the data

In [30]:
random.shuffle(text_pairs)

# Set training and validation sizes
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = int(len(text_pairs) - 2 * num_val_samples)

# Get train, val and test sets
train_pairs = text_pairs[: num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

print(f"Train size: {len(train_pairs)}")
print(f"Validation size: {len(val_pairs)}")
print(f"Test size: {len(test_pairs)}")

Train size: 135545
Validation size: 29044
Test size: 29044


### Tokenization

In [31]:
# Reserve these tokens
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

# Tokenize English samples
eng_samples = [text_pair[0] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)

# Tokenize Portuguese samples
por_samples = [text_pair[1] for text_pair in train_pairs]
por_vocab = train_word_piece(por_samples, POR_VOCAB_SIZE, reserved_tokens)

In [32]:
# View some tokens
print("English tokens: ", eng_vocab[100:110])
print("Portuguese tokens: ", por_vocab[100:110])

English tokens:  ['go', 'how', 'll', 'at', 'here', 'there', 've', 'she', 'going', 'they']
Portuguese tokens:  ['mais', 'estava', '##s', 'no', 'na', 'estou', 'tem', 'foi', 'nós', 'os']


#### Create tokenizers with vocabularies

In [33]:
# English tokenizer
eng_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab,
    lowercase=False
)

# Portuguese tokenizer
por_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=por_vocab,
    lowercase=False
)

In [34]:
# Saving tokenizers
# English
with open('tokenizer_engpor_eng.pickle', 'wb') as handle:
    pickle.dump(eng_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Portuguese
with open('tokenizer_engpor_por.pickle', 'wb') as handle:
    pickle.dump(por_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [35]:
# Test English tokenizer on a sample
eng_input_eg = text_pairs[0][0]
eng_token_eg = eng_tokenizer(eng_input_eg)
print("Original sentence: ", eng_input_eg)
print("Tokenized output: ", eng_token_eg)
print("Recovered text after tokenizing: ", eng_tokenizer.detokenize(eng_token_eg))

print()
# Test Portuguese tokenizer on a sample
por_input_eg = text_pairs[0][1]
por_token_eg = por_tokenizer(por_input_eg)
print("Original sentence: ", por_input_eg)
print("Tokenized output: ", por_token_eg)
print("Recovered text after tokenizing: ", por_tokenizer.detokenize(por_token_eg))

Original sentence:  don't open it.
Tokenized output:  tf.Tensor([ 76   9  45 296  70  12], shape=(6,), dtype=int32)
Recovered text after tokenizing:  tf.Tensor(b"don ' t open it .", shape=(), dtype=string)

Original sentence:  não o abram.
Tokenized output:  tf.Tensor([  81   42 4613   13], shape=(4,), dtype=int32)
Recovered text after tokenizing:  tf.Tensor(b'n\xc3\xa3o o abram .', shape=(), dtype=string)


### Format datasets

In [36]:
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [37]:
# Look at sequence shapes
for inputs, target in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 40)
inputs["decoder_inputs"].shape: (64, 40)
targets.shape: (64, 40)


## Model Construction

### Encoder

In [38]:
# Inputs
encoder_inputs = keras.Input(shape=(None,), name="encoder_inputs")

# Embedding
x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ENG_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
)(encoder_inputs)

# Outputs
encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)

# Define encoder
encoder = keras.Model(encoder_inputs, encoder_outputs)

### Decoder

In [39]:
# Inputs
decoder_inputs = keras.Input(shape=(None,), name="decoder_inputs") # What we already predicted
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs") # Output from encoder to go to the next word

# Embedding
x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=POR_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM
)(decoder_inputs)

# Decoder layer
x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM,
    num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)

# Add dropout
x = keras.layers.Dropout(0.5)(x)

# Outputs
decoder_outputs = keras.layers.Dense(POR_VOCAB_SIZE, activation='softmax')(x)

# Define decoder
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs
    ],
    decoder_outputs
)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])

### Transformer

In [40]:
transformer2 = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="transformer_eng_por"
)

transformer2.summary()

In [41]:
# Compile transformer
transformer2.compile(optimizer=tf.keras.optimizers.RMSprop(),
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

### Train Model

In [42]:
transformer2.fit(train_ds, validation_data=[val_ds],
               epochs=10,
               callbacks=[create_model_checkpoint(transformer2.name)])

Epoch 1/10


W0000 00:00:1725805720.777704      86 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert


[1m 367/2118[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m1:32[0m 53ms/step - accuracy: 0.6265 - loss: 2.8100

W0000 00:00:1725805747.931022      88 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert





[1m2117/2118[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 66ms/step - accuracy: 0.6644 - loss: 1.9936

W0000 00:00:1725805869.873779      88 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert
W0000 00:00:1725805871.386056      88 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert


[1m2118/2118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m168s[0m 71ms/step - accuracy: 0.6644 - loss: 1.9933 - val_accuracy: 0.7256 - val_loss: 1.4084
Epoch 2/10
[1m2118/2118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m108s[0m 51ms/step - accuracy: 0.7551 - loss: 1.2627 - val_accuracy: 0.9937 - val_loss: 0.0430
Epoch 3/10
[1m2118/2118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 48ms/step - accuracy: 0.9946 - loss: 0.0358 - val_accuracy: 0.9986 - val_loss: 0.0110
Epoch 4/10
[1m2118/2118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 48ms/step - accuracy: 0.9986 - loss: 0.0108 - val_accuracy: 0.9995 - val_loss: 0.0044
Epoch 5/10
[1m2118/2118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m101s[0m 48ms/step - accuracy: 0.9992 - loss: 0.0065 - val_accuracy: 0.9997 - val_loss: 0.0024
Epoch 6/10
[1m2118/2118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m101s[0m 47ms/step - accuracy: 0.9997 - loss: 0.0026 - val_accuracy: 0.9998 - val_loss: 0.0016
Epoch 7/1

<keras.src.callbacks.history.History at 0x7c09cef1f670>

## Decoding Test Sentences

In [43]:
# Draw English samples from test set
test_eng_texts = [pair[0] for pair in test_pairs]

# Output 2 example translations
for i in range(2):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequences([input_sentence], transformer2, por_tokenizer)
    translated = translated.numpy()[0].decode('utf-8')
    
    translated = (
        translated.replace("[PAD]", "")
        .replace("[UNK]", "")
        .replace("[START]", "")
        .replace("[END]", "")
        .strip()
    )
    print(f"Example{i+1}")
    print(input_sentence)
    print(translated)
    print()

Example1
what's your favorite christmas carol?
( q com motivado forte estardico :

Example2
this analysis is divided in two parts.
adicione conseguido fonte meu campainha õ nomes entanto insistiu tomatesico



# English-Aymara

The dataset for English to Aymara is significantly smaller (sourced from HuggingFace) and consists of specific words rather than sentences as the other datasets have. If more, higher quality data becomes available, this project will be updated.

## Setup

### Import English-Aymara dataset

In [44]:
eng_aym = pd.read_csv("hf://datasets/alvations/aymara-english/aymara-english.tsv", sep="\t")
eng_aym.head()

Unnamed: 0,Aymara,English
0,achachi,grandfather
1,achachila,grandfather
2,achachilan tatapa,great grandfather
3,achacu,mouse
4,achaku,mouse
