<a href="https://colab.research.google.com/github/mahima-c/deep-learning/blob/main/seq2seq_without_attention_for_machine_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

seq2seq without attention for machine translation

To understand the seq2seq model in greater detail, we will look at an example of one that learns how to translate from English to French using the French-English bilingual dataset from the Tatoeba Project (1997-2019) [26]. The dataset contains approximately 167,000 sentence pairs. To make our training go faster, we will only consider the first 30,000 sentence pairs for our training.

In [1]:
import nltk
import numpy as np
import re
import shutil
import tensorflow as tf
import os
import unicodedata
import zipfile
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

If you recall the structure of the seq2seq network, the input to the encoder is a sequence of English words. On the decoder side, the input is a set of French words, and the output is the sequence of French words offset by 1 timestep. The following function will download the zip file, expand it, and create the datasets described before.



The input is preprocessed to "asciify" the characters, separate out specific punctuations from their neighboring word, and remove all characters other than alphabets and these specific punctuation symbols. Finally, the sentences are converted to lowercase. Each English sentence is just converted to a single sequence of words. Each French sentence is converted into two sequences, one preceded by the BOS pseudo-word and the other followed by the end of sentence (EOS) pseudo-word.

In [7]:
def preprocess_sentence(sent):
    sent = "".join([c for c in unicodedata.normalize("NFD", sent) 
        if unicodedata.category(c) != "Mn"])
    sent = re.sub(r"([!.?])", r" \1", sent)
    sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)
    sent = re.sub(r"\s+", " ", sent)
    sent = sent.lower()
    return sent


def download_and_read(url, num_sent_pairs=30000):
    local_file = url.split('/')[-1]
    if not os.path.exists(local_file):
        os.system("wget -O {:s} {:s}".format(local_file, url))
        with zipfile.ZipFile(local_file, "r") as zip_ref:
            zip_ref.extractall(".")
    local_file = os.path.join(".", "fra.txt")
    en_sents, fr_sents_in, fr_sents_out = [], [], []
    with open(local_file, "r") as fin:
        for i, line in enumerate(fin):
            en_sent, fr_sent,_ = line.strip().split('\t')
            en_sent = [w for w in preprocess_sentence(en_sent).split()]
            fr_sent = preprocess_sentence(fr_sent)
            fr_sent_in = [w for w in ("BOS " + fr_sent).split()]
            fr_sent_out = [w for w in (fr_sent + " EOS").split()]
            en_sents.append(en_sent)
            fr_sents_in.append(fr_sent_in)
            fr_sents_out.append(fr_sent_out)
            if i >= num_sent_pairs - 1:
                break
    return en_sents, fr_sents_in, fr_sents_out

In [3]:
def clean_up_logs(data_dir):
    checkpoint_dir = os.path.join(data_dir, "checkpoints")
    if os.path.exists(checkpoint_dir):
        shutil.rmtree(checkpoint_dir, ignore_errors=True)
        os.makedirs(checkpoint_dir)
    return checkpoint_dir

In [4]:
NUM_SENT_PAIRS = 30000
EMBEDDING_DIM = 256
ENCODER_DIM, DECODER_DIM = 1024, 1024
BATCH_SIZE = 64
NUM_EPOCHS = 30

tf.random.set_seed(42)

data_dir = "./data"
checkpoint_dir = clean_up_logs(data_dir)



In [8]:
# data preparation
download_url = "http://www.manythings.org/anki/fra-eng.zip"
sents_en, sents_fr_in, sents_fr_out = download_and_read(download_url)

Our next step is to tokenize our inputs and create the vocabulary. Since we have sequences in two different languages, we will create two different tokenizers and vocabularies, one for each language. The tf.keras framework provides a very powerful and versatile tokenizer class – here we have set filters to an empty string and lower to False because we have already done what was needed for tokenization in our preprocess_sentence() function. The Tokenizer creates various data structures from which we can compute the vocabulary sizes and lookup tables that allow us to go from word to word index and back.

Next we handle different length sequences of words by padding with zeros at the end, using the pad_sequences() function. Because our strings are fairly short, we do not do any truncation; we just pad to the maximum length of sentence that we have (8 words for English, and 16 words for French):

In [9]:
tokenizer_en = tf.keras.preprocessing.text.Tokenizer(
    filters="", lower=False)
tokenizer_en.fit_on_texts(sents_en)
data_en = tokenizer_en.texts_to_sequences(sents_en)
data_en = tf.keras.preprocessing.sequence.pad_sequences(
    data_en, padding="post")
tokenizer_fr = tf.keras.preprocessing.text.Tokenizer(
    filters="", lower=False)
tokenizer_fr.fit_on_texts(sents_fr_in)
tokenizer_fr.fit_on_texts(sents_fr_out)
data_fr_in = tokenizer_fr.texts_to_sequences(sents_fr_in)
data_fr_in = tf.keras.preprocessing.sequence.pad_sequences(
    data_fr_in, padding="post")
data_fr_out = tokenizer_fr.texts_to_sequences(sents_fr_out)
data_fr_out = tf.keras.preprocessing.sequence.pad_sequences(
    data_fr_out, padding="post")
vocab_size_en = len(tokenizer_en.word_index)
vocab_size_fr = len(tokenizer_fr.word_index)
word2idx_en = tokenizer_en.word_index
idx2word_en = {v:k for k, v in word2idx_en.items()}
word2idx_fr = tokenizer_fr.word_index
idx2word_fr = {v:k for k, v in word2idx_fr.items()}
print("vocab size (en): {:d}, vocab size (fr): {:d}".format(
    vocab_size_en, vocab_size_fr))
maxlen_en = data_en.shape[1]
maxlen_fr = data_fr_out.shape[1]
print("seqlen (en): {:d}, (fr): {:d}".format(maxlen_en, maxlen_fr))

vocab size (en): 4359, vocab size (fr): 7590
seqlen (en): 8, (fr): 16


In [10]:
batch_size = 64
dataset = tf.data.Dataset.from_tensor_slices(
    (data_en, data_fr_in, data_fr_out))
dataset = dataset.shuffle(10000)
test_size = NUM_SENT_PAIRS // 4
test_dataset = dataset.take(test_size).batch(
    batch_size, drop_remainder=True)
train_dataset = dataset.skip(test_size).batch(
    batch_size, drop_remainder=True)

Our data is now ready to be used for training the seq2seq network, which we will define next. Our encoder is an Embedding layer followed by a GRU layer. The input to the encoder is a sequence of integers, which is converted to a sequence of embedding vectors of size embedding_dim. This sequence of vectors is sent to an RNN, which converts the input at each of the num_timesteps time steps to a vector of size encoder_dim. Only the output at the last time step is returned, as shown by the return_sequences=False.

In our example network, we have chosen our embedding dimension to be 128, followed by the encoder and decoder RNN dimension of 1024 each. Note that we have to add 1 to the vocabulary size for both the English and French vocabularies to account for the PAD character that was added during the pad_sequences() step:

In [11]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, num_timesteps, 
            encoder_dim, **kwargs):
        super(Encoder, self).__init__(**kwargs)
        self.encoder_dim = encoder_dim
        self.embedding = tf.keras.layers.Embedding(
            vocab_size, embedding_dim, input_length=num_timesteps)
        self.rnn = tf.keras.layers.GRU(
            encoder_dim, return_sequences=False, return_state=True)

    def call(self, x, state):
        x = self.embedding(x)
        x, state = self.rnn(x, initial_state=state)
        return x, state

    def init_state(self, batch_size):
        return tf.zeros((batch_size, self.encoder_dim))


class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, num_timesteps,
            decoder_dim, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.decoder_dim = decoder_dim
        self.embedding = tf.keras.layers.Embedding(
            vocab_size, embedding_dim, input_length=num_timesteps)
        self.rnn = tf.keras.layers.GRU(
            decoder_dim, return_sequences=True, return_state=True)
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, x, state):
        x = self.embedding(x)
        x, state = self.rnn(x, state)
        x = self.dense(x)
        return x, state

In [13]:
embedding_dim = 256
encoder_dim, decoder_dim = 1024, 1024
encoder = Encoder(vocab_size_en+1, 
    embedding_dim, maxlen_en, encoder_dim)
decoder = Decoder(vocab_size_fr+1, 
    embedding_dim, maxlen_fr, decoder_dim)

In [14]:
for encoder_in, decoder_in, decoder_out in train_dataset:
   encoder_state = encoder.init_state(batch_size)
   encoder_out, encoder_state = encoder(encoder_in, encoder_state)
   decoder_state = encoder_state
   decoder_pred, decoder_state = decoder(decoder_in, decoder_state)
   break
print("encoder input          :", encoder_in.shape)
print("encoder output         :", encoder_out.shape, "state:", encoder_state.shape)
print("decoder output (logits):", decoder_pred.shape, "state:", decoder_state.shape)
print("decoder output (labels):", decoder_out.shape)

encoder input          : (64, 8)
encoder output         : (64, 1024) state: (64, 1024)
decoder output (logits): (64, 16, 7591) state: (64, 1024)
decoder output (labels): (64, 16)


This produces the following output, which is in line with our expectations. The encoder input is a batch of a sequence of integers, each sequence being of size 8, which is the maximum number of tokens in our English sentences, so its dimension is (batch_size, maxlen_en).

The output of the encoder is a single tensor (return_sequences=False) of shape (batch_size, encoder_dim) and represents a batch of context vectors representing the input sentences. The encoder state tensor has the same dimensions. The decoder outputs are also a batch of sequence of integers, but the maximum size of a French sentence is 16; therefore, the dimensions are (batch_size, maxlen_fr). The decoder predictions are a batch of probability distributions across all time steps; hence the dimensions are (batch_size, maxlen_fr, vocab_size_fr+1), and the decoder state is the same dimension as the encoder state (batch_size, decoder_dim):

Next we define the loss function. Because we padded our sentences, we don't want to bias our results by considering equality of pad words between the labels and predictions. Our loss function masks our predictions with the labels, so padded positions on the label are also removed from the predictions, and we only compute our loss using the non zero elements on both the label and predictions. This is done as follows:



In [20]:
def loss_fn(ytrue, ypred):
    scce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    mask = tf.math.logical_not(tf.math.equal(ytrue, 0))
    mask = tf.cast(mask, dtype=tf.int64)
    loss = scce(ytrue, ypred, sample_weight=mask)
    return loss


Because the seq2seq model is not easy to package into a simple Keras model, we have to handle the training loop manually as well. Our train_step() function handles the flow of data and computes the loss at each step, applies the gradient of the loss back to the trainable weights, and returns the loss.

Notice that the training code is not quite the same as what was described in our discussion of the seq2seq model earlier. Here it appears that the entire decoder_input is fed in one go into the decoder to produce the output offset by one time step, whereas in the discussion, we said that this happens sequentially, where the token generated in the previous time step is used as the input to the next time step.

This is a common technique used to train seq2seq networks, which is called Teacher Forcing, where the input to the decoder is the ground truth output instead of the prediction from the previous time step. This is preferred because it makes training faster, but also results in some degradation in prediction quality. To offset this, techniques such as Scheduled Sampling can be used, where the input is sampled randomly either from the ground truth or the prediction at the previous time step, based on some threshold (depends on the problem, but usually varies between 0.1 and 0.4):



In [25]:
@tf.function
def train_step(encoder_in, decoder_in, decoder_out, encoder_state):
   with tf.GradientTape() as tape:
       encoder_out, encoder_state = encoder(encoder_in, encoder_state)
       decoder_state = encoder_state
       decoder_pred, decoder_state = decoder(
           decoder_in, decoder_state)
       loss = loss_fn(decoder_out, decoder_pred)
  
   variables = (encoder.trainable_variables + 
       decoder.trainable_variables)
   gradients = tape.gradient(loss, variables)
   optimizer.apply_gradients(zip(gradients, variables))
   return loss

The predict() method is used to randomly sample a single English sentence from the dataset and use the model trained so far to predict the French sentence. For reference, the label French sentence is also displayed. The evaluate() method computes the BiLingual Evaluation Understudy (BLEU) score [35] between the label and prediction across all records in the test set. BLEU scores are generally used where multiple ground truth labels exist (we have only one), but compares up to 4-grams (n-grams with n=4) in both reference and candidate sentences. Both the predict() and evaluate() methods are called at the end of every epoch:

In [28]:
def predict(encoder, decoder, batch_size, 
        sents_en, data_en, sents_fr_out, 
        word2idx_fr, idx2word_fr):
    random_id = np.random.choice(len(sents_en))
    print("input    : ",  " ".join(sents_en[random_id]))
    print("label    : ", " ".join(sents_fr_out[random_id]))

    encoder_in = tf.expand_dims(data_en[random_id], axis=0)
    decoder_out = tf.expand_dims(sents_fr_out[random_id], axis=0)

    encoder_state = encoder.init_state(1)
    encoder_out, encoder_state = encoder(encoder_in, encoder_state)
    decoder_state = encoder_state

    decoder_in = tf.expand_dims(
        tf.constant([word2idx_fr["BOS"]]), axis=0)
    pred_sent_fr = []
    while True:
        decoder_pred, decoder_state = decoder(decoder_in, decoder_state)
        decoder_pred = tf.argmax(decoder_pred, axis=-1)
        pred_word = idx2word_fr[decoder_pred.numpy()[0][0]]
        pred_sent_fr.append(pred_word)
        if pred_word == "EOS":
            break
        decoder_in = decoder_pred
    
    print("predicted: ", " ".join(pred_sent_fr))


In [29]:
def evaluate_bleu_score(encoder, decoder, test_dataset, 
        word2idx_fr, idx2word_fr):

    bleu_scores = []
    smooth_fn = SmoothingFunction()
    for encoder_in, decoder_in, decoder_out in test_dataset:
        encoder_state = encoder.init_state(batch_size)
        encoder_out, encoder_state = encoder(encoder_in, encoder_state)
        decoder_state = encoder_state
        decoder_pred, decoder_state = decoder(decoder_in, decoder_state)

        # compute argmax
        decoder_out = decoder_out.numpy()
        decoder_pred = tf.argmax(decoder_pred, axis=-1).numpy()

        for i in range(decoder_out.shape[0]):
            ref_sent = [idx2word_fr[j] for j in decoder_out[i].tolist() if j > 0]
            hyp_sent = [idx2word_fr[j] for j in decoder_pred[i].tolist() if j > 0]
            # remove trailing EOS
            ref_sent = ref_sent[0:-1]
            hyp_sent = hyp_sent[0:-1]
            bleu_score = sentence_bleu([ref_sent], hyp_sent, 
                smoothing_function=smooth_fn.method1)
            bleu_scores.append(bleu_score)

    return np.mean(np.array(bleu_scores))

The training loop is shown as follows. We will use the Adam optimizer for our model. We also set up a checkpoint so we can save our model after every 10 epochs. We then train the model for 250 epochs, and print out the loss, an example sentence and its translation, and the BLEU score computed over the entire test set:

In [None]:
optimizer = tf.keras.optimizers.Adam()
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                encoder=encoder,
                                decoder=decoder)
num_epochs = 250
eval_scores = []
for e in range(num_epochs):
   encoder_state = encoder.init_state(batch_size)
   for batch, data in enumerate(train_dataset):
       encoder_in, decoder_in, decoder_out = data
       # print(encoder_in.shape, decoder_in.shape, decoder_out.shape)
       loss = train_step(encoder_in, decoder_in, decoder_out, encoder_state)
  
   print("Epoch: {}, Loss: {:.4f}".format(e + 1, loss.numpy()))
   if e % 10 == 0:
       checkpoint.save(file_prefix=checkpoint_prefix)
  
   predict(encoder, decoder, batch_size, sents_en, data_en,
       sents_fr_out, word2idx_fr, idx2word_fr)
   eval_score = evaluate_bleu_score(encoder, decoder, 
       test_dataset, word2idx_fr, idx2word_fr)
   print("Eval Score (BLEU): {:.3e}".format(eval_score))
   # eval_scores.append(eval_score)
checkpoint.save(file_prefix=checkpoint_prefix)

Epoch: 1, Loss: 1.1656
input    :  what a strange man !
label    :  quel homme bizarre ! EOS
predicted:  quel est ton pantalon ! EOS
Eval Score (BLEU): 2.360e-02
Epoch: 2, Loss: 0.8047
input    :  i m a truck driver .
label    :  je suis un chauffeur de camion . EOS
predicted:  je suis un nouvel gars . EOS
Eval Score (BLEU): 3.503e-02
Epoch: 3, Loss: 0.6707
input    :  she looks confused .
label    :  elle a l air desorientee . EOS
predicted:  elle a l air perplexe . EOS
Eval Score (BLEU): 4.671e-02
Epoch: 4, Loss: 0.4774
input    :  get on a horse .
label    :  enfourche un cheval ! EOS
predicted:  prends un peu de ceux ci ! EOS
Eval Score (BLEU): 6.051e-02
Epoch: 5, Loss: 0.4084
input    :  i was careless .
label    :  j etais insouciante . EOS
predicted:  j etais submerge . EOS
Eval Score (BLEU): 7.898e-02
Epoch: 6, Loss: 0.3073
input    :  we need proof .
label    :  il nous faut des preuves . EOS
predicted:  nous avons besoin de plus . EOS
Eval Score (BLEU): 9.364e-02
Epoch: 7, Lo