# 11 Deep learning for text

## 11.5 Beyond text classification: Sequence-to-sequence learning

- A sequence-to-sequence model takes a sequence as input and translates it into a different sequence
- Machine translation
  - Convert a paragraph in a source language to its equivalent in a target language.
- Text summarization
  - Convert a long document to a shorter version that retains the most important information.
- Question answering
  - Convert an input question into its answer.
- Chatbots
  - Convert a dialogue prompt into a reply to this prompt, or convert the history of a conversation into the next reply in the conversation.
- Text generation
  - Convert a text prompt into a paragraph that completes the prompt.

- template behind sequence-to-sequence models
  - During training
    - An encoder model turns the source sequence into an intermediate representation.
    - A decoder is trained to predict the next token i in the target sequence by looking at both previous tokens ( 0 to i - 1 ) and the encoded source sequence.
  - During inference
    - We obtain the encoded source sequence from the encoder.
    - The decoder starts by looking at the encoded source sequence as well as an initial “seed” token (such as the string "[start]" ), and uses them to predict the first real token in the sequence.
    - The predicted sequence so far is fed back into the decoder, which generates the next token, and so on, until it generates a stop token (such as the string "[end]" ).

### 11.5.1 A machine translation example

In [1]:
# English to spanish data base
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip

--2022-06-20 17:06:34--  http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.135.128, 74.125.142.128, 74.125.195.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.135.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip’


2022-06-20 17:06:34 (136 MB/s) - ‘spa-eng.zip’ saved [2638744/2638744]



In [2]:
# parse the file
text_file = "spa-eng/spa.txt"
with open(text_file) as f:
  lines = f.read().split("\n")[:-1]
text_pairs = [] 
for line in lines:
  english, spanish = line.split("\t")
  spanish = "[start] " + spanish + " [end]"
  text_pairs.append((english, spanish))

In [3]:
# random data
import random
print(random.choice(text_pairs))

('Tom is the one who broke the window yesterday.', '[start] Tom es el que rompió la ventana ayer. [end]')


In [4]:
# Shuffle and split
import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

In [5]:
# Listing 11.26 Vectorizing the English and Spanish text pairs
import tensorflow as tf 
import string
import re
from tensorflow.keras import layers
 
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")
def custom_standardization(input_string):
  lowercase = tf.strings.lower(input_string)
  return tf.strings.regex_replace(
      lowercase, f"[{re.escape(strip_chars)}]", "")
vocab_size = 15000
sequence_length = 20

source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)

train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

In [6]:
# Listing 11.27 Preparing datasets for the translation task
batch_size = 64
 
def format_dataset(eng, spa):
  eng = source_vectorization(eng)
  spa = target_vectorization(spa)
  return ({
      "english": eng,
      "spanish": spa[:, :-1],
      }, 
      spa[:, 1:]) 
  
def make_dataset(pairs):
  eng_texts, spa_texts = zip(*pairs)
  eng_texts = list(eng_texts)
  spa_texts = list(spa_texts)
  dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
  dataset = dataset.batch(batch_size)
  dataset = dataset.map(format_dataset, num_parallel_calls=1)
  return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [7]:
# dataset outputs look like
for inputs, targets in train_ds.take(1):
  print(f"inputs['english'].shape: {inputs['english'].shape}")
  print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
  print(f"targets.shape: {targets.shape}")

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


### 11.5.2 Sequence-to-sequence learning with RNNs

- The simplest, naive way to use RNNs to turn a sequence into another sequence is to keep the output of the RNN at each time step
  ```py
  inputs = keras.Input(shape=(sequence_length,), dtype="int64")
  x = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
  x = layers.LSTM(32, return_sequences=True)(x)
  outputs = layers.Dense(vocab_size, activation="softmax")(x)
  model = keras.Model(inputs, outputs)
  ```

- proper sequence-to-sequence setup
  - first use an RNN (the encoder) to turn the entire source sequence into a single vector (or set of vectors)
  - use this vector (or vectors) as the initial state of another RNN (the decoder), which would look at elements 0...N in the target sequence, and try to predict step N+1 in the target sequence.

In [None]:
# Listing 11.28 GRU-based encoder
from tensorflow import keras 
from tensorflow.keras import layers
 
embed_dim = 256
latent_dim = 1024

source = keras.Input(shape=(None,), dtype="int64", name="english")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(layers.GRU(latent_dim), merge_mode="sum")(x)

In [None]:
# Listing 11.29 GRU-based decoder and the end-to-end model
past_target = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
seq2seq_rnn = keras.Model([source, past_target], target_next_step) 

In [None]:
seq2seq_rnn.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 english (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 spanish (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding (Embedding)          (None, None, 256)    3840000     ['english[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, None, 256)    3840000     ['spanish[0][0]']                
                                                                                              

In [None]:
# Listing 11.30 Training our recurrent sequence-to-sequence model
seq2seq_rnn.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])

seq2seq_rnn.fit(train_ds, epochs=5, validation_data=val_ds)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f93887f3dd0>

In [None]:
# Listing 11.31 Translating new sentences with our RNN encoder and decoder
import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
  tokenized_input_sentence = source_vectorization([input_sentence])
  decoded_sentence = "[start]" 
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = target_vectorization([decoded_sentence])
    next_token_predictions = seq2seq_rnn.predict([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(next_token_predictions[0, i, :])
    sampled_token = spa_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token
    if sampled_token == "[end]":
      break
  return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs] 
for _ in range(20):
  input_sentence = random.choice(test_eng_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence(input_sentence))

-
I've corrected the mistake.
[start] he [UNK] el error [end]
-
Something tells me Tom will be OK.
[start] algo que me va a tom se está bien [end]
-
He saw the boy jump over the fence and run away.
[start] Él vio a la [UNK] el [UNK] y se [UNK] a la [UNK] [end]
-
Why do the five yen coin and the fifty yen coin have holes in the center?
[start] por qué los [UNK] la [UNK] y el [UNK] se [UNK] en el cinco de la [UNK] [end]
-
Tom visited Mary on October 20th.
[start] tom le pasó a mary en el [UNK] de [UNK] de [UNK] [end]
-
We are teachers.
[start] estamos [UNK] [end]
-
The explosion may have been caused by a gas leak.
[start] la que se puede haber [UNK] un [UNK] de [UNK] [end]
-
I'm at the beach.
[start] estoy en la playa [end]
-
Tom knows Mary won't tell John.
[start] tom sabe que mary no se lo [UNK] a john [end]
-
Do you like to be alone?
[start] te gusta estar solo [end]
-
Tell me what I should be watching for.
[start] dime qué debería estar en punto [end]
-
What grade are you in?
[start]

- inference setup, while very simple, is rather inefficient, 
  - since we reprocess the entire source sentence and the entire generated target sentence every time we sample a new word. 
  - In a practical application, you’d factor the encoder and the decoder as two separate models, and your decoder would only run a single step at each token-sampling iteration, reusing its previous internal state

- model could be improved
  - use a deep stack of recurrent layers for both the encoder and the decoder
  - use an LSTM instead of a GRU

- the RNN approach to sequence-to-sequence learning has a few fundamental limitations
  - The source sequence representation has to be held entirely in the encoder state vector(s), which puts significant limitations on the size and complexity of the sentences you can translate. It’s a bit as if a human were translating a sentence entirely from memory, without looking twice at the source sentence while producing the translation.
  - RNNs have trouble dealing with very long sequences, since they tend to progressively forget about the past—by the time you’ve reached the 100th token in either sequence, little information remains about the start of the sequence. That means RNN-based models can’t hold onto long-term context, which can be essential for translating long documents.


### 11.5.3 Sequence-to-sequence learning with Transformer

- Neural attention enables Transformer models to successfully process sequences that are considerably longer and more complex than those RNNs can handle.
- the Transformer encoder keeps the encoded representation in a sequence format: it’s a sequence of context-aware embedding vectors.
- Transformer decoder reads tokens 0...N in the target sequence and tries to predict token N+1

In [46]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Listing 11.33 The TransformerDecoder
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

# Listing 11.34 TransformerDecoder method that generates a causal mask
    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

# Listing 11.35 The forward pass of the TransformerDecoder
    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)


- mask the upper half of the pairwise attention matrix to prevent the model from paying any attention to information from the future

In [47]:
# Listing 11.24 Implementing positional embedding as a subclassed layer
class PositionalEmbedding(layers.Layer):
  def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
    super().__init__(**kwargs)
    self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
    self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=output_dim)
    self.sequence_length = sequence_length
    self.input_dim = input_dim
    self.output_dim = output_dim
  def call(self, inputs):
    length = tf.shape(inputs)[-1]
    positions = tf.range(start=0, limit=length, delta=1)
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = self.position_embeddings(positions)
    return embedded_tokens + embedded_positions 
  def compute_mask(self, inputs, mask=None):
    return tf.math.not_equal(inputs, 0)
  def get_config(self):
    config = super().get_config()
    config.update({
        "output_dim": self.output_dim,
        "sequence_length": self.sequence_length,
        "input_dim": self.input_dim,})
    return config

In [48]:
# Listing 11.21 Transformer encoder implemented as a subclassed Layer
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
    self.dense_proj = keras.Sequential(
        [layers.Dense(dense_dim, activation="relu"),
         layers.Dense(embed_dim),])
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()
  def call(self, inputs, mask=None):
    if mask is not None:
      mask = mask[:, tf.newaxis, :]
    attention_output = self.attention(inputs, inputs, attention_mask=mask)
    proj_input = self.layernorm_1(inputs + attention_output)
    proj_output = self.dense_proj(proj_input)
    return self.layernorm_2(proj_input + proj_output)
  def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim": self.embed_dim,
        "num_heads": self.num_heads,
        "dense_dim": self.dense_dim,})
    return config

In [49]:
# Listing 11.36 End-to-end Transformer
embed_dim = 256
dense_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [50]:
transformer.summary()

Model: "model_11"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 english (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 spanish (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 positional_embedding_14 (Posit  (None, None, 256)   3845120     ['english[0][0]']                
 ionalEmbedding)                                                                                  
                                                                                                  
 positional_embedding_15 (Posit  (None, None, 256)   3845120     ['spanish[0][0]']         

In [52]:
# Listing 11.37 Training the sequence-to-sequence Transformer
transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
transformer.fit(train_ds, epochs=5, validation_data=val_ds)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f0b966de990>

In [53]:
# Listing 11.38 Translating new sentences with our Transformer model
import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [decoded_sentence])[:, :-1]
        predictions = transformer(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

-
The house has all the conveniences.
[start] la casa está todo el [UNK] [end]
-
Tom has a pain in the shoulder.
[start] tom tiene un dolor en el [UNK] [end]
-
She ignored him all day.
[start] ella lo lo [UNK] todo el día [end]
-
I'm happy and satisfied.
[start] soy feliz y feliz [end]
-
Speeding causes accidents.
[start] [UNK] los niños [UNK] [end]
-
She advised him to lose weight.
[start] le aconsejó que perder peso [end]
-
We're going out for lunch. Why don't you come along?
[start] nos vamos a ver por qué no puedes ir bien [end]
-
You have to do it.
[start] tienes que hacerlo lo lo hace [end]
-
Tom escaped from prison.
[start] tom se puso de la universidad de lluvia [end]
-
The bathroom is dirty.
[start] el baño está [UNK] [end]
-
You cannot buy happiness.
[start] no puedes comprar la ropa [end]
-
Wear something warm. It's going to be cold this afternoon.
[start] [UNK] algo que va a estar haciendo esta tarde [end]
-
The prize won't be given to her.
[start] el reloj no se lo [UNK] p

---