# About 

The goal in this notebook is to train a sequence to sequence model with 
attention on the OpenSubtitles Dataset, with the goal of using each 
turn in the dataset as a target and the concatenation of the two previous 
sentences as the source inputs. 

The goal of the model is to predict next sentences based on previous sentences. 

# Setup 

In [1]:
import sys 
import sklearn 
assert sklearn.__version__ >= "0.20" 
# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"
import tensorflow_text as tf_text 
# Common imports
import numpy as np
import os
# Others 
import transformers 


# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


## Loading and Preprocessing te Dataset

We will use the daily dialogue dataset to train the language model. 

https://arxiv.org/abs/1710.03957

In [2]:
from datasets import list_datasets, load_dataset 

In [3]:
dataset = load_dataset('daily_dialog' ) 

Using custom data configuration default
Reusing dataset daily_dialog (/Users/muhammadumair/.cache/huggingface/datasets/daily_dialog/default/1.0.0/c03444008e9508b8b76f1f6793742d37d5e5f83364f8d573c2747bff435ea55c)
100%|██████████| 3/3 [00:00<00:00, 257.80it/s]


In [4]:
dataset 

DatasetDict({
    train: Dataset({
        features: ['dialog', 'act', 'emotion'],
        num_rows: 11118
    })
    test: Dataset({
        features: ['dialog', 'act', 'emotion'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['dialog', 'act', 'emotion'],
        num_rows: 1000
    })
})

In [5]:
train_dataset_full = dataset["train"]['dialog'] 
test_dataset_full = dataset["test"]['dialog'] 
val_dataset_full = dataset["validation"]['dialog'] 

Obtaining the daily dialogue dataset, were every item is a conversation. 

This shows that the dataset has 1000 conversations of varying length. 

In [6]:

len(train_dataset_full)  

11118

In [7]:
train_dataset_full[0] 

['Say , Jim , how about going for a few beers after dinner ? ',
 ' You know that is tempting but is really not good for our fitness . ',
 ' What do you mean ? It will help us to relax . ',
 " Do you really think so ? I don't . It will just make us fat and act silly . Remember last time ? ",
 " I guess you are right.But what shall we do ? I don't feel like sitting at home . ",
 ' I suggest a walk over to the gym where we can play singsong and meet some of our friends . ',
 " That's a good idea . I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them . ",
 ' Sounds great to me ! If they are willing , we could ask them to go dancing with us.That is excellent exercise and fun , too . ',
 " Good.Let ' s go now . ",
 ' All right . ']

In [8]:
import itertools 

def daily_dialogue_preprocess(dataset): 
    X = list()
    for item in dataset: 
        X_conv = ['',item[0]]
        for i in range(1, len(item) -1): 
            X_conv.append(item[i-1] + item[i])  
        X.append(X_conv)
    y = [item for item in dataset] 
    # Flatten the list 
    X = list(itertools.chain(*X))
    y = list(itertools.chain(*y) ) 
    return X, y 

In [9]:
X_train_full, y_train_full = daily_dialogue_preprocess(train_dataset_full) 
X_val_full, y_val_full = daily_dialogue_preprocess(val_dataset_full) 
X_test_full, y_test_full = daily_dialogue_preprocess(test_dataset_full) 

In [10]:
assert len(X_train_full) == len(y_train_full) 


In [11]:
len(X_train_full)

87170

Next, we need to create a word level tokenizer to tokenize all the sentences. 

In [12]:
def preprocess_sentence(text):
    """
    This method standardizes the text in each sentence. 
    """
    # Split accecented characters.
    text = tf_text.normalize_utf8(text, 'NFKD')
    text = tf.strings.lower(text)
    # Keep space, a to z, and select punctuation.
    text = tf.strings.regex_replace(text, '[^ a-z.?!,¿]', '')
    # Add spaces around punctuation.
    text = tf.strings.regex_replace(text, '[.?!,¿]', r' \0 ')
    # Strip whitespace.
    text = tf.strings.strip(text)

    text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
    return text

In [13]:
max_vocab_size = 5000

input_text_processor = tf.keras.layers.TextVectorization(
    standardize=preprocess_sentence, 
    max_tokens=max_vocab_size)

2022-02-15 16:51:04.065434: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [14]:
input_text_processor.adapt(X_train_full) 

In [15]:
# Here are the first 10 words from the vocabulary:
input_text_processor.get_vocabulary()[:10]

['', '[UNK]', '.', '[START]', '[END]', ',', 'you', 'i', '?', 'the']

In [16]:
example_input_batch = [X_train_full[5]] 
example_input_batch

[" Do you really think so ? I don't . It will just make us fat and act silly . Remember last time ?  I guess you are right.But what shall we do ? I don't feel like sitting at home . "]

In [17]:
example_output_batch = [y_train_full[5]] 
example_output_batch

[' I suggest a walk over to the gym where we can play singsong and meet some of our friends . ']

In [18]:
# Tokenization of the example batch. 
example_tokens = input_text_processor(example_input_batch)
example_tokens

<tf.Tensor: shape=(1, 45), dtype=int64, numpy=
array([[   3,   20,    6,   62,   47,   37,    8,    7,   63,    2,   12,
          43,   53,  109,  134, 1233,   13, 2152, 1549,    2,  337,  146,
          69,    8,    7,  297,    6,   21,   70,    2,   33,   22,  387,
          26,   20,    8,    7,   63,  194,   30, 1262,   40,  206,    2,
           4]])>

In [19]:
# We can also convert the tokens back using the vocabulary. 
input_vocab = np.array(input_text_processor.get_vocabulary())
tokens = input_vocab[example_tokens[0].numpy()]
' '.join(tokens)

'[START] do you really think so ? i dont . it will just make us fat and act silly . remember last time ? i guess you are right . but what shall we do ? i dont feel like sitting at home . [END]'

In [20]:
def tokenize_data(X,y): 
    X = input_text_processor(X)
    y = input_text_processor(y) 
    return X,y 

In [21]:
X_train_tok, y_train_tok = tokenize_data(X_train_full,y_train_full) 
X_val_tok, y_val_tok = tokenize_data(X_val_full,y_val_full) 
X_val_tok, y_val_tok = tokenize_data(X_test_full,y_test_full) 

In [22]:
assert X_train_tok.shape[0] == y_train_tok.shape[0] 

Finally, we create the dataset 


In [23]:
def create_dataset(X,y, batch_size=32):
    dataset = tf.data.Dataset.from_tensor_slices((X,y)) 
    dataset = dataset.shuffle(10_000).batch(batch_size)
    return dataset 


In [24]:
train_dataset = create_dataset(X_train_tok, y_train_tok) 

## Creating the Model

The model in this case is a sequence to sequence model with attention. 

In [25]:
#@title Shape checker
class ShapeChecker():
  def __init__(self):
    # Keep a cache of every axis-name seen
    self.shapes = {}

  def __call__(self, tensor, names, broadcast=False):
    if not tf.executing_eagerly():
      return

    if isinstance(names, str):
      names = (names,)

    shape = tf.shape(tensor)
    rank = tf.rank(tensor)

    if rank != len(names):
      raise ValueError(f'Rank mismatch:\n'
                       f'    found {rank}: {shape.numpy()}\n'
                       f'    expected {len(names)}: {names}\n')

    for i, name in enumerate(names):
      if isinstance(name, int):
        old_dim = name
      else:
        old_dim = self.shapes.get(name, None)
      new_dim = shape[i]

      if (broadcast and new_dim == 1):
        continue

      if old_dim is None:
        # If the axis name is new, add its length to the cache.
        self.shapes[name] = new_dim
        continue

      if new_dim != old_dim:
        raise ValueError(f"Shape mismatch for dimension: '{name}'\n"
                         f"    found: {new_dim}\n"
                         f"    expected: {old_dim}\n")

### Creating the Encoder

In [26]:
class Encoder(keras.layers.Layer):

    def __init__(self, input_vocab_size, embedding_dim, enc_units):
        super().__init__() 
        self.enc_units = enc_units 
        self.input_vocab_size = input_vocab_size 
        # Embedding to convert tokens to vectors 
        self.embedding = keras.layers.Embedding(
            self.input_vocab_size,embedding_dim)
        # GRU RNN layers 
        self.gru = keras.layers.GRU(
            self.enc_units, return_sequences=True, return_state=True, 
            recurrent_initializer="glorot_uniform")

    def call(self, tokens, state=None):
        # NOTE: ShapeChecker is simply used to verify the shapes. 
        shape_checker = ShapeChecker()
        shape_checker(tokens, ('batch', 's'))

        # 2. The embedding layer looks up the embedding for each token.
        vectors = self.embedding(tokens)
        shape_checker(vectors, ('batch', 's', 'embed_dim'))

        # 3. The GRU processes the embedding sequence.
        #    output shape: (batch, s, enc_units)
        #    state shape: (batch, enc_units)
        output, state = self.gru(vectors, initial_state=state)
        shape_checker(output, ('batch', 's', 'enc_units'))
        shape_checker(state, ('batch', 'enc_units'))

        # 4. Returns the new sequence and its state.
        return output, state



Using the Encoder so far. 

In [27]:
embedding_dim = 256
units = 1024

# Convert the input text to tokens.
example_tokens = input_text_processor(example_input_batch)
# Encode the input sequence.
encoder = Encoder(input_text_processor.vocabulary_size(),
                  embedding_dim, units)

example_enc_output, example_enc_state = encoder(example_tokens)


In [28]:
example_enc_output.shape, example_enc_state.shape

(TensorShape([1, 45, 1024]), TensorShape([1, 1024]))

### Creating the Attention Layer

In [29]:
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super().__init__()
    # For Eqn. (4), the  Bahdanau attention
    self.W1 = tf.keras.layers.Dense(units, use_bias=False)
    self.W2 = tf.keras.layers.Dense(units, use_bias=False)

    self.attention = tf.keras.layers.AdditiveAttention()

  def call(self, query, value, mask):
    shape_checker = ShapeChecker()
    shape_checker(query, ('batch', 't', 'query_units'))
    shape_checker(value, ('batch', 's', 'value_units'))
    shape_checker(mask, ('batch', 's'))

    # From Eqn. (4), `W1@ht`.
    w1_query = self.W1(query)
    shape_checker(w1_query, ('batch', 't', 'attn_units'))

    # From Eqn. (4), `W2@hs`.
    w2_key = self.W2(value)
    shape_checker(w2_key, ('batch', 's', 'attn_units'))

    query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
    value_mask = mask

    context_vector, attention_weights = self.attention(
        inputs = [w1_query, value, w2_key],
        mask=[query_mask, value_mask],
        return_attention_scores = True,
    )
    shape_checker(context_vector, ('batch', 't', 'value_units'))
    shape_checker(attention_weights, ('batch', 't', 's'))

    return context_vector, attention_weights

In [30]:
attention_layer = BahdanauAttention(units)

In [31]:
# Later, the decoder will generate this attention query
example_attention_query = tf.random.normal(shape=[len(example_tokens), 2, 10])

# Attend to the encoded tokens

context_vector, attention_weights = attention_layer(
    query=example_attention_query,
    value=example_enc_output,
    mask=(example_tokens != 0))

In [32]:
context_vector.shape, attention_weights.shape

(TensorShape([1, 2, 1024]), TensorShape([1, 2, 45]))

### Creating the Decoder

In [33]:
class Decoder(keras.layers.Layer):

    def __init__(self, output_vocab_size, embedding_dim, dec_units):
        super().__init__()
        self.dec_units = dec_units 
        self.output_vocab_size = output_vocab_size
        self.embedding_dim = embedding_dim

        # For Step 1. The embedding layer convets token IDs to vectors
        self.embedding = tf.keras.layers.Embedding(self.output_vocab_size,
                                                embedding_dim)

        # For Step 2. The RNN keeps track of what's been generated so far.
        self.gru = tf.keras.layers.GRU(self.dec_units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')

        # For step 3. The RNN output will be the query for the attention layer.
        self.attention = BahdanauAttention(self.dec_units)

        # For step 4. Eqn. (3): converting `ct` to `at`
        self.Wc = tf.keras.layers.Dense(dec_units, activation=tf.math.tanh,
                                        use_bias=False)

        # For step 5. This fully connected layer produces the logits for each
        # output token.
        self.fc = tf.keras.layers.Dense(self.output_vocab_size)

    def call(self, inputs, state=None):
        shape_checker = ShapeChecker()
        new_tokens, enc_output, mask = inputs

        shape_checker(new_tokens, ('batch', 't'))
        shape_checker(enc_output, ('batch', 's', 'enc_units'))
        shape_checker(mask, ('batch', 's'))

        if state is not None:
            shape_checker(state, ('batch', 'dec_units'))

        # Step 1. Lookup the embeddings
        vectors = self.embedding(new_tokens)
        shape_checker(vectors, ('batch', 't', 'embedding_dim'))

        # Step 2. Process one step with the RNN
        rnn_output, state = self.gru(vectors, initial_state=state)

        shape_checker(rnn_output, ('batch', 't', 'dec_units'))
        shape_checker(state, ('batch', 'dec_units'))

        # Step 3. Use the RNN output as the query for the attention over the
        # encoder output.
        context_vector, attention_weights = self.attention(
            query=rnn_output, value=enc_output, mask=mask)
        shape_checker(context_vector, ('batch', 't', 'dec_units'))
        shape_checker(attention_weights, ('batch', 't', 's'))

        # Step 4. Eqn. (3): Join the context_vector and rnn_output
        #     [ct; ht] shape: (batch t, value_units + query_units)
        context_and_rnn_output = tf.concat([context_vector, rnn_output], axis=-1)

        # Step 4. Eqn. (3): `at = tanh(Wc@[ct; ht])`
        attention_vector = self.Wc(context_and_rnn_output)
        shape_checker(attention_vector, ('batch', 't', 'dec_units'))

        # Step 5. Generate logit predictions:
        logits = self.fc(attention_vector)
        shape_checker(logits, ('batch', 't', 'output_vocab_size'))

        return logits, state, attention_weights



In [34]:
decoder = Decoder(input_text_processor.vocabulary_size(), embedding_dim, units)

In [35]:
# Convert the target sequence, and collect the "[START]" tokens
example_output_tokens = input_text_processor(example_output_batch)

start_index = input_text_processor.get_vocabulary().index('[START]')
first_token = tf.constant([[start_index]] * example_output_tokens.shape[0])

In [36]:

inputs = (first_token, example_enc_output,(example_tokens != 0))
logits, dec_state , attention_weights = decoder(
    inputs =inputs,
    state = example_enc_state
)

In [37]:
logits.shape, attention_weights.shape, dec_state.shape 

(TensorShape([1, 1, 5000]), TensorShape([1, 1, 45]), TensorShape([1, 1024]))

In [38]:
sampled_token = tf.random.categorical(logits[:, 0, :], num_samples=1)

In [39]:
vocab = np.array(input_text_processor.get_vocabulary())
first_word = vocab[sampled_token.numpy()]
first_word[:5]

array([['mexican']], dtype='<U16')

## Training the Model 

### Defining the loss 

In [40]:
class MaskedLoss(tf.keras.losses.Loss):
  def __init__(self):
    self.name = 'masked_loss'
    self.loss = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')

  def __call__(self, y_true, y_pred):
    shape_checker = ShapeChecker()
    shape_checker(y_true, ('batch', 't'))
    shape_checker(y_pred, ('batch', 't', 'logits'))

    # Calculate the loss for each item in the batch.
    loss = self.loss(y_true, y_pred)
    shape_checker(loss, ('batch', 't'))

    # Mask off the losses on padding.
    mask = tf.cast(y_true != 0, tf.float32)
    shape_checker(mask, ('batch', 't'))
    loss *= mask

    # Return the total.
    return tf.reduce_sum(loss)

### Seq2Seq model 

In [49]:

train_translator = keras.models.Sequential([ 
    Encoder(input_text_processor.vocabulary_size(),
                  embedding_dim, units), 
    Decoder(input_text_processor.vocabulary_size(), embedding_dim, units)
])


In [50]:
# Configure the loss and optimizer
train_translator.compile(
    optimizer=tf.optimizers.Adam(),
    loss=MaskedLoss(),
)

In [51]:
train_translator.fit(train_dataset, epochs=1,
                     callbacks=[batch_loss])

ValueError: in user code:

    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/training.py", line 859, in train_step
        y_pred = self(x, training=True)
    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/sequential.py", line 318, in _build_graph_network_for_inferred_shape
        raise ValueError(SINGLE_LAYER_OUTPUT_ERROR_MSG)

    ValueError: Exception encountered when calling layer "sequential_1" (type Sequential).
    
    All layers in a Sequential model should have a single output tensor. For multi-output layers, use the functional API.
    
    Call arguments received:
      • inputs=tf.Tensor(shape=(None, 307), dtype=int64)
      • training=True
      • mask=None
