# About 

The goal in this notebook is to train a sequence to sequence model with 
attention on the OpenSubtitles Dataset, with the goal of using each 
turn in the dataset as a target and the concatenation of the two previous 
sentences as the source inputs. 

The goal of the model is to predict next sentences based on previous sentences. 

# Setup 

In [1]:
import sys 
import sklearn 
assert sklearn.__version__ >= "0.20" 
# TensorFlow ≥2.0 is required
import tensorflow as tf
# from tensorflow import keras

assert tf.__version__ >= "2.0"
import tensorflow_text as tf_text 
# Common imports
import numpy as np
import os
# Others 
import transformers 


# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)



## Loading and Preprocessing te Dataset

We will use the daily dialogue dataset to train the language model. 

https://arxiv.org/abs/1710.03957

In [2]:
from datasets import list_datasets, load_dataset 

In [3]:
dataset = load_dataset('daily_dialog' ) 

Using custom data configuration default
Reusing dataset daily_dialog (/Users/muhammadumair/.cache/huggingface/datasets/daily_dialog/default/1.0.0/c03444008e9508b8b76f1f6793742d37d5e5f83364f8d573c2747bff435ea55c)
100%|██████████| 3/3 [00:00<00:00, 269.98it/s]


In [4]:
train_dataset_full = dataset["train"]['dialog'] 
test_dataset_full = dataset["test"]['dialog'] 
val_dataset_full = dataset["validation"]['dialog'] 

Obtaining the daily dialogue dataset, where every item is a conversation. 

In [5]:
train_dataset_full[0] 

['Say , Jim , how about going for a few beers after dinner ? ',
 ' You know that is tempting but is really not good for our fitness . ',
 ' What do you mean ? It will help us to relax . ',
 " Do you really think so ? I don't . It will just make us fat and act silly . Remember last time ? ",
 " I guess you are right.But what shall we do ? I don't feel like sitting at home . ",
 ' I suggest a walk over to the gym where we can play singsong and meet some of our friends . ',
 " That's a good idea . I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them . ",
 ' Sounds great to me ! If they are willing , we could ask them to go dancing with us.That is excellent exercise and fun , too . ',
 " Good.Let ' s go now . ",
 ' All right . ']

In [6]:
import itertools 

def daily_dialogue_preprocess(dataset): 
    X = list()
    for item in dataset: 
        X_conv = ['',item[0]]
        for i in range(1, len(item) -1): 
            X_conv.append(item[i-1] + item[i])  
        X.append(X_conv)
    y = [item for item in dataset] 
    # Flatten the list 
    X = list(itertools.chain(*X))
    y = list(itertools.chain(*y) ) 
    return X, y 

In [7]:
X_train_full, y_train_full = daily_dialogue_preprocess(train_dataset_full) 
X_val_full, y_val_full = daily_dialogue_preprocess(val_dataset_full) 
X_test_full, y_test_full = daily_dialogue_preprocess(test_dataset_full) 

In [8]:
list(zip(X_train_full[:3], y_train_full[:3]))

[('', 'Say , Jim , how about going for a few beers after dinner ? '),
 ('Say , Jim , how about going for a few beers after dinner ? ',
  ' You know that is tempting but is really not good for our fitness . '),
 ('Say , Jim , how about going for a few beers after dinner ?  You know that is tempting but is really not good for our fitness . ',
  ' What do you mean ? It will help us to relax . ')]

In [9]:
assert len(X_train_full) == len(y_train_full) 


In [44]:
len(X_train_full)

87170

Next, we need to create a word level tokenizer to tokenize all the sentences. 

In [11]:
def preprocess_sentence(text):
    """
    This method standardizes the text in each sentence. 
    """
    # Split accecented characters.
    text = tf_text.normalize_utf8(text, 'NFKD')
    text = tf.strings.lower(text)
    # Keep space, a to z, and select punctuation.
    text = tf.strings.regex_replace(text, '[^ a-z.?!,¿]', '')
    # Add spaces around punctuation.
    text = tf.strings.regex_replace(text, '[.?!,¿]', r' \0 ')
    # Strip whitespace.
    text = tf.strings.strip(text)

    text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
    return text

In [12]:
max_vocab_size = 5000
output_sequence_length= 100

input_text_processor = tf.keras.layers.TextVectorization(
    standardize=preprocess_sentence, 
    max_tokens=max_vocab_size, 
    output_sequence_length=output_sequence_length)

2022-02-27 12:40:21.658048: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [13]:
input_text_processor.adapt(X_train_full) 

In [14]:
example_input_batch = [X_train_full[5]] 
example_input_batch

[" Do you really think so ? I don't . It will just make us fat and act silly . Remember last time ?  I guess you are right.But what shall we do ? I don't feel like sitting at home . "]

In [15]:
example_output_batch = [y_train_full[5]] 
example_output_batch

[' I suggest a walk over to the gym where we can play singsong and meet some of our friends . ']

In [16]:
# Tokenization of the example batch. 
example_tokens = input_text_processor(example_input_batch)
example_tokens.shape

TensorShape([1, 100])

In [17]:
# We can also convert the tokens back using the vocabulary. 
input_vocab = np.array(input_text_processor.get_vocabulary())
tokens = input_vocab[example_tokens[0].numpy()]
' '.join(tokens)

'[START] do you really think so ? i dont . it will just make us fat and act silly . remember last time ? i guess you are right . but what shall we do ? i dont feel like sitting at home . [END]                                                       '

In [18]:
# def tokenize_data(X,y): 
#     X = input_text_processor(X)
#     y = input_text_processor(y) 
#     return X,y 

In [19]:
# X_train_tok, y_train_tok = tokenize_data(X_train_full,y_train_full) 
# X_val_tok, y_val_tok = tokenize_data(X_val_full,y_val_full) 
# X_val_tok, y_val_tok = tokenize_data(X_test_full,y_test_full) 

In [20]:
# assert X_train_tok.shape[0] == y_train_tok.shape[0] 

Finally, we create the dataset 


In [21]:
def create_dataset(X,y, batch_size=32):
    dataset = tf.data.Dataset.from_tensor_slices((X,y)) 
    dataset = dataset.shuffle(len(X)).batch(batch_size)
    dataset = dataset.prefetch(1)
    return dataset 


In [22]:
train_dataset = create_dataset(X_train_full, y_train_full) 

# Creating the Model

## Seq2Seq Model with Attention

### Encoder

In [23]:
class Encoder(tf.keras.layers.Layer):
    
    def __init__(self, input_vocab_size, embedding_dim, enc_units):
        super().__init__()
        self.enc_units = enc_units 
        self.input_vocab_size = input_vocab_size
        self.embedding = tf.keras.layers.Embedding(
            self.input_vocab_size, embedding_dim) 
        self.gru = tf.keras.layers.GRU(
            self.enc_units, return_sequences=True, return_state=True, 
            recurrent_initializer="glorot_uniform") 
        
 
    def call(self, tokens, state=None):
        vectors = self.embedding(tokens) # Ret shape: (batch, s, embedding_dim)
        #    output shape: (batch, s, enc_units)
        #    state shape: (batch, enc_units)
        output, state = self.gru(vectors, initial_state=state) 
        return output, state


### Attention Layer 

In [24]:
class BahdanauAttention(tf.keras.layers.Layer):
    
    def __init__(self, units):
        super().__init__() 
        self.W1 = tf.keras.layers.Dense(units, use_bias=False) 
        self.W2 = tf.keras.layers.Dense(units, use_bias=False) 
        self.attention = tf.keras.layers.AdditiveAttention() 

    def call(self, query, value, mask):
        w1_query = self.W1(query)
        w2_key = self.W2(value)
        query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
        value_mask = mask
        context_vector, attention_weights = self.attention(
            inputs = [w1_query, value, w2_key],
            mask=[query_mask, value_mask],
            return_attention_scores = True,
        )
        return context_vector, attention_weights

    

### Decoder

In [25]:
class Decoder(tf.keras.layers.Layer):
    
    def __init__(self, output_vocab_size, embedding_dim, dec_units):
        super().__init__()
        self.dec_units = dec_units
        self.output_vocab_size = output_vocab_size
        self.embedding_dim = embedding_dim

        # For Step 1. The embedding layer convets token IDs to vectors
        self.embedding = tf.keras.layers.Embedding(
            self.output_vocab_size, embedding_dim)
        # For Step 2. The RNN keeps track of what's been generated so far
        self.gru = tf.keras.layers.GRU(
            self.dec_units, return_sequences=True, return_state=True, 
            recurrent_initializer="glorot_uniform")
        # For step 3. The RNN output will be the query for the attention layer.
        self.attention = BahdanauAttention(self.dec_units) 
        # For step 4. Eqn. (3): converting `ct` to `at`
        self.Wc = tf.keras.layers.Dense(
            dec_units, activation=tf.math.tanh,use_bias=False) 
        # For step 5. This fully connected layer produces the logits for each
        # output token.
        self.fc = tf.keras.layers.Dense(self.output_vocab_size)

    def call(self, inputs, state=None):
        # state shape: (batch, dec_units)
        # new_tokens_shape: (batch, t)
        # enc_output shape: (batch, s, enc_units) 
        new_tokens, enc_output, mask = inputs  

        vectors = self.embedding(new_tokens) 
        rnn_output, state = self.gru(vectors, initial_state=state)
        context_vector, attention_weights = self.attention(
            query=rnn_output, value=enc_output, mask=mask)
        context_and_rnn_output = tf.concat([context_vector,rnn_output],axis=-1)
        attention_vector = self.Wc(context_and_rnn_output)
        logits = self.fc(attention_vector) 
        return (logits, attention_weights), state 


### Model 

In [26]:
class AttentionSeq2Seq(tf.keras.Model):

    def __init__(self, embedding_dim, units, input_text_processor, 
            output_text_processor):
        super().__init__()
        encoder = Encoder(input_text_processor.vocabulary_size(),
                      embedding_dim, units)
        decoder = Decoder(output_text_processor.vocabulary_size(),
                        embedding_dim, units)
        self.encoder = encoder
        self.decoder = decoder
        self.input_text_processor = input_text_processor
        self.output_text_processor = output_text_processor

    def train_step(self, inputs):
        return self._train_step(inputs)


In [27]:
def _preprocess(self, input_text, target_text):
    input_tokens = self.input_text_processor(input_text) 
    target_tokens = self.output_text_processor(target_text) 
    input_mask = input_tokens != 0 
    target_mask = target_tokens != 0 
    return input_tokens, input_mask, target_tokens, target_mask 

AttentionSeq2Seq._preprocess = _preprocess 

In [28]:
def _train_step(self, inputs):
    input_text, target_text = inputs 
    (input_tokens, input_mask,
        target_tokens, target_mask) = self._preprocess(input_text, target_text)
    max_target_length = tf.shape(target_tokens)[1]
    with tf.GradientTape() as tape:
        # Encode the input
        enc_output, enc_state = self.encoder(input_tokens)

        # Initialize the decoder's state to the encoder's final state.
        # This only works if the encoder and decoder have the same number of
        # units.
        dec_state = enc_state
        loss = tf.constant(0.0)

        for t in tf.range(max_target_length-1):
            # Pass in two tokens from the target sequence:
            # 1. The current input to the decoder.
            # 2. The target for the decoder's next prediction.
            new_tokens = target_tokens[:, t:t+2]
            step_loss, dec_state = self._loop_step(new_tokens, input_mask,
                                                    enc_output, dec_state)
            loss = loss + step_loss

        # Average the loss over all non padding tokens.
        average_loss = loss / tf.reduce_sum(tf.cast(target_mask, tf.float32))

    # Apply an optimization step
    variables = self.trainable_variables 
    gradients = tape.gradient(average_loss, variables)
    self.optimizer.apply_gradients(zip(gradients, variables))

    # Return a dict mapping metric names to current value
    return {'batch_loss': average_loss}


AttentionSeq2Seq._train_step = _train_step 

In [29]:
def _loop_step(self, new_tokens, input_mask, enc_output, dec_state):
  input_token, target_token = new_tokens[:, 0:1], new_tokens[:, 1:2]
  # Run the decoder one step.
  decoder_input = (input_token, enc_output, input_mask)
  dec_result, dec_state = self.decoder(decoder_input, state=dec_state)
  logits, attention_weights = dec_result
  # `self.loss` returns the total for non-padded tokens
  y = target_token
  y_pred = logits
  step_loss = self.loss(y, y_pred)
  return step_loss, dec_state

AttentionSeq2Seq._loop_step = _loop_step 

### Custom Loss 

In [30]:
class MaskedLoss(tf.keras.losses.Loss):
  def __init__(self):
    self.name = 'masked_loss'
    self.loss = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')

  def __call__(self, y_true, y_pred):
    # Calculate the loss for each item in the batch.
    loss = self.loss(y_true, y_pred)
    # Mask off the losses on padding.
    mask = tf.cast(y_true != 0, tf.float32)
    loss *= mask
    # Return the total.
    return tf.reduce_sum(loss)

# Training the Model 

In [31]:
EMBEDDING_DIM = 128 
UNITS = 100 

In [32]:
seq_2_seq = AttentionSeq2Seq(
    EMBEDDING_DIM, UNITS, input_text_processor,input_text_processor)

In [33]:
seq_2_seq.compile(
    optimizer= tf.optimizers.Adam(), 
    loss=MaskedLoss()
)

In [34]:
# NOTE: Test to make sure that the training loop is working. 
# for n in range(10):
#   print(seq_2_seq.train_step([example_input_batch, example_output_batch]))
# print()

{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.129977>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.123968>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.117443>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.109748>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.100172>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.087882>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.071827>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.050645>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=8.022541>}
{'batch_loss': <tf.Tensor: shape=(), dtype=float32, numpy=7.9851103>}



In [35]:
class BatchLogs(tf.keras.callbacks.Callback):
  def __init__(self, key):
    self.key = key
    self.logs = []

  def on_train_batch_end(self, n, logs):
    self.logs.append(logs[self.key])

batch_loss = BatchLogs('batch_loss')

In [47]:
for example_input_batch, example_target_batch in train_dataset.take(1):
  print(example_input_batch[:5])
  print()
  print(example_target_batch[:5])
  break

tf.Tensor(
[b'Would you tell me how I should send this parcel to Shanghai , China ? It contains only books . '
 b" Sorry , madam . I afraid you have a wrong number.we don't have Mr Over here .  I want 6420422 3 , is that right ? "
 b" well , that's true .  how about your mother ? "
 b'What channel did you watch last night ? ' b''], shape=(5,), dtype=string)

tf.Tensor(
[b" You might send it as'Printed Matter ' . "
 b' No , you give a wrong number . '
 b' she also believes in healthy diet . And she requires us to have regular meals . '
 b' Channel Two . A TV series was showing on it . The name of the series is Huanzhu Gene '
 b"What's the weather like ? "], shape=(5,), dtype=string)


In [48]:
seq_2_seq.fit(train_dataset, epochs=3,
                     callbacks=[batch_loss])

Epoch 1/3


OperatorNotAllowedInGraphError: in user code:

    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/muhammadumair/anaconda3/envs/dl_lang/lib/python3.8/site-packages/keras/engine/training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "/var/folders/96/83knfb594fb5jg1ptk9m5xhc0000gn/T/ipykernel_13891/4095551697.py", line 16, in train_step
        return self._train_step(inputs)
    File "/var/folders/96/83knfb594fb5jg1ptk9m5xhc0000gn/T/ipykernel_13891/936795159.py", line 16, in _train_step
        for t in tf.range(max_target_length-1):

    OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.
