<h3><b>Train and evaluate a chatbot based on an encoder-decoder transformer model ( i.e. same as the original transformer model ) . The model is trained on the Cornell-Movie-Dialog dataset.</b></h3>

<h5><b> 0. Setup</b></h5>

In [3]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

from transformer_model import PositionalEncoding, MultiHeadAttentionLayer
from dataset import DatasetHp, preprocess_sentence, get_cornell_dataset
from transformer_model import ModelHp, encoder_decoder_transformer

<h5><b> 1. Load dataset and tokenizer</b></h5>

In [4]:
dataset_hp = DatasetHp(
    max_length = 40,
    vocab_size = 10_000,
    max_sample=50_000,
)

dataset, tokenizer = get_cornell_dataset(dataset_hp)

Downloading data from http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
loading conversations ... 


 22%|██▏       | 18638/83097 [00:03<00:12, 5029.91it/s]


initializing tokenizer ...
tokenizer saved in `./transformer/tokenizer`
vocab size updated from 10000 to 10255
tokenization ... 


50000it [00:03, 14348.49it/s]


<h5><b> 2. Define loss and metric functions.</b></h5>

In [5]:
optimizer = tf.keras.optimizers.Adam()

cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction="none"
)

def loss_function(y_true, y_pred):
    y_true = tf.reshape(y_true, shape=(-1, dataset_hp.max_length - 1))
    loss = cross_entropy(y_true, y_pred)
    mask = tf.cast(tf.not_equal(y_true, 0), dtype=tf.float32)
    loss = tf.multiply(loss, mask)
    return tf.reduce_mean(loss)

def accuracy(y_true, y_pred):
    y_true = tf.reshape(y_true, shape=(-1, dataset_hp.max_length - 1))
    return tf.keras.metrics.sparse_categorical_accuracy(y_true, y_pred)

<h5><b> 3. Build and train the model.</b></h5>

In [7]:
hparams = ModelHp(
    d_model = 256,
    num_attention_heads = 8,
    dropout_rate = 0.1,
    num_units = 512,
    activation = "relu",
    vocab_size = 10255,
    num_layers = 2,
)

model = encoder_decoder_transformer(hparams, "transformer")

print(f"Total number of model's parameters: {model.count_params()}")

Total number of model's parameters: 10521871


In [8]:
model.compile(optimizer=optimizer, loss=loss_function, metrics=[accuracy])

In [9]:
history = model.fit(dataset, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [10]:
def inference(hp, model, tokenizer, sentence):
    sentence = preprocess_sentence(sentence)
    sentence = tf.expand_dims(
        hp.start_token + tokenizer.encode(sentence) + hp.end_token, axis=0)
    
    output = tf.expand_dims(hp.start_token, 0)

    for i in range(hp.max_length):
        predictions = model(inputs=[sentence, output], training=False)
        predictions = predictions[:, -1:, :]
        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

        if tf.equal(predicted_id, hp.end_token[0]):
            break
        
        output = tf.concat([output, predicted_id], axis=-1)
    
    return tf.squeeze(output, axis=0)

def generate_response(hp, model, tokenizer, sentence):
    prediction = inference(hp, model, tokenizer, sentence)
    predicted_sentence = tokenizer.decode(
        [i for i in prediction if i < tokenizer.vocab_size]
    )
    return predicted_sentence

def evaluate(hp, model, tokenizer, inputs):
    print("-evaluating ...")
    response = "what are you going to do?"

    for user_sentnece in inputs:
        if user_sentnece != None:
            print(f"\nInput: {user_sentnece}")
            response = generate_response(hp, model, tokenizer, user_sentnece)
            print(f"Output: {response}")
        
        else:
            print(f"\nInput: {response}")
            response = generate_response(hp, model, tokenizer, response)
            print(f"Output: {response}")

In [11]:
from dataset import load_tokenizer

tokenizer = load_tokenizer("./transformer/tokenizer")

In [14]:
chatbot = tf.keras.models.load_model(
    "model.h5",
    custom_objects={
            "PositionalEncoding": PositionalEncoding,
            "MultiHeadAttentionLayer": MultiHeadAttentionLayer,
        },
    compile=False
)


In [16]:
sentences = [
    "What is your name ?",
    "I want to be a good programmer",
    "Do you love winter or summer ?",
    "Tomarrow, i have an important exam",
    "what is your age",
    None, None, None
]

evaluate(dataset_hp, chatbot, tokenizer, sentences)

-evaluating ...

Input: What is your name ?
Output: i do not know . i am going to do .

Input: I want to be a good programmer
Output: i am sorry .

Input: Do you love winter or summer ?
Output: i do not know . i am not sure . i have not in the mood .

Input: Tomarrow, i have an important exam
Output: i am sorry , but can i do not gamble , hilda !

Input: what is your age
Output: i am sorry , i do not know .

Input: i am sorry , i do not know .
Output: i am not convinced you should be happy .

Input: i am not convinced you should be happy .
Output: i do not know . it is a maze of tunnels . i cannot see it .

Input: i do not know . it is a maze of tunnels . i cannot see it .
Output: i do not want to .
