# HW10: A Simple Chatbot using GPT2

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Training a Chatbot**

In this exercise, we are going to train a simple chatbot based on DistilGPT2. Find an overview of the GPT2 architecture in hugggingface [here](https://huggingface.co/transformers/model_doc/gpt2.html). We will use the [CCPE data](https://www.aclweb.org/anthology/W19-5941.pdf) (no need to read the paper for this exercise, we provide data loading utilties). The dataset offers exciting possibilities to train sophisticated chatbots, however we only explore a very simple version.

In [1]:
# clone the github repo
!git clone https://github.com/google-research-datasets/ccpe

Cloning into 'ccpe'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 13 (delta 3), reused 6 (delta 2), pack-reused 0[K
Unpacking objects: 100% (13/13), 531.27 KiB | 171.00 KiB/s, done.


In [28]:
import json
import numpy as np

import tensorflow as tf

from tqdm import tqdm

from more_itertools import pairwise

In [30]:
def load_data():
    with open("ccpe/data.json") as f:
        for conversation in json.load(f):
            for (a, b) in pairwise(conversation["utterances"]):
                yield (a['text'], b['text'])

data = list(load_data())

print(*data[0:3], sep='\n')

('generally speaking what type of movies do you watch', 'I like thrillers a lot.')
('I like thrillers a lot.', 'thrillers? for example?')
('thrillers? for example?', "Zodiac's one of my favorite movies.")


In [31]:
data = np.reshape(data[:-13], (-1, 16, 2))
print(len(data))
print(data[0][:5])


# please note how we arrange the data in pairs of (previous_sentence, current_sentence)
# and each batch contains 16 such sentence pairs

716
[['generally speaking what type of movies do you watch'
  'I like thrillers a lot.']
 ['I like thrillers a lot.' 'thrillers? for example?']
 ['thrillers? for example?' "Zodiac's one of my favorite movies."]
 ["Zodiac's one of my favorite movies."
  "Zodiac the movie about a serial killer from the '60s or '70s, around there."]
 ["Zodiac the movie about a serial killer from the '60s or '70s, around there."
  'Zodiac? oh wow ok, what do you like about that movie']]


**The data**
As we can see from the data, we only extract sentence pairs and will train models on these pairs in isolation. This is a very simplified version of training a chatbot. 

**In a more realistic setting**, we would include more conversation history, perhaps have to retrieve additional information from a fact base to generate factually accurate examples (e.g. think about a chatbot which could suggest restaurants in a city and needs to have a list available of all restaurants in that city). We would probably also encode the speaker and the chatbot, and guess the speech act to give the desired response (e.g. if a speaker just wants to do small talk, we would guess this and reply accordingly. If we guess he is asking for factual information, we should structure our response very differently. However, in this exercise we stick to our very simplified version.

We have already prepared the data such that it comes in batches of 32 examples each.

To train a chatbot, we need data, a tokenizer, a model and an optimizer.

In [13]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

model_name = "distilgpt2"

# load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name, pad_token_id=0)

##because GPT2 tokenizers do not have padding, cls and sep tokens, we have to add these ourselves
##we won't need the # character, so this will be the pad token
tokenizer.add_special_tokens({'pad_token': '#'})
tokenizer.add_special_tokens({'cls_token': 'bos'})
tokenizer.add_special_tokens({'sep_token': 'bos'})

0

In [43]:
## load model
model = TFGPT2LMHeadModel.from_pretrained(model_name)

# https://huggingface.co/transformers/model_doc/gpt2.html
embedding_layer = model.resize_token_embeddings(len(tokenizer))

# https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
#model.set_num_special_tokens(len(SPECIAL_TOKENS))

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [42]:
##let's have a look what we generate before fine-tuning our chatbot

input_ids = tokenizer.encode("Why do you like this kind of movies?", return_tensors='tf')

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
)

print("sample_output:")
print(sample_output)

print("GENERATED:")
print(tokenizer.decode(sample_output[0]).replace("\n", "/"))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
sample_output:
tf.Tensor(
[[5195  466  345  588  428 1611  286 6918   30  198  198  198  198  198
   198  198  198  198  198  198  198  198  198  198  198  198  198  198
   198  198  198  198  198  198  198  198  198  198  198  198  198  198
   198  198  198  198  198  198  198  198]], shape=(1, 50), dtype=int32)
GENERATED:
Why do you like this kind of movies?/////////////////////////////////////////


In [27]:
## implement an optimizer with learning_rate = 2e-5 for all parameters

learning_rate = 2e-5
opt = tf.keras.optimizers.Adam(learning_rate=learning_rate)

In [39]:
## train the model
num_epochs = 1

## training on all 716 batches takes roughly one hour, so we just train for 100 steps 
max_steps = 100

for (i, batch) in enumerate((data[:max_steps])):
    print(batch.shape)
    print(tokenizer(batch[0, 0], return_tensors='tf'))
    break
    #tokenizer()

    ## prepare model input
    #In the textual entailment example in the notebook, we encode
    #[CLS-token]premise[SEP-token]hypothesis[SEP-TOKEN]
    #Here, we would like to encode 
    #[CLS-token]previous_sentence[Sep-token]current_sentence[SEP-TOKEN]
    
    ##Compute a forward step (the labels are simply the input_ids)
    #since gpt-2 reads from left to right, it will predict the label at each timestep without having access to that token's information during training
    
    ##Compute a backward step
    
    ##Perform an optimzer step

    ##Clear gradients of the optimizer

    pass


(16, 2)


In [None]:
##let's have a look what we generate after fine-tuning our chatbot

input_ids = tokenizer.encode("Why do you like this kind of movies?", return_tensors='tf')

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
)

tokenizer.decode(sample_output[0])