# Meet your Artificial Self

> This notebook will be used for the [AMLD 2020](https://appliedmldays.org/) workshop [**"Meet your Artificial Self"**](https://appliedmldays.org/workshops/meet-your-artificial-self-generate-text-that-sounds-like-you), taking place January 25 in Lausanne, Switzerland.

## Task 2
In task 1 we learnt how to fine-tune a language model and we saw how style transfer works. Conversations are a different beast however! 

In this task we will try our first approach at training a conversational model.

## Important resources
* [Workshop Github repo](https://github.com/mar-muel/artificial-self-AMLD-2020/tree/master/2)
* [PyTorch documentation](https://pytorch.org/docs/stable/index.html)
* Huggingface transformers library [ [Github](https://github.com/huggingface/transformers) | [Docs](https://huggingface.co/transformers/) ]

## Approach
In this task we will try a naive approach to getting conversational style by simply feeding the model "raw" conversation data of the form:
```
<speaker1> Hi
<speaker2> Hey - how are you?
<speaker1> Great, thanks!
...
```
Our hope is that the model will simply learn this structure and we will be able to query the model with an input of the form:

```
<speaker2> Am I speaking to a bot?
<speaker1>
```
We then expect the model to extend the text from this prefix.

This notebook will run you through all the steps from collecting the training data until interacting with the final model.


# Setting things up
The following cells will clone the repository and install all the necessary dependencies


In [0]:
# Install all dependencies for this task
!pip3 install --user -r requirements.txt

# Import training data
In this step we will add the data to Colab to train the model. You are free to choose from two options:
2. Use conversational data from well known people ("Alternative data")



# Alternative dataset 1: world leader interviews

Instead of using your own data, you can use some datasets we prepared for you. The first option is a dataset of interviews of world leaders: Barack Obama and Vladimir Putin. These interviews will be treated as chat conversations, where the interlocutors are the reporters.

To use those, copy the conversation you want from the `datasets` folder to the task2 `data` folder:

In [2]:
# Barack Obama
input_data = './data/barack_obama_interviews2.json'

# Vladimir Putin
#input_data = '../datasets/vladimir_putin_interviews.json'

assert os.path.isfile(input_data), "File not found"

NameError: name 'os' is not defined

# Alternative dataset 2: movie quotes

Another option is to use quotes from movies. You can use the [Cornell Movie-Dialogs Corpus](https://), which contains **220,579 conversational exchanges** between 10,292 pairs of movie characters.

In [3]:
#input_data = '../datasets/cornell_movie_dialogs_corpus.json.zip'

assert os.path.isfile(input_data), "File not found"

NameError: name 'os' is not defined

# Prepare the data
For this task we will use the transfomers library by Huggingface. The transformers library implements many recent NLP models (such as BERT and GPT-2).

Our data is currently in JSON format, but can easily be read into a Pandas Dataframe.

As a first step we want to get our conversation data from this format:

| timestamp | conversationId | conversationWithName | senderName | outgoing | text | language | platform |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| 1575463019 | 693342290 | Alice | Bob | True | Hi Alice! | en | whatsapp |
| 1575463025 | 693342290 | Alice | Alice | False | Hi Bob! How are you these days? | en | whatsapp |
| 1575463030 | 693342290 | Alice | Bob | True | Great! Thanks | en | whatsapp |

and get into a text file of this format:
```
<person1> Hi Alice!
<person2> Hi Bob! How are you these days?
<person1> Great! Thanks
...
```

Our model will try to generate this structure from the data. The tags `<person1>` and `<person2>` have nothing special to them and could in theory be replaced by something else as well (more about this below).

In [4]:
from utils import generate_input_task2
import os

In [5]:
assert os.path.isfile(input_data)
generate_input_task2(input_data)

100%|██████████| 1/1 [00:00<00:00, 166.46it/s]


The script has now generated an input file `cached_input_task2.txt`. You can inspect it with the `head` command:

In [6]:
!head cached_input_task2.txt

<speaker2> Mr President, you're about to fly to Kenya, to your ancestral home. Given the al-Shabaab attacks on the West Gate mall and Garissa University, I'm sure your secret service could've suggested other countries for you to visit. But you wanted to go to Kenya. Well, I think it is important first of all that the president of the United States underscores our commitment to partnering with countries around the world, even though we're not intimidated by terrorist organisations. Second, the counterterrorism co-operation between the United States and Kenya - and Uganda and other countries - in East Africa - is very strong. And part of the subject of the visit is to continue to strengthen those ties to make them more effective. Third, as I wind down my presidency, I've already had a number of visits to Africa. But this gives me an opportunity to focus on a region that I have not been visiting as president, and I'm also going to have the opportunity to talk to the African Union. So I'll

# Train the model
We can now start training the model. 

In [7]:
from collections import defaultdict
from itertools import chain
import random
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader, TensorDataset
from tqdm import tqdm, trange
from transformers import (
    WEIGHTS_NAME,
    AdamW,
    GPT2Config,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    OpenAIGPTConfig,
    OpenAIGPTLMHeadModel,
    OpenAIGPTTokenizer,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)
from utils import get_input_task2, set_seed, add_special_tokens_
import logging
import torch.nn.functional as F

# set up logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)-5.5s] [%(name)-12.12s]: %(message)s')


### Constants

In [8]:
run_name = 'run1'               # The name of the run (subdirectory in ./runs)
model_type = 'openai-gpt'       # Initialize model from path to checkpoint or with model name ("openai-gpt" or "gpt2")
save_every = 50                 # Save checkpoint every n updates steps.
max_input_length = 400          # Number of tokens which will be fed into the model (reduce this number if you have memory constraints)
weight_decay = 0                # Weight decay if we apply some.
train_batch_size = 8            # Batch size for training
gradient_accumulation_steps = 8 # Accumulate gradients on several steps
lr = 5e-5                       # Learning rate
adam_epsilon = 1e-8             # Epsilon for Adam optimizer.
max_norm = 1                    # Clipping gradient norm
n_epochs = 2                    # Number of training epochs
device = 'cpu'                 # Device (cuda or cpu)
warmup_steps = 0                # Linear warmup over warmup_steps.
seed = 42                       # random seed for initialization

### Data loading
In PyTorch we can define a class which inherits from the Dataset class containing an initializer method `__init__()` in which we 

1. Read the text file which we generated before
2. Tokenize the text (split it the text into smaller words/character pairs) and convert the tokens into vocabulary IDs (the positions of the tokens in the vocabulary. GPT-2 uses so-called BPE (byte-pair encoding). If you're interested how it works you can read more about it in [this blog post](https://leimao.github.io/blog/Byte-Pair-Encoding/).
3. Cut up the array of token IDs into chunks of size `max_input_length` (usually chosen to the maximum of what the memory/model allows for)
4. Append the generated training example as a list

The `__get_item__()` simply implements the retrieval of a new training example.

In [9]:
class TextDataset(Dataset):
    def __init__(self, tokenizer):
        # load the text data generated from before into memory
        text = get_input_task2(input_data)
        logger.info("Tokenizing and building input...")
        # tokenize the whole file
        tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
        # generate training examples by cutting the text into blocks of size max_input_length
        self.examples = []
        block_size = max_input_length
        if block_size < 0:
            # use maximum possible input block size
            block_size = tokenizer.max_len_single_sentence
        for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
            self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size]))
    def __len__(self):
        return len(self.examples)
    def __getitem__(self, item):
        return torch.tensor(self.examples[item])

def get_data_loader(tokenizer):
    """ Prepare the dataset for training and evaluation """
    dataset = TextDataset(tokenizer)
    logger.info("Train dataset: {:,} samples".format(len(dataset)))
    logger.info("Build dataloaders")
    data_loader = DataLoader(dataset, batch_size=train_batch_size, shuffle=True)
    return data_loader


In [10]:
# Setting the same seed allows for some reproducibility of the experiments
set_seed(seed)

### Load models and tokenizers
The transformers library comes with a built in method `from_pretrained(model_type)`. `model_type` can specify either
1. One of the [pretrained model architectures](https://huggingface.co/transformers/pretrained_models.html). In this case the model will be downloaded and cached on disk before loaded into memory.
2. The path to a folder with an existing model checkpoint.

This allows us to use the same syntax to either pretrained or fine-tuned models/tokenizers.


In [11]:
 # Load tokenizer
 logger.info("Prepare tokenizer, pretrained model and optimizer.")
 tokenizer_class = GPT2Tokenizer if "gpt2" in model_type else OpenAIGPTTokenizer
 tokenizer = tokenizer_class.from_pretrained(model_type)
 # Load model
 model_class = GPT2LMHeadModel if "gpt2" in model_type else OpenAIGPTLMHeadModel
 model = model_class.from_pretrained(model_type)
 model.to(device)

2020-07-12 21:51:35,095 [INFO ] [__main__    ]: Prepare tokenizer, pretrained model and optimizer.
2020-07-12 21:51:35,104 [DEBUG] [urllib3.conn]: Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-12 21:51:35,621 [DEBUG] [urllib3.conn]: https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/openai-gpt-vocab.json HTTP/1.1" 200 0
2020-07-12 21:51:35,638 [DEBUG] [urllib3.conn]: Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-12 21:51:36,235 [DEBUG] [urllib3.conn]: https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/openai-gpt-merges.txt HTTP/1.1" 200 0
2020-07-12 21:51:36,242 [INFO ] [transformers]: loading file https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json from cache at /home/benjamin/.cache/torch/transformers/4ab93d0cd78ae80e746c27c9cd34e90b470abdabe0590c9ec742df61625ba310.b9628f6fe5519626534b82ce7ec72b22ce0ae79550325f45c604a25c0ad87fd6
2020-07-12 21:51:36,246 [INFO ] [transformers]: loading file https://s3.ama

OpenAIGPTLMHeadModel(
  (transformer): OpenAIGPTModel(
    (tokens_embed): Embedding(40478, 768)
    (positions_embed): Embedding(512, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (1): Block(
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_aff

### Add special tokens
Let's see how `<speaker1>` and `<speaker2>` tokens will be tokenized by our current tokenizer:

In [12]:
tokenizer.tokenize('<speaker1>')

['<</w>', 'speaker', '1</w>', '></w>']

As you can see the model generates a total of 4 tokens: `['<', 'speaker', '1', '>']`. This doesn't make too much sense for us since the tags should not contain any meaning but should simply indicate who is currently speaking. Luckily there is an easy way to add our speaker tokens to the vocabulary of the tokenizer.

In [13]:
ATTR_TO_SPECIAL_TOKEN = {'additional_special_tokens': ('<speaker1>', '<speaker2>')}
# Add special tokens if they are not already added
def add_special_tokens_(model, tokenizer):
   """ Add special tokens to the tokenizer and the model if they have not already been added. """                                                                                   
   orig_num_tokens = len(tokenizer.encoder)
   num_added_tokens = tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN) # doesn't add if they are already there                                                                   
   if num_added_tokens > 0:
       model.resize_token_embeddings(new_num_tokens=orig_num_tokens + num_added_tokens)
       
add_special_tokens_(model, tokenizer)

2020-07-12 21:51:41,641 [INFO ] [transformers]: Adding <speaker1> to the vocabulary
2020-07-12 21:51:41,643 [INFO ] [transformers]: Adding <speaker2> to the vocabulary
2020-07-12 21:51:41,643 [INFO ] [transformers]: Assigning ('<speaker1>', '<speaker2>') to the additional_special_tokens key of the tokenizer


Now the model should generate a single token:


In [14]:
tokenizer.tokenize('<speaker1>')

['<speaker1>']

# Final setup before training
We need to set up a few things before we can start training:
* Prepare the data loaders (discussed above)
* An optimizer (we will use Adam)
* A scheduler to change the learning rate throughout training (we will use a [linear schedule with warmup](https://huggingface.co/transformers/main_classes/optimizer_schedules.html#transformers.get_linear_schedule_with_warmup))

In [15]:
 # Get data loaders
 logger.info("Prepare datasets")
 data_loader = get_data_loader(tokenizer)
 # Prepare optimizer and schedule (linear warmup and decay)
 no_decay = ["bias", "LayerNorm.weight"]
 optimizer_grouped_parameters = [
     {
         "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
         "weight_decay": weight_decay,
     },
     {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
 ]
 optimizer = AdamW(optimizer_grouped_parameters, lr=lr, eps=adam_epsilon)
 t_total = len(data_loader) // gradient_accumulation_steps * n_epochs
 scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)

2020-07-12 21:51:42,259 [INFO ] [__main__    ]: Prepare datasets
2020-07-12 21:51:42,260 [INFO ] [/home/benjam]: Reading cached input file from cached_input_task2.txt...
2020-07-12 21:51:42,262 [INFO ] [__main__    ]: Tokenizing and building input...
2020-07-12 21:51:42,285 [INFO ] [__main__    ]: Train dataset: 2 samples
2020-07-12 21:51:42,286 [INFO ] [__main__    ]: Build dataloaders


In [16]:
print(f"Your training time is approx. {len(data_loader)*n_epochs/((8/train_batch_size)*60):.0f} min")

Your training time is approx. 0 min


### Training

Finally we can start training!

In [17]:
logger.info("***** Running training *****")
global_step = 0
epochs_trained = 0
steps_trained_in_current_epoch = 0
# Check if we are training from a checkpoint or from a pretrained model
if os.path.exists(model_type):
    # set global_step to gobal_step of last saved checkpoint from model path
    global_step = int(model_type.split("-")[-1].split("/")[0])
    epochs_trained = global_step // (len(data_loader) // gradient_accumulation_steps)
    steps_trained_in_current_epoch = global_step % (len(data_loader) // gradient_accumulation_steps)
    logger.info("Continuing training from checkpoint, will skip to saved global_step")
    logger.info(f"Continuing training from epoch {epochs_trained}")
    logger.info(f"Continuing training from global step {global_step}")
    logger.info(f"Will skip the first {steps_trained_in_current_epoch} steps in the first epoch")

# Training loop
model.zero_grad()
epoch_pbar = trange(epochs_trained, int(n_epochs)) # epoch progress bar
av_loss = 0
for current_epoch in epoch_pbar:
    epoch_pbar.set_description(f"Epoch [{current_epoch+1}/{n_epochs}]") # description of epoch progress bar
    pbar = tqdm(data_loader, position=0) # progress bar
    for step, batch in enumerate(pbar):
        # Skip past any already trained steps if resuming training
        if steps_trained_in_current_epoch > 0:
            steps_trained_in_current_epoch -= 1
            continue
        model.train()
        # the language model targets (labels) are the same as the input!
        inputs, labels = (batch, batch)
        inputs = inputs.to(device)
        labels = labels.to(device)
        loss, *_ = model(inputs, labels=labels)
        loss.backward()
        tr_loss = loss.item()
        # Compute a running average of the loss
        av_loss = (step*av_loss + tr_loss)/(step + 1)
        pbar.set_description(f"Average loss: {av_loss:.4f}")
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        if (step + 1) % gradient_accumulation_steps == 0:
            optimizer.step()
            scheduler.step()  # Update learning rate schedule
            model.zero_grad()
            global_step += 1
            if global_step % save_every == 0 and global_step > 0:
                checkpoint_prefix = "checkpoint"
                output_dir = os.path.join('runs', run_name, "{}-{}".format(checkpoint_prefix, global_step))
                if not os.path.exists(output_dir):
                    os.makedirs(output_dir)
                logger.info(f"Saving model checkpoint to {output_dir}")
                model.save_pretrained(output_dir)
                tokenizer.save_pretrained(output_dir)
                logger.info(f"Saving optimizer and scheduler states to {output_dir}")
                torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))

# save model
output_dir = os.path.join('runs', run_name)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
logger.info(f"Saving model checkpoint to {output_dir}")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

2020-07-12 21:51:42,321 [INFO ] [__main__    ]: ***** Running training *****
Average loss: 4.2099: 100%|██████████| 1/1 [00:03<00:00,  3.81s/it]
Average loss: 4.1576: 100%|██████████| 1/1 [00:03<00:00,  3.92s/it]
Epoch [2/2]: 100%|██████████| 2/2 [00:07<00:00,  3.87s/it]
2020-07-12 21:51:50,063 [INFO ] [__main__    ]: Saving model checkpoint to runs/run1
2020-07-12 21:51:50,065 [INFO ] [transformers]: Configuration saved in runs/run1/config.json
2020-07-12 21:51:50,618 [INFO ] [transformers]: Model weights saved in runs/run1/pytorch_model.bin


('runs/run1/vocab.json',
 'runs/run1/merges.txt',
 'runs/run1/special_tokens_map.json',
 'runs/run1/added_tokens.json')

# Tweaking parameters

You can change the training parameters to see how they affect the language model: adjust them below, then run the again the cell above.

In [18]:
run_name = 'run1'               # The name of the run (subdirectory in ./runs)")
model_type = 'openai-gpt'       # Initialize model from path to checkpoint or with model name (openai-gpt/openai-gpt2)"
weight_decay = 0                # Weight decay if we apply some.
train_batch_size = 4            # Batch size for training
gradient_accumulation_steps = 8 # Accumulate gradients on several steps
lr = 5e-5                       # Learning rate
n_epochs = 1                    # Number of training epochs
warmup_steps = 0                # Linear warmup over warmup_steps.

# Interact with the model
The trained model can now be found under `./runs/{run_name}/`.

Let's see what happens when we feed in some text to our model:

In [19]:
input_ids = torch.tensor(tokenizer.encode("Hello world", add_special_tokens=True), device=device).unsqueeze(0)
print('Input IDs:')
print(input_ids)
out, = model(input_ids)
print('Model output:')
print(out)
print('Output shape:')
print(out.shape) # output shape: (batch size x sequence length x hidden size)
 

Input IDs:
tensor([[3570, 1276]])
Model output:
tensor([[[-6.2202e+00, -4.9445e+00, -1.6549e+01,  ...,  8.3677e-01,
           1.5425e-02,  2.1317e-01],
         [-6.6601e+00, -4.8935e+00, -1.8000e+01,  ...,  8.2694e-01,
          -1.0044e-01,  2.3388e-01]]], grad_fn=<UnsafeViewBackward>)
Output shape:
torch.Size([1, 2, 40480])


As you can see the output of the model outputs a tensor of the size of the number of input tokens. By getting the highest value of the last dimension of the output tensor we can get the most likely next token:

In [20]:
tokenizer.decode([torch.argmax(out[:, 1, :])])

'.'

## Start chatting
As we have seen in task 1 we have several hyperparameters to choose from in order to control how we sample from the output probability distribution. Just choosing always the most likely token will not lead to interesting conversations.

Below you see how the different sampling strategies are implemented. By default we will use top-p sampling.

In [22]:
# Constants
max_history = 2                  # Number of previous utterances to keep in history
no_sample = False                # Set to use greedy decoding instead of sampling
max_length = 80                  # Maximum length of the output utterances
temperature = 1.0                # Sampling softmax temperature
top_k = 0                        # Filter top-k tokens before sampling (<=0: no filtering)
top_p = 0.8                      # Nucleus filtering (top-p) before sampling (<=0.0: no filtering)
no_info = False                   # Only show conversation output

In [23]:
def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
        Args:
            logits: logits distribution shape (batch size x vocabulary size)
            top_k > 0: keep only top k tokens with highest probability (top-k filtering).
            top_p > 0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
    """
    top_k = min(top_k, logits.size(-1))  # Safety check
    if top_k > 0:
        # Remove all tokens with a probability less than the last token of the top-k
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value

    if top_p > 0.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above the threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        # scatter sorted tensors to original indexing
        indices_to_remove = sorted_indices_to_remove.scatter(dim=1, index=sorted_indices, src=sorted_indices_to_remove)
        logits[indices_to_remove] = filter_value
    return logits

def sample_sequence(conversation, model, num_samples=1):
    """Generate next tokens from pervious conversation"""
    context = torch.tensor(conversation, dtype=torch.long, device=device)
    context = context.unsqueeze(0).repeat(num_samples, 1)
    generated = context
    with torch.no_grad():
        for _ in range(max_length):
            inputs = {'input_ids': generated}
            outputs = model(**inputs)
            # scale by temperature
            next_token_logits = outputs[0][:, -1, :] / (temperature if temperature > 0 else 1.) 
            # filter by top-k/top-p
            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
            if temperature == 0: # greedy sampling:
                next_token = torch.argmax(filtered_logits, dim=-1).unsqueeze(-1)
            else:
                next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
            generated = torch.cat((generated, next_token), dim=1)
    return generated

We are ready to interact with the model. A few things to note:
* We will give it a "trigger" to start the conversation. From there the model takes it. Note that the model is "playing" speaker1 and you are speaker2. 
* If the `no_info` flag is set to `False`, the output shows both the input (conversation history) as well as the full output of the model. At the very end the answer which was selected by the model is shown.
* You can press `h` (and then ENTER) in order to see the whole history of the chat


Enjoy! :)

In [None]:
history = []
speaker1_tag = '<speaker1>'
speaker2_tag = '<speaker2>'
speaker1_tag_id = tokenizer.convert_tokens_to_ids(speaker1_tag)
speaker2_tag_id = tokenizer.convert_tokens_to_ids(speaker2_tag)
history = f"""
{speaker2_tag} Hi!
{speaker1_tag} Hello
{speaker2_tag} Are you ready?
{speaker1_tag} Yes!
{speaker2_tag} Ok let's start chatting
{speaker1_tag} Sure, what do you want to talk about?"""
print(history)
print('\n[Chat with the model! Send "h" to see the full history]\n')
history = history.split('\n')
while True: 
    message = None
    while not message:
        message = input(f'{speaker2_tag} ')
        if message == 'h':
            print('\n'.join(history))
            message = None
    # add new message to history
    history.append(f'{speaker2_tag} {message}')
    # keep only most recent conversation as input to the model
    recent_history = history[-(2*max_history):]
    # concatenate history into single string and add trigger word "bot:"
    history_str = '{}\n{}'.format('\n'.join(recent_history), speaker1_tag)
    # tokenize text and convert into vocabulary ids (input ids)
    history_enc = tokenizer.encode(history_str, add_special_tokens=True)
    with torch.no_grad():
        out_ids = sample_sequence(history_enc, model)
    out_ids = out_ids[:, len(history_enc):].tolist()[0]
    if not no_info:
        print(20*'-')
        print('Output of model:')
        full_output = tokenizer.decode(out_ids, clean_up_tokenization_spaces=True)
        print(full_output)
        print('\nInput to the model:')
        print(history_str)
        print(20*'-' + '\n')
    # Select part before speaker tags as answer
    for i, out_id in enumerate(out_ids):
        if out_id in [speaker1_tag_id, speaker2_tag_id]:
            break
    answer = '{} {}'.format(speaker1_tag, tokenizer.decode(out_ids[:i]))
    print(answer)
    # add answer to history
    history.append(answer)



<speaker2> Hi!
<speaker1> Hello
<speaker2> Are you ready?
<speaker1> Yes!
<speaker2> Ok let's start chatting
<speaker1> Sure, what do you want to talk about?

[Chat with the model! Send "h" to see the full history]

<speaker2> test sa mere
--------------------
Output of model:
a process, and then in paris, then maybe a huge saucer, it is all i love, but that you can <speaker2> you've both finished all to <speaker2> k -'
 is a 
 can't see a million in my bra, <speaker2>'let's be real pies, it's not much! 
'i'm nothing.'the blue bottle, and the music with'good ',

Input to the model:
<speaker1> Yes!
<speaker2> Ok let's start chatting
<speaker1> Sure, what do you want to talk about?
<speaker2> test sa mere
<speaker1>
--------------------

<speaker1> a process, and then in paris, then maybe a huge saucer, it is all i love, but that you can
<speaker2> eat my knee
--------------------
Output of model:
that i am in - " 
 " i'm going to change my name with a tree? 
 " the driver, " she joked,

You can now review this whole conversation:

In [None]:
history

### Save model to Google Drive
If you are happy with your model consider saving it to your Google Drive. Note that all data on this notebook will be lost after a certain time of inactivity. Note that the model size is quite big (~500MB) so make sure you have enough space in your Google Drive.

This will save only your final model state (from your directory `run_name` directory).


In [0]:
import shutil

#@title Save to Google Drive
save_to_drive = False #@param {type:"boolean"}

source_directory = f'./runs/{run_name}'
target_directory = f"/content/drive/My Drive/AMLD/models/task2/{run_name}/"
include_checkpoints = False

if save_to_drive:
  logger.info(f'Copying from {source_directory} to {target_directory}...')
  ignore_pattern = None
  if not include_checkpoints:
    ignore_pattern = shutil.ignore_patterns('checkpoint-*')
  shutil.copytree(source_directory, target_directory, ignore=ignore_pattern)
  logger.info('Successfully copied your model!')  