# Meet your Artificial Self - AMLD 2020 Workshop
## Task 3
In task 2 we used somewhat of a "hack" to get our model to learn conversations by simply feeding raw text. In this task we will make a few minor adjustments to our method which will potentially have a big impact on our model's performance:
* Multi-task learning
* Specifying token types for both speakers
* Improve data pre-processing

## Important resources
* [Workshop Github repo](https://github.com/mar-muel/artificial-self-AMLD-2020/tree/master/3)
* [PyTorch documentation](https://pytorch.org/docs/stable/index.html)
* Huggingface transformers library [ [Github](https://github.com/huggingface/transformers) | [Docs](https://huggingface.co/transformers/) ]
* [Blog post by Thomas Wolf on this approach](https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313)


## Approach
This task is heavily influenced by [this blog post](https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313) which describes building dialog models based on the PersonaChat dataset, which was the winning approach in the [ConvAI2 challenge](http://convai.io/) in 2018. 

The main difference of this approach is that we won't be training different personalities. This means the agent's knowledge base will only consist of the recent conversation history (and not on any personality description). 

# Setting things up
The following cells will clone the repository, install all the necessary dependencies, and mount your Google Drive


In [0]:
!nvidia-smi | grep -q 'failed' && echo "STOP! You are using a runtime without a GPU. Change the runtime type before going further!"

In [0]:
!git clone https://github.com/mar-muel/artificial-self-AMLD-2020.git

In [0]:
# Set working directory
%cd /content/artificial-self-AMLD-2020/3

In [0]:
# Install all dependencies for this task
!pip install -r requirements-colab.txt

In [0]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Import training data

This process is the same as in [Task 2](https://colab.research.google.com/drive/1iHcQ8_K0cfRE3v8QX6FMKAzdSSGtf5IX#scrollTo=pRYuNd85O5cl), follow the "Import training data" section there if you missed it.

## Set data path

In [0]:
import os

# Set the correct path here
data_path = "/content/drive/My Drive/AMLD/chatistics_data/chatistics_export_2020-01-16_13-46-06.json" #@param {type:"string"}
assert os.path.isfile(data_path)

# Prepare the data
Again we are going to start from this chat format: 


| timestamp | conversationId | conversationWithName | senderName | outgoing | text | language | platform |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| 1575463019 | 693342290 | Alice | Bob | True | Hi Alice! | en | whatsapp |
| 1575463025 | 693342290 | Alice | Alice | False | Hi Bob! How are you these days? | en | whatsapp |
| 1575463030 | 693342290 | Alice | Bob | True | Great! Thanks | en | whatsapp |
| 1575574212 | 693342290 | Alice | Alice | False | Hello Bob! Haven't heard from you in a while!| en | whatsapp |

As you can see from this example, the last message is not at all related to the previous conversation and was sent several hours later. It likely has not much in common with the previous messages.
 
This time we will improve the pre-processing and group the data into multiple conversations (by taking into account the timestamp). In order to do so, we will define an arbitrary cut-off of 24h for grouping the messages into conversations. Furthermore, we will only consider conversations which consist of at least 10 interactions. If you are curious you can check out the code for this in the function `get_grouped_conversation_data()` in `utils.py`.

The output of this function is a JSON file with the following structure:
```
{
  'Alice': [
    [
      {"messages": ["Hi Alice!"], "sender": "Bob", "senderType": "person1"},
      {"messages": ["Hi Bob! How are you these days?"], "sender": "Alice", "senderType": "person2"},
      {"messages": ["Great! Thanks"], "sender": "Bob", "senderType": "person1"}
    ], [
      {"messages": ["Hello Bob! Haven't heard from you in a while!"], "sender": "Alice", "senderType": "person2"},
    ...
    ]
  ]
}
```

# Multi-task learning

In recent years many papers have shown that forcing a model to learn multiple objectives at once can greatly improve performance on downstream tasks such as Question Answering, Sentiment Analysis, etc. [This blog post by Sebastian Ruder](https://ruder.io/multi-task/index.html) gives a good introduction to the topic. 

For our model we will use two tasks to train our transformer model:
* **Language model objective**: Like in the previous tasks we are training the next token prediction (language modelling) task. But instead of training on all of the input text we want the model to only train on the reply of `speaker1` given a conversation history with `speaker2` (we will call this the *gold reply*). In different wording: We will project the hidden-state on the word embedding matrix to get logits and apply a cross-entropy loss on the portion of the target corresponding to the gold reply. This will give us the language modelling loss.
* **Multiple choice classification**: Additionally to the above, we will generate examples where instead of the gold reply we will give the model a random previous reply of `speaker1`. This reply has nothing to do with the current conversation (we call these replies *distractors*). The model will be tasked to recognize whether the reply is the gold reply (predict `True`) or a distractor (predict `False`) as soon as it hits last token of the reply (i.e. the `eos` ( end-of-sentence token) token). The hidden state of the final layer of the transformer is then passed through a linear layer in order to get binary classification logits. Calculating the cross-entropy gives us our classification loss. 

The total loss is then calculated as a mixture between both losses ($\text{loss} = w_{LM} * \text{loss}_{LM} + w_{MC} * \text{loss}_{CLF}$) with tunable hyperparameters $w_{LM}$ and $w_{MC}$. From the loss we can compute backpropagation as usual and fine-tune the transformer.

## Token types
Before we jump into the implementation of multi-task losses, let's quickly cover token types. So far our model encodes the token embeddings and adds positional encoding (for an explanation of how input is encoded by default, check out [this amazing blog post by Jay Alammar](http://jalammar.github.io/illustrated-gpt2/)). 

However, our model can currently not differentiate between text that belongs to either speaker 1 or speaker 2. Maybe, with some luck, our previous model has learnt to associate the proximity of the `<speaker1>` tag with the text that followed it, but there's a much cleaner way to achieve this by using token types!

![Input encoding](https://github.com/mar-muel/artificial-self-AMLD-2020/blob/master/static/task_3_input_encoding.png?raw=true)

As in task 2, we will extend our tokenizer with the `<speaker1>` and `<speaker2>` tags. We will use these two tokens to build a vector of `token_type_ids` which specifies which portions of the input vector belong to which speaker and pass it to the model. The model will build an input representation by adding token embeddings, positional encodings, and token types. This will then serve as a single input to the transformer.

## Double head models
As multi-task learning is now a common feature for transformer models (either for pretraining as in BERT or as ways to tackle certain problems in NLP). The `transformers` library provides us with so called Double Head models which implement a LM head as well as a multiple choice classification head (check the [docs here](https://huggingface.co/transformers/model_doc/gpt2.html#transformers.GPT2DoubleHeadsModel)).

 The models have the following syntax:
```python
from transformers import GPT2DoubleHeadsModel

model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
(lm_loss), (mc_loss), *_ = model(
  input_ids,
  token_type_ids=token_type_ids,
  mc_token_ids=mc_token_ids,
  mc_labels=mc_labels,
  lm_labels=lm_labels)
```
The model returns `lm_loss`, which is the language modelling loss, and `mc_loss`, which is the multiple choice loss. The model's input arguments are:
* `input_ids`: Input vocabulary IDs of the full input sequence
* `token_type_ids`: See explanation above.
* `mc_token_ids`: At which token classification should be triggered (in our case at the very last input token (`eos` token))
* `mc_labels`: Classification labels for which input is the gold reply
* `lm_labels`: Labels for language modelling. So far we had `input_ids = lm_labels`, this time we will only train the language model on the gold reply.

![Double Head model](https://github.com/mar-muel/artificial-self-AMLD-2020/blob/master/static/task_3_double_head_model.png?raw=true)

As you can see, a single input consists of the gold reply as well as a distractor (therefore the input dimension of `input_ids` is `[2 x input_size]`). Token types, position and embeddings will be added and passed through the transformer. The last hidden layer (corresponding to the gold reply) is the basis for the language model (LM) head. The hidden layer representation of the last token (`eos` token) will serve as a basis for the multiple choice (MC) classification head.

# Generating training examples

From our current structure of conversation data:
```
{
  'Alice': [
    [
      {"messages": ["Hi Alice!"], "sender": "Bob", "senderType": "person1"},
      {"messages": ["Hi Bob! How are you these days?"], "sender": "Alice", "senderType": "person2"},
      {"messages": ["Great! Thanks"], "sender": "Bob", "senderType": "person1"}
    ], [
      {"messages": ["Hello Bob! Haven't heard from you in a while!"], "sender": "Alice", "senderType": "person2"},
    ...
    ]
  ]
}
```

the `get_input_task3()` method in `utils.py` will generate the following tensors for each training example:
* `input_ids`, dimension: `[2 x input_size]`
* `lm_labels`, dimension: `[2 x input_size]`
* `token_type_ids`, dimension: `[2 x input_size]`
* `mc_token_ids`, dimension: `[2]`
* `mc_labels`, dimension: `[1]`

(note that all dimensions will also have an additional `batch_size` dimension during training)

Feel free to check out the code in `utils.py`. The code works as following:
1. Read grouped conversation data (see explanation above)
2. Generate distractor messages for person 1 (which consists of all replies given by `person1`)
3. Iterate through all messages and compile conversation histories of size `2*max_history + 1` (by default `max_history=2`)
```
<person1> <person2> <person1> <person2> <candidate>
```
`<candidate>` will be either the gold reply or a distractor
4. Tokenize and convert text to vocabulary IDs
5. Build input tensors for both a random distractor message as well as the gold reply
6. If full length of either sequence is above `max_input_length` discard sample, else pad to the right of all tensors until they reach `max_input_length`.

# Train the model
We can now FINALLY! start training the model. You will see that the training code is almost identical to task 2!

In [0]:
import os
import logging
from argparse import ArgumentParser
import numpy as np
import torch
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm, trange
from transformers import (
    AdamW, 
    OpenAIGPTDoubleHeadsModel,
    OpenAIGPTTokenizer,
    GPT2DoubleHeadsModel,
    GPT2Tokenizer,
    get_linear_schedule_with_warmup,
)
from utils import get_input_task3, download_pretrained_model, set_seed

# set up logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)-5.5s] [%(name)-12.12s]: %(message)s')

## Constants

In [0]:
run_name = 'run1'               # The name of the run (subdirectory in ./runs)
model_type = 'openai-gpt'       # Initialize model from path to checkpoint or with model name ("openai-gpt" or "gpt2")
save_every = 100                 # Save checkpoint every n updates steps.
max_input_length = 200          # Number of tokens which will be fed into the model (reduce this number if you have memory constraints)
weight_decay = 0                # Weight decay if we apply some.
train_batch_size = 4            # Batch size for training
gradient_accumulation_steps = 8 # Accumulate gradients on several steps
lr = 6.25e-5                       # Learning rate
adam_epsilon = 1e-8             # Epsilon for Adam optimizer.
max_norm = 1                    # Clipping gradient norm
n_epochs = 3                    # Number of training epochs
device = 'cuda'                 # Device (cuda or cpu)
warmup_steps = 0                # Linear warmup over warmup_steps.
seed = 42                       # random seed for initializatio

# New for task 3!
num_candidates = 2              # Number of candidates for training
max_history = 2                 # Number of previous exchanges to keep in history
lm_coef = 1.0                   # LM loss coefficient
mc_coef = 1.0                   # Multiple-choice loss coefficient
use_huggingface_model = False   # Start fine-tuning from the pre-trained model by Huggingface (see explanation below)

## Data loading

We will use PyTorch's `TensorDataset` and use it to build a Data Loader.

In [0]:
def get_data_loader(tokenizer, use_cache=True):
    """ Prepare the dataset for training and evaluation """
    # get dataset of tensors
    data = get_input_task3(
        data_path, 
        tokenizer, 
        max_input_length=max_input_length,
        num_candidates=num_candidates,
        seed=seed,
        max_history=max_history,
        use_cache=use_cache)
    logger.info("Building training data loader")
    train_dataset = TensorDataset(*data)
    train_loader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True)
    logger.info("Train dataset input shape: (Batch size, Candidates, Seq length): {}".format(train_dataset.tensors[0].shape))
    return train_loader


In [0]:
# Setting the same seed allows for some reproducibility of the experiments
set_seed(seed)

### Start from already trained conversational model
*Note: This step is more of a fallback in case you have very little training data or you want to start already with a very good model and take it from there.*

**If you want to use an this option set `use_huggingface_model = True` above!**

This model was fine-tuned on the Personachat corpus and is a SOTA model (or at least it was back in 2018). You can play with it [here](https://convai.huggingface.co/).

In [0]:
if use_huggingface_model:
  model_type = download_pretrained_model()

### Load model and tokenizer
As discussed we will use the DoubleHeadModel

In [0]:
# Load tokenizer
logger.info("Prepare tokenizer, pretrained model and optimizer.")
tokenizer_class = GPT2Tokenizer if "gpt2" in model_type else OpenAIGPTTokenizer # cant use Autotokenizer because checkpoint could be a Path
tokenizer = tokenizer_class.from_pretrained(model_type)
# Load model
model_class = GPT2DoubleHeadsModel if "gpt2" in model_type else OpenAIGPTDoubleHeadsModel
model = model_class.from_pretrained(model_type)
model.to(device)

### Add special tokens
As in task 2 we will add `<speaker1>` and `<speaker2>` to our list of additional tokens. Additionally, we will add
* `<bos>`: Beginning of sequence token
* `<eos>`: End of sequence token
* `<pad>`: Padding token

In [0]:
ATTR_TO_SPECIAL_TOKEN = {'bos_token': '<bos>', 'eos_token': '<eos>', 'pad_token': '<pad>',
                         'additional_special_tokens': ('<speaker1>', '<speaker2>')}
def add_special_tokens_(model, tokenizer):
    """ Add special tokens to the tokenizer and the model if they have not already been added. """
    orig_num_tokens = len(tokenizer.encoder)
    num_added_tokens = tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN) # doesn't add if they are already there
    if num_added_tokens > 0:
        model.resize_token_embeddings(new_num_tokens=orig_num_tokens + num_added_tokens)
add_special_tokens_(model, tokenizer)

### Setup for training
As before we need:
* Our data loader (defined above)
* An optimizer
* A scheduler

In [0]:
# Get data loaders
logger.info("Prepare datasets")
data_loader = get_data_loader(tokenizer, use_cache=False)
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [ 
    {   
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": weight_decay,
    },  
    {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]   
optimizer = AdamW(optimizer_grouped_parameters, lr=lr, eps=adam_epsilon)
t_total = len(data_loader) // gradient_accumulation_steps * n_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)   


### Train

In [0]:
logger.info("***** Running training *****")
global_step = 0
epochs_trained = 0
steps_trained_in_current_epoch = 0
# Check if we are training from a checkpoint or from a pretrained model
if os.path.exists(model_type) and not use_huggingface_model:
    # set global_step to gobal_step of last saved checkpoint from model path
    global_step = int(model_type.split("-")[-1].split("/")[0])
    epochs_trained = global_step // (len(data_loader) // gradient_accumulation_steps)
    steps_trained_in_current_epoch = global_step % (len(data_loader) // gradient_accumulation_steps)
    logger.info("Continuing training from checkpoint, will skip to saved global_step")
    logger.info(f"Continuing training from epoch {epochs_trained}")
    logger.info(f"Continuing training from global step {global_step}")
    logger.info(f"Will skip the first {steps_trained_in_current_epoch} steps in the first epoch")

# Training loop
model.zero_grad()
epoch_pbar = trange(epochs_trained, int(n_epochs)) # epoch progress bar
av_loss = 0
for current_epoch in epoch_pbar:
    epoch_pbar.set_description(f"Epoch [{current_epoch+1}/{n_epochs}]") # description of epoch progress bar
    pbar = tqdm(data_loader, position=0) # progress bar
    for step, batch in enumerate(pbar):
        # Skip past any already trained steps if resuming training
        if steps_trained_in_current_epoch > 0:
            steps_trained_in_current_epoch -= 1
            continue

        # compute loss
        model.train()
        batch = tuple(input_tensor.to(device) for input_tensor in batch)
        input_ids, mc_token_ids, lm_labels, mc_labels, token_type_ids = batch
        (lm_loss), (mc_loss), *_ = model(input_ids, token_type_ids=token_type_ids, mc_token_ids=mc_token_ids, mc_labels=mc_labels, lm_labels=lm_labels)
        loss = (lm_loss * lm_coef + mc_loss * mc_coef) / gradient_accumulation_steps
        loss.backward()
        tr_loss = loss.item()

        # Compute a running average of the loss
        av_loss = (step*av_loss + tr_loss)/(step + 1)
        pbar.set_description(f"Average loss: {av_loss:.4f}")
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        if (step + 1) % gradient_accumulation_steps == 0:
            optimizer.step()
            scheduler.step()  # Update learning rate schedule
            model.zero_grad()
            global_step += 1
            if global_step % save_every == 0 and global_step > 0:
                checkpoint_prefix = "checkpoint"
                output_dir = os.path.join('runs', run_name, "{}-{}".format(checkpoint_prefix, global_step))
                if not os.path.exists(output_dir):
                    os.makedirs(output_dir)
                logger.info(f"Saving model checkpoint to {output_dir}")
                model.save_pretrained(output_dir)
                tokenizer.save_pretrained(output_dir)
                logger.info(f"Saving optimizer and scheduler states to {output_dir}")
                torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))

# save model
output_dir = os.path.join('runs', run_name)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
logger.info(f"Saving model checkpoint to {output_dir}")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Tweaking parameters

You can change the training parameters to see how they affect the language model: adjust them below, then run the again the cell above

In [0]:
run_name = 'run1'               # The name of the run (subdirectory in ./runs)")
model_type = 'openai-gpt'       # Initialize model from path to checkpoint or with model name (openai-gpt/openai-gpt2)"
weight_decay = 0                # Weight decay if we apply some.
train_batch_size = 4            # Batch size for training
gradient_accumulation_steps = 8 # Accumulate gradients on several steps
lr = 5e-5                       # Learning rate
n_epochs = 1                    # Number of training epochs
warmup_steps = 0                # Linear warmup over warmup_steps
lm_coef = 1.0                   # LM loss coefficient
mc_coef = 1.0                   # Multiple-choice loss coefficien

# Speak with the model
This code is again largely identical to the previous code (most explanations can be found there). The main difference is that parsing the output of the model is much cleaner now: The model always spits out a single reply for `person1`. The model finishes the reply by predicting the `eos` token at which point we stop sampling next tokens.

Similar to when generating training examples we will use `build_input_from_segments()` to generate the input for the next token prediction task (with the only difference of not adding an `eos` token).

In [0]:
import itertools
import torch.nn.functional as F
from utils import build_input_from_segments

### Constants

In [0]:
# Constants
max_history = 4                  # Number of previous utterances to keep in history
no_sample = False                # Set to use greedy decoding instead of sampling
max_length = 80                  # Maximum length of the output utterances
min_length = 1                   # Minimum length of the output utterances
temperature = 1                # Sampling softmax temperature
top_k = 0                        # Filter top-k tokens before sampling (<=0: no filtering)
top_p = .8                      # Nucleus filtering (top-p) before sampling (<=0.0: no filtering)

In [0]:
SPECIAL_TOKENS = ["<bos>", "<eos>", "<speaker1>", "<speaker2>", "<pad>"]

def top_filtering(logits, top_k=0., top_p=0.9, threshold=-float('Inf'), filter_value=-float('Inf')):
    """ Filter a distribution of logits using top-k, top-p (nucleus) and/or threshold filtering
        Args:
            logits: logits distribution shape (vocabulary size)
            top_k: <=0: no filtering, >0: keep only top k tokens with highest probability.
            top_p: <=0.0: no filtering, >0.0: keep only a subset S of candidates, where S is the smallest subset
                whose total probability mass is greater than or equal to the threshold top_p.
                In practice, we select the highest probability tokens whose cumulative probability mass exceeds
                the threshold top_p.
            threshold: a minimal threshold to keep logits
    """
    assert logits.dim() == 1  # Only work for batch size 1 for now - could update but it would obfuscate a bit the code
    top_k = min(top_k, logits.size(-1))
    if top_k > 0:
        # Remove all tokens with a probability less than the last token in the top-k tokens
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value
    if top_p > 0.0:
        # Compute cumulative probabilities of sorted tokens
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probabilities = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        # Remove tokens with cumulative probability above the threshold
        sorted_indices_to_remove = cumulative_probabilities > top_p
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0 
        # Back to unsorted indices and set them to -infinity
        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = filter_value
    indices_to_remove = logits < threshold
    logits[indices_to_remove] = filter_value
    return logits

def sample_sequence(history, tokenizer, model, current_output=None):
    special_tokens_ids = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS)
    if current_output is None:
        current_output = []
    for i in range(max_length):
        instance = build_input_from_segments(history, current_output, tokenizer, with_eos=False)
        input_ids = torch.tensor(instance["input_ids"], device=device).unsqueeze(0)
        token_type_ids = torch.tensor(instance["token_type_ids"], device=device).unsqueeze(0)
        logits = model(input_ids, token_type_ids=token_type_ids)
        if isinstance(logits, tuple):  # for gpt2 and maybe others
            logits = logits[0]
        logits = logits[0, -1, :] / temperature
        logits = top_filtering(logits, top_k=top_k, top_p=top_p)
        probs = F.softmax(logits, dim=-1)
        prev = torch.topk(probs, 1)[1] if no_sample else torch.multinomial(probs, 1)
        if i < min_length and prev.item() in special_tokens_ids:
            while prev.item() in special_tokens_ids:
                if probs.max().item() == 1:
                    warnings.warn("Warning: model generating special token with probability 1.")
                    break  # avoid infinitely looping over special token
                prev = torch.multinomial(probs, num_samples=1)
        if prev.item() in special_tokens_ids:
            break
        current_output.append(prev.item())
    return current_output

In [0]:
history = []
while True:
    raw_text = input(">>> ")
    while not raw_text:
        print('Prompt should not be empty!')
        raw_text = input(">>> ")
    history.append(tokenizer.encode(raw_text))
    with torch.no_grad():
        out_ids = sample_sequence(history, tokenizer, model)
    history.append(out_ids)
    history = history[-(2*max_history+1):]
    out_text = tokenizer.decode(out_ids, skip_special_tokens=True)
    print(out_text)

### Save model to Google Drive
If you are happy with your model consider saving it to your Google Drive. Note that all data on this notebook will be lost after a certain time of inactivity. Note that the model size is quite big (~500MB) so make sure you have enough space in your Google Drive.

This will save only your final model state (from your directory `run_name` directory).



In [0]:
import shutil

#@title Save to Google Drive
save_to_drive = True #@param {type:"boolean"}

source_directory = f'./runs/{run_name}'
target_directory = f"/content/drive/My Drive/AMLD/models/task3/{run_name}/"
include_checkpoints = False

if save_to_drive:
  logger.info(f'Copying from {source_directory} to {target_directory}...')
  ignore_pattern = None
  if not include_checkpoints:
    ignore_pattern = shutil.ignore_patterns('checkpoint-*')
  shutil.copytree(source_directory, target_directory, ignore=ignore_pattern)
  logger.info('Successfully copied your model!')  
  