# Applied Science Take-Home Exercise
This notebook can act as a boilerplate to help you along the road, do not feel obliged to submit everything via a notebook or make your customizations/ do things differently to how it is presented here. 

In `semantic_parsing_dialog` there are three functions that should help you: 
- `load_dataset()` which loads a local tsv file as a huggingface Dataset
- `postprocess_test()` which helps clean up predictions so that 
- `evaluate_predictions()` can be used to produce various model evaluation metrics 

### What we are looking for:
- Can you create a basic baseline model? Why have you chosen this implementation? 
- What are the steps have you taken (please make sure your thinking is well commented) 
- Try to experiment with methods to improve your basic baseline architecture 
- How are you evaluating & comparing your training runs? 
- Given more time what would you do? 
    - architecture strategies
    - relevant research papers

### Submission
- Check-in your solution to a new branch and create a PR (`!git checkout -b 'submission'`)
- Please make sure to include your predictions using two files called **data/submission_mce.tsv** and **data/submission_ccf.tsv**. The former should contain the prediction of the model trained with cross-entropy loss and the latter file containing the results of training with a custom cost function written by you. Make sure to compare the result of these two experiments. 

# Set Up

The first cells are just preparing the environment and mounting the required volumes in Google Colab.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

In [None]:
import os
import sys
import torch


# sys.path.append('/content/gdrive/MyDrive/<Your_Folder>/semantic_parsing_dialog')

In [None]:
!pip install -r requirements.txt

In [None]:
# use an experiment tracking tool of your own choice, here we use Wandb 
import wandb


wandb.login()

# Load the Dataset

In [None]:
# Check you are connected to a GPU-enabled instance.

print(torch.cuda.is_available()) # should be TRUE if GPU is expected

In [None]:
from semantic_parsing_dialog import ROOT 
from semantic_parsing_dialog import device

from semantic_parsing_dialog.utils import load_datasets, load_semantic_vocab
from semantic_parsing_dialog.vocab import get_intents, get_slots


datapath = os.path.join(ROOT, 'data')

dataset = load_datasets(datapath)
semantic_vocab = load_semantic_vocab(datapath)

# Task Data Analysis

Before jumping into the exercise, we'll have a look at some simple statistics of the task-oriented semantical vocabulary

#### Slots and intents

A quick analysis to have a feeling of the variability in terms of complexity for the different intents.

Here we look at co-occurrences of intents and certain slots within the same query - without caring about whether slots are part of its semantics or part of other intents that form the composed query.  

In [None]:
df = dataset['train'].to_pandas()

df['intent'] = df['representation'].map(get_intents)
df['slot'] = df['representation'].map(get_slots)

df['n_intent'] = df['intent'].map(len)
df['n_slot'] = df['slot'].map(len)

In [None]:
# TODO: Inspect any other aspect of the dataset relevant to the task at hand

# Model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
# TODO
model_name_or_path = '' # TODO: write name
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

# TODO: 
max_source_length = 
max_target_length = 

model.config.max_length = max_target_length

In [None]:
from typing import List


def extend_vocabulary(tokenizer: AutoTokenizer, model: AutoModelForSeq2SeqLM, tokens: List[str]) -> None:
    """ Extends the model's original vocabulary to accommodate new tokens
    
    The are added at the end of the tokenizer's vocabulary
    """
    ...
    # TODO

# Training 

#### Model training with cross-entropy loss

In [None]:
from transformers import  ...
from transformers import Seq2SeqTrainingArguments

from semantic_parsing_dialog import postprocess_text

In [None]:
def preprocess_function(sample):
    # TODO
    #############
    
    
    #############

    # Replace all tokenizer.pad_token_id in the labels by -100
    # when we want to ignore padding in the loss.
    labels["input_ids"] = [
        [(l if l != tokenizer.pad_token_id else -100) for l in label]
        for label in labels["input_ids"]
    ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# TODO 
tokenized_dataset = ...

In [None]:
# TODO 
data_collator = ...

In [None]:
# PUT YOUR CODE HERE
# you may want to alter the hyperparams
training_args = Seq2SeqTrainingArguments(
    output_dir="checkpoints",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    predict_with_generate=True,
    fp16=False,
    learning_rate=5e-1,
    num_train_epochs=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=10,
    report_to="wandb"
)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    # compute_metrics=..,
    # callbacks=..
)

In [None]:
# log experiments
# depending on how you implement the model, logging may be done differently. 
with wandb.init(project="semantic_parsing_dialog") as run:
    trainer.train()

#### Model training with a custom loss function

There are several intents with long tails in the training dataset, i.e. co-occurrence with a large cardinality of slots with low volume/representation in the training set. These cases could present learning challenges for the model. Incorporate a custom loss function in the training pipeline that takes into account the data imbalance.

In [None]:
from torch import nn
from transformers import Seq2SeqTrainer


class CustomTrainer(Seq2SeqTrainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        #  TODO: Incorporate a custom loss function in the training pipeline.
        return

In [None]:
trainer = CustomTrainer(
    model=Seq2SeqTrainer,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    # compute_metrics=..,
    # callbacks=..
) 

In [None]:
# log experiments
with wandb.init(project="semantic_parsing_dialog_custom_cost") as run:
    trainer.train()

## Computing metrics

We strongly recommend you use the given `evaluate_predictions` and `postprocess_text` functions 

In [None]:
import wandb

from semantic_parsing_dialog import evaluate_predictions, postprocess_text


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = postprocess_text(decoded_preds)
    decoded_labels = postprocess_text(decoded_labels)
    return evaluate_predictions(decoded_labels, decoded_preds)

# Evaluate and compare the models

In [None]:
model_ckpt = '<PATH>'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

model = model.to(device)

In [None]:
def evaluate(dataset, model, tokenizer, batch_size=16):
    """Evaluates the model on the given dataset
    
    Both the predictions and the labels should be processed
    with `postprocess_text` prior to any metric calculation
    """
    # TODO
    ...
    

# Save predictions over test dataset

In [None]:
# make predictions over test set 
output = trainer.predict(tokenized_dataset["test"])

In [None]:
output.metrics

In [None]:
# save predictions
predictions = tokenizer.batch_decode(output.predictions, skip_special_tokens=True)

In [None]:
with open("data/submission_mce.tsv", "w") as f:
    f.write("\n".join(postprocess_text(predictions)))

# Check in your submission to your branch and make a PR.

**please make sure to include your predictions in a file called *data/submission.tsv***

# Next steps

In [None]:
# TODO