<a href="https://colab.research.google.com/github/manarea/nlp_colabs/blob/main/Superglue_Metrics_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Task Example

In this notebook, we are going to fine-tune a multi-task model. Multi-task training is useful in many situations, and is a first-class feature in `jiant`.

--- 

In this notebook, we will:

* Train a RoBERTa base model on RTE, STS-B, and CommonsenseQA simultaneously

## Setup

#### Install dependencies

First, we will install libraries we need for this code.

In [None]:
%%capture
!git clone https://github.com/manarea/jiant.git
%cd jiant
!pip install -r requirements-no-torch.txt
!pip install --no-deps -e ./

#### Download data

Next, we will download RTE, STS-B and CommonsenseQA data.

In [None]:
%%capture
%cd /content
# Download RTE, STS-B and CommonsenseQA data
!PYTHONPATH=/content/jiant python jiant/jiant/scripts/download_data/runscript.py \
    download \
    --tasks rte stsb commonsenseqa \
    --output_path=/content/tasks/

## `jiant` Pipeline

In [None]:
import sys
sys.path.insert(0, "/content/jiant")

In [None]:
from IPython.display import clear_output
!pip install transformers seqeval Levenshtein datasets
clear_output()

In [None]:
import jiant.proj.main.tokenize_and_cache as tokenize_and_cache
import jiant.proj.main.export_model as export_model
import jiant.proj.main.scripts.configurator as configurator
import jiant.proj.main.runscript as main_runscript
import jiant.shared.caching as caching
import jiant.utils.python.io as py_io
import jiant.utils.display as display
import os

#### Download model

Next, we will download a `roberta-base` model. This also includes the tokenizer.

In [None]:
model_name = "distilbert-base-uncased"
export_model.export_model(
    hf_pretrained_model_name_or_path=model_name,
    output_base_path=f"./models/{model_name}",
)

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/256M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

#### Tokenize and cache

With the model and data ready, we can now tokenize and cache the inputs features for our tasks. This converts the input examples to tokenized features ready to be consumed by the model, and saved them to disk in chunks.

In [None]:
# Tokenize and cache each task
import jiant.utils.python.io as py_io
import jiant.proj.simple.runscript as simple_run
import jiant.scripts.download_data.runscript as downloader
import sys
sys.path.insert(0, "/content/jiant")
import os
EXP_DIR = "/content/exp"
DATA_DIR = "/content/exp/tasks/configs/"

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(EXP_DIR, exist_ok=True)
tasks_list = ["rte", "boolq", "copa", "multirc" ]#"record", "rte", "wic", "wsc"]:
for task_name in tasks_list:
    downloader.download_data([task_name], DATA_DIR)
    
    print(f".{DATA_DIR}/{task_name}_config.json",)
    tokenize_and_cache.main(tokenize_and_cache.RunConfiguration(
        task_config_path=f"{DATA_DIR}/{task_name}_config.json",
        hf_pretrained_model_name_or_path=model_name,
        output_dir=f"./cache/{task_name}",
        phases=["train", "val"],
    ))



  0%|          | 0/3 [00:00<?, ?it/s]

Downloaded and generated configs for 'rte' (1/1)
./content/exp/tasks/configs//rte_config.json
RteTask
  [train]: /content/exp/tasks/data/rte/train.jsonl
  [test]: /content/exp/tasks/data/rte/test.jsonl
  [val]: /content/exp/tasks/data/rte/val.jsonl


Tokenizing:   0%|          | 0/2490 [00:00<?, ?it/s]

Tokenizing:   0%|          | 0/277 [00:00<?, ?it/s]



  0%|          | 0/3 [00:00<?, ?it/s]

Downloaded and generated configs for 'boolq' (1/1)
./content/exp/tasks/configs//boolq_config.json
BoolQTask
  [train]: /content/exp/tasks/data/boolq/train.jsonl
  [test]: /content/exp/tasks/data/boolq/test.jsonl
  [val]: /content/exp/tasks/data/boolq/val.jsonl


Tokenizing:   0%|          | 0/9427 [00:00<?, ?it/s]

Tokenizing:   0%|          | 0/3270 [00:00<?, ?it/s]



  0%|          | 0/3 [00:00<?, ?it/s]

Downloaded and generated configs for 'copa' (1/1)
./content/exp/tasks/configs//copa_config.json
CopaTask
  [train]: /content/exp/tasks/data/copa/train.jsonl
  [test]: /content/exp/tasks/data/copa/test.jsonl
  [val]: /content/exp/tasks/data/copa/val.jsonl


Tokenizing:   0%|          | 0/400 [00:00<?, ?it/s]

Tokenizing:   0%|          | 0/100 [00:00<?, ?it/s]

Downloaded and generated configs for 'multirc' (1/1)
./content/exp/tasks/configs//multirc_config.json
MultiRCTask
  [train]: /content/exp/tasks/data/multirc/train.jsonl
  [val]: /content/exp/tasks/data/multirc/val.jsonl
  [test]: /content/exp/tasks/data/multirc/test.jsonl


Tokenizing:   0%|          | 0/27243 [00:00<?, ?it/s]

Tokenizing:   0%|          | 0/4848 [00:00<?, ?it/s]

We can inspect the first examples of the first chunk of each task.

In [None]:
row = caching.ChunkedFilesDataCache("./cache/rte/train").load_chunk(0)[0]["data_row"]
print(row.input_ids)
print(row.tokens)

In [None]:
row = caching.ChunkedFilesDataCache("./cache/stsb/val").load_chunk(0)[0]["data_row"]
print(row.input_ids)
print(row.tokens)

In [None]:
row = caching.ChunkedFilesDataCache("./cache/commonsenseqa/val").load_chunk(0)[0]["data_row"]
print(row.input_ids)
for context_and_choice in row.tokens_list:
    print(context_and_choice)

#### Writing a run config

Here we are going to write what we call a `jiant_task_container_config`. This configuration file basically defines a lot of the subtleties of our training pipeline, such as what tasks we will train on, do evaluation on, batch size for each task. The new version of `jiant` leans heavily toward explicitly specifying everything, for the purpose of inspectability and leaving minimal surprises for the user, even as the cost of being more verbose.

We use a helper "Configurator" to write out a `jiant_task_container_config`, since most of our setup is pretty standard. 

**Depending on what GPU your Colab session is assigned to, you may need to lower the train batch size.**

In [None]:
tasks_list = ["rte", "boolq", "copa", "multirc" ]
jiant_run_config = configurator.SimpleAPIMultiTaskConfigurator(
    task_config_base_path=f"{DATA_DIR}",
    task_cache_base_path="./cache",
    train_task_name_list=tasks_list,
    val_task_name_list=tasks_list,
    train_batch_size=4,
    eval_batch_size=8,
    epochs=0.5,
    num_gpus=1,
).create_config()
os.makedirs("./run_configs/", exist_ok=True)
py_io.write_json(jiant_run_config, "./run_configs/jiant_run_config.json")
display.show_json(jiant_run_config)

{
  "task_config_path_dict": {
    "rte": "/content/exp/tasks/configs/rte_config.json",
    "boolq": "/content/exp/tasks/configs/boolq_config.json",
    "copa": "/content/exp/tasks/configs/copa_config.json",
    "multirc": "/content/exp/tasks/configs/multirc_config.json"
  },
  "task_cache_config_dict": {
    "rte": {
      "train": "./cache/rte/train",
      "val": "./cache/rte/val",
      "val_labels": "./cache/rte/val_labels"
    },
    "boolq": {
      "train": "./cache/boolq/train",
      "val": "./cache/boolq/val",
      "val_labels": "./cache/boolq/val_labels"
    },
    "copa": {
      "train": "./cache/copa/train",
      "val": "./cache/copa/val",
      "val_labels": "./cache/copa/val_labels"
    },
    "multirc": {
      "train": "./cache/multirc/train",
      "val": "./cache/multirc/val",
      "val_labels": "./cache/multirc/val_labels"
    }
  },
  "sampler_config": {
    "sampler_type": "ProportionalMultiTaskSampler"
  },
  "global_train_config": {
    "max_steps": 4945,
 

To briefly go over the major components of the `jiant_task_container_config`:

* `task_config_path_dict`: The paths to the task config files we wrote above.
* `task_cache_config_dict`: The paths to the task features caches we generated above.
* `sampler_config`: Determines how to sample from different tasks during training.
* `global_train_config`: The number of total steps and warmup steps during training.
* `task_specific_configs_dict`: Task-specific arguments for each task, such as training batch size and gradient accumulation steps.
* `taskmodels_config`: Task-model specific arguments for each task-model, including what tasks use which model.
* `metric_aggregator_config`: Determines how to weight/aggregate the metrics across multiple tasks.

#### Start training

Finally, we can start our training run. 

Before starting training, the script also prints out the list of parameters in our model. You should notice that there is a unique task head for each task.

In [None]:
run_args = main_runscript.RunConfiguration(
    jiant_task_container_config_path="./run_configs/jiant_run_config.json",
    output_dir="./runs/run1",
    hf_pretrained_model_name_or_path=model_name,
    model_path=f"./models/{model_name}/model/model.p",
    model_config_path=f"./models/{model_name}/model/config.json",
    learning_rate=1e-5,
    eval_every_steps=500,
    do_train=True,
    do_val=True,
    force_overwrite=True,
)

main_runscript.run_loop(run_args)

  jiant_task_container_config_path: ./run_configs/jiant_run_config.json
  output_dir: ./runs/run1
  hf_pretrained_model_name_or_path: distilbert-base-uncased
  model_path: ./models/distilbert-base-uncased/model/model.p
  model_config_path: ./models/distilbert-base-uncased/model/config.json
  model_load_mode: from_transformers
  do_train: True
  do_val: True
  do_save: False
  do_save_last: False
  do_save_best: False
  write_val_preds: False
  write_test_preds: False
  eval_every_steps: 500
  save_every_steps: 0
  save_checkpoint_every_steps: 0
  no_improvements_for_n_evals: 0
  keep_checkpoint_when_done: False
  force_overwrite: True
  seed: -1
  learning_rate: 1e-05
  adam_epsilon: 1e-08
  max_grad_norm: 1.0
  optimizer_type: adam
  no_cuda: False
  fp16: False
  fp16_opt_level: O1
  local_rank: -1
  server_ip: 
  server_port: 
device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
Using seed: 615614760
{
  "jiant_task_container_config_path": "./run_configs/jiant

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  "The following weights were not loaded: {}".format(remainder_weights_dict.keys())


No optimizer decay for:
  encoder.embeddings.LayerNorm.weight
  encoder.embeddings.LayerNorm.bias
  encoder.transformer.layer.0.attention.q_lin.bias
  encoder.transformer.layer.0.attention.k_lin.bias
  encoder.transformer.layer.0.attention.v_lin.bias
  encoder.transformer.layer.0.attention.out_lin.bias
  encoder.transformer.layer.0.sa_layer_norm.bias
  encoder.transformer.layer.0.ffn.lin1.bias
  encoder.transformer.layer.0.ffn.lin2.bias
  encoder.transformer.layer.0.output_layer_norm.bias
  encoder.transformer.layer.1.attention.q_lin.bias
  encoder.transformer.layer.1.attention.k_lin.bias
  encoder.transformer.layer.1.attention.v_lin.bias
  encoder.transformer.layer.1.attention.out_lin.bias
  encoder.transformer.layer.1.sa_layer_norm.bias
  encoder.transformer.layer.1.ffn.lin1.bias
  encoder.transformer.layer.1.ffn.lin2.bias
  encoder.transformer.layer.1.output_layer_norm.bias
  encoder.transformer.layer.2.attention.q_lin.bias
  encoder.transformer.layer.2.attention.k_lin.bias
  encode



Training:   0%|          | 0/4945 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/63 [00:00<?, ?it/s]

Loading Best


Eval (rte, Val):   0%|          | 0/35 [00:00<?, ?it/s]

Eval (boolq, Val):   0%|          | 0/409 [00:00<?, ?it/s]

Eval (copa, Val):   0%|          | 0/13 [00:00<?, ?it/s]

Eval (multirc, Val):   0%|          | 0/606 [00:00<?, ?it/s]

{
  "aggregated": 0.5188473589293371,
  "rte": {
    "loss": 0.6849175589425224,
    "metrics": {
      "major": 0.5812274368231047,
      "minor": {
        "acc": 0.5812274368231047
      }
    }
  },
  "boolq": {
    "loss": 0.6367646895148643,
    "metrics": {
      "major": 0.6422018348623854,
      "minor": {
        "acc": 0.6422018348623854
      }
    }
  },
  "copa": {
    "loss": 0.6899378391412588,
    "metrics": {
      "major": 0.53,
      "minor": {
        "acc": 0.53
      }
    }
  },
  "multirc": {
    "loss": 0.6297393980445248,
    "metrics": {
      "major": 0.3219601640318583,
      "minor": {
        "em": 0.10703043022035677,
        "f1": 0.5368898978433598
      }
    }
  }
}


Finally, we should see the validation scores for all three tasks. We are not winning any awards with these scores, but this example should show how easy it is to run multi-task training in `jiant`.