# XNLI Example

In this notebook, we are going to train a model for evaluating on [XNLI](https://github.com/facebookresearch/XNLI). XNLI a cross-lingual NLI task, spanning 15 different languages, with 10,000 validation and test examples per language. Notably, XNLI does not have its own training set - instead, the usual recipe is to MNLI as a training set, and is then zero-shot evaluated on NLI examples in other languages. Of course, this works best when you start with a model that has already been pretrained on a lot of multi-lingual text, such as mBERT or XLM/XLM-RoBERTa.

Hence, the tricky part about this setup is that although we have separate XNLI and MNLI tasks, we need them to all use the same task head. We will cover how to easily do this with `jiant`.

--- 

In this notebook, we will:

* Train an XLM-RoBERTa base model on MNLI
* Evaluate on XNLI-de (German) and XNLI-zh (Chinese)

## Setup

#### Install dependencies

First, we will install libraries we need for this code.

In [None]:
%%capture
!git clone https://github.com/nyu-mll/jiant.git
%cd jiant
!pip install -r requirements-no-torch.txt
!pip install --no-deps -e ./

#### Download data

Next, we will download MNLI and XNLI data. 

In [None]:
%%capture
%cd /content
# Download MNLI and XNLI data
!PYTHONPATH=/content/jiant python jiant/jiant/scripts/download_data/runscript.py \
    download \
    --tasks mnli xnli \
    --output_path=/content/tasks/

## `jiant` Pipeline

In [1]:
import sys
sys.path.insert(0, "../../../jiant")

In [2]:
!CUDA_DEVICE_ORDER=PCI_BUS_ID
!CUDA_VISIBLE_DEVICES=0

In [3]:
import jiant.proj.main.tokenize_and_cache as tokenize_and_cache
import jiant.proj.main.export_model as export_model
import jiant.proj.main.scripts.configurator as configurator
import jiant.proj.main.runscript as main_runscript
import jiant.shared.caching as caching
import jiant.utils.python.io as py_io
import jiant.utils.display as display
import os

  from pandas.core.computation.check import NUMEXPR_INSTALLED


#### Download model

Next, we will download an `xlm-roberta-base` model. This also includes the tokenizer.

In [5]:
export_model.export_model(
    hf_pretrained_model_name_or_path="xlm-roberta-base",
    output_base_path="./models/xlm-roberta-base",
)

#### Tokenize and cache

With the model and data ready, we can now tokenize and cache the inputs features for our tasks. This converts the input examples to tokenized features ready to be consumed by the model, and saved them to disk in chunks.

Note that we are tokenize `train` and `val` data for MNLI, but only `val` data for XNLI, since there is no corresponding training data.

In [4]:
!ls ../../exp/tasks

configs  data  disrpt_cfg.tgz  ICSI_split_da_09_08_22.tar.gz


In [5]:
# Tokenize and cache MNLI
tokenize_and_cache.main(tokenize_and_cache.RunConfiguration(
    task_config_path=f"../../exp/tasks/configs/mnli_config.json",
    hf_pretrained_model_name_or_path="xlm-roberta-base",
    output_dir=f"../../exp/cache/mnli",
    phases=["train", "val"],
))

# Tokenize and cache XNLI-de, XNLI-zh
for lang in ["de", "zh"]:
    tokenize_and_cache.main(tokenize_and_cache.RunConfiguration(
        task_config_path=f"../../exp/tasks/configs/xnli_{lang}_config.json",
        hf_pretrained_model_name_or_path="xlm-roberta-base",
        output_dir=f"../../exp/cache/xnli_{lang}",
        phases=["val"],
    ))

MnliTask
  [train]: /moredata/muller/Devel/jiant/exp/tasks/data/mnli/train.jsonl
  [val]: /moredata/muller/Devel/jiant/exp/tasks/data/mnli/val.jsonl
  [test]: /moredata/muller/Devel/jiant/exp/tasks/data/mnli/test.jsonl


Tokenizing:   0%|          | 0/392702 [00:00<?, ?it/s]

Tokenizing:   0%|          | 0/9815 [00:00<?, ?it/s]

XnliTask
  [val]: /moredata/muller/Devel/jiant/exp/tasks/data/xnli_de/val.jsonl
  [test]: /moredata/muller/Devel/jiant/exp/tasks/data/xnli_de/test.jsonl


Tokenizing:   0%|          | 0/2490 [00:00<?, ?it/s]

XnliTask
  [val]: /moredata/muller/Devel/jiant/exp/tasks/data/xnli_zh/val.jsonl
  [test]: /moredata/muller/Devel/jiant/exp/tasks/data/xnli_zh/test.jsonl


Tokenizing:   0%|          | 0/2490 [00:00<?, ?it/s]

We can inspect the first examples of the first chunk of each task.

In [6]:
row = caching.ChunkedFilesDataCache("../../exp/cache/mnli/train").load_chunk(0)[0]["data_row"]
print(row.input_ids)
print(row.tokens)

[     0 128182     34  25958  24709   7158  58838   1556   6626  62822
 158208     20  12996    136    700  87168      5      2      2  73111
    136    700  87168    621   2367   3249  24709   7158  58838   4488
      5      2      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1]
['<s>', '▁Concept', 'u', 'ally', '▁cream', '▁ski', 'mming', '▁has', '▁two', '▁basic', '▁di

In [7]:
row = caching.ChunkedFilesDataCache("../../exp//cache/xnli_de/val").load_chunk(0)[0]["data_row"]
print(row.input_ids)
print(row.tokens)

[     0    165     72   1256  65017      4  22991    654   2394     48
   9491      5      2      2   1004   1427   4240  10810  68901    142
      4 216783     72   1312    745  27325   4223      6  84510      5
      2      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1]
['<s>', '▁und', '▁er', '▁hat', '▁gesagt', ',', '▁Mama', '▁ich', '▁bin', '▁da', 'heim', '.'

In [8]:
row = caching.ChunkedFilesDataCache("../../exp/cache/xnli_zh/val").load_chunk(0)[0]["data_row"]
print(row.input_ids)
print(row.tokens)

[     0      6  49617      4  25710      4    631  43774    274     30
      2      2      6   9889   3715 139030 137438   1826      4    852
  40678  90629  25710 140576  28413     30      2      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1      1      1
      1      1      1      1      1      1      1      1]
['<s>', '▁', '他说', ',', '妈妈', ',', '我', '回来', '了', '。', '</s>', '</s>', '▁', '校', '车', '把他

#### Writing a run config

Here we are going to write what we call a `jiant_task_container_config`. This configuration file basically defines a lot of the subtleties of our training pipeline, such as what tasks we will train on, do evaluation on, batch size for each task. The new version of `jiant` leans heavily toward explicitly specifying everything, for the purpose of inspectability and leaving minimal surprises for the user, even as the cost of being more verbose.

We use a helper "Configurator" to write out a `jiant_task_container_config`, since most of our setup is pretty standard. We specify to train only on MNLI, but evaluate on MNLI, XNLI-de and XNLI-zh.

**Depending on what GPU your Colab session is assigned to, you may need to lower the train batch size.**

In [9]:
jiant_run_config = configurator.SimpleAPIMultiTaskConfigurator(
    task_config_base_path="../../exp/tasks/configs",
    task_cache_base_path="../../exp/cache",
    train_task_name_list=["mnli"],
    val_task_name_list=["mnli", "xnli_de", "xnli_zh"],
    train_batch_size=1,
    eval_batch_size=1,
    epochs=0.1,
    num_gpus=1,
).create_config()
display.show_json(jiant_run_config)

{
  "task_config_path_dict": {
    "mnli": "../../exp/tasks/configs/mnli_config.json",
    "xnli_de": "../../exp/tasks/configs/xnli_de_config.json",
    "xnli_zh": "../../exp/tasks/configs/xnli_zh_config.json"
  },
  "task_cache_config_dict": {
    "mnli": {
      "train": "../../exp/cache/mnli/train",
      "val": "../../exp/cache/mnli/val",
      "val_labels": "../../exp/cache/mnli/val_labels"
    },
    "xnli_de": {
      "val": "../../exp/cache/xnli_de/val",
      "val_labels": "../../exp/cache/xnli_de/val_labels"
    },
    "xnli_zh": {
      "val": "../../exp/cache/xnli_zh/val",
      "val_labels": "../../exp/cache/xnli_zh/val_labels"
    }
  },
  "sampler_config": {
    "sampler_type": "ProportionalMultiTaskSampler"
  },
  "global_train_config": {
    "max_steps": 39270,
    "warmup_steps": 3927
  },
  "task_specific_configs_dict": {
    "mnli": {
      "train_batch_size": 1,
      "eval_batch_size": 1,
      "gradient_accumulation_steps": 1,
      "eval_subset_num": 500
    },


To briefly go over the major components of the `jiant_task_container_config`:

* `task_config_path_dict`: The paths to the task config files we wrote above.
* `task_cache_config_dict`: The paths to the task features caches we generated above.
* `sampler_config`: Determines how to sample from different tasks during training.
* `global_train_config`: The number of total steps and warmup steps during training.
* `task_specific_configs_dict`: Task-specific arguments for each task, such as training batch size and gradient accumulation steps.
* `taskmodels_config`: Task-model specific arguments for each task-model, including what tasks use which model.
* `metric_aggregator_config`: Determines how to weight/aggregate the metrics across multiple tasks.

**We need to make one small change to the auto-generated config**: we need to ensure that all three tasks use the same model head. Otherwise, each task will have its own task head, and the XNLI heads will be untrained.

We can make a simple change to the dictionary, setting all of them to point to an `nli_model` head, and then write out the config.

In [10]:
jiant_run_config["taskmodels_config"]["task_to_taskmodel_map"] = {
    "mnli": "nli_model",
    "xnli_de": "nli_model",
    "xnli_zh": "nli_model",
}
os.makedirs("./run_configs/", exist_ok=True)
py_io.write_json(jiant_run_config, "./run_configs/jiant_run_config.json")

#### Start training

Finally, we can start our training run. 

Before starting training, the script also prints out the list of parameters in our model. You should notice that the only task head is the `nli_model` head.

In [11]:
run_args = main_runscript.RunConfiguration(
    jiant_task_container_config_path="./run_configs/jiant_run_config.json",
    output_dir="./runs/run1",
    hf_pretrained_model_name_or_path="xlm-roberta-base",
    model_path="./models/xlm-roberta-base/model/model.p",
    model_config_path="./models/xlm-roberta-base/model/config.json",
    learning_rate=1e-5,
    eval_every_steps=500,
    do_train=True,
    do_val=True,
    force_overwrite=True,
)

main_runscript.run_loop(run_args)

  jiant_task_container_config_path: ./run_configs/jiant_run_config.json
  output_dir: ./runs/run1
  hf_pretrained_model_name_or_path: xlm-roberta-base
  model_path: ./models/xlm-roberta-base/model/model.p
  model_config_path: ./models/xlm-roberta-base/model/config.json
  model_load_mode: from_transformers
  do_train: True
  do_val: True
  do_save: False
  do_save_last: False
  do_save_best: False
  write_val_preds: False
  write_test_preds: False
  eval_every_steps: 500
  save_every_steps: 0
  save_checkpoint_every_steps: 0
  no_improvements_for_n_evals: 0
  keep_checkpoint_when_done: False
  force_overwrite: True
  seed: -1
  learning_rate: 1e-05
  adam_epsilon: 1e-08
  max_grad_norm: 1.0
  optimizer_type: adam
  freeze_layers: 0-8
  no_cuda: False
  fp16: False
  fp16_opt_level: O1
  local_rank: -1
  server_ip: 
  server_port: 
device: cuda n_gpu: 3, distributed training: False, 16-bits training: False
Using seed: 977026747
{
  "jiant_task_container_config_path": "./run_configs/jiant

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


******** freezing  encoder.layer.0.attention.self.query.weight
******** freezing  encoder.layer.0.attention.self.query.bias
******** freezing  encoder.layer.0.attention.self.key.weight
******** freezing  encoder.layer.0.attention.self.key.bias
******** freezing  encoder.layer.0.attention.self.value.weight
******** freezing  encoder.layer.0.attention.self.value.bias
******** freezing  encoder.layer.0.attention.output.dense.weight
******** freezing  encoder.layer.0.attention.output.dense.bias
******** freezing  encoder.layer.0.attention.output.LayerNorm.weight
******** freezing  encoder.layer.0.attention.output.LayerNorm.bias
******** freezing  encoder.layer.0.intermediate.dense.weight
******** freezing  encoder.layer.0.intermediate.dense.bias
******** freezing  encoder.layer.0.output.dense.weight
******** freezing  encoder.layer.0.output.dense.bias
******** freezing  encoder.layer.0.output.LayerNorm.weight
******** freezing  encoder.layer.0.output.LayerNorm.bias
******** freezing  encod

    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.


Training:   0%|          | 0/39270 [00:00<?, ?it/s]



RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 2.20 GiB already allocated; 20.56 MiB free; 2.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Finally, we should see the validation scores for MNLI, XNLI-de, and XNLI-zh. Given that the training data is in English, we expect to see slightly higher scores for MNLI, but the scores for XNLI-de and XNLI-zh are still decent!