# Edge-Probing Fine-tuning Example

In this notebook, we will:

* Train a RoBERTa base model on Edge-Probing (Semeval) and evaluate its performance
* Because the Edge-Probing data is not publicly available, we will simulate the run with a single example. This will serve as a guide for users who have access to the task data, or similarly formatted data.
* **The encoder is not frozen for training runs in this notebook.**

The code shown in this notebook will work, but the results will not be representative of the task!

## Setup

#### Install dependencies

First, we will install libraries we need for this code.

In [1]:
%%capture
!git clone https://github.com/nyu-mll/jiant.git
%cd jiant
!pip install -r requirements-no-torch.txt
!pip install --no-deps -e ./
%cd ..

## `jiant` Pipeline

In [2]:
import sys
sys.path.insert(0, "/content/jiant")

In [3]:
import jiant.proj.main.tokenize_and_cache as tokenize_and_cache
import jiant.proj.main.export_model as export_model
import jiant.proj.main.scripts.configurator as configurator
import jiant.proj.main.runscript as main_runscript
import jiant.shared.caching as caching
import jiant.utils.python.io as py_io
import jiant.utils.display as display
import os

## Creating sample Edge-Probing data.

Because the Edge-Probing data is not publicly available, we will simulate the run with a single example. We will write 1000 copies for the training set and 100 copies for the validation set. We will also write the corresponding task config.

In [4]:
example = {
  "text": "The current view is that the chronic inflammation in the distal part of the stomach caused by Helicobacter pylori infection results in an increased acid production from the non-infected upper corpus region of the stomach.",
  "info": {"id": 7},
  "targets": [
    {
      "label": "Cause-Effect(e2,e1)",
      "span1": [7,8],
      "span2": [19, 20],
      "info": {"comment": ""}
    }
  ]
}
# Simulate a training set of 1000 examples
train_data = [example] * 1000
# Simulate a validation set of 100 examples
val_data = [example] * 100

In [5]:
os.makedirs("/content/tasks/configs/", exist_ok=True)
os.makedirs("/content/tasks/data/semeval", exist_ok=True)
py_io.write_jsonl(
    data=train_data,
    path="/content/tasks/data/semeval/train.jsonl",
)
py_io.write_jsonl(
    data=val_data,
    path="/content/tasks/data/semeval/val.jsonl",
)
py_io.write_json({
  "task": "semeval",
  "paths": {
    "train": "/content/tasks/data/semeval/train.jsonl",
    "val": "/content/tasks/data/semeval/val.jsonl",
  },
  "name": "semeval"
}, "/content/tasks/configs/semeval_config.json")

#### Download model

Next, we will download a `roberta-base` model. This also includes the tokenizer.

In [6]:
export_model.lookup_and_export_model(
    model_type="roberta-base",
    output_base_path="./models/roberta-base",
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=501200538.0, style=ProgressStyle(descri…




Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




#### Tokenize and cache

With the model and data ready, we can now tokenize and cache the inputs features for our task. This converts the input examples to tokenized features ready to be consumed by the model, and saved them to disk in chunks.

In [7]:
# Tokenize and cache each task
task_name = "semeval"

tokenize_and_cache.main(tokenize_and_cache.RunConfiguration(
    task_config_path=f"./tasks/configs/{task_name}_config.json",
    model_type="roberta-base",
    model_tokenizer_path="./models/roberta-base/tokenizer",
    output_dir=f"./cache/{task_name}",
    phases=["train", "val"],
))

SemevalTask
  [train]: /content/tasks/data/semeval/train.jsonl
  [val]: /content/tasks/data/semeval/val.jsonl


HBox(children=(FloatProgress(value=0.0, description='Tokenizing', max=1000.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Tokenizing', style=ProgressStyle(description_width='initi…




We can inspect the first examples of the first chunk of each task.

In [14]:
row = caching.ChunkedFilesDataCache("./cache/semeval/train").load_chunk(0)[0]["data_row"]
print(row.input_ids)
print(row.tokens)
print(row.tokens[row.spans[0][0]: row.spans[0][1]+1])
print(row.tokens[row.spans[1][0]: row.spans[1][1]+1])

[    0   133   595  1217    16    14     5  7642 16000    11     5  7018
   337   233     9     5  9377  1726    30 31141  2413 35995   181  4360
  6249  7910   775    11    41  1130 10395   931    31     5   786    12
 37597   196  2853 42168   976     9     5  9377     4     2     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1]
['<s>', 'The', 'Ġcurrent', 'Ġview', 'Ġis', 'Ġthat', 'Ġthe', 'Ġchronic', 'Ġinflammation', 'Ġin', 'Ġthe', 'Ġdist', 'al', 'Ġpart', 'Ġof', 'Ġthe', 'Ġstomach', 'Ġcaused', 'Ġby', 'ĠHelic', 'ob', 'acter', 'Ġp', 'yl', 'ori', 'Ġi

#### Writing a run config

Here we are going to write what we call a `jiant_task_container_config`. This configuration file basically defines a lot of the subtleties of our training pipeline, such as what tasks we will train on, do evaluation on, batch size for each task. The new version of `jiant` leans heavily toward explicitly specifying everything, for the purpose of inspectability and leaving minimal surprises for the user, even as the cost of being more verbose.

We use a helper "Configurator" to write out a `jiant_task_container_config`, since most of our setup is pretty standard. 

**Depending on what GPU your Colab session is assigned to, you may need to lower the train batch size.**

In [15]:
jiant_run_config = configurator.SimpleAPIMultiTaskConfigurator(
    task_config_base_path="./tasks/configs",
    task_cache_base_path="./cache",
    train_task_name_list=["semeval"],
    val_task_name_list=["semeval"],
    train_batch_size=8,
    eval_batch_size=16,
    epochs=3,
    num_gpus=1,
).create_config()
os.makedirs("./run_configs/", exist_ok=True)
py_io.write_json(jiant_run_config, "./run_configs/semeval_run_config.json")
display.show_json(jiant_run_config)

{
  "task_config_path_dict": {
    "semeval": "./tasks/configs/semeval_config.json"
  },
  "task_cache_config_dict": {
    "semeval": {
      "train": "./cache/semeval/train",
      "val": "./cache/semeval/val",
      "val_labels": "./cache/semeval/val_labels"
    }
  },
  "sampler_config": {
    "sampler_type": "ProportionalMultiTaskSampler"
  },
  "global_train_config": {
    "max_steps": 375,
    "warmup_steps": 37
  },
  "task_specific_configs_dict": {
    "semeval": {
      "train_batch_size": 8,
      "eval_batch_size": 16,
      "gradient_accumulation_steps": 1,
      "eval_subset_num": 500
    }
  },
  "taskmodels_config": {
    "task_to_taskmodel_map": {
      "semeval": "semeval"
    },
    "taskmodel_config_map": {
      "semeval": null
    }
  },
  "task_run_config": {
    "train_task_list": [
      "semeval"
    ],
    "train_val_task_list": [
      "semeval"
    ],
    "val_task_list": [
      "semeval"
    ],
    "test_task_list": []
  },
  "metric_aggregator_config": {


To briefly go over the major components of the `jiant_task_container_config`:

* `task_config_path_dict`: The paths to the task config files we wrote above.
* `task_cache_config_dict`: The paths to the task features caches we generated above.
* `sampler_config`: Determines how to sample from different tasks during training.
* `global_train_config`: The number of total steps and warmup steps during training.
* `task_specific_configs_dict`: Task-specific arguments for each task, such as training batch size and gradient accumulation steps.
* `taskmodels_config`: Task-model specific arguments for each task-model, including what tasks use which model.
* `metric_aggregator_config`: Determines how to weight/aggregate the metrics across multiple tasks.

#### Start training

Finally, we can start our training run. 

Before starting training, the script also prints out the list of parameters in our model.

In [16]:
run_args = main_runscript.RunConfiguration(
    jiant_task_container_config_path="./run_configs/semeval_run_config.json",
    output_dir="./runs/semeval",
    model_type="roberta-base",
    model_path="./models/roberta-base/model/roberta-base.p",
    model_config_path="./models/roberta-base/model/roberta-base.json",
    model_tokenizer_path="./models/roberta-base/tokenizer",
    learning_rate=1e-5,
    eval_every_steps=500,
    do_train=True,
    do_val=True,
    do_save=True,
    force_overwrite=True,
)
main_runscript.run_loop(run_args)

  jiant_task_container_config_path: ./run_configs/semeval_run_config.json
  output_dir: ./runs/semeval
  model_type: roberta-base
  model_path: ./models/roberta-base/model/roberta-base.p
  model_config_path: ./models/roberta-base/model/roberta-base.json
  model_tokenizer_path: ./models/roberta-base/tokenizer
  model_load_mode: from_transformers
  do_train: True
  do_val: True
  do_save: True
  do_save_last: False
  do_save_best: False
  write_val_preds: False
  write_test_preds: False
  eval_every_steps: 500
  save_every_steps: 0
  save_checkpoint_every_steps: 0
  no_improvements_for_n_evals: 0
  keep_checkpoint_when_done: False
  force_overwrite: True
  seed: -1
  learning_rate: 1e-05
  adam_epsilon: 1e-08
  max_grad_norm: 1.0
  optimizer_type: adam
  no_cuda: False
  fp16: False
  fp16_opt_level: O1
  local_rank: -1
  server_ip: 
  server_port: 
device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
Using seed: 195818355
{
  "jiant_task_container_config_path": ".

  task_name, task_config_path, task_name, task.name, task_name,


No optimizer decay for:
  encoder.embeddings.LayerNorm.weight
  encoder.embeddings.LayerNorm.bias
  encoder.encoder.layer.0.attention.self.query.bias
  encoder.encoder.layer.0.attention.self.key.bias
  encoder.encoder.layer.0.attention.self.value.bias
  encoder.encoder.layer.0.attention.output.dense.bias
  encoder.encoder.layer.0.attention.output.LayerNorm.weight
  encoder.encoder.layer.0.attention.output.LayerNorm.bias
  encoder.encoder.layer.0.intermediate.dense.bias
  encoder.encoder.layer.0.output.dense.bias
  encoder.encoder.layer.0.output.LayerNorm.weight
  encoder.encoder.layer.0.output.LayerNorm.bias
  encoder.encoder.layer.1.attention.self.query.bias
  encoder.encoder.layer.1.attention.self.key.bias
  encoder.encoder.layer.1.attention.self.value.bias
  encoder.encoder.layer.1.attention.output.dense.bias
  encoder.encoder.layer.1.attention.output.LayerNorm.weight
  encoder.encoder.layer.1.attention.output.LayerNorm.bias
  encoder.encoder.layer.1.intermediate.dense.bias
  encode

HBox(children=(FloatProgress(value=0.0, description='Training', max=375.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=7.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=7.0, style=ProgressStyle(descri…


Loading Best


HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=7.0, style=ProgressStyle(descri…


{
  "aggregated": 1.0,
  "semeval": {
    "loss": 0.007899788208305836,
    "metrics": {
      "major": 1.0,
      "minor": {
        "acc": 1.0,
        "f1_micro": 1.0,
        "acc_and_f1_micro": 1.0
      }
    }
  }
}


Since we're training and evaluating on the same (duplicated) example, we should get perfect performance, but hopefully this notebook should be illustrative of the workflow for edge-probing tasks.