# Run MATCH with PeTaL data

Created by Eric Kong on 21 June 2021.
Last modified by Eric Kong on 9 July 2021.

In this notebook we run the MATCH algorithm (GitHub: https://github.com/yuzhimanhua/MATCH, arXiv: https://arxiv.org/abs/2102.07349) on Lens data labelled with PeTaL's taxonomy.

This notebook was originally run in Google Colaboratory with GPU acceleration.

## Setup

In this section we download and install the `MATCH` directory and its requirements.

In [None]:
!pip3 install gdown
!pip install wandb -qqq

In [2]:
import os
import gdown
import wandb

In [None]:
wandb.login()

Check the computing devices available to this notebook using `nvidia-smi`.

In [None]:
!nvidia-smi

Thu Jul  1 15:49:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Download the MATCH repository using gdown (thanks Paht!)

In [None]:
if not os.path.exists('MATCH/'):
    url = "https://drive.google.com/uc?id=100yel9kxjy4VW4VpaAUjjUr0JhxS3fDH" # MATCH_20210709
    # url = "https://drive.google.com/uc?id=1-9oiMwjpiJCRjw9c12wQYAcC6QLhQZlj" # MATCH_20210701

    output = "MATCH.tar.gz"
    gdown.download(url, output, quiet=False)

    !tar -xvf MATCH.tar.gz
else:
    print("You have already downloaded our modified MATCH repository.")

For the rest of the notebook, we will want to run scripts using `MATCH/` as our working directory.

In [None]:
%cd ./MATCH
# !ls

Install the MATCH requirements. NOTE: ~~You may have to restart the runtime after installing the requirements.  This is annoying but not prohibitively so.~~ You don't have to restart the runtime.

In [None]:
# Install requirements in requirements.txt
!chmod 755 -R .
!pip3 install -r requirements.txt

## Default preprocessing, training, and testing of MATCH with PeTaL data

In this section we preprocess the PeTaL data, train on MATCH on it, and evaluate it on test data.

The input that MATCH expects is in newline-delimited JSON format, where each line is a JSON object with the following fields.

```
{
  "paper": "020-134-448-948-932",
  "mag": [
    "microtubule_polymerization", "microtubule", "tubulin", "guanosine_triphosphate", "growth_rate", "gtp'", "optical_tweezers", "biophysics", "dimer", "biology"
  ],
  "mesh": [
    "D048429", "D000431"
  ],
  "venue": "Current biology",
  "author": [
    "2305659199", "2275630009", "2294310593", "1706693917", "2152058803"
  ],
  "reference": [
    "020-720-960-216-820", "052-873-952-181-099", "000-849-951-902-070"
  ],
  "scholarly_citations": [
    "000-393-690-357-939", "000-539-388-379-773", "002-134-932-426-244"
  ],
  "text": "microtubule assembly dynamics at the nanoscale background the labile nature of microtubules is critical for establishing cellular morphology and motility yet the molecular basis of assembly remains unclear here we use optical tweezers to track microtubule polymerization against microfabricated barriers permitting unprecedented spatial resolution",
  "label": [
    "change_size_or_color", "move", "physically_assemble/disassemble", "maintain_ecological_community"
  ]
}
```

This file is provided as `cleaned_lens_output.json` (`https://github.com/nasa-petal/PeTaL-labeller/blob/main/scripts/lens-cleaner/cleaned_lens_output.json`).

In [None]:
DATASET = "PeTaL"
MODEL = "MATCH"

### Preprocessing

`PeTaL/Split.py` is a custom script which takes `cleaned_lens_output.json` and performs a training-validation-testing split (currently 80%-10%-10%), outputting `train.json`, `dev.json`, and `test.json`.

`transform_data_PeTaL.py` transforms the above `json` files into plain text files, where each line is a sequence of tokens delimited by spaces. In `*_texts.txt` files, the `text` tokens are prepended by metadata tokens such as `author`, `venue`, and `references`. In `*_labels.txt` files, each line contains the PeTaL taxonomy labels for each paper.t

`preprocess.py`, among other things, transforms the `*.txt` data into `numpy`-compliant `*.npy` files, using the embedding files `emb_init.npy` and `PeTaL.joint.emb`. These embeddings come from *metadata-aware embedding pre-training* (performed with PeTaL data on `hpc.grc.nasa.gov`), which embeds the text and its metadata in the same latent space in order to capture the relationships between them.

In [None]:
def get_transform_arg_string(config):
    """Transforms config arguments into a CLI-option string 
    for transform_data_PeTaL.py.

    Args:
        config (dict[str]): JSON dictionary of config arguments.

    Returns:
        str: CLI-option string
    """
    transform_args = []
    if not config['use_mag']:
        transform_args.append("--no-mag")
    if not config['use_mesh']:
        transform_args.append("--no-mesh")
    if not config['use_author']:
        transform_args.append("--no-author")
    if not config['use_venue']:
        transform_args.append("--no-venue")
    if not config['use_references']:
        transform_args.append("--no-reference")
    if not config['use_text']:
        transform_args.append("--no-text")
    return ' '.join(transform_args)

def run_preprocessing(config):
    """Runs train-test split and preprocessing scripts

    Args:
        config (dict[str]): JSON dictionary of config arguments.
    """
    # Train-test split
    %cd PeTaL/
    !python3 Split.py \
        --train {config['train_proportion']} \
        --dev {config['dev_proportion']} \
        --skip {config['skip']}
    %cd ..
    !wc PeTaL/train.json

    # Slightly modified preprocess.sh
    !python3 transform_data_PeTaL.py --dataset {DATASET} \
    {get_transform_arg_string(config)}

    !python preprocess.py \
    --text-path {DATASET}/train_texts.txt \
    --label-path {DATASET}/train_labels.txt \
    --vocab-path {DATASET}/vocab.npy \
    --emb-path {DATASET}/emb_init.npy \
    --w2v-model {DATASET}/{DATASET}.joint.emb \

    !python preprocess.py \
    --text-path {DATASET}/test_texts.txt \
    --label-path {DATASET}/test_labels.txt \
    --vocab-path {DATASET}/vocab.npy \

### Training and testing

`main.py` with `--mode train` performs training. During training, the model will occasionally (every `step` batches, where currently `step = 10` in the configuration file `configure/models/MATCH-PeTaL.yaml`) print out a logger line including epoch number, steps, training loss, validation loss, precisions and Normalized Discounted Cumulative Gains (nDCGs) at top `{1, 3, 5}`, and an early stopping count (currently set to interrupt training at `50`). The model is available in `PeTaL/models`.

`main.py` with `--mode eval` performs testing. Precision and nDCG statistics are printed, and the results are available in `PeTaL/results`.

`evaluation.py` performs inference. The top `k` (currently `k = 5`) label predictions for each paper are printed line by line in `predictions.txt`.

In [None]:
from main import main # main.py

def run_train_test(config, group):
    """Runs training, testing, and evaluation.

    Args:
        config (dict[str]): JSON dictionary of config arguments.
        group (str): experiment group name for wandb logging.
    """
    # Slightly modified run_models.sh

    wandb.init(
        project="MATCH",
        group=group,
        config=config
    )

    %cp configure/datasets/{DATASET}.yaml {wandb.run.dir}
    %cp configure/models/{MODEL}-{DATASET}.yaml {wandb.run.dir}

    train_args = ["--data-cnf", f"configure/datasets/{DATASET}.yaml",
        "--model-cnf", f"configure/models/{MODEL}-{DATASET}.yaml",
        "--mode", "train",
        "--reg", "1" if config['hypernymy_regularization'] else "0"]
    main(args=train_args, standalone_mode=False)

    test_args = ["--data-cnf", f"configure/datasets/{DATASET}.yaml",
        "--model-cnf", f"configure/models/{MODEL}-{DATASET}.yaml",
        "--mode", "eval"]
    main(args=test_args, standalone_mode=False)
    
    wandb.finish()

    !python evaluation.py \
    --results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
    --targets {DATASET}/test_labels.npy \
    --train-labels {DATASET}/train_labels.npy

def run_trial(config, group):
    """Runs both preprocessing and training-testing. The whole enchilada.

    Args:
        config (dict[str]): JSON dictionary of config arguments.
        group (str): experiment group name for wandb logging.
    """
    run_preprocessing(config)
    run_train_test(config, group)

In [None]:
config={
    'train_proportion': 0.8,
    'dev_proportion': 0.1,
    'skip': 0,
    'use_mag': True,
    'use_mesh': True,
    'use_author': True,
    'use_venue': True,
    'use_references': True,
    'use_text': True,
    'hypernymy_regularization': True,
    'leaf_labels_only': False,
    'other_notes': "",
}
group = 'integration-test-2021-07-09a'

run_trial(config, group)

## Ablation study: Effect of adding MAG and MeSH labels to text

Relevant to PeTaL Labeller Issues #53 (https://github.com/nasa-petal/PeTaL-labeller/issues/53) and #58 (https://github.com/nasa-petal/PeTaL-labeller/issues/58)

Databases of papers categorise their papers differently. We investigate the effect of adding Microsoft Academic Graph (MAG) fields of study and PubMed's Medical Subject Headings (MeSH) terms, when available for each paper, as additional metadata.

To turn on/off including MAG fields of study and MeSH terms, use `transform_data_PetaL.py` options `--no-mag` and `--no-mesh`, respectively.

To change the train-dev-test split before processing, use `PeTaL/Split.py` options `--train TRAIN --dev DEV`, where `TRAIN` and `DEV` are between 0 and 1, and so is their sum. An 80-10-10 train-dev-test split (the default) can be explicitly invoked using `python3 PeTaL/Split.py --train 0.8 --dev 0.1`.

To rotate the dataset by `N` examples before processing, use `PeTaL/Split.py` option `--skip N`. This is useful for `k`-fold cross-validation.

### Issue 58: Ablation study with 10-fold cross validaiton.

In [None]:
STUDY_TITLE = "with_mag_with_mesh"

for skip in range(0, 1000, 100):
    print(f"```\n{STUDY_TITLE} skip={skip}\n")
    config={
        'train_proportion': 0.8,
        'dev_proportion': 0.1,
        'skip': skip,
        'use_mag': True,
        'use_mesh': True,
        'use_author': True,
        'use_venue': True,
        'use_references': True,
        'use_text': True,
        'hypernymy_regularization': True,
        'leaf_labels_only': False,
        'other_notes': "",
    }
    group = 'issue_58_ablation'

    run_trial(config, group)
    print("```\n")

In [None]:
STUDY_TITLE = "with_mag_no_mesh"

for skip in range(0, 1000, 100):
    print(f"```\n{STUDY_TITLE} skip={skip}\n")
    config={
        'train_proportion': 0.8,
        'dev_proportion': 0.1,
        'skip': skip,
        'use_mag': True,
        'use_mesh': False,
        'use_author': True,
        'use_venue': True,
        'use_references': True,
        'use_text': True,
        'hypernymy_regularization': True,
        'leaf_labels_only': False,
        'other_notes': "",
    }
    group = 'issue_58_ablation'

    run_trial(config, group)
    print("```\n")

In [None]:
STUDY_TITLE = "no_mag_with_mesh"

for skip in range(0, 1000, 100):
    print(f"```\n{STUDY_TITLE} skip={skip}\n")
    config={
        'train_proportion': 0.8,
        'dev_proportion': 0.1,
        'skip': skip,
        'use_mag': False,
        'use_mesh': True,
        'use_author': True,
        'use_venue': True,
        'use_references': True,
        'use_text': True,
        'hypernymy_regularization': True,
        'leaf_labels_only': False,
        'other_notes': "",
    }
    group = 'issue_58_ablation'

    run_trial(config, group)
    print("```\n")

In [None]:
STUDY_TITLE = "no_mag_no_mesh"

for skip in range(0, 1000, 100):
    print(f"```\n{STUDY_TITLE} skip={skip}\n")
    config={
        'train_proportion': 0.8,
        'dev_proportion': 0.1,
        'skip': skip,
        'use_mag': False,
        'use_mesh': False,
        'use_author': True,
        'use_venue': True,
        'use_references': True,
        'use_text': True,
        'hypernymy_regularization': True,
        'leaf_labels_only': False,
        'other_notes': "",
    }
    group = 'issue_58_ablation'

    run_trial(config, group)
    print("```\n")

## Ablation study: Turn off hypernymy regularization

We investigate the effect of the hierarachy (PeTaL/taxonomy.txt). The MATCH paper describes *hypernymy regularization*, which leverages taxonomy information to take into account the relationships between labels in training.

This includes *regularization in the parameter space*, where a penalty is added to encourage the parameters of each label (e.g., `active_movement`) to be similar to its parent (e.g., `move`), and *regularization in the output space*, where a penalty is added if a child label occurs without its parent label (roughly speaking).

The authors of `MATCH` were kind enough to include a CLI option, `--reg`, to toggle hypernymy regularization. `--reg 1` turns it on, and `--reg 0` turns it off.

In [None]:
STUDY_TITLE = "no_hyper_reg"

for skip in range(0, 1000, 100):
    print(f"```\n{STUDY_TITLE} skip={skip}\n")
    config={
        'train_proportion': 0.8,
        'dev_proportion': 0.1,
        'skip': skip,
        'use_mag': True,
        'use_mesh': True,
        'use_author': True,
        'use_venue': True,
        'use_references': True,
        'use_text': True,
        'hypernymy_regularization': False,
        'leaf_labels_only': False,
        'other_notes': "",
    }
    group = 'hyper_reg_testing'

    run_trial(config, group)
    print("```\n")

## Study: Effect of Dataset Size on MATCH Performance

Warning: 70 trials are run, each of length ~5 minutes.

| Train set size | P@1=nDCG@1 | P@3 | P@5 | nDCG@3 | nDCG@5 |
| --- | --- | --- | --- | --- | --- |
| 200 | 0.324 | 0.249 | 0.203 | 0.269 | 0.274 |
| 300 | 0.424 | 0.337 | 0.275 | 0.362 | 0.364 |
| 400 | 0.441 | 0.344 | 0.278 | 0.373 | 0.373 |
| 500 | 0.547 | 0.419 | 0.328 | 0.454 | 0.447 |
| 600 | 0.534 | 0.433 | 0.345 | 0.464 | 0.463 |
| 700 | 0.555 | 0.434 | 0.342 | 0.466 | 0.472 |
| 800 | 0.627 | 0.509 | 0.390 | 0.542 | 0.543 |

In [None]:
for train_size in range(200, 900, 100):
    STUDY_TITLE = "train_size"

    for skip in range(0, 1000, 100):
        print(f"```\n{STUDY_TITLE} skip={skip}\n")
        config={
            'train_proportion': train_size / 1000.,
            'dev_proportion': 0.1,
            'skip': skip,
            'use_mag': True,
            'use_mesh': True,
            'use_author': True,
            'use_venue': True,
            'use_references': True,
            'use_text': True,
            'hypernymy_regularization': True,
            'leaf_labels_only': False,
            'other_notes': "",
        }
        group = 'train_size_testing'

        run_trial(config, group)
        print("```\n")