# Run MATCH with PeTaL data

Created by Eric Kong on 21 June 2021.
Last modified by Eric Kong on 6 July 2021.

In this notebook we run the MATCH algorithm (GitHub: https://github.com/yuzhimanhua/MATCH, arXiv: https://arxiv.org/abs/2102.07349) on Lens data labelled with PeTaL's taxonomy.

This notebook was originally run in Google Colaboratory with GPU acceleration.

## Setup

In this section we download and install the `MATCH` directory and its requirements.

In [None]:
!pip3 install gdown



In [None]:
import os
import gdown

Check the computing devices available to this notebook using `nvidia-smi`.

In [None]:
!nvidia-smi

Thu Jul  1 15:49:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Download the MATCH repository using gdown (thanks Paht!)

In [None]:
if not os.path.exists('MATCH/'):
    url = "https://drive.google.com/uc?id=1-9oiMwjpiJCRjw9c12wQYAcC6QLhQZlj" # MATCH_20210701
    # url = "https://drive.google.com/uc?id=1-2evoruw4B88P4x7EpoH45tuJ2Dfb2Xs" # old one
    output = "MATCH.tar.gz"
    gdown.download(url, output, quiet=False)

    !tar -xvf MATCH.tar.gz
else:
    print("You have already downloaded our modified MATCH repository.")

For the rest of the notebook, we will want to run scripts using `MATCH/` as our working directory.

In [None]:
%cd ./MATCH
!ls

/content/MATCH
configure      PeTaL		 run_models.sh
deepxml        predictions.txt	 transform_data_PeTaL_only_mags_and_meshes.py
evaluation.py  preprocess.py	 transform_data_PeTaL.py
joint	       preprocess.sh	 transform_data_PeTaL_random_mags_and_meshes.py
LICENSE        README.md	 transform_data.py
main.py        requirements.txt


Install the MATCH requirements. NOTE: You may have to restart the runtime after installing the requirements.  This is annoying but not prohibitively so.

In [None]:
# Install requirements in requirements.txt
!chmod 755 -R .
!pip3 install -r requirements.txt

## Default preprocessing, training, and testing of MATCH with PeTaL data

In this section we preprocess the PeTaL data, train on MATCH on it, and evaluate it on test data.

The input that MATCH expects is in newline-delimited JSON format, where each line is a JSON object with the following fields.

```
{
  "paper": "020-134-448-948-932",
  "mag": [
    "microtubule_polymerization", "microtubule", "tubulin", "guanosine_triphosphate", "growth_rate", "gtp'", "optical_tweezers", "biophysics", "dimer", "biology"
  ],
  "mesh": [
    "D048429", "D000431"
  ],
  "venue": "Current biology",
  "author": [
    "2305659199", "2275630009", "2294310593", "1706693917", "2152058803"
  ],
  "reference": [
    "020-720-960-216-820", "052-873-952-181-099", "000-849-951-902-070"
  ],
  "scholarly_citations": [
    "000-393-690-357-939", "000-539-388-379-773", "002-134-932-426-244"
  ],
  "text": "microtubule assembly dynamics at the nanoscale background the labile nature of microtubules is critical for establishing cellular morphology and motility yet the molecular basis of assembly remains unclear here we use optical tweezers to track microtubule polymerization against microfabricated barriers permitting unprecedented spatial resolution",
  "label": [
    "change_size_or_color", "move", "physically_assemble/disassemble", "maintain_ecological_community"
  ]
}
```

This file is provided as `cleaned_lens_output.json` (`https://github.com/nasa-petal/PeTaL-labeller/blob/main/scripts/lens-cleaner/cleaned_lens_output.json`).

In [None]:
DATASET = "PeTaL"
MODEL = "MATCH"

### Preprocessing

`PeTaL/Split.py` is a custom script which takes `cleaned_lens_output.json` and performs a training-validation-testing split (currently 80%-10%-10%), outputting `train.json`, `dev.json`, and `test.json`.

`transform_data_PeTaL.py` transforms the above `json` files into plain text files, where each line is a sequence of tokens delimited by spaces. In `*_texts.txt` files, the `text` tokens are prepended by metadata tokens such as `author`, `venue`, and `references`. In `*_labels.txt` files, each line contains the PeTaL taxonomy labels for each paper.t

`preprocess.py`, among other things, transforms the `*.txt` data into `numpy`-compliant `*.npy` files, using the embedding files `emb_init.npy` and `PeTaL.joint.emb`. These embeddings come from *metadata-aware embedding pre-training* (performed with PeTaL data on `hpc.grc.nasa.gov`), which embeds the text and its metadata in the same latent space in order to capture the relationships between them.

In [None]:
# Slightly modified preprocess.sh

%cd PeTaL/
!python3 Split.py
%cd ..

!python3 transform_data_PeTaL.py --dataset $DATASET

!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210629 06:34:12 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 06:34:12 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210629 06:34:13 preprocess:32][39m Size of Samples: 900
[32m[I 210629 06:34:14 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 06:34:14 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210629 06:34:14 preprocess:32][39m Size of Samples: 100


### Training and testing

`main.py` with `--mode train` performs training. During training, the model will occasionally (every `step` batches, where currently `step = 10` in the configuration file `configure/models/MATCH-PeTaL.yaml`) print out a logger line including epoch number, steps, training loss, validation loss, precisions and Normalized Discounted Cumulative Gains (nDCGs) at top `{1, 3, 5}`, and an early stopping count (currently set to interrupt training at `50`). The model is available in `PeTaL/models`.

`main.py` with `--mode eval` performs testing. Precision and nDCG statistics are printed, and the results are available in `PeTaL/results`.

`evaluation.py` performs inference. The top `k` (currently `k = 5`) label predictions for each paper are printed line by line in `predictions.txt`.

In [None]:
# Slightly modified run_models.sh

!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210629 06:34:16 main:32][39m Model Name: MATCH
[32m[I 210629 06:34:16 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210629 06:34:16 main:47][39m Number of Labels: 124
[32m[I 210629 06:34:16 main:48][39m Size of Training Set: 800
[32m[I 210629 06:34:16 main:49][39m Size of Validation Set: 100
[32m[I 210629 06:34:16 main:66][39m Number of Edges: 101
[32m[I 210629 06:34:16 main:68][39m Training
[32m[I 210629 06:34:23 models:110][39m 2 512 train loss: 0.2769679 valid loss: 0.1508755 P@1: 0.33000 P@3: 0.22333 P@5: 0.19400 N@3: 0.24877 N@5: 0.25615 early stop: 0
[32m[I 210629 06:34:24 models:142][39m SWA Initializing
[32m[I 210629 06:34:25 models:110][39m 4 1024 train loss: 0.1443880 valid loss: 0.1437063 P@1: 0.33000 P@3: 0.21667 P@5: 0.20400 N@3: 0.23732 N@5: 0.26094 early stop: 0
[32m[I 210629 06:34:29 models:110][39m 7 512 train loss: 0.1436986 valid loss: 0.1422755 P@1: 0.330

## Ablation study: Effect of adding MAG and MeSH labels to text

Relevant to PeTaL Labeller Issues #53 (https://github.com/nasa-petal/PeTaL-labeller/issues/53) and #58 (https://github.com/nasa-petal/PeTaL-labeller/issues/58)

Databases of papers categorise their papers differently. We investigate the effect of adding Microsoft Academic Graph (MAG) fields of study and PubMed's Medical Subject Headings (MeSH) terms, when available for each paper, as additional metadata.

To turn on/off including MAG fields of study and MeSH terms, use `transform_data_PetaL.py` options `--no-mag` and `--no-mesh`, respectively.

To change the train-dev-test split before processing, use `PeTaL/Split.py` options `--train TRAIN --dev DEV`, where `TRAIN` and `DEV` are between 0 and 1, and so is their sum. An 80-10-10 train-dev-test split (the default) can be explicitly invoked using `python3 PeTaL/Split.py --train 0.8 --dev 0.1`.

To rotate the dataset by `N` examples before processing, use `PeTaL/Split.py` option `--skip N`. This is useful for `k`-fold cross-validation.

### Issue 53. Smaller ablation study.

This study is strictly superseded by the Issue #58 study below.

#### Results with MAG labels and MeSH labels



In [None]:
!python3 transform_data_PeTaL.py --dataset $DATASET

In [None]:
!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210629 01:07:02 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 01:07:02 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210629 01:07:03 preprocess:32][39m Size of Samples: 900
[32m[I 210629 01:07:04 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 01:07:04 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210629 01:07:04 preprocess:32][39m Size of Samples: 100


In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210629 01:07:08 main:32][39m Model Name: MATCH
[32m[I 210629 01:07:08 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210629 01:07:08 main:47][39m Number of Labels: 124
[32m[I 210629 01:07:08 main:48][39m Size of Training Set: 800
[32m[I 210629 01:07:08 main:49][39m Size of Validation Set: 100
[32m[I 210629 01:07:08 main:66][39m Number of Edges: 101
[32m[I 210629 01:07:08 main:68][39m Training
[32m[I 210629 01:07:14 models:110][39m 2 512 train loss: 0.2721960 valid loss: 0.1478169 P@1: 0.30000 P@3: 0.17333 P@5: 0.13400 N@3: 0.20418 N@5: 0.19661 early stop: 0
[32m[I 210629 01:07:15 models:142][39m SWA Initializing
[32m[I 210629 01:07:16 models:110][39m 4 1024 train loss: 0.1486002 valid loss: 0.1413299 P@1: 0.30000 P@3: 0.18333 P@5: 0.17200 N@3: 0.20620 N@5: 0.22358 early stop: 0
[32m[I 210629 01:07:18 models:110][39m 7 512 train loss: 0.1418496 valid loss: 0.1394818 P@1: 0.300

#### Results with MAG labels, without MeSH labels

In [None]:
# !cp -r PeTaL-062315 PeTaL

!python3 transform_data_PeTaL.py --dataset $DATASET --no-mesh

In [None]:
!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210623 23:51:54 preprocess:28][39m Vocab Size: 26834
[32m[I 210623 23:51:54 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210623 23:51:54 preprocess:32][39m Size of Samples: 900
[32m[I 210623 23:51:55 preprocess:28][39m Vocab Size: 26834
[32m[I 210623 23:51:55 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210623 23:51:55 preprocess:32][39m Size of Samples: 100


In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210623 23:53:26 main:32][39m Model Name: MATCH
[32m[I 210623 23:53:26 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210623 23:53:26 main:47][39m Number of Labels: 124
[32m[I 210623 23:53:26 main:48][39m Size of Training Set: 800
[32m[I 210623 23:53:26 main:49][39m Size of Validation Set: 100
[32m[I 210623 23:53:26 main:66][39m Number of Edges: 101
[32m[I 210623 23:53:26 main:68][39m Training
[32m[I 210623 23:53:32 models:142][39m SWA Initializing
[32m[I 210623 23:53:47 models:110][39m 24 1024 train loss: 0.1282653 valid loss: 0.1520345 P@1: 0.27000 P@3: 0.21000 P@5: 0.18600 N@3: 0.22714 N@5: 0.25315 early stop: 0
[32m[I 210623 23:54:05 models:110][39m 49 1024 train loss: 0.0306060 valid loss: 0.1493586 P@1: 0.40000 P@3: 0.29667 P@5: 0.24800 N@3: 0.32327 N@5: 0.33705 early stop: 0
[32m[I 210623 23:54:23 models:110][39m 74 1024 train loss: 0.0121740 valid loss: 0.1538170 P@1: 

#### Results without MAG labels, with MeSH labels

In [None]:
!python3 transform_data_PeTaL.py --dataset $DATASET

In [None]:
!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210629 06:30:37 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 06:30:37 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
Converting token to id: 0it [00:00, ?it/s]                                          Converting labels: 0it [00:00, ?it/s]                                     [32m[I 210629 06:30:37 preprocess:32][39m Size of Samples: 900
Traceback (most recent call last):
  File "preprocess.py", line 8, in <module>
  File "/content/drive/.shortcut-targets-by-id/1H34uxYzZnD3lCNKKLXPkhQj27fDosPDC/PeTaL/PeTaL Data/MATCH on PeTaL Data/MATCH/deepxml/data_utils.py", line 7, in <module>
    from gensim.models import KeyedVectors
  File "/usr/local/lib/python3.7/dist-packages/gensim/__init__.py", line 5, in <module>
    from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils  # noqa:F401
  File "/usr/local/lib/python3.7/dist-packages/gensim/parsing/__init__.py", line 4, in <module>
    from .preproc

In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210629 06:30:40 main:32][39m Model Name: MATCH
[32m[I 210629 06:30:40 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210629 06:30:40 main:47][39m Number of Labels: 124
[32m[I 210629 06:30:40 main:48][39m Size of Training Set: 800
[32m[I 210629 06:30:40 main:49][39m Size of Validation Set: 100
[32m[I 210629 06:30:40 main:66][39m Number of Edges: 101
[32m[I 210629 06:30:40 main:68][39m Training
^C
[32m[I 210629 06:30:43 main:32][39m Model Name: MATCH
[32m[I 210629 06:30:43 main:79][39m Loading Test Set
[32m[I 210629 06:30:43 main:83][39m Size of Test Set: 100
[32m[I 210629 06:30:43 main:85][39m Predicting
[32m[I 210629 06:30:47 main:91][39m Finish Predicting
Precision@1,3,5: 0.62 0.47 0.372
nDCG@1,3,5: 0.62 0.5069278726022756 0.5172544503323468


#### Results without MAG labels, without MeSH labels



In [None]:
!python3 transform_data_PeTaL.py --dataset $DATASET

In [None]:
!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

Traceback (most recent call last):
  File "preprocess.py", line 8, in <module>
    from deepxml.data_utils import build_vocab, convert_to_binary
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 724, in exec_module
  File "<frozen importlib._bootstrap_external>", line 818, in get_code
  File "<frozen importlib._bootstrap_external>", line 917, in get_data
KeyboardInterrupt
Traceback (most recent call last):
  File "preprocess.py", line 8, in <module>
    from deepxml.data_utils import build_vocab, convert_to_binary
  File "/content/drive/Shareddrives/MATCH Attempt/MATCH/deepxml/data_utils.py", line 5, in <module>
    from sklearn.preprocessing import MultiLabelBinarizer, normalize
  File "/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/__init__.py", line 6, in 

In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy


Aborted!
[32m[I 210624 21:33:03 main:32][39m Model Name: MATCH
[32m[I 210624 21:33:03 main:79][39m Loading Test Set
[32m[I 210624 21:33:03 main:83][39m Size of Test Set: 100
[32m[I 210624 21:33:03 main:85][39m Predicting
[32m[I 210624 21:33:06 main:91][39m Finish Predicting
Precision@1,3,5: 0.66 0.53 0.42
nDCG@1,3,5: 0.66 0.5620467187130749 0.5699775424270964


### Issue 58: Ablation study with 10-fold cross validaiton.

In [None]:
for skip in range(0, 1000, 100):
    print(f"\nWITH MAG WITH MESH, skip={skip}\n")
    %cd PeTaL/
    !python3 Split.py --skip={skip}
    %cd ..
    !wc PeTaL/train.json

    !python3 transform_data_PeTaL.py --dataset $DATASET

    !python preprocess.py \
    --text-path {DATASET}/train_texts.txt \
    --label-path {DATASET}/train_labels.txt \
    --vocab-path {DATASET}/vocab.npy \
    --emb-path {DATASET}/emb_init.npy \
    --w2v-model {DATASET}/{DATASET}.joint.emb \

    !python preprocess.py \
    --text-path {DATASET}/test_texts.txt \
    --label-path {DATASET}/test_labels.txt \
    --vocab-path {DATASET}/vocab.npy \

    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

    !python evaluation.py \
    --results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
    --targets {DATASET}/test_labels.npy \
    --train-labels {DATASET}/train_labels.npy


WITH MAG WITH MESH, skip=100

/content/drive/Shareddrives/NASA/MATCH/PeTaL
131
/content/drive/Shareddrives/NASA/MATCH
    800  274532 4635505 PeTaL/train.json
[32m[I 210629 03:07:34 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 03:07:34 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210629 03:07:34 preprocess:32][39m Size of Samples: 900
[32m[I 210629 03:07:35 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 03:07:35 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210629 03:07:35 preprocess:32][39m Size of Samples: 100
[32m[I 210629 03:07:37 main:32][39m Model Name: MATCH
[32m[I 210629 03:07:37 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210629 03:07:37 main:47][39m Number of Labels: 124
[32m[I 210629 03:07:37 main:48][39m Size of Training Set: 800
[32m[I 210629 03:07:37 main:49][39m Size of Validation Set:

In [None]:
for skip in range(0, 1000, 100):
    print(f"\nWITH MAG NO MESH, skip={skip}\n")
    %cd PeTaL/
    !python3 Split.py --skip={skip}
    %cd ..
    !wc PeTaL/train.json

    !python3 transform_data_PeTaL.py --dataset $DATASET --no-mesh

    !python preprocess.py \
    --text-path {DATASET}/train_texts.txt \
    --label-path {DATASET}/train_labels.txt \
    --vocab-path {DATASET}/vocab.npy \
    --emb-path {DATASET}/emb_init.npy \
    --w2v-model {DATASET}/{DATASET}.joint.emb \

    !python preprocess.py \
    --text-path {DATASET}/test_texts.txt \
    --label-path {DATASET}/test_labels.txt \
    --vocab-path {DATASET}/vocab.npy \

    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

    !python evaluation.py \
    --results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
    --targets {DATASET}/test_labels.npy \
    --train-labels {DATASET}/train_labels.npy


WITH MAG NO MESH, skip=100

/content/drive/Shareddrives/NASA/MATCH/PeTaL
131
/content/drive/Shareddrives/NASA/MATCH
    800  274532 4635505 PeTaL/train.json
[32m[I 210629 03:40:20 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 03:40:20 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210629 03:40:20 preprocess:32][39m Size of Samples: 900
[32m[I 210629 03:40:22 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 03:40:22 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210629 03:40:22 preprocess:32][39m Size of Samples: 100
[32m[I 210629 03:40:23 main:32][39m Model Name: MATCH
[32m[I 210629 03:40:23 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210629 03:40:23 main:47][39m Number of Labels: 124
[32m[I 210629 03:40:23 main:48][39m Size of Training Set: 800
[32m[I 210629 03:40:23 main:49][39m Size of Validation Set: 1

In [None]:
for skip in range(0, 1000, 100):
    print(f"\nNO MAG WITH MESH, skip={skip}\n")
    %cd PeTaL/
    !python3 Split.py --skip={skip}
    %cd ..
    !wc PeTaL/train.json

    !python3 transform_data_PeTaL.py --dataset $DATASET --no-mag

    !python preprocess.py \
    --text-path {DATASET}/train_texts.txt \
    --label-path {DATASET}/train_labels.txt \
    --vocab-path {DATASET}/vocab.npy \
    --emb-path {DATASET}/emb_init.npy \
    --w2v-model {DATASET}/{DATASET}.joint.emb \

    !python preprocess.py \
    --text-path {DATASET}/test_texts.txt \
    --label-path {DATASET}/test_labels.txt \
    --vocab-path {DATASET}/vocab.npy \

    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

    !python evaluation.py \
    --results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
    --targets {DATASET}/test_labels.npy \
    --train-labels {DATASET}/train_labels.npy


NO MAG WITH MESH, skip=0

/content/drive/.shortcut-targets-by-id/1H34uxYzZnD3lCNKKLXPkhQj27fDosPDC/PeTaL/PeTaL Data/MATCH on PeTaL Data/MATCH/PeTaL
131
/content/drive/.shortcut-targets-by-id/1H34uxYzZnD3lCNKKLXPkhQj27fDosPDC/PeTaL/PeTaL Data/MATCH on PeTaL Data/MATCH
    800  275452 4644161 PeTaL/train.json
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/dist-packages/gensim/parsing/preprocessing.py", line 42, in <module>
    from gensim import utils
  File "/usr/local/lib/python3.7/dist-packages/gensim/utils.py", line 45, in <module>
    from smart_open import smart_open
  File "/usr/local/lib/python3.7/d

In [None]:
for skip in range(0, 1000, 100):
    print(f"\nNO MAG NO MESH, skip={skip}\n")
    %cd PeTaL/
    !python3 Split.py --skip={skip}
    %cd ..
    !wc PeTaL/train.json

    !python3 transform_data_PeTaL.py --dataset $DATASET --no-mag --no-mesh

    !python preprocess.py \
    --text-path {DATASET}/train_texts.txt \
    --label-path {DATASET}/train_labels.txt \
    --vocab-path {DATASET}/vocab.npy \
    --emb-path {DATASET}/emb_init.npy \
    --w2v-model {DATASET}/{DATASET}.joint.emb \

    !python preprocess.py \
    --text-path {DATASET}/test_texts.txt \
    --label-path {DATASET}/test_labels.txt \
    --vocab-path {DATASET}/vocab.npy \

    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
    !PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

    !python evaluation.py \
    --results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
    --targets {DATASET}/test_labels.npy \
    --train-labels {DATASET}/train_labels.npy


NO MAG NO MESH, skip=100

/content/drive/Shareddrives/NASA/MATCH/PeTaL
131
/content/drive/Shareddrives/NASA/MATCH
    800  274532 4635505 PeTaL/train.json
[32m[I 210629 05:16:44 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 05:16:44 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210629 05:16:44 preprocess:32][39m Size of Samples: 900
[32m[I 210629 05:16:46 preprocess:28][39m Vocab Size: 26834
[32m[I 210629 05:16:46 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210629 05:16:46 preprocess:32][39m Size of Samples: 100
[32m[I 210629 05:16:47 main:32][39m Model Name: MATCH
[32m[I 210629 05:16:47 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210629 05:16:47 main:47][39m Number of Labels: 124
[32m[I 210629 05:16:47 main:48][39m Size of Training Set: 800
[32m[I 210629 05:16:47 main:49][39m Size of Validation Set: 100

## Ablation study: Turn off hypernymy regularization

We investigate the effect of the hierarachy (PeTaL/taxonomy.txt). The MATCH paper describes *hypernymy regularization*, which leverages taxonomy information to take into account the relationships between labels in training.

This includes *regularization in the parameter space*, where a penalty is added to encourage the parameters of each label (e.g., `active_movement`) to be similar to its parent (e.g., `move`), and *regularization in the output space*, where a penalty is added if a child label occurs without its parent label (roughly speaking).

The authors of `MATCH` were kind enough to include a CLI option, `--reg`, to toggle hypernymy regularization. `--reg 1` turns it on, and `--reg 0` turns it off.

In [None]:
# note: --reg 0 turns off hypernymy regularization
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode train --reg 0
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode eval

!python evaluation.py --results PeTaL/results/MATCH-PeTaL-labels.npy --targets PeTaL/test_labels.npy --train-labels PeTaL/train_labels.npy

[32m[I 210624 15:11:07 main:32][39m Model Name: MATCH
[32m[I 210624 15:11:07 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210624 15:11:08 main:47][39m Number of Labels: 124
[32m[I 210624 15:11:08 main:48][39m Size of Training Set: 800
[32m[I 210624 15:11:08 main:49][39m Size of Validation Set: 100
[32m[I 210624 15:11:08 main:68][39m Training
[32m[I 210624 15:11:14 models:142][39m SWA Initializing
[32m[I 210624 15:11:35 models:110][39m 24 1024 train loss: 0.1316223 valid loss: 0.1447000 P@1: 0.21000 P@3: 0.19333 P@5: 0.17200 N@3: 0.19867 N@5: 0.20960 early stop: 0
[32m[I 210624 15:11:59 models:110][39m 49 1024 train loss: 0.0376488 valid loss: 0.1427039 P@1: 0.37000 P@3: 0.26000 P@5: 0.24200 N@3: 0.28756 N@5: 0.31248 early stop: 0
[32m[I 210624 15:12:22 models:110][39m 74 1024 train loss: 0.0113640 valid loss: 0.1442004 P@1: 0.49000 P@3: 0.39000 P@5: 0.31600 N@3: 0.42163 N@5: 0.42813

## Study: Effect of Dataset Size on MATCH Performance

| Train set size | P@1=nDCG@1 | P@3 | P@5 | nDCG@3 | nDCG@5 |
| --- | --- | --- | --- | --- | --- |
| 200 | 0.324 | 0.249 | 0.203 | 0.269 | 0.274 |
| 300 | 0.424 | 0.337 | 0.275 | 0.362 | 0.364 |
| 400 | 0.441 | 0.344 | 0.278 | 0.373 | 0.373 |
| 500 | 0.547 | 0.419 | 0.328 | 0.454 | 0.447 |
| 600 | 0.534 | 0.433 | 0.345 | 0.464 | 0.463 |
| 700 | 0.555 | 0.434 | 0.342 | 0.466 | 0.472 |
| 800 | 0.627 | 0.509 | 0.390 | 0.542 | 0.543 |

In [None]:
%cd PeTaL/
# Note: in order to vary traning set size, I adjusted the --train parameter (currently 0.8 for 0.8 * 1000 = 800 training examples)
!python3 Split.py --train 0.8 --dev 0.1
%cd ..
# If the splitting went correctly, the first number in the wc output should be the number of training examples.
!wc PeTaL/train.json

/content/drive/Shareddrives/MATCH Attempt/MATCH/PeTaL
131
/content/drive/Shareddrives/MATCH Attempt/MATCH
    800  275452 4644161 PeTaL/train.json


In [None]:
# Slightly modified preprocess.sh

!python3 transform_data_PeTaL.py --dataset $DATASET

!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210625 17:01:33 preprocess:28][39m Vocab Size: 26834
[32m[I 210625 17:01:33 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210625 17:01:33 preprocess:32][39m Size of Samples: 900
[32m[I 210625 17:01:34 preprocess:28][39m Vocab Size: 26834
[32m[I 210625 17:01:34 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210625 17:01:34 preprocess:32][39m Size of Samples: 100


In [None]:
# Slightly modified run_models.sh

!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210625 17:01:38 main:32][39m Model Name: MATCH
[32m[I 210625 17:01:38 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210625 17:01:38 main:47][39m Number of Labels: 124
[32m[I 210625 17:01:38 main:48][39m Size of Training Set: 800
[32m[I 210625 17:01:38 main:49][39m Size of Validation Set: 100
[32m[I 210625 17:01:38 main:66][39m Number of Edges: 101
[32m[I 210625 17:01:38 main:68][39m Training
[32m[I 210625 17:01:45 models:142][39m SWA Initializing
[32m[I 210625 17:02:05 models:110][39m 24 1024 train loss: 0.1285587 valid loss: 0.1464014 P@1: 0.43000 P@3: 0.27667 P@5: 0.23800 N@3: 0.31060 N@5: 0.32622 early stop: 0
[32m[I 210625 17:02:29 models:110][39m 49 1024 train loss: 0.0270746 valid loss: 0.1439123 P@1: 0.47000 P@3: 0.35667 P@5: 0.29600 N@3: 0.38437 N@5: 0.39816 early stop: 0
[32m[I 210625 17:02:54 models:110][39m 74 1024 train loss: 0.0104609 valid loss: 0.1469670 P@1: 

idea: do k-fold cross validation?

# Results of MATCH Quick Start

Running MATCH on MAG-CS dataset as described in the paper.

In [None]:
!./preprocess.sh

[32m[I 210617 15:44:06 preprocess:28][39m Vocab Size: 500000
[32m[I 210617 15:44:06 preprocess:30][39m Getting Dataset: MAG/train_texts.txt Max Length: 500
tcmalloc: large alloc 2539503616 bytes == 0x558fac048000 @  0x7f91aba831e7 0x7f91a934bea1 0x7f91a93b0928 0x7f91a93b4070 0x7f91a93b45e5 0x7f91a944d40d 0x558ed8136d54 0x558ed8136a50 0x558ed81ab105 0x558ed81a54ae 0x558ed81383ea 0x558ed81aa7f0 0x558ed81a57ad 0x558ed81383ea 0x558ed81a63b5 0x558ed81a57ad 0x558ed81383ea 0x558ed81a63b5 0x558ed81a54ae 0x558ed8077e2c 0x558ed81a7bb5 0x558ed81a54ae 0x558ed8138c9f 0x558ed8138ea1 0x558ed81a7bb5 0x558ed813830a 0x558ed81a660e 0x558ed81a54ae 0x558ed8138a81 0x558ed8138ea1 0x558ed81a7bb5
[32m[I 210617 15:46:44 preprocess:32][39m Size of Samples: 634874
[32m[I 210617 15:47:41 preprocess:28][39m Vocab Size: 500000
[32m[I 210617 15:47:41 preprocess:30][39m Getting Dataset: MAG/test_texts.txt Max Length: 500
[32m[I 210617 15:47:56 preprocess:32][39m Size of Samples: 70533


In [None]:
!./run_models.sh

[32m[I 210617 15:48:17 main:32][39m Model Name: MATCH
[32m[I 210617 15:48:17 main:35][39m Loading Training and Validation Set
tcmalloc: large alloc 2539503616 bytes == 0x55777a22e000 @  0x7f4bd84601e7 0x7f4bd5d28ea1 0x7f4bd5d92b75 0x7f4bd5d9370e 0x7f4bd5e2c71e 0x55775d768d54 0x55775d768a50 0x55775d7dd105 0x55775d7d74ae 0x55775d76a3ea 0x55775d7d932a 0x55775d7d74ae 0x55775d76a3ea 0x55775d7d932a 0x55775d7d74ae 0x55775d76a3ea 0x55775d7d83b5 0x55775d7d74ae 0x55775d6a9e2c 0x55775d7d9bb5 0x55775d7d74ae 0x55775d76ac9f 0x55775d76aea1 0x55775d7d9bb5 0x55775d76a30a 0x55775d7d860e 0x55775d7d74ae 0x55775d76aa81 0x55775d76aea1 0x55775d7d9bb5 0x55775d7d74ae
tcmalloc: large alloc 2257362944 bytes == 0x5578127f4000 @  0x7f4bd84601e7 0x7f4bd5d28ea1 0x7f4bd5d8d928 0x7f4bd5d8da43 0x7f4bd5ddd2d4 0x7f4bd5e1cb90 0x55775d768d54 0x55775d768a50 0x55775d7dd105 0x55775d7d77ad 0x55775d76a3ea 0x55775d7d83b5 0x55775d85aec8 0x55775d850d8e 0x55775d840b95 0x55775d777a34 0x55775d7a8cc4 0x55775d769462 0x55775d7dc715 