# Run MATCH with PeTaL data

Created by Eric Kong on 21 June 2021.

In this notebook we run the MATCH algorithm (https://github.com/yuzhimanhua/MATCH) on Lens data labelled with PeTaL's taxonomy.

This notebook was originally run in Google Colaboratory with GPU acceleration.

In [1]:
import os

In [2]:
# If running on Google Colab: Mount drive and cd to workspace.
from google.colab import drive
drive.mount('/content/drive')

# NOTE: Replace DRIVE_PATH with the path you plan to clone MATCH into.
DRIVE_PATH = '/content/drive/Shareddrives/NASA'
%cd $DRIVE_PATH

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/Shareddrives/NASA


In [3]:
!nvidia-smi

Fri Jun 25 18:17:41 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    25W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
if not os.path.exists('MATCH/'):
    !git clone https://github.com/yuzhimanhua/MATCH.git
else:
    print("You have already cloned the MATCH repository.")

You have already cloned the MATCH repository.


In [5]:
%cd ./MATCH
!ls

/content/drive/Shareddrives/NASA/MATCH
configure      main.py	     PeTaL-062414      run_models.sh
deepxml        MeSH	     predictions.txt   transform_data_PeTaL.py
evaluation.py  PeTaL	     preprocess.py     transform_data.py
joint	       PeTaL-062309  preprocess.sh
LICENSE        PeTaL-062312  README.md
MAG	       PeTaL-062315  requirements.txt


In [8]:
# Install requirements in requirements.txt
!chmod 755 -R .
!pip3 install -r requirements.txt

Collecting torch==1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/05/65/5248be50c55ab7429dd5c11f5e2f9f5865606b80e854ca63139ad1a584f2/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (748.9MB)
[K     |████████████████████████████████| 748.9MB 25kB/s 
[?25hCollecting torchvision==0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/51/83/2d77d040e34bd8f70dcb4770f7eb7d0aa71e07738abf6831be863ade00db/torchvision-0.4.0-cp37-cp37m-manylinux1_x86_64.whl (8.8MB)
[K     |████████████████████████████████| 8.8MB 7.3MB/s 
[?25hCollecting torchgpipe==0.0.5
  Downloading https://files.pythonhosted.org/packages/47/ac/8c4f6d058e87403643c49c00bc18c97b553b6d6d60a295a4b9168710a93d/torchgpipe-0.0.5.tar.gz
Collecting click==7.0
[?25l  Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
[K     |████████████████████████████████| 81kB 11.0MB/s 
[?25hCollecting ruamel.yaml==

In [6]:
DATASET = "PeTaL"
MODEL = "MATCH"

In [7]:
# Slightly modified preprocess.sh

!python3 transform_data_PeTaL.py --dataset $DATASET

!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210625 18:18:07 preprocess:28][39m Vocab Size: 26834
[32m[I 210625 18:18:07 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210625 18:18:07 preprocess:32][39m Size of Samples: 900
[32m[I 210625 18:18:10 preprocess:28][39m Vocab Size: 26834
[32m[I 210625 18:18:10 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210625 18:18:10 preprocess:32][39m Size of Samples: 100


In [None]:
# Slightly modified run_models.sh

!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210625 17:21:49 main:32][39m Model Name: MATCH
[32m[I 210625 17:21:49 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210625 17:21:49 main:47][39m Number of Labels: 124
[32m[I 210625 17:21:49 main:48][39m Size of Training Set: 800
[32m[I 210625 17:21:49 main:49][39m Size of Validation Set: 100
[32m[I 210625 17:21:49 main:66][39m Number of Edges: 101
[32m[I 210625 17:21:49 main:68][39m Training
[32m[I 210625 17:21:56 models:142][39m SWA Initializing
[32m[I 210625 17:22:16 models:110][39m 24 1024 train loss: 0.1276313 valid loss: 0.1360693 P@1: 0.53000 P@3: 0.32000 P@5: 0.26800 N@3: 0.37020 N@5: 0.38405 early stop: 0
[32m[I 210625 17:22:41 models:110][39m 49 1024 train loss: 0.0285277 valid loss: 0.1314530 P@1: 0.57000 P@3: 0.42333 P@5: 0.34800 N@3: 0.46408 N@5: 0.47563 early stop: 0
[32m[I 210625 17:23:05 models:110][39m 74 1024 train loss: 0.0110766 valid loss: 0.1356467 P@1: 

## Ablation study: Effect of adding MAG and MeSH labels to text

Relevant to PeTaL Labeller Issue #53 (https://github.com/nasa-petal/PeTaL-labeller/issues/53).

Note: to turn on/off including MAG fields of study and MeSH terms I mucked about with the source file ./transform_data_PeTaL.py. For future it may be more convenient to add CLI options specifying such.

### Attempt with 1000 epochs, step = 10

In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode train --reg 1

[32m[I 210623 22:45:20 main:32][39m Model Name: MATCH
[32m[I 210623 22:45:20 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210623 22:45:20 main:47][39m Number of Labels: 124
[32m[I 210623 22:45:20 main:48][39m Size of Training Set: 800
[32m[I 210623 22:45:20 main:49][39m Size of Validation Set: 100
[32m[I 210623 22:45:20 main:66][39m Number of Edges: 101
[32m[I 210623 22:45:20 main:68][39m Training
[32m[I 210623 22:45:25 models:110][39m 2 512 train loss: 0.2828015 valid loss: 0.1504736 P@1: 0.13000 P@3: 0.14333 P@5: 0.15200 N@3: 0.13877 N@5: 0.16712 early stop: 0
[32m[I 210623 22:45:26 models:142][39m SWA Initializing
[32m[I 210623 22:45:27 models:110][39m 4 1024 train loss: 0.1472312 valid loss: 0.1420337 P@1: 0.29000 P@3: 0.22000 P@5: 0.19200 N@3: 0.23735 N@5: 0.25692 early stop: 0
[32m[I 210623 22:45:29 models:110][39m 7 512 train loss: 0.1413717 valid loss: 0.1408076 P@1: 0.290

In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode eval

[32m[I 210623 22:59:58 main:32][39m Model Name: MATCH
[32m[I 210623 22:59:58 main:79][39m Loading Test Set
[32m[I 210623 22:59:58 main:83][39m Size of Test Set: 100
[32m[I 210623 22:59:58 main:85][39m Predicting
[32m[I 210623 23:00:02 main:91][39m Finish Predicting


In [None]:
!python evaluation.py --results PeTaL/results/MATCH-PeTaL-labels.npy --targets PeTaL/test_labels.npy --train-labels PeTaL/train_labels.npy

Precision@1,3,5: 0.61 0.49666666666666665 0.382
nDCG@1,3,5: 0.61 0.5242603369757589 0.5284999798212123


### Attempt with 1000 epochs, step = 100

In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode train --reg 1

[32m[I 210623 23:14:10 main:32][39m Model Name: MATCH
[32m[I 210623 23:14:10 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210623 23:14:10 main:47][39m Number of Labels: 124
[32m[I 210623 23:14:10 main:48][39m Size of Training Set: 800
[32m[I 210623 23:14:10 main:49][39m Size of Validation Set: 100
[32m[I 210623 23:14:10 main:66][39m Number of Edges: 101
[32m[I 210623 23:14:10 main:68][39m Training
[32m[I 210623 23:14:17 models:142][39m SWA Initializing
[32m[I 210623 23:14:32 models:110][39m 24 1024 train loss: 0.1303363 valid loss: 0.1415193 P@1: 0.43000 P@3: 0.26000 P@5: 0.18200 N@3: 0.30282 N@5: 0.27964 early stop: 0
[32m[I 210623 23:14:50 models:110][39m 49 1024 train loss: 0.0259115 valid loss: 0.1349336 P@1: 0.54000 P@3: 0.37000 P@5: 0.28200 N@3: 0.41327 N@5: 0.40468 early stop: 0
[32m[I 210623 23:15:07 models:110][39m 74 1024 train loss: 0.0111599 valid loss: 0.1365098 P@1: 

In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode eval

[32m[I 210623 23:26:32 main:32][39m Model Name: MATCH
[32m[I 210623 23:26:32 main:79][39m Loading Test Set
[32m[I 210623 23:26:32 main:83][39m Size of Test Set: 100
[32m[I 210623 23:26:32 main:85][39m Predicting
[32m[I 210623 23:26:35 main:91][39m Finish Predicting


In [None]:
!python evaluation.py --results PeTaL/results/MATCH-PeTaL-labels.npy --targets PeTaL/test_labels.npy --train-labels PeTaL/train_labels.npy

Precision@1,3,5: 0.67 0.49333333333333335 0.392
nDCG@1,3,5: 0.67 0.5408242591878821 0.5537992919593395


### Results with MAG labels, without MeSH labels

In [None]:
# !cp -r PeTaL-062315 PeTaL

!python3 transform_data_PeTaL.py --dataset $DATASET

In [None]:
!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210623 23:51:54 preprocess:28][39m Vocab Size: 26834
[32m[I 210623 23:51:54 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210623 23:51:54 preprocess:32][39m Size of Samples: 900
[32m[I 210623 23:51:55 preprocess:28][39m Vocab Size: 26834
[32m[I 210623 23:51:55 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210623 23:51:55 preprocess:32][39m Size of Samples: 100


In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210623 23:53:26 main:32][39m Model Name: MATCH
[32m[I 210623 23:53:26 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210623 23:53:26 main:47][39m Number of Labels: 124
[32m[I 210623 23:53:26 main:48][39m Size of Training Set: 800
[32m[I 210623 23:53:26 main:49][39m Size of Validation Set: 100
[32m[I 210623 23:53:26 main:66][39m Number of Edges: 101
[32m[I 210623 23:53:26 main:68][39m Training
[32m[I 210623 23:53:32 models:142][39m SWA Initializing
[32m[I 210623 23:53:47 models:110][39m 24 1024 train loss: 0.1282653 valid loss: 0.1520345 P@1: 0.27000 P@3: 0.21000 P@5: 0.18600 N@3: 0.22714 N@5: 0.25315 early stop: 0
[32m[I 210623 23:54:05 models:110][39m 49 1024 train loss: 0.0306060 valid loss: 0.1493586 P@1: 0.40000 P@3: 0.29667 P@5: 0.24800 N@3: 0.32327 N@5: 0.33705 early stop: 0
[32m[I 210623 23:54:23 models:110][39m 74 1024 train loss: 0.0121740 valid loss: 0.1538170 P@1: 

### Results without MAG labels, with MeSH labels

In [None]:
!python3 transform_data_PeTaL.py --dataset $DATASET

In [None]:
!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210624 00:25:18 preprocess:28][39m Vocab Size: 26834
[32m[I 210624 00:25:18 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210624 00:25:18 preprocess:32][39m Size of Samples: 900
[32m[I 210624 00:25:19 preprocess:28][39m Vocab Size: 26834
[32m[I 210624 00:25:19 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210624 00:25:19 preprocess:32][39m Size of Samples: 100


In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210624 00:25:22 main:32][39m Model Name: MATCH
[32m[I 210624 00:25:22 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210624 00:25:22 main:47][39m Number of Labels: 124
[32m[I 210624 00:25:22 main:48][39m Size of Training Set: 800
[32m[I 210624 00:25:22 main:49][39m Size of Validation Set: 100
[32m[I 210624 00:25:22 main:66][39m Number of Edges: 101
[32m[I 210624 00:25:22 main:68][39m Training
[32m[I 210624 00:25:29 models:142][39m SWA Initializing
[32m[I 210624 00:25:43 models:110][39m 24 1024 train loss: 0.1273809 valid loss: 0.1429620 P@1: 0.25000 P@3: 0.27667 P@5: 0.24200 N@3: 0.27716 N@5: 0.29998 early stop: 0
[32m[I 210624 00:26:01 models:110][39m 49 1024 train loss: 0.0264492 valid loss: 0.1411267 P@1: 0.45000 P@3: 0.38000 P@5: 0.31400 N@3: 0.40256 N@5: 0.41598 early stop: 0
[32m[I 210624 00:26:19 models:110][39m 74 1024 train loss: 0.0111127 valid loss: 0.1483654 P@1: 

### Results without MAG labels, without MeSH labels



In [None]:
!python3 transform_data_PeTaL.py --dataset $DATASET

In [None]:
!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

Traceback (most recent call last):
  File "preprocess.py", line 8, in <module>
    from deepxml.data_utils import build_vocab, convert_to_binary
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 724, in exec_module
  File "<frozen importlib._bootstrap_external>", line 818, in get_code
  File "<frozen importlib._bootstrap_external>", line 917, in get_data
KeyboardInterrupt
Traceback (most recent call last):
  File "preprocess.py", line 8, in <module>
    from deepxml.data_utils import build_vocab, convert_to_binary
  File "/content/drive/Shareddrives/MATCH Attempt/MATCH/deepxml/data_utils.py", line 5, in <module>
    from sklearn.preprocessing import MultiLabelBinarizer, normalize
  File "/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/__init__.py", line 6, in 

In [None]:
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy


Aborted!
[32m[I 210624 21:33:03 main:32][39m Model Name: MATCH
[32m[I 210624 21:33:03 main:79][39m Loading Test Set
[32m[I 210624 21:33:03 main:83][39m Size of Test Set: 100
[32m[I 210624 21:33:03 main:85][39m Predicting
[32m[I 210624 21:33:06 main:91][39m Finish Predicting
Precision@1,3,5: 0.66 0.53 0.42
nDCG@1,3,5: 0.66 0.5620467187130749 0.5699775424270964


## Ablation study: Turn off hypernymy regularization

Investigating the effect of the hierarachy (PeTaL/taxonomy.txt).

In [None]:
# note: --reg 0 turns of hypernymy regularization
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode train --reg 0
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/PeTaL.yaml --model-cnf configure/models/MATCH-PeTaL.yaml --mode eval

!python evaluation.py --results PeTaL/results/MATCH-PeTaL-labels.npy --targets PeTaL/test_labels.npy --train-labels PeTaL/train_labels.npy

[32m[I 210624 15:11:07 main:32][39m Model Name: MATCH
[32m[I 210624 15:11:07 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210624 15:11:08 main:47][39m Number of Labels: 124
[32m[I 210624 15:11:08 main:48][39m Size of Training Set: 800
[32m[I 210624 15:11:08 main:49][39m Size of Validation Set: 100
[32m[I 210624 15:11:08 main:68][39m Training
[32m[I 210624 15:11:14 models:142][39m SWA Initializing
[32m[I 210624 15:11:35 models:110][39m 24 1024 train loss: 0.1316223 valid loss: 0.1447000 P@1: 0.21000 P@3: 0.19333 P@5: 0.17200 N@3: 0.19867 N@5: 0.20960 early stop: 0
[32m[I 210624 15:11:59 models:110][39m 49 1024 train loss: 0.0376488 valid loss: 0.1427039 P@1: 0.37000 P@3: 0.26000 P@5: 0.24200 N@3: 0.28756 N@5: 0.31248 early stop: 0
[32m[I 210624 15:12:22 models:110][39m 74 1024 train loss: 0.0113640 valid loss: 0.1442004 P@1: 0.49000 P@3: 0.39000 P@5: 0.31600 N@3: 0.42163 N@5: 0.42813

## Study: Effect of Dataset Size on MATCH Performance

| Train size |   P@1 |
|------------|-------|
|        200 | 0.324 |
|        300 | 0.424 |
|        400 | 0.441 |
|        500 | 0.547 |
|        600 | 0.534 |
|        700 | 0.555 |
|        800 | 0.627 |

In [None]:
# Note: I fiddled with the hardcoded train-test split in PeTal/Split.py before running this.
# Probably better to add a CLI option to specify train-test split.
%cd PeTaL/
!python3 Split.py
%cd ..
!wc PeTaL/train.json

/content/drive/Shareddrives/MATCH Attempt/MATCH/PeTaL
131
/content/drive/Shareddrives/MATCH Attempt/MATCH
    800  275452 4644161 PeTaL/train.json


In [None]:
# Slightly modified preprocess.sh

!python3 transform_data_PeTaL.py --dataset $DATASET

!python preprocess.py \
--text-path {DATASET}/train_texts.txt \
--label-path {DATASET}/train_labels.txt \
--vocab-path {DATASET}/vocab.npy \
--emb-path {DATASET}/emb_init.npy \
--w2v-model {DATASET}/{DATASET}.joint.emb \

!python preprocess.py \
--text-path {DATASET}/test_texts.txt \
--label-path {DATASET}/test_labels.txt \
--vocab-path {DATASET}/vocab.npy \

[32m[I 210625 17:01:33 preprocess:28][39m Vocab Size: 26834
[32m[I 210625 17:01:33 preprocess:30][39m Getting Dataset: PeTaL/train_texts.txt Max Length: 500
[32m[I 210625 17:01:33 preprocess:32][39m Size of Samples: 900
[32m[I 210625 17:01:34 preprocess:28][39m Vocab Size: 26834
[32m[I 210625 17:01:34 preprocess:30][39m Getting Dataset: PeTaL/test_texts.txt Max Length: 500
[32m[I 210625 17:01:34 preprocess:32][39m Size of Samples: 100


In [None]:
# Slightly modified run_models.sh

!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode train --reg 1
!PYTHONFAULTHANDLER=1 python main.py --data-cnf configure/datasets/{DATASET}.yaml --model-cnf configure/models/{MODEL}-{DATASET}.yaml --mode eval

!python evaluation.py \
--results {DATASET}/results/{MODEL}-{DATASET}-labels.npy \
--targets {DATASET}/test_labels.npy \
--train-labels {DATASET}/train_labels.npy

[32m[I 210625 17:01:38 main:32][39m Model Name: MATCH
[32m[I 210625 17:01:38 main:35][39m Loading Training and Validation Set
  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))
[32m[I 210625 17:01:38 main:47][39m Number of Labels: 124
[32m[I 210625 17:01:38 main:48][39m Size of Training Set: 800
[32m[I 210625 17:01:38 main:49][39m Size of Validation Set: 100
[32m[I 210625 17:01:38 main:66][39m Number of Edges: 101
[32m[I 210625 17:01:38 main:68][39m Training
[32m[I 210625 17:01:45 models:142][39m SWA Initializing
[32m[I 210625 17:02:05 models:110][39m 24 1024 train loss: 0.1285587 valid loss: 0.1464014 P@1: 0.43000 P@3: 0.27667 P@5: 0.23800 N@3: 0.31060 N@5: 0.32622 early stop: 0
[32m[I 210625 17:02:29 models:110][39m 49 1024 train loss: 0.0270746 valid loss: 0.1439123 P@1: 0.47000 P@3: 0.35667 P@5: 0.29600 N@3: 0.38437 N@5: 0.39816 early stop: 0
[32m[I 210625 17:02:54 models:110][39m 74 1024 train loss: 0.0104609 valid loss: 0.1469670 P@1: 

idea: do k-fold cross validation?

# Results of MATCH Quick Start

Running MATCH on MAG-CS dataset as described in the paper.

In [None]:
!./preprocess.sh

[32m[I 210617 15:44:06 preprocess:28][39m Vocab Size: 500000
[32m[I 210617 15:44:06 preprocess:30][39m Getting Dataset: MAG/train_texts.txt Max Length: 500
tcmalloc: large alloc 2539503616 bytes == 0x558fac048000 @  0x7f91aba831e7 0x7f91a934bea1 0x7f91a93b0928 0x7f91a93b4070 0x7f91a93b45e5 0x7f91a944d40d 0x558ed8136d54 0x558ed8136a50 0x558ed81ab105 0x558ed81a54ae 0x558ed81383ea 0x558ed81aa7f0 0x558ed81a57ad 0x558ed81383ea 0x558ed81a63b5 0x558ed81a57ad 0x558ed81383ea 0x558ed81a63b5 0x558ed81a54ae 0x558ed8077e2c 0x558ed81a7bb5 0x558ed81a54ae 0x558ed8138c9f 0x558ed8138ea1 0x558ed81a7bb5 0x558ed813830a 0x558ed81a660e 0x558ed81a54ae 0x558ed8138a81 0x558ed8138ea1 0x558ed81a7bb5
[32m[I 210617 15:46:44 preprocess:32][39m Size of Samples: 634874
[32m[I 210617 15:47:41 preprocess:28][39m Vocab Size: 500000
[32m[I 210617 15:47:41 preprocess:30][39m Getting Dataset: MAG/test_texts.txt Max Length: 500
[32m[I 210617 15:47:56 preprocess:32][39m Size of Samples: 70533


In [None]:
!./run_models.sh

[32m[I 210617 15:48:17 main:32][39m Model Name: MATCH
[32m[I 210617 15:48:17 main:35][39m Loading Training and Validation Set
tcmalloc: large alloc 2539503616 bytes == 0x55777a22e000 @  0x7f4bd84601e7 0x7f4bd5d28ea1 0x7f4bd5d92b75 0x7f4bd5d9370e 0x7f4bd5e2c71e 0x55775d768d54 0x55775d768a50 0x55775d7dd105 0x55775d7d74ae 0x55775d76a3ea 0x55775d7d932a 0x55775d7d74ae 0x55775d76a3ea 0x55775d7d932a 0x55775d7d74ae 0x55775d76a3ea 0x55775d7d83b5 0x55775d7d74ae 0x55775d6a9e2c 0x55775d7d9bb5 0x55775d7d74ae 0x55775d76ac9f 0x55775d76aea1 0x55775d7d9bb5 0x55775d76a30a 0x55775d7d860e 0x55775d7d74ae 0x55775d76aa81 0x55775d76aea1 0x55775d7d9bb5 0x55775d7d74ae
tcmalloc: large alloc 2257362944 bytes == 0x5578127f4000 @  0x7f4bd84601e7 0x7f4bd5d28ea1 0x7f4bd5d8d928 0x7f4bd5d8da43 0x7f4bd5ddd2d4 0x7f4bd5e1cb90 0x55775d768d54 0x55775d768a50 0x55775d7dd105 0x55775d7d77ad 0x55775d76a3ea 0x55775d7d83b5 0x55775d85aec8 0x55775d850d8e 0x55775d840b95 0x55775d777a34 0x55775d7a8cc4 0x55775d769462 0x55775d7dc715 