# GrAFF-MS Experiments Notebook
This notebook calls various scripts that generate numbers for the tables in the paper. 

This requires `preprocess-nist.py` and the various preprocessing notebooks (`data/casmi-16/casmi-16.ipynb`, `data/chembl/chembl.ipynb`, `data/gnps/gnps.ipynb`, `data/cfm-id/cfm-id.ipynb`) have all been run, and trained GrAFF-MS and NEIMS models have been saved to checkpoints.

### Dataset sizes

In [1]:
%%bash

fs=(
    "./data/nist-20/hr_msms_nist_train.tsv"
    "./data/nist-20/hr_msms_nist_val.tsv"
    "./data/nist-20/hr_msms_nist_test.tsv"
    "./data/casmi-16/casmi-16.tsv"
    "./data/gnps/gnps.tsv"
    "./data/chembl/nist-20_chembl_decoys.tsv"
)

for f in ${fs[@]}; do
    num_structures=$(cut -f2 ${f} | sort | uniq | wc -l)
    num_spectra=$(wc -l ${f} | cut -f1 -d' ')
    echo ${f}:' '${num_spectra}' spectra, '${num_structures}' structures'
done

./data/nist-20/hr_msms_nist_train.tsv: 287995 spectra, 18665 structures
./data/nist-20/hr_msms_nist_val.tsv: 36265 spectra, 2346 structures
./data/nist-20/hr_msms_nist_test.tsv: 4424 spectra, 1632 structures
./data/casmi-16/casmi-16.tsv: 166 spectra, 151 structures
./data/gnps/gnps.tsv: 707 spectra, 636 structures
./data/chembl/chembl_decoy_library.tsv: 1262025 spectra, 221502 structures


### Model checkpoints
This set somes variables in a way that persists across Jupyter cells.

In [1]:
%%bash

GRAFF_PATH=$(ls -t ./lightning_logs/graff-ms/*/checkpoints/*.ckpt | head -n1)
NEIMS_PATH=$(ls -t ./lightning_logs/neims/*/checkpoints/*.ckpt | head -n1)

echo $GRAFF_PATH > /tmp/graff_path
echo $NEIMS_PATH > /tmp/neims_path

echo $GRAFF_PATH
echo $NEIMS_PATH

./lightning_logs/graff-ms/version_0/checkpoints/epoch=96-step=27257.ckpt
./lightning_logs/neims/version_0/checkpoints/epoch=35-step=103643.ckpt


# NIST-20 spectrum prediction

### GrAFF-MS

In [3]:
%%bash

python run-graff-ms.py \
    $(cat /tmp/graff_path) \
    ./data/nist-20/hr_msms_nist_test.tsv \
    ./data/graff-ms/nist-20/hr_msms_nist_test_graff-ms.msp

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/graff-ms/version_0/checkpoints/epoch=96-step=27257.ckpt... done
Precomputing graphs... 
done


Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-797c0506-9259-a78a-d225-0bcdda9110f6]
2023-07-21 19:28:17.966479: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Predicting DataLoader 0: 100%|##########| 9/9 [00:02<00:00,  3.01it/s]
done
Time elapsed: 15.208698987960815 seconds
Exporting to MSP... done


In [4]:
%%bash

python cosine-similarity.py \
    ./data/graff-ms/nist-20/hr_msms_nist_test_graff-ms.msp \
    ./data/nist-20/hr_msms_nist_test.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.71 +- 0.01 (N=4424)


### NEIMS

In [5]:
%%bash

python run-neims.py \
    $(cat /tmp/neims_path) \
    ./data/nist-20/hr_msms_nist_test.tsv \
    ./data/neims/nist-20/hr_msms_nist_test_neims.msp

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/neims/version_0/checkpoints/epoch=35-step=103643.ckpt... done


Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-797c0506-9259-a78a-d225-0bcdda9110f6]
2023-07-21 19:28:44.668068: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Predicting DataLoader 0: 100%|##########| 9/9 [00:01<00:00,  5.55it/s]
done
Time elapsed: 8.540608644485474 seconds
Exporting to MSP... done


In [6]:
%%bash

python cosine-similarity.py \
    ./data/neims/nist-20/hr_msms_nist_test_neims.msp \
    ./data/nist-20/hr_msms_nist_test.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.60 +- 0.01 (N=4424)


### CFM-ID

In [7]:
%%bash 

python cosine-similarity.py \
    ./data/cfm-id/nist-20/hr_msms_nist_test_cfm-id.msp \
    ./data/nist-20/hr_msms_nist_test.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.53 +- 0.01 (N=4404)


# CASMI-16 spectrum prediction

### GrAFF-MS

In [8]:
%%bash

python run-graff-ms.py \
    $(cat /tmp/graff_path) \
    ./data/casmi-16/casmi-16.tsv \
    ./data/graff-ms/casmi-16/casmi-16_graff-ms.msp

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/graff-ms/version_0/checkpoints/epoch=96-step=27257.ckpt... done
Precomputing graphs... 
done


Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-797c0506-9259-a78a-d225-0bcdda9110f6]
2023-07-21 19:29:13.559325: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00,  1.87it/s]
done
Time elapsed: 6.52911114692688 seconds
Exporting to MSP... done


In [9]:
%%bash 

python cosine-similarity.py \
    ./data/graff-ms/casmi-16/casmi-16_graff-ms.msp \
    ./data/casmi-16/casmi-16.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.78 +- 0.05 (N=166)


### NEIMS

In [10]:
%%bash

python run-neims.py \
    $(cat /tmp/neims_path) \
    ./data/casmi-16/casmi-16.tsv \
    ./data/neims/casmi-16/casmi-16_neims.msp

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/neims/version_0/checkpoints/epoch=35-step=103643.ckpt... done


Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-797c0506-9259-a78a-d225-0bcdda9110f6]
2023-07-21 19:29:30.507382: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00,  2.04it/s]
done
Time elapsed: 6.830160856246948 seconds
Exporting to MSP... done


In [11]:
%%bash 

python cosine-similarity.py \
    ./data/neims/casmi-16/casmi-16_neims.msp \
    ./data/casmi-16/casmi-16.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.57 +- 0.06 (N=166)


### CFM-ID

In [12]:
%%bash

python cosine-similarity.py \
    ./data/cfm-id/casmi-16/casmi-16_cfm-id.msp \
    ./data/casmi-16/casmi-16.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.71 +- 0.04 (N=166)


# GNPS spectrum prediction

### GrAFF-MS

In [17]:
%%bash 

python run-graff-ms.py \
    $(cat /tmp/graff_path) \
    ./data/gnps/gnps.tsv \
    ./data/graff-ms/gnps/gnps_graff-ms.msp

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/graff-ms/version_0/checkpoints/epoch=96-step=27257.ckpt... done
Precomputing graphs... 
done


Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-797c0506-9259-a78a-d225-0bcdda9110f6]
2023-07-21 19:33:00.133964: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Predicting DataLoader 0: 100%|##########| 2/2 [00:00<00:00,  2.14it/s]
done
Time elapsed: 10.325887203216553 seconds
Exporting to MSP... done


In [18]:
%%bash 

python cosine-similarity.py \
    ./data/graff-ms/gnps/gnps_graff-ms.msp \
    ./data/gnps/gnps.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.40 +- 0.03 (N=707)


### NEIMS

In [19]:
%%bash 

python run-neims.py \
    $(cat /tmp/neims_path) \
    ./data/gnps/gnps.tsv \
    ./data/neims/gnps/gnps_neims.msp

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/neims/version_0/checkpoints/epoch=35-step=103643.ckpt... done


Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-797c0506-9259-a78a-d225-0bcdda9110f6]
2023-07-21 19:33:20.809674: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Predicting DataLoader 0: 100%|##########| 2/2 [00:00<00:00,  3.28it/s]
done
Time elapsed: 6.752257823944092 seconds
Exporting to MSP... done


In [20]:
%%bash 

python cosine-similarity.py \
    ./data/neims/gnps/gnps_neims.msp \
    ./data/gnps/gnps.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.28 +- 0.02 (N=707)


### CFM-ID

In [21]:
%%bash

python cosine-similarity.py \
    ./data/cfm-id/gnps/gnps_cfm-id.msp \
    ./data/gnps/gnps.msp

Loading predicted spectra... done
Loading ground truth spectra... done
Calculating cosine similarity... done
Mean cosine similarity = 0.37 +- 0.02 (N=676)


# NIST-20/ChEMBL library search

### GrAFF-MS

In [None]:
%%bash

python run-graff-ms.py \
    $(cat /tmp/graff_path) \
    ./data/chembl/nist-20_chembl_decoys.tsv \
    ./data/graff-ms/chembl/nist-20_chembl_decoys_graff-ms.msp \
    --min_probability 0.001

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/graff-ms/version_0/checkpoints/epoch=96-step=27257.ckpt... done
Precomputing graphs... 


In [4]:
%%bash 

python sls-recall.py \
    ./data/nist-20/hr_msms_nist_test.msp \
    ./data/graff-ms/chembl/nist-20_chembl_decoys_graff-ms.msp

Loading queries... done
Loading library... done
Querying 4424 against 1262025 spectra
Prefiltering matches... done
Calculating cosine similarity... done
Deduplicating library matches... done
Calculating recall... done
Structure recall @ 1 = 0.37 +- 0.01
Structure recall @ 5 = 0.67 +- 0.01
Structure recall @ 10 = 0.75 +- 0.01
Formula recall @ 1 = 0.52 +- 0.02
Formula recall @ 5 = 0.76 +- 0.01
Formula recall @ 10 = 0.83 +- 0.01


### NEIMS

In [6]:
%%bash

python run-neims.py \
    $(cat /tmp/neims_path) \
    ./data/chembl/nist-20_chembl_decoys.tsv \
    ./data/neims/nist-20/nist-20_chembl_decoys_neims.msp \
    --min_probability 0.001

Global seed set to 0


Reading queries... done
Generating metadata... done
Loading model ./lightning_logs/neims/version_0/checkpoints/epoch=35-step=103643.ckpt... done


Multiprocessing is handled by SLURM.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-b1be1dc7-be01-efdc-bbda-e03573d61cc4]
2023-07-24 22:50:52.635708: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Task exception was never retrieved
future: <Task finished name='Task-32' coro=<ScriptMagics.shebang.<locals>._handle_stream() done, defined at /state/partition1/llgrid/pkg/anaconda/anaconda3-2022b/lib/python3.8/site-packages/IPython/core/magics/script.py:211> exception=ValueError('Separator is not found, and chunk exceed the limit')>
Traceback (most recent call last):
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2022b/lib

In [7]:
%%bash 

python sls-recall.py \
    ./data/nist-20/hr_msms_nist_test.msp \
    ./data/neims/chembl/nist-20_chembl_decoys_neims.msp

Loading queries... done


Traceback (most recent call last):
  File "sls-recall.py", line 45, in <module>
    df_library = read_msp(args.library_path)
  File "/home/gridsan/mmurphy/projects/GrAFF-MS-Release/src/io.py", line 87, in read_msp
    with open(path,'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/neims/chembl/nist-20_chembl_decoys_neims.msp'


Loading library... 

CalledProcessError: Command 'b'\npython sls-recall.py \\\n    ./data/nist-20/hr_msms_nist_test.msp \\\n    ./data/neims/chembl/nist-20_chembl_decoys_neims.msp\n'' returned non-zero exit status 1.

### CFM-ID

In [None]:
%%bash

python sls-recall.py \
    ./data/nist-20/hr_msms_nist_test.msp \
    ./data/neims/chembl/nist-20_chembl_decoys_cfm-id.msp