## MASS BANK
Preparation of the for coparison with DeepEI. We want to mimic the DeepEI MassBank test set as closely as possible while respecting
different training sets of SpecTUS and DeepEI. In DeepEI github repository, there is a list of testing data (msbk_smiles.json) 
and list of SMILES that was separated as NIST overlap (neims_msbk_smiles.json). We will create a complement of the neims_msbk_smiles.json
in msbk_smiles.json and from this set we remove all the SpectTUS training data. The remaining data will be used as a test against DeepEI.


In [None]:
import sys
sys.path.append("..")
from utils.spectra_process_utils import msp2jsonl, remove_stereochemistry_and_canonicalize
from matchms.importing import load_from_msp
from matchms.exporting import save_as_msp
import pandas as pd
from matchms import Spectrum
from pathlib import Path

#### Load all pretraining SMILES

In [None]:
data_path_synth1 = "../clean_paper/data/synth/neims_custom_gen/train.jsonl"
data_path_synth2 = "../data/datasets/4_8M/rassp_gen_rounded/train.jsonl"

df2 = pd.read_json(data_path_synth2, lines=True)
df1 = pd.read_json(data_path_synth1, lines=True)

In [None]:
synth1_smiles = set(df1.smiles)
synth2_smiles = set(df2.smiles)

#### Load all MASSBANK data
And canonicalize smiles + remove destereo information

In [None]:
mb_spectra = list(load_from_msp("../clean_paper/data/massbank/GCMS DB-Public-KovatsRI-VS3.msp", metadata_harmonization=False))

In [None]:
canon_mb_spectra = [Spectrum(mz=s.mz,
                             intensities=s.intensities,
                             metadata={"smiles": remove_stereochemistry_and_canonicalize(s.metadata["smiles"])})
                             for s in mb_spectra]

In [None]:
mb_smiles = set([s.metadata["smiles"] for s in canon_mb_spectra])

In [None]:
len(mb_smiles), len(canon_mb_spectra)

(8651, 28008)

#### Load DeepEI test SMILES 

- Load all SMILES marked as MASSBANK SMILES in DeepEI.
- Load all SMILES marked as MASSBANK-NEIMS in DeepEI.

In [None]:
deepei_smiles = set(pd.read_json("../clean_paper/data/massbank/msbk_smiles.json")[0])
deepei_neims_smiles = set(pd.read_json("../clean_paper/data/massbank/msbk_smiles_neims.json")[0])

In [None]:
canon_deepei_smiles = set([remove_stereochemistry_and_canonicalize(s) for s in deepei_smiles])
canon_deepei_neims_smiles = set([remove_stereochemistry_and_canonicalize(s) for s in deepei_neims_smiles])

#### Load all NIST data

In [None]:
nist_train = set(pd.read_csv("../clean_paper/data/nist/train.smi", header=None, names=["smiles"]).smiles)
nist_test = set(pd.read_csv("../clean_paper/data/nist/test.smi", header=None, names=["smiles"]).smiles)
nist_valid = set(pd.read_csv("../clean_paper/data/nist/valid.smi", header=None, names=["smiles"]).smiles)
all_nist_smiles = nist_train.union(nist_test).union(nist_valid)

In [None]:
len(all_nist_smiles)

243304

#### Try to replicate the deepEI non-NIST dataset
(as closeely as possible)

In [None]:
deepei_non_neims_smiles = canon_deepei_smiles - canon_deepei_neims_smiles
print("Num of unique DeepEI non-NEIMS smiles:", len(deepei_non_neims_smiles))

Num of unique DeepEI non-NEIMS smiles: 3939


In [None]:
# print overlaps
print("non-neims deepei and synth1:", len(deepei_non_neims_smiles.intersection(synth1_smiles)))
print("non-neims deepei and synth2:", len(deepei_non_neims_smiles.intersection(synth2_smiles)))
print("non-neims deepei and NIST train:", len(deepei_non_neims_smiles.intersection(nist_train)))

non-neims deepei and synth1: 0
non-neims deepei and synth2: 24
non-neims deepei and NIST train: 2999


In [None]:
mb_test_set = deepei_non_neims_smiles - nist_train - synth1_smiles - synth2_smiles

In [None]:
len(mb_test_set)
print("non-neims deepei without our training compounds:", len(mb_test_set))

non-neims deepei without our training compounds: 934


In [None]:
clean_non_neims_mb_spectra = [s for s in canon_mb_spectra if s.metadata["smiles"] in mb_test_set]

In [None]:
print("Num of spectra in the non-NEIMS DeepEI set:", len(clean_non_neims_mb_spectra))

Num of spectra in the non-NEIMS DeepEI set: 2632


In [None]:
# save the filtered spectra
save_as_msp(clean_non_neims_mb_spectra, "../clean_paper/data/massbank/deepei_non_nist_test.msp")

#### Convert the filtered MSP to JSONL

In [None]:
msp2jsonl(Path("../clean_paper/data/massbank/deepei_non_nist_test.msp"),
          do_preprocess=False,
          keep_spectra=True
          )

100%|██████████| 2632/2632 [00:00<00:00, 5656.76it/s]
