# Preparing datasets

In this notebook we describe and demonstrate the process of data acquisition, synthetic spectra generation and dataset composition. Training of our model requires three datasets - experimentally measured commercial NIST GS-EI-MS library (we use NIST20) and two much bigger synthetic datasets generated by open source models NEIMS and RASSP. The NIST dataset can be purchased from distributors listed [here](https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:distributors), the synthetic datasets can be either generated with this notebook or downloaded from us (see **Precomputed datasets** below).

The only information about each molecule we need for training our model is:
    
```text
- SMILES:              "CCCCCCCCC(CCCC)O[Si]1(C)CCCCC1"
- list of m/z values:  [26, 27, 28, 31, ..., 200, 201, 255, 256, 257]
- list of intensities: [21.98, 335.7, 49.95, 737.34, ..., 23.98, 5.99, 78.93, 20.98, 5.0]
```

### 1) NIST dataset
This notebook first creates training, test and validation splits for NIST and converts them into `.jsonl` files. Each line of these files is a `json` with SMILES, m/z and intensities as keys representing a single molecule.

### 2) SMILES acquisition from ZINC
Secondly we download a subset of ZINC library (query *2d-standard-annotated-druglike*) that contains SMILES strings but no m/z nor intensities. We further filter the SMILES randomly to 30M, which creates a base for synthetic generation. 

### 3) RASSP synthetic spectra generation
Thirdly, we generate synthetic spectra using RASSP model. This model's molecular restrictions reduce the number of generated spectra to ???4.8M???. These spectra are split into training, validation and test sets.

### 4) NEIMS synthetic spectra generation
In the fourth step we use NEIMS model to generate synthetic spectra from the ???4.8M??? molecules permitted by RASSP (NEIMS does not have as strict molecular restrictions). These spectra are divided into **the same** training, validation and test splits as RASSP.

### 5) Dataleaks elimination (TODO upresni, nebo vyhod - dataleaky se resi v 3) uz)
Finally we ensure there are no data leaks between NIST valid + test set (which will serve as the primary evaluation sets) and all the training data (RASSP train set, NEIMS train set and synthetic train sets). 

### Precomputed datasets
Synthetic spectra generation in full scale is very computationally intensive and requires a lot of resources. Therefore, we provide the prepared synthetic dataset in the form of a zip file that can be downloaded from the following link: ....

If the user chooses to use the precomputed datasets for training, he can perform step 1 with his o, skip the steps 2-4 and proceed directly to the dataleaks elimination step 5. 
If the user chooses to generate the synthetic datasets himself, he can go through all the steps and choose to process the whole dataset instead of just the toy example presented.

---------------------------------------------------------------------------

## 1) NIST dataset

### 1.1) NIST cleaning and splitting
The core part of this section is altered from a [notebook](https://github.com/Jozefov/mass-spectra-prediction-GCN/blob/master/Notebooks/data_preprocessing.ipynb) created by Filip Jozefov.

- in this subsection we inspect the missing identifiers in NIST dataset
- we dorp ~60k of NIST spectra that don't have any form of a proper identifier (smiles, inchikey)
- we canonize the smiles strings and remove stereochemistry information
- we split the remaining data in the 0.8:0.1:0.1 ratio

In [3]:
import sys
sys.path.append('..')

from matchms.importing import load_from_msp
from matchms.exporting import save_as_msp
from matchms import Spectrum
import matchms

import pandas as pd
from rdkit import Chem

import numpy as np
import os
import random
import pandas as pd
from tqdm import tqdm

import matplotlib.pyplot as plt

from utils.spectra_process_utils import remove_stereochemistry_and_canonicalize

tqdm.pandas()

In [6]:
PROJECT_ROOT = '/home/xhajek9/gc-ms_bart/clean_paper' # TODO: change this to the path of your project root
NIST_PATH = '../../data/datasets/NIST/NIST20/20210925_NIST_EI_MS_cleaned.msp' # TODO: change this to the path of your NIST20 dataset
SEED = 42

In [7]:
nist_dataset = list(load_from_msp(NIST_PATH, metadata_harmonization=False))

In [None]:
np.random.seed(SEED)
random.seed(SEED)

#### Inspection of missing identifiers

In [None]:
# count examined occurrences of specific data missing in our dataset
def count_all(dataset):

    all_data = 0

    no_smiles = 0
    no_inchikey = 0
    no_inchi = 0

    no_smile_only = 0
    no_inchikey_only = 0
    both_missing_counter = 0

    all_identifier_missing = 0

    for obj in dataset:
        if obj.get('smiles') == None:
            no_smiles += 1
        if obj.get('inchikey') == None:
            no_inchikey += 1
        if obj.get('inchi') == None:
            no_inchi += 1
        if obj.get('smiles') == None and obj.get('inchi') != None:
            no_smile_only += 1
        if obj.get('smiles') != None and obj.get('inchi') == None:
            no_inchikey_only += 1
        if obj.get('smiles') == None and obj.get('inchikey') != None:
            both_missing_counter += 1
        if obj.get('smiles') == None and obj.get('inchikey') == None and obj.get('inchi') == None:
            all_identifier_missing += 1
        all_data += 1
    return (all_data, no_smiles, no_inchikey, no_inchi, no_smile_only, no_inchikey_only,
            both_missing_counter, all_identifier_missing)

In [None]:
all_data, no_smiles, no_inchikey, no_inchi, no_smile_only, no_inchikey_only,\
            both_missing_counter, all_identifier_missing = count_all(nist_dataset)

unique_smiles = set([obj.get('smiles') for obj in nist_dataset if obj.get('smiles') != None])

In [None]:
print(f"We are currently working with {all_data - no_smiles} smiles, from which {len(unique_smiles)} are unique\n")
print(f"We are currently working with {all_data - no_inchikey} inchikeys\n")
print(f"We are currently working with {all_data - no_inchi} inchi\n")

In [None]:
# STATISTICS
data_missing = {
    'All data': [all_data],
    'No smiles': [no_smiles],
    'No inchikey': [no_inchikey],
    'No inchi': [no_inchi],
    'Smiles Only Missing': [no_smile_only],
    'Inchi Only Missing': [no_inchikey_only],
    'Both Missing': [both_missing_counter],
    'All tree missing': [all_identifier_missing],
}
missing_df = pd.DataFrame(data_missing)

missing_df = missing_df.T
missing_df.columns = ["Count"]
missing_df["average"] = missing_df.apply(lambda row: row.Count / all_data, axis = 1)

missing_df

In [None]:
# STATISTICS VISUALIZATION

# x-coordinates of left sides of bars
parameters_missing = [i for i in range(len(missing_df))]

# heights of bars
height = [i for i in missing_df.Count]

# labels for bars
tick_label = ['All data', 'No smiles', 'No inchikey', 'No inchi', 'Smiles Only Missing',
              'Inchi Only Missing', 'Both Missing', 'All tree missing']


plt.bar(parameters_missing, height, tick_label = tick_label,
        width = 0.8)

plt.xlabel('')
plt.xticks(rotation=70)
plt.ylabel('Number of Instances')
plt.title('Number of missing data')
plt.show()

#### Identifier reconstruction

We have approximately 60k data that we are unable to work with. They do not include any identifiers (inchi, inchikey, or SMILES) so we are dropping them. SMILES is crucial for us, so for molecules that don't include a valid SMILES string but include another identifier we will try to reconstruct it. If that fails, we will drop such molecule too.

In [3]:
# Filter and try to restore corupted smiles
# both tools are used, with help of rdkit and matchms as well
# e add the smiles destereo and canonization

def reconstruct_information(dataset):
    updated_dataset = []
    for spectrum in tqdm(dataset):
        smiles = spectrum.get('smiles')
        # all missing
        if smiles == None and spectrum.get('inchikey') == None and spectrum.get('inchi') == None:
            continue

        #check weather smiles is syntactically valid or molecule is chemically reasonable
        if (smiles == None or \
            Chem.MolFromSmiles(smiles, sanitize=False) == None or\
            Chem.MolFromSmiles(smiles) == None) and\
            spectrum.get('inchi') != None:

            # try to convert from inchi
            tmp = Chem.inchi.MolFromInchi(spectrum.get('inchi'))
            if tmp != None:
                spectrum.set('smiles', Chem.MolToSmiles(tmp))
                smiles = spectrum.get('smiles')

        # try with matchms
        if smiles == None and spectrum.get('inchi') != None:
            spectrum = matchms.filtering.derive_smiles_from_inchi(spectrum)
            smiles = spectrum.get('smiles')

        if smiles == None:
            continue

        updated_dataset.append(spectrum)
    return updated_dataset

In [None]:
reconstructed_dataset = reconstruct_information(nist_dataset)

In [None]:
print(f"In the dataset there remains {len(reconstructed_dataset)} / {len(nist_dataset)} molecules and all have now SMILES strings")


#### Remove stereochemistry and canonicalize smiles

In [None]:
def remove_stereochemistry_and_canonicalize_whole_dataset(dataset):
    updated_dataset = []
    counter_smiles_changed = 0
    for i, spectrum in enumerate(dataset):
        smiles = spectrum.get('smiles')
        if smiles is None:
            raise ValueError("Smiles is None, reconstruction and filtering poorly done.")
        new_smiles = remove_stereochemistry_and_canonicalize(smiles)
        if new_smiles is None:
            continue
        spectrum.set('smiles', new_smiles)
        if new_smiles != smiles:
            counter_smiles_changed += 1
        updated_dataset.append(spectrum)
    print(f"Number of smiles canonicalized or destereochemicalized: {counter_smiles_changed}")
    return updated_dataset

In [None]:
canonicalized_dataset = remove_stereochemistry_and_canonicalize_whole_dataset(reconstructed_dataset)

In [None]:
unique_smiles = set([obj.get('smiles') for obj in canonicalized_dataset if obj.get('smiles') != None])

In [None]:
print(f"In the dataset there remains {len(canonicalized_dataset)} / {len(nist_dataset)} molecules and all have now canonical SMILES strings")
print(f"\nFrom the remaining there are {len(unique_smiles)} unique SMILES strings.")

#### Splitting

In [None]:
TRAIN_RATIO = 0.8
VALID_RATIO = 0.1
TEST_RATIO = 0.1

TRAIN_INDEX = 0
VALID_INDEX = 1
TEST_INDEX = 2

In [None]:
# map each spectrum to its smiles
def unique_mapping(dataset):

    smiles_dict = dict()
    counter_none = 0

    for spectrum in dataset:
        if "smiles" not in spectrum.metadata or spectrum.get("smiles") == None:
            counter_none += 1
            continue
        if spectrum.get("smiles") not in smiles_dict:
            smiles_dict[spectrum.get("smiles")] = [spectrum]
        else:
            smiles_dict[spectrum.get("smiles")].append(spectrum)

    print(f"Missing smiles identifier in {counter_none} cases")
    return smiles_dict

In [None]:
# generate shuffled indices for train, valid and test
def generate_index(dataset, train_ratio, valid_ratio, test_ratio):
    dataset_length = len(dataset)

    train_idx = np.full(int(dataset_length * train_ratio), 0, dtype=int)
    valid_idx = np.full(int(dataset_length * valid_ratio), 1, dtype=int)
    test_idx = np.full(int(dataset_length * test_ratio), 2, dtype=int)

    concatenate_array = np.concatenate((train_idx, valid_idx, test_idx))

    np.random.shuffle(concatenate_array)

    return concatenate_array

In [None]:
# build list dateset for training, valid and test
# we iterate over all cases with same value and append them to final list in way
# that all train, valid and test does not overlap with duplicities
# at the end, lists are shuffled to avoid continuous stream of same data
def generate_train_test_dataset(dataset, indices):

    train = []
    valid = []
    test = []

    for i, spectrums in zip(indices, dataset):
        if i == TRAIN_INDEX:
            for spectrum in dataset[spectrums]:
                train.append(spectrum)
        elif i == VALID_INDEX:
            for spectrum in dataset[spectrums]:
                valid.append(spectrum)
        elif i == TEST_INDEX:
            for spectrum in dataset[spectrums]:
                test.append(spectrum)

    random.shuffle(train)
    random.shuffle(valid)
    random.shuffle(test)
    return (train, valid, test)

In [None]:
# saving list in msp format
def save_dataset(dataset, path, name):
    # makes all intermediate-level directories needed to contain the leaf directory
    os.makedirs(path, mode=0o777, exist_ok=True)
    save_as_msp(dataset, f"{path}/{name}.msp")

In [None]:
# Perform splitting
nist_dict = unique_mapping(canonicalized_dataset)
DATASET_LENGTH = len(nist_dict)

indices = generate_index(nist_dict, TRAIN_RATIO, VALID_RATIO, TEST_RATIO)
train, valid, test = generate_train_test_dataset(nist_dict, indices)

#### Save splits to .msp

In [None]:
# The code is commented on to avoid unintentional rewriting of the created dataset.

save_dir = PROJECT_ROOT + "/data/nist"

save_dataset(train, save_dir, "train")
save_dataset(test, save_dir, "test")
save_dataset(valid, save_dir, "valid")

In [None]:
all_data, no_smiles, no_inchikey, no_inchi, no_smile_only, no_inchikey_only,\
            both_missing_counter, all_identifier_missing = count_all(canonicalized_dataset)

unique_smiles = set([obj.get('smiles') for obj in canonicalized_dataset if obj.get('smiles') != None])

In [None]:
print(f"Number of new unique smiles is {len(set(unique_smiles))}")

In [None]:
missing_df["Update count"] = [all_data, no_smiles, no_inchikey, no_inchi, no_smile_only, no_inchikey_only,\
            both_missing_counter, all_identifier_missing]

missing_df["Update average"] = missing_df.apply(lambda row: row["Update count"] / all_data, axis = 1)

missing_df

In [None]:
# no overlap test
len(train), len(valid), len(test), len(set(train) & set(valid)), len(set(train) & set(test)), len(set(valid) & set(test))

### 1.2) NIST splits to .jsonl
In this part of the notebook we create a `.jsonl` containing only SMILES, m/z values and intensities for each molecule. This file will be used in the training process (the finetuning part).

In [1]:
import sys
from pathlib import Path
sys.path.append("..")

from utils.spectra_process_utils import msp2jsonl
from pathlib import Path

In [None]:
for dataset_type in ["train", "valid", "test"]:
    dataset_path = Path(f"{PROJECT_ROOT}/data/nist")
    msp2jsonl(path_msp=dataset_path / f"{dataset_type}.msp",
              path_jsonl=dataset_path / f"{dataset_type}.jsonl",
              tokenizer = None,
              keep_spectra=True,
              do_preprocess=False)

### 1.3) NIST splits to .smi 
We also save the splits as .smi files for reproducibility.

In [12]:
from utils.spectra_process_utils import jsonl2smi

for dataset_type in ["train", "valid", "test"]:
    jsonl2smi(f"{PROJECT_ROOT}/data/nist/{dataset_type}.jsonl")

## 2) SMILES collection from ZINC

In this section we download a subset of ZINC20 library (query *2d-standard-annotated-druglike*) that contains SMILES strings but no m/z nor intensities. Finally we want to end up with about 4M to 5M molecules in the Synthetic dataset (empirically a good balance between computational intensity and coverage). We observed that the RASSP filter (section 3) allows about 1/6 of the molecules to pass through, therefore we need to sample 30M SMILES strings from ZINC. The steps to replicate our process are:

#### 2.1) Download the ZINC library

With our specification query *2d-standard-annotated-druglike* you download about 1.8B SMILES strings (101GB). It is necessary to download all of them and sample them afterwards so we cover the whole chemical space. For the download you can use the `download_script.sh` in `PROJECT_ROOT/data/zinc/scripts` that we got [here](https://zinc20.docking.org/tranches/home). The ZINC20 database is being continuously updated - though it's just small bits, it makes the sampling process nondeterministic. If you want exactly the same Synthetic dataset as we used, you can download it from us.

#### 2.2) Sample 40M SMILES strings
We further sample the SMILES strings randomly to ~40M, which creates a base for synthetic generation. This step also includes stripping csv header and removing zinc_id column fro the tranches.

#### 2.3) SMILES cleaning
Perform the following steps: corrupted smiles filtering, destereochemicalization, canonicalization, long smiles filtering (over 100). At the end concat the clean SMILES strings into a single file.

#### 2.4) Deduplicate, remove all NIST20 molecules and sample final 30M SMILES strings
Now we can finally deduplicate the SMILES strings, to avoid any direct dataleaks we remove all NIST20 molecules and sample the final 30M dataset.

In [21]:
import numpy as np
from pathlib import Path
import shutil
import os

#paths
TRANCHES_PATH = f"{PROJECT_ROOT}/data/zinc/tranches"
DOWNLOAD_SCRIPT_PATH = f"{PROJECT_ROOT}/data/zinc/scripts/download_script.sh"

TRANCHES_40M_PATH = f"{TRANCHES_PATH}_40M"
TRANCHES_40M_CLEAN_PATH = f"{TRANCHES_40M_PATH}_clean"
ALL_40M_CLEAN_SMILES_PATH = f"{TRANCHES_40M_CLEAN_PATH}/all_smiles.txt"
ALL_30M_CLEAN_SMILES_DIR = f"{PROJECT_ROOT}/data/zinc/30M/"
ALL_30M_CLEAN_SMILES_PATH = f"{ALL_30M_CLEAN_SMILES_DIR}/30M.smi"


# macro variables
SEED = 42
NUM_WORKERS = 32

### 2.1) Download the ZINC library

In [None]:
!mkdir -p {TRANCHES_PATH}
!cd {TRANCHES_PATH} && bash {DOWNLOAD_SCRIPT_PATH}

### 2.2) Sample 40M SMILES strings

In [None]:
# first script: make it 40M and only SMILES
!python ../data/zinc/scripts/zinc_to_slice_of_smiles.py --input-dir {TRANCHES_PATH} --output-dir {TRANCHES_40M_PATH} --sample-ratio 0.0222 --num-workers {NUM_WORKERS} --seed {SEED}

### 2.3) SMILES cleaning

In [None]:
# corrupted smiles filtering, destereochemicalization, SMILES canonicalization, long smiles filtering
!python ../data/zinc/scripts/clean_smiless.py --input-dir {TRANCHES_40M_PATH} --output-dir {TRANCHES_40M_CLEAN_PATH} --num-workers {NUM_WORKERS}

In [None]:
# concat all files in the 40M_clean folder to one file
os.environ['TRANCHES_40M_CLEAN_PATH'] = TRANCHES_40M_CLEAN_PATH
os.environ['ALL_40M_CLEAN_SMILES_PATH'] = ALL_40M_CLEAN_SMILES_PATH

!echo ${TRANCHES_40M_CLEAN_PATH}/* | xargs -I {} sh -c 'cat {} >> ${ALL_40M_CLEAN_SMILES_PATH}'

### 2.4) Deduplicate, remove all NIST20 molecules and sample final 30M SMILES strings

In [None]:
# prepare unique NIST SMILES
NIST_FOLDER = f"{PROJECT_ROOT}/data/nist"

nist_train = pd.read_json(f"{NIST_FOLDER}/train.jsonl", lines=True)
nist_valid = pd.read_json(f"{NIST_FOLDER}/valid.jsonl", lines=True)
nist_test = pd.read_json(f"{NIST_FOLDER}/test.jsonl", lines=True)

nist_unique_smiles = set(nist_train["smiles"]) | set(nist_valid["smiles"]) | set(nist_test["smiles"])

In [None]:
# deduplicate, remove all NIST20 molecules and sample final 30M SMILES strings

SAMPLE_SIZE = 30000000

Path(ALL_30M_CLEAN_SMILES_PATH).parent.mkdir(parents=True, exist_ok=True)

np.random.seed(SEED)
with open(ALL_40M_CLEAN_SMILES_PATH, "r") as inputf, open(f"{ALL_30M_CLEAN_SMILES_PATH}", "w") as outputf:

    print("## READING FILE")
    unique_smi = set(inputf.read().splitlines()) # read and deduplicate
    print(f"Length of deduplicated: {len(unique_smi)})")

    unique_smi.difference_update(nist_unique_smiles) # remove NIST20
    print(f"Length of deduplicated without NIST20: {len(unique_smi)})")

    print("## SAMPLING")
    unique_smi = np.array(list(unique_smi))
    sample = np.random.choice(unique_smi, SAMPLE_SIZE, replace=False)

    print(f"## WRITING (length: {len(sample)})")
    for smi in sample:
        outputf.write(smi + '\n')

### 2.5) Remove all the temporary files

In [None]:
# UNCOMMENT AND RUN ONLY IF YOU ARE SURE YOU HAVE A CORRECT DATASET IN THE 30M FOLDER

shutil.rmtree(TRANCHES_PATH)
shutil.rmtree(TRANCHES_40M_PATH)
shutil.rmtree(TRANCHES_40M_CLEAN_PATH)

### Stats
```text
downloaded ZINC subset:                               1820667950      (3 hrs)
40M sample:                                             40418852      (3 min)
40M sample cleaned:                                     40418750      (12 min)
40M sample cleaned concatenated:                        40418750      (5 sec)
40M sample cleaned deduplicated:                        39577841
40M sample cleaned deduplicated without NIST20:         39575106      
30M final sample:                                       30000000      (2 min)
```

## 3) RASSP synthetic spectra generation

In this section we take the cleaned 30M SMILES strings, send it through the RASSP model to generate synthetic spectra.


This step requires quite a lot of computing time. Therefore it is desirable to offload it to a computational cluster and to run as many parallel, independent jobs. 


### 3.1) 30M.smi data splitting

First, we prepare the data for processing - split them into a folder and zip the folder.

In [22]:
RASSP_OUTPUT_DIR = f"{PROJECT_ROOT}/data/synth/rassp_gen"
RASSP_OUTPUT_DIR_TMP = f"{RASSP_OUTPUT_DIR}/tmp"
ALL_30M_CLEAN_SMILES_DIR_TMP = f"{ALL_30M_CLEAN_SMILES_DIR}/tmp"

In [71]:
! mkdir -p {RASSP_OUTPUT_DIR}
! mkdir -p {RASSP_OUTPUT_DIR_TMP}
! mkdir -p {ALL_30M_CLEAN_SMILES_DIR_TMP}

In [13]:
! split -a 3 -d -l 100000 {ALL_30M_CLEAN_SMILES_PATH} {ALL_30M_CLEAN_SMILES_DIR_TMP}/30M.smi_

In [16]:
! cd {ALL_30M_CLEAN_SMILES_DIR} && zip 30M.zip {ALL_30M_CLEAN_SMILES_DIR_TMP}/30M.smi_*

  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_000 (deflated 72%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_001 (deflated 72%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_002 (deflated 72%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_003 (deflated 71%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_004 (deflated 72%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_005 (deflated 71%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_006 (deflated 71%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_007 (deflated 72%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_008 (deflated 71%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_009 (deflated 72%)
  adding: home/xhajek9/gc-ms_bart/clean_paper/data/zinc/30M//tmp/30M.smi_010 (de

### 3.2) RASSP spectra prediction

Now grab the file `{ALL_30M_CLEAN_SMILES_DIR_TMP}/30M.zip` (approx. 300 MB), copy it to a suitable folder at the computational cluster, and unzip it there. It yields 300 chunks of 100,000 lines each.

We provide an example implementation which runs at our [Metacentrum PBS cluster](https://metacentrum.cz) smoothly. It needs:
- [rassp-pbs.sh](../forward/rassp-pbs.sh) : shell wrapper designed to be submitted as PBS job, taking a single argument -- the SMILES file; unless specified otherwise with options, it downloads our RASSP docker image `cerit.io/ljocha/rassp:CURRENT_STABLE`, converts it for singularity, downloads the published RASSP models, splits the input into 1000 lines chunks, and runs predictions on the chunks sequentially.
- [rassp-predict.py](../forward/rassp-predict.py) : the actual RASSP spectra prediction, it takes a file with SMILES as input and produces spectra in JSONL format; it calls the original RASSP library, and it is expected to be run in our docker/singularity image. SMILES which cannot be processed by RASSP (too many atoms, unknown atom types, too many subformulae, ...) are filtered out.

Copy both these files to the same folder as `30M.zip` and submit a PBS job on each of them:

```bash    
singularity pull rassp.sif docker://cerit.io/ljocha/rassp:nvidia-2023-6

for s in 30M.smi_*; do
  qsub -N $s -q gpu -l walltime=4:00:00 -l select=1:ncpus=8:ngpus=1:mem=8gb:scratch_local=50gb -- $PWD/rassp-pbs.sh -i $PWD/rassp.sif $s
done
```
Use of 8 or even more CPUs per job makes sense, GPU is effectively required (virtually any current NVidia will work; the published RASSP model usually fits into 8 GB GPU memory), with current Intel/AMD server CPUs 
the prediction can give upto 1000 molecules/minute, 4 hours is a safe upper bound for the job.

The first line could be omitted together with `-i` option of`rassp-pbs.sh` (it pulls the image, then); however, it takes quite a long time and it is better to recycle

After the jobs are finished, corresponding `*.jsonl` files are copied to the same folder. Copy them back to {RASSP_OUTPUT_DIR}.

The docker container was built with Dockerfile available in [our fork of original RASSP repository](https://github.com/ljocha/rassp-public/tree/ljocha).

### Concatenation and splitting

After the predictions are done, we concatenate the files and split them into training, validation and test sets.

In [25]:
def data_split(df, train_val_test_ratio: list, seed):
    """split the df into train, test and valid sets"""

    if sum(train_val_test_ratio) != 1:
        print("The sum of the ratios is not equal to 1.")
        train_val_test_ratio = list(np.array(train_val_test_ratio) / sum(train_val_test_ratio))
        print(f"Ratios were normalized to {train_val_test_ratio}")

    train_set = df.sample(frac=train_val_test_ratio[0], random_state=seed)
    rest = df.drop(train_set.index)

    valid_set = rest.sample(frac=train_val_test_ratio[1] / (train_val_test_ratio[1] + train_val_test_ratio[2]),
                            random_state=seed)
    test_set = rest.drop(valid_set.index)

    print(f"SPLITTING STATS\n" +
          f"train len: {len(train_set)}\ntest len: {len(test_set)}\nvalid len: {len(valid_set)}\n" +
          f"{len(train_set)} + {len(test_set)} + {len(valid_set)} == {len(df)} : the sum matches len of the df\n")

    return train_set, valid_set, test_set


def concat_and_split_rassp(jsonls_dir: Path, final_dir: Path, train_val_test_split: list, seed: int):
    df = pd.DataFrame(columns=["smiles", "mz", "intensity"])
    jsonls = list(jsonls_dir.glob("*.jsonl"))
    jsonls.sort()
    for jsonl in tqdm(jsonls):
        df = pd.concat([df, pd.read_json(jsonl, lines=True, orient="records")], ignore_index=True, sort=False)

    train_set, valid_set, test_set = data_split(df, train_val_test_split, seed)

    train_set["smiles"].to_csv(final_dir / "train.smi", index=False, header=False)
    valid_set["smiles"].to_csv(final_dir / "valid.smi", index=False, header=False)
    test_set["smiles"].to_csv(final_dir / "test.smi", index=False, header=False)

    return train_set, valid_set, test_set

# takes about 14 mins
train_set, valid_set, test_set = concat_and_split_rassp(Path(RASSP_OUTPUT_DIR_TMP),
                                                        Path(RASSP_OUTPUT_DIR),
                                                        [0.9, 0.05, 0.05],
                                                        SEED)

train_set.to_json(RASSP_OUTPUT_DIR + "/train.jsonl", orient="records", lines=True)
valid_set.to_json(RASSP_OUTPUT_DIR + "/valid.jsonl", orient="records", lines=True)
test_set.to_json(RASSP_OUTPUT_DIR + "/test.jsonl", orient="records", lines=True)


100%|██████████| 300/300 [07:35<00:00,  1.52s/it]


SPLITTING STATS
train len: 4214986
test len: 234166
valid len: 234166
4214986 + 234166 + 234166 == 4683318 : the sum matches len of the df



## 4) NEIMS synthetic spectra generation
In this section we take a .smi file with molecules filtered according to RASSP restrictions and generate synthetic spectra using NEIMS model. The generated spectra are split into training, validation and test sets and saved as .jsonl files, where each line is a json dictionary with smiles, mz and intensity as keys. 

In [1]:
# imports
import sys
sys.path.append("..")

import pandas as pd
from rdkit.Chem import PandasTools
from pathlib import Path
import subprocess as subp
import numpy as np
from tqdm import tqdm
import glob
import json
from multiprocessing import Pool

tqdm.pandas()

In [6]:
# macro variables
SEED = 42
RASSP_OUTPUT_DIR = f"{PROJECT_ROOT}/data/synth/rassp_gen"

BASE='after_rassp'
NEIMS_OUTPUT_DIR = f"{PROJECT_ROOT}/data/synth/neims_gen/"
NEIMS_OUTPUT_DIR_TMP = f"{NEIMS_OUTPUT_DIR}/tmp/"


# NEIMS inference makes a single batch from the whole input file; 50k SMILES use approx. 7GB GPU memory,
# which is generally acceptable; make it smaller if you run out of memory
NEIMS_CHUNK=50000

NEIMS_REPO = PROJECT_ROOT + "/deep-molecular-massspec"
NEIMS_WEIGHTS = NEIMS_REPO + "/NEIMS_weights/massspec_weights"


### 4.1) Gather SMILES from RASSP output and convert to SDF

In [60]:
smi = []
for f_type in ["train", "valid", "test"]:
    smi += pd.read_csv(f"{RASSP_OUTPUT_DIR}/{f_type}.smi", header=None)[0].to_list()

In [61]:
len(smi)

4683318

In [31]:
!mkdir -p {NEIMS_OUTPUT_DIR}
!mkdir -p {NEIMS_OUTPUT_DIR_TMP}

In [64]:
# convert SMILES to SDF (~2 mins 25 sec with 32 workers)
def smi2sdf(smiles_list, list_id, output_dir=NEIMS_OUTPUT_DIR_TMP, base=BASE):
    df_smi = pd.DataFrame(smiles_list, columns=['smiles'])
    PandasTools.AddMoleculeColumnToFrame(df_smi, smilesCol='smiles', molCol='ROMol')
    df_smi["id"] = df_smi.index
    PandasTools.WriteSDF(df_smi, f'{output_dir}/{base}_{list_id:03}.sdf', idName="id", properties=list(df_smi.columns))

def chunk_list(alist, chunk_size):
    return [alist[i:i + chunk_size] for i in range(0, len(alist), chunk_size)]



NUM_WORKERS = 32
chunked_smi = chunk_list(smi, NEIMS_CHUNK)
chunks_with_id = [(chunk, i) for i, chunk in enumerate(chunked_smi)]

with Pool(NUM_WORKERS) as p:
    p.starmap(smi2sdf, chunks_with_id)

In [70]:
! cd {NEIMS_OUTPUT_DIR_TMP} && zip {BASE}.zip {BASE}*.sdf

  adding: after_rassp_000.sdf (deflated 91%)
  adding: after_rassp_001.sdf (deflated 91%)
  adding: after_rassp_002.sdf (deflated 91%)
  adding: after_rassp_003.sdf (deflated 91%)
  adding: after_rassp_004.sdf (deflated 91%)
  adding: after_rassp_005.sdf (deflated 91%)
  adding: after_rassp_006.sdf (deflated 91%)
  adding: after_rassp_007.sdf (deflated 91%)
  adding: after_rassp_008.sdf (deflated 91%)
  adding: after_rassp_009.sdf (deflated 91%)
  adding: after_rassp_010.sdf (deflated 91%)
  adding: after_rassp_011.sdf (deflated 91%)
  adding: after_rassp_012.sdf (deflated 91%)
  adding: after_rassp_013.sdf (deflated 91%)
  adding: after_rassp_014.sdf (deflated 91%)
  adding: after_rassp_015.sdf (deflated 91%)
  adding: after_rassp_016.sdf (deflated 91%)
  adding: after_rassp_017.sdf (deflated 91%)
  adding: after_rassp_018.sdf (deflated 91%)
  adding: after_rassp_019.sdf (deflated 91%)
  adding: after_rassp_020.sdf (deflated 91%)
  adding: after_rassp_021.sdf (deflated 91%)
  adding: 

Similarly to RASSP prediction, copy the file {NEIMS_OUTPUT_DIR}/{BASE}.zip to a computational cluster and unzip it there. Copy also the PBS script [neims-pbs.sh](../forward/neims-pbs.sh) to the same folder and submit it to a suitable queue after downloading our container image (created in [our NEIMS repository clone](https://github.com/ljocha/deep-molecular-massspec/tree/ljocha))
```bash
singularity pull neims.sif docker://cerit.io/ljocha/neims # this step requires more RAM () than the generatoin itself

chmod +x neims-pbs.sh

qsub -q gpu -l walltime=4:00:00 -l select=1:ncpus=1:ngpus=1:mem=8gb:scratch_local=50gb -- $PWD/neims-pbs.sh *.sdf
```

Unlike RASSP, the NEIMS prediction is quite fast and makes little sense to split it into multiple jobs, the script loops over all input files.

After it finishes, grab all {BASE}-out.sdf files and copy them back to {NEIMS_OUTPUT_DIR_TMP}.

### 4.3) Spectra processing
Here we convert the SDF files to JSONL format with `smiles`, `mz`, and `intensity` keys.

In [8]:
from multiprocessing import Pool
from utils.spectra_process_utils import sdf2jsonl

# takes ~ 2min with 32 workers
sdfs = glob.glob(f"{NEIMS_OUTPUT_DIR_TMP}/*out.sdf")
sdfs.sort()
NUM_WORKERS = 32

with Pool(NUM_WORKERS) as p:
    p.map(sdf2jsonl, sdfs)

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_000-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_001-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_002-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_003-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_005-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_006-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_007-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_004-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_008-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_009-out.sdf: RUNNING/home/xhajek9/gc-ms_bart/clean_paper/data/synth/ne

100%|██████████| 50000/50000 [00:00<00:00, 1006050.26it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1053702.26it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1049557.34it/s]
100%|██████████| 50000/50000 [00:00<00:00, 975188.21it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1035958.03it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1015589.65it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1085386.90it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1058141.60it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1113788.31it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1084264.57it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1072937.04it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1099044.11it/s]
100%|██████████| 50000/50000 [00:00<00:00, 880593.91it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1039562.99it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1080749.93it/s]
100%|██████████| 50000/50000 [00:00<00:00, 950378.18it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1011440.95it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_016-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_007-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_032-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_033-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_004-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_001-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_034-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_035-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_026-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_036-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_g

100%|██████████| 50000/50000 [00:00<00:00, 1017032.73it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1091357.20it/s]

100%|██████████| 50000/50000 [00:00<00:00, 1002510.64it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1074823.18it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1078998.36it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1032906.80it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1109645.33it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1076517.00it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1029342.73it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1094187.19it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1092050.53it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1077269.06it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1075357.79it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1053268.31it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1082015.70it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1072503.55it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1063103.31it

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_033-out.sdf: DONE


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_036-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_064-out.sdf: RUNNING


100%|██████████| 50000/50000 [00:00<00:00, 1029150.78it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_065-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_035-out.sdf: DONE


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_066-out.sdf: RUNNING


100%|██████████| 50000/50000 [00:00<00:00, 1066043.12it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_032-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_067-out.sdf: RUNNING


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_038-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 1076975.85it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_068-out.sdf: RUNNING


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_042-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_049-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_045-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 1018747.09it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_069-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_070-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_071-out.sdf: RUNNING


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_043-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_050-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 1012862.47it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_072-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_046-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_073-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_074-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_054-out.sdf: DONE


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_075-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_057-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 998767.47it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_037-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_076-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_044-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_077-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_040-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_051-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_078-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_053-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_039-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_079-out.sdf: RUNNING
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen/

100%|██████████| 33318/33318 [00:00<00:00, 976560.59it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_093-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 902191.00it/s]
100%|██████████| 50000/50000 [00:00<00:00, 875151.59it/s]
100%|██████████| 50000/50000 [00:00<00:00, 946035.24it/s]
100%|██████████| 50000/50000 [00:00<00:00, 963609.71it/s]
100%|██████████| 50000/50000 [00:00<00:00, 958895.32it/s]
100%|██████████| 50000/50000 [00:00<00:00, 945374.22it/s]
100%|██████████| 50000/50000 [00:00<00:00, 886632.56it/s]
100%|██████████| 50000/50000 [00:00<00:00, 970795.84it/s]
100%|██████████| 50000/50000 [00:00<00:00, 947702.54it/s]
100%|██████████| 50000/50000 [00:00<00:00, 969709.52it/s]
100%|██████████| 50000/50000 [00:00<00:00, 971542.40it/s]
100%|██████████| 50000/50000 [00:00<00:00, 1023790.04it/s]
100%|██████████| 50000/50000 [00:00<00:00, 907421.97it/s]
100%|██████████| 50000/50000 [00:00<00:00, 919013.48it/s]
100%|██████████| 50000/50000 [00:00<00:00, 887754.78it/s]
100%|██████████| 50000/50000 [00:00<00:00, 964043.82it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_067-out.sdf: DONE


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_068-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 985031.61it/s]
100%|██████████| 50000/50000 [00:00<00:00, 947103.35it/s]
  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_066-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 819523.33it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_073-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 897284.81it/s]
100%|██████████| 50000/50000 [00:00<00:00, 884975.08it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_064-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 941216.18it/s]
100%|██████████| 50000/50000 [00:00<00:00, 897964.85it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_065-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_078-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_071-out.sdf: DONE


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_069-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 1010709.76it/s]
100%|██████████| 50000/50000 [00:00<00:00, 929901.92it/s]
100%|██████████| 50000/50000 [00:00<00:00, 972366.76it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_072-out.sdf: DONE


  0%|          | 0/50000 [00:00<?, ?it/s]

/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_079-out.sdf: DONE


100%|██████████| 50000/50000 [00:00<00:00, 867739.16it/s]
100%|██████████| 50000/50000 [00:00<00:00, 969951.71it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_087-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_075-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_085-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_076-out.sdf: DONE/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_082-out.sdf: DONE



100%|██████████| 50000/50000 [00:00<00:00, 1013934.91it/s]


/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_084-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_080-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_070-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_081-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_074-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_090-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_086-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_077-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_083-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_rassp_091-out.sdf: DONE
/home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//tmp/after_r

### 4.4) Data splitting
The data split has to be the same as for RASSP. Therefore we will go through all the NEIMS's `.jsonl` files and according to the RASSP splits write the lines to corresponding files.

In [9]:
# load RASSP splits

def read_smi_to_set(path):
    smis = pd.read_csv(path, header=None, names=["smiles"]).smiles
    return set(smis.to_list())

rassp_train_set = read_smi_to_set(f"{RASSP_OUTPUT_DIR}/train.smi")
rassp_valid_set = read_smi_to_set(f"{RASSP_OUTPUT_DIR}/valid.smi")
rassp_test_set = read_smi_to_set(f"{RASSP_OUTPUT_DIR}/test.smi")

In [10]:
neims_train_file = f"{NEIMS_OUTPUT_DIR}/train.jsonl"
neims_valid_file = f"{NEIMS_OUTPUT_DIR}/valid.jsonl"
neims_test_file = f"{NEIMS_OUTPUT_DIR}/test.jsonl"

jsonls = sorted(glob.glob(f"{NEIMS_OUTPUT_DIR_TMP}/*.jsonl"))

In [11]:
# go through all the jsonls and sort them out to train, valid and test files based on RASSP .smi splits
with open(neims_train_file, "w") as trainf, open(neims_valid_file, "w") as validf, open(neims_test_file, "w") as testf:
    for jsonl in tqdm(jsonls):
        with open(jsonl, "r") as f:
            for line in tqdm(f):
                data = json.loads(line)
                smiles = data["smiles"]
                if smiles in rassp_train_set:
                    trainf.write(line)
                elif smiles in rassp_valid_set:
                    validf.write(line)
                elif smiles in rassp_test_set:
                    testf.write(line)
                else:
                    print(f"WARNING: SMILES {smiles} not found in any RASSP split")

50000it [00:01, 34544.59it/s]?, ?it/s]
50000it [00:01, 34020.70it/s]02:40,  1.72s/it]
50000it [00:01, 34202.60it/s]02:24,  1.57s/it]
50000it [00:01, 34880.53it/s]02:18,  1.52s/it]
50000it [00:01, 35217.89it/s]02:14,  1.49s/it]
50000it [00:01, 35313.66it/s]02:10,  1.47s/it]
50000it [00:01, 35276.51it/s]02:07,  1.45s/it]
50000it [00:01, 35153.25it/s]02:05,  1.44s/it]
50000it [00:01, 34988.71it/s]02:03,  1.44s/it]
50000it [00:01, 35097.42it/s]02:01,  1.43s/it]
50000it [00:01, 35063.05it/s]<02:00,  1.43s/it]
50000it [00:01, 35151.51it/s]<01:58,  1.43s/it]
50000it [00:01, 35188.61it/s]<01:57,  1.43s/it]
50000it [00:01, 35242.22it/s]<01:55,  1.43s/it]
50000it [00:02, 24628.24it/s]<01:54,  1.43s/it]
50000it [00:01, 35017.04it/s]<02:07,  1.61s/it]
50000it [00:01, 28969.50it/s]<02:01,  1.56s/it]
50000it [00:01, 34857.55it/s]<02:03,  1.61s/it]
50000it [00:01, 35720.24it/s]<01:58,  1.56s/it]
50000it [00:01, 35597.88it/s]<01:53,  1.51s/it]
50000it [00:01, 35835.08it/s]<01:49,  1.48s/it]
50000it [0

In [17]:
# stats neims
! echo "Train" && wc -l {neims_train_file}
! echo "Valid" && wc -l {neims_valid_file}
! echo "Test" && wc -l {neims_test_file}

Train
4214986 /home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//train.jsonl
Valid
234166 /home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//valid.jsonl
Test
234166 /home/xhajek9/gc-ms_bart/clean_paper/data/synth/neims_gen//test.jsonl


### Remove all the temporary files

In [None]:
# rassp tmp folder
! rm -r {RASSP_OUTPUT_DIR_TMP}

# neims tmp folder
! rm -r {NEIMS_OUTPUT_DIR_TMP}