This notebook collects the code used to pre-process the data. This code is based on 'matchms==0.24.1', "matchmsextras==0.4.0", and 'ms2deepscore==2.0.0'. These are not part of the general repo dependencies and need to be added manually. Data structures are saved as .npy (txt, similarity matrix) and json (spectra) for use in the exploration code within "practical_appendix_natural_products.ipynb". The ms2deepscore core further requires a trained model file in a models folder to run (e.g. ms2deepscore_model.pt from https://zenodo.org/records/10814307).

Important: this notebook is designed for local runtimes only. The data produced are available in the data_natural_products_library folder.

In [1]:
import os
import numpy as np
import matchms
import ms2deepscore
import matchmsextras
import natural_product_library_preprocessing_utils as prepro
print(matchms.__version__)
print(matchmsextras.__version__)
print(ms2deepscore.__version__)

0.24.1
0.4.0
2.0.0


In [2]:
ms2deepscore_model_filepath = os.path.join("models", "ms2deepscore_model.pt")                                           # MODEL FILES                      
data_directory = "data_natural_products_library"                                                                        # DIRECTORY
spectra_filepath_mgf = os.path.join(data_directory, "GNPS-NIH-NATURALPRODUCTSLIBRARY.mgf")                              # INPUT
spectra_filepath_json = os.path.join(data_directory, "spectra.json")                                                    # OUTPUT
similarity_matrix_modcos_filepath = os.path.join(data_directory, "similarities_modcos.npy")                             # OUTPUT          
similarity_matrix_ms2ds_filepath = os.path.join(data_directory, "similarities_ms2ds.npy")                               # OUTPUT 

In [16]:
# Filtering and basic cleaning of data
matchms_spectra = list(matchms.importing.load_from_mgf(spectra_filepath_mgf))
matchms_spectra = [matchms.filtering.default_filters(spec) for spec in matchms_spectra]
matchms_spectra = [matchms.filtering.normalize_intensities(spec) for spec in matchms_spectra]
matchms_spectra = [
  matchms.filtering.reduce_to_number_of_peaks(spec, n_required=5, n_max=200) for spec in matchms_spectra
]
matchms_spectra = [spec for spec in matchms_spectra if spec is not None]
matchms_spectra = [spec.set("feature_id", spec.get("spectrum_id")) for spec in matchms_spectra]
matchms_spectra = [spec.set("retention_time", "not-available") for spec in matchms_spectra]

In [17]:
matchms_spectra[0].metadata

{'charge': 1,
 'ionmode': 'positive',
 'smiles': 'OC(=O)[C@H](NC(=O)CCN1C(=O)[C@@H]2Cc3ccccc3CN2C1=O)c4ccccc4',
 'scans': '1865',
 'ms_level': '2',
 'instrument_type': 'LC-ESI-qTof',
 'file_name': 'p1-A05_GA5_01_17878.mzXML',
 'peptide_sequence': '*..*',
 'organism_name': 'GNPS-NIH-NATURALPRODUCTSLIBRARY',
 'compound_name': '(2R)-2-[3-[(10aS)-1,3-dioxo-10,10a-dihydro-5H-imidazo[1,5-b]isoquinolin-2-yl]propanoylamino]-2-phenylacetic acid"',
 'principal_investigator': 'Dorrestein',
 'data_collector': 'VVP/LMS',
 'submit_user': 'vphelan',
 'confidence': '1',
 'spectrum_id': 'CCMSLIB00000079350',
 'precursor_mz': 408.156,
 'adduct': '[M+H]+',
 'feature_id': 'CCMSLIB00000079350',
 'retention_time': 'not-available'}

In [18]:
matchms.exporting.save_as_json(matchms_spectra, filename = spectra_filepath_json)

In [5]:
# using msfeast functions compute_similarities_cosine & compute_similarities_ms2ds
similarity_matrix_modcos = prepro.compute_similarities_cosine(matchms_spectra)
similarity_matrix_ms2ds = prepro.compute_similarities_ms2ds(matchms_spectra, ms2deepscore_model_filepath)

The model version (0.5.0) does not match the version of MS2Deepscore (2.0.0), consider downloading a new model or changing the MS2Deepscore version


1267it [00:06, 182.16it/s]


In [6]:
np.savetxt(similarity_matrix_modcos_filepath, similarity_matrix_modcos)
np.savetxt(similarity_matrix_ms2ds_filepath, similarity_matrix_ms2ds)