Steps performed in this notebook:

**1. Load in spectral data:** The data from GNPS 24-04-09 all_positive is loaded in. It contains 187,152 spectra that are annotated with structures. They are filtered on unique inchikey.

**2. Create embeddings:** 
[Spec2Vec](https://github.com/iomega/spec2vec) and [MS2DeepScore](https://github.com/matchms/ms2deepscore) learn relationships between peaks as they create spectral embeddings. Within these embeddings, similar spectra are placed nearby each other in the latent space. This is also true for molecular embeddings, for example ones made with [Mol2Vec](https://github.com/samoturk/mol2vec). For all methods, there are pretrained models available that are used to create the embeddings from the GNPS data. The embeddings are normalized in order to put both embeddings in the same scale. Data is saved to a file to be used by other notebooks.

In [1]:
!pwd
!python --version

/lustre/BIF/nobackup/unen004/thesis/cca
Python 3.8.12


## 1. Load in spectral data
GNPS 21-04-09 all positive

In [2]:
import os 
import pickle
import pandas as pd
import numpy as np

from collections import OrderedDict
from tqdm.notebook import tqdm 

path = "/mnt/LTR_userdata/hooft001/mass_spectral_embeddings"
dataset = "ALL_GNPS_210409_positive"

# load spectra annotated with molecular structures
spectra_fn = os.path.join(path, "datasets", dataset,
                          "ALL_GNPS_210409_positive_cleaned_peaks_processed_s2v_only_annotated.pickle")
with open(spectra_fn, "rb") as f:
    all_spectrums = pickle.load(f)
    
len(all_spectrums)

187152

In [3]:
# Load in class info to merge into df later 
class_fn = os.path.join(path, "classifications", dataset,
                       "ALL_GNPS_210409_positive_cleaned_peaks_processed_s2v_only_annotated_classes.txt")
classes = pd.read_table(class_fn, sep = "\t") #, on_bad_lines="skip"
classes.head()

Unnamed: 0,spectrum_id,cf_kingdom,cf_superclass,cf_class,cf_subclass,cf_direct_parent,npc_class_results,npc_superclass_results,npc_pathway_results,npc_isglycoside
0,CCMSLIB00000001547,Organic compounds,Organic acids and derivatives,Peptidomimetics,Hybrid peptides,Hybrid peptides,Cyclic peptides; Microcystins,Oligopeptides,Amino acids and Peptides,0
1,CCMSLIB00000001548,Organic compounds,Organic acids and derivatives,Peptidomimetics,Depsipeptides,Cyclic depsipeptides,Cyclic peptides,Oligopeptides,Amino acids and Peptides,0
2,CCMSLIB00000001549,Organic compounds,Organoheterocyclic compounds,Oxepanes,,Oxepanes,Lipopeptides,Oligopeptides,Amino acids and Peptides,0
3,CCMSLIB00000001550,Organic compounds,Organoheterocyclic compounds,Indoles and derivatives,,Indoles and derivatives,,,Shikimates and Phenylpropanoids,0
4,CCMSLIB00000001552,Organic compounds,Organic acids and derivatives,Peptidomimetics,Depsipeptides,Cyclic depsipeptides,Cyclic peptides; Depsipeptides,Oligopeptides,Amino acids and Peptides; Polyketides,0


### 1.1 Filter on unique inchikey
Function obtained from [Joris Louwen](https://github.com/louwenjjr/ms2_mass_differences/blob/ffd31aff66fba14502e3c7ec4f1b5eb947687ef1/scripts/mass_differences/processing.py). This was done to remove any bias on spectra linked to many structures.

In [4]:
def count_higher_peaks(spectrum, threshold = 0.1):
    return np.sum(spectrum.peaks.intensities/spectrum.peaks.intensities.max() >= threshold)

def get_ids_for_unique_inchikeys(spectrums): #: List[SpectrumType]
    """Return indices for best chosen spectra for each unique inchikey
    Parameters
    ----------
    spectrums:
        Input spectra
    """
    # collect all inchikeys (first 14 characters)
    inchikey_collection = OrderedDict()
    for i, spec in enumerate(spectrums):
        inchikey = spec.get("inchikey")
        if inchikey:
            if inchikey[:14] in inchikey_collection:
                inchikey_collection[inchikey[:14]] += [i]
            else:
                inchikey_collection[inchikey[:14]] = [i]

    intensity_thres = 0.01
    n_peaks_required = 10
    ID_picks = []

    inchikey14_unique = [x for x in inchikey_collection.keys()]

    # Loop through all unique inchiques (14 first characters)
    for inchikey14 in inchikey14_unique:
        specIDs = np.array(inchikey_collection[inchikey14])
        if specIDs.size == 1:
            ID_picks.append(specIDs[0])
        else:
            # 1 select spec with sufficient peaks (e.g. 10 with intensity 0.01)
            num_peaks = np.array([count_higher_peaks(
                spectrums[specID], intensity_thres) for
                                  specID in specIDs])
            sufficient_peaks = np.where(num_peaks >= n_peaks_required)[0]
            if sufficient_peaks.size == 0:
                sufficient_peaks = np.where(num_peaks == max(num_peaks))[0]
            step1IDs = specIDs[sufficient_peaks]

            # 2 select best spectrum qualities
            # (according to gnps measure). 1 > 2 > 3
            qualities = np.array(
                [int(spectrums[specID].get("library_class", 3))
                 for specID in step1IDs])  # default worst quality
            step2IDs = step1IDs[np.where(qualities == min(qualities))[0]]

            # 3 Select the ones with most peaks > threshold
            num_peaks = np.array([count_higher_peaks(
                spectrums[specID], intensity_thres) for specID in step2IDs])
            pick = np.argmax(num_peaks)
            ID_picks.append(step2IDs[pick])
    ID_picks.sort()  # ensure order

    return ID_picks

In [5]:
# filter on unique inchi keys
uniq_ids = get_ids_for_unique_inchikeys(all_spectrums)
uniq_spectrums = [all_spectrums[i] for i in uniq_ids]

# take small subset for dev
#spectrums = uniq_spectrums[:10000]
spectrums = uniq_spectrums

In [6]:
# turn into pandas dataframe
df = pd.DataFrame([spec.metadata for spec in spectrums])
df = df[['spectrum_id', 'compound_name', 'smiles', 'inchikey']]
df['inchikey14'] = [x[:14] for x in df['inchikey']]  # for tanimoto similarity matrix

# merge classes into main df
df = pd.merge(df, classes, on="spectrum_id")
print(len(df))

16360


## 1. Create embeddings


### 1.1 Spec2Vec

In [7]:
from gensim.models import Word2Vec
from spec2vec import SpectrumDocument
from spec2vec.vector_operations import calc_vector

def create_spec2vec_embeddings():
    """Create Spec2Vec embeddings 
    """
    # Load pretrained spec2vec model 
    spec2vec_model_fn = os.path.join(path, "embeddings", "ALL_GNPS_210409_positive", 
                                     "ALL_GNPS_210409_positive_cleaned_spec2vec_embedding_iter_15.model")
    spec2vec_model = Word2Vec.load(spec2vec_model_fn)
    
    # Create "documents"
    spectrum_documents = [SpectrumDocument(s, n_decimals=2) for s in spectrums]
    
    # Derive embeddings from model with documents
    intensity_weighting_power = 0.5
    allowed_missing_percentage = 50 # specify the maximum (weighted) fraction of the spectrum that is allowed to be missing

    spec2vec_embeddings = np.zeros((len(spectrum_documents), spec2vec_model.vector_size), dtype="float")
    for i, doc in enumerate(tqdm(spectrum_documents)):
        spec2vec_embeddings[i, 0:spec2vec_model.vector_size] = calc_vector(spec2vec_model, doc,
                                                                           intensity_weighting_power,
                                                                           allowed_missing_percentage)
    
    return spec2vec_embeddings

# Add to dataframe
spec2vec_embeddings = create_spec2vec_embeddings()

  0%|          | 0/16360 [00:00<?, ?it/s]

### 1.2 MS2DeepScore

In [8]:
from ms2deepscore.models import load_model
from ms2deepscore import MS2DeepScore

data_path = '/lustre/BIF/nobackup/unen004/data'

def create_ms2ds_embeddings():
    """Create and save MS2DeepScore embeddings to file.
    """
    # Load pretrained ms2ds model
    ms2ds_model_fn = os.path.join(data_path, 'ms2ds', 'ms2ds_model_20210419-221701_data210409_10k_500_500_200.hdf5')
    ms2ds_model = load_model(ms2ds_model_fn)

    # Init MS2DeepScore
    ms2ds_score = MS2DeepScore(ms2ds_model)
    ms2ds_score.model.spectrum_binner.allowed_missing_percentage = 50

    # Generate embeddings from spectra
    ms2ds_embeddings = ms2ds_score.calculate_vectors(spectrums)
    
    # Save embeddings to file, cause takes ~15 min to make the embeddings
    with open(os.path.join(data_path, 'ms2ds', 'ms2ds_embeddings.pickle'), 'wb') as f:
        pickle.dump(ms2ds_embeddings, f) 
        
    return ms2ds_embeddings

def load_ms2ds_embeddings():    
    """Return ms2ds embedding vectors
    """
    with open(os.path.join(data_path, 'ms2ds', 'ms2ds_embeddings.pickle'), 'rb') as f:
        ms2ds_embeddings = pickle.load(f)
        return ms2ds_embeddings

#ms2ds_embeddings = create_ms2ds_embeddings()
ms2ds_embeddings = load_ms2ds_embeddings()

2022-02-20 17:16:03.175868: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-20 17:16:03.175926: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### 1.3 Mol2Vec

In [9]:
from mol2vec.features import mol2alt_sentence, MolSentence #,sentences2vec
from rdkit import Chem

def sentences2vec_new(sentences, model, unseen=None):
    """mol2vec.sentences2vec - but updated to Gensim version 4.
    https://github.com/samoturk/mol2vec/pull/16
    """
    keys = set(model.wv.key_to_index)
    vec = []
    
    if unseen:
        unseen_vec = model.wv.get_vector(unseen)

    for sentence in sentences:
        if unseen:
            vec.append(sum([model.wv.get_vector(y) if y in set(sentence) & keys
                       else unseen_vec for y in sentence]))
        else:
            vec.append(sum([model.wv.get_vector(y) for y in sentence 
                            if y in set(sentence) & keys]))
    return np.array(vec)

def create_mol2vec_embeddings():
    """Create Mol2Vec embeddings vectors
    """
    # Load pretrained mol2vec model 
    mol2vec_model_fn = os.path.join(data_path, 'mol2vec', 'model_300dim.pkl')
    mol2vec_model = Word2Vec.load(mol2vec_model_fn)

    df['mol'] = ""
    df['sentence'] = ""
    for i, row in tqdm(df.iterrows(), total=df.shape[0]):
        mol = Chem.MolFromSmiles(row['smiles'])  # This gives the hydrogen warning
        if mol.GetNumAtoms() != 0:  
            df.at[i, 'mol'] = mol  # rdkit/molecule
            df.at[i, 'sentence'] = MolSentence(mol2alt_sentence(mol, radius=1))
    mol2vec_embeddings = [x for x in sentences2vec_new(df['sentence'], mol2vec_model, unseen='UNK')]
    return mol2vec_embeddings

mol2vec_embeddings = create_mol2vec_embeddings()

  0%|          | 0/16360 [00:00<?, ?it/s]



In [15]:
# Add embedding vectors to df
df['ms2ds'] = [x for x in ms2ds_embeddings]
df['spec2vec'] = [x for x in spec2vec_embeddings]
df['mol2vec'] = [x for x in mol2vec_embeddings]
df.tail()

Unnamed: 0,spectrum_id,compound_name,smiles,inchikey,inchikey14,cf_kingdom,cf_superclass,cf_class,cf_subclass,cf_direct_parent,npc_class_results,npc_superclass_results,npc_pathway_results,npc_isglycoside,mol,sentence,ms2ds,spec2vec,mol2vec
16355,CCMSLIB00006112253,Carnosic acid,O=C(O)C12C=3C(O)=C(O)C(=CC3CCC1C(C)(C)CCC2)C(C)C,QRYRORQUOLYVBU-VBKZILBWSA-N,QRYRORQUOLYVBU,Organic compounds,Lipids and lipid-like molecules,Prenol lipids,Diterpenoids,Diterpenoids,Abietane diterpenoids,Diterpenoids,Terpenoids,0,"<img data-content=""rdkit/molecule"" src=""data:i...","(864942730, 1510328189, 2246699815, 1292826808...","[27.573139190673828, 11.943531036376953, 11.95...","[45.552994983480886, -46.004177415837, -2.5947...","[-0.5933314, 1.833904, -2.99647, 0.3149404, 0...."
16356,CCMSLIB00006112262,Isobavachalcone,O=C(C=CC1=CC=C(O)C=C1)C2=CC=C(O)C(=C2O)CC=C(C)C,DUWPGRAKHMEPCM-IZZDOVSWSA-N,DUWPGRAKHMEPCM,Organic compounds,Phenylpropanoids and polyketides,"Linear 1,3-diarylpropanoids",Chalcones and dihydrochalcones,3-prenylated chalcones,Chalcones,Flavonoids,Shikimates and Phenylpropanoids,0,"<img data-content=""rdkit/molecule"" src=""data:i...","(864942730, 1510328189, 2246699815, 1627070083...","[27.06756019592285, 6.022364616394043, 16.0695...","[60.6237717086733, -12.419441839078276, -9.988...","[0.31012917, 0.8686801, -6.186948, 3.3131878, ..."
16357,CCMSLIB00006112307,Daucosterol;Sitogluside,OCC1OC(OC2CC3=CCC4C(CCC5(C)C(CCC45)C(C)CCC(CC)...,NPJICTMALKLTFW-OFUAXYCQSA-N,NPJICTMALKLTFW,Organic compounds,Lipids and lipid-like molecules,Steroids and steroid derivatives,Stigmastanes and derivatives,Stigmastanes and derivatives,Stigmastane steroids,Steroids,Terpenoids,1,"<img data-content=""rdkit/molecule"" src=""data:i...","(864662311, 1535166686, 2245384272, 3153477100...","[43.09640884399414, 21.87156105041504, 4.65105...","[27.656290851015424, 9.136454038112202, 48.378...","[-3.1778405, -10.28999, -7.77818, 0.881122, 4...."
16358,CCMSLIB00006112312,Isoshaftoside,O=C1C=C(OC2=C1C(O)=C(C(O)=C2C3OC(CO)C(O)C(O)C3...,OVMFOVNOXASTPA-SDSMHRFWSA-N,OVMFOVNOXASTPA,Organic compounds,Phenylpropanoids and polyketides,Flavonoids,Flavonoid glycosides,Flavonoid 8-C-glycosides,Flavones,Flavonoids,Shikimates and Phenylpropanoids,0,"<img data-content=""rdkit/molecule"" src=""data:i...","(864942730, 10565946, 3217380708, 3628883864, ...","[34.61747741699219, 6.432244300842285, 9.56135...","[15.183186117005787, -15.596486092501944, 19.3...","[4.702742, -7.1017447, -10.525252, 3.7643328, ..."
16359,CCMSLIB00006112328,Lutein,OC1C=C(C)C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C...,KBPHJBAIARWVSC-NSIPBSJQSA-N,KBPHJBAIARWVSC,Organic compounds,Lipids and lipid-like molecules,Prenol lipids,Tetraterpenoids,Xanthophylls,"Carotenoids (C40, β-ε)",Carotenoids (C40),Terpenoids,0,"<img data-content=""rdkit/molecule"" src=""data:i...","(864662311, 266675433, 2976033787, 1531406731,...","[42.814483642578125, 17.666553497314453, 0.0, ...","[8.151431105364194, -37.66852868733449, 55.801...","[-10.978773, 0.80958086, -12.219066, -3.163424..."


In [12]:
with open(os.path.join(data_path, 'full_dataframe.pickle'), 'wb') as f:
    pickle.dump(df, f)

### 1.4 Normalizing
To put all embeddings in the same scale.

In [13]:
# Normalize embedding vectors
from sklearn.preprocessing import normalize

# Normalize embeddings on samples
ms2ds_embeddings_normalized = normalize(ms2ds_embeddings)
spec2vec_embeddings_normalized = normalize(spec2vec_embeddings)
mol2vec_embeddings_normalized = normalize(mol2vec_embeddings)

# Add embedding vectors to df
df['ms2ds'] = [x for x in ms2ds_embeddings_normalized]
df['spec2vec'] = [x for x in spec2vec_embeddings_normalized]
df['mol2vec'] = [x for x in mol2vec_embeddings_normalized]

Lastly, save the dataframe to a file.

In [14]:
with open(os.path.join(data_path, 'full_dataframe_normalized.pickle'), 'wb') as f:
    pickle.dump(df, f)