# Summary
In this notebook mass differences are extracted from the AllPositive dataset and expressed per spectrum. This dataset is constructed here: https://github.com/iomega/spec2vec_gnps_data_analysis/tree/master/notebooks and the pickled gnps_positive_ionmode_cleaned_by_matchms_and_lookups file is created here: https://github.com/louwenjjr/improve_library_matching .

Steps in this notebook:
- Reading spectra
- Processing spectra
- Extracting mass differences and fragments as words

In [1]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle

## Reading spectra

In [9]:
all_positive_file = "/mnt/scratch/louwe015/Mass_differences/data/gnps_positive_ionmode_cleaned_by_matchms_and_lookups.pickle"
if os.path.exists(all_positive_file):
    with open(all_positive_file, 'rb') as inf:
        spectrums = pickle.load(inf)  # list of matchms.Spectrum.Spectrum
else:
    print("error")

In [3]:
print("number of spectra:", len(spectrums))

number of spectra: 112956


## Processing spectra
Similar to Xing et al. (2020) https://pubs-acs-org.ezproxy.library.wur.nl/doi/full/10.1021/acs.analchem.0c02521
In Xing they also do square root transformation on the normalised intensities. We don't do that here (yet).

Steps:
- normalise peaks (maximum intensity to 1)
- remove peaks outside [0, 1000] m/z window
- select top 30 intensity peaks (from Xing et al. (2020))
- remove peaks with intensities < 0.001 of maximum intensity

In [4]:
from matchms.filtering import normalize_intensities
from matchms.filtering import select_by_mz
from matchms.filtering import select_by_relative_intensity
from matchms.filtering import reduce_to_number_of_peaks
# from matchms.filtering import add_losses

def post_process_md(s, min_peaks=2):
    s = normalize_intensities(s)
    s = select_by_mz(s, mz_from=0, mz_to=1000)
    s = reduce_to_number_of_peaks(s, n_required=min_peaks, n_max=30)
    if s is None:
        return None
    s_remove_low_peaks = select_by_relative_intensity(s, intensity_from=0.001)
    if len(s_remove_low_peaks.peaks) >= min_peaks:
        s = s_remove_low_peaks
    return s

# apply post processing steps to the data
spectrums_top30 = [post_process_md(s) for s in spectrums]

# omit spectrums that didn't qualify for analysis
spectrums_top30 = [s for s in spectrums_top30 if s is not None]

print("{} remaining spectra.".format(len(spectrums_top30)))

111472 remaining spectra.


In [6]:
top30_file = "/mnt/scratch/louwe015/Mass_differences/data/gnps_positive_ionmode_cleaned_by_matchms_and_lookups_top30_peaks.pickle"
with open(top30_file, 'wb') as outf:
    pickle.dump(spectrums_top30, outf)

In [8]:
top30_file = "/mnt/scratch/louwe015/Mass_differences/data/gnps_positive_ionmode_cleaned_by_matchms_and_lookups_top30_peaks.pickle"
if os.path.exists(top30_file):
    with open(top30_file, 'rb') as inf:
        spectrums_top30 = pickle.load(inf)  # list of matchms.Spectrum.Spectrum
else:
    print("error")

In [14]:
test_spec = spectrums_top30[0]
[i for i in test_spec.peaks]

[array([456.107544, 469.872314, 515.922852, 538.003174, 539.217773,
        554.117432, 556.030396, 582.110107, 599.352783, 600.382812,
        696.332397, 724.081909, 830.409302, 839.455811, 847.433594,
        851.380859, 852.370605, 866.295654, 868.372192, 909.424438,
        919.39624 , 932.351929, 936.552002, 939.61792 , 949.370239,
        953.396606, 954.491577, 963.686768, 964.524658, 965.192139]),
 array([0.13785328, 0.1145734 , 0.10302372, 0.28900242, 0.35616456,
        0.09970269, 0.28046377, 0.13351462, 1.        , 0.14946182,
        0.15854461, 0.11577153, 0.13417909, 0.09418422, 0.2569733 ,
        0.32199162, 0.36216307, 0.14520426, 0.10205448, 0.31933114,
        0.11226477, 0.17874806, 0.13489457, 0.10109441, 0.1117298 ,
        0.71323034, 0.16211023, 0.34214536, 0.41616014, 0.16272238])]

In [15]:
cutoff = 36  # like Xing et al.


30