# Summary
This is a variant of notebook 1 where we try out some different parameters.

In this notebook mass differences are extracted from the AllPositive dataset and expressed per spectrum. This dataset is constructed here: https://github.com/iomega/spec2vec_gnps_data_analysis/tree/master/notebooks just like the pickled ALL_GNPS_210125_positive_cleaned_by_matchms_and_lookups file, and the pickled gnps_positive_ionmode_cleaned_by_matchms_and_lookups file is created here: https://github.com/louwenjjr/improve_library_matching .

Steps in this notebook:
- Reading spectra
- Processing spectra
- Extracting mass differences (new)

In [1]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle
from math import ceil

data_path = "/mnt/scratch/louwe015/Mass_differences/data/"

## Determine AllPositive version; default is version 2

In [2]:
all_pos_version1 = False
if all_pos_version1:
    all_pos = "gnps_positive_ionmode_cleaned_by_matchms_and_lookups"
else:
    all_pos = "ALL_GNPS_210125_positive_cleaned_by_matchms_and_lookups"

print(all_pos)

ALL_GNPS_210125_positive_cleaned_by_matchms_and_lookups


## Reading spectra

In [3]:
all_positive_file = os.path.join(data_path, all_pos + ".pickle")
if os.path.exists(all_positive_file):
    with open(all_positive_file, 'rb') as inf:
        spectrums = pickle.load(inf)  # list of matchms.Spectrum.Spectrum
else:
    print("error")

In [4]:
print("number of spectra:", len(spectrums))

number of spectra: 144691


## Processing spectra
Here we try some alternative processing steps:

- Add only whitelisted MDs already in preprocessing
- Mulitply MD intensities instead of taking the mean
- Punish intensities, square root?
- Allow maximum MDs per peak

In [8]:
# load top30 file
top30_file = os.path.join(data_path, all_pos + "_top30_peaks.pickle")
if os.path.exists(top30_file):
    with open(top30_file, 'rb') as inf:
        spectrums_top30 = pickle.load(inf)  # list of matchms.Spectrum.Spectrum
else:
    print("error")

In [7]:
from matchms.Spikes import Spikes
from matchms.typing import SpectrumType

def get_mass_differences(spectrum_in: SpectrumType, cutoff: int = 36, n_max: int = 100) -> Spikes:
    """Returns Spikes with top 100 mass differences and intensities
    
    Parameters:
    spectrum_in:
        Spectrum in matchms.Spectrum format
    cutoff:
        Mass cutoff for mass difference (like Xing et al.)
    n_max:
        Maximum amount of mass differences to select, ranked on intensity (like Xing et al.)
    """
    if spectrum_in is None:
        return None

    spectrum = spectrum_in.clone()
    peaks_mz, peaks_intensities = spectrum.peaks
    mass_diff_mz = []
    mass_diff_intensities = []
    for i, (mz_i, int_i) in enumerate(zip(peaks_mz[:-1], peaks_intensities[:-1])):
        for mz_j, int_j in zip(peaks_mz[i+1:], peaks_intensities[i+1:]):
            mz_diff = mz_j-mz_i
            if mz_diff > cutoff:
                mass_diff_mz.append(mz_diff)
                mass_diff_intensities.append(np.mean([int_i, int_j]))
    mass_diff_mz = np.array(mass_diff_mz)
    mass_diff_intensities = np.array(mass_diff_intensities)
    idx = mass_diff_intensities.argsort()[-n_max:]
    idx_sort_by_mz = mass_diff_mz[idx].argsort()
    mass_diff_peaks = Spikes(mz=mass_diff_mz[idx][idx_sort_by_mz],
                             intensities=mass_diff_intensities[idx][idx_sort_by_mz])
    return mass_diff_peaks
