# Using ms2deepscore: How to load data, train a model, and compute similarities.

In [9]:
from pathlib import Path

from matchms.importing import load_from_mgf
from tensorflow import keras
import pandas as pd

from ms2deepscore import SpectrumBinner
from ms2deepscore.data_generators import DataGeneratorAllSpectrums
from ms2deepscore.models import SiameseModel
from ms2deepscore import MS2DeepScore

## Data loading

Here we load in a small sample of test spectrum as well as reference scores data.

In [2]:
TEST_RESOURCES_PATH = Path.cwd().parent / 'tests' / 'resources'
spectrums_filepath = str(TEST_RESOURCES_PATH / "pesticides_processed.mgf")
score_filepath = str(TEST_RESOURCES_PATH / "pesticides_tanimoto_scores.json")

Load processed spectrums from .mgf file. For processing itself see [matchms](https://github.com/matchms/matchms) documentation.

In [3]:
spectrums = list(load_from_mgf(spectrums_filepath))

Load reference scores from a .json file. This is a Pandas DataFrame with reference similarity scores (=labels) for compounds identified by inchikeys. Columns and index should be inchikeys, the value in a row x column depicting the similarity score for that pair. Must be symmetric (reference_scores_df[i,j] == reference_scores_df[j,i]) and column names should be identical to the index.

In [4]:
tanimoto_scores_df = pd.read_json(score_filepath)

## Data preprocessing

Bin the spectrums using `ms2deepscore.SpectrumBinner`. In this binned form we can feed spectra to the model.

In [5]:
spectrum_binner = SpectrumBinner(1000, mz_min=10.0, mz_max=1000.0, peak_scaling=0.5)
binned_spectrums = spectrum_binner.fit_transform(spectrums)

Spectrum binning: 100%|██████████| 76/76 [00:00<00:00, 1366.15it/s]
Create BinnedSpectrum instances: 100%|██████████| 76/76 [00:00<00:00, 69478.44it/s]

Collect spectrum peaks...
Calculated embedding dimension: 543.
Convert spectrums to binned spectrums...





Create a data generator that will generate batches of training examples.
Each training example consists of a pair of binned spectra and the corresponding reference similarity score.

In [11]:
dimension = len(spectrum_binner.known_bins)
data_generator = DataGeneratorAllSpectrums(binned_spectrums, tanimoto_scores_df,
                                           dim=dimension)

## Model training

Initialize a SiameseModel. It consists of a dense 'base' network that produces an embedding for each of the 2 inputs. The 'head' model computes the cosine similarity between the embeddings.

In [10]:
model = SiameseModel(spectrum_binner, base_dims=(200, 200, 200), embedding_dim=200,
                     dropout_rate=0.2)
model.compile(loss='mse', optimizer=keras.optimizers.Adam(lr=0.001))
model.summary()

Model: "base"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
base_input (InputLayer)      [(None, 543)]             0         
_________________________________________________________________
dense1 (Dense)               (None, 200)               108800    
_________________________________________________________________
normalization1 (BatchNormali (None, 200)               800       
_________________________________________________________________
dropout1 (Dropout)           (None, 200)               0         
_________________________________________________________________
dense2 (Dense)               (None, 200)               40200     
_________________________________________________________________
normalization2 (BatchNormali (None, 200)               800       
_________________________________________________________________
dropout2 (Dropout)           (None, 200)               0      

Train the model on the data, for the sake of simplicity we use the same dataset for training and validation.

In [12]:
model.fit(data_generator,
          validation_data=data_generator,
          epochs=2)

Epoch 1/2
Epoch 2/2


## Model inference

Calculate similariteis for a pair of spectra

In [15]:
similarity_measure = MS2DeepScore(model)
score = similarity_measure.pair(spectrums[0], spectrums[1])
print(score)

Spectrum binning: 100%|██████████| 1/1 [00:00<00:00, 1144.11it/s]
Create BinnedSpectrum instances: 100%|██████████| 1/1 [00:00<00:00, 9532.51it/s]
Spectrum binning: 100%|██████████| 1/1 [00:00<00:00, 870.91it/s]
Create BinnedSpectrum instances: 100%|██████████| 1/1 [00:00<00:00, 8830.11it/s]

0.7736728371253915





Calculate similarities for a 3x3 matrix of spectra

In [17]:
scores = similarity_measure.matrix(spectrums[:3], spectrums[:3])
print(scores)

Spectrum binning: 100%|██████████| 3/3 [00:00<00:00, 1661.99it/s]
Create BinnedSpectrum instances: 100%|██████████| 3/3 [00:00<00:00, 14074.85it/s]
Calculating vectors of reference spectrums: 100%|██████████| 3/3 [00:00<00:00, 21.24it/s]
Spectrum binning: 100%|██████████| 3/3 [00:00<00:00, 1515.83it/s]
Create BinnedSpectrum instances: 100%|██████████| 3/3 [00:00<00:00, 11949.58it/s]
Calculating vectors of reference spectrums: 100%|██████████| 3/3 [00:00<00:00, 19.07it/s]

[[1.         0.77367284 0.76113528]
 [0.77367284 1.         0.79715826]
 [0.76113528 0.79715826 1.        ]]



