SpeCollate is the first Deep Learning-based peptide-spectrum similarity network. It allows searching a peptide database by generating embeddings for both mass spectra and database peptides. K-nearest neighbor search is performed on a GPU in the embedding space to find the k (usually k=5) nearest peptide for each spectrum.
SpeCollate network consists of two branch, i.e., Spectrum Sub-Network (SSN) and Peptide Sub-Network (PSN). SSN processes spectra and generates spectral embeddings while PSN processes peptide sequences and generates peptides embeddings. Both types of embeddings are generated in real space of dimension 256. The network architecture is shown in Fig 1 below.
Fig 1: SpeCollate network architecture. Spectra are encodded in dense arrays of length 80,000 each where each index represents a m/z bin width of 0.1 Da. Hence, spectra with maximum m/z of 8,000 can be encoded using this technique. Encoded spectra are passed through SSN which consists of two fully connected layers of dimessions 80,000 x 1,024 and 1,024 x 256. Output from the second layer is normalized to have unit length. Similarly, peptides sequences are integer encoded where each amino acid and modification character is assigned a unique integer value. These encoded peptide vectors are passed through the embedding layer which learns 256 dimension embedding for each amino acid. The output from the embedding layer is then passed throug PSN which consists of two BiLSTMs and two fully connected layers of length 2,048 x 1,024 and 1,024 x 256. Output from the last layer is normalzied to unit length.
To train SpeCollate, we design a custom loss function called SNAP-Loss which is inspired from Triplet Loss function. In SNAP-Loss, loss is calcualted over sextuplets of datapoints where each sextuplet consists of an anchor spectrum, a positive peptide, two negative spectra and two negative peptides.
We design SNAP-loss which extends Triplet-Loss to multi-modal data, in our case numerical spectra and sequence peptides. For this purpose, we consider all possible negatives (qj, pk, ql, pm) for a given positive pair (qi, pi) and average the total loss. The four possible negatives are explained below:
- qj: The negative spectrum for qi.
- pk: The negative peptide for qi.
- ql: The negative spectrum for pi.
- pm: The negative peptide for pi.
To calculate the loss value, we first define a few variables that are precomputed in distances matrices above as follows:
Then the SNAP-loss is calculated for a batch of size b as follows:
The training process is visualized in the figure below:
Once, the sextuplets are genrated, the loss is calculated using the SNAP-Loss function and the network paramenters are updated by back propagation.
Tuned hyperparameters are given in table 1 below and the ranges for which their value was tuned for:
Hyperparameter | Value | Values Tested |
---|---|---|
Learning Rate | 0.0001 | 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01 |
Weight Decay | 0.0001 | 1xe^-6, 1xe^-5, 1xe^-4, 1xe^-3 |
Margin | 0.2 | 0.1, 0.2, 0.3, 0.4 |
Embedding Dim | 256 | 32, 64, 128, 256, 512, 1028, 2048 |
FC Layers | 2 | 1, 2, 3 |
BiLSTM Layers | 2 | 1, 2, 3, 4 |
SpeCollate is available as a standalone executable that can be downloaded and run on a Linux server with a Cuda-enabled GPU.
Two different executables are included in the downloadable specollate.tar.gz file; 1) specollate_train for retraining a model and 2) specollate_search for performing database search using a trained model. A pre-trained model is provided within the download file.
The below sections explain the setup for retraining the model.
- A Computer with Ubuntu 16.04 (or later) or CentOS 8.1 (or later).
- At least 120GBs of system memory and 10 CPU cores.
- Cuda enabled GPU with at least 12 GBs of memory. Cuda Toolkit 10.0 (or later).
- OpenMS tool for creating custom peptide database. (Optional)
- Crux for FDR analysis using its percolator option.
-
Download the specollate.tar.gz file and extract the contents using the following command:
tar -xzf specollate.tar.gz
The extracted directory contains multiple files, including:specollate-train
: This is the executable for training SpeCollate.specollate-search
: This is the executable for database search.config.ini
: Parameter file for training and searching.models (dir)
: Contains the pre-trained model. New models will also be stored here.percolator (dir)
: Percolator input (.pin) files be placed here after the search is complete.
-
Download the preprocessed data for training (here) and extract the contents using:
tar -xzf specollate-training-data.tar.gz
-
Open the config.ini file from step 1 in your favorite text editor and set the following parameters:
in_tensor_dir
in [preprocess] section: Absolute path of the decompressed file from step 2.model_name
in [ml] section: The name by which to wish to save the trained model file.- other parameters in the [ml] section: You can adjust different hyperparameters in the [ml] section, e.g., learning_rate, dropout, etc.
-
Execute the specollate_train file.
./specollate_train
-
Same as step 1 in the Previous section.
-
Download one of the mgf files. Or you can use your own spectra files in mgf format.
-
Download the human peptide database. You can provide your own peptide database file created using the Digestor tool provided by OpenMS.
-
Set the following parameters in the [search] section of the
config.ini
file:model_name
: Name of the model to be used. The model should be in the/models
directory.mgf_dir
: Absolute path to the directory containing mgf files to be searched.prep_dir
: Absolute path to the directory where preprocessed mgf files will be saved.pep_dir
: Absolute path to the directory containing peptide database.out_pin_dir
: Absolute path to a directory where percolator pin files will be saved. The directory must exist; otherwise, the process will exit with an error.- Set database search parameters e.g.
precursor_mass_tolerance
etc.
-
Execute the specollate_search file:
python run_search.py
If you want to use the preprocessed spectra from a previous run, use the-p False
flag:
python run_search.py -p False
-
Once the search is complete; you can analyze the percolator files using the crux percolator tool:
cd <out_pin_dir>
crux percolator target.pin decoy.pin --list-of-files T --overwrite T
If you use our tool, please cite our work:
[1]. Tariq, Muhammad Usman, and Fahad Saeed. "SpeCollate: Deep cross-modal similarity network for mass spectrometry data based peptide deductions." PloS one 16.10 (2021): e0259349.
For questions, suggestions, or technical problems, contact:
mtari008@fiu.edu