In [12]:
from pathlib import Path
import urllib.request

import pandas as pd
from pandas import ExcelWriter

import spec2vec

# Spec2Vec API – Quickstart
## Introduction

This short guide explains how you can add GNPS library matches to your mass spectra - using Omigami's Spec2Vec API.

The API uses a Spec2Vec model that was trained on the entire GNPS spectral library. It embeds each of your spectra into a vector space and calculates the cosine similarity to all GNPS library spectra in this vector space. It then returns you the top library matches for each of your spectra. To learn more about Spec2Vec read our gentle introduction: [Spec2Vec: The Next Step in Mass Spectral Similarity Metrics](https://www.datarevenue.com/en-blog/spec2vec-mass-spectral-similarity-metric)

This notebook shows you how 
1. Specify any MGF file
2. Run a library search through the Spec2Vec API
3. Save the results as XLSX.

## Installation
Run the following cell to install the required external packages (`pandas`, `requests` and `matchms`)

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

# 1. (Down)load a MS/MS dataset (MGF format)

We'll use a small MS/MS dataset in the MGF format from [here](https://gnps-external.ucsd.edu/gnpslibrary/GNPS-COLLECTIONS-MISC.mgf).
You can also select any other from the [GNPS spectral library](https://gnps-external.ucsd.edu/gnpslibrary), or - of course - use your own.

*Note that you will need the Precursor_MZ field `PEPMASS` and the abundance pairs in your MGF file.*

In [5]:
# Load your own MS/MS dataset
home = str(Path.home())
path_to_mgf = f'{home}/path/to/dataset.mgf'

In [13]:
# Or download a small MS/MS dataset from GNPS 
url = 'https://gnps-external.ucsd.edu/gnpslibrary/GNPS-COLLECTIONS-MISC.mgf'

home = str(Path.home())
path_to_mgf = f'{home}/Downloads/GNPS-COLLECTIONS-MISC.mgf' # use your prefered saving path here

urllib.request.urlretrieve(url, path_to_mgf)

('/Users/pierre/Downloads/GNPS-COLLECTIONS-MISC.mgf',
 <http.client.HTTPMessage at 0x7f89ecb8b210>)

# 2. Get GNPS library matches with Spec2Vec

`spec2vec.py` is a python wrapper which :
- Builds a json payload from the MGF file
- Calls the Spec2Vec API
- Formats the prediction results into readable dataframes

(see the `spec2vec.py` file associated with this notebook)
____
- `n_best_spectra` sets the number of matches you'd like per spectrum (it is set to 10 by default).

In the results dataframes, the input spectra can be identified by their number in the dataframes index, which refers to their order in the MGF file.  
*i.e.* `matches of spectrum 1` gives the spectrum_id and Spec2Vec scores of the library spectra matches, for the first spectrum in the MGF file.

For each spectrum in the MGF file, the library spectra matches are sorted according to their Spec2Vec similarity score (best is first).   
The following information about the predicted spectra are returned :
- `score`, the Spec2Vec similarity score between the input spectrum and the library spectrum
- `matches of spectrum #`, the spectrum_ID of the matched library spectra for the spectrum number # in the MGF file

In [14]:
# Run Spec2Vec library search
spec2vec_matches = spec2vec.run(mgf_file = path_to_mgf, n_best_spectra = 10)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

## 2.1 View results
A list of dataframes is returned. To look at a specific dataframe you can call :
```python
spec2vec_matches[i]  # 'i' refers to the (i+1)th spectrum in the MGF file input
```

In [None]:
spec2vec_matches[i]

# 3. Save results

Execute the following cell to save the results in an Excel file. For readability, each dataframe is saved in its own sheet.

In [10]:
home = str(Path.home())
writer = pd.ExcelWriter(f"{home}/Downloads/spec2vec_spectra_matches.xlsx", engine='xlsxwriter')

for i, spectrum_match_dataframe in enumerate(spec2vec_matches):
    spectrum_match_dataframe.to_excel(writer, sheet_name=f'spectrum #{i+1}')
writer.save()

____