# Spec2Vec API – Quickstart

*In order to use the client, you need an access token provided by DataRevenue.*
## Introduction

This is a short guide on you can add GNPS library matches to your mass spectra - using Omigami's Spec2Vec API.

The API uses a Spec2Vec model that was trained on the entire GNPS spectral library. It embeds each of your spectra into a vector space and calculates the cosine similarity to all GNPS library spectra in this vector space. It then returns you the top library matches for each of your spectra. To learn more about Spec2Vec read our gentle introduction: [Spec2Vec: The Next Step in Mass Spectral Similarity Metrics](https://www.datarevenue.com/en-blog/spec2vec-mass-spectral-similarity-metric)

This notebook shows you how 
1. Specify any MGF file
2. Run a library search through the Spec2Vec API
3. Save the results as CSV.

# 1. (Down)load a MS/MS dataset (MGF format)

We'll use a small MS/MS dataset in the MGF format from [here](https://gnps-external.ucsd.edu/gnpslibrary/GNPS-COLLECTIONS-MISC.mgf).
You can also select any other from the [GNPS spectral library](https://gnps-external.ucsd.edu/gnpslibrary), or - of course - use your own.

*Note that you will need the Precursor_MZ field `PEPMASS` and the abundance pairs in your MGF file.*

In [1]:
# Load your own MS/MS dataset (and skip the next cell)
path_to_mgf <- '/path/to/local_dataset.mgf'

In [2]:
# OR download a small MS/MS dataset from GNPS, in the same directory as this notebook
url = 'https://gnps-external.ucsd.edu/gnpslibrary/GNPS-COLLECTIONS-MISC.mgf'

path_to_mgf = 'GNPS-COLLECTIONS-MISC.mgf' # use your prefered saving path here

download.file(url, path_to_mgf, method = "curl")

# 2. Query for the best matches with Spec2Vec

`Romigami` is a R wrapper which instantiates a virtual environment on which to install and call the Python `Omigami` package

`Spec2Vec` is a Python wrapper which :
- Builds a json payload from the MGF file
- Calls the Spec2Vec API
- Formats the prediction results into readable dataframes

____
- `n_best` sets the number of matches you'd like per spectrum (it is set to 10 by default).

In the results dataframes, the input spectra can be identified by their number in the dataframes index, which refers to their order in the MGF file.  
*i.e.* `matches of spectrum 1` gives the spectrum_id and Spec2Vec scores of the library spectra matches, for the first spectrum in the MGF file.

For each spectrum in the MGF file, the library spectra matches are sorted according to their Spec2Vec similarity score (best is first).   
The following information about the predicted spectra are returned :
- `score`, the Spec2Vec similarity score between the input spectrum and the library spectrum
- `matches of spectrum #`, the spectrum_ID of the matched library spectra for the spectrum number # in the MGF file

In [3]:
# install devtools and romigami if not already installed
if("devtools" %in% installed.packages() == FALSE) {install.packages("devtools")}
devtools::install_github("omigami/romigami")

# import the romigami package
library("romigami")

Skipping install of 'romigami' from a github remote, the SHA1 (e2a38577) has not changed since last install.
  Use `force = TRUE` to force installation



In [4]:
# environment setup
initialize_environment()

virtualenv: Romigami
Using virtual environment '/home/rouven/.virtualenvs/Romigami' ...


In [5]:
# Run Spec2Vec library search with your user token
n_best_matches <- 10
include_metadata <- list("Smiles", "Compound_name")
ion_mode <- "positive"  # either positive or negative
# "https://mlops.datarevenue.com/seldon/seldon/spec2vec-{ion_mode}/api/v0.1/predictions"
spectra_matches <- match_spectra_from_path(token = "xVdEbxiFgAnO8W3s2OjBCao0sDr1zZwk",
                                           mgf_path = path_to_mgf,
                                           n_best = n_best_matches,
                                           include_metadata = include_metadata,
                                           ion_mode = ion_mode
                                          )

## 2.1 View results
A list of dataframes is returned. To look at a specific dataframe you can call :
```python
spectra_matches[i]  # 'i' refers to the ith spectrum in the MGF file input
```

In [6]:
spectra_matches[1]

Unnamed: 0_level_0,Compound_name,Smiles,score
Unnamed: 0_level_1,<chr>,<chr>,<list>
CCMSLIB00000005869,Progesterone,CC(=O)[C@H]1CC[C@@H]2[C@@]1(CC[C@H]3[C@H]2CCC4=CC(=O)CC[C@]34C)C,0.2057379
CCMSLIB00000006368,Nizatidine,CN/C(=C\[N+](=O)[O-])/NCCSCC1=CSC(=N1)CN(C)C,0.1863432
CCMSLIB00000206026,Anileridine,[H]N([H])c(c([H])3)c([H])c([H])c(c([H])3)C([H])([H])C([H])([H])N(C([H])([H])1)C([H])([H])C([H])([H])C(C(=O)OC([H])([H])C([H])([H])[H])(c(c([H])2)c([H])c([H])c([H])c([H])2)C([H])([H])1,0.1847168
CCMSLIB00000222124,"[1,3]Benzodioxolo[5,6-c]-1,3-dioxolo[4,5-i]phenanthridin-6-ol, 5b,6,7,12b,13,14-hexahydro-13-methyl-, (5bR,6S,12bS)-",CN1CC2=C(C=CC3=C2OCO3)C4C1C5=CC6=C(C=C5CC4O)OCO6,0.2022142
CCMSLIB00000425025,Hymenamide B,O=C2N5[C@H](C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H]2Cc3ccccc3)CCC(=O)O)C(C)C)Cc4ccccc4)CC(=O)N)CCC5,0.1794558
CCMSLIB00000847644,"(1R)-2-chloro-1,7-dihydroxy-3,9-dimethoxy-1-methylbenzo[c]chromene-4,6-dione",COC1=CC(O)=C2C(=O)OC3=C(C2=C1)[C@@](C)(O)C(Cl)=C(OC)C3=O,0.2486994
CCMSLIB00004688372,"3-methyl-8,10,20,22-tetraoxa-3-azapentacyclo[15.7.0.0?,��.0?,��.0�?,��]tetracosa-1(17),5,7(11),12,18,23-hexaen-14-one",CN1Cc2cc3c(cc2CCC(=O)c2cc4c(cc2C1)OCO4)OCO3,0.1792837
CCMSLIB00004710054,"4'-ethenyl-2'-hydroxy-1,4',4a-trimethyl-5-oxospiro[2,3,4,7,8,8a-hexahydronaphthalene-6,1'-cyclopentane]-1-carboxylic acid",C=CC1(C)CC(O)C2(CCC3C(C)(C(=O)O)CCCC3(C)C2=O)C1,0.179185
CCMSLIB00004751476,Thelephoric acid,O=C(C1=C2C3=C(C=C(O)C(O)=C3)O1)C4=C(OC5=C4C=C(O)C(O)=C5)C2=O,0.4527032
CCMSLIB00005771122,Anileridine,[H]N([H])c(c([H])3)c([H])c([H])c(c([H])3)C([H])([H])C([H])([H])N(C([H])([H])1)C([H])([H])C([H])([H])C(C(=O)OC([H])([H])C([H])([H])[H])(c(c([H])2)c([H])c([H])c([H])c([H])2)C([H])([H])1,0.1847168


# 3. Save results

Execute the following cell to save the results in a CSV file. For readability, each dataframe is saved in its own CSV file under the `matches` directory.

In [7]:
if (!dir.exists("matches")){
    dir.create("matches")
}
for (i in seq_along(spectra_matches)){
    matches <- data.frame(lapply(spectra_matches[i], as.character), stringsAsFactors=FALSE)
    write.csv(matches, sprintf("matches/spectrum_%s.csv", i))
}

# 4. Create Plots
The following cell will show you an example of how to visualize your matches.


### Creating a Plot of the molecular structure


In [None]:
plot_molecule_structure_grid(spectra_matches=spectra_matches[0], representation="smiles", draw_indices=1, molecule_image_size=list(200, 200), substructure_highlight=1)

###  Creating a Plot using the ClassyFire API

In [None]:
plot_classyfire_result(spectra_matches=spectra_matches[0])

### Creating a Plot using the NP-Classifier API

In [None]:
plot_NPclassifier_result(spectra_matches=spectra_matches[0], color="orange")


___