# Virtually metabolize GNPS annotations and prepare for Network Annotation Propagation or SIRIUS

Made by Louis-Felix Nothias (UC San Diego), louisfelix.nothias@gmail.com. Started in 2018 and improved in May 2021.

This notebook downloads results of spectral annotations from classical or feature-based molecular networking job from GNPS [[http://gnps.ucsd.edu](http://gnps.ucsd.edu)] and generate virtual metabolites either with SyGMa or BioTransformer. The resulting candidates can be used for [Network Annotation Propagation](https://ccms-ucsd.github.io/GNPSDocumentation/nap/) on GNPS or with [SIRIUS](https://boecker-lab.github.io/docs.sirius.github.io/install/).

> Start by running the cell below to initiate the libraries.

In [1]:
import sys
sys.path.append('gnps_postprocessing/lib')
sys.path.append('src')
from gnps_download_results import *
from consolidate_structures import *
from gnps_results_postprocess import *
from prepare_virtual_metabolization import *
from run_virtual_metabolization import *

## Mandatory - Download annotation from the GNPS job
 
> Replace the job ID from the GNPS molecular networking job in the URL in the cell below (line 3). We support both classical molecular networking and feature-based molecular networking (FBMN) jobs.

You can try the classical MN job from that paper https://pubs.acs.org/doi/10.1021/acs.analchem.8b05854 with the ID `'bbee697a63b1400ea585410fafc95723'`. 

An other test job for feature-based molecular networking (FBMN) is `'e78a8c8f429a46fcb24f3b34d69aff25'`.

In [None]:
job_id = 'bbee697a63b1400ea585410fafc95723'

gnps_download_results(job_id, output_folder ='all_annotations')

This is the GNPS job link: https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=bbee697a63b1400ea585410fafc95723
Downloading the following content: https://gnps.ucsd.edu/ProteoSAFe/DownloadResult?task=bbee697a63b1400ea585410fafc95723&view=view_all_annotations_DB


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7200k    0 7200k    0     0   302k      0 --:--:--  0:00:23 --:--:--  406k

In [None]:
gnps_download_results.df_annotations.head(2)

## Mandatory - Filter GNPS annotations

Modifiy the parameters for filtering below and run the cell = 

In [None]:
ionisation_mode='pos' #or 'neg'
max_ppm_error=10
min_cosine=0.7
shared_peaks=3
max_spec_charge=2

gnps_download_results_df_annotations_filtered = gnps_filter_annotations(gnps_download_results.df_annotations, 'INCHI', 
                            ionisation_mode,  max_ppm_error, min_cosine, 
                            shared_peaks, max_spec_charge, prefix = '')

## Mandatory - Consolidating structures identifier

> Run the cell below to have a complete set of Smiles and InChI for the annotations.

**IMPORTANT: Note that only spectral annotations that have a valid InChI or SMILES identifier will be considered downstream. If the annotations you are interested in don't have an identifier in the library, go back to the GNPS library entry, update the entry by adding an identifier, and rerun your GNPS job**

In [None]:
df_annotations_consolidated  = consolidate_and_convert_structures(gnps_download_results_df_annotations_filtered, prefix='', 
                                                                  smiles='Smiles', inchi='INCHI')

In [None]:
# We keep only annotations with a structure identifier

df_annotations = get_info_gnps_annotations(df_annotations_consolidated, 
                          inchi_column='Consol_InChI', 
                          smiles_column = 'Consol_SMILES', 
                          smiles_planar_column='Consol_SMILES_iso')

### [Advanced optional feature - Recommended to ignore] - Filter annotations based on compound name

If you want to apply this filter, convert the cell type from raw to code. Other skip the following cells.

##### Optional - Display compound name

#### Optional - Select compound name to keep

Replace the compound names in the list `compound_name_to_keep`


### [Advanced optional feature - Recommended to ignore]  - Filter annotations based on tags

If you want to apply this filter convert the cell type from raw to code.

#### Optional - Display tags-annotations

#### Optional - Select tags to keep

Specify the tags in the list `tags_to_keep`

## Mandatory - Apply filter (if any were set)
If you haven't select a filter, run this cell anyway.

In [None]:
# We check if those lists exists and process as needed:

try: compound_name_to_keep
except NameError: compound_name_to_keep = None

try: tags_to_keep
except NameError: tags_to_keep = None

if compound_name_to_keep == None and tags_to_keep == None:
    df_annotations_filtered = df_annotations
    print('No Compound_Name or Tags filter were used')
    
elif compound_name_to_keep and tags_to_keep==None:
    df_annotations_filtered = df_annotations_filtering(df_annotations, compound_name=compound_name_to_keep)
    print('Compound name filtering applied')
    
elif compound_name_to_keep == None and tags_to_keep:
    df_annotations_filtered = df_annotations_filtering(df_annotations, tags=tags_to_keep)
    print('Tag(s) filtering applied')
    
elif compound_name_to_keep and tags_to_keep:
    df_annotations_filtered = df_annotations_filtering(df_annotations, compound_name=compound_name_to_keep, tags=tags_to_keep)
    print('Compound name and tags filtering applied')
    
else:
    print('Something is wrong')
    
print('Number of annotations after filtering = '+str(df_annotations_filtered.shape[0]))

## Mandatory - Choose between planar or stereochemical SMILES

### [RECOMMENDED] Use the planar SMILES for virtual metabolization (no stereochemistry specified)

Run the cell below to use planar isomers and ignore the cell after. This is recommended as it reflects the confidence computational mass spectrometry annotation can achieve and limits the number of candidates to compute.

In [None]:
use_planar_structure_boolean =True # or False
prepare_for_virtual_metabolization(df_annotations_filtered,
                                   compound_name = 'Compound_Name',
                                    smiles_column = 'Consol_SMILES', 
                                    smiles_planar_column='Consol_SMILES_iso',
                                    drop_duplicated_structure = True, 
                                    use_planar_structure = use_planar_structure_boolean)

## Optional - Add candidate structures

Appending structures to virtual metabolization batch.

Upload a tab-separated file in the jupyter notebook interface (drag and drop) and specify the path in the cell below. The file must contain a first column indicating the compound name and the second the SMILES (no headers needed). Many compounds can be included in the file (one per line).

In [None]:
extra_compounds_table_file = 'input/extra_compounds-UTF8.tsv'

In [None]:
load_extra_compounds(extra_compounds_table_file)
append_to_list_if_not_present(prepare_for_virtual_metabolization.list_compound_name, prepare_for_virtual_metabolization.list_smiles, 
                              load_extra_compounds.extra_compound_names, load_extra_compounds.extra_compound_smiles)

# Mandatory - Choose between SyGMa (A) or BioTransformer (B) for virtual metabolization

#### A - SyGMa generates specifically human biotransformation of phase 1 and/or 2. 
It takes generally couple minutes to compute. More informations from the paper (https://doi.org/10.1002/cmdc.200700312).

#### B - BioTransformer generates biotransformation in mammals, their gut microbiota, as well as the soil/aquatic microbiota. 
It takes more time to compute. More information from the paper ([https://doi.org/10.1186/s13321-018-0324-5](https://doi.org/10.1186/s13321-018-0324-5)).

# A - Virtual metabolization with **SyGMa**

SyGMa is a python library for the Systematic Generation of potential Metabolites. See [SyGMa: combining expert knowledge and empirical scoring in the prediction of metabolites](https://doi.org/10.1002/cmdc.200700312) and [https://github.com/3D-e-Chem/sygma](https://github.com/3D-e-Chem/sygma).

Please cite their work:
Ridder, L., & Wagener, M. (2008) [SyGMa: combining expert knowledge and empirical scoring in the prediction of metabolites](https://doi.org/10.1002/cmdc.200700312). ChemMedChem, 3(5), 821-832.


### IMPORTANT -> Change the parameters below as needed
> Define the ruleset and the number of phase 1/2 reaction cyles to apply in the SyGMA scenario. For example 2 cycles for phase 1 `phase_1_cycle = 2`. Using a value > 1 will be slow.

> Define the maximum number of SyGMa candidates outputted (consider the number of reaction cycles). Suggested value `top_sygma_candidates = 15`

> Run SyGMa.

In [11]:
# Define the number of metabolization cycles (1-3). If the number of cycle is more than 1, it can be slow.
phase_1_cycle = 1
phase_2_cycle = 1
          
#Top metabolites predicted by SyGMa to output (ranked by highest score)
top_sygma_candidates = 10

### Run the cell below for running SyGMa (Fast !)

No need to change the content of cell below

In [12]:
run_sygma_batch(prepare_for_virtual_metabolization.list_smiles, prepare_for_virtual_metabolization.list_compound_name, 
                phase_1_cycle, phase_2_cycle, top_sygma_candidates, 'results_vm-NAP_SyGMa.tsv', 'Compound_Name')

=== Starting SyGMa computation ===
Number of compounds = 86
Batch_size = 13
If you are running many compounds or cycles, and maxing out RAM memory available, you can decrease the batch size. Otherwise the value can be increased for faster computation.
Please wait
Batch 1/7 completed with 130 metabolites
Batch 2/7 completed with 130 metabolites
Batch 3/7 completed with 130 metabolites
Batch 4/7 completed with 130 metabolites
Batch 5/7 completed with 130 metabolites
Batch 6/7 completed with 130 metabolites
Batch 7/7 completed with 80 metabolites
Number of SyGMA candidates = 860
Number of unique SyGMA candidates = 837
===== COMPLETED =====


When completed, download the full SyGMa results in the left side panel->
['results_vm-NAP_SyGMa.tsv'](./results_vm-NAP_SyGMa.tsv).

## Export the SyGMa results for NAP
See the documentation for custom database in [NAP](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database) and how to run NAP on GNPS [https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database).

In [13]:
export_for_NAP('results_vm-NAP_SyGMa.tsv', 'Compound_Name')

Number of metabolites = 860
Number of unique metabolites considered = 630


View/Download the results for NAP in the left side panel->
['results_vm-NAP_SyGMa_NAP.tsv'](./results_vm-NAP_SyGMa_NAP.tsv).

To download: Go into File/Download or right-clic on the file in the left panel

## Export the SyGMa results for SIRIUS

See the documentation to generate the SIRIUS [custom database here](https://boecker-lab.github.io/docs.sirius.github.io/cli-standalone/#custom-database-tool).

In [14]:
export_for_SIRIUS('results_vm-NAP_SyGMa.tsv', 'Compound_Name')

Number of metabolites = 860
Number of unique metabolites considered = 624


Download the results for SIRIUS in the left side panel->
['results_vm-NAP_SyGMa_SIRIUS.tsv'](./results_vm-NAP_SyGMa_SIRIUS.tsv).

# B - Virtual metabolization with **BioTransformer** (It is slow !)

BioTransformer is a software tool that predicts small molecule metabolism in mammals, their gut microbiota, as well as the soil/aquatic microbiota. BioTransformer also assists scientists in metabolite identification, based on the metabolism prediction. More information from the paper [[https://doi.org/10.1186/s13321-018-0324-5](https://doi.org/10.1186/s13321-018-0324-5)] and [[https://bitbucket.org/wishartlab/biotransformer3.0jar/src/master/](https://bitbucket.org/wishartlab/biotransformer3.0jar/src/master//)].

### Citation

Djoumbou-Feunang, Y., Fiamoncini, J., Gil-de-la-Fuente, A. et al. [BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification.](https://doi.org/10.1186/s13321-018-0324-5) J Cheminform 11, 2 (2019).

### Install BioTransformer [it can be ran once]
It requires `curl` and `java`.

In [11]:
!java -version
!curl -C - https://bitbucket.org/wishartlab/biotransformer3.0jar/get/cc4006a06ed3.zip -o BioTransformer3.0.zip
!unzip -n -q -d biotransformer BioTransformer3.0.zip

openjdk version "1.8.0_112"
OpenJDK Runtime Environment (Zulu 8.19.0.1-linux64) (build 1.8.0_112-b16)
OpenJDK 64-Bit Server VM (Zulu 8.19.0.1-linux64) (build 25.112-b16, mixed mode)
** Resuming transfer from byte position 109966663
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  104M    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0


#### Specify the parameters of BioTransformer

#### Mode 'standard'

`type_of_biotransformation` : -b,--bt Type <BioTransformer Option> The type of description: Type of biotransformer - EC-based (`ecbased`), CYP450 (`cyp450`), Phase II (`phaseII`), Human gut microbial (`hgut`), human super transformer* (`superbio`, or `allHuman`), Environmental microbial** (`envimicro`).

(* ) While the `superbio` option runs a set number of transformation steps in a pre-defined order (e.g. deconjugation first, then Oxidation/reduction, etc.), the `allHuman` option predicts all possible metabolites from any applicable reaction (Oxidation, reduction, (de-)conjugation) at each step.

(** ) For the environmental microbial biodegradation, all reactions (aerobic and anaerobic) are reported, and not only the aerobic biotransformations (as per default in the EAWAG BBD/PPS system).
    
`number_of_steps`  -s,--nsteps <Number of steps> The number of steps for the prediction. This option can be set by the user for the EC-based, CYP450, Phase II, and Environmental microbial biotransformers. The default value is `1`.
    
#### Mode parameters free entry
    
`type_of_biotransformation` : -b,--bt Type <BioTransformer Option> The type of description: Type of biotransformer - EC-based (`ecbased`), CYP450 (`cyp450`), Phase II (`phaseII`), Human gut microbial (`hgut`), human super transformer* (`superbio`, or `allHuman`), Environmental microbial** (`envimicro`).

`-cm`, `cyp450mode` CYP450 prediction Mode here: `1`) CypReact + BioTransformer rules; `2`) CyProduct only; `3`) Combined: CypReact + BioTransformer rules + CyProducts. Default mode is `1`.
    
`-s`, `--nsteps` <Number of steps> The number of steps for the prediction. This option can be set by the user for the EC-based, CYP450, Phase II, and Environmental microbial biotransformers. The default value is `1`.

`-q`, `--bsequence` <Sequence> Ordered Sequence of biotransformation steps. Semi-colon separated pairs of biotransformer types and corresponding number of steps to be simulated.


In [None]:
### Choose the BioTransformation mode 
#### Default (standard) for single type of biotransformation

mode = 'standard'
type_of_biotransformation = 'hgut'
number_of_steps = 1

#### Else use free entry parameters like mode = '-k pred -q "cyp450:2; phaseII:1"'
mode = '-k pred -q "cyp450:2; phaseII:1"'

run_biotransformer3(mode, prepare_for_virtual_metabolization.list_smiles,prepare_for_virtual_metabolization.list_compound_name,
                   type_of_biotransformation, number_of_steps, 'results_vm-NAP_BioTransformer.tsv')
 
print(' ====> Biotransformer computation is finally completed !!! ')

Download the full BioTransformer results in the left side panel->
['results_vm-NAP_BioTransformer.tsv'](./results_vm-NAP_BioTransformer.tsv).

## Export the BioTransformer results for NAP

See the documentation for custom database in [NAP](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database) and how to run NAP on GNPS [https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database).

In [None]:
export_for_NAP('results_vm-NAP_BioTransformer.tsv', 'Compound_Name')

Download the BioTransformer results for NAP in the left side panel->
['results_vm-NAP_BioTransformer_NAP.tsv'](./results_vm-NAP_BioTransformer_NAP.tsv).

## Export the BioTransformer results for SIRIUS

See the documentation to generate the SIRIUS [custom database here](https://boecker-lab.github.io/docs.sirius.github.io/cli-standalone/#custom-database-tool).

In [None]:
export_for_SIRIUS('results_vm-NAP_BioTransformer.tsv', 'Compound_Name')

Download the BioTransformer results for NAP in the left side panel->
['results_vm-NAP_BioTransformer_SIRIUS.tsv'](./results_vm-NAP_BioTransformer_SIRIUS.tsv).