# Virtually metabolize GNPS annotations and prepare for Network Annotation Propagation or SIRIUS

Made by Louis-Felix Nothias (UC San Diego), louisfelix.nothias@gmail.com. Started in 2018 and improved in May 2021.

This notebook downloads results of spectral annotations from [SIRIUS/CSIFingerID/COSMIC](https://boecker-lab.github.io/docs.sirius.github.io/install/).

> Start by running the cell below to initiate the libraries.

In [1]:
import sys
sys.path.append('gnps_postprocessing/lib')
sys.path.append('src')
from gnps_download_results import *
from consolidate_structures import *
from gnps_results_postprocess import *
from prepare_virtual_metabolization import *
from run_virtual_metabolization import *

## Mandatory - Load the SIRIUS/CSIFingerID annotations
 
> Upload by drag-and-drop your SIRIUS results file `compound_identifications.tsv` or `compound_identifications_adducts.tsv` into the input folder (left panel).

> Run the cell

In [2]:
df = load_csifingerid_cosmic_annotations('input/compound_identifications.tsv')
print(df.shape)
df.head(3)

(11623, 6)


Unnamed: 0,ConfidenceScore,ZodiacScore,name,smiles,links,id
0,0.999228,0.276382,3''-O-Galactopyranosyl-ara-C,C1=CN(C(=O)N=C1N)C2C(C(C(O2)CO)OC3C(C(C(C(O3)C...,COCONUT:(CNP0387389);Natural Products:(UNPD146...,11643_Celastraceae_pos_SIRIUS_12948
1,0.996772,0.978894,no_name,CC1C(C(C(C2(C13C(C(C(=O)C2OC(=O)C4=CC=CC=C4)C(...,PubChem:(132555887);DNP_Lotus_DB_v2,4608_Celastraceae_pos_SIRIUS_5022
2,0.986558,0.067839,no_name,CCCCCCCCCCCCCCCCCNC(=O)C(COC1C(C(C(C(O1)CO)O)O...,PubChem:(139589835);DNP_Lotus_DB_v2,12023_Celastraceae_pos_SIRIUS_13351


## Mandatory - Apply filter (if any were set)
If you haven't select a filter, run this cell anyway.

#### Filter based on score

In [3]:
zodiac_score = 0.9
confidence_score = 0.3
df_score_filtered = df_csifingerid_cosmic_annotations_filtering(df, zodiac_score, confidence_score)

Filtering with ZodiacScore >= 0.9
Total entries remaining = 10212
Filtering with Confidence Score >= 0.3
Total entries remaining = 120


#### Filter based on database links

In [4]:
db_links = 'KEGG|HMDB'
df_db_links_filtered = df_csifingerid_cosmic_annotations_filtering(df, links=db_links)

Filtering with database links >= KEGG|HMDB
Total entries remaining = 3657


## Mandatory - Prepare for virtual metabolization


In [5]:
prepare_for_virtual_metabolization(df_score_filtered,
                                    compound_name = 'name',
                                    smiles_planar_column='smiles',
                                    drop_duplicated_structure = True, 
                                    use_planar_structure= True)

Number of spectral library annotations = 120
Number of spectral annotations with planar SMILES/InChI = 120
Number of unique planar SMILES considered = 118


Unnamed: 0,ConfidenceScore,ZodiacScore,name,smiles,links,id
1,0.996772,0.978894,no_name,CC1C(C(C(C2(C13C(C(C(=O)C2OC(=O)C4=CC=CC=C4)C(...,PubChem:(132555887);DNP_Lotus_DB_v2,4608_Celastraceae_pos_SIRIUS_5022
5,0.890122,1.000000,Monomyristin,CCCCCCCCCCCCCC(=O)OCC(CO)O,HMDB:(11561);SuperNatural:(SN00383855);ZINC bi...,2462_Celastraceae_pos_SIRIUS_2670
6,0.887853,0.904020,no_name,CCC(C)C=C(C)C(C(C)C=C(C)C(C(C)C=C(C)C(C(C)C(=O...,PubChem:(101866225);DNP_Lotus_DB_v2,747_Celastraceae_pos_SIRIUS_786
7,0.886321,0.907035,no_name,CCCC(=O)OC1C(C2C(C3CC(C(CCC(C2O3)(C)OC(=O)CCC)...,PubChem:(56931406);DNP_Lotus_DB_v2,139_Celastraceae_pos_SIRIUS_142
8,0.881854,0.990452,Asperphenamate,C1=CC=C(C=C1)CC(COC(=O)C(CC2=CC=CC=C2)NC(=O)C3...,COCONUT:(CNP0371247);KNApSAcK:(31444);Natural ...,312_Celastraceae_pos_SIRIUS_321
...,...,...,...,...,...,...
144,0.311563,0.994472,"(1R,4aR,4bR,10aR)-1,4a-dimethyl-7-oxo-3,4,4b,5...",CC12CCCC(C1CCC3=CC(=O)CCC23)(C)C(=O)O,COCONUT:(CNP0136246 CNP0253954);Natural Produc...,6248_Celastraceae_pos_SIRIUS_6873
145,0.307969,0.950251,no_name,COC1=C(C=CC(=C1)CC2COC(C2COC3C(C(C(C(O3)CO)O)O...,COCONUT:(CNP0226329);Natural Products:(UNPD906...,3830_Celastraceae_pos_SIRIUS_4171
148,0.304524,1.000000,no_name,CC1(C(C(OC1=O)C2=C(C=C3C(=C2)C=CC(=O)O3)OC)O)OC,COCONUT:(CNP0182083);Natural Products:(UNPD120...,8351_Celastraceae_pos_SIRIUS_9181
149,0.300679,1.000000,no_name,CCCCC(C#CC(C(C(CCCCCCCC(=O)O)O)O)O)O,COCONUT:(CNP0111317);KNApSAcK:(39250);Natural ...,3955_Celastraceae_pos_SIRIUS_4307


## Optional - Manually add candidate structures

Convert the cell from markdown to raw if you want to use it.

Appending structures to virtual metabolization batch.

You can proceed by manually appending the pairs of compound name and SMILES [the order should match in both list]



In [6]:
extra_compounds_table_file = 'input/extra_compounds-UTF8.tsv'

In [7]:
load_extra_compounds(extra_compounds_table_file)
append_to_list_if_not_present(prepare_for_virtual_metabolization.list_compound_name, prepare_for_virtual_metabolization.list_smiles, 
                              load_extra_compounds.extra_compound_names, load_extra_compounds.extra_compound_smiles)

Initial number of compound name the list = 118
Initial number of smiles in the list = 118
Final number of compound name the list = 152
Final number of smiles in the list = 152


# Mandatory - Choose between SyGMa (A) or BioTransformer (B) for virtual metabolization, or run both !

#### A - SyGMa generates specifically human biotransformation of phase 1 and/or 2. 
It takes generally couple minutes to compute. More informations from the paper (https://doi.org/10.1002/cmdc.200700312).

#### B - BioTransformer generates biotransformation in mammals, their gut microbiota, as well as the soil/aquatic microbiota. 
It takes more time to compute. More information from the paper ([https://doi.org/10.1186/s13321-018-0324-5](https://doi.org/10.1186/s13321-018-0324-5)).

# A - Virtual metabolization with SyGMa

SyGMa is a python library for the Systematic Generation of potential Metabolites. See [SyGMa: combining expert knowledge and empirical scoring in the prediction of metabolites](https://doi.org/10.1002/cmdc.200700312) and [https://github.com/3D-e-Chem/sygma](https://github.com/3D-e-Chem/sygma).

Please cite their work:
Ridder, L., & Wagener, M. (2008) [SyGMa: combining expert knowledge and empirical scoring in the prediction of metabolites](https://doi.org/10.1002/cmdc.200700312). ChemMedChem, 3(5), 821-832.


### IMPORTANT -> Change the parameters below as needed
> Define the ruleset and the number of phase 1/2 reaction cyles to apply in the SyGMA scenario. For example 2 cycles for phase 1 `phase_1_cycle = 2`. Using a value > 1 will be slow.

> Define the maximum number of SyGMa candidates outputted (consider the number of reaction cycles). Suggested value `top_sygma_candidates = 15`

> Run SyGMa.

In [8]:
# Define the number of metabolization cycles (1-3). If the number of cycle is more than 1, it can be slow.
phase_1_cycle = 1
phase_2_cycle = 1
          
#Top metabolites predicted by SyGMa to output (ranked by highest score)
top_sygma_candidates = 10

### Run the cell below for running SyGMa (Fast !)

No need to change the content of cell below

In [None]:
run_sygma_batch(prepare_for_virtual_metabolization.list_smiles, prepare_for_virtual_metabolization.list_compound_name, 
                phase_1_cycle, phase_2_cycle, top_sygma_candidates, 'results_vm-NAP_SyGMa.tsv',
                compound_name = 'name')

=== Starting SyGMa computation ===
Number of compounds = 152
Batch_size = 13
If you are running many compounds or cycles, and maxing out RAM memory available, you can decrease the batch size. Otherwise the value can be increased for faster computation.
Please wait
Batch 1/12 completed with 130 metabolites


When completed, download the full SyGMa results in the left side panel->
['results_vm-NAP_SyGMa.tsv'](./results_vm-NAP_SyGMa.tsv).

## Export the SyGMa results for NAP
See the documentation for custom database in [NAP](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database) and how to run NAP on GNPS [https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database).

In [9]:
export_for_NAP('results_vm-NAP_SyGMa.tsv', 'name')

Number of metabolites = 1173
Number of unique metabolites considered = 634


View/Download the results for NAP in the left side panel->
['results_vm-NAP_SyGMa_NAP.tsv'](./results_vm-NAP_SyGMa_NAP.tsv).

To download: Go into File/Download or right-clic on the file in the left panel

## Export the SyGMa results for SIRIUS

See the documentation to generate the SIRIUS [custom database here](https://boecker-lab.github.io/docs.sirius.github.io/cli-standalone/#custom-database-tool).

In [10]:
export_for_SIRIUS('results_vm-NAP_SyGMa.tsv', 'name')

Number of metabolites = 1173
Number of unique metabolites considered = 629


Download the results for SIRIUS in the left side panel->
['results_vm-NAP_SyGMa_SIRIUS.tsv'](./results_vm-NAP_SyGMa_SIRIUS.tsv).

# B - Virtual metabolization with BioTransformer (It is slow !)

BioTransformer is a software tool that predicts small molecule metabolism in mammals, their gut microbiota, as well as the soil/aquatic microbiota. BioTransformer also assists scientists in metabolite identification, based on the metabolism prediction. More information from the paper [[https://doi.org/10.1186/s13321-018-0324-5](https://doi.org/10.1186/s13321-018-0324-5)] and [[https://bitbucket.org/djoumbou/biotransformerjar/src/master/](https://bitbucket.org/djoumbou/biotransformerjar/src/master/)].

### Citation

Djoumbou-Feunang, Y., Fiamoncini, J., Gil-de-la-Fuente, A. et al. [BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification.](https://doi.org/10.1186/s13321-018-0324-5) J Cheminform 11, 2 (2019).

### Install BioTransformer [it can be ran once]
It requires `curl` and `java`.

In [11]:
!java -version
!rm -r biotransformer.zip biotransformer/
!curl https://bitbucket.org/djoumbou/biotransformerjar/get/f47aa4e3c0da.zip -o biotransformer.zip
!unzip -q -d biotransformer biotransformer.zip
!cp -r biotransformer/djoumbou-biotransformerjar-f47aa4e3c0da/. .
!rm -r biotransformer.zip biotransformer/

openjdk version "1.8.0_112"
OpenJDK Runtime Environment (Zulu 8.19.0.1-linux64) (build 1.8.0_112-b16)
OpenJDK 64-Bit Server VM (Zulu 8.19.0.1-linux64) (build 25.112-b16, mixed mode)
rm: cannot remove 'biotransformer.zip': No such file or directory
rm: cannot remove 'biotransformer/': No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 68.9M  100 68.9M    0     0  22.6M      0  0:00:03  0:00:03 --:--:-- 22.6M


#### Specify the parameters of BioTransformer

`type_of_biotransformation` : -b,--bt Type <BioTransformer Option> The type of description: Type of biotransformer - EC-based (`ecbased`), CYP450 (`cyp450`), Phase II (`phaseII`), Human gut microbial (`hgut`), human super transformer* (`superbio`, or `allHuman`), Environmental microbial** (`envimicro`).

(* ) While the `superbio` option runs a set number of transformation steps in a pre-defined order (e.g. deconjugation first, then Oxidation/reduction, etc.), the `allHuman` option predicts all possible metabolites from any applicable reaction (Oxidation, reduction, (de-)conjugation) at each step.

(** ) For the environmental microbial biodegradation, all reactions (aerobic and anaerobic) are reported, and not only the aerobic biotransformations (as per default in the EAWAG BBD/PPS system).
    
`number_of_steps`  -s,--nsteps <Number of steps> The number of steps for the prediction. This option can be set by the user for the EC-based, CYP450, Phase II, and Environmental microbial biotransformers. The default value is `1`.

In [12]:
type_of_biotransformation = 'hgut'
number_of_steps = 1

run_biotransformer(prepare_for_virtual_metabolization.list_smiles,prepare_for_virtual_metabolization.list_compound_name,
                   type_of_biotransformation, number_of_steps, 'results_vm-NAP_BioTransformer.tsv')
print(' ====> Biotransformer computation is finally completed !!! ')

     Number of compounds being virtually metabolized with BioTransformer =  118
     Biotransformation: hgut
     Please wait for the computation ...
1    [main] INFO  net.sf.jnati.deploy.artefact.ConfigManager  - Loading global configuration
13   [main] DEBUG net.sf.jnati.deploy.artefact.ConfigManager  - Loading defaults: jar:file:/home/jovyan/biotransformer-1.1.5.jar!/META-INF/jnati/jnati.default-properties
14   [main] INFO  net.sf.jnati.deploy.artefact.ConfigManager  - Loading artefact configuration: jniinchi-1.03_1
15   [main] DEBUG net.sf.jnati.deploy.artefact.ConfigManager  - Loading instance defaults: jar:file:/home/jovyan/biotransformer-1.1.5.jar!/META-INF/jnati/jnati.instance.default-properties
18   [main] INFO  net.sf.jnati.deploy.repository.ClasspathRepository  - Searching classpath for: jniinchi-1.03_1-LINUX-AMD64
19   [main] INFO  net.sf.jnati.deploy.repository.LocalRepository  - Searching local repository for: jniinchi-1.03_1-LINUX-AMD64
20   [main] DEBUG net.sf.jnati.dep

Download the full BioTransformer results in the left side panel->
['results_vm-NAP_BioTransformer.tsv'](./results_vm-NAP_BioTransformer.tsv).

## Export the BioTransformer results for NAP

See the documentation for custom database in [NAP](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database) and how to run NAP on GNPS [https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database).

In [None]:
export_for_NAP('results_vm-NAP_BioTransformer.tsv', 'name')

Download the BioTransformer results for NAP in the left side panel->
['results_vm-NAP_BioTransformer_NAP.tsv'](./results_vm-NAP_BioTransformer_NAP.tsv).

## Export the BioTransformer results for SIRIUS

See the documentation to generate the SIRIUS [custom database here](https://boecker-lab.github.io/docs.sirius.github.io/cli-standalone/#custom-database-tool).

In [None]:
export_for_SIRIUS('results_vm-NAP_BioTransformer.tsv', 'name')

Download the BioTransformer results for NAP in the left side panel->
['results_vm-NAP_BioTransformer_SIRIUS.tsv'](./results_vm-NAP_BioTransformer_SIRIUS.tsv).