# Virtually metabolize GNPS annotations and prepare for Network Annotation Propagation or SIRIUS

Made by Louis-Felix Nothias (UC San Diego), louisfelix.nothias@gmail.com. Started in 2018 and improved in May 2021.

This notebook downloads results of spectral annotations from SIRIUS/CSIFingerID with COSMIC scores and generate virtual metabolites either with SyGMa or BioTransformer. The resulting candidates can be used for [Network Annotation Propagation](https://ccms-ucsd.github.io/GNPSDocumentation/nap/) on GNPS or with [SIRIUS](https://boecker-lab.github.io/docs.sirius.github.io/install/).

> Start by running the cell below to initiate the libraries.

In [1]:
import sys
sys.path.append('src')
from prepare_virtual_metabolization import *
from run_virtual_metabolization import *

## [Mandatory] - Load the file from SIRIUS/CSIFingerID with COSMIC score
 
This file is named by default `compound_identifications.tsv`. Load it in that lab environment.

In [2]:
df_csifingerid_cosmic = load_csifingerid_cosmic_annotations('compound_identifications.tsv')
df_csifingerid_cosmic.head(3)

Unnamed: 0,ConfidenceScore,ZodiacScore,name,smiles,links,id
0,0.999228,0.276382,3''-O-Galactopyranosyl-ara-C,C1=CN(C(=O)N=C1N)C2C(C(C(O2)CO)OC3C(C(C(C(O3)C...,COCONUT:(CNP0387389);Natural Products:(UNPD146...,11643_Celastraceae_pos_SIRIUS_12948
1,0.996772,0.978894,no_name,CC1C(C(C(C2(C13C(C(C(=O)C2OC(=O)C4=CC=CC=C4)C(...,PubChem:(132555887);DNP_Lotus_DB_v2,4608_Celastraceae_pos_SIRIUS_5022
2,0.986558,0.067839,no_name,CCCCCCCCCCCCCCCCCNC(=O)C(COC1C(C(C(C(O1)CO)O)O...,PubChem:(139589835);DNP_Lotus_DB_v2,12023_Celastraceae_pos_SIRIUS_13351


## Choose between Option 1 and Option 2
## [Option 1] - Filter based on scores AND db links
See the parameters and run the cells

In [3]:
zodiac_score = 0.95    #Score for molecular formula identification
confidence_score = 0.2     #Score for COSMIC confidence score in the structure annotation
db_links = 'KEGG|NORMAN'     #See the existing links. Multiple can be used with |

In [4]:
df_score_filtered = df_csifingerid_cosmic_annotations_filtering(df_csifingerid_cosmic, zodiac_score, confidence_score)
df_filtered = df_csifingerid_cosmic_annotations_filtering(df_score_filtered, links= db_links)

Filtering with ZodiacScore >= 0.95
Total entries remaining = 10039
Filtering with Confidence Score >= 0.2
Total entries remaining = 192
Filtering with database links >= KEGG|NORMAN
Total entries remaining = 61


**Now you can skip [Option 2] cells**

## [Option 2] - Filter based on scores OR db links
If needed, set the cells to code and set the parameters and run the cells

In [5]:
df_score_filtered = df_csifingerid_cosmic_annotations_filtering(df_csifingerid_cosmic, zodiac_score, confidence_score)

Filtering with ZodiacScore >= 0.95
Total entries remaining = 10039
Filtering with Confidence Score >= 0.2
Total entries remaining = 192


In [6]:
df_db_links_filtered = df_csifingerid_cosmic_annotations_filtering(df_csifingerid_cosmic, links= db_links)

Filtering with database links >= KEGG|NORMAN
Total entries remaining = 3371


## [Mandatory] Prepare for virtual metabolization


In [7]:
prepare_for_virtual_metabolization(df_filtered,
                                    compound_name = 'name',
                                    smiles_planar_column='smiles',
                                    drop_duplicated_structure = True, 
                                    use_planar_structure= True)

Number of spectral library annotations = 61
Number of spectral annotations with planar SMILES/InChI = 61
Number of unique planar SMILES considered = 55


Unnamed: 0,ConfidenceScore,ZodiacScore,name,smiles,links,id
5,0.890122,1.0,Monomyristin,CCCCCCCCCCCCCC(=O)OCC(CO)O,HMDB:(11561);SuperNatural:(SN00383855);ZINC bi...,2462_Celastraceae_pos_SIRIUS_2670
9,0.879112,0.999497,Valinopine,CC(C)C(C(=O)O)NC(CCC(=O)O)C(=O)O,COCONUT:(CNP0211025);KEGG:(C19976);Natural Pro...,11333_Celastraceae_pos_SIRIUS_12603
10,0.879094,1.0,2-Amylthiophene,CCCCCC1=CC=CS1,NORMAN:(NS00022222);COCONUT:(CNP0094840);HMDB:...,12147_Celastraceae_pos_SIRIUS_13491
15,0.817273,1.0,Sandin EU,CCCCCCCCCCCCCCCCCC(=O)OCC(CO)O,HMDB:(11131);PubChem class - food;SuperNatural...,8984_Celastraceae_pos_SIRIUS_9933
18,0.759907,1.0,bmse000686,CC(CCC(=O)O)C1CCC2C1(CCC3C2CCC4C3(CCC(C4)O)C)C,HMDB:(381 713 717 761);SuperNatural:(SN0000656...,3874_Celastraceae_pos_SIRIUS_4217
19,0.758254,1.0,NCIStruc1_001111,CC(CCC(=O)OC)C1CCC2C1(CCC3C2CCC4C3(CCC(C4)O)C)C,NORMAN:(NS00044872);COCONUT:(CNP0199065);Natur...,1347_Celastraceae_pos_SIRIUS_1455
29,0.670673,1.0,10-undecynoate,C#CCCCCCCCCC(=O)O,NORMAN:(NS00028365);COCONUT:(CNP0057246 CNP019...,13252_Celastraceae_pos_SIRIUS_14935
36,0.654696,1.0,epiafzelechin,C1C(C(OC2=CC(=CC(=C21)O)O)C3=CC=C(C=C3)O)O,HMDB:(30822 30823);SuperNatural:(SN00036139 SN...,1475_Celastraceae_pos_SIRIUS_1593
40,0.615597,1.0,Dimethyl tetradecanedioate,COC(=O)CCCCCCCCCCCCC(=O)OC,NORMAN:(NS00032018);COCONUT:(CNP0122354);Natur...,4196_Celastraceae_pos_SIRIUS_4572
42,0.614129,1.0,Myricetin 3-glucoside,C1=C(C=C(C(=C1O)O)O)C2=C(C(=O)C3=C(C=C(C=C3O2)...,HMDB:(34358 125203);SuperNatural:(SN00160344 S...,9119_Celastraceae_pos_SIRIUS_10110


## [Mandatory] - Choose between SyGMa (A) or BioTransformer (B) for virtual metabolization

#### A - SyGMa generates specifically human biotransformation of phase 1 and/or 2. 
It takes generally couple minutes to compute. More informations from the paper (https://doi.org/10.1002/cmdc.200700312).

#### B - BioTransformer generates biotransformation in mammals, their gut microbiota, as well as the soil/aquatic microbiota. 
It takes more time to compute. More information from the paper ([https://doi.org/10.1186/s13321-018-0324-5](https://doi.org/10.1186/s13321-018-0324-5)).

# [Option A] - Virtual metabolization with SyGMa

SyGMa is a python library for the Systematic Generation of potential Metabolites. See [SyGMa: combining expert knowledge and empirical scoring in the prediction of metabolites](https://doi.org/10.1002/cmdc.200700312) and [https://github.com/3D-e-Chem/sygma](https://github.com/3D-e-Chem/sygma).

Please cite their work:
Ridder, L., & Wagener, M. (2008) [SyGMa: combining expert knowledge and empirical scoring in the prediction of metabolites](https://doi.org/10.1002/cmdc.200700312). ChemMedChem, 3(5), 821-832.


### IMPORTANT -> Change the parameters below as needed
> Define the ruleset and the number of phase 1/2 reaction cyles to apply in the SyGMA scenario. For example 2 cycles for phase 1 `phase_1_cycle = 2`. Using a value > 1 will be slow.

> Define the maximum number of SyGMa candidates outputted (consider the number of reaction cycles). Suggested value `top_sygma_candidates = 15`

> Run SyGMa.

In [8]:
# Define the number of metabolization cycles (1-3). If the number of cycle is more than 1, it can be slow.
phase_1_cycle = 1
phase_2_cycle = 1
          
#Top metabolites predicted by SyGMa to output (ranked by highest score)
top_sygma_candidates = 10

### Run the cell below for running SyGMa (Fast !)

No need to change the content of cell below

In [9]:
run_sygma_batch(prepare_for_virtual_metabolization.list_smiles, prepare_for_virtual_metabolization.list_compound_name, 
                phase_1_cycle, phase_2_cycle, top_sygma_candidates, 'results_vm-NAP_SyGMa.tsv',
                compound_name = 'name')

=== Starting SyGMa computation ===
Number of compounds = 55
Batch_size = 13
If you are running many compounds or cycles, and maxing out RAM memory available, you can decrease the batch size. Otherwise the value can be increased for faster computation.
Please wait
Batch 1/5 completed with 125 metabolites


RDKit ERROR: [17:05:58] Can't kekulize mol.  Unkekulized atoms: 3 5 9
[17:05:58] Can't kekulize mol.  Unkekulized atoms: 3 5 9

RDKit ERROR: 


Batch 2/5 completed with 130 metabolites
Batch 3/5 completed with 130 metabolites
Batch 4/5 completed with 130 metabolites
Batch 5/5 completed with 30 metabolites
Number of SyGMA candidates = 545
Number of unique SyGMA candidates = 533
===== COMPLETED =====


When completed, download the full SyGMa results in the left side panel->
['results_vm-NAP_SyGMa.tsv'](./results_vm-NAP_SyGMa.tsv).

## Export the SyGMa results for NAP
See the documentation for custom database in [NAP](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database) and how to run NAP on GNPS [https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database).

In [10]:
export_for_NAP('results_vm-NAP_SyGMa.tsv', compound_name = 'name')

Number of metabolites = 545
Number of unique metabolites considered = 407


View/Download the results for NAP in the left side panel->
['results_vm-NAP_SyGMa_NAP.tsv'](./results_vm-NAP_SyGMa_NAP.tsv).

To download: Go into File/Download or right-clic on the file in the left panel

## Export the SyGMa results for SIRIUS

See the documentation to generate the SIRIUS [custom database here](https://boecker-lab.github.io/docs.sirius.github.io/cli-standalone/#custom-database-tool).

In [11]:
export_for_SIRIUS('results_vm-NAP_SyGMa.tsv', compound_name = 'name')

Number of metabolites = 545
Number of unique metabolites considered = 407


Download the results for SIRIUS in the left side panel->
['results_vm-NAP_SyGMa_SIRIUS.tsv'](./results_vm-NAP_SyGMa_SIRIUS.tsv).

# [Option B] - Virtual metabolization with BioTransformer (It is slow !)

BioTransformer is a software tool that predicts small molecule metabolism in mammals, their gut microbiota, as well as the soil/aquatic microbiota. BioTransformer also assists scientists in metabolite identification, based on the metabolism prediction. More information from the paper [[https://doi.org/10.1186/s13321-018-0324-5](https://doi.org/10.1186/s13321-018-0324-5)] and [[https://bitbucket.org/djoumbou/biotransformerjar/src/master/](https://bitbucket.org/djoumbou/biotransformerjar/src/master/)].

### Citation

Djoumbou-Feunang, Y., Fiamoncini, J., Gil-de-la-Fuente, A. et al. [BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification.](https://doi.org/10.1186/s13321-018-0324-5) J Cheminform 11, 2 (2019).

### Install BioTransformer [it can be ran once]
It requires `curl` and `java`.

In [12]:
!java -version
!rm -r biotransformer.zip biotransformer/
!curl https://bitbucket.org/djoumbou/biotransformerjar/get/f47aa4e3c0da.zip -o biotransformer.zip
!unzip -q -d biotransformer biotransformer.zip
!cp -r biotransformer/djoumbou-biotransformerjar-f47aa4e3c0da/. .
!rm -r biotransformer.zip biotransformer/

openjdk version "1.8.0_112"
OpenJDK Runtime Environment (Zulu 8.19.0.1-linux64) (build 1.8.0_112-b16)
OpenJDK 64-Bit Server VM (Zulu 8.19.0.1-linux64) (build 25.112-b16, mixed mode)
rm: cannot remove 'biotransformer.zip': No such file or directory
rm: cannot remove 'biotransformer/': No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 68.9M  100 68.9M    0     0  8310k      0  0:00:08  0:00:08 --:--:-- 7598k


#### Specify the parameters of BioTransformer

`type_of_biotransformation` : -b,--bt Type <BioTransformer Option> The type of description: Type of biotransformer - EC-based (`ecbased`), CYP450 (`cyp450`), Phase II (`phaseII`), Human gut microbial (`hgut`), human super transformer* (`superbio`, or `allHuman`), Environmental microbial** (`envimicro`).

(* ) While the `superbio` option runs a set number of transformation steps in a pre-defined order (e.g. deconjugation first, then Oxidation/reduction, etc.), the `allHuman` option predicts all possible metabolites from any applicable reaction (Oxidation, reduction, (de-)conjugation) at each step.

(** ) For the environmental microbial biodegradation, all reactions (aerobic and anaerobic) are reported, and not only the aerobic biotransformations (as per default in the EAWAG BBD/PPS system).
    
`number_of_steps`  -s,--nsteps <Number of steps> The number of steps for the prediction. This option can be set by the user for the EC-based, CYP450, Phase II, and Environmental microbial biotransformers. The default value is `1`.

In [None]:
type_of_biotransformation = 'hgut'
number_of_steps = 1

run_biotransformer(prepare_for_virtual_metabolization.list_smiles,prepare_for_virtual_metabolization.list_compound_name,
                   type_of_biotransformation, number_of_steps, 'results_vm-NAP_BioTransformer.tsv', 'name')
print(' ====> Biotransformer computation is finally completed !!! ')

     Number of compounds being virtually metabolized with BioTransformer =  55
     Biotransformation: hgut
     Please wait for the computation ...
0    [main] INFO  net.sf.jnati.deploy.artefact.ConfigManager  - Loading global configuration
6    [main] DEBUG net.sf.jnati.deploy.artefact.ConfigManager  - Loading defaults: jar:file:/home/jovyan/biotransformer-1.1.5.jar!/META-INF/jnati/jnati.default-properties
6    [main] INFO  net.sf.jnati.deploy.artefact.ConfigManager  - Loading artefact configuration: jniinchi-1.03_1
8    [main] DEBUG net.sf.jnati.deploy.artefact.ConfigManager  - Loading instance defaults: jar:file:/home/jovyan/biotransformer-1.1.5.jar!/META-INF/jnati/jnati.instance.default-properties
10   [main] INFO  net.sf.jnati.deploy.repository.ClasspathRepository  - Searching classpath for: jniinchi-1.03_1-LINUX-AMD64
91   [main] INFO  net.sf.jnati.deploy.repository.LocalRepository  - Searching local repository for: jniinchi-1.03_1-LINUX-AMD64
92   [main] DEBUG net.sf.jnati.depl

Download the full BioTransformer results in the left side panel->
['results_vm-NAP_BioTransformer.tsv'](./results_vm-NAP_BioTransformer.tsv).

## Export the BioTransformer results for NAP

See the documentation for custom database in [NAP](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database) and how to run NAP on GNPS [https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database](https://ccms-ucsd.github.io/GNPSDocumentation/nap/#structure-database).

In [None]:
export_for_NAP('results_vm-NAP_BioTransformer.tsv', compound_name='name')

Download the BioTransformer results for NAP in the left side panel->
['results_vm-NAP_BioTransformer_NAP.tsv'](./results_vm-NAP_BioTransformer_NAP.tsv).

## Export the BioTransformer results for SIRIUS

See the documentation to generate the SIRIUS [custom database here](https://boecker-lab.github.io/docs.sirius.github.io/cli-standalone/#custom-database-tool).

In [None]:
export_for_SIRIUS('results_vm-NAP_BioTransformer.tsv', compound_name='name')

Download the BioTransformer results for NAP in the left side panel->
['results_vm-NAP_BioTransformer_SIRIUS.tsv'](./results_vm-NAP_BioTransformer_SIRIUS.tsv).