# Application

In order to obtain a better understanding of **pySmash** and its application, we introduce examples to show a regular significant fragment extracting process. We will show the workflow of **pySmash** in *package* and *software*. This application scripts can be obtained from [here](https://github.com/kotori-y/pySmash/blob/master/tutorial/Application.ipynb).

## Example of toxicophore analysis on a carcinogenicity dataset

Toxicity is a significant issue for the pharmaceutical industry in early drug development phase. In order to gain a better understanding of the utility of substructure extraction, 1092 compounds that have been tested long-term carcinogenicity bioassay on rodents, including 694 carcinogenicity toxic compounds and 358 non-toxic compounds were collected from [here](http://old.iss.it/meca/index.php?lang=1&anno=2013&tipo=25) (Benigni, et al., 2013). Three algorithms were used for fragment extraction, with circular fragment (minimum radius = 1, maximum radius = 4) and path fragment (minimum path = 1, maximum path = 7) parameter setup. For substructure requirement, the minimum number of compounds with the substructure (minNum) was set as 5 and the minimum portion of aimed label compounds with the substructure (minAcc) was set as 0.7. To control the familywise error rate, the Bonferroni correction was also applied. The result is summarized in [Result 1](#Result-1).<br>

In [1]:
# rdkit
from rdkit import Chem
# science modules
import pandas as pd
import numpy as np
# Display
import IPython
# pySmash Learner
from smash import CircularLearner, PathLearner, FunctionGroupLearner

In [2]:
# loading data
data = pd.read_csv('./datasets/carc/carc.txt', sep='\t')
pd.set_option('display.max_rows', 8)
data

Unnamed: 0,SMILES,Label
0,CN(c1ccc(cc1)N=Nc1ccccc1)C,1
1,CC(=O)Nc1cccc2c1c1ccccc1C2,0
2,c1cc2ccc3c4c2c(c1)ccc4ccc3,0
3,ClC(Cl)Cl,1
...,...,...
1048,CN1C2CCC1CC(C2)NC(=O)c1cc(Cl)cc2c1OC(C2)(C)C,0
1049,O=C1C=CC(=O)C=C1,1
1050,c1ccc2c(c1)sc(n2)SSc1nc2c(s1)cccc2,0
1051,O=NN(CCN(C)C)C,1


In [3]:
mols = [Chem.MolFromSmiles(smi) for smi in data.SMILES]
labels = data.Label.values

print("Positive Number: {}".format(labels.sum()))
print("Negative Number: {}".format((1-labels).sum()))

Positive Number: 694
Negative Number: 358


In [4]:
# Instantiate
cir = CircularLearner(minRadius=1, maxRadius=4, nJobs=4)
path = PathLearner(minPath=1, maxPath=7, nJobs=4)
fg = FunctionGroupLearner(nJobs=4)

In [5]:
# Fitting
kwgrs = {
    "mols": mols, 
    "labels": labels, 
    "accCutoff": 0.7, 
    "pCutoff": 0.05, 
    "Bonferroni": True
}

sigCirPvalue, sigCirMatrix = cir.fit(**kwgrs)
sigPathPvalue, sigPathMatrix = path.fit(**kwgrs)
sigFgPvalue, sigFgMatrix = fg.fit(**kwgrs)

In [6]:
# Visualizing circular fragments
IPython.display.HTML(sigCirPvalue.to_html(escape=False))

Unnamed: 0,Pvalue,Total,Hitted,Accuracy,Coverage,SMARTS,Substructure
2380084179,1.584474e-08,126,113,0.896825,0.162824,[*]-[N]=[O],N*O
3153453529,0.001609288,23,23,1.0,0.033141,[*]-[CH2]-[N](-[N]=[O])-[C](-[*])=[*],***NNO
3356397823,0.01157093,30,28,0.933333,0.040346,[*]:[cH]:[c](:[o]:[*])-[N+](=[O])-[O-],O**N+OO-
3440991424,0.01157093,30,28,0.933333,0.040346,[*]-[c]1:[cH]:[cH]:[c](-[N+](=[O])-[O-]):[o]:1,*ON+OO-
1470580613,0.01287997,18,18,1.0,0.025937,[*]:[c](:[*])-[c]1:[cH]:[cH]:[c](-[N+](=[O])-[O-]):[o]:1,**ON+OO-
1083852209,0.01585835,115,92,0.8,0.132565,[*]:[c](:[*])-[NH2],**H2N
1147919419,0.0437558,21,20,0.952381,0.028818,[*]-[CH2]-[N](-[CH2]-[*])-[N]=[O],**NNO


In [7]:
# Visualizing FG fragments
IPython.display.HTML(sigFgPvalue.to_html(escape=False))

Unnamed: 0,Pvalue,Total,Hitted,Accuracy,Coverage,SMARTS,Substructure
CN(C)N=O,2.7e-05,61,56,0.918033,0.080692,CN(C)N=O,NNO
cN,0.009277,114,91,0.798246,0.131124,cN,H2N
coc,0.010573,51,44,0.862745,0.063401,coc,OO


In [8]:
# Visualizing Path fragments
IPython.display.HTML(sigPathPvalue.to_html(escape=False))

Unnamed: 0,Pvalue,Total,Hitted,Accuracy,Coverage,SMARTS,Substructure
480113190,2.737936e-09,128,117,0.914062,0.168588,CCNN,NN
1856304663,2.737936e-09,128,117,0.914062,0.168588,CN(C)N,NN
4276258003,2.910665e-09,194,169,0.871134,0.243516,NN,H2NNH2
278227040,6.905304e-09,182,159,0.873626,0.229107,CNN,NN
3901489279,3.072772e-07,93,86,0.924731,0.123919,CCN(C)N=O,ONN
3436234336,6.695027e-07,106,96,0.90566,0.138329,CN(C)N=O,ONN
2414583053,2.167071e-05,46,45,0.978261,0.064841,CN(N)C=O,ONN
4282884236,0.001212875,77,68,0.883117,0.097983,CCCNN,NN
88390985,0.002391244,34,33,0.970588,0.04755,CCN(N)C=O,ONN
4236736947,0.003927312,73,64,0.876712,0.092219,NNC=O,ONN


In [9]:
nCir = len(sigCirPvalue)
nFG = len(sigFgPvalue)
nPath = len(sigPathPvalue)
nEnsemble = nCir + nFG + nPath

cirFlagged = sigCirMatrix.sum(axis=1)>0
cirFlaggedTox = cirFlagged[labels==1]

FgFlagged = sigFgMatrix.sum(axis=1)>0
FgFlaggedTox = FgFlagged[labels==1]

PathFlagged = sigPathMatrix.sum(axis=1)>0
PathFlaggedTox = PathFlagged[labels==1]

ensembleFlagged = cirFlagged | FgFlagged | PathFlagged
ensembleFlaggedTox = ensembleFlagged[labels==1]

res = pd.DataFrame(
    {
        "Substructure": [nCir, nFG, nPath, nEnsemble],
        "Flagged compounds": [cirFlagged.sum(), FgFlagged.sum(), PathFlagged.sum(), ensembleFlagged.sum()],
        "Flagged toxic compounds": [cirFlaggedTox.sum(), FgFlaggedTox.sum(), PathFlaggedTox.sum(), ensembleFlaggedTox.sum()],
    },
    
    index = ["Circular algorithm", "Functional group", "Path algorithm", "Ensemble"]
    
)

res["Accuracy"] = res["Flagged toxic compounds"]/res["Flagged compounds"]

### Result 1

In [10]:
res

Unnamed: 0,Substructure,Flagged compounds,Flagged toxic compounds,Accuracy
Circular algorithm,7,266,228,0.857143
Functional group,3,218,183,0.83945
Path algorithm,15,211,186,0.881517
Ensemble,25,348,295,0.847701


As shown in [Result 1](#Result-1), for individual substructure algorithm, the correction rates are all higher than 0.83, with the successful detection of more than 180 carcinogenicity toxic compounds. The detection coverage can be wider with the integration of three type substructures, which can accurately detect 326 carcinogenicity toxic compounds by only 26 alerts. These results indicate the advantages of the combination of different substructure algorithms, which can capture the distinct substructures and structural features from different viewpoints.<br>
Besides, there are also some structural features that have been frequently presented in different types of substructures. For example, 1-nitrocyclopenta-1,3-diene have been detected in 28 carcinogenicity toxic compounds, with accuracy value higher than 0.900. The Metabolic activation of such substructures involves initial one-electron reduction of the nitro group to yield a resonance-stabilized nitroanion radical. Under aerobic conditions, the radical can reduce molecular oxygen to form the superoxide anion, which can generate various toxic and DNA-reactive oxygen species (Benigni and Bossa, 2011). Another representative toxicophore 2-nitrosobutane, appeared in 23 carcinogenicity toxic compounds, with accuracy value of 1.00. The mechanism of action of such substructure has not been clearly understood. It is proposed a nonalkylating mechanism, with formaldehyde as the suggested intermediate responsible for mutagenicity, after oxidation of N-nitrodimethy-lamine at the methyl group. Such analysis indicated that the credibility and interpretability of the generated substructures. 