
# Getting Started with pySmash GUI

*pysmash: A Tool for the Mine of Significant Fragments from Specific Dataset* 

<img align="left" src="./image/pysmash.png" width="100" style="margin-bottom:-3px">

PySmash is available as software and python library at [here](https://github.com/kotori-y/pySmash/releases/latest/download/pysmash.tar.gz). In this document, we mainly provide the brief operation tutorial of pySmash software. If you find any mistake or have any suggestion for improvement, please either fix them in the source document (the .py file under "pysmash/gui") or send to the mailing list: oriental-cds@163.com and kotori@cbdd.me.

## The Main Programs of PySmash

This software consists of two main programs: Calculation and Prediction. The first program includes: (1) data input; (2) substructures calculation; (3) user-defined parameters and (4) various output types. The prediction part allow users to apply the front derived alerts for external dataset screening, which will provide the information about flagged substructure and predicted label.

<img align="left" src="./image/interface.png" style="margin-bottom:-3px" width="500">


## Generating Significant Fragments

Click the ***Browse...*** button, and then chose the file. **PySmash** supports *.csv*, *.excel* and *.txt* file, which should provide the information about SMILES and label.

### Selecting and Loading File

<img align="left" src="image/step/1.png" width="500" style="margin-bottom:-3px">

Here, we take a carcinogenicity bioassay dataset, which cab be obtain from [here](http://old.iss.it/meca/index.php?lang=1&anno=2013&tipo=25), as an example to further explain the functions of **PySmash** software.

In [1]:
data = pd.read_csv('./datasets/Carc/Carc.txt', sep='\t') # loading data
data

Unnamed: 0,SMILES,Label
0,CN(c1ccc(cc1)N=Nc1ccccc1)C,1
1,CC(=O)Nc1cccc2c1c1ccccc1C2,0
2,c1cc2ccc3c4c2c(c1)ccc4ccc3,0
3,ClC(Cl)Cl,1
4,CC/C(=C(\c1ccc(cc1)O)/CC)/c1ccc(cc1)O,1
...,...,...
1111,OC(=O)C1=NN(C(=O)C1/N=N/c1ccc(cc1)S(=O)(=O)O)c...,1
1112,CN1C2CCC1CC(C2)NC(=O)c1cc(Cl)cc2c1OC(C2)(C)C,0
1113,O=C1C=CC(=O)C=C1,1
1114,c1ccc2c(c1)sc(n2)SSc1nc2c(s1)cccc2,0


Where filed Label 1 = carcinogen(i.e. positive) and 0 = noncarcinogen(i.e. negative)

### Selecting the corresponding field

Select the columns in the dataset which provide information of SMILES and label. Users also need to confirm the aim label for representative substructure calculation. In this example, "Label=1" was set as the aim label..

<img align="left" src="image/step/2.png" width="500" style="margin-bottom:-3px">

### Calculating the representative substructures

Three substructure algorithms including circular-based, path-based and function group-based fragments are provided in PySmash for substructure derivation.

**(1) Circular fragment**<br>
Circular-based fragment are derived using the Morgan algorithm. Users can define the minimum and the maximum radius of the calculated circular substructures.

**(2) Path fragment**<br>
Path fragment algorithm identifies all subgraphs in the molecule within a particular range of sizes. Users can define the minimum and the maximum bond paths of the calculated patterns.

**(3) Function Group fragment**<br>
Function group fragments are pre-defined patterns from this article. One of the main advantages of this algorithm is the short calculation time and the less computation effort.


<img align="left" src="image/step/3_1.png" width="300" style="margin-bottom:-3px">

In this example, we use circular algorithm and set the minimum and maximum radius as 1 and 7, respectively.

### Adjusting the program running parameters

The running parameters are used for dispose the generated fragments:

(1) <code>minNum</code>: the minimum frequency a fragment required. A fragment would be dropped if it appears less than this number;

(2) <code>minAcc</code>: the minimum accuracy lead by a fragment judge, in other word, the ratio between the number of "active" molecules (i.e. the molecule with aim label) with specific fragment and total molecules with same fragment should be higher than <code>minAcc</code>, otherwise it will be dropped;

(3) <code>p-value</code>: the value used for judging a fragment is significant or not, the statistical method based on the probability density function of the binomial distribution with following equation: $P_{value}=\Sigma_{i=ms}^{ns}\frac{ns!}{i!(ns-i)!}(\frac{m}{n})^{i}(1-\frac{m}{n})^{ns-i}$
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>
where  𝑛  and  𝑚  present the number of the whole compounds and the compounds with aim label respectively, and specific fragment is found in  𝑛𝑠  ("Total" in table) compounds the amount of those compounds with the specific aim label is  𝑚𝑠  ("Hitted" in table).

(4) <code>Bonferroni</code>: whether use Bonferroni method to revised calculated $P_{value}$

(5) <code>n_jobs</code>: server capability deployment


<img align="left" src="image/step/4.png" width="300">

### Previewing and Saving Results

Click the ***Calculate*** button to start significant fragments generation.

<img src="image/step/run.png" width="400" style="margin-bottom:-3px" align="left">

After click the ***Calculate*** button, a pop-up window will appear, which records the task currently being processed. After all tasks are completed, click the ***Next*** button to enter the result preview and save interface.

<img src="image/step/process.png" width="350" style="margin-bottom:-3px" align="left">

In the preview and save window, there are three parts:

(1) <code>sigMatrix</code>: a binary matrix, whose each column represents a fragment, and each row is a sample, in which 1 meant molecule contain related fragments;

(2) <code>sigPvalue</code>: a table contains significant fragments and related statistical information including $p_{value}$;

(3) <code>model</code>: a model object that contains significant fragments, by calling this model, a given molecule can be predicted.

<img src="image/step/preview.png" width="450" style="margin-bottom:-3px" align="left">

The corresponding results can be previewed and saved through ***Preview*** and ***Downloaded*** button, respectively. In the Preview interface, *sigMatrix* and *sigPvalue* will display the top 5 rows of the result, and Model will display its parameter.

<img src="image/step/preview2.png" style="margin-bottom:-3px" align="left">

sigPvalue will be saved in *.HTML* format

**Note:** <br>
In the following table, $Accuracy = \frac{ms}{ns}$ $Coverage = \frac{ms}{m}$

The default highlight colors for the Morgan bits indicate:

 - blue: the central atom in the environment


 - yellow: aromatic atoms


 - gray: aliphatic ring atoms

In [3]:
with open('./image/pvalue.html') as f_obj:
    html = f_obj.read()
    
IPython.display.HTML(html)

Unnamed: 0,Pvalue,Total,Hitted,Accuracy,Coverage,SMARTS,Substructure
2380084179,6.669066e-09,127,114,0.897638,0.150396,*N=O,N*O
3153453529,0.0001368092,23,23,1.0,0.030343,*CN(N=O)C(*)=*,***NNO
3087153396,0.0009464337,18,18,1.0,0.023747,*=[N+]([*-])c1ccc(-c(*)*)o1,*-N+***O
3907878801,0.0009464337,18,18,1.0,0.023747,*c(*)-c1ccc([*+])o1,**+O*
3356397823,0.001023642,30,28,0.933333,0.036939,*cc(o*)[N+](=O)[O-],O**N+OO-
3440991424,0.001023642,30,28,0.933333,0.036939,*c1ccc([N+](=O)[O-])o1,*ON+OO-
344943648,0.001023642,30,28,0.933333,0.036939,*c1ccc([N+](=*)[*-])o1,ON+**-*
1083852209,0.00114072,120,97,0.808333,0.127968,*c(*)N,**NH2
1147919419,0.002294315,22,21,0.954545,0.027704,*CN(C*)N=O,**ONN
2378775366,0.003322,114,91,0.798246,0.120053,*[N+](=*)[O-],*N+*O-


## Predicting molecules with significant fragments model

Switch the ***Predict*** tab to apply the front derived alerts for external dataset screening and prediction.
In the prediction mode, users should provide the SMILE of detection compound and the information of screening substructures.

<img src="image/predict/1.png" style="margin-bottom:-3px" align="left" width="400">

### Output of prediction

The result of prediction mode is a table, which provides the information about flagged substructure and predicted label.

In [2]:
out = pd.read_csv('./datasets/Carc/cirPred.csv')
out

Unnamed: 0,2380084179,3153453529,1470580613,3440991424,3356397823,1083852209,1147919419,2378775366,3095540251,2378779377,...,1330196390,535847852,198706261,1495075844,4235614536,3888780669,3989046787,3575264755,1429883190,PredLabel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1111,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1113,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1114,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
