
# Getting Started with pySmash GUI

*pysmash: A Tool for the Mine of Significant Fragments from Specific Dataset* 

<img align="left" src="./image/pysmash.png" width="100" style="margin-bottom:-3px">

Besides Python package, we also developed a software with GUI for using the basic function of pySmash conveniently.

This document intends to provide users with the operation of pySmash software. If you find any mistake or have suggestions for improvements, please either fix them in the source document (the .py file under "pysmash/gui") or send to the mailing list: oriental-cds@163.com and kotori@cbdd.me.

## The Main Interface of pySmash

<img align="left" src="./image/interface.png" style="margin-bottom:-3px" width="500">


The interface of pySmash software mainly contain four parts shown as above screen shot:<br><br>
    (1)Selecting and loading file;<br><br>
    (2)Selecting the corresponding field;<br><br>
    (3)Adjusting the parameters of fragment to be generated;<br><br>
    (4)Adjusting the program running parameters.<br><br>

## Generating Significant Fragments

### Selecting and Loading File

<img align="left" src="image/step/1.png" width="500" style="margin-bottom:-3px">

Click the <code>Browse...</code> button, and then chose the file. The input file contains at least two columns, one of which is the molecule in SMILES format, and the other is the label of the corresponding molecule.

Here, we take a carcinogens about dataset as the example to show the step of our software. The data look likes following: 

In [1]:
data = pd.read_csv('./datasets/Carc/Carc.txt', sep='\t') # loading data
data

Unnamed: 0,SMILES,Label
0,CN(c1ccc(cc1)N=Nc1ccccc1)C,1
1,CC(=O)Nc1cccc2c1c1ccccc1C2,0
2,c1cc2ccc3c4c2c(c1)ccc4ccc3,0
3,ClC(Cl)Cl,1
4,CC/C(=C(\c1ccc(cc1)O)/CC)/c1ccc(cc1)O,1
...,...,...
1111,OC(=O)C1=NN(C(=O)C1/N=N/c1ccc(cc1)S(=O)(=O)O)c...,1
1112,CN1C2CCC1CC(C2)NC(=O)c1cc(Cl)cc2c1OC(C2)(C)C,0
1113,O=C1C=CC(=O)C=C1,1
1114,c1ccc2c(c1)sc(n2)SSc1nc2c(s1)cccc2,0


Where filed Label 1 = carcinogen(i.e. positive) and 0 = noncarcinogen(i.e. negative)

### Selecting the corresponding field

Select column name of SMILES, label and aim label in turn. The "aim label" meant the label would be regarded as reference one in p-values calculation. Since we intent to conclude the fragment related with "carcinogen", the "1" set as the aim label.

<img align="left" src="image/step/2.png" width="500" style="margin-bottom:-3px">

### Adjusting the parameters of fragment to be generated

Firstly, The fragment type should be selected.

pySmash provides 3 algorithms, including circular-, path-, and function group-based, to smash a molecule.

**(1) Circular-based fragment**<br>
Circular fragments is built by applying the Morgan algorithm. When generating circular fragments, Morgan fingerprints are calculated firstly under the given radius, and then circular fragments would be retrieved through combining the information of each bit with specific function.

**(2) Path-based fragment**<br>
The path-based fragment algorithm identifies all subgraphs in the molecule within a particular range of sizes. Like generation of circular fragments, the RDKit fingerprints are calculated at first under minimum and maximum length. Through the bitinfo, the fragments are obtained.

**(3) Function Group-based fragment**<br>
This fragments are implemented by this [article](https://jcheminf.springeropen.com/articles/10.1186/s13321-017-0225-z), which proposed an algorithm to identify functional groups in organic molecules.

<img align="left" src="image/step/3_1.png" width="300" style="margin-bottom:-3px">

Here, we take Circular Fragment as example, and adjusted related parameters. We set the minimum and maximum radius to 1 and 7 respectively.

<img align="left" src="image/step/3_2.png" width="300" style="margin-bottom:-3px">

### Adjusting the program running parameters

The running parameters are used for dispose the generated fragments:

(1) <code>minNum</code>: the minimum frequency a fragment required. A fragment would be dropped if it appears less than this number;

(2) <code>minAcc</code>: the minimum accuracy lead by a fragment judge, in other word, the ratio between the number of "active" molecules (i.e. the molecule with aim label) with specific fragment and total molecules with same fragment should be higher than <code>minAcc</code>, otherwise it will be dropped;

(3) <code>p-value</code>: the value used for judging a fragment is significant or not, the statistical method based on the probability density function of the binomial distribution with following equation: $P_{value}=\Sigma_{i=ms}^{ns}\frac{ns!}{i!(ns-i)!}(\frac{m}{n})^{i}(1-\frac{m}{n})^{ns-i}$
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>
**Note:** In $n$ compounds has $m$ compounds with a specific aim label and specific fragment is found in $ns$ compounds the amount of those compounds with the specific aim label is $ms$.

(4) <code>Bonferroni</code>: whether use Bonferroni method to revised calculated $P_{value}$

(5) <code>n_jobs</code>


<img align="left" src="image/step/4.png" width="300">

### Running 

Click the "**Calculate**" button to start generate significant fragments under above parameters.

<img src="image/step/run.png" width="400" style="margin-bottom:-3px" align="left">

After clicking the "**Calculate**" button, a pop-up window will appear, which records the task currently being processed. After all tasks are completed, click the "**Next**" button to enter the data preview and save interface.

<img src="image/step/process.png" width="350" style="margin-bottom:-3px" align="left">

In the preview and save window, there are three parts:

(1) <code>subMatrix</code>: a binary matrix, whose each column represents a fragment, and each row is a sample, in which 1 meant molecule contain related fragments;

(2) <code>subPvalue</code>: a table contains significant fragments and related statistical information including $p_{value}$;

(3) <code>model</code>: a model object that contains significant fragments, by calling this model, a given molecule can be predicted.

<img src="image/step/preview.png" width="450" style="margin-bottom:-3px" align="left">

The corresponding results can be previewed and saved through "**Preview**" and "**Downloaded**" button respectively, where <code>subMatrix</code> and <code>subPvalue</code> will display the top 5 rows, and <code>Model</code> will display its parameter.

<img src="image/step/preview2.png" style="margin-bottom:-3px" align="left">

The result of subPvalue will be saved in *.HTML* format,it looks following:

**Note**: $Accuracy=\frac{ms}{ns}$ $Coverage=\frac{ms}{m}$

In [10]:
with open('./image/pvalue.html') as f_obj:
    html = f_obj.read()
    
IPython.display.HTML(html)

Unnamed: 0,Pvalue,Total,Hitted,Accuracy,Coverage,SMARTS,Substructure
2380084179,6.669066e-09,127,114,0.897638,0.150396,*N=O,N*O
3153453529,0.0001368092,23,23,1.0,0.030343,*CN(N=O)C(*)=*,***NNO
1470580613,0.0009464337,18,18,1.0,0.023747,*c(*)-c1ccc([N+](=O)[O-])o1,**N+OO-O
3440991424,0.001023642,30,28,0.933333,0.036939,*c1ccc([N+](=O)[O-])o1,*ON+OO-
3356397823,0.001023642,30,28,0.933333,0.036939,*cc(o*)[N+](=O)[O-],O**N+OO-
1083852209,0.00114072,120,97,0.808333,0.127968,*c(*)N,**NH2
1147919419,0.002294315,22,21,0.954545,0.027704,*CN(C*)N=O,**ONN
2378775366,0.003322,114,91,0.798246,0.120053,*[N+](=*)[O-],*N+*O-
3095540251,0.006547344,13,13,1.0,0.01715,*CN(C)N=O,*ONN
2378779377,0.01059133,107,84,0.785047,0.110818,*[N+]([*-])=O,*N+*-O


## Predicting molecules with significant fragments model

Switch to the "**Predict**" tab to predict the molecules.

In the prediction mode, two input files are required, one of which is a table with molecules in SMILES format, and the other is the model file saved in the previous step.

<img src="image/predict/1.png" style="margin-bottom:-3px" align="left" width="400">

### Output of prediction

The result of **pySmash** gui prediction mode is a matrix A is a table, its last column is the predicted label, the others are the predicted matrix.

In [2]:
out = pd.read_csv('./datasets/Carc/cirPred.csv')
out

Unnamed: 0,2380084179,3153453529,1470580613,3440991424,3356397823,1083852209,1147919419,2378775366,3095540251,2378779377,...,1330196390,535847852,198706261,1495075844,4235614536,3888780669,3989046787,3575264755,1429883190,PredLabel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1111,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1113,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1114,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
