# ChemTONIC: A Brief Guideline and Examples

![Logo](https://raw.githubusercontent.com/mldlproject/chemtonic/e218517eb2b5f73553035badad7e32f0f37bd291/chemtonic.svg)

### PART 1: FUNCTIONS IN SUBSTAGE MODULES

ChemTONIC 0.0.1 was released with the first module called **curation**. Module **curation** contains four submodules, including
- `validation`
- `cleaning`
- `normalization`
- `refinement`
- `utils`

This brief guideline is designed to help novice users to perform data curation in the fastest way. For the details of all provided functions, please read "[ChemTONIC: A Complete Guideline and Examples](https://github.com/mldlproject/chemtonic/blob/main/CompleteGuideline.ipynb)"

### Import libraries

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

In [6]:
from chemtonic.curation.validation import validateComplete
from chemtonic.curation.cleaning import cleanComplete
from chemtonic.curation.normalization import normalizeComplete
from chemtonic.curation.refinement import refineComplete
import pandas as pd

- validateComplete(): Perform validation stages including all substages
- cleanComplete(): Perform cleaning stages including all substages
- normalizeComplete(): Perform normalization stages including all substages
- refineComplete(): Perform complete data curation processincluding all stages

![Pipeline](https://raw.githubusercontent.com/mldlproject/chemtonic/0fdf3801a315d23cee264535b5bc13bea1b83572/images/pipelinesmall.svg)

### Load example dataset

In [None]:
smiles = pd.read_csv("./data/example.csv")['SMILES']

### 1. Validation

Module **validation** contains four functions: *rmMixtures()*, *rmInorganics()*, *rmOrganometallics()*, and *validateComplete()*. ***validateComplete()*** performs complete validation substage, including *rmMixtures()*,  *rmInorganics()*, and *rmOrganometallics()*

```python
validateComplete(compounds, getInvalidStruct=False, removeDuplicates=False, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True)
```


Function arguments:
- ``compounds``: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- ``getInvalidStruct``: Boolean. Default `False`. Set `True` to get unvalidated SMILES (if any)
- ``removeDuplicates``: Boolean. Default `False`. Set `True` to remove duplicated SMILES (if any)
- ``getDuplicatedIdx``: Boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- ``exportCSV``: Boolean. Default `False`. Set `True` to get csv file
- ``outputPath``: Directory. Must be filled when `exportCSV=True` to get csv file
- ``printlogs``: Boolean. Default `True`. Print logs and summary during the process

#### 1.1. Obtaining lists of validated SMILES(s) (duplicated data may exist)

In [9]:
validateComplete(smiles, removeDuplicates=False)

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
set 'getInvalidStruct=True' to get the list of all unvalidated structures 



Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
494,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
495,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
496,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
497,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 1.2. Obtaining lists of validated SMILES(s) (duplicated data removed)

In [10]:
validateComplete(smiles, removeDuplicates=True)

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
set 'getInvalidStruct=True' to get the list of all unvalidated structures 

There are 381 unique structures filtered from 499 initial validated structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun validateComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun validateComplete() with setting 'getDuplicates=Tr

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
376,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...
377,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
378,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
379,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 1.3. Obtaining lists of unvalidated SMILES(s) (if any)

In [11]:
validateComplete(smiles, getInvalidStruct=True)

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------


Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,Mixture,rmMixtures(),277


#### 1.4. Obtaining lists of duplicated validated SMILES(s) (if any)

In [18]:
validateComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True)

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
set 'getInvalidStruct=True' to get the list of all unvalidated structures 

There are 381 unique structures filtered from 499 initial validated structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun validateComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun validateComplete() with setting 'getDuplicates=Tr

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 278]",validateComplete()
1,3,"[3, 467]",validateComplete()
2,7,"[7, 215]",validateComplete()
3,9,"[0, 9, 69, 278]",validateComplete()
4,11,"[11, 418]",validateComplete()
...,...,...,...
196,487,"[464, 487]",validateComplete()
197,492,"[200, 492]",validateComplete()
198,493,"[445, 493]",validateComplete()
199,495,"[213, 495]",validateComplete()


#### 1.5. Export csv files

To create csv file, run the commands below with `outputPath` defined.

In [None]:
#ValOutput= '../validation/'
# validateComplete(smiles, removeDuplicates=False, exportCSV=True, outputPath=ValOutput)
# validateComplete(smiles, removeDuplicates=True,  exportCSV=True, outputPath=ValOutput)
# validateComplete(smiles, getInvalidStruct=True,  exportCSV=True, outputPath=ValOutput)
# validateComplete(smiles, removeDuplicates=True,  getDuplicatedIdx=True, exportCSV=True, outputPath=ValOutput)

---

### 2. Cleaning

Module **cleaning** contains two functions: *clSalts()*, *clCharges()*, and *cleanComplete()*. ***cleanComplete()*** performs complete cleaning substage, including *clSalts()* and *clCharges()*.

**Notice**: Cleaning substage requires all SMILESs to be verified. In case of unverified SMILES(s) available, cleanComplete() will remove unvalidated SMILES(s). cleanComplete() does not support removing mixtures, inorganics, or organometallics. 

```python
cleanComplete(compounds, getUncleanedStruct=False, deSalt=False, neutralize=False, removeDuplicates=False, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True)
```


Function arguments:
- ``compounds``: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- ``getUncleanedStruct``: boolean. Default `False`. Set `True` to get uncleaned SMILES (if any)
- ``deSalt``: boolean. Default `False`. Set `True` to get the SMILES desalted (if any). Default `False` removes SMILES(s) of salt(s).
- ``neutralize``: boolean. Default `False`. Set `True` to get the SMILES neutralized (if any)
- ``removeDuplicates``: boolean. Default `False`. Set `True` to remove duplicated SMILES (if any)
- ``getDuplicatedIdx``: boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- ``exportCSV``: boolean. Default `False`. Set `True` to get csv file
- ``outputPath``: Directory. Must be filled when `exportCSV=True` to get csv file
- ``printlogs``: Print logs, summary during the process

#### 2.1. Obtaining lists of cleaned SMILES(s) (duplicated data may exist)

In [15]:
cleanComplete(smiles, removeDuplicates=False)

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------
set 'getUncleanedStruct=True' to get the list of all uncleaned structures. Neutralized charged structures will be included (if any) 



Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
492,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
493,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
494,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
495,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 2.2. Obtaining lists of cleaned SMILES(s) (duplicated data removed)

In [16]:
cleanComplete(smiles, removeDuplicates=True)

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------
set 'getUncleanedStruct=True' to get the list of all uncleaned structures. Neutralized charged structures will be included (if any) 

There are 380 unique structures filtered from 497 initial cleaned structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun cleanComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list 

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
375,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...
376,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
377,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
378,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 2.3. Obtaining lists of uncleaned SMILES(s) (if any)

In [17]:
cleanComplete(smiles, getUncleanedStruct=True)

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------


Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,Salt,clSalts(),277
1,CCCC(=O)[O-].[Na+],Salt,clSalts(),344
2,CCCC(=O)[O-].[Na+],Salt,clSalts(),379
3,COc1ccc(C2CC(=O)c3c([O-])cc(O)cc3O2)cc1,Charge,clCharges(),302
4,O=C(O)C1=CC(=CC=[N+]2c3cc(O)c(OC4OC(CO)C(O)C(O...,Charge,clCharges(),329
5,CCCC(=O)[O-].[Na+],Charge,clCharges(),344
6,CCCC(=O)[O-].[Na+],Charge,clCharges(),379


#### 2.4. Obtaining lists of duplicated cleaned SMILES(s) (if any)

In [19]:
cleanComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True)

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------
set 'getUncleanedStruct=True' to get the list of all uncleaned structures. Neutralized charged structures will be included (if any) 

There are 380 unique structures filtered from 497 initial cleaned structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun cleanComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list 

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 278]",cleanComplete()
1,3,"[3, 465]",cleanComplete()
2,7,"[7, 215]",cleanComplete()
3,9,"[0, 9, 69, 278]",cleanComplete()
4,11,"[11, 416]",cleanComplete()
...,...,...,...
194,485,"[462, 485]",cleanComplete()
195,490,"[200, 490]",cleanComplete()
196,491,"[443, 491]",cleanComplete()
197,493,"[213, 493]",cleanComplete()


#### 2.5. Export csv files

To create csv file, run the commands below with `outputPath` defined.

In [1]:
#CleanOutput= '../cleaning/'
# cleanComplete(smiles, removeDuplicates=False,  exportCSV=True, outputPath=CleanOutput)
# cleanComplete(smiles, removeDuplicates=True,   exportCSV=True, outputPath=CleanOutput)
# cleanComplete(smiles, getUncleanedStruct=True, exportCSV=True, outputPath=CleanOutput)
# cleanComplete(smiles, removeDuplicates=True,   getDuplicatedIdx=True, exportCSV=True, outputPath=CleanOutput)

---

### 3. Normalization

Module `normalization` contains three functions: `normTautomers()`, `normStereoisomers()`, and `normalizeComplete()`.
- `normTautomers()`: remove/detautomerize SMILES of tautomer
- `normStereoisomers()`: destereoisomerize SMILES of stereoisomer
- `normalizeComplete()`: perform complete normalization substage which includes these two steps above

```python
normalizeComplete(compounds, getUnnormalizedStruct=False, deTautomerize=True, deSterioisomerize=True, removeDuplicates=False, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True)
```

Function arguments:
- ``compounds``: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- ``getUnnormalizedStruct``:boolean. Default `False`. Set `True` to get unnormalized SMILES (if any)
- ``deTautomerize``: boolean. Default `True`. Set `False` to remove tautomer SMILES (if any)
- ``deSterioisomerize``: boolean. Default `True`. Set `False` to not to desterioisomerize SMILES (if any)
- ``removeDuplicates``: boolean. Default `False`. Set `True` to remove duplicated SMILES (if any)
- ``getDuplicatedIdx``: boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- ``exportCSV``: boolean. Default `False`. Set `True` to get csv file
- ``outputPath``: Directory. Must be filled when `exportCSV=True` to get csv file
- ``printlogs``: Print logs, summary during the process

#### 3.1. Obtaining lists of normalized SMILES(s) (duplicated data may exist)

In [20]:
normalizeComplete(smiles, removeDuplicates=False) 

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized
set 'getUnnormalizedStruct=True' to get the list of all unnormalized structures. 



Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
495,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
496,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
497,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
498,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 3.2. Obtaining lists of normalized SMILES(s) (duplicated data removed)

In [21]:
normalizeComplete(smiles, removeDuplicates=True) 

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized
set 'getUnnormalizedStruct=True' to get the list of all unnormalized structures. 

There are 381 unique structures filtered from 500 initial normalized structures
To get detailed information, please follow steps below:
(1) Rerun normalizeComplete() with setting 'removeDuplicates=False' to get the list of all normalized structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun normalizeComplete() with setting 'getDuplicates=True', 'exportCSV'=True, and 'outputPath=<Directory>' to export a csv file  containing the list of duplicated structures 

--OR--
Ru

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
376,CC(O)c1c2c(cc3c(=O)c4c(O)cc5c(c4oc13)C=CC(C)(C...
377,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
378,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
379,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 3.3. Obtaining lists of unnormalized SMILES(s) (if any)

In [22]:
normalizeComplete(smiles, getUnnormalizedStruct=True) 

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized


Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CC1CN2CC3CCC(=O)C4=C5C(=C(O)C1CC2C53C)C1=C4CCO...,Tautomer,normTautomers(),5
1,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,Tautomer,normTautomers(),6
2,COC(=O)C1=C(CCO)C2=C3C1=C(O)C1(O)CC4N(CC1C)CC(...,Tautomer,normTautomers(),27
3,CC(C)=CCCC1(C)C(CC=C(C)C)CC2(CC=C(C)C)C(=O)C1(...,Tautomer,normTautomers(),30
4,CC(C)C1=CC2=CC(=O)C3C(C)(C)CCCC3(C)C2=C(O)C1=O,Tautomer,normTautomers(),45
...,...,...,...,...
60,CC(=O)CC1C(=O)C(CO)=CCC(C)(CCC(O)C(C)(C)O)C1C=...,Tautomer,normTautomers(),486
61,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...,Tautomer,normTautomers(),491
62,COC(c1ccccc1)C(O)C1CC=CC(=O)O1,Tautomer,normTautomers(),499
63,F[C@@]12C[C@]1(Cl)C[C@H](/C=C/Br)O2,Stereoisomer,normStereoisomers(),89


#### 3.4. Obtaining lists of duplicated normalized SMILES(s) (if any)

In [23]:
normalizeComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True) 

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized
set 'getUnnormalizedStruct=True' to get the list of all unnormalized structures. 

There are 381 unique structures filtered from 500 initial normalized structures
To get detailed information, please follow steps below:
(1) Rerun normalizeComplete() with setting 'removeDuplicates=False' to get the list of all normalized structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun normalizeComplete() with setting 'getDuplicates=True', 'exportCSV'=True, and 'outputPath=<Directory>' to export a csv file  containing the list of duplicated structures 

--OR--
Ru

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 279]",normalizeComplete()
1,3,"[3, 468]",normalizeComplete()
2,7,"[7, 215]",normalizeComplete()
3,9,"[0, 9, 69, 279]",normalizeComplete()
4,11,"[11, 419]",normalizeComplete()
...,...,...,...
198,488,"[465, 488]",normalizeComplete()
199,493,"[200, 493]",normalizeComplete()
200,494,"[446, 494]",normalizeComplete()
201,496,"[213, 496]",normalizeComplete()


#### 3.5. Export csv files

In [2]:
#NormOutput= '../normalization/'
# normalizeComplete(smiles, removeDuplicates=False,     exportCSV=True, outputPath=NormOutput)
# normalizeComplete(smiles, removeDuplicates=True,      exportCSV=True, outputPath=NormOutput)
# normalizeComplete(smiles, getUnnormalizedStruct=True, exportCSV=True, outputPath=NormOutput)
# normalizeComplete(smiles, removeDuplicates=True,      getDuplicatedIdx=True, exportCSV=True, outputPath=NormOutput)

---

### 4. Refinement

Module **refinment** contains only one function: refineComplete().
- *refineComplete()*: perform complete data curation process which includes these three substages above

```python
refineComplete(compounds, getUnrefinedStruct=False, deSalt=False, neutralize=False, deTautomerize=True, deSterioisomerize=True, removeDuplicates=True, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True):

```

Function arguments:
- ``compounds``: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- ``getUnrefinedStruct``:boolean. Default `False`. Set `True` to get unnormalized SMILES (if any)
-``deSalt``:
-``neutralize``:
- ``deTautomerize``: boolean. Default `True`. Set `False` to remove tautomer SMILES (if any)
- ``deSterioisomerize``: boolean. Default `True`. Set `False` to not to desterioisomerize SMILES (if any)
- ``removeDuplicates``: boolean. Default `True`. Set `False` to keep duplicated SMILES (if any)
- ``getDuplicatedIdx``: boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- ``exportCSV``: boolean. Default `False`. Set `True` to get csv file
- ``outputPath``: Directory. Must be filled when `exportCSV=True` to get csv file
- ``printlogs``: Print logs, summary during the process

#### 4.1. Obtaining lists of normalized SMILES(s) (duplicated data may exist)

In [25]:
refineComplete(smiles, removeDuplicates=False)

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
201/499 validated structures have at least one duplicates
CLEANING
-----------------------------------------------------------------------------
497/499 structures were successfully cleaned
2/499 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 497 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
497/497 structures were successfully normalized
-----------------------------------------------------------------------------
2/497 normalized structures have at least one duplicates
REFINEMENT SUMMARY
-------------------

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
492,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
493,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
494,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
495,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 4.2. Obtaining lists of normalized SMILES(s) (duplicated data removed)

In [26]:
refineComplete(smiles, removeDuplicates=True)

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
There are 381 unique structures filtered from 499 initial validated structures
CLEANING
-----------------------------------------------------------------------------
380/381 structures were successfully cleaned
1/381 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 380 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
379/380 structures were successfully normalized
1/380 structure(s) were unsuccessfully normalized and need to be rechecked
-----------------------------------------------------------------------------

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
374,CC(O)c1c2c(cc3c(=O)c4c(O)cc5c(c4oc13)C=CC(C)(C...
375,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
376,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
377,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 4.3 Obtaining lists of unrefined SMILES(s) (if any)

In [28]:
refineComplete(smiles, removeDuplicates=False, getUnrefinedStruct=True)

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
201/499 validated structures have at least one duplicates
CLEANING
-----------------------------------------------------------------------------
497/499 structures were successfully cleaned
2/499 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 497 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
497/497 structures were successfully normalized
-----------------------------------------------------------------------------
2/497 normalized structures have at least one duplicates
REFINEMENT SUMMARY
-------------------

Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,Mixture,rmMixtures(),277
1,CCCC(=O)[O-].[Na+],Salt,clSalts(),286
2,COc1ccc(C2CC(=O)c3c([O-])cc(O)cc3O2)cc1,Charge,clCharges(),257
3,O=C(O)C1=CC(=CC=[N+]2c3cc(O)c(OC4OC(CO)C(O)C(O...,Charge,clCharges(),274
4,CCCC(=O)[O-].[Na+],Charge,clCharges(),286
5,CC1CN2CC3CCC(=O)C4=C5C(=C(O)C1CC2C53C)C1=C4CCO...,Tautomer,normTautomers(),5
6,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,Tautomer,normTautomers(),6
7,COC(=O)C1=C(CCO)C2=C3C1=C(O)C1(O)CC4N(CC1C)CC(...,Tautomer,normTautomers(),26
8,CC(C)=CCCC1(C)C(CC=C(C)C)CC2(CC=C(C)C)C(=O)C1(...,Tautomer,normTautomers(),29
9,CC(C)C1=CC2=CC(=O)C3C(C)(C)CCCC3(C)C2=C(O)C1=O,Tautomer,normTautomers(),44


**Notice**: `idx` returned by different substage is different from each other. For example, SMILES `idx 277` returned by `rmMixtures()` is different from SMILES `idx 277` returned by `clSalts()` because `rmMixtures()` belongs to substage `validation` while `clSalts()` belongs to substage `cleaning`. When using refineComplete(), the order of substage is `validation`, followed by `cleaning`, and `normalization`.

#### 4.4. Obtaining lists of duplicated refined SMILES(s) (if any)

In [29]:
refineComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True)

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
There are 381 unique structures filtered from 499 initial validated structures
CLEANING
-----------------------------------------------------------------------------
380/381 structures were successfully cleaned
1/381 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 380 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
379/380 structures were successfully normalized
1/380 structure(s) were unsuccessfully normalized and need to be rechecked
-----------------------------------------------------------------------------

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 278]",validateComplete()
1,3,"[3, 467]",validateComplete()
2,7,"[7, 215]",validateComplete()
3,9,"[0, 9, 69, 278]",validateComplete()
4,11,"[11, 418]",validateComplete()
...,...,...,...
199,495,"[213, 495]",validateComplete()
200,496,"[472, 496]",validateComplete()
201,na,na,cleanComplete()
202,83,"[83, 90]",normalizeComplete()


**Notice**: `idx` returned by different substage is different from each other. For example, SMILES `idx 0` returned by `validateComplete()` is different from SMILES `idx 0` returned by `normalizeComplete()`. For a substage, if `idx na` appears, it means no duplicate was found in the stage. When using `refineComplete()`, the order of substage is `validation`, followed by `cleaning`, and `normalization`.

#### 4.5. Export csv files

In [30]:
# refineOutput= '../refinement/'
# refineComplete(smiles, removeDuplicates=True,  exportCSV=True, outputPath=refineOutput)
# refineComplete(smiles, removeDuplicates=False, exportCSV=True, outputPath=refineOutput)
# refineComplete(smiles, removeDuplicates=True,  getDuplicatedIdx=True,   exportCSV=True, outputPath=refineOutput)
# refineComplete(smiles, removeDuplicates=False, getUnrefinedStruct=True, exportCSV=True, outputPath=refineOutput)

---

### PART 2: STANDARD DATA CURATION

This session introduces a standard data curation procedure proposed by [Fourches et al. (2010)](https://pubs.acs.org/doi/10.1021/ci100176x) whose major substages are represented by functions in substage modules. The data curation process is characterized by the flowchart below.

![Pipeline](https://raw.githubusercontent.com/mldlproject/chemtonic/0fdf3801a315d23cee264535b5bc13bea1b83572/images/pipelinesmall.svg)

This session is designed for you IF
- You've known about this procedure and want to perform data curation process only OR
- You're novice users who are looking for a standard data curation procedure and want to perform to data curation process OR
- You're computational scientists who are looking for a built module to integrate into your to-be-deployed model as a checking gate for input SMILES(s).

#### Completing data curation with one-run task

In [None]:
refineComplete(smiles)

The function and its default parameters are designed based on [Fourches et al. (2010)](https://pubs.acs.org/doi/10.1021/ci100176x)'s procedure.
```python
refineComplete(compounds, getUnrefinedStruct=False, deSalt=False, neutralize=False, deTautomerize=True, deSterioisomerize=True, removeDuplicates=True, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True):

```
To run the task with other conditions, please make sure you understand the role of each substage's steps.

`refineComplete(compounds)` is equal to `normalizeComplete(cleaningComplete(validateComplete(compounds, removeDuplicates=True), removeDuplicates=True), removeDuplicates=True)`.

You can perform each substage to obtain detailed logs as well as extract correct `idx`.

In [None]:
validated_compounds  = validateComplete(smiles, removeDuplicates=True)
cleaned_compounds    = cleanComplete(validated_compounds, removeDuplicates=True)
normalized_compounds = normalizeComplete(cleaned_compounds, removeDuplicates=True)
refined_compounds    = normalized_compounds

`normalized_compounds` is equal to `refined_compounds` when the input SMILES(s) is/are cleaned.

-END-