# ChemTONIC: A Complete Guideline and Examples

![Logo](https://raw.githubusercontent.com/mldlproject/chemtonic/e218517eb2b5f73553035badad7e32f0f37bd291/chemtonic.svg)

### PART 1: FUNCTIONS IN SUBSTAGE MODULES

ChemTONIC 0.0.1 was released with the first module called **curation**. Module **curation** contains four submodules, including
- validation
- cleaning
- normalization
- refinement
- utils

### Import libraries

In [None]:
from IPython.display import Image
from IPython.core.display import HTML

In [1]:
from chemtonic.curation.validation import *
from chemtonic.curation.cleaning import *
from chemtonic.curation.normalization import *
from chemtonic.curation.utils import *
from chemtonic.curation.refinement import *
import pandas as pd

- `validation` module: Perform validation stages including all substages
- `cleaning` module: Perform cleaning stages including all substages
- `normalization` module: Perform normalization stages including all substages
- `refinement` module: Perform complete data curation processincluding all stages

![Pipeline](https://raw.githubusercontent.com/mldlproject/chemtonic/0fdf3801a315d23cee264535b5bc13bea1b83572/images/pipelinesmall.svg)

### Load example dataset

In [2]:
smiles = pd.read_csv("./data/example.csv")['SMILES']

### Necessary function

In [3]:
# This function is used for converting SMILES lists to SMILES pandas Dataframe for better visualization
def to_df(smilesList, col=1):
    if col==1:
        smilesDf = pd.DataFrame(smilesList, columns=['SMILES'])
    elif col==2:
        smilesDf = pd.DataFrame(smilesList).T
        smilesDf.columns=['SMILES', 'index']
    return smilesDf

### 1. Validation

Module `validation` contains four functions: `rmMixtures()`, `rmInorganics()`, `rmOrganometallics()`, and `validateComplete()`.

- `rmMixtures()`: remove SMILES of mixtures
- `rmInorganics()`: remove SMILES of inorganic compounds
- `rmOrganometallics()`: remove SMILES of organometallic compounds
- `validateComplete()`: perform complete validation substage which includes these three steps above

Functions:
```python
rmMixtures(compounds, getMixtures=False, getMixturesIdx=False, printlogs=True)
rmInorganics(compounds, getInorganics=False, getInorganicsIdx=False, printlogs=True)
rmOrganometallics(compounds, getOrganometallics=False, getOrganometallicsIdx=False, printlogs=True)
```
Function arguments:
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getMixtures`: Boolean. Default `False`. Set `True` to get mixture SMILES (if any)
- `getMixturesIdx`: Boolean. Default `False`. Set `True` to get index of mixture SMILES (if any). Can be used when `getMixtures=True` only.
- `printlogs`: Boolean. Default `True`. Print logs and summary during the process
---
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getInorganics`: Boolean. Default `False`. Set `True` to get inorganic SMILES (if any)
- `getInorganicsIdx`: Boolean. Default `False`. Set `True` to get index of inorganic SMILES (if any). Can be used when `getInorganics=True` only.
- `printlogs`: Boolean. Default `True`. Print logs and summary during the process
---
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getOrganometallics`: Boolean. Default `False`. Set `True` to get organometallic SMILES (if any)
- `getOrganometallicsIdx`: Boolean. Default `False`. Set `True` to get index of organometallic SMILES (if any). Can be used when `getOrganometallics=True` only.
- `printlogs`: Boolean. Default `True`. Print logs and summary during the process

### 1.1. Obtaining lists of SMILES(s) with invalid structures removed

#### 1.1.1. SMILES list with 'Mixture SMILES(s)' removed

In [4]:
a1 = rmMixtures(smiles, getMixtures=False, getMixturesIdx=False)
to_df(a1)

Succeeded to verify 500/500 structures
499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed


Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
494,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
495,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
496,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
497,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 1.1.2. SMILES list with 'Inorganic SMILES(s)' removed

In [5]:
b1 = rmInorganics(smiles, getInorganics=False, getInorganicsIdx=False, printlogs=True)
to_df(b1)

Succeeded to verify 500/500 structures
500/500 structures are organics


Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
495,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
496,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
497,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
498,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 1.1.3. SMILES list with 'Organometallic SMILES(s)' removed

In [6]:
c1 = rmOrganometallics(smiles, getOrganometallics=False, getOrganometallicsIdx=False)
to_df(c1)

Succeeded to verify 500/500 structures
500/500 structures are NOT organometallics


Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
495,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
496,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
497,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
498,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


### 1.2. Obtaining lists of unvalidated SMILES(s)

#### 1.2.1. List of Mixture SMILES(s)

In [7]:
a2 = rmMixtures(smiles, getMixtures=True, getMixturesIdx=True)
to_df(a2, col=2)

Succeeded to verify 500/500 structures
499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed


Unnamed: 0,SMILES,index
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,277


#### 1.2.2. List of Inorganic SMILES(s)

In [8]:
b2 = rmInorganics(smiles, getInorganics=True, getInorganicsIdx=True, printlogs=True)
to_df(b2, col=2) # No inorganic SMILES

Succeeded to verify 500/500 structures
500/500 structures are organics


Unnamed: 0,SMILES,index


#### 1.2.3. List of Organometallic SMILES(s)

In [9]:
c2 = rmOrganometallics(smiles, getOrganometallics=True, getOrganometallicsIdx=True)
to_df(c2, col=2) # No organometallic SMILES

Succeeded to verify 500/500 structures
500/500 structures are NOT organometallics


Unnamed: 0,SMILES,index


### 1.3. Perform complete validation substage

Submodule **validation** contains four functions: *rmMixtures()*, *rmInorganics()*, *rmOrganometallics()*, and *validateComplete()*. ***validateComplete()*** performs complete validation substage, including *rmMixtures()*,  *rmInorganics()*, and *rmOrganometallics()*

```python
validateComplete(compounds, getInvalidStruct=False, removeDuplicates=False, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True)
```


Function arguments:
- ``compounds``: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- ``getInvalidStruct``: Boolean. Default `False`. Set `True` to get unvalidated SMILES (if any)
- ``removeDuplicates``: Boolean. Default `False`. Set `True` to remove duplicated SMILES (if any)
- ``getDuplicatedIdx``: Boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- ``exportCSV``: Boolean. Default `False`. Set `True` to get csv file
- ``outputPath``: Directory. Must be filled when `exportCSV=True` to get csv file
- ``printlogs``: Boolean. Default `True`. Print logs and summary during the process

#### 1.3.1. Obtaining a list of all validated compounds with duplicated data removed

In [10]:
deduplicated_validated_compounds = validateComplete(smiles, removeDuplicates=True)
deduplicated_validated_compounds

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
set 'getInvalidStruct=True' to get the list of all unvalidated structures 

There are 381 unique structures filtered from 499 initial validated structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun validateComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun validateComplete() with setting 'getDuplicates=Tr

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
376,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...
377,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
378,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
379,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 1.3.2. Obtaining a list of validated compounds without removing duplicated data

In [11]:
duplicated_validated_compounds = validateComplete(smiles, removeDuplicates=False)
duplicated_validated_compounds

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
set 'getInvalidStruct=True' to get the list of all unvalidated structures 



Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
494,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
495,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
496,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
497,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 1.3.3. Obtaining a list of duplicated compounds with details

In [12]:
duplicated_validated_compoundsIdx = validateComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True)
duplicated_validated_compoundsIdx

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
set 'getInvalidStruct=True' to get the list of all unvalidated structures 

There are 381 unique structures filtered from 499 initial validated structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun validateComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun validateComplete() with setting 'getDuplicates=Tr

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 278]",validateComplete()
1,3,"[3, 467]",validateComplete()
2,7,"[7, 215]",validateComplete()
3,9,"[0, 9, 69, 278]",validateComplete()
4,11,"[11, 418]",validateComplete()
...,...,...,...
196,487,"[464, 487]",validateComplete()
197,492,"[200, 492]",validateComplete()
198,493,"[445, 493]",validateComplete()
199,495,"[213, 495]",validateComplete()


**Notice**:
Output is a dataframe containing detailed information:
- `idx`: index of SMILES(s) found to be duplicated
- `matchedIdx`: location of duplicated SMILES(s)
- `fromFunction*`: which function (substage) identify it/them

*E.g.,: The SMILES at row 0 has been duplicated at rows 9, 69, and 278.*

#### 1.3.4.  Obtaining a list of invalid structures with details

In [13]:
unvalidated_compounds = validateComplete(smiles, getInvalidStruct=True)
unvalidated_compounds

Succeeded to validate 500/500 structures 

499/500 structures are non-mixtures
1/500 structure(s) is/are mixture(s) and was/were removed 

500/500 structures are organics 

500/500 structures are NOT organometallics 

SUMMARY:
499/500 structures were successfully validated
1/500 structure(s) were/was unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------


Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,Mixture,rmMixtures(),277


**Notice**:
Output is a dataframe containing detailed information:
- `SMILES`: invalid SMILES(s)	
- `errorTag`: why it/they are rejected
- `fromFunction`: which function rejects it/them	
- `idx`: where it/they is/are

*E.g.,: The SMILES 'CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...' in row 277 was identified as a mixture structure, processed by function rmMixtures().*

#### 1.3.5. Exporting results as csv files

To export .csv file, user can use `validateComplete`(..., `exportCSV=True`, `outputPath="/../..`)

In [None]:
#ValOutput= '../validation/'
# validateComplete(smiles, removeDuplicates=False, exportCSV=True, outputPath=ValOutput)
# validateComplete(smiles, removeDuplicates=True,  exportCSV=True, outputPath=ValOutput)
# validateComplete(smiles, getInvalidStruct=True,  exportCSV=True, outputPath=ValOutput)
# validateComplete(smiles, removeDuplicates=True,  getDuplicatedIdx=True, exportCSV=True, outputPath=ValOutput)

---

### 2. Cleaning

Module `cleaning` contains three functions: `clSalts()`, `clCharges()`, and `cleanComplete()`.

- `clSalts()`: remove/desalt SMILES of salts
- `clCharges()`: neutralize SMILES of charged compounds
- `cleanComplete()`: perform complete cleaning substage which includes these two steps above

Functions:
```python
clSalts(compounds, getSalts=False, getSaltsIdx=False, deSalt=False, printlogs=True)
clCharges**(compounds, getCharges=False, getChargesIdx=False, deCharges=False, printlogs=True):
```

Function arguments:
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getSalts`: Boolean. Default `False`. Set `True` to get salt SMILES (if any)
- `getSaltsIdx`: Boolean. Default `False`. Set `True` to get index of salt SMILES (if any). Can be used when `getSalts=True` only.
- `deSalt`: Boolean. Default `False`. Set `True` to desalt SMILES (if any)
- `printlogs`: Boolean. Default `True`. Print logs and summary during the process
---
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getCharges`: Boolean. Default `False`. Set `True` to get inorganic SMILES (if any)
- `getChargesIdx`: Boolean. Default `False`. Set `True` to get index of inorganic SMILES (if any). Can be used when `getCharges=True` only.
- `deCharges`: Boolean. Default `False`. Set `True` to neutralize charged SMILES (if any)
- `printlogs`: Boolean. Default `True`. Print logs and summary during the process

### 2.1. Obtaining lists of SMILES(s) with uncleaned structures removed

#### 2.1.1. SMILES list with 'Salt SMILES(s)' removed

In [14]:
d1 = clSalts(smiles, deSalt=False) # Set deSalt=True to desalt the SMILES
to_df(d1)

Succeeded to verify 500/500 structures
497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed


Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
492,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
493,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
494,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
495,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 2.1.2. SMILES list with 'Charged SMILES(s)' neutralized

In [15]:
e1 = clCharges(smiles, deCharges=True) # Set deCharges=True to neutralize the SMILES
to_df(e1) 

Succeeded to verify 500/500 structures
496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized


Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
495,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
496,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
497,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
498,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


### 2.2. Obtaining lists of uncleaned SMILES(s)

#### 2.2.1. List of Salt SMILES(s)

In [16]:
d2 = clSalts(smiles, getSalts=True, getSaltsIdx=True) # Set deSalt=True to desalt the SMILES
to_df(d2, col=2)

Succeeded to verify 500/500 structures
497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed


Unnamed: 0,SMILES,index
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,277
1,CCCC(=O)[O-].[Na+],344
2,CCCC(=O)[O-].[Na+],379


#### 2.2.1. List of Charged SMILES(s)

In [17]:
e2 = clCharges(smiles, getCharges=True, getChargesIdx=True) # Set deSalt=True to desalt the SMILES
to_df(e2, col=2)

Succeeded to verify 500/500 structures
496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized


Unnamed: 0,SMILES,index
0,COc1ccc(C2CC(=O)c3c([O-])cc(O)cc3O2)cc1,302
1,O=C(O)C1=CC(=CC=[N+]2c3cc(O)c(OC4OC(CO)C(O)C(O...,329
2,CCCC(=O)[O-].[Na+],344
3,CCCC(=O)[O-].[Na+],379


### 2.3. Perform complete cleaning substage

Module `cleaning` contains three functions: `clSalts()`, `clCharges()`, and `cleanComplete()`. `cleanComplete()` performs complete cleaning substage, including `clSalts()` and `clCharges()`.

**Notice**: Cleaning substage requires all SMILESs to be verified. In case of unverified SMILES(s) available, cleanComplete() will remove unvalidated SMILES(s). cleanComplete() does not support removing mixtures, inorganics, or organometallics. 

```python
cleanComplete(compounds, getUncleanedStruct=False, deSalt=False, neutralize=False, removeDuplicates=False, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True)
```


Function arguments:
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getUncleanedStruct`: boolean. Default `False`. Set `True` to get uncleaned SMILES (if any)
- `deSalt`: boolean. Default `False`. Set `True` to get the SMILES desalted (if any). Default `False` removes SMILES(s) of salt(s).
- `neutralize`: boolean. Default `False`. Set `True` to get the SMILES neutralized (if any)
- `removeDuplicates`: boolean. Default `False`. Set `True` to remove duplicated SMILES (if any)
- `getDuplicatedIdx`: boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- `exportCSV`: boolean. Default `False`. Set `True` to get csv file
- `outputPath`: Directory. Must be filled when `exportCSV=True` to get csv file
- `printlogs`: Print logs, summary during the process

#### 2.3.1. Obtaining a list of all cleaned compounds with duplicated data removed

In [18]:
deduplicated_cleaned_compounds = cleanComplete(smiles, removeDuplicates=True)
deduplicated_cleaned_compounds

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------
set 'getUncleanedStruct=True' to get the list of all uncleaned structures. Neutralized charged structures will be included (if any) 

There are 380 unique structures filtered from 497 initial cleaned structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun cleanComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list 

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
375,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...
376,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
377,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
378,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 2.3.2. Obtaining a list of cleaned compounds without removing duplicated data

In [19]:
duplicated_cleaned_compounds = cleanComplete(smiles, removeDuplicates=True)
duplicated_cleaned_compounds

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------
set 'getUncleanedStruct=True' to get the list of all uncleaned structures. Neutralized charged structures will be included (if any) 

There are 380 unique structures filtered from 497 initial cleaned structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun cleanComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list 

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
375,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...
376,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
377,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
378,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 2.3.3. Obtaining a list of duplicated compounds with details of indices

In [20]:
duplicated_cleaned_compoundsIdx = cleanComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True)
duplicated_cleaned_compoundsIdx

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------
set 'getUncleanedStruct=True' to get the list of all uncleaned structures. Neutralized charged structures will be included (if any) 

There are 380 unique structures filtered from 497 initial cleaned structures
-----------------------------------------------------------------------------
To get detailed information, please follow steps below:
(1) Rerun cleanComplete() with setting 'removeDuplicates=False' to get the list of all validated structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list 

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 278]",cleanComplete()
1,3,"[3, 465]",cleanComplete()
2,7,"[7, 215]",cleanComplete()
3,9,"[0, 9, 69, 278]",cleanComplete()
4,11,"[11, 416]",cleanComplete()
...,...,...,...
194,485,"[462, 485]",cleanComplete()
195,490,"[200, 490]",cleanComplete()
196,491,"[443, 491]",cleanComplete()
197,493,"[213, 493]",cleanComplete()


#### 2.3.4.  Obtaining a list of invalid structures with details

In [21]:
uncleaned_compounds = cleanComplete(smiles, getUncleanedStruct=True)
uncleaned_compounds

Succeeded to validate 500/500 structures 

497/500 structures are NOT salts
3/500 structure(s) is/are salt(s) and was/were removed 

496/500 structures are NOT charges
4/500 structure(s) is/are charge(s) BUT was/were NOT neutralized 

SUMMARY:
500/500 structures were successfully verfied
497/500 structures were successfully cleaned
3/500 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-------------------------------------------------------


Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,Salt,clSalts(),277
1,CCCC(=O)[O-].[Na+],Salt,clSalts(),344
2,CCCC(=O)[O-].[Na+],Salt,clSalts(),379
3,COc1ccc(C2CC(=O)c3c([O-])cc(O)cc3O2)cc1,Charge,clCharges(),302
4,O=C(O)C1=CC(=CC=[N+]2c3cc(O)c(OC4OC(CO)C(O)C(O...,Charge,clCharges(),329
5,CCCC(=O)[O-].[Na+],Charge,clCharges(),344
6,CCCC(=O)[O-].[Na+],Charge,clCharges(),379


#### 2.3.5. Exporting results as csv files

In [None]:
#CleanOutput= '../cleaning/'
# cleanComplete(smiles, removeDuplicates=False,  exportCSV=True, outputPath=CleanOutput)
# cleanComplete(smiles, removeDuplicates=True,   exportCSV=True, outputPath=CleanOutput)
# cleanComplete(smiles, getUncleanedStruct=True, exportCSV=True, outputPath=CleanOutput)
# cleanComplete(smiles, removeDuplicates=True,   getDuplicatedIdx=True, exportCSV=True, outputPath=CleanOutput)

---

### 3. Normalization

Module `normalization` contains three functions: `normTautomers()`, `normStereoisomers()`, and `normalizeComplete()`.
- `normTautomers()`: remove/detautomerize SMILES of tautomer
- `normStereoisomers()`: destereoisomerize SMILES of stereoisomer
- `normalizeComplete()`: perform complete normalization substage which includes these two steps above

Functions:
```python
normTautomers(compounds, getTautomers=False, getTautomersIdx=False, deTautomerize=False, printlogs=True)
normStereoisomers**(compounds, getStereoisomers=False, getStereoisomersIdx=False, deSterioisomerize=False, printlogs=True)
```

Function arguments:
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getTautomers`: Boolean. Default `False`. Set `True` to get tautomer SMILES (if any)
- `getTautomersIdx`: Boolean. Default `False`. Set `True` to get index of tautomer SMILES (if any). Can be used when 
`getTautomersIdx=True` only.
- `deTautomerize`: Boolean. Default `False`. Set `True` to detautomerize SMILES (if any)
- `printlogs`: Boolean. Default `True`. Print logs and summary during the process
---
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getStereoisomers`: Boolean. Default `False`. Set `True` to get stereoisomer SMILES (if any)
- `getStereoisomersIdx`: Boolean. Default `False`. Set `True` to get index of stereoisomer SMILES (if any). Can be used when `getStereoisomers=True` only.
- `deSterioisomerize`: Boolean. Default `False`. Set `True` to desterioisomerize SMILES (if any)
- `printlogs`: Boolean. Default `True`. Print logs and summary during the process

### 3.1. Obtaining lists of SMILES(s) with structures normalized

#### 3.1.1. SMILES list with 'detautomerized SMILES(s)'

In [22]:
f1 = normTautomers(smiles, deTautomerize=True) # Set deTautomerize=False to remnove the tautomer SMILES(s)
to_df(f1)

Succeeded to verify 500/500 structures
437/500 structures are NOT tautomers
63/500 structure(s) í/are tautomer(s) BUT was/were detautomerized
!!!!!Notice: Detautomerizing has been applied!!!!!


Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
495,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
496,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
497,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
498,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 3.1.2. SMILES list with 'Stereoisomerized SMILES(s)'

In [23]:
g1 = normStereoisomers(smiles, deSterioisomerize=True) # Set deCharges=True to neutralize the SMILES
to_df(g1) 

Succeeded to verify 500/500 structures
498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized
!!!!!Notice: Destereoisomerization has been applied!!!!!


Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
495,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
496,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
497,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
498,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


### 3.2. Obtaining lists of unnormalized SMILES(s)

#### 3.2.1. List of Tautomer SMILES(s)

In [24]:
f2 = normTautomers(smiles, getTautomers=True, getTautomersIdx=True) 
to_df(f2, col=2)

Succeeded to verify 500/500 structures
437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were NOT detautomerized


Unnamed: 0,SMILES,index
0,CC1CN2CC3CCC(=O)C4=C5C(=C(O)C1CC2C53C)C1=C4CCO...,5
1,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,6
2,COC(=O)C1=C(CCO)C2=C3C1=C(O)C1(O)CC4N(CC1C)CC(...,27
3,CC(C)=CCCC1(C)C(CC=C(C)C)CC2(CC=C(C)C)C(=O)C1(...,30
4,CC(C)C1=CC2=CC(=O)C3C(C)(C)CCCC3(C)C2=C(O)C1=O,45
...,...,...
58,C=C(CCC(C)C1CCC2C3=C(C(=O)C(O)C21C)C1(C)CCC(O)...,480
59,COC(=O)C1C2=C(C(=O)C(=O)C(C(C)C)=C2O)C2(C)CCCC...,483
60,CC(=O)CC1C(=O)C(CO)=CCC(C)(CCC(O)C(C)(C)O)C1C=...,486
61,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...,491


#### 3.2.2. List of Stereoisomers SMILES(s)

In [25]:
g2 = normStereoisomers(smiles, getStereoisomers=True, getStereoisomersIdx=True) 
to_df(g2, col=2)

Succeeded to verify 500/500 structures
498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were NOT destereoisomerized


Unnamed: 0,SMILES,index
0,F[C@@]12C[C@]1(Cl)C[C@H](/C=C/Br)O2,89
1,F[C@]12C[C@]1(Cl)C[C@@H](/C=C\Br)O2,98


### 3.3. Perform complete normalization substage

Module `normalization` contains three functions: `normTautomers()`, `normStereoisomers()`, and `normalizeComplete()`.
- `normTautomers()`: remove/detautomerize SMILES of tautomer
- `normStereoisomers()`: destereoisomerize SMILES of stereoisomer
- `normalizeComplete()`: perform complete normalization substage which includes these two steps above

```python
normalizeComplete(compounds, getUnnormalizedStruct=False, deTautomerize=True, deSterioisomerize=True, removeDuplicates=False, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True)
```

Function arguments:
- `compounds`: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- `getUnnormalizedStruct`:boolean. Default `False`. Set `True` to get unnormalized SMILES (if any)
- `deTautomerize`: boolean. Default `True`. Set `False` to remove tautomer SMILES (if any)
- `deSterioisomerize`: boolean. Default `True`. Set `False` to not to desterioisomerize SMILES (if any)
- `removeDuplicates`: boolean. Default `False`. Set `True` to remove duplicated SMILES (if any)
- `getDuplicatedIdx`: boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- `exportCSV`: boolean. Default `False`. Set `True` to get csv file
- `outputPath`: Directory. Must be filled when `exportCSV=True` to get csv file
- `printlogs`: Print logs, summary during the process

#### 3.3.1. Obtaining a list of all normalized compounds with duplicated data removed

In [26]:
deduplicated_normalized_compounds = normalizeComplete(smiles, removeDuplicates=True)
deduplicated_normalized_compounds

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized
set 'getUnnormalizedStruct=True' to get the list of all unnormalized structures. 

There are 381 unique structures filtered from 500 initial normalized structures
To get detailed information, please follow steps below:
(1) Rerun normalizeComplete() with setting 'removeDuplicates=False' to get the list of all normalized structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun normalizeComplete() with setting 'getDuplicates=True', 'exportCSV'=True, and 'outputPath=<Directory>' to export a csv file  containing the list of duplicated structures 

--OR--
Ru

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
376,CC(O)c1c2c(cc3c(=O)c4c(O)cc5c(c4oc13)C=CC(C)(C...
377,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
378,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
379,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 3.3.2. Obtaining a list of all normalized compounds without removing duplicated data

In [27]:
duplicated_normalized_compounds = normalizeComplete(smiles, removeDuplicates=False)
duplicated_normalized_compounds

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized
set 'getUnnormalizedStruct=True' to get the list of all unnormalized structures. 



Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
495,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
496,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
497,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
498,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 3.3.3. Obtaining a list of duplicated compounds in a normalized compound list with details

In [28]:
duplicated_normalized_compoundsIdx = normalizeComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True)
duplicated_normalized_compoundsIdx

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized
set 'getUnnormalizedStruct=True' to get the list of all unnormalized structures. 

There are 381 unique structures filtered from 500 initial normalized structures
To get detailed information, please follow steps below:
(1) Rerun normalizeComplete() with setting 'removeDuplicates=False' to get the list of all normalized structures
(2) Run ultils.molDeduplicate() with setting 'getDuplicates=True'to get the list of duplicated structures 

--OR--
Rerun normalizeComplete() with setting 'getDuplicates=True', 'exportCSV'=True, and 'outputPath=<Directory>' to export a csv file  containing the list of duplicated structures 

--OR--
Ru

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 279]",normalizeComplete()
1,3,"[3, 468]",normalizeComplete()
2,7,"[7, 215]",normalizeComplete()
3,9,"[0, 9, 69, 279]",normalizeComplete()
4,11,"[11, 419]",normalizeComplete()
...,...,...,...
198,488,"[465, 488]",normalizeComplete()
199,493,"[200, 493]",normalizeComplete()
200,494,"[446, 494]",normalizeComplete()
201,496,"[213, 496]",normalizeComplete()


#### 3.3.4.  Obtaining a list of unnormalized structures with details

In [29]:
unnormalized_compounds = normalizeComplete(smiles, getUnnormalizedStruct=True)
unnormalized_compounds

Succeeded to validate 500/500 structures 

498/500 structures are NOT stereoisomers
2/500 structure(s) is/are stereoisomer(s) BUT was/were destereoisomerized 

437/500 structures are NOT tautomers
63/500 structure(s) is/are tautomer(s) BUT was/were detautomerized 

SUMMARY:
500/500 structure were successfully verfied
500/500 structures were successfully normalized


Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CC1CN2CC3CCC(=O)C4=C5C(=C(O)C1CC2C53C)C1=C4CCO...,Tautomer,normTautomers(),5
1,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,Tautomer,normTautomers(),6
2,COC(=O)C1=C(CCO)C2=C3C1=C(O)C1(O)CC4N(CC1C)CC(...,Tautomer,normTautomers(),27
3,CC(C)=CCCC1(C)C(CC=C(C)C)CC2(CC=C(C)C)C(=O)C1(...,Tautomer,normTautomers(),30
4,CC(C)C1=CC2=CC(=O)C3C(C)(C)CCCC3(C)C2=C(O)C1=O,Tautomer,normTautomers(),45
...,...,...,...,...
60,CC(=O)CC1C(=O)C(CO)=CCC(C)(CCC(O)C(C)(C)O)C1C=...,Tautomer,normTautomers(),486
61,CC(=O)C1c2oc3c4c(cc(O)c3c(=O)c2CC2C1C(=O)OC2(C...,Tautomer,normTautomers(),491
62,COC(c1ccccc1)C(O)C1CC=CC(=O)O1,Tautomer,normTautomers(),499
63,F[C@@]12C[C@]1(Cl)C[C@H](/C=C/Br)O2,Stereoisomer,normStereoisomers(),89


#### 3.3.5. Exporting results as csv files

In [None]:
#NormOutput= '../normalization/'
# normalizeComplete(smiles, removeDuplicates=False, exportCSV=True, outputPath=NormOutput)
# normalizeComplete(smiles, removeDuplicates=True,  exportCSV=True, outputPath=NormOutput)
# normalizeComplete(smiles, getUnnormalizedStruct=True,  exportCSV=True, outputPath=NormOutput)
# normalizeComplete(smiles, removeDuplicates=True,  getDuplicatedIdx=True, exportCSV=True, outputPath=NormOutput)

---

### 4. Refinement

Module **refinment** contains only one function: refineComplete().
- *refineComplete()*: perform complete data curation process which includes these three substages above

```python
refineComplete(compounds, getUnrefinedStruct=False, deSalt=False, neutralize=False, deTautomerize=True, deSterioisomerize=True, removeDuplicates=True, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True):

```

Function arguments:
- ``compounds``: Input compounds. Can be in the form of `str`, `list`, `pandas.core.series.Series`, `pandas.core.frame.DataFrame` 
- ``getUnrefinedStruct``:boolean. Default `False`. Set `True` to get unnormalized SMILES (if any)
-``deSalt``:
-``neutralize``:
- ``deTautomerize``: boolean. Default `True`. Set `False` to remove tautomer SMILES (if any)
- ``deSterioisomerize``: boolean. Default `True`. Set `False` to not to desterioisomerize SMILES (if any)
- ``removeDuplicates``: boolean. Default `True`. Set `False` to keep duplicated SMILES (if any)
- ``getDuplicatedIdx``: boolean. Default `False`. Set `True` to get index of duplicated SMILES (if any). Can be used when `removeDuplicates=True` only.
- ``exportCSV``: boolean. Default `False`. Set `True` to get csv file
- ``outputPath``: Directory. Must be filled when `exportCSV=True` to get csv file
- ``printlogs``: Print logs, summary during the process

#### 4.1. Obtaining a list of all refined compounds with duplicated data removed

In [30]:
deduplicated_refined_compounds = refineComplete(smiles, removeDuplicates=True)
deduplicated_refined_compounds

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
There are 381 unique structures filtered from 499 initial validated structures
CLEANING
-----------------------------------------------------------------------------
380/381 structures were successfully cleaned
1/381 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 380 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
379/380 structures were successfully normalized
1/380 structure(s) were unsuccessfully normalized and need to be rechecked
-----------------------------------------------------------------------------

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
374,CC(O)c1c2c(cc3c(=O)c4c(O)cc5c(c4oc13)C=CC(C)(C...
375,CC(C)=CCCC(C)=CCc1c(O)ccc2cc(-c3cc(O)cc(O)c3)oc12
376,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
377,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 4.2 Obtaining a list of all refined compounds without removing duplicated data

In [31]:
duplicated_refined_compounds = refineComplete(smiles, removeDuplicates=False)
duplicated_refined_compounds

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
201/499 validated structures have at least one duplicates
CLEANING
-----------------------------------------------------------------------------
497/499 structures were successfully cleaned
2/499 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 497 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
497/497 structures were successfully normalized
-----------------------------------------------------------------------------
2/497 normalized structures have at least one duplicates
REFINEMENT SUMMARY
-------------------

Unnamed: 0,SMILES
0,Cc1cc2c(cc1O)C1(C)CCC(O)C(C)(C)C1=CC2=O
1,CC1(C)CCC2(C(=O)O)CCC3(C)C(=CCC4C5(C)CCC(OC6OC...
2,COC1CC(=O)CCC1(O)CC=Cc1ccccc1
3,C=CC(C)=CCC1C(=C)CCC2C(C)(C)C(O)C(OC(C)=O)CC12C
4,COc1ccccc1C=CCC1(O)C=CC(=O)CC1
...,...
492,COc1ccc(C=CCc2ccc(O)c(OC)c2OC)cc1O
493,C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1
494,COc1cc(C2Oc3cc(C4Oc5cc(O)cc(O)c5C(=O)C4O)ccc3O...
495,COc1cc2c(cc1OC)C1C(C)c3ccc(OC)c(OC)c3CN1CC2


#### 4.3. Obtaining a list of duplicated data in a refined compound list with details

In [32]:
duplicated_refined_compoundsIdx = refineComplete(smiles, removeDuplicates=True, getDuplicatedIdx=True)
duplicated_refined_compoundsIdx

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
There are 381 unique structures filtered from 499 initial validated structures
CLEANING
-----------------------------------------------------------------------------
380/381 structures were successfully cleaned
1/381 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 380 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
379/380 structures were successfully normalized
1/380 structure(s) were unsuccessfully normalized and need to be rechecked
-----------------------------------------------------------------------------

Unnamed: 0,idx,matchedIdx,fromFunction
0,0,"[0, 9, 69, 278]",validateComplete()
1,3,"[3, 467]",validateComplete()
2,7,"[7, 215]",validateComplete()
3,9,"[0, 9, 69, 278]",validateComplete()
4,11,"[11, 418]",validateComplete()
...,...,...,...
199,495,"[213, 495]",validateComplete()
200,496,"[472, 496]",validateComplete()
201,na,na,cleanComplete()
202,83,"[83, 90]",normalizeComplete()


#### 4.4. Obtaining a list of unrefined structures with details

In [33]:
unrefined_compounds = refineComplete(smiles, removeDuplicates=False, getUnrefinedStruct=True)
unrefined_compounds

VALIDATION
-----------------------------------------------------------------------------
499/500 structures were successfully validated
1/500 structure(s) was/were unsuccessfully validated and need to be rechecked
-----------------------------------------------------------------------------
201/499 validated structures have at least one duplicates
CLEANING
-----------------------------------------------------------------------------
497/499 structures were successfully cleaned
2/499 structure(s) was/were unsuccessfully cleaned and need to be rechecked
-----------------------------------------------------------------------------
No duplicate was found (in 497 cleaned structures)
NORMALIZATION
-----------------------------------------------------------------------------
497/497 structures were successfully normalized
-----------------------------------------------------------------------------
2/497 normalized structures have at least one duplicates
REFINEMENT SUMMARY
-------------------

Unnamed: 0,SMILES,errorTag,fromFunction,idx
0,CCC1(O)CC2CN(CCc3c([nH]c4ccccc34)C(C(=O)OC)(c3...,Mixture,rmMixtures(),277
1,CCCC(=O)[O-].[Na+],Salt,clSalts(),286
2,COc1ccc(C2CC(=O)c3c([O-])cc(O)cc3O2)cc1,Charge,clCharges(),257
3,O=C(O)C1=CC(=CC=[N+]2c3cc(O)c(OC4OC(CO)C(O)C(O...,Charge,clCharges(),274
4,CCCC(=O)[O-].[Na+],Charge,clCharges(),286
5,CC1CN2CC3CCC(=O)C4=C5C(=C(O)C1CC2C53C)C1=C4CCO...,Tautomer,normTautomers(),5
6,O=C1NCc2c1c1c3ccccc3[nH]c1c1[nH]c3ccccc3c21,Tautomer,normTautomers(),6
7,COC(=O)C1=C(CCO)C2=C3C1=C(O)C1(O)CC4N(CC1C)CC(...,Tautomer,normTautomers(),26
8,CC(C)=CCCC1(C)C(CC=C(C)C)CC2(CC=C(C)C)C(=O)C1(...,Tautomer,normTautomers(),29
9,CC(C)C1=CC2=CC(=O)C3C(C)(C)CCCC3(C)C2=C(O)C1=O,Tautomer,normTautomers(),44


#### 4.5. Exporting results as csv files

In [1]:
# refineOutput= '../refinement/'
# refineComplete(smiles, removeDuplicates=True,  exportCSV=True, outputPath=refineOutput)
# refineComplete(smiles, removeDuplicates=False, exportCSV=True, outputPath=refineOutput)
# refineComplete(smiles, removeDuplicates=True,  getDuplicatedIdx=True,   exportCSV=True, outputPath=refineOutput)
# refineComplete(smiles, removeDuplicates=False, getUnrefinedStruct=True, exportCSV=True, outputPath=refineOutput)

---

### PART 2: STANDARD DATA CURATION

This session introduces a standard data curation procedure proposed by [Fourches et al. (2010)](https://pubs.acs.org/doi/10.1021/ci100176x) whose major substages are represented by functions in substage modules. The data curation process is characterized by the flowchart below.

![Pipeline](https://raw.githubusercontent.com/mldlproject/chemtonic/0fdf3801a315d23cee264535b5bc13bea1b83572/images/pipelinesmall.svg)

This session is designed for you IF
- You've known about this procedure and want to perform data curation process only OR
- You're novice users who are looking for a standard data curation procedure and want to perform to data curation process OR
- You're computational scientists who are looking for a built module to integrate into your to-be-deployed model as a checking gate for input SMILES(s).

#### Completing data curation with one-run task

In [None]:
refineComplete(smiles)

The function and its default parameters are designed based on [Fourches et al. (2010)](https://pubs.acs.org/doi/10.1021/ci100176x)'s procedure.
```python
refineComplete(compounds, getUnrefinedStruct=False, deSalt=False, neutralize=False, deTautomerize=True, deSterioisomerize=True, removeDuplicates=True, getDuplicatedIdx=False, exportCSV=False, outputPath=None, printlogs=True):

```
To run the task with other conditions, please make sure you understand the role of each substage's steps.

`refineComplete(compounds)` is equal to `normalizeComplete(cleaningComplete(validateComplete(compounds, removeDuplicates=True), removeDuplicates=True), removeDuplicates=True)`.

You can perform each substage to obtain detailed logs as well as extract correct `idx`.

In [None]:
validated_compounds  = validateComplete(smiles, removeDuplicates=True)
cleaned_compounds    = cleanComplete(validated_compounds, removeDuplicates=True)
normalized_compounds = normalizeComplete(cleaned_compounds, removeDuplicates=True)
refined_compounds    = normalized_compounds

`normalized_compounds` is equal to `refined_compounds` when the input SMILES(s) is/are cleaned.

-END-