# !!!This step is the most important one during QSAR modeling, please make sure you have double checked all the stucture information for all the compounds you are using.

# 1. Information of compounds -- get as much as you can

#### There are many different ways to represent the compounds, such as the following example (Aspirin):
1. Systematic name: 1,2-dihydroxybenzene
2. Synonyms: AIBN
3. Trade names: Aspirin
4. CAS number (one compound could have multiple CAS numbers): 50-78-2
5. InChI key: BSYNRYMUTXBXSQ-UHFFFAOYSA-N
6. PubChem CID: 2244
7. SMILES: CC(=O)OC1=CC=CC=C1C(=O)O
8. sdf: (not shown here)
9. xyz: (not shown here)

#### When you develop QSAR model, you would like to make all the information in a clean format for all the compounds. So you could try to generate one representation for all the compounds, for example, get all the SMILES for all the compounds through different source information:

#### A. From CAS to identifier (inchikey or smiles) through cactus ( https://cactus.nci.nih.gov/chemical/structure ) 

In [1]:
from urllib import request, error, parse
from time import sleep
import random

def get_identifier_from_cas(cas, identifier):
    url = 'https://cactus.nci.nih.gov/chemical/structure/' + parse.quote(cas) + '/' + str(identifier)
    sleep(random.uniform(0, 0.5))
    try:
        response = request.urlopen(url)
        stdinchikey = [line.strip().decode("utf-8") for line in response]
        if len(stdinchikey) > 6:
            stdinchikey = None
    except error.HTTPError as err:
        stdinchikey = None
    except error.URLError as err:
        stdinchikey = None
    except TimeoutError as err:
        stdinchikey = None
    if stdinchikey is not None and len(stdinchikey) == 1:
        stdinchikey = ''.join(stdinchikey)
    return stdinchikey


cactus_smiles = get_identifier_from_cas('50-78-2', 'smiles')
cactus_stdinchikey = get_identifier_from_cas('50-78-2', 'stdinchikey')

#### B. From name/inchikey/pubchem CID to smiles/inchikey through PubChemPy ( https://pubchempy.readthedocs.io/en/latest/ ) 

In [3]:
import pubchempy as pcp
c = pcp.get_compounds('Aspirin', 'name')
c[0].isomeric_smiles
c[0].inchikey

'CC(=O)OC1=CC=CC=C1C(=O)O'

In [14]:
c = pcp.get_compounds("BSYNRYMUTXBXSQ-UHFFFAOYSA-N", 'inchikey')
c[0].isomeric_smiles

'CC(=O)OC1=CC=CC=C1C(=O)O'

In [15]:
c = pcp.Compound.from_cid(int(2244))
c.isomeric_smiles
c.inchikey

'BSYNRYMUTXBXSQ-UHFFFAOYSA-N'

# 2. Clean structure information

#### We use chemical structure standardizer tool such as Case Ultra and python scripts to remove 
1. duplicates;
2. inorganics;
3. mixtures.

#### Using python scripts to identify duplicates:

In [18]:
from rdkit import Chem
m = Chem.MolFromSmiles('CC(=O)OC1=CC=CC=C1C(=O)O')
n = Chem.MolFromSmiles('OC(=O)C1=CC=CC=C1OC(=O)C')
if (m is not None) and (n is not None):
    if m.HasSubstructMatch(n) and n.HasSubstructMatch(m):
        print("m and n are duplicates")

m and n are duplicates


# 3. After clean the dataset...

#### Make a .csv file (easy to handle), containing at least three columns (id, smiles, endpoint)

1. csv file is preferred, it could be opened directly using excel, notepad or read through python.
2. this file should contain id for all compounds, prefer numerical ids (easy to handle in pandas).
3. structure information (such as smiles) also need to be included, for generating molecular descriptors.
4. endpoint for compounds in training set is neccessary, the endpoint for test set compounds are not required.