# 2) Data Collection: Retrieve SMILES Codes

So far, the boiling point data set does not contain much information about the molecules' structures: only a chemical formula and a name, neither one of which are easily parsed. There is, however, a CAS number. Although the CAS number contains no structural information, it is a unique numeric identifier for each molecule. We can use them to look up additional structural information that is more readily parsed.

For this study, we'll use [SMILES codes](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system). These are plain-text strings that uniquely describe a molecule's structure. They are similar in purpose to [InChI codes](https://en.wikipedia.org/wiki/International_Chemical_Identifier), but the two have different quirks that render them slightly different in practice. Both may readily be searched for substructures, similarities, etc., using the [RDkit module](https://www.rdkit.org/).

Here, we'll import both SMILES and InChI codes for all the molecules in the data set. That way, we will have the flexibility to use either depending on which works more reliably in our applications moving forward.

*Note:* For copyright reasons, the database is not included in this repository.

## Import the required modules.

In [1]:
import pandas as pd
import requests
from time import sleep # Used to delay automatic search queries & not overload the NIH servers.
from numpy import nan # Used to fill in unresolved SMILES codes
from sqlalchemy import create_engine

## Import the data.

In [2]:
# Establish a database connection
db_engine = create_engine("sqlite:///physical_properties.db")

In [3]:
# Query the index, formula, name, and CAS number for each entry.
# To cut down on the number of useless structure lookups we execute later,
# limit this query to database entries that have high-to-moderate reliability for
# boiling point information (codes 1 and 2), as well as a CAS number.
query_terms = """
                SELECT *
                FROM yaws_2015
                WHERE [CAS No] IS NOT NULL AND ([Boil Pt (code)] = 1 OR [Boil Pt (code)] = 2)
                """
bp_data = pd.read_sql_query(query_terms, db_engine, index_col = "id")
bp_data

Unnamed: 0_level_0,Formula,Name,CAS No,Melt Pt (K),Melt Pt (code),Boil Pt (K),Boil Pt (code)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,C9H18O,7-methyl-7-octen-1-ol,---,266.15,2.0,498.00,2.0
1,C9H18O,cis-2-methyl-3-octen-2-ol,18521-07-8,266.15,2.0,498.00,2.0
2,C9H18O,3-methyl-1-octen-3-ol,24089-00-7,266.15,2.0,498.00,2.0
3,C9H18O,3-methyl-4-octen-3-ol,90676-55-4,266.15,2.0,498.00,2.0
4,C9H18O,4-methyl-1-octen-4-ol,62108-06-9,266.15,2.0,498.00,2.0
...,...,...,...,...,...,...,...
12814,C96H194,n-hexanonacontane,7763-13-5,388.15,2.0,985.61,2.0
12815,C97H196,n-heptanonacontane,7670-25-9,388.35,2.0,987.93,2.0
12816,C98H198,n-octanonacontane,7670-26-0,388.55,2.0,990.24,2.0
12817,C99H200,n-nonanonacontane,7670-27-1,388.75,2.0,992.53,2.0


## Helper function
Let's create a simple function to take a single CAS number and look up the corresponding SMILES and InChI codes. The NIH hosts a [**beautifully simple**](https://cactus.nci.nih.gov/chemical/structure) site that does exactly this (and more!), returning only the result of interest as plain text. This will make it very simple to get the codes we want.

The simple, elegant way to do this is with by calling `apply` directly on the dataframe with an appropriate function. However, I tried this and encountered problems; connections with the NIH site periodically freeze without error, and none of the intermediate progress is saved. To remedy this, we'll use the less elegant solution of appending results to lists, then updating the dataframe manually after successfullly retrieving the structural codes.

In [4]:
# Create a list of tuples containing the index number, CAS number, and name.
mol_ids = list(zip(
                list(bp_data.index),
                list(bp_data["CAS No"]),
                list(bp_data["Name"])))

In [5]:
# In this generic query, the first blank is the structural identifier (CAS or name), and the second blank is the desired output ("smiles" or "inchi")
query_url = "https://cactus.nci.nih.gov/chemical/structure/{}/{}"

index_nos = []
smiles_codes = []
inchi_codes = []

def get_struct_codes(mol_id_tup):
    
    # Define search terms
    index_no = mol_id_tup[0]
    cas_num = mol_id_tup[1]
    mol_name = mol_id_tup[2]
    
    # The database contains "---" as an indicator of missing data.
    # If no CAS number is present, search by name instead.
    if cas_num != "---":
        smiles_search = requests.get(query_url.format(cas_num, "smiles"))
        
        # If a result is found, the CAS number is good. Get the InChI code, too.
        # If not, look up both using the name instead of CAS number.
        if smiles_search.status_code == 200:
            inchi_search = requests.get(query_url.format(cas_num, "inchi"))
        
        else:
            smiles_search = requests.get(query_url.format(mol_name, "smiles"))
            inchi_search = requests.get(query_url.format(mol_name, "inchi"))
    
    else:
        smiles_search = requests.get(query_url.format(mol_name, "smiles"))
        inchi_search = requests.get(query_url.format(mol_name, "inchi"))
  
    # If we successfully found SMILES and/or InChI codes, write them to the dataframe.

    if smiles_search.status_code == 200:
        smiles = smiles_search.text
    else:
        smiles = nan
        
    if inchi_search.status_code == 200:
        inchi = inchi_search.text    
    else:
        inchi = nan
    
    # Print an update.
    print("===> Progress: {}/{}".format(index_no, len(bp_data)), end = "\r")   
    
    # Return the structural codes.
    index_nos.append(index_no)
    smiles_codes.append(smiles)
    inchi_codes.append(inchi)

## Search for Structural Codes
As mentioned above, I tried performing this lookup the elegant way with an appropriate function to use with `df.apply()`. Repeated freezing of the network connection, however, rendered this impractical.

In this alternative, the `for` loop can be interrupted and resumed as many times as necessary without a loss of data; it will resume where it was halted each time.

In [19]:
for i in range(len(index_nos), len(mol_ids), 1):
    get_struct_codes(mol_ids[i])

===> Progress: 12818/12819

## Confirm that the data has been appropriately matched with CAS numbers.
I will manually checked the first and last entries in the dataframe to make sure the structural codes match the listed CAS number.

In [20]:
len(index_nos)

12819

In [22]:
bp_data["SMILES"] = smiles_codes
bp_data["InChI"] = inchi_codes

In [23]:
bp_data

Unnamed: 0_level_0,Formula,Name,CAS No,Melt Pt (K),Melt Pt (code),Boil Pt (K),Boil Pt (code),SMILES,InChI
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,C9H18O,7-methyl-7-octen-1-ol,---,266.15,2.0,498.00,2.0,CC(=C)CCCCCCO,"InChI=1/C9H18O/c1-9(2)7-5-3-4-6-8-10/h10H,1,3-..."
1,C9H18O,cis-2-methyl-3-octen-2-ol,18521-07-8,266.15,2.0,498.00,2.0,CCCC\C=C/C(C)(C)O,"InChI=1/C9H18O/c1-4-5-6-7-8-9(2,3)10/h7-8,10H,..."
2,C9H18O,3-methyl-1-octen-3-ol,24089-00-7,266.15,2.0,498.00,2.0,CCCCCC(C)(O)C=C,"InChI=1/C9H18O/c1-4-6-7-8-9(3,10)5-2/h5,10H,2,..."
3,C9H18O,3-methyl-4-octen-3-ol,90676-55-4,266.15,2.0,498.00,2.0,CCCC=CC(C)(O)CC,"InChI=1/C9H18O/c1-4-6-7-8-9(3,10)5-2/h7-8,10H,..."
4,C9H18O,4-methyl-1-octen-4-ol,62108-06-9,266.15,2.0,498.00,2.0,CCCCC(C)(O)CC=C,"InChI=1/C9H18O/c1-4-6-8-9(3,10)7-5-2/h5,10H,2,..."
...,...,...,...,...,...,...,...,...,...
12814,C96H194,n-hexanonacontane,7763-13-5,388.15,2.0,985.61,2.0,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,InChI=1/C96H194/c1-3-5-7-9-11-13-15-17-19-21-2...
12815,C97H196,n-heptanonacontane,7670-25-9,388.35,2.0,987.93,2.0,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,InChI=1/C97H196/c1-3-5-7-9-11-13-15-17-19-21-2...
12816,C98H198,n-octanonacontane,7670-26-0,388.55,2.0,990.24,2.0,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,InChI=1/C98H198/c1-3-5-7-9-11-13-15-17-19-21-2...
12817,C99H200,n-nonanonacontane,7670-27-1,388.75,2.0,992.53,2.0,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,InChI=1/C99H200/c1-3-5-7-9-11-13-15-17-19-21-2...


Yes, these structural codes match the listed chemical names and CAS numbers. All that's left is to write this updated information to the database.

In [34]:
bp_data.to_sql("yaws_w_structures", con = db_engine, if_exists = "append")