In [3]:
import pubchempy as pcp
import pandas as pd
import numpy as np
import requests
import json

### The purpose of this notebook is to use pubchem's online data retreival resources, in this case PUG REST specifically, in order to expand material candidate datasets by searching for other materials/compounds with similar structure. I'm sure there are much better in depth and complex methods to achieve this, but this should be viewed as a rather straight forward and quick approach for basic material screening and discovery. 

The pubchempy API can be viewed here: https://pubchempy.readthedocs.io/en/latest/api.html

The PUG REST documentation can be viewed here: https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest

According to PUG REST, there are a few methods to do chemical searches based on structure. The most appropriate one seems to be a similarity search, which simply put, searches for similarities in compounds based on their 2D structure. You can sue a cid, smiles string, or InChl to search by. They also show a "standard" and "fast" method, in which the regular method requires two subsequent requests to retrieve the data, and the fast method just instantaneously gives you the results. After playing around with both, I didn't see any difference in the results I was getting, so i'm going to stick to the very convenient "fast" method. 

We can initiate a request for the data below. We will be using our trusty friend choline chloride as an initial example, searching by it's smiles string. Note you also have the option to attach constraints to the end of the request, such a max number of results returned, and more useful, a minimum threshold for the Tanimoto score, which is basically an index or score of chemical similarity. Choosing a threshold value is a little arbitrary, too low of a threshold and the resulting compounds may not be as similar as you would like. Too high of a threshold yielded a much smaller quantity of results, but they were almost "too" similar, most were just radioactive compounds of choline chloride. The default threshold value is 90, I'm going to use 80 to get a few more variety in the returned structures.


In [31]:
request_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/C[N+](C)(C)CCO.[Cl-]/cids/JSON?Threshold=80"

request = requests.get(request_url)
request_json = request.json()

The result is returned as a very short dictionary, and we can parse it as such to get the list of similar compounds.

In [32]:
request_json['IdentifierList']['CID']

[6209,
 7618,
 7902,
 7497,
 305,
 7767,
 37511,
 14074,
 8769,
 31255,
 74724,
 80057,
 75171,
 14989482,
 449688,
 198804,
 162303,
 118525,
 101814,
 87300,
 85865,
 84429,
 76721,
 62288,
 60232,
 42848,
 16942,
 6552,
 18772452,
 15419000,
 14859573,
 449689,
 449673,
 404592,
 404591,
 287929,
 200674,
 45414,
 17916,
 16941,
 24799587,
 6950248,
 6395359,
 6328237,
 5249138,
 2723762,
 357051,
 199656,
 172277,
 152431,
 83283,
 80058,
 70682,
 25781,
 23437,
 71310335,
 23352174,
 23066982,
 16217619,
 16213540,
 16213539,
 15029190,
 15029175,
 14943907,
 14859571,
 14859569,
 13778189,
 12249371,
 12239068,
 3063123,
 3017041,
 413857,
 342520,
 293482,
 249874,
 239819,
 237041,
 200221,
 170746,
 162304,
 120823,
 117618,
 116739,
 103983,
 87940,
 83239,
 57052,
 46081,
 23438,
 88729402,
 87876898,
 87573822,
 87389474,
 86759356,
 86752887,
 86106984,
 71360639,
 71309220,
 71309123,
 71308987,
 57483587,
 57266120,
 55252930,
 46209411,
 45791167,
 23680072,
 23675624,


In [33]:
len(request_json['IdentifierList']['CID'])

1072

1072 different compounds were returned, let's find out the names of those compounds.

In [34]:
similarity_list = request_json['IdentifierList']['CID']

In [35]:
name_list = [] #empty list to append compound names too

for cid in similarity_list:
    #general pubchempy request to obtain a json of the entire compound data
    names_request = "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/%s/JSON" % str(cid) 
    request = requests.get(names_request)
    request_json = request.json()
    
    names = request_json['Record']['RecordTitle'] #the compound name is contained at the very top of the json, fairly easy.
    
    name_list.append(names)

In [36]:
name_list

['Choline chloride',
 'Triethanolamine',
 'Deanol',
 '2-Diethylaminoethanol',
 'Choline',
 'N-Methyldiethanolamine',
 'Dimepranol',
 '1-Aziridineethanol',
 'N-Ethyldiethanolamine',
 'Choline hydroxide',
 'Choline Bromide',
 '(2-Hydroxyethyl)triethylammonium iodide',
 '2-((2-(Dimethylamino)ethyl)(methyl)amino)ethanol',
 'Choline chloride C-11',
 'Choline C-11',
 'Triethylcholine chloride',
 'Bis(2-hydroxyethyl)dimethylammonium chloride',
 'Tris(2-hydroxyethyl)methylammonium hydroxide',
 'Triethanolamine hydrochloride',
 'Choline iodide',
 '2-(Dimethylamino)propan-1-ol',
 '2-(Diethylamino)ethanol hydrochloride',
 '2-(Dipropylamino)ethanol',
 'Copper triethanolamine complex',
 '1-Ethyl-1-(2-hydroxyethyl)aziridinium chloride',
 "Ethanol, 2,2'-(propylimino)bis-",
 '2-Methylcholine',
 'Choline bicarbonate',
 'Copper triethanolamine',
 'Ethanaminium, N,N,N-triethyl-2-hydroxy-, bromide',
 '2-Hydroxyethyl-(iodomethyl)-dimethylazanium;iodide',
 'Fluoroethylcholine ion F-18',
 'Fluorocholine F-18

Scrolling through the list, we can see some compounds that look very interesting for our application as hydrogen bond acceptors (HBA) for deep eutectic solvents. Others seem not as useful, such as the radioactive compounds I mentioned earlier, or some are just the parent compounds of the halide salts. Not sure of a great way to further screen and reduce this list, but a start would be removing results that are not halide salts, since that is our criteria for a HBA.

In [37]:
sorted_names = []

for i in name_list:
    
    if "chloride" in i:
        sorted_names.append(i)
        
    elif "bromide" in i:
        sorted_names.append(i)
        
    elif "fluoride" in i:
        sorted_names.append(i)
        
    elif "iodide" in i:
        sorted_names.append(i)
        
    else:
        pass

Keeping only the results that contain a halide seems to be the most straightforward to screen these results. The only downside is we may miss compounds that go by a common name, but may indeed by halide salts. I think there shouldn't be too many of those cases to worry about significantly.


In [38]:
sorted_names

['Choline chloride',
 '(2-Hydroxyethyl)triethylammonium iodide',
 'Choline chloride C-11',
 'Triethylcholine chloride',
 'Bis(2-hydroxyethyl)dimethylammonium chloride',
 'Triethanolamine hydrochloride',
 'Choline iodide',
 '2-(Diethylamino)ethanol hydrochloride',
 '1-Ethyl-1-(2-hydroxyethyl)aziridinium chloride',
 'Ethanaminium, N,N,N-triethyl-2-hydroxy-, bromide',
 '2-Hydroxyethyl-(iodomethyl)-dimethylazanium;iodide',
 'Ethanol, 2-dimethylamino-, hydrochloride',
 '1-Propanaminium, 2-hydroxy-N,N,N-trimethyl-, chloride',
 '2-(Dimethylamino)ethanol hydrochloride',
 'Triethanolamine hydriodide',
 'Tetrakis(2-hydroxyethyl)ammonium chloride',
 'Dimethylethylcholine iodide',
 'Ethanaminium, 2-hydroxy-N,N-bis(2-hydroxyethyl)-N-methyl-, chloride',
 'Choline chloride-15N',
 '2-Methoxyethyl(trimethyl)azanium;chloride',
 'N,N-Diethyl-2-hydroxy-N-methylethan-1-aminium bromide',
 'Choline-1,1,2,2-d4 bromide',
 'Methyl-D9-choline chloride',
 'N-Ethyl-2-hydroxy-N,N-dimethylethan-1-aminium bromide',
 

In [39]:
len(sorted_names)

260

We have reduced the number of results from 1072 to 260, now most of these should be decent candiates to begin the screening process. Interestingly some of these results are also complexed with metals. I'm not sure about these and whether or not to include them. 

Let's start writing this as a function so that we can test this on a larger dataset.

In [42]:
#function for finding the names of the results from their cid's

def get_names(cid_list):
    
    """This function will retrieve the chemical names from a list of cid's"""
    
    name_list = [] #empty list to append compound names too
    
    for cid in cid_list:
        #general pubchempy request to obtain a json of the entire compound data
        names_request = "https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/%s/JSON" % str(cid) 
        names_request = requests.get(names_request)
        names_request_json = names_request.json()

        names = names_request_json['Record']['RecordTitle'] #the compound name is contained at the very top of the json, fairly easy.

        name_list.append(names)
        
    return name_list

In [44]:
#function for sorting halide salts

def keep_salts(name_list):
    
    """This function will remove non halide salts from a list of chemical names"""
    
    sorted_list = [] #empty list to append sorted names to.

    for i in name_list:

        if "chloride" in i:
            sorted_list.append(i)

        elif "bromide" in i:
            sorted_list.append(i)

        elif "fluoride" in i:
            sorted_list.append(i)

        elif "iodide" in i:
            sorted_list.append(i)

        else:
            pass
        
        return sorted_list

In [45]:
#wrapper function starting from a dataframe containing smiles strings

def get_similar_structures(dataframe, smiles_column,):
    
    cid_list = [] #contains cid's of similar chemcials from results
    
    for i, row in dataframe.iterrows():
        
        smiles = row[smiles_column]
        
        request_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/%s/cids/JSON?Threshold=80" % str(smiles)

        request = requests.get(request_url)
        request_json = request.json()
        
        cid_list.append(request_json['IdentifierList']['CID'])
        
    name_list = get_names(cid_list)
    
    sorted_list = keep_salts(name_list)