In [1]:
import pubchempy as pcp
import pandas as pd
import numpy as np
import requests
import json

### The purpose of this notebook is to use pubchem's online data retreival resources, in this case PUG REST specifically, in order to expand material candidate datasets by searching for other materials/compounds with similar structure. I'm sure there are much better in depth and complex methods to achieve this, but this should be viewed as a rather straight forward and quick approach for basic material screening and discovery. 

### We will be continuing from the other notebook using the heat capacity dataset as an initial example since it isn't very large, in order to write a function for this process that we can then apply to all the other scrapped datasets.

In [2]:
#importing heat capacity dataset
heat_capacity = pd.read_csv('/Users/Jaime/Desktop/des-basis-set/scripts/webscrapping/heat_capacity/heat_capacity.csv')

In [3]:
heat_capacity

Unnamed: 0,"Heat capacity at constant pressure, J/K/mol",HBA Mole Fraction,"Pressure, kPa","Temperature, K",HBD_MW,HBA_MW,HBD,HBA
0,590.0,0.80,100,308.20,60.06,277.92,urea,tetrabutylammonium chloride
1,593.0,0.80,100,310.70,60.06,277.92,urea,tetrabutylammonium chloride
2,598.0,0.80,100,313.20,60.06,277.92,urea,tetrabutylammonium chloride
3,601.0,0.80,100,315.70,60.06,277.92,urea,tetrabutylammonium chloride
4,603.0,0.80,100,318.20,60.06,277.92,urea,tetrabutylammonium chloride
5,601.0,0.80,100,320.70,60.06,277.92,urea,tetrabutylammonium chloride
6,602.0,0.80,100,323.20,60.06,277.92,urea,tetrabutylammonium chloride
7,603.0,0.80,100,325.70,60.06,277.92,urea,tetrabutylammonium chloride
8,602.0,0.80,100,328.20,60.06,277.92,urea,tetrabutylammonium chloride
9,603.0,0.80,100,330.70,60.06,277.92,urea,tetrabutylammonium chloride


Starting with the HBA as an example, we will make a new dataframe with only the unique HBA.

In [12]:
HBA = pd.DataFrame()
HBA['HBA'] = heat_capacity['HBA']
HBA = HBA.drop_duplicates()
HBA = HBA.reset_index(drop = True)

In [13]:
HBA

Unnamed: 0,HBA
0,tetrabutylammonium chloride
1,methyltriphenylphosphonium bromide


Use the get_cid function made previously to retrieve cid's

In [14]:
#import the get_cid function from other path
import sys
sys.path.insert(1, '/Users/Jaime/Desktop/des-basis-set/scripts/pubchem')

from get_cid import get_cid

In [15]:
HBA = get_cid(HBA, 'HBA','CID')

In [16]:
HBA

Unnamed: 0,HBA,CID
0,tetrabutylammonium chloride,70681
1,methyltriphenylphosphonium bromide,74505


In [20]:
#obtaining smiles strings
from get_properties import get_properties

HBA = get_properties(HBA, 'canonical_smiles', 'CID', 'HBA_')

In [34]:
HBA

Unnamed: 0,HBA,CID,HBA_CanonicalSMILES
0,tetrabutylammonium chloride,70681,CCCC[N+](CCCC)(CCCC)CCCC.[Cl-]
1,methyltriphenylphosphonium bromide,74505,C[P+](C1=CC=CC=C1)(C2=CC=CC=C2)C3=CC=CC=C3.[Br-]


I've realized the request gives you the same result if you search by cid or smiles string, so really no need to do the extra step above to get the smiles string right now.

In [45]:
#searching for similar chemicals based on first item's smiles string as example
cid = HBA['CID'][0]

request_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/cid/%s/cids/JSON?Threshold=80"% str(cid)

request = requests.get(request_url)
request_json = request.json()

In [47]:
request_json['IdentifierList']['CID']

[23558,
 8154,
 62581,
 7622,
 7616,
 5946,
 74236,
 21218,
 20708,
 8155,
 7879,
 2723671,
 78202,
 67553,
 16028,
 14227,
 2724141,
 165872,
 91822,
 88455,
 78667,
 74745,
 74723,
 70681,
 66074,
 66022,
 61906,
 31281,
 24952,
 18855,
 17249,
 16958,
 15743,
 14049,
 12429,
 12133,
 11769095,
 5743232,
 3014969,
 165075,
 137295,
 94377,
 81880,
 78073,
 76521,
 75483,
 75057,
 69525,
 67932,
 20513,
 18843,
 9559,
 14298629,
 9881569,
 3072561,
 3014224,
 2735155,
 2724292,
 519218,
 169536,
 110256,
 104201,
 89152,
 82489,
 80021,
 79880,
 78388,
 78207,
 78026,
 77071,
 76205,
 75056,
 70086,
 62582,
 44527,
 36209,
 36208,
 21541,
 17248,
 13834,
 134813759,
 60196394,
 23225441,
 20316921,
 16218668,
 14417848,
 14325059,
 14029864,
 13998878,
 13041689,
 12444412,
 11996614,
 11748636,
 11746670,
 11726816,
 11346348,
 11208119,
 11123851,
 11123692,
 10891295,
 10062191,
 9939907,
 9807837,
 6399486,
 3017185,
 2724287,
 2723680,
 559387,
 524212,
 522199,
 521159,
 520138,

3092 results returned

In [48]:
len(request_json['IdentifierList']['CID'])

3092

In [50]:
#putting results in list
similarity_list = request_json['IdentifierList']['CID']

Getting all the smiles strings for the similarity list

In [51]:
smiles_list = []
for cid in similarity_list:
    smiles_list.append(pcp.get_properties('canonical_smiles', cid))

In [52]:
smiles_list

[[{'CID': 23558, 'CanonicalSMILES': 'CCCCCCCCCC[N+](C)(C)CCCCCCCCCC.[Cl-]'}],
 [{'CID': 8154, 'CanonicalSMILES': 'CCCCCCCCCCCCCCCC[N+](C)(C)C.[Cl-]'}],
 [{'CID': 62581, 'CanonicalSMILES': 'CCCCCCCC[N+](C)(C)CCCCCCCC.[Cl-]'}],
 [{'CID': 7622, 'CanonicalSMILES': 'CCCCN(CCCC)CCCC'}],
 [{'CID': 7616, 'CanonicalSMILES': 'CCCN(CCC)CCC'}],
 [{'CID': 5946, 'CanonicalSMILES': 'CC[N+](CC)(CC)CC.[Cl-]'}],
 [{'CID': 74236, 'CanonicalSMILES': 'CCCC[N+](CCCC)(CCCC)CCCC.[Br-]'}],
 [{'CID': 21218,
   'CanonicalSMILES': 'CCCCCCCC[N+](C)(CCCCCCCC)CCCCCCCC.[Cl-]'}],
 [{'CID': 20708, 'CanonicalSMILES': 'CCCCCCCCCCCCCC[N+](C)(C)C.[Cl-]'}],
 [{'CID': 8155, 'CanonicalSMILES': 'CCCCCCCCCCCCCCCCCC[N+](C)(C)C.[Cl-]'}],
 [{'CID': 7879,
   'CanonicalSMILES': 'CCCCCCCCCCCCCCCCCC[N+](C)(C)CCCCCCCCCCCCCCCCCC.[Cl-]'}],
 [{'CID': 2723671, 'CanonicalSMILES': 'CCCC[N+](CCCC)(CCCC)CCCC.[OH-]'}],
 [{'CID': 78202, 'CanonicalSMILES': 'CCCCCCCCN(C)CCCCCCCC'}],
 [{'CID': 67553, 'CanonicalSMILES': 'CCCC[N+](CCCC)(CCCC)CCCC.[I-

Excluding all chemicals that do not contain an ammonium group. We are focusing on the ammonium salts so this would exclude any phosphoniums later on, but this can eaily be changed if we also want to explore those. List of candidates was reduced from 3092 to 1910.

In [53]:
temp_list = []

for i in range(len(smiles_list)):
    
    if '[N+]' in smiles_list[i][0]['CanonicalSMILES']: #checking if ammonium group is present in smiles string
        
        temp_list.append(smiles_list[i][0])

In [55]:
len(temp_list)

1910

We will perform another screen based on if haldie salts are present in the smiles string, since these candidates obviously ned to be salts for a DES. This step would not apply for the HBD. 

In [56]:
new_smiles_list = []

for i in range(len(temp_list)):
    if '.[Cl-]' in temp_list[i]['CanonicalSMILES']:
        new_smiles_list.append(temp_list[i])
        
    elif '.[Br-]' in temp_list[i]['CanonicalSMILES']:
        new_smiles_list.append(temp_list[i])
        
    elif '.[I-]' in temp_list[i]['CanonicalSMILES']:
        new_smiles_list.append(temp_list[i])
        
    elif '.[F-]' in temp_list[i]['CanonicalSMILES']:
        new_smiles_list.append(temp_list[i])
    
    else:
        pass

Performing this next screen further reduces the number of candidates down to 843.

In [57]:
new_smiles_list

[{'CID': 23558, 'CanonicalSMILES': 'CCCCCCCCCC[N+](C)(C)CCCCCCCCCC.[Cl-]'},
 {'CID': 8154, 'CanonicalSMILES': 'CCCCCCCCCCCCCCCC[N+](C)(C)C.[Cl-]'},
 {'CID': 62581, 'CanonicalSMILES': 'CCCCCCCC[N+](C)(C)CCCCCCCC.[Cl-]'},
 {'CID': 5946, 'CanonicalSMILES': 'CC[N+](CC)(CC)CC.[Cl-]'},
 {'CID': 74236, 'CanonicalSMILES': 'CCCC[N+](CCCC)(CCCC)CCCC.[Br-]'},
 {'CID': 21218, 'CanonicalSMILES': 'CCCCCCCC[N+](C)(CCCCCCCC)CCCCCCCC.[Cl-]'},
 {'CID': 20708, 'CanonicalSMILES': 'CCCCCCCCCCCCCC[N+](C)(C)C.[Cl-]'},
 {'CID': 8155, 'CanonicalSMILES': 'CCCCCCCCCCCCCCCCCC[N+](C)(C)C.[Cl-]'},
 {'CID': 7879,
  'CanonicalSMILES': 'CCCCCCCCCCCCCCCCCC[N+](C)(C)CCCCCCCCCCCCCCCCCC.[Cl-]'},
 {'CID': 67553, 'CanonicalSMILES': 'CCCC[N+](CCCC)(CCCC)CCCC.[I-]'},
 {'CID': 2724141, 'CanonicalSMILES': 'CCCC[N+](CCCC)(CCCC)CCCC.[F-]'},
 {'CID': 91822, 'CanonicalSMILES': 'CCCC[N+](C)(CCCC)CCCC.[Cl-]'},
 {'CID': 78667, 'CanonicalSMILES': 'CCCCC[N+](CCCCC)(CCCCC)CCCCC.[Cl-]'},
 {'CID': 74745, 'CanonicalSMILES': 'CCC[N+](CCC)(CC

In [59]:
len(new_smiles_list)

843

Now let's put this into a function that we can use on the rest of the datasets. For now, we will make one function for the HBA that includes the screening we just did above, and another for the HBD that doesn't have that screening.

In [None]:
#function for screenn
def hba_screen(smiles_list):
    temp_list = []

for i in range(len(smiles_list)):
    
    if '[N+]' in smiles_list[i][0]['CanonicalSMILES']: #checking if ammonium group is present in smiles string
        
        temp_list.append(smiles_list[i][0])
    