# PubChem Molecular Formula Search example

This notebook shows how to use pubchem_api_crawler to fetch the smiles and other properties of compounds on PubChem using a `MolecularFormulaSearch`.

In [1]:
%load_ext autoreload
%autoreload 2

We start by setting some logs

In [2]:
import logging

logger = logging.getLogger('pubchem_api_crawler')
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(ch)

In [3]:
from pubchem_api_crawler.molecular_search import MolecularFormulaSearch

Using `MolecularFormulaSearch`, we can look for compounds containing a certain combination of elements, and specify which properties we want to retrieve for them.

We'll start with a pretty restrictive query, fetching all compounds containing only Carbon, Hydrogen, Boron and Aluminium. We specify a high range of 200 for each atom to make sure we don't miss out any.

In [13]:
mf = MolecularFormulaSearch()
df = mf.search(["C1-200", "H1-200", "B1-200", "Al1-200"], allow_other_elements=False, properties=["IUPACName", "MolecularFormula", "MolecularWeight", "CanonicalSMILES"])

2024-01-31 10:23:27,985 - pubchem_api_crawler.molecular_search - INFO - Executing Molecular Formula Search query


In [14]:
df

Unnamed: 0_level_0,MolecularFormula,MolecularWeight,CanonicalSMILES,IUPACName
CID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
168084494,CH5AlB2,69.68,[BH].[BH].C[Al],
163556649,C16H14AlB,244.1,[B]CCC1=C2CCC=CC2=C(C3=CC=CC=C31)[Al],
161576177,C27H30AlB,392.3,[H+].[B-](C1=CC=CC=C1)(C2=CC=CC=C2)(C3=CC=CC=C...,hydron;tetraphenylboranuide;trimethylalumane
160469542,C8H26Al2B2,197.9,[B](C)C.[B](C)C.C[AlH]C.C[AlH]C,
160352291,C6H15AlB,124.98,[B].CC[Al](CC)CC,
159970515,C20H48Al2B2,364.2,B(C)(CCB(C)CCC)CCC.CCC[Al](C)CC[Al](C)CCC,methyl-[2-[methyl(propyl)alumanyl]ethyl]-propy...
159123289,C10H28AlB2,196.9,[B](C)C.[B](C)C.CCCC.C[Al]C,
158802573,C11H29AlB,199.14,B(C)(C)C.CCCC.CC[Al]CC,
158250967,C3H9AlB,82.9,[B].C[Al](C)C,
158044531,C2H6AlB,67.86,[B].C[Al]C,


Setting `allow_other_elements=True` will fetch more results.

In [15]:
df = mf.search(["C1-200", "H1-200", "B1-200", "Al1-200"], allow_other_elements=True, properties=["IUPACName", "MolecularFormula", "MolecularWeight", "CanonicalSMILES"])

2024-01-31 10:24:17,279 - pubchem_api_crawler.molecular_search - INFO - Executing Molecular Formula Search query


In [16]:
df

Unnamed: 0_level_0,MolecularFormula,MolecularWeight,CanonicalSMILES,IUPACName
CID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
73556239,C6H18Al2B3Cl3O9,427.0,B(OC)(OC)O[Al](OB(OC)OC)Cl.B(OC)(OC)O[Al](Cl)Cl,[chloro(dimethoxyboranyloxy)alumanyl] dimethyl...
167661984,C103H105AlBClF6LiN3NaO27S7,2258.6,[H-].[Li+].B(C1=CC(=CC=C1)CN)(O)O.CCOC(=O)CC1=...,
167652838,C46H43AlBF26N3O4,1233.6,[B].CC1C(C(=O)N(C1=O)CCCCC(C(C(C(C(C(F)(F)F)(F...,
167652450,C181H180AlBBr2F18LiN22O27,3642,[H-].[Li+].B1(OC(C(O1)(C)C)(C)C)C2=CC3=C(C=C2)...,
167651617,C130H131AlBClF9Li2N4O35S3,2663.8,[H-].[Li+].[Li+].B(C1=C(C(=CC=C1)CN)F)(O)O.CCO...,
...,...,...,...,...
160882943,C12H30Al2BN3Si,309.25,B1(C[Si](N1)(C(C)(C)C)C(C)(C)C)N2[Al](N([Al]2C...,"2,2-ditert-butyl-4-(2,3,4-trimethyl-1,3,2,4-di..."
163835366,C27H44AlBN8O7,630.5,B(C(CC(C)C)NC(=O)CNC(=O)CC1=CC(CC=C1)NC(=O)C2=...,[3-methyl-1-[[2-[[2-[3-[[1-methyl-4-[2-(methyl...
166171265,C25H14AlBN2O8,508.2,[B]N1C(=O)C2=C(C1=O)C=C(C=C2)C(=O)OC3=CC(=C(C=...,
167577056,C24H23AlBN3O4S,487.3,[B]C1=C(C(=C(C(=[Al]1)C)S(=O)(=O)NC(=O)C2=C(C=...,


If your search starts to return more results, it might timeout. For example, searching for all compounds containing only Carbon and Hydrogen will return too many results to complete synchronously. Use the `_async=True` param to perform the search asynchronously.

In [22]:
df = mf.search(["C1-200", "H1-200"], allow_other_elements=False, properties=["IUPACName", "MolecularFormula", "MolecularWeight", "CanonicalSMILES"], _async=True)

2024-01-31 10:28:37,995 - pubchem_api_crawler.molecular_search - INFO - Executing Molecular Formula Async Search query
2024-01-31 10:28:49,166 - pubchem_api_crawler.molecular_search - INFO - Checking status for query 27351309396844755.
2024-01-31 10:28:52,800 - pubchem_api_crawler.molecular_search - INFO - Query is done.
2024-01-31 10:28:52,802 - pubchem_api_crawler.molecular_search - INFO - Retrieving async search results.
 22%|███████████████████████▍                                                                                | 9/40 [02:41<09:14, 17.89s/it]


In [23]:
df

Unnamed: 0_level_0,MolecularFormula,MolecularWeight,CanonicalSMILES,IUPACName
CID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
169546171,C11H18,150.26,CC(C)C12CCC(C1)(C=C2)C,"(1S,4S)-1-methyl-4-propan-2-ylbicyclo[2.2.1]he..."
169544981,C27H20,344.4,C1C=CC=C2C1=CCC3=CC4=C(C=CC4)C=C5C=CC6=CC=CC=C...,"hexacyclo[17.8.0.02,11.03,8.013,17.022,27]hept..."
169544862,C14H12,180.24,C1CC2=CC=C3C2=C4C1=CC=C4CC3,"tetracyclo[5.5.2.04,13.010,14]tetradeca-1,3,7,..."
169544542,C15H24,204.35,CC1=CCC2(CCCC(C23C1C3)(C)C)C,"(4aR,8aR)-2,4a,8,8-tetramethyl-1,1a,4,5,6,7-he..."
12368920,C35H22,442.5,C1C2=CC=CC=C2C3=C(C4=C(C5=CC=CC6=C5C4=CC=C6)C(...,"3,13-diphenylhexacyclo[13.7.1.02,14.04,12.05,1..."
...,...,...,...,...
881,CH3-,15.035,[CH3-],carbanide
573,C40H56,536.9,CC1=C(C(CCC1)(C)C)C=CC(=CC=CC(=CC=CC=C(C)C=CC=...,"1,3,3-trimethyl-2-[3,7,12,16-tetramethyl-18-(2..."
356,C8H18,114.23,CCCCCCCC,octane
297,CH4,16.043,C,methane
