# PubChem API Crawler

This notebook shows how to use `pubchem_api_crawler` to search for compounds on PubChem and retrieve their properties.

We first setup some logging.

In [1]:
import logging

logger = logging.getLogger('pubchem_api_crawler')
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(ch)

## Molecular Formula Search

Using `MolecularFormulaSearch`, we can look for compounds containing a certain combination of elements, and specify which properties we want to retrieve for them.

We'll start with a pretty restrictive query, fetching all compounds containing only Carbon, Hydrogen, Boron and Aluminium. We specify a high range of 200 for each atom to make sure we don't miss any.

In [2]:
from pubchem_api_crawler.molecular_search import MolecularFormulaSearch

In [3]:
mf = MolecularFormulaSearch()
mf.search(["C1-200", "H1-200", "B1-200", "Al1-200"], allow_other_elements=False, properties=["IUPACName", "MolecularFormula", "MolecularWeight", "CanonicalSMILES"])

2024-02-01 17:37:16,017 - pubchem_api_crawler.molecular_search - INFO - Executing Molecular Formula Search query


Unnamed: 0,CID,MolecularFormula,MolecularWeight,CanonicalSMILES,IUPACName
0,168084494,CH5AlB2,69.68,[BH].[BH].C[Al],
1,163556649,C16H14AlB,244.1,[B]CCC1=C2CCC=CC2=C(C3=CC=CC=C31)[Al],
2,161576177,C27H30AlB,392.3,[H+].[B-](C1=CC=CC=C1)(C2=CC=CC=C2)(C3=CC=CC=C...,hydron;tetraphenylboranuide;trimethylalumane
3,160469542,C8H26Al2B2,197.9,[B](C)C.[B](C)C.C[AlH]C.C[AlH]C,
4,160352291,C6H15AlB,124.98,[B].CC[Al](CC)CC,
5,159970515,C20H48Al2B2,364.2,B(C)(CCB(C)CCC)CCC.CCC[Al](C)CC[Al](C)CCC,methyl-[2-[methyl(propyl)alumanyl]ethyl]-propy...
6,159123289,C10H28AlB2,196.9,[B](C)C.[B](C)C.CCCC.C[Al]C,
7,158802573,C11H29AlB,199.14,B(C)(C)C.CCCC.CC[Al]CC,
8,158250967,C3H9AlB,82.9,[B].C[Al](C)C,
9,158044531,C2H6AlB,67.86,[B].C[Al]C,


Setting `allow_other_elements=True` will fetch more results.

In [4]:
mf.search(["C1-200", "H1-200", "B1-200", "Al1-200"], allow_other_elements=True, properties=["IUPACName", "MolecularFormula", "MolecularWeight", "CanonicalSMILES"])

2024-02-01 17:37:19,560 - pubchem_api_crawler.molecular_search - INFO - Executing Molecular Formula Search query


Unnamed: 0,CID,MolecularFormula,MolecularWeight,CanonicalSMILES,IUPACName
0,73556239,C6H18Al2B3Cl3O9,427.0,B(OC)(OC)O[Al](OB(OC)OC)Cl.B(OC)(OC)O[Al](Cl)Cl,[chloro(dimethoxyboranyloxy)alumanyl] dimethyl...
1,167661984,C103H105AlBClF6LiN3NaO27S7,2258.6,[H-].[Li+].B(C1=CC(=CC=C1)CN)(O)O.CCOC(=O)CC1=...,
2,167652838,C46H43AlBF26N3O4,1233.6,[B].CC1C(C(=O)N(C1=O)CCCCC(C(C(C(C(C(F)(F)F)(F...,
3,167652450,C181H180AlBBr2F18LiN22O27,3642,[H-].[Li+].B1(OC(C(O1)(C)C)(C)C)C2=CC3=C(C=C2)...,
4,167651617,C130H131AlBClF9Li2N4O35S3,2663.8,[H-].[Li+].[Li+].B(C1=C(C(=CC=C1)CN)F)(O)O.CCO...,
...,...,...,...,...,...
1714,160882943,C12H30Al2BN3Si,309.25,B1(C[Si](N1)(C(C)(C)C)C(C)(C)C)N2[Al](N([Al]2C...,"2,2-ditert-butyl-4-(2,3,4-trimethyl-1,3,2,4-di..."
1715,163835366,C27H44AlBN8O7,630.5,B(C(CC(C)C)NC(=O)CNC(=O)CC1=CC(CC=C1)NC(=O)C2=...,[3-methyl-1-[[2-[[2-[3-[[1-methyl-4-[2-(methyl...
1716,166171265,C25H14AlBN2O8,508.2,[B]N1C(=O)C2=C(C1=O)C=C(C=C2)C(=O)OC3=CC(=C(C=...,
1717,167577056,C24H23AlBN3O4S,487.3,[B]C1=C(C(=C(C(=[Al]1)C)S(=O)(=O)NC(=O)C2=C(C=...,


If your search starts to return more results, it might timeout (limit is 30s). For example, searching for all compounds containing only Carbon and Hydrogen will return too many results to complete synchronously. Use the `_async=True` param to perform the search asynchronously.

In [7]:
ch = mf.search(["C1-200", "H1-200"], allow_other_elements=False, properties=["IUPACName", "MolecularFormula", "MolecularWeight", "CanonicalSMILES"], _async=True)
ch

2024-02-01 17:38:53,302 - pubchem_api_crawler.molecular_search - INFO - Executing Molecular Formula Async Search query
2024-02-01 17:39:04,473 - pubchem_api_crawler.molecular_search - INFO - Checking status for query 3547665118550663911.
2024-02-01 17:39:06,765 - pubchem_api_crawler.molecular_search - INFO - Query is done.
2024-02-01 17:39:06,767 - pubchem_api_crawler.molecular_search - INFO - Retrieving async search results.
 22%|███████████████████████▍                                                                                | 9/40 [02:55<08:49, 17.08s/it]2024-02-01 17:42:19,285 - pubchem_api_crawler.molecular_search - INFO - Retrieved all results.
 22%|███████████████████████▍                                                                                | 9/40 [03:12<11:03, 21.39s/it]


Unnamed: 0,CID,MolecularFormula,MolecularWeight,CanonicalSMILES,IUPACName
0,169546171,C11H18,150.26,CC(C)C12CCC(C1)(C=C2)C,"(1S,4S)-1-methyl-4-propan-2-ylbicyclo[2.2.1]he..."
1,169544981,C27H20,344.4,C1C=CC=C2C1=CCC3=CC4=C(C=CC4)C=C5C=CC6=CC=CC=C...,"hexacyclo[17.8.0.02,11.03,8.013,17.022,27]hept..."
2,169544862,C14H12,180.24,C1CC2=CC=C3C2=C4C1=CC=C4CC3,"tetracyclo[5.5.2.04,13.010,14]tetradeca-1,3,7,..."
3,169544542,C15H24,204.35,CC1=CCC2(CCCC(C23C1C3)(C)C)C,"(4aR,8aR)-2,4a,8,8-tetramethyl-1,1a,4,5,6,7-he..."
4,12368920,C35H22,442.5,C1C2=CC=CC=C2C3=C(C4=C(C5=CC=CC6=C5C4=CC=C6)C(...,"3,13-diphenylhexacyclo[13.7.1.02,14.04,12.05,1..."
...,...,...,...,...,...
499517,881,CH3-,15.035,[CH3-],carbanide
499518,573,C40H56,536.9,CC1=C(C(CCC1)(C)C)C=CC(=CC=CC(=CC=CC=C(C)C=CC=...,"1,3,3-trimethyl-2-[3,7,12,16-tetramethyl-18-(2..."
499519,356,C8H18,114.23,CCCCCCCC,octane
499520,297,CH4,16.043,C,methane


## Experimental Properties

When using PubChem's REST API, you can only retrieve computed compound properties (list is available [here](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=Compound-Property-Tables)).
If you want to retrieve experimental properties annotations, you can use the `Annotations` class of `pubchem_api_crawler`. The list of annotation headings (and their types) for which PubChem has any data is available [here](https://pubchem.ncbi.nlm.nih.gov/rest/pug/annotations/headings/JSON).

In [8]:
from pubchem_api_crawler.annotations import Annotations

an = Annotations()

#### Getting annotations for specific compounds

Using PubChem API Crawler you can get annotations for specific compounds using their cids. But PubChem does not provide a batch method to get annotations for multiple compounds, so we need to send on request per compound, which will take a lot of time if you request annotations for a lot of compounds.

In [15]:
an.get_compound_annotations(ch.tail())

2024-02-01 17:47:55,928 - pubchem_api_crawler.annotations - INFO - Getting Experimental Properties annotations for compounds.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.57it/s]
  return pd.concat(dfs).fillna(value=np.nan)


Unnamed: 0_level_0,Physical Description,Color/Form,Color/Form,Odor,Odor,Boiling Point,Boiling Point,Melting Point,Melting Point,Flash Point,...,Chemical Classes,CID,Taste,Taste,Henry's Law Constant,Henry's Law Constant,Stability/Shelf Life,Stability/Shelf Life,Relative Evaporation Rate,Relative Evaporation Rate
Unnamed: 0_level_1,Value,Value,Reference,Value,Reference,Value,Reference,Value,Reference,Value,...,Value,Unnamed: 13_level_1,Value,Reference,Value,Reference,Value,Reference,Value,Reference
0,N-octane is a colorless liquid with an odor of...,Colorless liquid,"Lewis, R.J. Sr.; Hawley's Condensed Chemical D...",Gasoline-like,NIOSH. NIOSH Pocket Guide to Chemical Hazards....,"258.1 °F at 760 mmHg (USCG, 1999)",U.S. Coast Guard. 1999. Chemical Hazard Respon...,"-70.2 °F (USCG, 1999)",U.S. Coast Guard. 1999. Chemical Hazard Respon...,"56 °F (USCG, 1999)",...,"Solvents -> Aliphatics, Saturated (<C12)",356,,,,,,,,
1,Gas or Vapor; Liquid,Clear liquid,"Lewis, R.J. Sr. (ed) Sax's Dangerous Propertie...",,,125.62 °C,"Haynes, W.M. (ed.). CRC Handbook of Chemistry ...",-56.73 °C,"Haynes, W.M. (ed.). CRC Handbook of Chemistry ...",56 °F (13 °C) (Closed cup),...,,356,,,,,,,,
2,Colorless liquid with a gasoline-like odor; [N...,,,,,125.00 to 126.00 °C. @ 760.00 mm Hg,The Good Scents Company Information System,-56.8 °C,,72 °F (22 °C) (Open cup),...,,356,,,,,,,,
3,Liquid,,,,,126 °C,,-56.8 °C,,13 °C c.c.,...,,356,,,,,,,,
4,COLOURLESS LIQUID WITH CHARACTERISTIC ODOUR.,,,,,258 °F,,-70 °F,,56 °F,...,,356,,,,,,,,
5,Colorless liquid with a gasoline-like odor.,,,,,258 °F,,-70 °F,,56 °F,...,,356,,,,,,,,
6,Colorless liquid with a gasoline-like odor.,,,,,,,,,,...,,356,,,,,,,,
0,Methane is a colorless odorless gas. It is als...,Colorless gas,"O'Neil, M.J. (ed.). The Merck Index - An Encyc...",Odorless,"O'Neil, M.J. (ed.). The Merck Index - An Encyc...","-258 °F at 760 mmHg (USCG, 1999)",U.S. Coast Guard. 1999. Chemical Hazard Respon...,"-296 °F (USCG, 1999)",U.S. Coast Guard. 1999. Chemical Hazard Respon...,"-306 °F (NTP, 1992)",...,Toxic Gases & Vapors -> Simple Asphyxiants,297,Tasteless,"Lewis, R.J. Sr.; Hawley's Condensed Chemical D...",,,,,,
1,Methane is a colorless odorless gas. It is als...,,,Weak odor,"U.S. Coast Guard, Department of Transportation...","-258.7 °F at 760 mmHg (NTP, 1992)","National Toxicology Program, Institute of Envi...","-296.5 °F (NTP, 1992)","National Toxicology Program, Institute of Envi...",-306 °F,...,,297,,,,,,,,
2,"Methane, refrigerated liquid (cryogenic liquid...",,,,,-161.50 °C,"Haynes, W.M. (ed.). CRC Handbook of Chemistry ...",-182.566 °C,"Haynes, W.M. (ed.). CRC Handbook of Chemistry ...",-188 °C (-306 °F) - closed cup,...,,297,,,,,,,,


#### Getting all data for a given annotation heading

If instead you're more interested in getting all available data for some specific annotation heading, you should use the `get_annotations` method. For example, let's retrieve all the `Autoignition Temperatures` available on PubChem.

In [17]:
ait = an.get_annotations("Autoignition Temperature")
ait

2024-02-01 17:49:52,399 - pubchem_api_crawler.annotations - INFO - Getting Autoignition Temperature annotations.
2024-02-01 17:50:00,327 - pubchem_api_crawler.annotations - INFO - Fetching 1 additional result pages.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.75s/it]


Unnamed: 0,SourceName,SourceID,URL,Value,Reference,CID
0,Hazardous Substances Data Bank (HSDB),30,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/30,270 °C (518 °F),Fire Protection Guide to Hazardous Materials. ...,4510
1,Hazardous Substances Data Bank (HSDB),35,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/35,928 °F (498 °C),National Fire Protection Association; Fire Pr...,241
2,Hazardous Substances Data Bank (HSDB),37,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/37,871 °F (466 °C),National Fire Protection Association; Fire Pr...,2537
3,Hazardous Substances Data Bank (HSDB),39,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/39,772 °F (411 °C),Fire Protection Guide to Hazardous Materials. ...,7835
4,Hazardous Substances Data Bank (HSDB),40,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/40,867 °F (463 °C),National Fire Protection Association; Fire Pr...,176
...,...,...,...,...,...,...
1728,Hazardous Substances Data Bank (HSDB),8499,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/8499,491 °F (255 °C),National Fire Protection Association; Fire Pr...,18635
1729,ILO-WHO International Chemical Safety Cards (I...,0471,https://www.ilo.org/dyn/icsc/showcard.display?...,>500 °C,,5462310
1730,ILO-WHO International Chemical Safety Cards (I...,1747,https://www.ilo.org/dyn/icsc/showcard.display?...,215 °C,,787
1731,ILO-WHO International Chemical Safety Cards (I...,1779,https://www.ilo.org/dyn/icsc/showcard.display?...,188 °C,,8301


Using the unique CID of each compound, we can merge the annotations dataframe with our previous dataframe containing all C-H- compounds on PubChem, which will give us all the `AutoIgnition Temperatures` for all C-H- compounds available on PubChem.

In [18]:
ait.merge(ch, how="inner", on="CID")

Unnamed: 0,SourceName,SourceID,URL,Value,Reference,CID,MolecularFormula,MolecularWeight,CanonicalSMILES,IUPACName
0,Hazardous Substances Data Bank (HSDB),35,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/35,928 °F (498 °C),National Fire Protection Association; Fire Pr...,241,C6H6,78.11,C1=CC=CC=C1,benzene
1,Hazardous Substances Data Bank (HSDB),60,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/60,473 °F (245 °C),National Fire Protection Association; Fire Pr...,8078,C6H12,84.16,C1CCCCC1,cyclohexane
2,Hazardous Substances Data Bank (HSDB),62,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/62,682 °F (361 °C),National Fire Protection Association; Fire Pr...,9253,C5H10,70.13,C1CCCC1,cyclopentane
3,Hazardous Substances Data Bank (HSDB),63,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/63,410 °F (210 °C),National Fire Protection Association; Fire Pr...,15600,C10H22,142.28,CCCCCCCCCC,decane
4,Hazardous Substances Data Bank (HSDB),75,https://pubchem.ncbi.nlm.nih.gov/source/hsdb/75,761 °F (405 °C).,National Fire Protection Association; Fire Pr...,6403,C6H14,86.18,CCC(C)(C)C,"2,2-dimethylbutane"
...,...,...,...,...,...,...,...,...,...,...
220,ILO-WHO International Chemical Safety Cards (I...,1674,https://www.ilo.org/dyn/icsc/showcard.display?...,>450 °C,,6734,C12H10,154.21,C1CC2=CC=CC3=C2C1=CC=C3,"1,2-dihydroacenaphthylene"
221,ILO-WHO International Chemical Safety Cards (I...,1704,https://www.ilo.org/dyn/icsc/showcard.display?...,265 °C,,6616,C10H16,136.23,CC1(C2CCC(C2)C1=C)C,"2,2-dimethyl-3-methylidenebicyclo[2.2.1]heptane"
222,ILO-WHO International Chemical Safety Cards (I...,1714,https://www.ilo.org/dyn/icsc/showcard.display?...,449 °C,,11345,C12H18,162.27,CC(C)C1=CC=CC=C1C(C)C,"1,2-di(propan-2-yl)benzene"
223,ILO-WHO International Chemical Safety Cards (I...,1773,https://www.ilo.org/dyn/icsc/showcard.display?...,450 °C,,10041,C5H12,72.15,CC(C)(C)C,"2,2-dimethylpropane"
