### PubChem

PubChem is a database of chemical molecules maintained by the National Center for Biotechnology Information (NCBI). It's composed by three databases: Bioassay, Compound and Substance. Our data were retrieved from the Compound DB, where for each molecule information about chemical and physical properties are provided, together with links to other related DB. Our interest was in retrieving some basic information about the compound (i.e. Compound ID, SMILES, Name) and the associated Pharmacologic Actions terms from the MeSH Ontology. [Example: Aspirin](https://pubchem.ncbi.nlm.nih.gov/compound/2244#section=Top)

<img src="mesh_aspirin.png">


Every NCBI database can be accessed "programmatically" through its API, the E-utility functions. Basically one can search a DB and retrieve information -as an XML response- through URL calls.

In the following it's shown how the information for an example compound (Aspirin - CID: 2244) were retrieved.

In [5]:
import os
import sys
parent_path = os.path.abspath(os.path.join('..'))
if parent_path not in sys.path:
    sys.path.append(parent_path)
    
from preprocess.pubchem_api import *

All E-utility calls share the same base URL:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
    
First of all, the ESearch utility is used to retrieve a list of IDs of compounds responding to a specific query. For our purpose, we specified the "pccompound_mesh_pharm" filter, that select only compounds annotated with MeSH Pharmacological Actions. Actually, for retrieving an huge number of IDs (up to 10000) the results of ESearch are stored in the History server and accessed through query_key and WebEnv variables.

Then, using ESummary, an XML summary is retrieved with informations for each compound in the ID list resulting from the previous search.

In [6]:
db = 'pccompound'
query = '2244[uid]pccompound_mesh_pharm[filt]'

query_key, web_env = e_search(db, query)
summary = get_summary(query_key, web_env, db)
print(summary)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary pccompound 20170720//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20170720/esummary_pccompound.dtd">
<eSummaryResult>
<DocumentSummarySet status="OK">
<DbBuild>Build180211-0815m.1</DbBuild>

<DocumentSummary uid="2244">
	<CID>2244</CID>
	<SourceNameList>
	</SourceNameList>
	<SourceCategoryList>
		<string>Chemical Vendors</string>
		<string>Research and Development</string>
		<string>Curation Efforts</string>
		<string>Governmental Organizations</string>
		<string>NIH Initiatives</string>
		<string>Subscription Services</string>
		<string>Journal Publishers</string>
		<string>Legacy Depositors</string>
	</SourceCategoryList>
	<CreateDate>2004/09/16 00:00</CreateDate>
	<SynonymList>
		<string>aspirin</string>
		<string>ACETYLSALICYLIC ACID</string>
		<string>2-Acetoxybenzoic acid</string>
		<string>50-78-2</string>
		<string>2-(Acetyloxy)benzoic acid</string>
		<string>Acetylsalicylate</strin

As it can be seen, in the summary it's also present a "PharmActionList" node, but the terms listed here are much more than the actually showed in PubChem (Anti-Inflammatory Agents, Non-Steroidal - Fibrinolytic Agents - Antipyretics - Cyclooxygenase Inhibitors)

```xml
<PharmActionList>
		<string>Cyclooxygenase Inhibitors</string>
		<string>Pharmacologic Actions</string>
		<string>Chemical Actions and Uses</string>
		<string>Enzyme Inhibitors</string>
		<string>Molecular Mechanisms of Pharmacological Action</string>
		<string>Analgesics</string>
		<string>Cardiovascular Agents</string>
		<string>Hematologic Agents</string>
		<string>Peripheral Nervous System Agents</string>
		<string>Platelet Aggregation Inhibitors</string>
		<string>Analgesics, Non-Narcotic</string>
		<string>Anti-Inflammatory Agents</string>
		<string>Anti-Inflammatory Agents, Non-Steroidal</string>
		<string>Antipyretics</string>
		<string>Antirheumatic Agents</string>
		<string>Fibrin Modulating Agents</string>
		<string>Fibrinolytic Agents</string>
		<string>Physiological Effects of Drugs</string>
		<string>Sensory System Agents</string>
		<string>Therapeutic Uses</string>
</PharmActionList>
```

To retrieve only the linked MeSH terms, it's necessary to link to the MeSH DB. For each CID, the MeSH IDs of the associated terms are retrieved. With the ID, it's then possible to get the Term name and the associated tree numbers.

In [10]:
print(get_mesh_id('2244'))
print(get_mesh_info_from_cid('2244'))

['68058633', '68016861', '68010975', '68005343', '68000894']
(['D27.505.696.068', 'D27.505.519.389.310', 'D27.505.696.663.850.014.040.500.500', 'D27.505.954.158.030.500', 'D27.505.954.329.030.500', 'D27.505.954.502.780', 'D27.505.519.421.750', 'D27.505.954.411.320', 'D27.505.954.502.427', 'D27.505.696.663.850.014.040.500', 'D27.505.954.158.030', 'D27.505.954.329.030'], ['Antipyretics', 'Cyclooxygenase Inhibitors', 'Platelet Aggregation Inhibitors', 'Fibrinolytic Agents', 'Anti-Inflammatory Agents, Non-Steroidal'])


This procedure was iterated in "batches" of 10k uid (maximum limit using the ESearch utility) to retrieve the about 15k compounds with pharmacological annotations. The "MeSH linking" step represents a bottleneck since it's very slow, the time needed to retrieve all the data was about 6 hours.