<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Website-Data" data-toc-modified-id="Website-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Website Data</a></span><ul class="toc-item"><li><span><a href="#Requests-Package" data-toc-modified-id="Requests-Package-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Requests Package</a></span></li><li><span><a href="#JSON-files" data-toc-modified-id="JSON-files-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>JSON files</a></span></li><li><span><a href="#Discussion:-What-could-go-wrong-if-this-approach-were-applied-to-a-new-compound?" data-toc-modified-id="Discussion:-What-could-go-wrong-if-this-approach-were-applied-to-a-new-compound?-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Discussion: What could go wrong if this approach were applied to a new compound?</a></span></li><li><span><a href="#Exercise:-Write-a-function-that-counts-the-number-of-C-H-bonds-in-ethanol" data-toc-modified-id="Exercise:-Write-a-function-that-counts-the-number-of-C-H-bonds-in-ethanol-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Exercise: Write a function that counts the number of C-H bonds in ethanol</a></span></li></ul></li><li><span><a href="#Application-Programming-Interfaces-(APIs)" data-toc-modified-id="Application-Programming-Interfaces-(APIs)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Application Programming Interfaces (APIs)</a></span><ul class="toc-item"><li><span><a href="#RESTful-API's" data-toc-modified-id="RESTful-API's-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>RESTful API's</a></span></li><li><span><a href="#Exercise:-Write-a-function-that-returns-the-CID-given-a-compound-name" data-toc-modified-id="Exercise:-Write-a-function-that-returns-the-CID-given-a-compound-name-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Exercise: Write a function that returns the CID given a compound name</a></span></li><li><span><a href="#Exercise:-Write-a-function-that-returns-the-SMILES-string-for-any-compound-based-on-CAS-number" data-toc-modified-id="Exercise:-Write-a-function-that-returns-the-SMILES-string-for-any-compound-based-on-CAS-number-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Exercise: Write a function that returns the SMILES string for any compound based on CAS number</a></span></li><li><span><a href="#Python-API's" data-toc-modified-id="Python-API's-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Python API's</a></span></li><li><span><a href="#Exercise:-Write-a-function-that-takes-an-arbitrary-chemical-name-or-CAS-number-and-returns-the-number-of-C-H-bonds." data-toc-modified-id="Exercise:-Write-a-function-that-takes-an-arbitrary-chemical-name-or-CAS-number-and-returns-the-number-of-C-H-bonds.-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Exercise: Write a function that takes an arbitrary chemical name or CAS number and returns the number of C-H bonds.</a></span></li></ul></li></ul></div>

# Online Data Access

A large amount of data is only available through the internet. There are many ways to access the data, but some are more convenient than others. In this lecture we will work with the PubChem website and database.

## Website Data

### Requests Package

Let's start by looking at the PubChem page for ethanol: https://pubchem.ncbi.nlm.nih.gov/compound/Ethanol

The main Python package for accessing online data is `requests`, which essentially makes HTTP requests for data. We can "request" data from this URL:

In [1]:
import requests

page = requests.get('https://pubchem.ncbi.nlm.nih.gov/compound/Ethanol')

In [2]:
#page.text

This is the raw text that describes the website, in this case it is HTML. It is possible to extract data directly from HTML, but it is challenging and tedious. Packages such as [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) can make the process much easier, but we won't cover it in this course.

### JSON files

An alternate approach is to use a more structured representation. Note that there is a "Download" button in the top right of the page. If you select "JSON" under "Data used to display this page" the result will be the `ethanol.json` file. We can load this with the Python `json` package:

In [3]:
import json

with open('data/ethanol.json') as f:
    etoh = json.load(f)

The JSON file acts like a Python dictionary, and can contain other dictionaries/lists within it:

In [4]:
#etoh

This is still pretty messy, but it's a little more organized. Working with JSON data can be challenging if there are many nested structures, headers, etc. It is very useful to use a visualization tool:

* [Code Beautify](https://codebeautify.org/jsonviewer)
* [Chrome Extension](https://chrome.google.com/webstore/detail/json-viewer/gbmdgpbipfallnflgajpaliibnhdgobh?hl=en-US)

From the visualizer we can see how to extract the information we need. Note the "search" feature!

Let's try to extract some information:

* SMILES string
* Molecular weight

In [5]:
SMILES = etoh['Record']['Section'][2]['Section'][1]['Section'][3]['Information'][0]['Value']['StringWithMarkup'][0]['String']#['StringValue']
MW = etoh['Record']['Section'][3]['Section'][0]['Section'][0]['Information'][0]['Value']['Number'][0]
print('SMILES: {}'.format(SMILES))
print('Molecular Weight: {}'.format(MW))

SMILES: CCO
Molecular Weight: 46.07


### Discussion: What could go wrong if this approach were applied to a new compound?

The data is somewhat structured, but it is still very time consuming to find spedific data. This is mainly due to the complex structure of the JSON file. If there were fewer nested loops it would have been a lot easier to find specific info. For example, consider the simpler JSON file that also contains data on ethanol (we will see how to get this later):

In [6]:
with open('data/ethanol_simple.json') as f:
    etoh_simple = json.load(f)
    
print(etoh_simple)

{'PC_Compounds': [{'id': {'id': {'cid': 702}}, 'atoms': {'aid': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'element': [8, 6, 6, 1, 1, 1, 1, 1, 1]}, 'bonds': {'aid1': [1, 1, 2, 2, 2, 3, 3, 3], 'aid2': [2, 9, 3, 4, 5, 6, 7, 8], 'order': [1, 1, 1, 1, 1, 1, 1, 1]}, 'coords': [{'type': [1, 5, 255], 'aid': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'conformers': [{'x': [3.732, 2.866, 2, 2.4675, 3.2646, 2.31, 1.4631, 1.69, 4.269], 'y': [0.25, -0.25, 0.25, -0.7249, -0.7249, 0.7869, 0.56, -0.2869, -0.06]}]}], 'charge': 0, 'props': [{'urn': {'label': 'Compound', 'name': 'Canonicalized', 'datatype': 5, 'release': '2019.01.04'}, 'value': {'ival': 1}}, {'urn': {'label': 'Compound Complexity', 'datatype': 7, 'implementation': 'E_COMPLEXITY', 'version': '3.4.6.11', 'software': 'Cactvs', 'source': 'xemistry.com', 'release': '2019.06.18'}, 'value': {'fval': 2.8}}, {'urn': {'label': 'Count', 'name': 'Hydrogen Bond Acceptor', 'datatype': 5, 'implementation': 'E_NHACCEPTORS', 'version': '3.4.6.11', 'software': 'Cactvs', 'source': 'x

We can extract the same information with significantly less effort:

In [7]:
SMILES = etoh_simple['PC_Compounds'][0]['props'][18]['value']['sval']
MW = etoh_simple['PC_Compounds'][0]['props'][17]['value']['fval']
print('SMILES: {}'.format(SMILES))
print('Molecular Weight: {}'.format(MW))

SMILES: CCO
Molecular Weight: 46.07


### Exercise: Write a function that counts the number of C-H bonds in ethanol

Use the `ethanol_simple.json` file as input. You will need both `bonds` and `atoms` information. Note that `element` refers to the atomic number (e.g. hydrogen is `1`).

In [8]:
print(etoh_simple['PC_Compounds'][0]['bonds'])
print(etoh_simple['PC_Compounds'][0]['atoms'])

{'aid1': [1, 1, 2, 2, 2, 3, 3, 3], 'aid2': [2, 9, 3, 4, 5, 6, 7, 8], 'order': [1, 1, 1, 1, 1, 1, 1, 1]}
{'aid': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'element': [8, 6, 6, 1, 1, 1, 1, 1, 1]}


> The number of C-H bonds will be the number of bonds between (2 or 3) and (4, 5, 6, 7, 8, or 9).

## Application Programming Interfaces (APIs)

API's are like GUI's for experts. They are not limited to online data, or even data in general. API is a term for any programmatic structure that makes it easier to interact with a more complex underlying code or data structure. However, they are particularly prevalent in data science because accessing data is much less painful.

### RESTful API's

REST stands for "representational state transfer", and is a protocol that enables accessing data directly through a URL. This is a very common and very powerful approach because it allows the data provider to abstract the database back-end from the API. In other words, data providers can provide a uniform interface to data in relational (schema-driven) databases, schema-free databases, file servers, or services in any programming language. All the user needs to know is how to "query" from a URL. If you pay attention to URL's as you browse the web you will see that you use RESTful API's all the time without knowing it!

<center>
<img src="images/RESTful.png" width="500">
</center>

RESTful API's are designed to return data in specific structures, and respond to specific queries that are embedded in the URL. A few notes:

* Many API's require a "key" or "token". This is to avoid spammers overloading their servers.
* Most API's also limit the amount of data per request, and the rate of requests.
* It is still necessary to understand the underlying structure of the data you are querying.

You should always start by reading the documentation of an API to learn what you can/can't do.

In this lecture we will work with the PubChem API:

[PubChem API tutorial documentation](http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial$_Toc458584421)

[PubChem API full documentation](http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest)

The nice thing about RESTful API's is that they can be accessed directly through HTTP requests. Let's try to find the compound identifier (CID) for ethanol.

First, we need to understand the structure of the query to decide how to search. From the documentation:

* prolog: `https://pubchem.ncbi.nlm.nih.gov/rest/pug`

* input: `/compound/name/ethanol`

* operation: `/cids`

* output: `/TXT`

In [9]:
r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/ethanol/cids/TXT')

In [10]:
print(r.text)

702



The "name" search is pretty flexible, and we can even search by CAS number. For example, the CAS number for ethanol is 64-17-5:

In [11]:
r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/64-17-5/cids/TXT')
print(r.text)

702



Note that if we ask for something that isn't there we get a 404 error that gives some insight into what went wrong:

In [12]:
r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/whiskey/cids/TXT')
print(r.text)

Status: 404
Code: PUGREST.NotFound
Message: No CID found
Detail: No CID found that matches the given name



### Exercise: Write a function that returns the CID given a compound name

In [13]:
def returnCID(compound):
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/cids/TXT'.format(compound))
    return r.text

print(returnCID('ethanol'))

702



This is much easier, less memory intensive, and more robust, than trying to extract the property from the JSON of the webpage (or the HTML).

An alternative intermediate strategy is to pull the full record using the API, then work with the resulting JSON for a single compound. The following function was written using the documentation for the PubChem RESTful interface:

In [14]:
def get_full(chemical):
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/record/json'.format(chemical))
    chem_json = json.loads(r.text)
    return chem_json

Let's use it to get information for ethanol:

In [15]:
etoh_json = get_full('ethanol')
print(etoh_json)

{'PC_Compounds': [{'id': {'id': {'cid': 702}}, 'atoms': {'aid': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'element': [8, 6, 6, 1, 1, 1, 1, 1, 1]}, 'bonds': {'aid1': [1, 1, 2, 2, 2, 3, 3, 3], 'aid2': [2, 9, 3, 4, 5, 6, 7, 8], 'order': [1, 1, 1, 1, 1, 1, 1, 1]}, 'coords': [{'type': [1, 5, 255], 'aid': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'conformers': [{'x': [3.732, 2.866, 2, 2.4675, 3.2646, 2.31, 1.4631, 1.69, 4.269], 'y': [0.25, -0.25, 0.25, -0.7249, -0.7249, 0.7869, 0.56, -0.2869, -0.06]}]}], 'charge': 0, 'props': [{'urn': {'label': 'Compound', 'name': 'Canonicalized', 'datatype': 5, 'release': '2019.01.04'}, 'value': {'ival': 1}}, {'urn': {'label': 'Compound Complexity', 'datatype': 7, 'implementation': 'E_COMPLEXITY', 'version': '3.4.6.11', 'software': 'Cactvs', 'source': 'xemistry.com', 'release': '2019.06.18'}, 'value': {'fval': 2.8}}, {'urn': {'label': 'Count', 'name': 'Hydrogen Bond Acceptor', 'datatype': 5, 'implementation': 'E_NHACCEPTORS', 'version': '3.4.6.11', 'software': 'Cactvs', 'source': 'x

Note that this is the same JSON file we saw earlier. Now we could apply our bond counting function or SMILES/molecular weight extraction to get this information for any compound if we know its name or CAS number.

### Exercise: Write a function that returns the SMILES string for any compound based on CAS number

In [16]:
def returnSMILES(CAS):
    r = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/record/json'.format(CAS))
    chem_json = json.loads(r.text)
    
    SMILES = chem_json['PC_Compounds'][0]['props'][18]['value']['sval']
    return SMILES

print(returnSMILES('64-17-5'))

CCO


### Python API's

RESTful API's are widely used and easy to interact with. However, reading the documentation and converting more complex queries into the proper URL can be tedious and time consuming, especially because every RESTful API will use a different protocol. Furthermore, not all data sources use RESTful API's.

Python is one of the most common languages for API's, and widely-used data sources (e.g. PubChem) will often have a Python "wrapper" for their RESTful API.

We can use the [PubChemPy](https://pypi.python.org/pypi/PubChemPy/1.0) API to achieve the same goal, but we will need to install it first:

In [17]:
# ! pip install PubChemPy

Now we can import the API and will have access to intuitive function names and documentation:

In [18]:
import pubchempy as pcpy
#help(pcpy)

Python APIs make code more readable, and are more intuitive to learn:

In [19]:
compounds = pcpy.get_compounds('Ethanol','name')
print(compounds)
etoh = compounds[0]
print(etoh.bonds[0].aid2)
print(etoh.atoms[etoh.bonds[0].aid1].element)
print(etoh.atoms[etoh.bonds[0].aid2].element)

[Compound(702)]
2
C
C


We see that the full .json output is already parsed into a nice Python data structure that can be accessed by attributes and has element symbols for each atom. We can also inspect this object like other Python objects:

In [20]:
dir(etoh)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_atoms',
 '_bonds',
 '_record',
 '_setup_atoms',
 '_setup_bonds',
 'aids',
 'atom_stereo_count',
 'atoms',
 'bond_stereo_count',
 'bonds',
 'cactvs_fingerprint',
 'canonical_smiles',
 'charge',
 'cid',
 'complexity',
 'conformer_id_3d',
 'conformer_rmsd_3d',
 'coordinate_type',
 'covalent_unit_count',
 'defined_atom_stereo_count',
 'defined_bond_stereo_count',
 'effective_rotor_count_3d',
 'elements',
 'exact_mass',
 'feature_selfoverlap_3d',
 'fingerprint',
 'from_cid',
 'h_bond_acceptor_count',
 'h_bond_donor_count',
 'heavy_atom_count',
 'inchi',
 'inchikey',
 'isomeric_smiles',
 'isotope_atom_count',
 'iupac_name',
 

We can also use the PubChemPy API to ask for specific attributes with the `get_properties` method:

In [21]:
p = pcpy.get_properties('CanonicalSMILES', 'ethanol', 'name')
print(p)

[{'CID': 702, 'CanonicalSMILES': 'CCO'}]


This provides a good tradeoff between the flexibility of the RESTful API and easy to read code.

### Exercise: Write a function that takes an arbitrary chemical name or CAS number and returns the number of C-H bonds.

In [22]:
def countCH(name):
    c = pcpy.get_compounds(name, 'name')
    bonds = c[0].bonds
    
    count = 0
    for bond in bonds:
        if c[0].atoms[bond.aid1 - 1].element == 'C' and c[0].atoms[bond.aid2 - 1].element == 'H':
            count += 1
        elif c[0].atoms[bond.aid1 - 1].element == 'H' and c[0].atoms[bond.aid2 - 1].element == 'C':
            count += 1
            
    return count

countCH('ethanol')

5

A few notes about accessing data with APIs:

* Every data source will have different structures and standards
* APIs can sometimes be outdated if they are not maintained properly 
* Some APIs require "keys" to gain access
* Many APIs (including PubChem) have limits on data transfer rates
* Some APIs have terms of use that should not be violated

In general, Python API's are the best option for accessing online data with Python, though sometimes they can also be difficult to understand, or may contain bugs if they are not developed by the official maintainers of the dataset. RESTful API's are a good backup option, since they are relatively flexible and easy to access with Python. If this isn't available, then look for JSON or XML versions of the webpage or data source that can be parsed to extract data. Obtaining data by "scraping" HTML should only be done as a last resort since it is time consuming and will not work if the website updates its HTML structure.