# Programmatic Access to PubChem Troubleshooting

__goal__: figure out how to download tabular bioassay data from pubchem bioassay based on queries and UID lists

__approach__: go through the PubChem docs to find out how to do this in the most efficient way possible

---
## Programmatic Access
https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access
* there are multiple points of programmatic access to PubChem:
    1. __PUG-REST__ - a simplified, URL-based access system to PubChem data that is designed for locating specific data from specific pubchem recrods. PUG-REST is designed such that it has less overhead than the XML associated wit PUG and the SOAP envelopes that PUG-SOAP use, so it is easier and quicker to use, but lacks the added capabilities that the other web services can accomplish. It is made to handle short requests (<30 seconds) that are synchronous (happening during the requesting period). _Given the simple task you need to accomplish, this is likely your best bet_
        * _Representational state transfer (REST)_ - a set of rules for interacting with web services that adherese to specific style of software architecture. 
    * __PUG-View__ - a REST-based platform that returns summaries of records; used mainly for creating PubChem summary web pages
    * __Power User Gateway (PUG)__ - uses a common gateway interface (CGI) called <code>pug.cgi</code> to exchange XML data through HTTP POSTs
    * __PUG-SOAP__ - Simple Object Access Protocol (SOAP)-based access service. Easier programmatic access than the PUG gateway, with enhanced flexibility of data access relative to REST protocols, but still has added complexity in interpretation because of the SOAP envelopes it returns. Recommended for GUI workflow applications like pipeline pilot and for programming/scripting langauges like Python
    * __PuChemRDF REST interface__ - REST itnerface for RDF data
    * __Entrez Utilities__ - E-utilites; not well suited to PubChem because it cannot easily access/return a lot of data types unique to the PubChem system, such as large tables of bioactivity data, or chemical structures 
* these were made because the chemical and bioassay data within pubchem is structured differently than the data in the rest of Entrez, so to make things easier, they created a new set of interfaces with PubChem data: the Power User Gateway (PUG)

---
## PUG-SOAP
https://pubchemdocs.ncbi.nlm.nih.gov/pug-soap
* this looks like it's meant to handle more complex queries and information retreival, so I'm going to shelve learning this for now
* just note that it is an option, if you find yourself needing to access information in a more complicated way across the database

---
## PUG-REST
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial
### introduction
* developed to be a simple interface for scripts, javascript applications in webpages, or other apps, to pull data from pubchem
* it is designed to pull specific types of information, as opposed to summary statments across records in the database

__usage notes__:
* it is not intended for millions of requests - smaller batches are prefered
* time limit for an access request: access requests should not be too complicated (i.e. cannot go over 30 seconds per request)
* request volume limitations: the amount of requests per second are limited (do not exceed 5 req/sec); dynamic request throttling can be used to help you maintain under the limit\
* not adhering to these rules could get your IP address temporarily banned from using any PUG service

### how PUG REST works
The PUG REST workflow (i.e. URL) has 3x required parts that vary:
1. an input - how data within the database should be recognized (e.g a compound to search by as a SMILES string or a AID record)
2. an operation - how you want the data to be processed (e.g. returing tabular bioactivity data)
3. an output - how you want the data returned (e.g. as a CSV file)
These portions of the request are modular, so they can be swapped around with different requests to yield different pieces of information based on the data

For example, to look up the InChI for Vioxx as text, the URL would be this:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/vioxx/property/InChI/TXT

When you go to the above link, PUG REST takes the HTTP/HTTPS request, parses it, figures out the input (Vioxx compund name) does the requested operation encoded into the URL path (look up the InChI key), and returns it as specified (as text)

4. additionally, a <code>?</code> following the link an be used to add additionally information for certain queries

### design of the URL
In the case of Vioxx ambove:
> __https://pubchem.ncbi.nlm.nih.gov/rest/pug__ | represents the invariant portion of the link, or prolog<br>
__/compound/name/vioxx__ | is the input (look up all compound names that match the query "vioxx")<br>
__/property/InChI__ | is the operation (look up the InChI key)<br>
__/TXT__ | is the output (return as text)<br>

if the request has chracters that cannot be interpreted as a URL, or is too long, then you can use HTTP POST to get around these shortcomings

### output
types of output formats:

| output format | description |
| --- | --- |
| XML | standard XML, for which a schema is available |
| JSON | JavaScript Object Notation |
| JSONP | like JSON, but wrapped in a callback function |
| ASNB | standard binary ASN.1 |
| ASNT | NCBI's human-readable text flavor of ASN.1 |
| SDF | chemical structure data format |
| CSV | comma separated values |
| PNG | PNG image |
| TXT | plain text |

note that these are specific for each record that you are retrieving - (cannot return an SDF for a data table, for example)

### error handling
when an error occurs during a request, PUG-REST returns a human readable error message as to what happened, in lieu of the requested data

### Access to PubChem BioAssays
there are two main ways to access assay data on PubChem: via AID or via SID/CID

__accessing data via AID__
using a specific AID, you can access a variety of information surrounding the assay, and the assay data itself, in a straightforward manner:
* assay description - title, protocol, etc. associated with the assay
> e.g. https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/description/XML
* target of the assay - biological data surrounding the target of the assay
> e.g. the gene symbol, protein name, and other terms for a target of a given assay<br>
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/490,1000/targets/ProteinGI,ProteinName,GeneID,GeneSymbol/XML
* subsets of assay data or small assay data sets - note that by default, you can't go over 10,000 rows/request, so you cannot download large datasets in this way

> e.g. downloading a small data set: <br>
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/CSV

>e.g. downloading a subset of SIDs from a dataset:<br> 
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/XML?sid=104169547,109967232

>e.g. downloading dose-response data in a simplified output:<br> 
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/CSV?sid=104169547,109967232

* entire assay data sets - see the "dealing with lists of identifiers" section below to learn how to request very large datasets from PubChem BioAssays

__retrieving AIDs that match a query__
* you can search by different assay types and get the AIDs returned via including aids in the operation
* furthermore, you can specify by assay type using the <code>activity</code> operator
> e.g. finding all assays that are measuring EC50<br>
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/activity/EC50/aids/JSON

__accessing data via CID/SID__
* you can find all the assay information for a given CID/SID as follows:
>e.g. https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV
* this base URL can be modified to pull actual assay data, or other information surrounding how the compound was discovered

### Dealing with lists of identifiers
__storing lists on the server__
* this is useful for very long lists of identifiers, e.g. when you're downloading thousands of SIDs from a given AID to get around the 10,000 line cap that one request has (if you try and request a dataset of >10,000 records, PubChem REST will return an error)
* alternatively, this is useful for when you have one set of IDs that you want to use on many requests
* for these examples, you can store the lists server-side, then retrieve them in batches to do operations on them
* a __list key__ is a key you use to access your server-side data
* you use the list key to repeatedly access the data and handle it in batches

__case study: downloading all of the SIDs in a large assay__:
first, request the list of SIDs associated with the assay, and store them as a listkey
>https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/XML?list_return=listkey

the above link contains, in XML format, the list key itself, and also descriptions of the data set - like the number of records found in it. The list key is stored under the ListKey leaf, and the # of records is stored under the Size leaf

next, use the list key to download the data in batches from PUG-REST. Loop over the following data using the URL below, but updating the <code>listkey_start=</code> to iterate over the entire data set
> https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/CSV?sid=listkey&listkey=2452489925683562238&listkey_start=0&listkey_count=1000

after you reach the full size of the data set, you're done!

---

## testing download of a single dataset
AID = 640
number of records = 96409
link to downloading the whole dataset from the GUI: https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=640

In [None]:
import requests
import xml.etree.ElementTree as ET
import pandas as pd

In [None]:
# create a listkey
aid = '640'
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/{}/sids/XML?list_return=listkey'.format(aid)
r = requests.get(url)
r.encoding = 'utf-8'
xml = r.text

In [None]:
# clean out the links within the attribute values of the IdentifierList tag to make ET happy
cleaned_xml = xml.replace('\n', '')

id_start = cleaned_xml.find('<IdentifierList')
listkey_start = cleaned_xml.find('ListKey')
cleaned_xml = cleaned_xml[0:id_start] + '<IdentifierList>' + cleaned_xml[listkey_start-1:]
cleaned_xml

In [None]:
# parse out the list key and size from the xml
root = ET.fromstring(cleaned_xml)
listkey = root.find('ListKey').text
size = root.find('Size').text

print(listkey, size)

In [None]:
dl_url

In [None]:
# loop over the keys and construct a dataframe in batches
url_prolog = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/'
url_listkey = 'assay/aid/{aid}/CSV?sid=listkey&listkey={listkey}'.format(aid=aid,
                                                                        listkey=listkey)
url_base = url_prolog + url_listkey

listkey_start = 0
listkey_count = 1000
listkey_end = listkey_start + listkey_count
row_count = 0

data = pd.DataFrame()

while listkey_end < 10000:
    url_keycount = '&listkey_start={listkey_start}&listkey_count={listkey_count}'.format(listkey_start=listkey_start,
                                                                                         listkey_count = listkey_count)
    dl_url = url_base + url_keycount
    
    # read in the data as DF and get rid of the 3x useless rows at the top: result_type, result_descr, result_unit
    temp_df = pd.read_csv(dl_url, index_col=0)
    data = pd.concat([data, temp_df], sort=False)
    
    if row_count%10000 == 0:
        print('{row} rows of {size} rows completed'.format(row=row_count, size=size))
    
    listkey_start += listkey_count
    listkey_end += listkey_count
    row_count += listkey_count
    
# save the dataframe
p = '../data/test_download/{}_test.csv'.format(aid)
data

In [None]:
data.to_csv(p)

the above code does not work right
downloading it that way leaves out certain rows - we have 96288 rows in lieu of the 96409 we were expecting 

furthermore, the garbage columns:
>0,RESULT_TYPE,,,,,,,FLOAT<br>
1,RESULT_DESCR,,,,,,,Normalized % inhibition at 2 micromolar inhibitor concentration of the primary assay<br>
2,RESULT_UNIT,,,,,,,NONE

are repeated for each call to pubchem - meaning we're actually missing even MORE rows, because there are 97 * 3 rows of that nonsense include within the above data set

---

## conclusions
* given that the data seems to be missing lines and is loading extraordinarly slowly, the GUI interface is probably better for batch-downloading full datasets, regardless of what the documentation says
* I should use the GUI interface for downloading bioassay data
* I can, however, use this interface if I have more complicated queries across certain compounds