# troubleshooting importing data from Entrez
__goal__: figure out how to use Entrez to query and download data from pubchem bioassay

__objectives:__
* go through introductory and quick start Entrez guide to figure out how Entrez is working
* write a function that imports data sets from a list of Assay IDs
* write a function that can find data sets based on certain queries, and download subsets of them

# HHHNNNNNNNGGGGGGGGG!!!!!
according to the pubchem docs, Entrez is not well suited to downloading tabular bioactivity data: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access 

I, naturally, found this out after re-leaning all the Entrez docs below

time to learn how to use the pubchem version of all of this; however, note that I got something working at the bottom that bypasses E-utils entirely, and just downloads using a link I modified off of the PubChem BioAssay GUI

In [None]:
import requests
import xml.etree.ElementTree as ET
import pandas as pd

import os
import sys

## resources
https://www.ncbi.nlm.nih.gov/books/NBK25501/<br>
https://www.ncbi.nlm.nih.gov/books/NBK25500/<br>
https://www.ncbi.nlm.nih.gov/books/NBK25498/<br>
https://www.ncbi.nlm.nih.gov/books/NBK25497/

## A General Introduction to the E-utilities
https://www.ncbi.nlm.nih.gov/books/NBK25497/

### introduction
* E-utilities = Entrez Programming Utilities
* __E-utilities__ are a set of server-side programs at the NCBI that help you query and retrieve data from the Entrez database system; i.e. they're the interface that allows you to get data from all of the Entrez databases - pubmed, pubchem bio assay, etc.
* data is accessed via posting an E-utility URL to the NCBI; any scripting language that can post a URL to the server and interperet the XML response is capable of using this system (e.g. python)

### usage guidelines and requirements
* data us accessed via the E-utility URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
* timing and frequency:
    * requests should be limited to <3/second
    * large requests should be run on weekends or between 9 PM - 5 AM ET
    * violation of these can result your IP address being blocked from using the E-utilities - the guide to getting unblocked is on the website
* API keys
  * allow for up to 10 requests/second
  * are taken from your NCBI account, under the settings page http://www.ncbi.nlm.nih.gov/account/
  * to use the API key, include it in the E-utility URL, e.g.: esummary.fcgi?db=pubmed&id=123456&api_key=ABCDE12345
* minimizing the number of requests using the Entrez History server
    * to decrease request #, especially if you're doing a large query, use Entrez History
    * within this method, UID records are stored on a server, and you query the actualy data you're downloading (using the UIDs) in batches to cover the entire dataset
* E-utilities URL syntax
    * use only lower case
    * replace all spaces with "+"
    * use URL encodings for special characters (e.g. %22 for ")

### the nine E-utilities in breif
1. EInfo - provides a summary of a given field in a database, e.g. the record number, the last time it was updated, and what other records that given field is linked to in other Entrez databases
* ESearch - takes a query and converts it to a list of UIDs that match that query, for a given database
* EPost - accepts a list of UIDs and puts the corresponding data on the History Server; returns a query key and web environment that allows you to access the data from that server
    * you should be using this for your AID search - since you know a certain list of AIDs that you're trying to downlaod the data from
    * e.g. <code>epost.fcgi?db=database&id=uid1,uid2,uid3,...</code>
* ESummary - accepts a list of UIDs and returns breif summaries of the resulting records
* EFetch - accepts a list of UIDs and returns the data that are associated with each of the records, for a given database; _also allows for specifying data types_
    * e.g. <code>efetch.fcgi?db=database&id=uid1,uid2,uid3&rettype=report_type&retmode=data_mode</code>
* ELink - accepts a list of UIDs from one database and returns a way to find associated UIDs in that database or another, specified database 
* EGQuery - takes a query and searches across all Entrez databases - returns the number of records that match each query in each data base
* ESpell - returns spelling suggestions for a given query at a given data base
* ECitMatch - returns a list of PMIDs based on a list of formatted citation strings

### understanding the E-utilities within Entrez
* The E-utilities access Entrez databases - note that, occassionally, some data hosted by NCBI are not in the Entrez system; therefore, that data cannot be accessed by the E-utilities - verify if the data are in Entrez if that's the case
* The entrez system identifies records via their UIDs - each database has its own UID set - see the table on this page for specifics
    * __PubChem BioAssay UID__: AID ; __PubChem BioAssay Database Entrez Name__: pcassay

__E-utilities syntax__
> term1[field1] __Op__ term2[field2] __Op__ term3[field3] __Op__ ...

    * term = search term
    * [field] = type of value you're querying (e.g. [author] on pubmed)
    * Op = operator (e.g. AND, OR, NOT)
* Boolean oporators must be in all caps (e.g. OR)
* all other terms should be in lowercase
* spaces should be replaced with <code>+</code> in links e.g. <code>zhang[author]+AND+novartis[affiliation]</code>
* use URL encodings for special characters (e.g. %22 for ")

__Entrez History Server__
* method for storing long lists of UIDs on a temporary server, so records associated with the UIDs can be processed in batches - _this is exactly what you should do with your project; obviates the need to do huge data batches at once, or multiple query calls; combined with EPost to locate a precise list of UIDs
* the History server works by assigining a __query key__ (a specific ID) for the UIDs that correspond to a certain query and a __web environment__ (a cookie string related to where the data is being processed)
* since the query key (i.e. the specific batch of UIDs) and the web environment (i.e. where the batches are processed) are separate, multiple data sets can be housed on the history server, and results can be combined between them using Boolean operators to discover data that's related across databases - however, this is not set by default, and has to be manually specified
* History server works in 2x general steps (in the most simple case):
    1. an upload step that generates a web environment and a query key (note that esearch requires usehistory=y, while epost uses the history server by default)
> examples:
esearch.fcgi?db=database&term=query&usehistory=y<br>
epost.fcgi?db=database&id=uid1,uid2,uid3,...
    2. a download step that leverages the web environment/query key to get the data you want
> examples:
esummary.fcgi?db=database&WebEnv=webenv&query_key=key<br>
efetch.fcgi?db=database&WebEnv=webenv&query_key=key&rettype=report_type&retmode=data_mode

## Sample Applications of E-utilities
https://www.ncbi.nlm.nih.gov/books/NBK25498/

from above, I think what I want to do is combine two protocols from these sample applications:
<code>EPost-EFetch</code> to download recrods associated with a specific list of UIDs

and the <code>application 3: retrieving large datasets</code> section, which uses the history server to retrieve data from EFetch in batches

__EPost - Esummary/Efetch example:__

        use LWP::Simple;

        # Download protein records corresponding to a list of GI numbers.

        $db = 'protein';
        $id_list = '194680922,50978626,28558982,9507199,6678417';

        #assemble the epost URL
        $base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
        $url = $base . "epost.fcgi?db=$db&id=$id_list";

        #post the epost URL
        $output = get($url);

        #parse WebEnv and QueryKey
        $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
        $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

        ### include this code for EPost-ESummary
        #assemble the esummary URL
        $url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";

        #post the esummary URL
        $docsums = get($url);
        print "$docsums";

        ### include this code for EPost-EFetch
        #assemble the efetch URL
        $url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
        $url .= "&rettype=fasta&retmode=text";

        #post the efetch URL
        $data = get($url);
        print "$data";


conda update pandas

In [None]:
# import record list
p = '../data/literature/KYHelal_etal_2016_JCIM_supplement.xlsx'
supp_table = pd.read_excel(p, engine='openpyxl')
supp_table.head()

In [None]:
# translating the code into python and tailoring it for PubChem BioAssay

# assemble db name and UID list
db = 'pcassay'
id_list = supp_table['AID'].values
id_list = id_list[0:5]

id_string = ''
for s in id_list:
    id_string = id_string + str(s) + ','
id_string = id_string[:-1]

#assemble the epost URL
base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
url = base + "epost.fcgi?db={db}&id={id_string}".format(db = db, id_string = id_string)

In [None]:
# post the UIDs and parse the resulting response object to get the query_key and web_env
r = requests.get(url)
content = r.content

root = ET.fromstring(content)
key = root.find('QueryKey').text
web_env = root.find('WebEnv').text

#retmax_value = 500
#retstart_value = 0

In [None]:
# fetch data
#assemble the efetch URL
fetch_url = base + "efetch.fcgi?db={db}&query_key={key}&WebEnv={web_env}".format(db = db,
                                                                                 key = key,
                                                                                 web_env = web_env)
fetch_url = fetch_url + "&rettype=datatable&actvty=all&retmode=csv"
fetch_url

example url when trying to download 1x record using the pubchem bioassay gui
https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=1511

hmmm ... the above code isn't working ... I can't tell what rettype I should pass to the entrez url

let's try just using the above url:

In [None]:
dl_url_base = 'https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid='
data_dir = '../data/test_download/'

for i in id_list:
    dl_url = dl_url_base + str(i)
    temp_df = pd.read_csv(dl_url)
    temp_df.to_csv(data_dir + str(i)+'.csv')
    

that  - technically worked ... although it was very slow and feels sketchy