# Intro
This is a general walkthrough of how to use the Habanero library to retrieve Crossref URLs to fulltext articles, and then download those fulltext articles using Wiley's API.

#### Prereqs
You should have logged in to https://apps.crossref.org/clickthrough/researchers and selected the licenses you are willing to accept. Note institutional licenses supercede these clickthrough licenses, *but* you'll still need to agree to them to get the API keys for those licenses.  You will need to have accepted Wiley's TDM agreement for the purprose of this tutorial.

#### Getting started:
First let's make sure we have these libraries installed and loaded

In [None]:
from habanero import Crossref
import requests
import logging
import sys
import time
import re

We'll be using a few libraries to run our code. Noteably, the habanero library for accessing Crossref (as opposed to curl commands), the requests library (again as opposed to curl commands) and re (the regular expression matching tool, used just to simplify the title of the articles).

#### Loading our information

For both Crossref and Wiley, they expect some information, Crossref wants contact information for something called the "polite pool" (just a way to get better service, and contact you if there is a problem with your script), and Wiley needs to know you have the correct API key to access their stuff. 

You can store these in a file, which we'll load in the next section.

In [None]:

#for the stand alone program you can run it at the command prompt using:
# useage ~python wiley_doi_retrieval.py "myContactFile.txt" "DOIfile.txt"

# myContactFile.txt should have format:
# your_url
# your_email
# your_wiley_api_token
#contactFile = sys.argv[1]
contactFile = "contact_info.txt"

# store your contact info to be on the polite list (crossref) as well as your Wiley API token
# alternatively edit this section to just load your API token
with open(contactFile, "r") as fp:
    base_url = fp.readline().rstrip('\n')
    mailto = fp.readline().rstrip('\n')
    clickThroughKey = fp.readline().rstrip('\n')

# setup Crossref
cr = Crossref()
Crossref(base_url=base_url)
Crossref(mailto=mailto)


We also need to load a list of DOIs to retrieve.

In [None]:

# load DOIs
#The DOI file should look like a text file with line after line of DOIs:
#10.234234/235235
#122.4334234/15151253
#etc.....

#doiFile = sys.argv[2]
doiFile = "doi_list.txt"
doiList = []
with open(doiFile, 'r') as fp:
    for line in fp:
        doiList.append(line.rstrip('\n'))



Finally, before we get started, it might be good to set some tracking info (in case your API requests get rejected).

In [None]:
# setup logging environment
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True



#### Getting URLs from CrossRef
For the first part we need to get the links to the full text pdfs. We'll use Crossref to grab them, and then store the URL along with the title of the article.

In [None]:
# retrieve links to DOIs through crossref:
articleList = []
for doi in doiList:
    print("searching for DOI: " + doi )
    res = cr.works(ids=doi)
    print("found "+ res['message']['title'][0])
    article = res['message']
    print("getting link to: " +article['title'][0])
    if article['link'][0]['intended-application'] == 'text-mining':  #double check you get the right url
        #store as a tuple with (URL,title)
        print("success")
        articleList.append((article['link'][0]['URL'], re.sub(r'\W+','',article['title'][0])))
    else:
        print("url error")
        break




#### Getting the PDFs
Finally, we'll download and save the PDFs by making a request for each article from Wiley's servers.

In [None]:
# retrieve PDFs through Wiley API (Can fork this and change this to other API's e.g. elsevier if needed)
header = {'CR-Clickthrough-Client-Token': clickThroughKey}
for article in articleList:
    r = requests.get(article[0], allow_redirects=True, headers=header)
    #you might also consider prefixing all the pdfs with something to make them easier to find.
    with open(article[1].replace(" ","")[:15]+ ".pdf", 'wb') as fp:
        fp.write(r.content)
    time.sleep(10) #this time should be generous enough to avoid rate limits.