Copyright 2020 Frances M. Skinner
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# **A Guide to creating citations using only a DOI**
*Note: This Notebook seeks to retrieve citations for papers as 3 outputs; JSON, BibTeX and HTML. The citations in this notebook are retrieving the following information on a paper: Title, Authors, Journal Name, Volume Number, Page Range, Year, Hyperlink(s) to the Article and DOI of the Article. If you are looking for more detailed citations or other outputs then please refer to step 3 after completing steps 1 & 2!*

### You only need to click run on each cell in this notebook and everything should populate normally!

Step 1. Get DOI of the paper you want to cite<br>
Step 2. Use DOI to search for the Bibcode -- Enter DOI in prompt provided -- Output will be the Bibcode for the paper<br>
Step 3. The Bibcode will automatically be populated for you to search in ADS method. The Output is customizable, all formats possible. This method includes links to paper by DOI url & ADS url<br>
Note. (This May Happen to Some Users) Error! My paper was not in ADS! There is no Bibcode for the paper! -- If there is no Bibcode then please Move to Step 4<br>
Step 4. Use DOI to search in Urllib method -- your DOI is already populated from step 1 in Urllib Method -- Output will be the full citation as a plain text/JSON output. Output is customizable as HTML through Step 5, or in BibTeX format through Step 6<br>
Step 5. The full JSON output will be populated from the Urllib method to encode it as HTML</font> <br>
Step 6. The DOI is already populated and you will recieve the full BibTeX citation for the paper using the GScholar Method<br>

### *All Done!* Now you have the citation for your paper in 3 different formats!

## Before Doing Anything, Import the Necessary Modules

In [1]:
# Import these modules
# Step 2 Bibcode
import requests
import json
# Step 4 urllib
import urllib.request
from urllib.error import HTTPError
#Step 5 html encoder
import html
# Step 6 bibtex
import re
import logging
from bs4 import BeautifulSoup  
from html.entities import name2codepoint  
from urllib.request import quote, Request, urlopen

## Step 1. Enter your DOI and Token

You can retrieve the DOI for your paper in many different ways.
1. What is a DOI?
    - The DOI is a unique alphanumeric string assigned by the International DOI Foundation, to identify content and provide a persistent link to its location on the Internet. It is written in the general format of '10.1000/xyz123'
2. Where Can I find a DOI?
    - The DOI should be written on the top left or top right corner of your paper, it is written as 'DOI:10.1000/xyz123'
    - The DOI should be listed in the details or citation section on the publishers website where you have found your paper
    - The DOI may also be written as a link, next to the papers information on the publishers website, the link is written as https://doi.org/10.1000/xyz123 or https://dx.doi.org/10.1000/xyz123
3. How do I run this notebook?
    - In order to use this notebook please type in "ENTER HERE" spots the DOI in the '10.1000/xyz123' format, *NOT* as hyperlink
    - You only need to run each cell after entering your DOI, everything else will populate for you

In [2]:
# Enter your token here: You need this for using step 2 and 3
# A token can only be used once you have an account on NASA/ADS https://ui.adsabs.harvard.edu/. 
# Once you have an account click on 'Account', then click on 'Customize Settings' on the dropdown menu. 
# In 'Customize Settings' there is a panel to the left of the screen, if you scroll down that panel you will see 'API Token'. 
# Click on 'API Token' and then click on 'generate a new key'.
# If your using the django application you can enter your token in the settings file and Import ADS_TOKEN from settings.py
token="Enter Your Token Here" 
doi = input("Enter DOI Here: ")

Enter DOI Here: 10.1016/j.jqsrt.2019.106711


## Step 2. Retrive Bibcode

In [3]:
def get_citeproc_authors(cpd_author):
    if cpd_author is None:
        return None
    names = []
    for author in cpd_author:
        try:
            family = author['family'].title()
        except KeyError:
            name = author['name']
            names.append(name)
            continue
        try:
            given = author['given']
        except KeyError:
            # This author has first name
            names.append(family)
            continue
        initials = given.split()
        initials[0] = '{}.'.format(initials[0][0])
        initials = ' '.join(initials)
        names.append('{} {}'.format(initials, family))
    return ', '.join(names)
def parse_citeproc_json(citeproc_json):
    """Parse the provided JSON into a Ref object."""   
    cpd = json.loads(citeproc_json)
    try:
        if cpd['type'] != 'article-journal':
            return None
    except KeyError:
        return None
    authors = get_citeproc_authors(cpd.get('author', ''))
    title = cpd.get('title', '').replace('\n', '')
    journal = cpd.get('container-title', '')
    volume = cpd.get('volume', '')
    page_start, page_end = cpd.get('page', ''), ''
    if page_start and '-' in page_start:
        page_start, page_end = page_start.split('-')
    article_number = cpd.get('article-number', '')
    doi = cpd.get('DOI', '')
    url = cpd.get('URL', '')
    try:
        year = cpd['issued']['date-parts'][0][0]
    except (KeyError, IndexError):
        year = None        
    try:
        bibcode = cpd.get('bibcode', '')
    except (KeyError, IndexError):
        bibcode = None        
# # =============================================================================
# #   OUTPUT
# # =============================================================================
    ref = [authors, 
        title, 
        journal, 
        volume,
        year, 
        page_start, 
        page_end, 
        doi,
        url, 
        article_number,
        citeproc_json]
    return ref 
def get_citeproc_json_from_doi(doi):
    base_url = 'http://dx.doi.org/'
    url = base_url + doi
    req = urllib.request.Request(url)
    req.add_header('Accept', 'application/citeproc+json')
    try:
        with urllib.request.urlopen(req) as f:
            citeproc_json = f.read().decode()
    except HTTPError as e:
        if e.code == 404:
            raise ValueError('DOI not found.')
        raise
    return citeproc_json
def get_source_from_doi(doi):
    citeproc_json = get_citeproc_json_from_doi(doi)
    ref = parse_citeproc_json(citeproc_json)
    return ref
doi_fetched = get_source_from_doi(doi)
rdoi = doi
rdoi_bs = rdoi.replace("\\", "%2F")    # Remove backslash and replace with URL code for backslash
rdoi_fs = rdoi_bs.replace("/", "%2F")  # Remove forwardslash and replace with URL code for backslash
rurl = requests.get("https://api.adsabs.harvard.edu/v1/search/query?q=doi:"+rdoi_fs,\
                 params={"q":"*:*", "fl": "*", "rows":2000},
                 headers={'Authorization': 'Bearer ' + token})
todos          = json.loads(rurl.text)
todos_response = todos.get('response', '')
Bibcode =  (todos_response['docs'][0]['bibcode']) 
print("Bibcode:",Bibcode)

Bibcode: 2020JQSRT.24106711C


## Step 3. ADS Method

### Note before using ADS Method
  1. Exporting using bibcodes require two things.
       - A Bibcode Number (which you got from step 2)<br>
       - A Token--You will need to know what a Token is. It must be used whenever you want to access the ADS database. A token can only be used once you have an account on NASA/ADS https://ui.adsabs.harvard.edu/. Once you have an account click on 'Account' in the top right hand corner, then click on 'Customize Settings' on the dropdown menu. In 'Customize Settings' there is a panel to the left of the screen, if you scroll down that panel you will see 'API Token'. Click on 'API Token' and then click on 'generate a new key'. <br>
         - You are technically using ADS's API when you are using this method. So for any questions/concerns please refer to the NASA/ADS API Information tool on GitHub https://github.com/adsabs/adsabs-dev-api#access-settings <br>
  2. The benefits of this method are the endless choices to customize your citation output.
     - You can get more information such as... the abstract, copyright, citation count, author affiliation, keywords, publication category and arXiv e-print number, etc.<br>
     - You can search more than 1 bibcode at a time<br>
     - You have more output options such as... EndNote, ProCite, RIS (Refman), RefWorks, MEDLARS, AASTeX, Icarus, MNRAS, Solar Physics (SoPh), DC (Dublin Core) XML, REF-XML, REFABS-XML, VOTables and RSS<br>
     - This notebook does not display examples of all of these output format options, if you are interested in any of these choices or extra features please refer to http://adsabs.github.io/help/actions/export <br>
  3. The first option is to retrieve a citation where the output is in HTML unique character (JSON) format<br>
  4. The second option is to retrieve a citation where the output is in BibTeX format<br>
  6. The third option is to retrieve a citation where the output is in HTML converted character format
  
*Overall you need to make an account on ADS in order to use this method.*

*If you do not want to make an account then use the BibTeX citation from step 6 and if you want, use steps 4 & 5 to retrieve html and JSON citation formats, in steps 4 & 5 you only need to enter the DOI to retrieve citations (the DOI is set to populate for you automatically)*

*However there are many benefits to using the ADS method, your citation output is completely customizable! So if your willing and you have your Bibcode then its recommended to use this method!*

After running the cell below, you will recieve an HTML reference with the characters &, <, >, and “ included

In [4]:
# HTML with the characters &, <, >, and “ included
payload = {"bibcode": ["{}".format(Bibcode)],
           "sort": "first_author asc",
           "format":
           '''{"ref_json": {"authors": "%I",
              "title": "%T",
              "journal": "%J",
              "volume": "%V",
              "start-page": "%p",
              "end-page": "%P",
              "year": %Y,
              "doi": "%d",
              "bibcode": "%u"}}'''
              }
r = requests.post("https://api.adsabs.harvard.edu/v1/export/custom", \
                 headers={"Authorization": "Bearer " + token, "Content-type": "application/json"}, \
                 data=json.dumps(payload))
response_json = r.json()
ref_json = json.loads(response_json['export'])['ref_json']
print('authors:', ref_json['authors'])
print('title:', ref_json['title'])
print('journal:', ref_json['journal'])
print('volume:', ref_json['volume'])
print('start-page:', ref_json['start-page'])
print('end-page:', ref_json['end-page'])
print('year:', ref_json['year'])
print('doi:', ref_json['doi'])
print('bibcode:', ref_json['bibcode'])

authors: Conway, E. K., I. E. Gordon, A. A. Kyuberis, O. L. Polyansky, J. Tennyson, and N. F. Zobov
title: Calculated line lists for H<SUB>2</SUB><SUP>16</SUP>O and H<SUB>2</SUB><SUP>18</SUP>O with extensive comparisons to theoretical and experimental sources including the HITRAN2016 database
journal: Journal of Quantitative Spectroscopy and Radiative Transfer
volume: 241
start-page: 106711
end-page: %P
year: 2020
doi: 10.1016/j.jqsrt.2019.106711
bibcode: https://ui.adsabs.harvard.edu/abs/2020JQSRT.24106711C


After running the cell, you will recieve a BibTeX reference

In [5]:
# BibTeX Reference
payload = {"bibcode": ["{}".format(Bibcode)],
           "sort": "first_author asc",
           "format": 
           '''{"ref_json": {"encoder": "%ZEncoding:latex\\bibitem",
              "journal": "%J",
              "title": "%T",
              "volume": "%V",
              "start-page": "%p",
              "end-page": "%P",
              "year": %Y,
              "authors": "%I",
              "doi": "%d",
              "bibcode": "%u"}}'''
              }
r = requests.post("https://api.adsabs.harvard.edu/v1/export/custom", \
                 headers={"Authorization": "Bearer " + token, "Content-type": "application/json"}, \
                 data=json.dumps(payload))
response_json = r.json() 
ref_json = json.loads(response_json['export'])['ref_json']
print('authors:', ref_json['authors'])
print('title:', ref_json['title'])
print('journal:', ref_json['journal'])
print('volume:', ref_json['volume'])
print('start-page:', ref_json['start-page'])
print('end-page:', ref_json['end-page'])
print('year:', ref_json['year'])
print('doi:', ref_json['doi'])
print('bibcode:', ref_json['bibcode'])
# Note if this gives you an error then please remove "encoder": "%ZEncoding:latex\\bibitem", and enter a \ before each J, T, V,
# etc. therefore "journal": "%J", would be changed to "journal": "%\J", thereby encoding the journal name into BibTeX format
# this error occurs when the bibtex encoder cannot encode a section of the citation.

authors: Conway, E.~K., I.~E. Gordon, A.~A. Kyuberis, O.~L. Polyansky, J. Tennyson, and N.~F. Zobov
title: Calculated line lists for H$_{2}$$^{16}$O and H$_{2}$$^{18}$O with extensive comparisons to theoretical and experimental sources including the HITRAN2016 database
journal: Journal of Quantitative Spectroscopy and Radiative Transfer
volume: 241
start-page: 106711
end-page: %P
year: 2020
doi: 10.1016/j.jqsrt.2019.106711
bibcode: https://ui.adsabs.harvard.edu/abs/2020JQSRT.24106711C


After running the cell below, you will recieve an HTML reference with the characters &, <, >, and “ converted to & amp; & lt; & gt; and & quot; respectively.

In [6]:
# HTML with the characters &, <, >, and “ are converted to &amp;, &lt;, &gt;, and &quot;, respectively.
payload = {"bibcode": ["{}".format(Bibcode)],
           "sort": "first_author asc",
           "format": 
           '''{"ref_json": {"encoder": "%ZEncoding:html<P>",
              "authors": "%I",
              "title": "%T",
              "journal": "%J",
              "volume": "%V",
              "start-page": "%p",
              "end-page": "%P",
              "year": %Y,
              "doi": "%d",
              "bibcode": "%u"}}'''
              }
r = requests.post("https://api.adsabs.harvard.edu/v1/export/custom", \
                 headers={"Authorization": "Bearer " + token, "Content-type": "application/json"}, \
                 data=json.dumps(payload))
response_json = r.json()
ref_json = json.loads(response_json['export'])['ref_json']
print('authors:', ref_json['authors'])
print('title:', ref_json['title'])
print('journal:', ref_json['journal'])
print('volume:', ref_json['volume'])
print('start-page:', ref_json['start-page'])
print('end-page:', ref_json['end-page'])
print('year:', ref_json['year'])
print('doi:', ref_json['doi'])
print('bibcode:', ref_json['bibcode'])

authors: Conway, E. K., I. E. Gordon, A. A. Kyuberis, O. L. Polyansky, J. Tennyson, and N. F. Zobov
title: Calculated line lists for H&lt;SUB&gt;2&lt;/SUB&gt;&lt;SUP&gt;16&lt;/SUP&gt;O and H&lt;SUB&gt;2&lt;/SUB&gt;&lt;SUP&gt;18&lt;/SUP&gt;O with extensive comparisons to theoretical and experimental sources including the HITRAN2016 database
journal: Journal of Quantitative Spectroscopy and Radiative Transfer
volume: 241
start-page: 106711
end-page: %P
year: 2020
doi: 10.1016/j.jqsrt.2019.106711
bibcode: https://ui.adsabs.harvard.edu/abs/2020JQSRT.24106711C


## Step 4. Urllib method
If you did not have a bibcode or you want a plain text reference then use this method 

In [7]:
#doi = input("Enter doi Here: ")
doi_fetched = get_source_from_doi('{}'.format(doi))

# Below are the parameters for searching your citation
# if you would like to add or change anything then refer to the initial code in step 2 above to make your changes
reference = (doi_fetched[0],doi_fetched[1],doi_fetched[2],doi_fetched[3],doi_fetched[4],doi_fetched[5],doi_fetched[6],doi_fetched[7],doi_fetched[8],doi_fetched[9])

print ('Authors:', doi_fetched[0], '')
print ('Title:', doi_fetched[1], '')
print ('Journal:', doi_fetched[2], '')
print ('Volume:', doi_fetched[3], '')
print ('Year:', doi_fetched[4], '')
print ('Page Start:', doi_fetched[5], '')
print ('Page End:', doi_fetched[6], '')
print ('Article Number:', doi_fetched[9],'')
print ('DOI:', doi_fetched[7], '')
print ('URL:', doi_fetched[8], '')

Authors: E. K. Conway, I. E. Gordon, A. A. Kyuberis, O. L. Polyansky, J. Tennyson, N. F. Zobov 
Title: Calculated line lists for H216O and H218O with extensive comparisons to theoretical and experimental sources including the HITRAN2016 database 
Journal: Journal of Quantitative Spectroscopy and Radiative Transfer 
Volume: 241 
Year: 2020 
Page Start: 106711 
Page End:  
Article Number: 106711 
DOI: 10.1016/j.jqsrt.2019.106711 
URL: http://dx.doi.org/10.1016/j.jqsrt.2019.106711 


## Step 5. Encoding JSON in HTML <br>
Reference is populated from the Urllib Method

In [8]:
# Here is the populated output from the urllib method
# This will replace ("""& < " '> """ ) with (&amp; &lt; &quot; &#x27; &gt;)
s = html.escape( """& < " '> """ ) 
html.escape(s)
html.escape("{}".format(reference))

'(&#x27;E. K. Conway, I. E. Gordon, A. A. Kyuberis, O. L. Polyansky, J. Tennyson, N. F. Zobov&#x27;, &#x27;Calculated line lists for H216O and H218O with extensive comparisons to theoretical and experimental sources including the HITRAN2016 database&#x27;, &#x27;Journal of Quantitative Spectroscopy and Radiative Transfer&#x27;, &#x27;241&#x27;, 2020, &#x27;106711&#x27;, &#x27;&#x27;, &#x27;10.1016/j.jqsrt.2019.106711&#x27;, &#x27;http://dx.doi.org/10.1016/j.jqsrt.2019.106711&#x27;, &#x27;106711&#x27;)'

## Step 6. BibTeX citation

In [9]:
"""Library to query Google Scholar.
Call the method query with a string which contains the full search
string. Query will return a list of citations.
"""
GOOGLE_SCHOLAR_URL = "https://scholar.google.com"
HEADERS = {'User-Agent': 'Mozilla/5.0'}
FORMAT_BIBTEX = 4
FORMAT_ENDNOTE = 3
FORMAT_REFMAN = 2
FORMAT_WENXIANWANG = 5
logger = logging.getLogger(__name__)
# we are using query in our code
def query(searchstr, outformat=FORMAT_BIBTEX, allresults=False):
    """Query google scholar.
    This method queries google scholar and returns a list of citations.
    Parameters
    ----------
    searchstr : str
        the query
    outformat : int, optional
        the output format of the citations. Default is bibtex.
    allresults : bool, optional
        return all results or only the first (i.e. best one)
    Returns
    -------
    result : list of strings
        the list with citations
    """
    logger.debug("Query: {sstring}".format(sstring=searchstr))
    searchstr = '/scholar?q='+quote(searchstr)
    url = GOOGLE_SCHOLAR_URL + searchstr
    header = HEADERS
    header['Cookie'] = "GSP=CF=%d" % outformat
    request = Request(url, headers=header)
    response = urlopen(request)
    html = response.read()
    html = html.decode('utf8')
    # grab the links
    tmp = get_links(html, outformat)
    # follow the bibtex links to get the bibtex entries
    result = list()
    if not allresults:
        tmp = tmp[:1]
    for link in tmp:
        url = GOOGLE_SCHOLAR_URL+link
        request = Request(url, headers=header)
        response = urlopen(request)
        bib = response.read()
        bib = bib.decode('utf8')
        result.append(bib)
    return result
def get_links(html, outformat):
    """Return a list of reference links from the html.
    Parameters
    ----------
    html : str
    outformat : int
        the output format of the citations
    Returns
    -------
    List[str]
        the links to the references
    """
    if outformat == FORMAT_BIBTEX:
        refre = re.compile(r'<a href="https://scholar.googleusercontent.com(/scholar\.bib\?[^"]*)')
    elif outformat == FORMAT_ENDNOTE:
        refre = re.compile(r'<a href="https://scholar.googleusercontent.com(/scholar\.enw\?[^"]*)"')
    elif outformat == FORMAT_REFMAN:
        refre = re.compile(r'<a href="https://scholar.googleusercontent.com(/scholar\.ris\?[^"]*)"')
    elif outformat == FORMAT_WENXIANWANG:
        refre = re.compile(r'<a href="https://scholar.googleusercontent.com(/scholar\.ral\?[^"]*)"')
    reflist = refre.findall(html)
    # escape html entities
    reflist = [re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
                      chr(name2codepoint[m.group(1)]), s) for s in reflist]
    return reflist
class Bibtex(object):
    """ Convert doi number to bibtex entries."""
    def __init__(self, doi=None, title=None):
        """
        Input doi number ou title (actually any text/keyword.)
        Returns doi, encoded doi, and doi url or just the title.
        """
        _base_url = "http://dx.doi.org/"
        self.doi = doi
        self.title = title
        self.bibtex = None
# Beautiful Soup is a Python library for pulling data out of HTML and XML files
    def _soupfy(self, url):
        """Returns a soup object."""
        html = request.urlopen(url).read()
        self.soup = BeautifulSoup(html, 'html.parser')
        return self.soup   
    def getGScholar(self):
        """Get bibtex entry from doi using Google database."""
        bib = query(self.doi, 4)[0]
        bib = bib.split('\n') 
        self.bibtex = '\n'.join(bib[0:-1]) #-9
        return self.bibtex
def main(argv=None):
    if argv is None:
        argv = sys.argv
    doi = args.positional
    method = args.method
    def allfailed():
        """All failed message+google try."""
        bold, reset = "\033[1m", "\033[0;0m"
        bib.getGScholar()
        url = bold + bib.url + reset
        msg = """Unable to resolve this DOI using database
        \nTry opening, \n\t{0}\nand download it manually.
        \n...or if you are lucky check the Google Scholar search below:
        \n{1}
        """.format(url, bib.bibtex)
        return msg
    bib = Bibtex(doi=doi)
doi = '{}'.format(doi)       
bib = Bibtex(doi)
bib = bib.getGScholar()
print(bib)

@article{conway2020calculated,
  title={Calculated line lists for H216O and H218O with extensive comparisons to theoretical and experimental sources including the HITRAN2016 database},
  author={Conway, Eamon K and Gordon, Iouli E and Kyuberis, Aleksandra A and Polyansky, Oleg L and Tennyson, Jonathan and Zobov, Nikolai F},
  journal={Journal of Quantitative Spectroscopy and Radiative Transfer},
  volume={241},
  pages={106711},
  year={2020},
  publisher={Elsevier}
}
