# Converting from recid to arXiv identifier
Flip Tanedo
24 October 2015
for: Science Hack Day SF 2015, CiteMap team

**Background**: when playing with bibliographic data in high energy physics (HEP) the InspireHEP database uses a unique *recid* to identify papers, while the arXiv uses a unique *arXiv identifier*. These are simple functions which call the InspireHEP API to [try to] map between *recid*s to *arXiv identifiers*. It also takes *recid*s and outputs basic paper data.

For information about the inspireHEP API, see: http://inspirehep.net/info/hep/api?ln=en
Note that their examples don't point out the (perhaps obvious) point that all the requests should be prefixed with 'http://inspirehep.net'.

In [179]:
import json
import requests

inspire_location = 'http://inspirehep.net/'
output_type = 'ot=primary_report_number,recid,title,system_control_number,doi,filenames'
get_json = 'of=recjson'

def arXiv_to_recid(arXiv_id_str):
    """ 
    Input: arXiv ID ('hep-ph/0410364' or '1506.06131') 
    Output: InspireHEP recid
    """
    query = 'search?p=eprint:arxiv:' + arXiv_id_str
    queryold = 'search?p=eprint:' + arXiv_id_str
    if arXiv_id_str[:4].isdigit(): 
        my_request = inspire_location + query + '&' + get_json + '&' + output_type
    else:
        my_request = inspire_location + queryold + '&' + get_json + '&' + output_type
    my_json = requests.get(my_request).json()
    if len(my_json) > 1:
        print "HEY, not a unique arXiv ID! Error in arXiv_json."
    return my_json[0]['recid']
# Examples: 
# arXiv_to_recid('hep-ph/0305127')
# arXiv_to_recid('0801.1833')

    
def recid_to_arXiv(recid_str):
    """
    Input: recid (e.g. '618609')
    Output: arXiv identifier
    Note: this is much more difficult since the arXiv IDs aren't encoded
    in a standardized way!
    """
    query = 'record/' + recid_str
    my_request = inspire_location + query + '?' + get_json + '&' + output_type
    my_json = requests.get(my_request).json()
    if len(my_json) > 1:
        print "HEY, not a unique recid! Error in recid_to_arXiv."
    ## UNFINISHED:
    ##
    ## Need to extract the arXiv identifier from this
    ## it takes one of two forms: 
    ## (1) arXiv:0801.1833
    ## (2) arXiv:hep-ph/0305127 , where "hep-ph" can be different letters, see arXiv for examples
    ##
    return str(my_json)

## Testing and Examples

In [161]:
arXiv_to_recid('hep-ph/0305127')

618609

In [157]:
arXiv_to_recid('0801.1833')

777282

In [159]:
arXiv_to_recid('hep-ph/0305127')

618609

In [180]:
temp=recid_to_arXiv('618609')
print temp

[{u'doi': u'10.1016/j.nuclphysb.2003.08.033', u'title': {u'title': u'On the two loop Yukawa corrections to the MSSM Higgs boson masses at large tan beta'}, u'system_control_number': [{u'institute': u'arXiv', u'canceled': u'oai:arXiv.org:hep-ph/0305127', u'value': u'oai:arXiv.org:hep-ph/0305127'}, {u'institute': u'DESY', u'canceled': u'D03-09843'}, {u'institute': u'SPIRESTeX', u'value': u'Dedes:2003km'}, {u'institute': u'CDS', u'value': u'621776'}], u'filenames': [u'arXiv:hep-ph_0305127.pdf', u'arXiv:hep-ph_0305127'], u'recid': 618609, u'primary_report_number': [u'hep-ph/0305127', u'MPI-PHT-2003-21', u'TUM-HEP-507-03', u'RM3-TH-03-05']}]
