## One way to access Arxiv metadata

Using the arxiv wrapper for python (install if you don't have it)

In [1]:
#!pip install arxiv
import arxiv

In this notebook, we are demonstrating how to search for the metadata of a paper (given its arxiv ID).

You can choose to extract the arxiv metadata some other way as well (as long as you don't use forbidden packages mentioned in the project description)

In [2]:
search = arxiv.Search(id_list=["1703.00663"]) #searching wrt to a paper ID, you can do so much more with the arxiv package
# IMPORTANT NOTE : you can also send multiple ids in id_list at once to reduce calls to the web API
#Eg : id_list=["1703.00001","1703.00002",...,""1703.99999"]

result = list(search.results()) # search.results is an iterable, need to cast into a list
#result will be [] if no paper exists for an id (meaning the ID or IDs does not exist)

#I am also initializing a dictionary i will use to store the 
paper_info = {}

Below are <i>all</i> the different fields of metadata we can get from arxiv

In [3]:
#title of paper
#We only have 1 result so access index 0
paper_info['title'] = result[0].title
print("title:", paper_info['title'])

#authors (they have to be taken one by one and converted into string)
paper_info['authors'] = []

for author in result[0].authors:
    paper_info['authors'].append(str(author))

print("authors:",paper_info['authors'])

#summary (abstract)
paper_info['summary'] = result[0].summary
print("summary/abstract:",paper_info['summary'])

#comment by author
paper_info['comment'] = result[0].comment
print("comment:",paper_info['comment'])

#journal reference (if present)
paper_info['journal_ref'] = result[0].journal_ref
print("journal_ref:",paper_info['journal_ref'])

#DOI (if present)
paper_info['doi'] = result[0].doi
print("doi:",paper_info['doi'])
#entry_id
paper_info['entry_id'] = result[0].entry_id
print("entry_id:", paper_info['entry_id'])

#last upated
paper_info['updated'] = str(result[0].updated)
print("last updated:",paper_info['updated'])

#first published date
paper_info['published'] = str(result[0].published)
print("original posting date:",paper_info['published'])


#Primary category
paper_info['primary_category'] = result[0].primary_category
print("primary category:", paper_info['primary_category'])

#All categories
paper_info['categories'] = result[0].categories
print("categories:", paper_info['categories'])

#links
paper_info['links'] =  str(result[0].links)
print("links:",paper_info['links'])

#pdf_url
paper_info['pdf_url'] = result[0].pdf_url
print("pdf_url:",paper_info['pdf_url'])

title: Introduction to Nonnegative Matrix Factorization
authors: ['Nicolas Gillis']
summary/abstract: In this paper, we introduce and provide a short overview of nonnegative
matrix factorization (NMF). Several aspects of NMF are discussed, namely, the
application in hyperspectral imaging, geometry and uniqueness of NMF solutions,
complexity, algorithms, and its link with extended formulations of polyhedra.
In order to put NMF into perspective, the more general problem class of
constrained low-rank matrix approximation problems is first briefly introduced.
comment: 18 pages, 4 figures
journal_ref: SIAG/OPT Views and News 25 (1), pp. 7-16 (2017)
doi: None
entry_id: http://arxiv.org/abs/1703.00663v1
last updated: 2017-03-02 08:23:04+00:00
original posting date: 2017-03-02 08:23:04+00:00
primary category: cs.NA
categories: ['cs.NA', 'cs.CV', 'cs.LG', 'math.OC', 'stat.ML']
links: [arxiv.Result.Link('http://arxiv.org/abs/1703.00663v1', title=None, rel='alternate', content_type=None), arxiv.R

If you want to filter the paper to see if it is in our desired primary category, you can check to see<br> <code>if paper_info['primary_category'] in ['cs.LG','cs.AI','cs.CC','cs.AR']:</code>

Now you can store this dictionary (Or you can make a list of dictionaries for storing metadata for multiple papers).

In [4]:
import json

with open('paper_info.json', 'w') as fp:
    json.dump(paper_info, fp)

Open the json file to confirm if got everything back

In [5]:
with open('paper_info.json', 'r') as fp:
    paper_info2 = json.load(fp)

In [6]:
paper_info2

{'title': 'Introduction to Nonnegative Matrix Factorization',
 'authors': ['Nicolas Gillis'],
 'summary': 'In this paper, we introduce and provide a short overview of nonnegative\nmatrix factorization (NMF). Several aspects of NMF are discussed, namely, the\napplication in hyperspectral imaging, geometry and uniqueness of NMF solutions,\ncomplexity, algorithms, and its link with extended formulations of polyhedra.\nIn order to put NMF into perspective, the more general problem class of\nconstrained low-rank matrix approximation problems is first briefly introduced.',
 'comment': '18 pages, 4 figures',
 'journal_ref': 'SIAG/OPT Views and News 25 (1), pp. 7-16 (2017)',
 'doi': None,
 'entry_id': 'http://arxiv.org/abs/1703.00663v1',
 'updated': '2017-03-02 08:23:04+00:00',
 'published': '2017-03-02 08:23:04+00:00',
 'primary_category': 'cs.NA',
 'categories': ['cs.NA', 'cs.CV', 'cs.LG', 'math.OC', 'stat.ML'],
 'links': "[arxiv.Result.Link('http://arxiv.org/abs/1703.00663v1', title=None, r