There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.
It's super hacky and was written in 4 hours. I'll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want - I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.
Main functionality is a search feature, and most useful is that you can click "sort by tfidf similarity to this", which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.
You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer), and flask (for serving the results)
Requires reading code and getting hands dirty. Magic numbers throughout code.
- Run
scrape.py
, which queries most recent papers in Arxiv and dumps xml into folderraw
- Run
parse_raw.py
, which reads all xml files inraw
and creates a pickle with all critical information calleddb.p
. - Run
download_pdf.py
, which iterates over all papers in parsed pickle and downloads the papers into folderpdf
- Run
parse_pdf_to_text.py
to export all text from pdfs to files intxt
- Run
analyze.py
to compute tfidf vectors for all documents based on bigrams. Saves atfidf.p
pickle file. - Run
thumb_pdf.py
to export thumbnails of all pdfs tothumb
- Run the flask server with
serve.py
. Visit localhost:5000 and enjoy sane viewing of papers
If you'd like to browse arxiv papers from last 3 months you can download the result of running the above steps 1-6, and only run 7. to browse. Here is the download link.. Unzip in root folder and fire up flask with serve.py
. Should work I think.
If you'd like to run this flask server online (e.g. AWS/Terminal) make sure to uncomment app.debug = True
in serve.py
, and change app.run()
to app.run(host='0.0.0.0')
to make the app visible to the world.