Skip to content

kylemcdonald/arxiv-sanity-preserver

 
 

Repository files navigation

arxiv sanity preserver

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It's super hacky and was written in 4 hours. I'll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want - I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click "sort by tfidf similarity to this", which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.

user interface

Dependencies

You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer), and flask (for serving the results)

Ugly I don't have time processing pipeline

Requires reading code and getting hands dirty. Magic numbers throughout code.

  1. Run scrape.py, which queries most recent papers in Arxiv and dumps xml into folder raw
  2. Run parse_raw.py, which reads all xml files in raw and creates a pickle with all critical information called db.p.
  3. Run download_pdf.py, which iterates over all papers in parsed pickle and downloads the papers into folder pdf
  4. Run parse_pdf_to_text.py to export all text from pdfs to files in txt
  5. Run analyze.py to compute tfidf vectors for all documents based on bigrams. Saves a tfidf.p pickle file.
  6. Run thumb_pdf.py to export thumbnails of all pdfs to thumb
  7. Run the flask server with serve.py. Visit localhost:5000 and enjoy sane viewing of papers

Prebuilt database

If you'd like to browse arxiv papers from last 3 months you can download the result of running the above steps 1-6, and only run 7. to browse. Here is the download link.. Unzip in root folder and fire up flask with serve.py. Should work I think.

Running online

If you'd like to run this flask server online (e.g. AWS/Terminal) make sure to uncomment app.debug = True in serve.py, and change app.run() to app.run(host='0.0.0.0') to make the app visible to the world.

About

Quick web interface for browsing recent arxiv submissions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 58.0%
  • HTML 42.0%