arxiv sanity preserver

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It's super hacky and was written in 4 hours. I'll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want - I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click "sort by tfidf similarity to this", which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.

Dependencies

You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer), and flask (for serving the results)

Ugly I don't have time processing pipeline

Requires reading code and getting hands dirty. Magic numbers throughout code.

Run scrape.py, which queries most recent papers in Arxiv and dumps xml into folder raw
Run parse_raw.py, which reads all xml files in raw and creates a pickle with all critical information called db.p.
Run download_pdf.py, which iterates over all papers in parsed pickle and downloads the papers into folder pdf
Run parse_pdf_to_text.py to export all text from pdfs to files in txt
Run analyze.py to compute tfidf vectors for all documents based on bigrams. Saves a tfidf.p pickle file.
Run thumb_pdf.py to export thumbnails of all pdfs to thumb
Run the flask server with serve.py. Visit localhost:5000 and enjoy sane viewing of papers

Prebuilt database

If you'd like to browse arxiv papers from last 3 months you can download the result of running the above steps 1-6, and only run 7. to browse. Here is the download link.. Unzip in root folder and fire up flask with serve.py. Should work I think.

Running online

If you'd like to run this flask server online (e.g. AWS/Terminal) make sure to uncomment app.debug = True in serve.py, and change app.run() to app.run(host='0.0.0.0') to make the app visible to the world.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
static		static
templates		templates
README.md		README.md
analyze.py		analyze.py
download_pdfs.py		download_pdfs.py
parse_pdf_to_text.py		parse_pdf_to_text.py
parse_raw.py		parse_raw.py
scrape.py		scrape.py
serve.py		serve.py
thumb_pdf.py		thumb_pdf.py
ui.jpeg		ui.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

static

static

templates

templates

README.md

README.md

analyze.py

analyze.py

download_pdfs.py

download_pdfs.py

parse_pdf_to_text.py

parse_pdf_to_text.py

parse_raw.py

parse_raw.py

scrape.py

scrape.py

serve.py

serve.py

thumb_pdf.py

thumb_pdf.py

ui.jpeg

ui.jpeg

Repository files navigation

arxiv sanity preserver

Dependencies

Ugly I don't have time processing pipeline

Prebuilt database

Running online

About

Releases

Packages

Languages

kylemcdonald/arxiv-sanity-preserver

Folders and files

Latest commit

History

Repository files navigation

arxiv sanity preserver

Dependencies

Ugly I don't have time processing pipeline

Prebuilt database

Running online

About

Resources

Stars

Watchers

Forks

Languages