Here we will mine papers from arxiv to derive the following:
- a count of the number of figures, and summary statistics over time
- extraction of equations
- extraction of article summaries
This work was modified from vsoch/arxiv-equations to run in an HPC environment. The goal is to generate embeddings from the equations, and find associations with terms so that a user can search using words to find equations, and vice versa.
The original src folder includes a subfolder of example datasets that were used for @dormaayan original analysis, and also can be used here to develop and test. A test extraction (without complete data) was done in test, and hopefully here @vsoch can include all data.
The files were copied from the NAS in Josh Sunshine's office ultimately to the Sherlock cluster using sftp. An inventory is included here, which is a basic listing of the files. We generate this inventory with 0.generateInventory.py. The inventory arxiv-inventory-newonly.tsv is included here.
The new archive data (in the arxiv folder on the NAS) has 947,943 subfolders.
We can be sure of having these subfolders, each a .tar.gz, and it remains to be seen if each one has data inside (or could be empty, for example).
I will try to transfer these again, but it could be the originals were corrupt:
- '/scratch/users/vsochat/DATA/arxiv/1508.tar'
Update I have re-transferred the file, and I was able to extract (I think most)
of the contents, but the end of the file has a ReadError
so likely we lost a small
subset.
We use the following scripts for this step:
Specifically, each input file is associated with an extracted archive, and we can determine which jobs need to be run based on the existence of the output folder.