Name		Name	Last commit message	Last commit date
parent directory ..
0.generateInventory.py		0.generateInventory.py
1.clusterExtract.py		1.clusterExtract.py
1.run_clusterExtract.py		1.run_clusterExtract.py
LICENSE		LICENSE
README.md		README.md
arxiv-inventory-newonly.tsv		arxiv-inventory-newonly.tsv
findMissingNpages.py		findMissingNpages.py
helpers.py		helpers.py
requirements.txt		requirements.txt

README.md

ArXiv Miner Analysis

Here we will mine papers from arxiv to derive the following:

a count of the number of figures, and summary statistics over time
extraction of equations
extraction of article summaries

This work was modified from vsoch/arxiv-equations to run in an HPC environment. The goal is to generate embeddings from the equations, and find associations with terms so that a user can search using words to find equations, and vice versa.

Development

The original src folder includes a subfolder of example datasets that were used for @dormaayan original analysis, and also can be used here to develop and test. A test extraction (without complete data) was done in test, and hopefully here @vsoch can include all data.

Step 1. Inventory of files

The files were copied from the NAS in Josh Sunshine's office ultimately to the Sherlock cluster using sftp. An inventory is included here, which is a basic listing of the files. We generate this inventory with 0.generateInventory.py. The inventory arxiv-inventory-newonly.tsv is included here.

The new archive data (in the arxiv folder on the NAS) has 947,943 subfolders.

We can be sure of having these subfolders, each a .tar.gz, and it remains to be seen if each one has data inside (or could be empty, for example).

.tar found corrupt

I will try to transfer these again, but it could be the originals were corrupt:

'/scratch/users/vsochat/DATA/arxiv/1508.tar'

Update I have re-transferred the file, and I was able to extract (I think most) of the contents, but the end of the file has a ReadError so likely we lost a small subset.

Step 2. Extraction

We use the following scripts for this step:

Specifically, each input file is associated with an extracted archive, and we can determine which jobs need to be run based on the existence of the output folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

0.generateInventory.py

0.generateInventory.py

1.clusterExtract.py

1.clusterExtract.py

1.run_clusterExtract.py

1.run_clusterExtract.py

LICENSE

LICENSE

README.md

README.md

arxiv-inventory-newonly.tsv

arxiv-inventory-newonly.tsv

findMissingNpages.py

findMissingNpages.py

helpers.py

helpers.py

requirements.txt

requirements.txt

README.md

ArXiv Miner Analysis

Development

Step 1. Inventory of files

.tar found corrupt

Step 2. Extraction

Files

analysis

Directory actions

More options

Directory actions

More options

Latest commit

History

analysis

Folders and files

parent directory

ArXiv Miner Analysis

Development

Step 1. Inventory of files

.tar found corrupt

Step 2. Extraction