In [1]:
import urllib.request
import bs4 as bs
import time
import datetime
import dateutil.parser
import csv
import re
import os
from pathlib import Path

from my_utilities import read_dict, save_dict

# ArXiv Metadata Harvester

---

# Summary:

## Grab records from the requested timespan, from all or from one selected category

## Write to tab-delimited local csv:
## columns: *id, categories, title, abstract*
(Dealing with funny characters in the names of authors was beyond me. One could also get a date associated with each record but it's supposed not to necessarily correspond to the date of posting by the authors.)
### There are two functions (the code is in *my_utilities.py*).
Both will talk to You using prints.
* *harvest_slice* needs You to explicitly choose the category (possibly 'all') and the filename as arguments
    * just appends lines to the file, it's up to You not to make a mess


* *harvest_data* divides the timespan into slices of given length and harvests those using *harvest_slice*:
    * can make up the name of the file on its own
    * adds the header to the csv
    * default behavior when the file already exists is to quit
    * default category is 'all'

### It is slow.    
### Examples:
*  ~ 1 min,  2 MB >>> harvest_slice("2018-10-01", "2018-10-10", "math", "test.csv")
*  ~ 5 min, 11 MB >>> harvest_slice("2018-10-01", "2018-10-10", "all", "test.csv")
* ~ 10 min, 16 MB >>> harvest_data("2018-08-01", "2018-11-01", category="math", file_name = "test.csv", overwrite=True)
* ~ 1 h, 68 MB >>> harvest_data("2018-08-01", "2018-11-01")

### Example of a basic query used in the code:
* http://export.arxiv.org/oai2?verb=ListRecords&from=2012-01-01&until=2018-02-01&set=physics:hep-th&metadataPrefix=arXiv
* "http://export.arxiv.org/oai2?verb=ListSets"

See https://arxiv.org/help/bulk_data for more info.

### Side effect:
Create small text files that - loosely speaking - list the extracted labels: 
   * categories.txt
   * top_cats.txt
   * physics_genres.txt 

---

# Explanation:

### Aside from having authors, a title and an abstract (a summary), articles on *ArXiv* are typically assigned to a category, e.g. Computer Science, Economics, etc. Those informations form the meta-data of an article that is easily obtainable with an API.

### One can talk with *ArXiv* using two different interfaces.

### The first one serves to answer typical complicated search queries.
For example looking for articles by Stephen Hawking about black holes we could start with

In [139]:
search_query = "ti:black%20hole+AND+au:Hawking"

query = "http://export.arxiv.org/api/query?search_query=" + search_query
sauce = urllib.request.urlopen(query).read()    
soup = bs.BeautifulSoup(sauce, 'lxml')
entries = soup.find_all('entry')

and a typical data we get is the following. Notice that there is both the *primary category* and a general *category* list

In [140]:
entry = entries[0]

print(entry.id.string)
print(entry.author.find('name').string)
print(entry.title.string)
print('primary category:', entry.find("arxiv:primary_category")['term'])
print('all categories:', [cat['term'] for cat in entry.find_all("category")])
print('abstract:', entry.summary.string[:200]+" ...")

http://arxiv.org/abs/hep-th/0507171v2
S. W. Hawking
Information Loss in Black Holes
primary category: hep-th
all categories: ['hep-th']
abstract:   The question of whether information is lost in black holes is investigated
using Euclidean path integrals. The formation and evaporation of black holes is
regarded as a scattering problem with all m ...


In another example we see that there can be more categories: e.g. one from Economics (*econ.EM*) and an another one from Statistics (*stat.AP*), and that the first one in the list is the primary category

In [167]:
search_query = "1803.11233"
# search_query = "0707.3787"
query = "http://export.arxiv.org/api/query?search_query=" + search_query
sauce = urllib.request.urlopen(query).read()    
soup = bs.BeautifulSoup(sauce, 'lxml')
entry = soup.find('entry')
print(entry.id.string)
print(entry.author.find('name').string)
print(entry.title.string)
print('primary category:', entry.find("arxiv:primary_category")['term'])
print('all categories:', [cat['term'] for cat in entry.find_all("category")])
print('abstract:', entry.summary.string[:60]+" ...")

http://arxiv.org/abs/1803.11233v1
Kamil Jodź
Mortality in a heterogeneous population - Lee-Carter's methodology
primary category: econ.EM
all categories: ['econ.EM', 'stat.AP']
abstract:   The EU Solvency II directive recommends insurance companie ...


### But this first API is not suited for bulk data downloads. Instead, we want to use the interface specified by Open Archives Initiative (OAI) that ArXiv complies with.
This time we build the query by specifying the time slice from which we want the articles. We can also filter for one category, if we want. Take the following query for example

In [169]:
date_from = "2018-04-02"
date_until = "2018-04-02"
category = "econ" # Economics

search_query = f"&from={date_from}&until={date_until}&set={category}"
query = "http://export.arxiv.org/oai2?verb=ListRecords" + search_query + "&metadataPrefix=arXiv"
sauce = urllib.request.urlopen(query).read()    
soup = bs.BeautifulSoup(sauce, 'lxml')
records = soup.find_all('record')

In [170]:
records[0]

<record>
<header>
<identifier>oai:arXiv.org:1803.11233</identifier>
<datestamp>2018-04-02</datestamp>
<setspec>econ</setspec>
</header>
<metadata>
<arxiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd">
<id>1803.11233</id><created>2018-03-29</created><authors><author><keyname>Jodź</keyname><forenames>Kamil</forenames></author></authors><title>Mortality in a heterogeneous population - Lee-Carter's methodology</title><categories>econ.EM stat.AP</categories><comments>12 pages</comments><license>http://arxiv.org/licenses/nonexclusive-distrib/1.0/</license><abstract>  The EU Solvency II directive recommends insurance companies to pay more
attention to the risk management methods. The sense of risk management is the
ability to quantify risk and apply methods that reduce uncertainty. In life
insurance, the risk is a consequence of the random variable describing the life
ex

Notice that this time there is only the single ***categories*** tag. **We will be assuming that there is a convention that the first item on that list is the primary category of an article.**

In [171]:
print(records[0].id.string)
print(records[0].categories.string)

1803.11233
econ.EM stat.AP


### Notice the *set* field in the last query. We can retrieve the list of all possible *sets* using another fixed query

In [3]:
# The query retrieves xml about the accesible 'sets', e.g.
# <set>
# <setspec>cs</setspec>
# <setname>Computer Science</setname>
# </set>

if not Path("categories.txt").is_file() :
    
    xml_query = "http://export.arxiv.org/oai2?verb=ListSets"
    sauce = urllib.request.urlopen(xml_query).read()
    soup = bs.BeautifulSoup(sauce, 'lxml')
    sets = soup.find_all("set")

    categories = {}

    for set_ in sets:
        categories[set_.setspec.string] = set_.setname.string

    save_dict(categories, "categories.txt")
            

categories = read_dict("categories.txt")

categories

{'cs': 'Computer Science',
 'econ': 'Economics',
 'eess': 'Electrical Engineering and Systems Science',
 'math': 'Mathematics',
 'physics': 'Physics',
 'physics:astro-ph': 'Astrophysics',
 'physics:cond-mat': 'Condensed Matter',
 'physics:gr-qc': 'General Relativity and Quantum Cosmology',
 'physics:hep-ex': 'High Energy Physics - Experiment',
 'physics:hep-lat': 'High Energy Physics - Lattice',
 'physics:hep-ph': 'High Energy Physics - Phenomenology',
 'physics:hep-th': 'High Energy Physics - Theory',
 'physics:math-ph': 'Mathematical Physics',
 'physics:nlin': 'Nonlinear Sciences',
 'physics:nucl-ex': 'Nuclear Experiment',
 'physics:nucl-th': 'Nuclear Theory',
 'physics:physics': 'Physics (Other)',
 'physics:quant-ph': 'Quantum Physics',
 'q-bio': 'Quantitative Biology',
 'q-fin': 'Quantitative Finance',
 'stat': 'Statistics'}

Apparently physics enthusiasts get more options. 

### The matter of actual article categories is more messy, see https://arxiv.org/ and https://arxiv.org/help/prep#subj
Physics gets an additional level of gradation: e.g. *physics:astro-ph* is a subset of *physics*. And the categorization chosen by an author her- or himself is finer and may be multiple, e.g. *cs.ai* (Computer Science: Artificial Intelligence) instead of just *cs*, together with *physics:astro-ph.GA* (Physics: Astrophysics: Astrophysics of Galaxies) instead of just *physics:astro-ph* (assuming that the article was both about Artificial Intelligence and Galaxies). But, again, first of the categories is the primary one.

In [4]:
# Create new dictionaries. One with the top-level categories, and the second with physics genres.

pattern = re.compile('physics:(.+)')

physics_genres = {}
top_categories = {}

for category, description in categories.items():
    match = pattern.match(category)
    if match:
        physics_genres[match.group(1)] = description
    else:
        top_categories[category] = description
        
save_dict(physics_genres, "physics_genres.txt")

Also, *Economics* and *Electrical Engineering* start only in 2017 (I've checked), so we will exclude *econ* and *eess* from *top_categories*

In [5]:
top_cats = {cat: cat_name for (cat, cat_name) in top_categories.items() if cat not in  ['econ' ,'eess']}
save_dict(top_cats, "top_cats.txt")

In [6]:
top_cats, physics_genres

({'cs': 'Computer Science',
  'math': 'Mathematics',
  'physics': 'Physics',
  'q-bio': 'Quantitative Biology',
  'q-fin': 'Quantitative Finance',
  'stat': 'Statistics'},
 {'astro-ph': 'Astrophysics',
  'cond-mat': 'Condensed Matter',
  'gr-qc': 'General Relativity and Quantum Cosmology',
  'hep-ex': 'High Energy Physics - Experiment',
  'hep-lat': 'High Energy Physics - Lattice',
  'hep-ph': 'High Energy Physics - Phenomenology',
  'hep-th': 'High Energy Physics - Theory',
  'math-ph': 'Mathematical Physics',
  'nlin': 'Nonlinear Sciences',
  'nucl-ex': 'Nuclear Experiment',
  'nucl-th': 'Nuclear Theory',
  'physics': 'Physics (Other)',
  'quant-ph': 'Quantum Physics'})

---

# The code:

### The API will serve us 1000 records each 10 seconds (plus a considerable overhead for communication)
The imported *harvest_slice* function, given a time-slice, category and a file-path, works in a loop and
    * sends the query
    * saves the received records into a file
    * using the last *resumption token* (appended to the xml) and the given dates forms a next query
    * finally returns the number of retrieved records

In [6]:
def harvest_slice(date_from, date_until, category, file) -> int:
    # returns number of downloaded records if succesful
    
    base_query = "http://export.arxiv.org/oai2?verb=ListRecords"
    
    if category == "all":
        query = base_query + f"&from={date_from}&until={date_until}&metadataPrefix=arXiv"
    else:
        query = base_query + f"&from={date_from}&until={date_until}&set={category}&metadataPrefix=arXiv"
    
    retrieved = 0
    
    while query:
        
        time_0 = time.time()
        
        # try to download
        try:            
            sauce = urllib.request.urlopen(query).read()

        except:
            print(f":( Failed requesting {query}\nMoving on")
            break
        
        # parse the xml looking for <record>'s
        soup = bs.BeautifulSoup(sauce, 'lxml')
        records = soup.find_all('record')

        retrieved = retrieved + len(records)

        with open(file, "a", encoding='utf-8') as dump:

            writer = csv.writer(dump, delimiter='\t')
            for record in records:                
                record_string = [(record.id.string if record.id else 'nan'),
                                 [(author.forenames.string+" " if author.forenames else "") + (author.keyname.string if author.keyname else 'nan') for author in record.find_all('author')],
                                 (record.title.string if record.title else 'nan'),
                                 (record.abstract.string if record.abstract else 'nan'),
                                 (record.categories.string if record.categories else 'nan')
                                ]
                writer.writerow(record_string)
        
        if len(records) == 0:
            print("".join([category," from ", f"{date_from}"," until ", f"{date_until}"," empty"]))
            break
        
        # info at the end of 'soup' about where to resume if the data stream was cut at 1000 records
        # None if the stream wasn't cut
        res_token = soup.find("resumptiontoken")
        
        if res_token:

            # data in the current loop started at this record in the 'query'
            started_at = int(res_token['cursor']) + 1
            
            # total number of records in the 'query', should be the same in each loop
            all_to_retrieve = int(res_token['completelistsize'])
            
            if res_token.string:
                # the identifier that allows to resume the query
                # None if the slice was completed

                query = base_query + f"&resumptionToken={res_token.string}"
                time.sleep(10)
            else:
                query = None
            
        else:
            started_at = 1 
            all_to_retrieve = len(records)
            query = None
        
        time_1 = time.time()
        
        print("".join([category,
                       " from ", f"{date_from}", " until ", f"{date_until}",
                       f" ({started_at:>5}-{started_at+len(records)-1:>5})/{all_to_retrieve:>5}",
                      " in ", f"{(time_1 - time_0):3.2f}", "s"]) )
    
    # end of while loop
     
    return retrieved



We are receiving the data in batches with 1000 records each. Each batch from the time period defined by the arguments ends up in the same file. For example

In [7]:
harvest_slice("2018-10-01", "2018-10-10", "math", "test.csv")

math from 2018-10-01 until 2018-10-10 (    1- 1000)/ 2328 in 24.69s
math from 2018-10-01 until 2018-10-10 ( 1001- 2000)/ 2328 in 25.26s
math from 2018-10-01 until 2018-10-10 ( 2001- 2328)/ 2328 in 6.21s


2328

If we were to download different categories separately like that, the records that belong to more than one category would be repeated in each file. But we can be downloading all categories at the same time 

In [8]:
harvest_slice("2018-10-01", "2018-10-10", "all", "test.csv")

all from 2018-10-01 until 2018-10-10 (    1- 1000)/ 7482 in 26.81s
all from 2018-10-01 until 2018-10-10 ( 1001- 2000)/ 7482 in 27.57s
all from 2018-10-01 until 2018-10-10 ( 2001- 3000)/ 7482 in 26.85s
all from 2018-10-01 until 2018-10-10 ( 3001- 4000)/ 7482 in 27.77s
all from 2018-10-01 until 2018-10-10 ( 4001- 5000)/ 7482 in 28.97s
all from 2018-10-01 until 2018-10-10 ( 5001- 6000)/ 7482 in 25.77s
all from 2018-10-01 until 2018-10-10 ( 6001- 7000)/ 7482 in 32.03s
all from 2018-10-01 until 2018-10-10 ( 7001- 7482)/ 7482 in 8.67s


7482

### Just as a precaution, let's split longer time-slices into multiple shorter ones in case there is an upper limit for the total number of records we can retrieve with one query.
We divide the time-slice into 92-days long (by default) periods, and write to an automatically named file.

In [9]:
# Wrapper around harvest_slice
# * handles file-names
# * slices the time period of papers into intervals of given number of days (days_in_slice)

def harvest_data(isoday_0, isoday_1, category='all', days_in_slice = 92, file_name=None, overwrite=False) -> int:

    date_0 = dateutil.parser.parse(isoday_0).date()
    date_1 = dateutil.parser.parse(isoday_1).date()

    if not file_name:
        # create a file with an overly descriptive name
        file = f"arXivMeta_{category.replace(':','--')}_from_{date_0}_to_{date_1}.csv"
    else:
        file = file_name
    
    # check if file already exists
    if Path(file).is_file():
        if overwrite :

            # try to backup the old file
            file_info = re.match(r"(\w.+)\.(\w\w+)", file)
            if file_info:
                new_file = "".join([ file_info.group(1), "_bak.", file_info.group(2) ])
                if not Path(new_file).is_file():
                    os.rename(file, new_file)
                    print(f"Old file backed up as {new_file}")

            # clear the file
            print(f"Overwriting {file}")
            with open(file, "w") as dump:
                dump.truncate(0)
            
        else:
            print(f"The file {file} already exists")
            return -1
    
    else:
        print(f"Writing to {file}")
    
    with open(file, "a") as dump:
            writer = csv.writer(dump, delimiter='\t')
            header = ['id', 'authors', 'title', 'abstract', 'categories']
            writer.writerow(header)
    
    # Start the clock
    time_0 = time.time()
    
    # Let's count all downloaded records
    retrieved = 0
    
    # We'll go from 'date_0' until 'date_1' in slices of 'days_in_slice' days
    # The server's response presumably maxes out at some number of records,
    # so we hope to have slices with less records than that.

    date_from = date_0

    while date_from <= date_1:
        
        date_until = min(date_1, date_from + datetime.timedelta(days_in_slice-1))

        # try to download the slice
        newly_retrieved = harvest_slice(date_from, date_until, category, file)
        retrieved = retrieved + newly_retrieved

        # move on to the next slice
        date_from = date_until + datetime.timedelta(days=1)
        
        # time-out
        time.sleep(10)

    time_1 = time.time()
    
    print("".join([category,
                   " from ", str(date_0), " until ", str(date_1),
                   " retrieved ", str(retrieved), " records"
                   ," in ", f"{(time_1 - time_0)/60:.0f}", " min\n"])
         )
    
    return retrieved



This time we can do for example

In [11]:
harvest_data("2018-08-01", "2018-11-01", category="math", file_name="test.csv", overwrite=True)

Old file backed up as test_bak.csv
Overwriting test.csv
math from 2018-08-01 until 2018-10-31 (    1- 1000)/18206 in 43.39s
math from 2018-08-01 until 2018-10-31 ( 1001- 2000)/18206 in 62.31s
math from 2018-08-01 until 2018-10-31 ( 2001- 3000)/18206 in 103.59s
math from 2018-08-01 until 2018-10-31 ( 3001- 4000)/18206 in 57.29s
math from 2018-08-01 until 2018-10-31 ( 4001- 5000)/18206 in 90.82s
math from 2018-08-01 until 2018-10-31 ( 5001- 6000)/18206 in 54.15s
math from 2018-08-01 until 2018-10-31 ( 6001- 7000)/18206 in 46.50s
math from 2018-08-01 until 2018-10-31 ( 7001- 8000)/18206 in 31.82s
math from 2018-08-01 until 2018-10-31 ( 8001- 9000)/18206 in 45.30s
math from 2018-08-01 until 2018-10-31 ( 9001-10000)/18206 in 25.56s
math from 2018-08-01 until 2018-10-31 (10001-11000)/18206 in 26.49s
math from 2018-08-01 until 2018-10-31 (11001-12000)/18206 in 27.74s
math from 2018-08-01 until 2018-10-31 (12001-13000)/18206 in 25.49s
math from 2018-08-01 until 2018-10-31 (13001-14000)/18206 i

18468

Notice that there were two time-slices, one with 18206 records, second with 262 records. All saved in "test.csv"

---

# Harvest:

In [12]:
# single year split in two files
year = 2010

harvest_data(f"{year}-01-01", f"{year}-07-01")
harvest_data(f"{year}-07-02", f"{year}-12-31")

Writing to arXivMeta_all_from_2010-01-01_to_2010-07-01.csv
all from 2010-01-01 until 2010-04-02 (    1- 1000)/13727 in 34.98s
all from 2010-01-01 until 2010-04-02 ( 1001- 2000)/13727 in 34.47s
all from 2010-01-01 until 2010-04-02 ( 2001- 3000)/13727 in 39.02s
all from 2010-01-01 until 2010-04-02 ( 3001- 4000)/13727 in 37.09s
all from 2010-01-01 until 2010-04-02 ( 4001- 5000)/13727 in 32.30s
all from 2010-01-01 until 2010-04-02 ( 5001- 6000)/13727 in 32.11s
all from 2010-01-01 until 2010-04-02 ( 6001- 7000)/13727 in 96.90s
all from 2010-01-01 until 2010-04-02 ( 7001- 8000)/13727 in 65.82s
all from 2010-01-01 until 2010-04-02 ( 8001- 9000)/13727 in 76.95s
all from 2010-01-01 until 2010-04-02 ( 9001-10000)/13727 in 90.02s
all from 2010-01-01 until 2010-04-02 (10001-11000)/13727 in 83.90s
all from 2010-01-01 until 2010-04-02 (11001-12000)/13727 in 90.99s
all from 2010-01-01 until 2010-04-02 (12001-13000)/13727 in 71.82s
all from 2010-01-01 until 2010-04-02 (13001-13727)/13727 in 36.72s
all

28161

In [5]:
# multiple years (also split in two files each)

for year in ['2011','2012','2013','2014','2015','2016','2017']:
    harvest_data(f"{year}-01-01", f"{year}-07-01")
    harvest_data(f"{year}-07-02", f"{year}-12-31")

Writing to arXivMeta_all_from_2011-01-01_to_2011-07-01.csv
all from 2011-01-01 until 2011-04-02 (    1- 1000)/16544 in 28.84s
all from 2011-01-01 until 2011-04-02 ( 1001- 2000)/16544 in 28.68s
all from 2011-01-01 until 2011-04-02 ( 2001- 3000)/16544 in 26.71s
all from 2011-01-01 until 2011-04-02 ( 3001- 4000)/16544 in 26.42s
all from 2011-01-01 until 2011-04-02 ( 4001- 5000)/16544 in 27.60s
all from 2011-01-01 until 2011-04-02 ( 5001- 6000)/16544 in 26.40s
all from 2011-01-01 until 2011-04-02 ( 6001- 7000)/16544 in 28.16s
all from 2011-01-01 until 2011-04-02 ( 7001- 8000)/16544 in 29.50s
all from 2011-01-01 until 2011-04-02 ( 8001- 9000)/16544 in 25.17s
all from 2011-01-01 until 2011-04-02 ( 9001-10000)/16544 in 24.91s
all from 2011-01-01 until 2011-04-02 (10001-11000)/16544 in 25.32s
all from 2011-01-01 until 2011-04-02 (11001-12000)/16544 in 24.55s
all from 2011-01-01 until 2011-04-02 (12001-13000)/16544 in 25.32s
all from 2011-01-01 until 2011-04-02 (13001-14000)/16544 in 27.80s
all

all from 2012-10-02 until 2012-12-31 ( 7001- 8000)/14867 in 25.13s
all from 2012-10-02 until 2012-12-31 ( 8001- 9000)/14867 in 26.69s
all from 2012-10-02 until 2012-12-31 ( 9001-10000)/14867 in 26.10s
all from 2012-10-02 until 2012-12-31 (10001-11000)/14867 in 25.93s
all from 2012-10-02 until 2012-12-31 (11001-12000)/14867 in 26.02s
all from 2012-10-02 until 2012-12-31 (12001-13000)/14867 in 29.35s
all from 2012-10-02 until 2012-12-31 (13001-14000)/14867 in 25.38s
all from 2012-10-02 until 2012-12-31 (14001-14867)/14867 in 23.47s
all from 2012-07-02 until 2012-12-31 retrieved 31429 records in 15 min

Writing to arXivMeta_all_from_2013-01-01_to_2013-07-01.csv
all from 2013-01-01 until 2013-04-02 (    1- 1000)/17385 in 28.06s
all from 2013-01-01 until 2013-04-02 ( 1001- 2000)/17385 in 25.14s
all from 2013-01-01 until 2013-04-02 ( 2001- 3000)/17385 in 27.22s
all from 2013-01-01 until 2013-04-02 ( 3001- 4000)/17385 in 29.03s
all from 2013-01-01 until 2013-04-02 ( 4001- 5000)/17385 in 28.02

all from 2014-04-03 until 2014-07-01 (12001-13000)/17697 in 29.61s
all from 2014-04-03 until 2014-07-01 (13001-14000)/17697 in 28.44s
all from 2014-04-03 until 2014-07-01 (14001-15000)/17697 in 25.37s
all from 2014-04-03 until 2014-07-01 (15001-16000)/17697 in 27.12s
all from 2014-04-03 until 2014-07-01 (16001-17000)/17697 in 27.32s
all from 2014-04-03 until 2014-07-01 (17001-17697)/17697 in 14.97s
all from 2014-01-01 until 2014-07-01 retrieved 37860 records in 18 min

Writing to arXivMeta_all_from_2014-07-02_to_2014-12-31.csv
all from 2014-07-02 until 2014-10-01 (    1- 1000)/19557 in 34.67s
all from 2014-07-02 until 2014-10-01 ( 1001- 2000)/19557 in 27.18s
all from 2014-07-02 until 2014-10-01 ( 2001- 3000)/19557 in 27.03s
all from 2014-07-02 until 2014-10-01 ( 3001- 4000)/19557 in 28.25s
all from 2014-07-02 until 2014-10-01 ( 4001- 5000)/19557 in 27.63s
all from 2014-07-02 until 2014-10-01 ( 5001- 6000)/19557 in 27.80s
all from 2014-07-02 until 2014-10-01 ( 6001- 7000)/19557 in 27.60

all from 2015-04-03 until 2015-07-01 (28001-29000)/139925 in 30.37s
all from 2015-04-03 until 2015-07-01 (29001-30000)/139925 in 28.34s
all from 2015-04-03 until 2015-07-01 (30001-31000)/139925 in 31.13s
all from 2015-04-03 until 2015-07-01 (31001-32000)/139925 in 28.32s
all from 2015-04-03 until 2015-07-01 (32001-33000)/139925 in 30.69s
all from 2015-04-03 until 2015-07-01 (33001-34000)/139925 in 28.48s
all from 2015-04-03 until 2015-07-01 (34001-35000)/139925 in 27.84s
all from 2015-04-03 until 2015-07-01 (35001-36000)/139925 in 30.17s
all from 2015-04-03 until 2015-07-01 (36001-37000)/139925 in 28.18s
all from 2015-04-03 until 2015-07-01 (37001-38000)/139925 in 28.00s
all from 2015-04-03 until 2015-07-01 (38001-39000)/139925 in 29.41s
all from 2015-04-03 until 2015-07-01 (39001-40000)/139925 in 29.76s
:( Failed requesting http://export.arxiv.org/oai2?verb=ListRecords&resumptionToken=3148006|40001
Moving on
all from 2015-01-01 until 2015-07-01 retrieved 69949 records in 35 min

Writi

KeyboardInterrupt: 