# OCC Dataset
This is the second iteration of my work on setting up the framework for training and evaluation of parsers. The motivation for this second iteration is explained [here](https://colab.research.google.com/drive/1dfAmcbxoNHfLJ2lVq7_G2tN0nB-ClAWL?usp=sharing).

In [1]:
import sparql
import requests
import time
import random
import re
import pandas as pd
import pickle

In [2]:
with open('unshared_data/raw_doi67730.pickle', 'rb') as dbfile:
    raw_dois = pickle.load(dbfile)

In [3]:
len(raw_dois['dois']) * 2 / 3600

25.42

In [4]:
OC_API = 'https://w3id.org/oc/index/api/v1'
data = {'author': [],
    'year': [],
    'title': [],
    'page': [],
    'volume': [],
    'source_title': [],
    'issue': []}
def failed_get_meta():
    print('WARNING: Could not interpret response.')
    for ls in data.values():
        ls.append(None)
def save_meta(meta):
    for key, ls in data.items():
        ls.append(meta[key])
tstart = time.time()
for i, doi in enumerate(raw_dois['dois']):
    print('Sending request...', end=' ')
    t0 = time.time()
    meta = None
    wait = 30
    backoff = 5
    while meta is None:
        try:
            meta = requests.get(
                    OC_API + '/metadata/{}'.format(doi)
                )
        except ConnectionError as e:
            print('Bad connection. Retrying in {wait} seconds...')
            time.sleep(wait)
            wait *= backoff
    try:
        meta = meta.json()[0]
    except (ValueError, IndexError) as e:
        failed_get_meta()
        continue
    print('Response received in {:.4f} seconds.'.format(time.time()-t0))
    if i % 10 == 0:
        total_elapsed = time.time() - tstart
        proportion_complete = (i+1) / len(raw_dois['dois'])
        remaining_time = total_elapsed \
                * (1-proportion_complete) / proportion_complete
        print('{:.2f}% complete after {:.2f} seconds. {:.2f} hours'
              ' remaining.'.format(
                  proportion_complete * 100,
                  total_elapsed,
                  remaining_time / 3600))
    save_meta(meta)

.. Response received in 2.3265 seconds.
Sending request... Response received in 3.0493 seconds.
Sending request... Response received in 2.7111 seconds.
Sending request... Response received in 2.7164 seconds.
Sending request... Response received in 2.5950 seconds.
Sending request... Response received in 2.4539 seconds.
Sending request... Response received in 2.1367 seconds.
23.52% complete after 50413.79 seconds. 45.54 hours remaining.
Sending request... Response received in 2.3056 seconds.
Sending request... Response received in 4.2698 seconds.
Sending request... Response received in 2.4358 seconds.
Sending request... Response received in 2.5896 seconds.
Sending request... Response received in 2.6265 seconds.
Sending request... Response received in 2.5162 seconds.
Sending request... Response received in 2.4918 seconds.
Sending request... Response received in 2.3030 seconds.
Sending request... Response received in 2.2371 seconds.
Sending request... Response received in 2.3631 seconds.
2

ConnectionError: HTTPSConnectionPool(host='w3id.org', port=443): Max retries exceeded with url: /oc/index/api/v1/metadata/10.2307/3791349 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f0146560880>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

In [12]:
data['raw'] = raw_dois['raw_text']
occ = pd.DataFrame(data=data)
occ.head()

Unnamed: 0,author,year,title,page,volume,source_title,issue,raw
0,Michael N. Sawka; Timothy D. Noakes,2007,Does Dehydration Impair Exercise Performance?,1209-1217,39,Medicine & Science In Sports & Exercise,8,"Sawka, MN, Noakes, TD. Does dehydration impair..."
1,"Knechtle, B.; Schulze, I.",2008,Ernährungsverhalten Bei Ultraläufern - Deutsch...,243-251,97,Praxis,5,"Knechtle, B, Knechtle, P, Schulze, I, Kohler, ..."
2,"Sousa, Mónica; Fernandes, Maria João; Moreira,...",2013,Nutritional Supplements Usage By Portuguese At...,48-58,83,International Journal For Vitamin And Nutritio...,1,"Sousa, M, Fernandes, MJ, Moreira, P, Teixeira,..."
3,,2000,Nutrition And Athletic Performance,2130-2145,32,Medicine And Science In Sports And Exercise,12,"American College of Sports M, American Dieteti..."
4,"Schooler, Jonathan",2011,Unpublished Results Hide The Decline Effect,437-437,470,Nature,7335,"Schooler, J. (2011). Unpublished results hide ..."


In [13]:
with open('occ_dataset2.pickle', 'wb') as dbfile:
    pickle.dump(occ, dbfile, pickle.HIGHEST_PROTOCOL)