# OCC Dataset
This is the second iteration of my work on setting up the framework for training and evaluation of parsers. The motivation for this second iteration is explained [here](https://colab.research.google.com/drive/1dfAmcbxoNHfLJ2lVq7_G2tN0nB-ClAWL?usp=sharing).

In [1]:
import sparql
import requests
import time
import random
import re
import pandas as pd
import pickle

In [2]:
with open('unshared_data/raw_doi67730.pickle', 'rb') as dbfile:
    raw_dois = pickle.load(dbfile)

In [3]:
len(raw_dois['dois']) * 2 / 3600

25.42

In [4]:
OC_API = 'https://w3id.org/oc/index/api/v1'
data = {'author': [],
    'year': [],
    'title': [],
    'page': [],
    'volume': [],
    'source_title': [],
    'issue': []}
def failed_get_meta():
    print('WARNING: Could not interpret response.')
    for ls in data.values():
        ls.append(None)
def save_meta(meta):
    for key, ls in data.items():
        ls.append(meta[key])
tstart = time.time()
for i, doi in enumerate(raw_dois['dois']):
    t0 = time.time()
    meta = None
    wait = 300
    backoff = 5
    while meta is None:
        try:
            meta = requests.get(
                    OC_API + '/metadata/{}'.format(doi)
                )
        except ConnectionError as e:
            print('Bad connection. Retrying in {wait} seconds...')
            time.sleep(wait)
            wait *= backoff
    try:
        meta = meta.json()[0]
    except (ValueError, IndexError) as e:
        failed_get_meta()
        continue
    if i % 50 == 0:
        total_elapsed = time.time() - tstart
        proportion_complete = (i+1) / len(raw_dois['dois'])
        remaining_time = total_elapsed \
                * (1-proportion_complete) / proportion_complete
        print('{:.2f}% complete after {:.2f} hours. {:.2f} hours'
              ' remaining.'.format(
                  proportion_complete * 100,
                  total_elapsed / 3600,
                  remaining_time / 3600))
    save_meta(meta)

te after 3.74 hours. 33.87 hours remaining.
10.06% complete after 3.78 hours. 33.79 hours remaining.
10.16% complete after 3.81 hours. 33.70 hours remaining.
10.27% complete after 3.85 hours. 33.60 hours remaining.
10.38% complete after 3.88 hours. 33.47 hours remaining.
10.49% complete after 3.91 hours. 33.38 hours remaining.
10.60% complete after 3.95 hours. 33.27 hours remaining.
10.71% complete after 3.99 hours. 33.25 hours remaining.
10.82% complete after 4.03 hours. 33.20 hours remaining.
10.93% complete after 4.06 hours. 33.10 hours remaining.
11.04% complete after 4.09 hours. 32.98 hours remaining.
11.15% complete after 4.13 hours. 32.88 hours remaining.
11.26% complete after 4.16 hours. 32.78 hours remaining.
11.37% complete after 4.19 hours. 32.69 hours remaining.
11.48% complete after 4.23 hours. 32.61 hours remaining.
11.59% complete after 4.27 hours. 32.56 hours remaining.
11.69% complete after 4.30 hours. 32.49 hours remaining.
11.80% complete after 4.34 hours. 32.40 hour

ConnectionError: HTTPSConnectionPool(host='w3id.org', port=443): Max retries exceeded with url: /oc/index/api/v1/metadata/10.1021/j100101a045 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe228727130>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

In [12]:
data['raw'] = raw_dois['raw_text']
occ = pd.DataFrame(data=data)
occ.head()

Unnamed: 0,author,year,title,page,volume,source_title,issue,raw
0,Michael N. Sawka; Timothy D. Noakes,2007,Does Dehydration Impair Exercise Performance?,1209-1217,39,Medicine & Science In Sports & Exercise,8,"Sawka, MN, Noakes, TD. Does dehydration impair..."
1,"Knechtle, B.; Schulze, I.",2008,Ernährungsverhalten Bei Ultraläufern - Deutsch...,243-251,97,Praxis,5,"Knechtle, B, Knechtle, P, Schulze, I, Kohler, ..."
2,"Sousa, Mónica; Fernandes, Maria João; Moreira,...",2013,Nutritional Supplements Usage By Portuguese At...,48-58,83,International Journal For Vitamin And Nutritio...,1,"Sousa, M, Fernandes, MJ, Moreira, P, Teixeira,..."
3,,2000,Nutrition And Athletic Performance,2130-2145,32,Medicine And Science In Sports And Exercise,12,"American College of Sports M, American Dieteti..."
4,"Schooler, Jonathan",2011,Unpublished Results Hide The Decline Effect,437-437,470,Nature,7335,"Schooler, J. (2011). Unpublished results hide ..."


In [13]:
with open('occ_dataset2.pickle', 'wb') as dbfile:
    pickle.dump(occ, dbfile, pickle.HIGHEST_PROTOCOL)