# OCC Dataset

This notebook is a natural extension of the section "Loading Datasets" of ParserEvaluation.ipynb in the AWCA Google Drive. It is separated a) for brevity and b) because I have no choice: It has to be run locally because the servers used for Google Colab have a firewall that apparently prevents me from opening a TCP port connection (whatever that is).

In [47]:
import sparql
import requests
import time
import random
import re
import pandas as pd
import pickle

In [48]:
# This is a quick and dirty way to prevent name collisions between files
def timecode():
    seconds_per_day = 86400
    return round(time.time() % seconds_per_day)

## Downloading the Raw Texts

The cell below is all that is required. This is the same as the more user-friendly SPARQL endpoint that can be viewed [here](https://opencitations.net/sparql).

Note that I received a timeout error when I set the limit to 10^4.

In [55]:
n = 50000
query = """PREFIX cito: <http://purl.org/spar/cito/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>
PREFIX biro: <http://purl.org/spar/biro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX c4o: <http://purl.org/spar/c4o/>
SELECT ?cited_ref ?cited_url WHERE {
        ?cito cito:cites ?cited .
        { 
                ?cito frbr:part ?ref .
                ?ref biro:references ?cited ;
                        c4o:hasContent ?cited_ref 
        }
        {
                ?cited datacite:hasIdentifier [
                        datacite:usesIdentifierScheme datacite:url ;
                        literal:hasLiteralValue ?cited_url
                ]
        }
} LIMIT """ + str(round(n))

t0 = time.time()
result = sparql.query('https://opencitations.net/sparql', query)
elapsed = time.time() - t0
print('Time to get response: {:.4f} seconds '
      '({:.4f} rows per second).'.format(elapsed, elapsed / n))

Time to get response: 96.1726 seconds (0.0019 rows per second).


## Building the DataFrame
To review, I have the raw text, but I need 4 more things:
* tags
* contributors
* title
* year

These fields are explained in ParserEvaluation.ipynb.

In [56]:
oc_api = 'https://w3id.org/oc/index/api/v1'
prefix = 'http://dx.doi.org/'
dois = []
raw_text = []
for raw, cited_url in result:
    if cited_url.value.find(prefix) == 0:
        raw_text.append(raw.value)
        dois.append(cited_url.value[len(prefix):])
len(dois)

45756

In [58]:
with open('raw_doi{}.pickle'.format(timecode()), 'wb') as dbfile:
    pickle.dump({'raw_text': raw_text, 'dois': dois}, dbfile)

In [70]:
sample_size = 5000

In [78]:
with open('raw_doi67730.pickle', 'rb') as dbfile:
    raw_dois = pickle.load(dbfile)
selected = random.sample(
    list(zip(raw_dois['raw_text'], raw_dois['dois'])),
    sample_size)
selected_raw_text, dois = zip(*selected)
data = {
    'raw_text': selected_raw_text,
    'tags': [],
    'contributors': [],
    'title': [],
    'year': []
}
selected_raw_text[0], dois[0]

('Holowka, D, Wensel, T, Baird, B. A nanosecond fluorescence depolarization study on the segmental flexibility of receptor-bound immunoglobulin E, Biochemistry, 1990, 29, 4607, 12, PMID: 2142605',
 '10.1021/bi00471a015')

In [79]:
dois[:10]

('10.1021/bi00471a015',
 '10.1073/pnas.2336149100',
 '10.1016/j.pt.2007.07.010',
 '10.1186/isrctn10214981',
 '10.1515/cclm-2014-0210',
 '10.1016/j.virusres.2010.10.011',
 '10.1038/nrm1741',
 '10.1038/nature12322',
 '10.1016/j.ypmed.2009.06.009',
 '10.1016/j.addr.2007.04.019')

In [80]:
response = requests.get(oc_api + '/metadata/{}'.format(dois[0]))
response.json()

[{'source_title': 'Biochemistry',
  'citation_count': '11',
  'doi': '10.1021/bi00471a015',
  'reference': '',
  'page': '4607-4612',
  'oa_link': '',
  'source_id': 'issn:0006-2960; issn:1520-4995',
  'year': '1990',
  'citation': '10.1201/b14035-4; 10.1201/b17290-14; 10.1007/978-3-662-22022-1_2; 10.1074/jbc.m111.331967; 10.1007/s10895-007-0189-x; 10.1007/978-0-387-46312-4_11; 10.1007/978-0-387-46312-4_20; 10.1007/978-1-4757-3061-6_11; 10.1007/978-1-4757-3061-6_20; 10.1111/j.1600-065x.2007.00517.x; 10.1038/nsmb.2795',
  'author': 'Holowka, David; Wensel, Theodore; Baird, Barbara',
  'volume': '29',
  'title': 'A Nanosecond Fluorescence Depolarization Study On The Segmental Flexibility Of Receptor-Bound Immunoglobulin E',
  'issue': '19'}]

In [81]:
def failed_get_meta():
    print('WARNING: Could not interpret response.')
    data['contributors'].append(None)
    data['year'].append(None)
    data['title'].append(None)
count_failed_decode = 0
for doi in dois:
    print('Sending request: GET ' + oc_api + '/metadata/{}'.format(doi))
    t0 = time.time()
    meta = requests.get(
            oc_api + '/metadata/{}'.format(doi)
        )
    try:
        meta = meta.json()
    except ValueError:
        failed_get_meta()
        count_failed_decode += 1
        continue
    if len(meta) == 1:
        meta = meta[0]
        print('Response received in {:.4f} seconds.'.format(time.time()-t0))
        data['contributors'].append(meta['author'])
        data['year'].append(meta['year'])
        data['title'].append(meta['title'])
    else:
        failed_get_meta()
data

nding request: GET https://w3id.org/oc/index/api/v1/metadata/10.1016/0003-4975%2894%2990134-1
Response received in 2.5341 seconds.
Sending request: GET https://w3id.org/oc/index/api/v1/metadata/10.3109/10826081003659543
Response received in 3.5373 seconds.
Sending request: GET https://w3id.org/oc/index/api/v1/metadata/10.4306/pi.2014.11.3.281
Response received in 2.6964 seconds.
Sending request: GET https://w3id.org/oc/index/api/v1/metadata/10.1021/acs.nanolett.5b00110
Response received in 2.4573 seconds.
Sending request: GET https://w3id.org/oc/index/api/v1/metadata/10.1016/j.neuron.2015.05.044
Response received in 3.2426 seconds.
Sending request: GET https://w3id.org/oc/index/api/v1/metadata/10.1371/journal.pone.0075955
Response received in 3.3238 seconds.
Sending request: GET https://w3id.org/oc/index/api/v1/metadata/10.1016/j.matbio.2006.08.261
Response received in 3.7184 seconds.
Sending request: GET https://w3id.org/oc/index/api/v1/metadata/10.1016/j.ajhg.2012.04.015
Response rec

e Asymmetries Of The Dental Arches, Jaws, And Skull, And Their Etiological Significance',
  'Midline Episiotomy And Anal Incontinence: Retrospective Cohort Study',
  'Perfil Epidemiológico Dos Pacientes Com Hanseníase No Extremo Sul De Santa Catarina, No Período De 2001 A 2007',
  'Quantitative And Developmental Analyses Of The Alarm Reaction In The Zebra Danio, Brachydanio Rerio',
  'Patterns Of Genetic Polymorphism Maintained By Fluctuating Selection With Overlapping Generations',
  'Neglect Of Mowing And Manuring Leads To Slower Nitrogen Cycling In Subalpine Grasslands',
  'In Vitro Feeding Assays For Hard Ticks',
  'Nucleolar Stress Characterized By Downregulation Of Nucleophosmin: A Novel Cause Of Neuronal Degeneration',
  'Reputation-Based Partner Choice Promotes Cooperation In Social Networks',
  'Recent Advances In Birefringence Studies At Thz Frequencies',
  'Diversified Egg And Clutch Sizes Among Local Populations Of The Fresh-Water Prawn Macrobrachium Nipponense (De Haan)',


In [83]:
data['contributors'][:5]

['Holowka, David; Wensel, Theodore; Baird, Barbara',
 'Arias-Salgado, E. G.; Lizano, S.; Sarkar, S.; Brugge, J. S.; Ginsberg, M. H.; Shattil, S. J.',
 'Kröber, Thomas; Guerin, Patrick M.',
 'Riso, Patrizia, 0000-0002-9204-7257',
 'Magrini, Laura; Gagliano, Giulia; Travaglino, Francesco; Vetrone, Francesco; Marino, Rossella; Cardelli, Patrizia; Salerno, Gerardo; Di Somma, Salvatore']

In [84]:
data['year'] = [int(year) if year else None for year in data['year']]
data['contributors'] = [
    (' '.join(re.findall(r'[^ \d\-]+', contribs))
      if contribs is not None else '')
    for contribs in data['contributors']]

In [88]:
data['tags'] = []
min_match_len = 3
for raw, contribs, year, title in zip(
        data['raw_text'], data['contributors'], data['year'], data['title']):
    raw = raw.lower() if raw else None
    contribs = contribs.lower() if contribs else None
    title = title.lower() if title else None
    year = str(year) if year else None
    tag = ''
    contribs_list = re.split(r'[\,; ]+', contribs) if contribs else None
    # prev_tag is used to create a hysteresis of sorts, to give the
    # tag generator a tendency to repeat the same tag as was given to
    # the previous word
    prev_tag = ''
    while raw != '':
        next_space = raw.find(' ') % len(raw)
        if contribs_list and any(
                (len(contrib_name) >= min_match_len
                 and -1 < raw.find(contrib_name) < next_space)
                 or (prev_tag == 'A' and (
                     -1 < raw.find(contrib_name) < next_space
                     or -1 < contrib_name.find(
                         re.sub(r'\W+', '', raw[:next_space])
                 )))
                 for contrib_name in contribs_list):
            prev_tag = 'A'
            tag += 'A '
            raw = raw[next_space + 1:]
        elif year \
                and raw.find(year) % len(raw) < next_space \
                and year == ''.join(filter(str.isdigit, raw[:next_space])):
            prev_tag = 'D'
            tag += 'D '
            # Go to the next word, or to the end if no spaces remain.
            raw = raw[next_space + 1:]
        elif title and (raw.lower().find(title) == 0
                or re.sub(r'\W+', '', raw.lower()).find(
                    re.sub(r'\W+', '', title)) == 0):
            prev_tag = 'T'
            tag += 'T ' * len(title.split())
            raw = raw[len(title):]
        else:
            prev_tag = 'O'
            tag += 'O '
            # Go to the next word, or to the end if no spaces remain.
            raw = raw[next_space + 1:]
        raw = raw.strip()
    data['tags'].append(tag)
data['tags']

 T T T T T T T T T T T T T T T T O O D O O O O O O O O ',
 'A A A A A A A A A A A A T T T T T T T T T T O O O O O O O O D O O O O ',
 'A A A A A A T T T T O O O O D O O O O O O ',
 'A O A A A A A A A O A O A O A O A A T T T T T T T T T T T T T T T O O O O O O ',
 'A O O A A D T T T T T T T T T T T O O O O O O O O O ',
 'O O O O O O O O O O O O O O ',
 'A A O O O O O O O O O O O O T T T T T T T T T O O O O O O D O O O O ',
 'A A A O A A A A A A A A O O O O O O O O O O O O O O O O O O O O D O O O O ',
 'A A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T T T T T T O O O O D O O O O ',
 'A A A O O T T T T T T T T T T T O O D O O ',
 'A A A A A A T T T T T T T T T T T T T T T T T T T O O O D O O O O O O ',
 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O ',
 'A A A A A A A A A D T T T T T T T T T T T T T T T T O O O O O O O O O ',
 'A A A A A A A O A A A A A A A A A A A O O O O O O O O O O O O O O O D O O O O O ',
 'A A A A T T T T

In [89]:
dataset = pd.DataFrame(data)
with open('dataset{}.pickle'.format(timecode()), 'wb') as dbfile:
    pickle.dump(dataset, dbfile, pickle.HIGHEST_PROTOCOL)

In [90]:
dataset.head(10)

Unnamed: 0,raw_text,tags,contributors,title,year
0,"Holowka, D, Wensel, T, Baird, B. A nanosecond ...",A A A A A A A O O O O O O O O O O O O O D O O ...,"Holowka, David; Wensel, Theodore; Baird, Barbara",A Nanosecond Fluorescence Depolarization Study...,1990.0
1,Arias-Salgado E. G. et al. Src kinase activati...,A A A O O O O O O O O O O O O O O O O O O O O ...,"Arias Salgado, E. G.; Lizano, S.; Sarkar, S.; ...",Src Kinase Activation By Direct Interaction Wi...,2003.0
2,"Kröber, T, Guerin, PM. In vitro feeding assays...",A A A A A O O O O O O O O D O O O O O O O O,"Kröber, Thomas; Guerin, Patrick M.",In Vitro Feeding Assays For Hard Ticks,2007.0
3,Ministero Delle Politiche Agricole Alimentari ...,O O O O O O O O O O O,"Riso, Patrizia,",Effect Of A Polyphenol-Rich Diet On Leaky Gut ...,2017.0
4,"Magrini, L, Gagliano, G, Travaglino, F. Compar...",A A A A A A T T T T T T T T T T T T T T T T T ...,"Magrini, Laura; Gagliano, Giulia; Travaglino, ...","Comparison Between White Blood Cell Count, Pro...",2014.0
5,"Dhuruvasan K, Sivasubramanian G, Pellett PE. R...",A A A A A A T T T T T T T T T T O O O D O O O ...,"Dhuruvasan, Kavitha; Sivasubramanian, Geetha; ...",Roles Of Host And Viral Micrornas In Human Cyt...,2011.0
6,"Lukyanov KA, Chudakov DM, Lukyanov S, Verkhush...",A O A A A A A A O T T T O O O O O O D O O O O,"Lukyanov, Konstantin A.; Chudakov, Dmitry M.; ...",Photoactivatable Fluorescent Proteins,2005.0
7,"Zhang, R, Han, P, Yang, H, Ouyang, K, Lee, D, ...",A A A A A A A A A A A O A A A A A A A A A O O ...,"Zhang, Ruilin; Han, Peidong; Yang, Hongbo; Ouy...",In Vivo Cardiac Reprogramming Contributes To Z...,2013.0
8,"Reichert, FF, Azevedo, MR, Breier, A. Physical...",A O A O A A T T T T T T T T T T T T T T T O O ...,"Reichert, Felipe F.; Azevedo, Mario R.; Breier...",Physical Activity And Prevalence Of Hypertensi...,2009.0
9,"Vargas A, Zeisser-Labouèbe M, Lange N, Gurny R...",A A A A A A A A A A D T T T T T T T T T T T T ...,"Vargas, A; Zeisserlabouebe, M; Lange, N; Gurny...",The Chick Embryo And Its Chorioallantoic Membr...,2007.0
