# CCC Dataset

This notebook is a natural extension of the section "Loading Datasets" of ParserEvaluation.ipynb in the AWCA Google Drive. It is separated a) for brevity and b) because I have no choice: It has to be run locally because the servers used for Google Colab have a firewall that apparently prevents me from opening a TCP port connection (whatever that is).

In [33]:
import sparql
import requests
import time

## Downloading the Raw Texts

The cell below is all that is required. This is the same as the more user-friendly SPARQL endpoint that can be viewed [here](https://opencitations.net/sparql).

In [34]:
query = """PREFIX cito: <http://purl.org/spar/cito/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>
PREFIX biro: <http://purl.org/spar/biro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX c4o: <http://purl.org/spar/c4o/>
SELECT ?cited_ref ?cited_url WHERE {
        ?cito cito:cites ?cited .
        { 
                ?cito frbr:part ?ref .
                ?ref biro:references ?cited ;
                        c4o:hasContent ?cited_ref 
        }
        {
                ?cited datacite:hasIdentifier [
                        datacite:usesIdentifierScheme datacite:url ;
                        literal:hasLiteralValue ?cited_url
                ]
        }
} LIMIT 100"""

result = sparql.query('https://opencitations.net/sparql', query)

## Building the DataFrame
To review, I have the raw text, but I need 4 more things:
* tags
* contributors
* title
* year

These fields are explained in ParserEvaluation.ipynb.

In [35]:
data = {
    'raw_text': [],
    'tags': [],
    'contributors': [],
    'title': [],
    'year': []
}
oc_api = 'https://w3id.org/oc/index/api/v1'
prefix = 'http://dx.doi.org/'
dois = []
for raw, cited_url in result:
    if cited_url.value.find(prefix) == 0:
        data['raw_text'].append(raw.value)
        dois.append(cited_url.value[len(prefix):])
print('Sending request...')
t0 = time.time()
cited_metas = requests.get(
        oc_api + '/metadata/{}'.format('__'.join(dois))
    ).json()
print('Response received in {:.4f} seconds.'.format(time.time()-t0))
for meta in cited_metas:
    data['contributors'].append(meta['author'])
    data['year'].append(meta['year'])
    data['title'].append(meta['title'])
data

Sending request...


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [40]:
response = requests.get(oc_api+'/metadata/'+'__'.join(dois))
response

KeyboardInterrupt: 

In [41]:
oc_api+'/metadata/'+'__'.join(dois)

'https://w3id.org/oc/index/api/v1/metadata/10.1007/s10822-014-9746-y__10.1093/protein/15.8.677__10.1001/jama.287.21.2805__10.1002/pro.257__10.1054/bjoc.2000.1559__10.1155/2011/217860__10.1002/prot.340110406__10.1016/j.patrec.2013.06.010__10.1093/bioinformatics/18.suppl_1.s22__10.1016/j.jbiotec.2004.12.016__10.1016/s0079-6603%2808%2960253-6__10.1093/bioinformatics/18.suppl_1.s22__10.1016/j.bpj.2012.11.3201__10.1080/07391102.1989.10506545__10.1080/07391102.1989.10506544__10.1159/000430249__10.1002/%28sici%291097-0215%2819990118%2980:2%3C250::aid-ijc14%3E3.0.co%3B2-d__10.1007/s00432-010-0963-z__10.1016/j.biocel.2008.08.030__10.1155/2014/849720__10.1038/nrn2282__10.4161/epi.25012__10.1097/00005792-196401000-00001__10.3109/15622975.2012.747699__10.1002/14651858.cd003498.pub2__10.1111/j.1365-2982.2010.01659.x__10.1155/2009/532640__10.1016/j.jns.2008.08.021__10.1007/s10803-008-0614-2__10.1038/ejcn.2014.13__10.1038/ejcn.2012.197__10.1007/978-1-61779-588-6_12__10.1016/s1360-1385%2801%2902149-5_

In [43]:
for doi in dois:
    t0 = time.time()
    cited_meta = requests.get(
        oc_api + '/metadata/{}'.format('__'.join(doi))
    )
    print('Time to get {}: {:.4f} seconds. {}'.format(
        doi, time.time() - t0,
        '*********(SLOW)*********' if time.time() - t0 > 1 else ''
    ))
    try:
        cited_meta.json()
    except:
        print('****COULD NOT GET JSON****')

Time to get 10.1007/s10822-014-9746-y: 0.7879 seconds. 
Time to get 10.1093/protein/15.8.677: 0.9592 seconds. 
Time to get 10.1001/jama.287.21.2805: 1.2624 seconds. (SLOW)
Time to get 10.1002/pro.257: 1.5330 seconds. (SLOW)
Time to get 10.1054/bjoc.2000.1559: 1.5174 seconds. (SLOW)
Time to get 10.1155/2011/217860: 0.9161 seconds. 
Time to get 10.1002/prot.340110406: 0.8096 seconds. 
Time to get 10.1016/j.patrec.2013.06.010: 0.7320 seconds. 
Time to get 10.1093/bioinformatics/18.suppl_1.s22: 0.7432 seconds. 
Time to get 10.1016/j.jbiotec.2004.12.016: 0.7729 seconds. 
Time to get 10.1016/s0079-6603%2808%2960253-6: 0.7738 seconds. 
Time to get 10.1093/bioinformatics/18.suppl_1.s22: 0.7391 seconds. 
Time to get 10.1016/j.bpj.2012.11.3201: 0.9577 seconds. 
Time to get 10.1080/07391102.1989.10506545: 0.7627 seconds. 
Time to get 10.1080/07391102.1989.10506544: 0.7192 seconds. 
Time to get 10.1159/000430249: 0.7693 seconds. 
Time to get 10.1002/%28sici%291097-0215%2819990118%2980:2%3C250::aid