# Problem
[Zenodo](https://zenodo.org) has a number of scientific packages. Unfortunately their [API](http://developers.zenodo.org/#rest-api) is currently (April '18) only set up for deposit so it cannot be used to investigate the collection. Their [OAI-PMH] has a limit of 100 per page, so resumption tokens require a lot of clicking to get the metadata. Fortunately a majority of Zendo packages have registered DOIs with [Datacite](https://datacite.org) so we can use their [API](https://api.datacite.org/).

* software
* datasets
* text
* images
* audiovisual
* interactive
* collection
* other

# How many Datacite DOIs have been registered by Zenodo since 2013?

Zenodo is registered under CERN (`cern.zenodo`), we can access this metadata using (`cern.zenodo`) as the data-center-id.

In [2]:
import requests
r = requests.get("https://api.datacite.org/works?data-center-id=cern.zenodo")
if r.status_code == 200:
    data = r.json()

DOIs for Zenodo packages per year:

In [3]:
data['meta']['registered']

[{'count': 100089, 'id': '2018', 'title': '2018'},
 {'count': 206033, 'id': '2017', 'title': '2017'},
 {'count': 82204, 'id': '2016', 'title': '2016'},
 {'count': 23015, 'id': '2015', 'title': '2015'},
 {'count': 5306, 'id': '2014', 'title': '2014'},
 {'count': 328, 'id': '2013', 'title': '2013'}]

RecordsPerDay average times rest of days in the year:

In [5]:
daysSoFar=114
daysToCome=365-daysSoFar
recordsToDate=100089
recordsPerDay=recordsToDate/daysSoFar
round(recordsPerDay*daysToCome)

220371

Looks like Zendodo is on pace to continue increasing in use year after year. 

### DATASETS

In [7]:
r2 = requests.get("https://api.datacite.org/works?data-center-id=cern.zenodo&resource-type-id=dataset")
if r2.status_code == 200:
    data2 = r2.json()

In [8]:
data2['meta']['registered']

[{'count': 10562, 'id': '2018', 'title': '2018'},
 {'count': 32312, 'id': '2017', 'title': '2017'},
 {'count': 1722, 'id': '2016', 'title': '2016'},
 {'count': 1430, 'id': '2015', 'title': '2015'},
 {'count': 266, 'id': '2014', 'title': '2014'},
 {'count': 38, 'id': '2013', 'title': '2013'}]

Estimated total **dataset** DOIs for 2018: 

In [9]:
daysSoFar=114
daysToCome=365-daysSoFar
recordsToDate=10562
recordsPerDay=recordsToDate/daysSoFar
round(recordsPerDay*daysToCome)

23255

### SOFTWARE

In [10]:
r3 = requests.get("https://api.datacite.org/works?data-center-id=cern.zenodo&resource-type-id=software")
if r3.status_code == 200:
    data3 = r3.json()

In [11]:
data3['meta']['registered']

[{'count': 7334, 'id': '2018', 'title': '2018'},
 {'count': 22588, 'id': '2017', 'title': '2017'},
 {'count': 8801, 'id': '2016', 'title': '2016'},
 {'count': 3918, 'id': '2015', 'title': '2015'},
 {'count': 1803, 'id': '2014', 'title': '2014'},
 {'count': 14, 'id': '2013', 'title': '2013'}]

Estimated total **software** DOIs for 2018:

In [12]:
daysSoFar=108
daysToCome=365-daysSoFar
recordsToDate=7334
recordsPerDay=recordsToDate/daysSoFar
round(recordsPerDay*daysToCome)

17452

## HOW MANY SOFTWARE OR DATASET RECORDS HAVE JUPYTER IN THE DESCRIPTION OR TITLE?

Let's practice with 500 of each for a total of **1000** DOIs from the first month of 2018:

In [61]:
r4 = requests.get("https://api.datacite.org/works?data-center-id=cern.zenodo&resource-type-id=software&from-created-date=2018-01-01&until-created-date=2018-02-01&page[size]=1000")
if r4.status_code == 200:
    data4 = r4.json()   

In [62]:
len(data4['data'])

1000

In [63]:
r5 = requests.get("https://api.datacite.org/works?data-center-id=cern.zenodo&resource-type-id=dataset&from-created-date=2018-01-01&until-created-date=2018-02-01&page[size]=1000")
if r5.status_code == 200:
    data5 = r5.json() 

In [64]:
len(data5['data'])

857

In [79]:
jupyter_count = 0
for d in data4['data']:
    if 'jupyter' in d['attributes']['description'].lower() or \
       'jupyter' in d['attributes']['title'].lower():
        #print("{}\n\t{}".format(d['attributes']['title'], 
        #                        d['attributes']['description']))
            jupyter_count+=1
for d in data5['data']:
    if 'jupyter' in d['attributes']['description'].lower() or \
       'jupyter' in d['attributes']['title'].lower():
            jupyter_count+=1

In [80]:
print(jupyter_count)

13


### EXTRACTING GITHUB URLS

Datacite JSON metadata does not include the URL of the Github repo for the software DOI.  This critical information provides the _backlink_ to the actual software repository being cited, but this mystery is solved by decoding the xml associated with every datacite JSON payload.  In it is the XML of the original Zenodo metadata which **does** include the repo URL.  Here is an example extraction of that metadata:

In [83]:
def get_gh_urls(xml_md):
    import base64
    from lxml import etree

    xml_md_root = etree.fromstring(xml_md)
    urls = xml_md_root.xpath('//dcite:relatedIdentifiers/dcite:relatedIdentifier[@relationType="IsSupplementTo" and @relatedIdentifierType="URL"]/text()', 
                      namespaces={'dcite': 'http://datacite.org/schema/kernel-3'})
    
    return [u for u in urls if 'github' in u]

In [95]:
import base64
get_gh_urls(base64.b64decode(data4['data'][50]['attributes']['xml']))

['https://github.com/miykael/LINEViewer/tree/0.2.2']

In [86]:
import base64

doi_gh_map = {}
for d in data4['data']:
    doi_gh_map[d['attributes']['doi']] = get_gh_urls(base64.b64decode(d['attributes']['xml']))

### HOW MANY OF THE 1000 SOFTWARE URLS **DO NOT** HAVE GITHUB REPOS IN THE METADATA?

In [87]:
len(doi_gh_map)

1000

In [88]:
len([gh_list for gh_list in doi_gh_map.values() if not gh_list])

132

### REPEATS?

In [90]:
1000-132-len(set([gh_list[0] for gh_list in doi_gh_map.values() if gh_list]))

256

### NUMBER OF UNIQUE GH URLS IN THE 1000 DOIS?

In [91]:
1000-131-257

612