# Leverage the DataCite REST API for metadata discovery and creation

## Task Automation: Harvesting a set of metadata records

### Overview

Participants will learn how to:
  1. Apply parameters to API requests using the Python requests library.
  1. Create a loop to perform pagination, retrieving over 1,000 DOIs by making multiple requests.
  1. Get results in different formats (JSON, CSV).
  1. Save results to a file.


* [DataCite Rest API Reference](https://support.datacite.org/reference/get_dois)
* [Python Requests Package Reference](https://pypi.org/project/requests/)

### Applying parameters to API requests unsing `requests`

To begin, we will import the requests library and set some constants representing the API endpoints

In [None]:
import requests
DATACITE_API_URL = "https://api.datacite.org"
DATACITE_DOI_API_URL = f"{DATACITE_API_URL}/dois/"
RESULTS_PER_PAGE = 1000

At its simplest, the API can be used to fetch a query based on a string.
We can constuct a url with query parameters in it

In [None]:
my_query = "transcriptomes"

In [None]:
url = DATACITE_DOI_API_URL + "?" + "query=" + my_query
print(url)

https://api.datacite.org/dois/?query=transcriptomes


We can then go out and fetch the data using a GET method

In [None]:
response = requests.get(url)

Assuming everything went alright we can convert the response to JSON

In [None]:
results = response.json()

When we are checking things programatically, we need to explicitly check for errors.  This can be done with using the `response.status_code` or the shortcut `response.ok`

In [None]:
response.status_code

200

In [None]:
response.ok

True

You can also create a dictionary of parameters and pass this into the request that is sent to the API

In [None]:
rows = 10
params = {
        'query': my_query,
        'page[size]': rows
    }

url = DATACITE_DOI_API_URL
response = requests.get(url, params)
if response.ok:
  results = response.json()

Let's take a look at results and all results



In [None]:
results

{'data': [{'id': '10.7488/era/4551',
   'type': 'dois',
   'attributes': {'doi': '10.7488/era/4551',
    'identifiers': [{'identifier': 'https://hdl.handle.net/1842/41828',
      'identifierType': 'uri'}],
    'creators': [{'name': 'Treanor-Taylor, Mairi',
      'nameType': 'Personal',
      'givenName': 'Mairi',
      'familyName': 'Treanor-Taylor',
      'affiliation': [],
      'nameIdentifiers': []}],
    'titles': [{'title': 'Investigating disease progression of cutaneous squamous cell carcinoma'}],
    'publisher': 'The University of Edinburgh',
    'container': {},
    'publicationYear': 2024,
    'subjects': [{'subject': 'cutaneous squamous cell carcinoma'},
     {'subject': 'disease progression'},
     {'subject': 'metastasis'},
     {'subject': 'RNA-sequencing'},
     {'subject': 'whole genome sequencing'}],
    'contributors': [{'name': 'University Of Edinburgh',
      'affiliation': [],
      'contributorType': 'DataManager',
      'nameIdentifiers': []},
     {'name': 'Uni

To look at the dois in the resonse we will need to access the `data` key of the results dictionary

In [None]:
dois = results['data']

In [None]:
len(dois)

10

Using the above method you can change the query, or use any of the search parameters outlined in the [API reference](https://support.datacite.org/reference/get_dois)

A you can even increase the number of rows returned up to 1000


In [None]:
rows = 1000
params = {
        'query': my_query,
        'page[size]': rows
    }

url = DATACITE_DOI_API_URL
response = requests.get(url, params)
if response.ok:
  results = response.json()

### Pagination to retrieve over 1000 results

Note: using page numbers we can only retrieve up to 10,000 results

The row max is limited to 1000 but we can request mutlitple pages of dois.  First, let's look at the total number of results, and the total number of pages.  These are located in the `meta` sub-dictionary

In [None]:
results['meta']['total'] # Total number of results


8946

In [None]:
results['meta']['totalPages'] # Total number of pages

9

We will need to add the `page[number]` to our list of parameters

In [None]:
params = {
        'query': my_query,
        'page[size]': rows,
        'page[number]': 1
    }


Lets put it all together.  We will

1.   Perform an initial search
2.   Get the total number of pages
3.   Create a for loop to run the search on each page

Meanwhile we will collect all dois returned to the list `all_dois`



In [None]:
url = DATACITE_DOI_API_URL

params = {
        'query': my_query,
        'page[size]': 0,
        'page[number]': 1
    }

response = requests.get(url, params)
response.json()

{'data': [],
 'meta': {'total': 8948,
  'totalPages': 0,
  'page': 1,
  'states': [{'id': 'findable', 'title': 'Findable', 'count': 8948}],
  'resourceTypes': [{'id': 'dataset', 'title': 'Dataset', 'count': 3434},
   {'id': 'collection', 'title': 'Collection', 'count': 2293},
   {'id': 'text', 'title': 'Text', 'count': 2019},
   {'id': 'image', 'title': 'Image', 'count': 469},
   {'id': 'journal-article', 'title': 'Journal Article', 'count': 186},
   {'id': 'software', 'title': 'Software', 'count': 160},
   {'id': 'other', 'title': 'Other', 'count': 124},
   {'id': 'dissertation', 'title': 'Dissertation', 'count': 69},
   {'id': 'audiovisual', 'title': 'Audiovisual', 'count': 62},
   {'id': 'preprint', 'title': 'Preprint', 'count': 21},
   {'id': 'workflow', 'title': 'Workflow', 'count': 14},
   {'id': 'peer-review', 'title': 'Peer Review', 'count': 10},
   {'id': 'output-management-plan',
    'title': 'Output Management Plan',
    'count': 9},
   {'id': 'project', 'title': 'Project', 

In [None]:
# Set up lists to collect the results
all_dois = []
rows = 1000
my_query = 'transcriptomes'
url = DATACITE_DOI_API_URL

# Initial Search

params = {
        'query': my_query,
        'page[size]': rows,
        'page[number]': 1
    }

response = requests.get(url, params)
if response.ok:
  results = response.json()
  all_dois.extend(results['data'])

# Get the total number of pages
total_number_of_pages = results['meta']['totalPages'] # Total number of pages

# Loop through all the remaining pages
for page in range(2, total_number_of_pages + 1):
  # Update our parameters

  params = {
        'query': my_query,
        'page[size]': rows,
        'page[number]': page
  }

  response = requests.get(url, params)
  if response.ok:
    results = response.json()
    all_dois.extend(results['data'])


Let's check the size of the all_dois list

In [None]:
len(all_dois)


8948

Preview the results using the pandas library

In [None]:
import pandas as pd
df = pd.DataFrame(all_dois)
display(df)

Unnamed: 0,id,type,attributes,relationships
0,10.60692/pgtps-9b067,dois,"{'doi': '10.60692/pgtps-9b067', 'identifiers':...","{'client': {'data': {'id': 'iapx.bsycxq', 'typ..."
1,10.60692/85e10-n3c87,dois,"{'doi': '10.60692/85e10-n3c87', 'identifiers':...","{'client': {'data': {'id': 'iapx.bsycxq', 'typ..."
2,10.7488/era/4551,dois,"{'doi': '10.7488/era/4551', 'identifiers': [{'...","{'client': {'data': {'id': 'bl.ed', 'type': 'c..."
3,10.60692/k1t8q-38c08,dois,"{'doi': '10.60692/k1t8q-38c08', 'identifiers':...","{'client': {'data': {'id': 'iapx.bsycxq', 'typ..."
4,10.60692/7mnfs-10y79,dois,"{'doi': '10.60692/7mnfs-10y79', 'identifiers':...","{'client': {'data': {'id': 'iapx.bsycxq', 'typ..."
...,...,...,...,...
8943,10.6084/m9.figshare.c.5022629,dois,"{'doi': '10.6084/m9.figshare.c.5022629', 'iden...","{'client': {'data': {'id': 'figshare.ars', 'ty..."
8944,10.5281/zenodo.3885088,dois,"{'doi': '10.5281/zenodo.3885088', 'identifiers...","{'client': {'data': {'id': 'cern.zenodo', 'typ..."
8945,10.6084/m9.figshare.c.4958501.v1,dois,"{'doi': '10.6084/m9.figshare.c.4958501.v1', 'i...","{'client': {'data': {'id': 'figshare.ars', 'ty..."
8946,10.5281/zenodo.1493871,dois,"{'doi': '10.5281/zenodo.1493871', 'identifiers...","{'client': {'data': {'id': 'cern.zenodo', 'typ..."


### Convert the results to different file formats and save them

In [None]:
df.to_json("search_results.json", indent=2)
df.to_csv("search_results.csv", index=False)

## (Optional) Using cursors for paginating through results

As highlighted in the [API reference](https://support.datacite.org/docs/pagination#method-2-cursor), you can use the parameter `page[cursor]=1` in the initial search and use the `links[next]` value in the results from there on out.

In [None]:
# Set up lists to collect the results
all_dois = []
rows = 1000
my_query = 'transcriptomes'
url = DATACITE_DOI_API_URL

# Initial Search

params = {
        'query': my_query,
        'page[size]': rows
}
while url:
    response = requests.get(url, params=params)
    if response.status_code == 200:
      params.pop('page[cursor]', None) # Remove page[cursor] from the params if it exists
      results = response.json()
      if results and 'data' in results:
        all_dois.extend(results['data'])
        url = results['links'].get('next', None) # Update the URL to the next page, if any
      else:
        break
    else:
      print(f"Error: {response.status_code}")
      break

In [None]:
len(all_dois)

8851