# Zenodo API

# Setup

## Instructions

This notebook utilizes the Zenodo API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Zenodo account at https://zenodo.org/signup/
2. After logging in, click on the user dropdown menu in the top right corner, and click on 'Applications'
3. Access API key either by:
    - Create a Developer Application by clicking on 'New application'
    - Create a Personal Access Token by clicking on 'New Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'ZENODO_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Zenodo API ([Zenodo](https://developers.zenodo.org))
- Zenodo Search Guide ([Guide](https://help.zenodo.org/guides/search/))

## Overview of workflow

#### Miro workflow image here

## Imports

In [1]:
# Import flatten_json, installing if necessary
try:
    from flatten_json import flatten
except ImportError as e:
    !pip3 install flatten_json
    from flatten_json import flatten

#import libraries
import os
import requests
import pandas as pd
import pickle
import pprint as pp
from utils import flatten_nested_df
from collections import OrderedDict

In [2]:
# Load credentials
try:
    with open('credentials.pkl', 'rb') as credentials:
        ZENODO_TOKEN = pickle.load(credentials)['ZENODO_TOKEN']
except:
    ZENODO_TOKEN = input('Please enter your Zenodo API Key: ')

In [3]:
SEARCH_URL = 'https://zenodo.org/api/records'

# Data Wrangling

Function `get_all_search_outputs` queries the Zenodo API for all search terms specified
* Calls function `get_individual_search_output` for each search term
    * To account for Zenodo search limits, queries API for search term in one-year increments
* Appends data frame results for each search term

In [4]:
def get_all_search_outputs(search_terms, flatten_output=False):
    """
    Call the Figshare API for each search term and search type. 
    Results are retured in results[(search_term)] = df
    
    Params:
    - search_terms (list-like): collection of search terms to query over
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - results (dict): dictionary consisting of returned DataFrames from get_individual_search_output for each query
    """

    results = OrderedDict()

    for search_term, in search_terms:
        results[(search_term,)] = get_search_output(search_term, flatten_output)
        
    return results

Function `get_individual_search_output` queries the Zenodo API with the specified search term (e.g., “machine learning”). By default, the Zenodo API includes all published records, so there is no need to specify a search type, as with other APIs.
* Searches across all returned pages
* Flattens highly nested JSON output to dataframe if specified in argument
* Each resulting dataframe contains *full metadata* about each object as well as high level summary statistics of search (i.e., number of hits)
    * Note that this means the resulting dataframe has thousands of columns

In [5]:
def get_individual_search_output(search_term, flatten_output=False):
    """
    Calls the Zenodo API with the specified search term and returns the search output results.
    
    Params:
    - search_term (str): keyword to seach for
    - flatten (bool): optional (default=False)
   
    Returns:
    - df (pandas.DataFrame): DataFrame containing the output of the search query
    """
    
    # Make sure out input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    
    # Set search variables
    start_page = 1
    page_size = 1000 # Max = 10,000, Default = 10
    search_year = 2021
    search_df = pd.DataFrame()
    start_date = f'{search_year}-01-01'
    end_date = f'{search_year}-12-31'
    
    search_params = {
        'q': f'\"{search_term}\" AND created:[{start_date} TO {end_date}]',
        'access_token': ZENODO_TOKEN,
        'page': start_page,
        'size': page_size,
        }
    
    # Run initial search & extract output
    response = requests.get(SEARCH_URL, #Records — search published records
                        params = search_params)
    output = response.json()
    
    # Gather high-level search information from the 'aggregations' entry
    search_aggregation_info = output['aggregations']
    
    # Loop over search years - searches until the current search year does not return any results
    while output.get('hits').get('total'):
        # Loop over pages - searches until the current page is empty 
        while output.get('hits').get('hits'):
            # Flatten output
            if flatten_output:
                output_list = [flatten(result) for result in output['hits']['hits']]
            else:
                output_list = output['hits']['hits']
            
            # Turn outputs into DataFrame & add page info
            output_df = pd.DataFrame(output_list)
            output_df['page'] = search_params['page']
            
            # Append modified output df to our cumulative search DataFrame
            search_df = pd.concat([search_df, output_df]).reset_index(drop=True)

            # Increment page
            search_params['page'] += 1 
            
             # Run search & extract output
            response = requests.get(SEARCH_URL, #Records — search published records
                                params = search_params)
            output = response.json()
            
        # Change search year, reset search page
        search_year -= 1
        start_date = f'{search_year}-01-01'
        end_date = f'{search_year}-12-31'

        search_params['q'] = f'\"{search_term}\" AND created:[{start_date} TO {end_date}]'
        search_params['page'] = start_page

        # Run search & extract output
        response = requests.get(SEARCH_URL, #Records — search published records
                            params = search_params)
        output = response.json()
        
    return search_aggregation_info, search_df

#### Example search

In [6]:
search_aggregation_info, search_df = get_individual_search_output('chipmunk', flatten_output=True)

In [7]:
search_df.head()

Unnamed: 0,conceptrecid,created,doi,files_0_bucket,files_0_checksum,files_0_key,files_0_links_self,files_0_size,files_0_type,files_1_bucket,...,metadata_related_identifiers_902_resource_type,metadata_related_identifiers_902_scheme,metadata_related_identifiers_903_identifier,metadata_related_identifiers_903_relation,metadata_related_identifiers_903_resource_type,metadata_related_identifiers_903_scheme,metadata_related_identifiers_904_identifier,metadata_related_identifiers_904_relation,metadata_related_identifiers_904_resource_type,metadata_related_identifiers_904_scheme
0,4966916,2021-06-16T14:25:57.408732+00:00,10.5061/dryad.c3sg55p,8a26640e-153c-43fb-b97c-5833a35dadbb,md5:30f66705913a2690132e4f23274cc4dd,comparative_trees.tar,https://zenodo.org/api/files/8a26640e-153c-43f...,228352.0,tar,8a26640e-153c-43fb-b97c-5833a35dadbb,...,,,,,,,,,,
1,4936920,2021-06-12T14:37:37.488821+00:00,10.5061/dryad.pn371m9,fc2f8285-d275-42e0-94e7-00f4dbf3c9a3,md5:c9083c5bd0583be6b6998f8630010a03,Data_Emmering et al. JAB01946.xlsx,https://zenodo.org/api/files/fc2f8285-d275-42e...,228624.0,xlsx,,...,,,,,,,,,,
2,4990061,2021-06-19T03:43:48.528726+00:00,10.5061/dryad.3583j,62c232fc-b45a-4bed-b527-2054e9d657df,md5:f411331fe9f1d5a71c40a5de51a00262,covariance_matrices_neotamias.xls,https://zenodo.org/api/files/62c232fc-b45a-4be...,608256.0,xls,62c232fc-b45a-4bed-b527-2054e9d657df,...,,,,,,,,,,
3,4571490,2021-03-01T21:50:16.948943+00:00,10.5281/zenodo.4571491,523471b8-3327-4972-915b-8e89cf7a7996,md5:1232dec933ba5b6d4f9691679fbd997e,treatment.html,https://zenodo.org/api/files/523471b8-3327-497...,3039.0,html,,...,,,,,,,,,,
4,4971642,2021-06-17T06:13:10.904864+00:00,10.5061/dryad.52mp1,7930baef-435a-4937-9ff4-d8317e90c3f9,md5:af062e557e5409cb91bd8c9db014dd6b,quadgroup.nex,https://zenodo.org/api/files/7930baef-435a-493...,293823.0,nex,,...,,,,,,,,,,
