# Zenodo API

# Setup

## Instructions

This notebook utilizes the Zenodo API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Zenodo account at https://zenodo.org/signup/
2. After logging in, click on the user dropdown menu in the top right corner, and click on 'Applications'
3. Access API key either by:
    - Create a Developer Application by clicking on 'New application'
    - Create a Personal Access Token by clicking on 'New Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'ZENODO_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Zenodo API ([Zenodo](https://developers.zenodo.org))
- Zenodo Search Guide ([Guide](https://help.zenodo.org/guides/search/))

## Overview of workflow

<img src="../images/Zenodo_workflow.jpg" width=500 height=500 align="left"/>

## Imports

In [1]:
#import libraries
import requests
import pandas as pd
import pickle
from flatten_json import flatten
from collections import OrderedDict

In [2]:
# Load credentials
try:
    with open('credentials.pkl', 'rb') as credentials:
        ZENODO_TOKEN = pickle.load(credentials)['ZENODO_TOKEN']
except:
    ZENODO_TOKEN = input('Please enter your Zenodo API Key: ')

In [3]:
SEARCH_URL = 'https://zenodo.org/api/records'
HEADERS = {'Authorization': f'Bearer {ZENODO_TOKEN}'}

# Query #1: query API query based on search terms

Function `get_all_search_outputs` queries the Zenodo API for all search terms specified
- Calls function `get_individual_search_output` for each search term
- To account for Zenodo search limits, queries API for search term in one-year increments
- Appends each resulting dataframe to main dataframe
- Flattens highly nested JSON output if specified in argument

In [4]:
def get_all_search_outputs(search_terms, flatten_output=False):
    """Call the Zenodo API for each search term and search type. 
    Results are retured in results[(search_term)] = df.
    
    Parameters
    ----------
    search_terms : list-like
        Collection of search terms to query over.
    flatten_output : boolean, optional (default=False)
        Flag for flattening nested columns of output.
    
    Returns
    -------
    results : OrderedDict 
        Dictionary consisting of returned DataFrames from get_individual_search_output for each query.
    """

    results = OrderedDict()

    for search_term in search_terms:
        results[(search_term,)] = get_individual_search_output(search_term, flatten_output)
        
    return results

Function `get_individual_search_output` queries the Zenodo API with the specified search term (e.g., “machine learning”)
- Searches across all returned pages
- Result is a dataframe
    - Each dataframe contains *full metadata* about each object as well as high level summary statistics of search (i.e., number of hits)

In [5]:
def get_individual_search_output(search_term, flatten_output=False):
    """Calls the Zenodo API with the specified search term and returns the search output results.
    
    Parameters
    ----------
    search_term : str 
        Keyword to search for.
    flatten_output : boolean, optional (default=False)
        Flag for flattening nested columns of output.
   
    Returns
    -------
    df : DataFrame
        DataFrame containing the output of the search query.
    """
    
    # Make sure out input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    
    # Set search variables
    start_page = 1
    page_size = 1000 # Max = 10,000, Default = 10
    search_year = 2021
    search_df = pd.DataFrame()
    start_date = f'{search_year}-01-01'
    end_date = f'{search_year}-12-31'
    
    search_params = {
        'q': f'{search_term} AND created:[{start_date} TO {end_date}]',
        'page': start_page,
        'size': page_size,
        }
    
    # Run initial search & extract output
    response = requests.get(SEARCH_URL, #Records — search published records
                        params = search_params)
    output = response.json()
    
    # Gather high-level search information from the 'aggregations' entry
    search_aggregation_info = output['aggregations']
    
    # Loop over search years - searches until the current search year does not return any results
    while output.get('hits').get('total'):
        # Loop over pages - searches until the current page is empty 
        while response.status_code == 200 and output.get('hits').get('hits'):
            # Flatten output
            if flatten_output:
                output_list = [flatten(result) for result in output['hits']['hits']]
            else:
                output_list = output['hits']['hits']
            
            # Turn outputs into DataFrame & add page info
            output_df = pd.DataFrame(output_list)
            output_df['page'] = search_params['page']
            
            # Append modified output df to our cumulative search DataFrame
            search_df = pd.concat([search_df, output_df]).reset_index(drop=True)

            # Increment page
            search_params['page'] += 1 
            
             # Run search & extract output
            response = requests.get(SEARCH_URL, #Records — search published records
                                params = search_params)
            output = response.json()
            
        # Change search year, reset search page
        search_year -= 1
        start_date = f'{search_year}-01-01'
        end_date = f'{search_year}-12-31'

        search_params['q'] = f'{search_term} AND created:[{start_date} TO {end_date}]'
        search_params['page'] = start_page

        # Run search & extract output
        response = requests.get(SEARCH_URL, #Records — search published records
                            params = search_params)
        output = response.json()
        
    return search_aggregation_info, search_df

#### Example search

In [6]:
search_terms = ['\"machine learning\"', '\"artificial intelligence\"']

In [7]:
search_output_dict = get_all_search_outputs(search_terms, flatten_output=True)

In [8]:
sample_key = (search_terms[0],)
sample_df = search_output_dict[sample_key][1]

In [9]:
sample_df.head()

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files_0_bucket,files_0_checksum,files_0_key,files_0_links_self,files_0_size,files_0_type,...,metadata_references_631,metadata_references_632,metadata_references_633,metadata_references_634,metadata_references_635,metadata_references_636,metadata_journal_year,metadata_thesis_supervisors_4_affiliation,metadata_thesis_supervisors_4_name,owners
0,10.5281/zenodo.4738769,4738769,2021-05-05T10:21:43.604973+00:00,10.5281/zenodo.4738770,fdefeabc-7897-4130-9628-438795c877c2,md5:0c8ea118118b0300a150b7f54ffc56e8,kratzert/multiple_forcing-v1.0.zip,https://zenodo.org/api/files/fdefeabc-7897-413...,133317.0,zip,...,,,,,,,,,,
1,,4768051,2021-05-17T17:53:16.165204+00:00,10.1007/s10994-021-05968-x,a43e8b77-a43a-488c-8e02-489f02047271,md5:82cb35e198d55ae12aef1e51f1aefb10,Škrlj2021_Article_AutoBOTEvolvingNeuro-symbol...,https://zenodo.org/api/files/a43e8b77-a43a-488...,3000278.0,pdf,...,,,,,,,,,,
2,10.5281/zenodo.4559516,4559516,2021-02-25T13:59:22.292039+00:00,10.5281/zenodo.4559517,a3d35e2c-f833-4d73-a6c9-0ec34f1c4523,md5:88feeb70a50b5156c37f923135a5edb3,sars-cov2-em-gpmm-mean-reconstruction.h5,https://zenodo.org/api/files/a3d35e2c-f833-4d7...,2361200.0,h5,...,,,,,,,,,,
3,10.5281/zenodo.4670267,4670267,2021-04-07T17:34:57.971465+00:00,10.5281/zenodo.4670268,3356b853-271f-4c6a-8486-222c1e4e2e99,md5:9a5d25809ef212e0967e54849f89bbcb,multi-forcing-models.zip,https://zenodo.org/api/files/3356b853-271f-4c6...,6163467000.0,zip,...,,,,,,,,,,
4,10.5281/zenodo.4456151,4456151,2021-01-22T08:59:53.920317+00:00,10.5281/zenodo.4456470,723fc682-b0bd-4b5a-ba84-5c13f158611e,md5:2b26b38d7191972f614763a0aec177cc,CAMP.zip,https://zenodo.org/api/files/723fc682-b0bd-4b5...,58343170.0,zip,...,,,,,,,,,,
