# Zenodo API

# Setup

## Instructions

This notebook utilizes the Zenodo API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Zenodo account at https://zenodo.org/signup/
2. After logging in, click on the user dropdown menu in the top right corner, and click on 'Applications'
3. Access API key either by:
    - Create a Developer Application by clicking on 'New application'
    - Create a Personal Access Token by clicking on 'New Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'ZENODO_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Zenodo API ([Zenodo](https://developers.zenodo.org))
- Zenodo Search Guide ([Guide](https://help.zenodo.org/guides/search/))

## Overview of workflow

<img src="../images/Zenodo_workflow.jpg" width=500 height=500 align="left"/>

## Imports

In [None]:
#import libraries
import requests
import pandas as pd
import pickle
from collections import OrderedDict

In [None]:
# Load credentials
try:
    with open('credentials.pkl', 'rb') as credentials:
        ZENODO_TOKEN = pickle.load(credentials)['ZENODO_TOKEN']
except:
    ZENODO_TOKEN = input('Please enter your Zenodo API Key: ')

In [None]:
SEARCH_URL = 'https://zenodo.org/api/records'
HEADERS = {'Authorization': f'Bearer {ZENODO_TOKEN}'}

# Query #1: query API query based on search terms

Function `get_all_search_outputs` queries the Zenodo API for all search terms specified
- Calls function `get_individual_search_output` for each search term
- To account for Zenodo search limits, queries API for search term in one-year increments
- Appends each resulting dataframe to main dataframe
- Flattens highly nested JSON output if specified in argument

In [None]:
def get_all_search_outputs(search_terms, flatten_output=False):
    """
    Call the Figshare API for each search term and search type. 
    Results are retured in results[(search_term)] = df
    
    Params:
    - search_terms (list-like): collection of search terms to query over
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - results (dict): dictionary consisting of returned DataFrames from get_individual_search_output for each query
    """

    results = OrderedDict()

    for search_term in search_terms:
        results[(search_term,)] = get_individual_search_output(search_term, flatten_output)
        
    return results

Function `get_individual_search_output` queries the Zenodo API with the specified search term (e.g., “machine learning”)
- Searches across all returned pages
- Result is a dataframe
    - Each dataframe contains *full metadata* about each object as well as high level summary statistics of search (i.e., number of hits)

In [None]:
def get_individual_search_output(search_term, flatten_output=False):
    """
    Calls the Zenodo API with the specified search term and returns the search output results.
    
    Params:
    - search_term (str): keyword to seach for
    - flatten (bool): optional (default=False)
   
    Returns:
    - df (pandas.DataFrame): DataFrame containing the output of the search query
    """
    
    # Make sure out input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    
    # Set search variables
    start_page = 1
    page_size = 1000 # Max = 10,000, Default = 10
    search_year = 2021
    search_df = pd.DataFrame()
    start_date = f'{search_year}-01-01'
    end_date = f'{search_year}-12-31'
    
    search_params = {
        'q': f'{search_term} AND created:[{start_date} TO {end_date}]',
        'page': start_page,
        'size': page_size,
        }
    
    # Run initial search & extract output
    response = requests.get(SEARCH_URL, #Records — search published records
                        params = search_params)
    output = response.json()
    
    # Gather high-level search information from the 'aggregations' entry
    search_aggregation_info = output['aggregations']
    
    # Loop over search years - searches until the current search year does not return any results
    while output.get('hits').get('total'):
        # Loop over pages - searches until the current page is empty 
        while response.status_code == 200 and output.get('hits').get('hits'):
            # Flatten output
            if flatten_output:
                output_list = [flatten(result) for result in output['hits']['hits']]
            else:
                output_list = output['hits']['hits']
            
            # Turn outputs into DataFrame & add page info
            output_df = pd.DataFrame(output_list)
            output_df['page'] = search_params['page']
            
            # Append modified output df to our cumulative search DataFrame
            search_df = pd.concat([search_df, output_df]).reset_index(drop=True)

            # Increment page
            search_params['page'] += 1 
            
             # Run search & extract output
            response = requests.get(SEARCH_URL, #Records — search published records
                                params = search_params)
            output = response.json()
            
        # Change search year, reset search page
        search_year -= 1
        start_date = f'{search_year}-01-01'
        end_date = f'{search_year}-12-31'

        search_params['q'] = f'{search_term} AND created:[{start_date} TO {end_date}]'
        search_params['page'] = start_page

        # Run search & extract output
        response = requests.get(SEARCH_URL, #Records — search published records
                            params = search_params)
        output = response.json()
        
    return search_aggregation_info, search_df

#### Example search

In [None]:
search_terms = ['\"machine learning\" OR \"artificial intelligence\"', 
                '\"machine learning\"', 
                '\"artificial intelligence\"',
                '\"deep learning\"',
                '\"neural network\"',
                '\"supervised learning\"',
                '\"unsupervised learning\"',
                '\"reinforcement learning\"',
                '\"training data\"']

In [None]:
search_output_dict = get_all_search_outputs(search_terms, flatten_output=False)