# Zenodo API

# Setup

## Instructions

This notebook utilizes the Zenodo API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Zenodo account at https://zenodo.org/signup/
2. After logging in, click on the user dropdown menu in the top right corner, and click on 'Applications'
3. Access API key either by:
    - Create a Developer Application by clicking on 'New application'
    - Create a Personal Access Token by clicking on 'New Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'ZENODO_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Zenodo API ([Zenodo](https://developers.zenodo.org))
- Zenodo Search Guide ([Guide](https://help.zenodo.org/guides/search/))

General Zenodo requests format:

```python
response = requests.get('https://zenodo.org/api/records',
                        params={'q': 'my title',
                                'access_token': ACCESS_TOKEN,
                                ...})
```

## Imports

In [1]:
# Import flatten_json, installing if necessary
try:
    from flatten_json import flatten
except ImportError as e:
    !pip3 install flatten_json
    from flatten_json import flatten

#import libraries
import os
import requests
import pandas as pd
import pickle
import pprint as pp
from utils import flatten_nested_df
from collections import OrderedDict

In [2]:
# Load credentials
try:
    with open('credentials.pkl', 'rb') as credentials:
        ZENODO_TOKEN = pickle.load(credentials)['ZENODO_TOKEN']
except:
    ZENODO_TOKEN = input('Please enter your Zenodo API Key: ')

In [3]:
SEARCH_URL = 'https://zenodo.org/api/records'

# Data Wrangling

In [4]:
def get_all_search_outputs(search_terms, flatten_output=False):
    """
    Call the Figshare API for each search term and search type. 
    Results are retured in results[(search_term)] = df
    
    Params:
    - search_terms (list-like): collection of search terms to query over
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - results (dict): dictionary consisting of returned DataFrames from get_search_output for each query
    """

    results = OrderedDict()

    for search_term, in search_terms:
        results[(search_term,)] = get_search_output(search_term, flatten_output)
        
    return results

In [5]:
def get_individual_search_output(search_term, flatten_output=False):
    """
    Calls the Zenodo API with the specified search term and returns the search output results.
    
    Params:
    - search_term (str): keyword to seach for
    - flatten (bool): optional (default=False)
   
    Returns:
    - df (pandas.DataFrame): DataFrame containing the output of the search query
    """
    
    # Make sure out input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    
    # Set search variables
    start_page = 1
    page_size = 1000 # Max = 10,000, Default = 10
    search_year = 2021
    search_df = pd.DataFrame()
    start_date = f'{search_year}-01-01'
    end_date = f'{search_year}-12-31'
    
    search_params = {
        'q': f'\"{search_term}\" AND created:[{start_date} TO {end_date}]',
        'access_token': ZENODO_TOKEN,
        'page': start_page,
        'size': page_size,
        }
    
    # Run initial search & extract output
    response = requests.get(SEARCH_URL, #Records — search published records
                        params = search_params)
    output = response.json()
    
    # Gather high-level search information from the 'aggregations' entry
    search_aggregation_info = output['aggregations']
    
    # Loop over search years - searches until the current search year does not return any results
    while output.get('hits').get('total'):
        # Loop over pages - searches until the current page is empty 
        while output.get('hits').get('hits'):
            # Flatten output
            if flatten_output:
                output_list = [flatten(result) for result in output['hits']['hits']]
            else:
                output_list = output['hits']['hits']
            
            # Turn outputs into DataFrame & add page info
            output_df = pd.DataFrame(output_list)
            output_df['page'] = search_params['page']
            
            # Append modified output df to our cumulative search DataFrame
            search_df = pd.concat([search_df, output_df])

            # Increment page
            search_params['page'] += 1 
            
             # Run search & extract output
            response = requests.get(SEARCH_URL, #Records — search published records
                                params = search_params)
            output = response.json()
            
        # Change search year, reset search page
        search_year -= 1
        start_date = f'{search_year}-01-01'
        end_date = f'{search_year}-12-31'

        search_params['q'] = f'\"{search_term}\" AND created:[{start_date} TO {end_date}]'
        search_params['page'] = start_page

        # Run search & extract output
        response = requests.get(SEARCH_URL, #Records — search published records
                            params = search_params)
        output = response.json()
        
    return search_aggregation_info, search_df

In [6]:
search_aggregation_info, search_df = get_individual_search_output('artificial intelligence', flatten_output=True)

year: 2021 page: 1
year: 2020 page: 1
year: 2019 page: 1
year: 2018 page: 1
year: 2018 page: 2
year: 2018 page: 3
year: 2017 page: 1
year: 2016 page: 1
year: 2015 page: 1
year: 2014 page: 1


## Check out the data

In [7]:
search_df.head()

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files_0_bucket,files_0_checksum,files_0_key,files_0_links_self,files_0_size,files_0_type,...,metadata_references_3197,metadata_references_3198,metadata_references_3199,metadata_references_3200,metadata_references_3201,metadata_references_3202,metadata_references_3203,metadata_references_3204,metadata_communities_3_id,owners
0,10.5281/zenodo.4891946,4891946,2021-06-02T00:37:49.848437+00:00,10.5281/zenodo.4891947,b77cc60a-755f-407d-b841-48dd4fbe72e2,md5:34a2480f0d291788c6640b0fa979ac9c,BioAITeam/Sensitivity-of-deep-learning-applied...,https://zenodo.org/api/files/b77cc60a-755f-407...,825322.0,zip,...,,,,,,,,,,
1,10.5281/zenodo.4618693,4618693,2021-03-18T18:46:15.483976+00:00,10.5281/zenodo.4618694,87adf869-8456-433d-a2e3-f082954f5408,md5:f8abb0d8ad0c32cef2998cc0a8037bb9,1_K6wNayRTNOCh4ozH5Eus1g.jpeg,https://zenodo.org/api/files/87adf869-8456-433...,88856.0,jpeg,...,,,,,,,,,,
2,10.5281/zenodo.4702011,4702011,2021-04-19T21:47:32.368567+00:00,10.5281/zenodo.4702012,0f3fcb75-fcd6-4c6c-a8f0-d685c1690aa3,md5:631dd6286f13d0820542e98223c6d34a,submission.pdf,https://zenodo.org/api/files/0f3fcb75-fcd6-4c6...,59593690.0,pdf,...,,,,,,,,,,
3,10.5281/zenodo.4772094,4772094,2021-06-10T06:42:48.959739+00:00,10.5281/zenodo.4922385,a968eaf1-eb6f-4bb5-9a45-584bbe0a151d,md5:e1d23833576dc63306740c5f98dea924,legalnero.zip,https://zenodo.org/api/files/a968eaf1-eb6f-4bb...,21467440.0,zip,...,,,,,,,,,,
4,10.5281/zenodo.4626539,4626539,2021-03-22T08:38:45.921221+00:00,10.5281/zenodo.4626540,25c7a3a8-d527-4366-90fe-18ec08d1bdaf,md5:c255a0b0f57a7a370bbe713c44778249,RTASC.zip,https://zenodo.org/api/files/25c7a3a8-d527-436...,1645101000.0,zip,...,,,,,,,,,,
