# Dryad API

# Setup

## Instructions

This notebook utilizes the Data Dryad API. No API Token is required.

## Additional Information

Documentation Guide:
- Dryad API ([Dryad](https://datadryad.org/api/v2/docs/))
- Dryad API ([GitHub](https://github.com/CDL-Dryad/dryad-app/tree/main/documentation/apis))

## Overview of workflow

<img src="../images/Dryad_workflow.jpg" width=600 height=600 align="left" />

## Imports

In [1]:
import requests # For querying data from API
import pandas as pd # For storing/manipulating query data
from tqdm import tqdm # Gives status bar on loop completion
from collections import OrderedDict
from flatten_json import flatten

## Setup

In [2]:
# Search constants
BASE_URL = 'https://datadryad.org/api/v2'

## Query #1: query API based on search terms

Function `get_all_search_outputs` queries the Dryad API for all search term specified and returns the results as a dictionary of dataframes (one dataframe for each query combination)
- Calls function `get_individual_search_output` for each search term

In [3]:
def get_all_search_outputs(search_terms, flatten_output=False):
    """
    Call the Data Dryad API for each search term. 
    Results are retured in results['({term},)'] = df
    
    Params:
    - search_terms : list-like 
        collection of search terms to query over
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output
    
    Returns:
    - results : dict
        dictionary consisting of returned DataFrames from get_search_output for each query
    """
    
    results = OrderedDict()

    for search_term in search_terms:
        results[(search_term,)] = get_individual_search_output(search_term, flatten_output)
        
    return results

Function `_conduct_search_over_pages` is a helper function used to iterate over search result pages

In [4]:
def _conduct_search_over_pages(search_url, search_params, flatten_output=False, delim=None):
    # Make sure proper delim passed in
    if delim:
        assert isinstance(delim, str), 'Incorrect delim parameter passed in. Must be of type str'
    
    search_df = pd.DataFrame()
    
    # Perform initial search & convert results to json
    response = requests.get(search_url, params=search_params)
    output = response.json()
    
    # Loops over the search as long as the page was not empty
    while output.get('count'):
        # Extract relevant output data
        output = output['_embedded']
        
        if delim:
            output = output[delim]
        
        # Flatten output if necessary
        if flatten_output:
            output = [flatten(result) for result in output]
        else:
            output = output
        
        output_df = pd.DataFrame(output)
        output_df['page'] = search_params['page']
        
        search_df = pd.concat([search_df, output_df]).reset_index(drop=True)
        
        # Increment the page number to perform another search
        search_params['page'] += 1
        
        # Perform next search and convert results to json
        response = requests.get(search_url, params=search_params)
        output = response.json()

    return search_df

Function `get_individual_search_output` queries the Dryad API for the specified search term and returns the result as a dataframe
- Calls function `_conduct_search_over_pages` 

In [5]:
def get_individual_search_output(search_term, flatten_output=False):
    """
    Returns a list of all datasets available from the Data Dryad API
    
    Params:
    - search_term : str
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output
        
    Returns:
    - pandas.DataFrame
        DataFrame containing the output of the search query
    """
    
    # Set search params
    search_url = f'{BASE_URL}/search'
    start_page = 1
    page_size = 100
    
    search_params = {
        'q': search_term,
        'page': start_page,
        'per_page': page_size
    }
    
    return _conduct_search_over_pages(search_url, search_params, flatten_output, delim='stash:datasets')

#### Run query #1 functions - example

In [6]:
search_terms = ['\"machine learning\"', '\"artificial intelligence\"']

In [7]:
search_output_dict = get_all_search_outputs(search_terms, flatten_output=True)

In [8]:
sample_key = (search_terms[0],)
sample_df = search_output_dict[sample_key]

## Query #2: query API for full metadata for hits from query #1

Function `get_query_metadata` extracts metadata associated with each object and formats as dataframe
- Output is single dataframe for each search query (matching each dataframe in result #1 ordered dictionary)

In [9]:
def get_query_metadata(object_paths, flatten_output=False):
    """
    Retrieves the metadata for the file/files listed in object_paths
    
    Params:
    - object_paths : str/list-like
        string or list of strings containing the paths for the objects
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output
    
    Returns:
    - metadata_df : pandas.DataFrame
        DataFrame containing metadata for the requested dataset
    """
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(object_paths) == str:
        object_paths = [object_paths]
    
    # Make sure our input is valid
    assert len(object_paths) > 0, 'Please enter at least one object id'
    
    # Set search variables
    start_page = 1
    metadata_df = pd.DataFrame()
    
    # Request the metadata for each object
    for object_path in tqdm(object_paths):
        # Set search variables
        search_url = f'{BASE_URL}/versions/{object_path}/files'
        search_params = {'page': start_page}
        
        # Conduct search
        object_df = _conduct_search_over_pages(search_url, search_params, flatten_output, delim='stash:files')

        # Add relevant data to DataFrame and merge
        object_df['id'] = object_path
        object_df['page'] = search_params['page']
        metadata_df = pd.concat([metadata_df, object_df]).reset_index(drop=True)
    
    return metadata_df

Function `get_all_metadata` uses a `for` loop to put dataframes into an ordered dictionary, matching result #1 ordered_dictionary
- Calls function `get_query_metadata`

In [10]:
def get_all_metadata(search_output_dict, flatten_output=False):
    """
    Retrieves all of the metadata that relates to the provided DataFrames
    
    Params:
    - search_output_dict : dict
        Dictionary of DataFrames from get_all_search_outputs
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output  
      
    Returns:
    - metadata_dict : collections.OrderedDict
        OrderedDict of DataFrames with metadata for each query
        Order matches the order of search_output_dict
    """
    
    ## Extract IDs from DataFrame, and returns as list of strings
    metadata_dict = OrderedDict()

    for query, df in search_output_dict.items():
        search_term = query[0]
        print(f'Retrieving {search_term} metadata')
        
        # Create object paths
        object_paths = df.id.convert_dtypes(convert_string=True).tolist()

        metadata_dict[query] = get_query_metadata(object_paths, flatten_output)
    
    return metadata_dict

#### Run query #2 functions - example

In [11]:
metadata_dict = get_all_metadata(search_output_dict, flatten_output=True)

  0%|          | 0/225 [00:00<?, ?it/s]

Retrieving "machine learning" metadata


100%|██████████| 225/225 [02:00<00:00,  1.87it/s]
  0%|          | 0/19 [00:00<?, ?it/s]

Retrieving "artificial intelligence" metadata


100%|██████████| 19/19 [00:09<00:00,  1.95it/s]


In [12]:
metadata_dict[sample_key]

Unnamed: 0,_links_self_href,_links_stash:dataset_href,_links_stash:version_href,_links_stash:files_href,_links_stash:file-download_href,_links_stash:download_href,_links_curies_0_name,_links_curies_0_href,_links_curies_0_templated,path,url,size,mimeType,status,digest,digestType,description,page,id
0,/api/v2/files/105376,/api/v2/datasets/doi%3A10.5061%2Fdryad.61pm78v,/api/v2/versions/28194,/api/v2/versions/28194/files,/api/v2/files/105376/download,/api/v2/files/105376/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,Datas-ZLL.xlsx,https://dryad-assetstore-east.s3.amazonaws.com...,39464.0,application/vnd.openxmlformats-officedocument....,created,eb69ba9befabb98fb4cbc75ab6eb23ca,md5,"Carbon concentration, carbon stock, and distri...",2,28194
1,/api/v2/files/59371,/api/v2/datasets/doi%3A10.5061%2Fdryad.48vb7,/api/v2/versions/17689,/api/v2/versions/17689/files,/api/v2/files/59371/download,/api/v2/files/59371/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,GundersonLeal15_AmNat_Anole_Data.txt,https://dryad-assetstore-east.s3.amazonaws.com...,11605.0,text/plain,created,7f10d45c751b82f8e05f2932e221b7a0,md5,Activity data for Anolis cristatellus,2,17689
2,/api/v2/files/59372,/api/v2/datasets/doi%3A10.5061%2Fdryad.48vb7,/api/v2/versions/17689,/api/v2/versions/17689/files,/api/v2/files/59372/download,/api/v2/files/59372/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,Christian varanid spp activity.csv,https://dryad-assetstore-east.s3.amazonaws.com...,1266.0,text/csv,created,6e44028c2e001c24d33427d172eb893d,md5,Data on activity for Varanids from Christian a...,2,17689
3,/api/v2/files/232146,/api/v2/datasets/doi%3A10.5061%2Fdryad.3xsj3txbt,/api/v2/versions/50040,/api/v2/versions/50040/files,/api/v2/files/232146/download,/api/v2/files/232146/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,jags_code.txt,,3219.0,text/plain,copied,,,,2,50040
4,/api/v2/files/232147,/api/v2/datasets/doi%3A10.5061%2Fdryad.3xsj3txbt,/api/v2/versions/50040,/api/v2/versions/50040/files,/api/v2/files/232147/download,/api/v2/files/232147/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,rcode_for_running_model.R,,1278.0,application/octet-stream,copied,,,,2,50040
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
562,/api/v2/files/77515,/api/v2/datasets/doi%3A10.5061%2Fdryad.t6v7t8s,/api/v2/versions/22425,/api/v2/versions/22425/files,/api/v2/files/77515/download,/api/v2/files/77515/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,Raw data from flow cytometric analysis.xlsx,https://dryad-assetstore-east.s3.amazonaws.com...,9522.0,application/vnd.openxmlformats-officedocument....,created,959a2337a82c5784fffc0d50342fb4b8,md5,,2,22425
563,/api/v2/files/77516,/api/v2/datasets/doi%3A10.5061%2Fdryad.t6v7t8s,/api/v2/versions/22425,/api/v2/versions/22425/files,/api/v2/files/77516/download,/api/v2/files/77516/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,Raw data from the germination trials.xlsx,https://dryad-assetstore-east.s3.amazonaws.com...,14437.0,application/vnd.openxmlformats-officedocument....,created,124673d7b9e6da22598a2a5ebaf965c2,md5,,2,22425
564,/api/v2/files/80783,/api/v2/datasets/doi%3A10.5061%2Fdryad.3s6rm0f,/api/v2/versions/23265,/api/v2/versions/23265/files,/api/v2/files/80783/download,/api/v2/files/80783/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,NZDOC.PKIMR_UVS.species_counts.1998_1999.csv,https://dryad-assetstore-east.s3.amazonaws.com...,7042.0,text/csv,created,2111c2535d4a322ae2265dbeed167ba7,md5,Data consist of counts of abundances of each o...,2,23265
565,/api/v2/files/80784,/api/v2/datasets/doi%3A10.5061%2Fdryad.3s6rm0f,/api/v2/versions/23265,/api/v2/versions/23265/files,/api/v2/files/80784/download,/api/v2/files/80784/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,README_for_NZDOC.PKIMR_UVS.species_counts.1998...,https://dryad-assetstore-east.s3.amazonaws.com...,1250.0,application/octet-stream,created,d0a48ce9052dffc8c04c81e3723db7e9,md5,Data consist of counts of abundances of each o...,2,23265


## Combine results of query #1 and query #2

Function `merge_search_and_metadata_dicts` merges the output dictionaries from query #1 and query #2 to a single ordered dictionary and (optional) saves the results as a single csv file

In [13]:
def merge_search_and_metadata_dicts(search_dict, metadata_dict, on=None, left_on=None, right_on=None, save=False):
    """
    Merges together all of the search and metadata DataFrames by the given 'on' key
    
    Params:
    - search_dict (dict): dictionary of search output results
    - metadata_dict (dict): dictionary of metadata results
    - on (str/list-like): column name(s) to merge the two dicts on
    - left_on (str/list-like): column name(s) to merge the left dict on
    - right_on (str/list-like): column name(s) to merge the right dict on
    - save=False, optional (bool/list-like): specifies if the output DataFrames should be saved
        If True: saves to file of format 'data/dryad/dryad_{search_term}_{search_type}.csv'
        If list-like: saves to respective location in list of save locations
            Must contain enough strings (one per query; len(search_terms) * len(search_types))
            
    Returns:
    - df_dict (OrderedDict): OrderedDict containing all of the merged search/metadata dicts
    """

    # Make sure the dictionaries contain the same searches
    assert search_dict.keys() == metadata_dict.keys(), 'Dictionaries must contain the same searches'
    
    num_dataframes = len(search_dict)
    
    # Ensure the save variable data is proper
    try:
        if isinstance(save, bool):
            save = [save] * num_dataframes
        assert len(save) == num_dataframes
    except:
        raise ValueError('Incorrect save value(s)')
        
    # Merge the DataFrames
    df_dict = OrderedDict()
    for (query_key, search_df), (query_key, metadata_df), save_loc in zip(search_dict.items(), 
                                                                          metadata_dict.items(), 
                                                                          save):

        # Merge small version of "full" dataframe with "detailed" dataframe
        df_all = pd.merge(search_df, metadata_df, on=on, left_on=left_on, right_on=right_on, how='outer')
            
        # Save DataFrame
        if save_loc:
            data_dir = os.path.join('data', 'dryad')
            if isinstance(save_loc, str):
                output_file = save_loc
            elif isinstance(save_loc, bool):
                # Ensure figshare directory is already created
                if not os.path.isdir(data_dir):
                    os.path.mkdir(data_dir)
                
                search_term, search_type = query_key
                output_file = f'{search_term}_{search_type}.csv'
            else:
                raise ValueError(f'Save type must be bool or str, not {type(save_loc)}')

            search_df.to_csv(os.path.join(data_dir, output_file), index=False)
        
        df_dict[query_key] = df_all
    
    return df_dict

#### Run merge function - example

In [14]:
df_dict = merge_search_and_metadata_dicts(search_output_dict, metadata_dict, on='id')

In [15]:
df_dict[sample_key]

Unnamed: 0,_links_self_href_x,_links_stash:versions_href,_links_stash:version_href_x,_links_stash:download_href_x,_links_curies_0_name_x,_links_curies_0_href_x,_links_curies_0_templated_x,identifier,id,storageSize,...,_links_curies_0_templated_y,path,url,size,mimeType,status,digest,digestType,description,page_y
0,/api/v2/datasets/doi%3A10.5061%2Fdryad.nj2mg77,/api/v2/datasets/doi%3A10.5061%2Fdryad.nj2mg77...,/api/v2/versions/29027,/api/v2/datasets/doi%3A10.5061%2Fdryad.nj2mg77...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.nj2mg77,28194,1769456,...,true,Datas-ZLL.xlsx,https://dryad-assetstore-east.s3.amazonaws.com...,39464.0,application/vnd.openxmlformats-officedocument....,created,eb69ba9befabb98fb4cbc75ab6eb23ca,md5,"Carbon concentration, carbon stock, and distri...",2.0
1,/api/v2/datasets/doi%3A10.5061%2Fdryad.q6ft5,/api/v2/datasets/doi%3A10.5061%2Fdryad.q6ft5/v...,/api/v2/versions/17751,/api/v2/datasets/doi%3A10.5061%2Fdryad.q6ft5/d...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.q6ft5,17689,31656,...,true,GundersonLeal15_AmNat_Anole_Data.txt,https://dryad-assetstore-east.s3.amazonaws.com...,11605.0,text/plain,created,7f10d45c751b82f8e05f2932e221b7a0,md5,Activity data for Anolis cristatellus,2.0
2,/api/v2/datasets/doi%3A10.5061%2Fdryad.q6ft5,/api/v2/datasets/doi%3A10.5061%2Fdryad.q6ft5/v...,/api/v2/versions/17751,/api/v2/datasets/doi%3A10.5061%2Fdryad.q6ft5/d...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.q6ft5,17689,31656,...,true,Christian varanid spp activity.csv,https://dryad-assetstore-east.s3.amazonaws.com...,1266.0,text/csv,created,6e44028c2e001c24d33427d172eb893d,md5,Data on activity for Varanids from Christian a...,2.0
3,/api/v2/datasets/doi%3A10.7280%2FD1WQ3R,/api/v2/datasets/doi%3A10.7280%2FD1WQ3R/versions,/api/v2/versions/118084,/api/v2/datasets/doi%3A10.7280%2FD1WQ3R/download,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.7280/D1WQ3R,67581,241082358,...,,,,,,,,,,
4,/api/v2/datasets/doi%3A10.5061%2Fdryad.xksn02vbq,/api/v2/datasets/doi%3A10.5061%2Fdryad.xksn02v...,/api/v2/versions/55637,/api/v2/datasets/doi%3A10.5061%2Fdryad.xksn02v...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.xksn02vbq,40246,141267,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
661,/api/v2/datasets/doi%3A10.5061%2Fdryad.dr1c383,/api/v2/datasets/doi%3A10.5061%2Fdryad.dr1c383...,/api/v2/versions/23329,/api/v2/datasets/doi%3A10.5061%2Fdryad.dr1c383...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.dr1c383,23265,70472,...,true,NZDOC.PKIMR_UVS.species_counts.1998_1999.csv,https://dryad-assetstore-east.s3.amazonaws.com...,7042.0,text/csv,created,2111c2535d4a322ae2265dbeed167ba7,md5,Data consist of counts of abundances of each o...,2.0
662,/api/v2/datasets/doi%3A10.5061%2Fdryad.dr1c383,/api/v2/datasets/doi%3A10.5061%2Fdryad.dr1c383...,/api/v2/versions/23329,/api/v2/datasets/doi%3A10.5061%2Fdryad.dr1c383...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.dr1c383,23265,70472,...,true,README_for_NZDOC.PKIMR_UVS.species_counts.1998...,https://dryad-assetstore-east.s3.amazonaws.com...,1250.0,application/octet-stream,created,d0a48ce9052dffc8c04c81e3723db7e9,md5,Data consist of counts of abundances of each o...,2.0
663,/api/v2/datasets/doi%3A10.5061%2Fdryad.6m905qfzp,/api/v2/datasets/doi%3A10.5061%2Fdryad.6m905qf...,/api/v2/versions/117654,/api/v2/datasets/doi%3A10.5061%2Fdryad.6m905qf...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.6m905qfzp,61800,1455995,...,,,,,,,,,,
664,/api/v2/datasets/doi%3A10.5061%2Fdryad.078bn,/api/v2/datasets/doi%3A10.5061%2Fdryad.078bn/v...,/api/v2/versions/12722,/api/v2/datasets/doi%3A10.5061%2Fdryad.078bn/d...,stash,https://github.com/CDL-Dryad/stash/blob/main/s...,true,doi:10.5061/dryad.078bn,12676,89062,...,true,PFO and Cryptogenic Stroke Data File.xlsx,https://dryad-assetstore-east.s3.amazonaws.com...,65474.0,application/vnd.openxmlformats-officedocument....,created,2dd37c65722b1326a193917725851d33,md5,This file contains 4 sheet. PFO vs Medical The...,2.0
