# Figshare API

# Setup

## Instructions

This notebook utilizes the Figshare API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Figshare account at https://figshare.com/account/register
2. After logging in, click on your account photo in the top right corner, and then click on 'Applications'
3. Access API key either by:
    - Create an application by clicking on 'Create Application'
    - Create an API key by clicking on 'Create Personal Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'FIGSHARE_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Figshare API ([Figshare](https://docs.figshare.com))

## Overview of workflow

#### Import necessary libraries

#### Load API credentials

#### Query #1: query API query based on search terms and search types

Define functions to query API based on search terms and search types:

1.	Function `get_individual_search_output` queries the Figshare API with the specified search term (e.g., “machine learning”) and search type (i.e., articles, collections, projects)
    - Figshare allows a variety of search types. For this script, we search: (1) articles, (2) collections, and (3) projects.
        - "Articles" in this context include a variety of item types: 1 - Figure, 2 - Media, 3 - Dataset, 5 - Poster, 6 - Journal contribution, 7 - Presentation, 8 - Thesis, 9 - Software, 11 - Online resource, 12 - Preprint, 13 - Book, 14 - Conference contribution, 15 - Chapter, 16 - Peer review, 17 - Educational resource, 18 - Report, 19 - Standard, 20 - Composition, 21 - Funding, 22 - Physical object, 23 - Data management plan, 24 - Workflow, 25 - Monograph, 26 - Performance, 27 - Event, 28 - Service, 29 - Model
    - Searches across all returned pages
    - Result is a dataframe (one dataframe per search term/search type combination)
    - Each dataframe contains high level information about each object (i.e., id, title, doi, URL, etc)


3.	Function `get_all_search_outputs` queries the Figshare API for all combinations of search terms and search types specified and returns the results as a dictionary of dataframes (one dataframe for each query combination)
    - Calls function `get_individual_search_output` for each combination of search term and search type

Run `get_all_search_outputs` for specified search terms and search types. Output is ordered dictionary of dataframes (result #1 "ordered_dict").

#### Query #2: query API for full metadata for hits from initial query

Following query #1 (resuling in result #1 "ordered_dict"), define functions to retrieve full metadata associated with each object.

4.	Function `_retrieve_object_json` uses the URL for each object (from dataframe in result #1 "ordered_dict") to query API for metadata associated with each object. 
    - Returns flattened JSON object   
    
    
5.	Function `get_metadata` extracts metadata associated with each object and formats as dataframe
    - Calls function `_retrieve_object_json` to get full metadata for each object
    - Output is single dataframe for each search query (matching each dataframe in result #1 "ordered_dict")
    

6. Use a `for` loop to put dataframes into an ordered dictionary, matching result #1 "ordered_dict" object
    - Calls function `get_metadata`


Run `for` loop (which calls `get_metadata`) to pull metadata for each object returned by query #1. Output is ordered dictionary of dataframes (result #2 "metadata_dict").

We now have a dictionary of results from query #1 (object "ordered_dict", which includes high level metadata such as ID and title) and a dictionary of results with additional metadata for each object (object "metadata_dict").

#### Merge results
7.	Function `merge_search_and_metadata_dicts` merges these two dictionaries ("ordered_dict" and "metadata_dict") to a single ordered dictionary and (optional) saves the results as a single csv file

Run merge function to access full ordered dictionary and optionally save results to csv file.

## Imports

In [6]:
import requests # For querying data from API
import pandas as pd # For storing/manipulating query data
from tqdm import tqdm # Gives status bar on loop completion
import itertools # For efficient looping over queries
from collections import OrderedDict
from flatten_json import flatten

# For loading credentials
import pickle
import os 

API access tokens have been stored in credentials.pkl file

In [7]:
# Load credentials

# Check for credentials file
try:
    with open('credentials.pkl', 'rb') as credentials:
        FIGSHARE_TOKEN = pickle.load(credentials)['FIGSHARE_TOKEN']
except:
    FIGSHARE_TOKEN = input('Please enter your Figshare API Key: ')

In [8]:
# Search constants
BASE_URL = 'https://api.figshare.com/v2'
HEADERS = {'Authorization': f'token {FIGSHARE_TOKEN}'}

## Query #1: query API based on search terms and search types

In [9]:
def get_all_search_outputs(search_terms, search_types, flatten_output=False):
    """
    Call the Figshare API for each search term and search type. 
    Results are retured in results['{term}_{type}'] = df
    
    Params:
    - search_terms (list-like): collection of search terms to query over
    - search_types (list-like): collection of search types to query over
    - flatten_output (bool): optional, (default=False)
    
    Returns:
    - results (dict): dictionary consisting of returned DataFrames from get_search_output for each query
    """
    
    num_searches = len(search_terms) * len(search_types)
    results = OrderedDict()

    for search_term, search_type in itertools.product(search_terms, search_types):
        results[(search_term, search_type)] = get_individual_search_output(search_term, search_type, flatten_output)
        
    return results

In [10]:
def get_individual_search_output(search_term, search_type, flatten_output=False):
    """
    Calls the Figshare API with the specified search term and returns the search output results.
    
    Params:
    - search_term (str): keyword to seach for
    - search_type (str): objects to search over (must be either datasets or kernels)
    - flatten_output (bool): optional (default=False)
   
    Returns:
    - df (pandas.DataFrame): DataFrame containing the output of the search query
    """
    
    # Make sure our input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    assert search_type in ('articles', 'collections'), \
        'Search can only be conducted over articles and collections'
        
    # Set search variables
    start_page = 1
    page_size = 1000 # Maximum page size (min = 10)
    output = None
    search_df = pd.DataFrame()
    search_year = 1950
    prev_date = None
    search_date = f'{search_year}-01-01'
    
    search_params = {
        'search_for': search_term,
        'published_since': search_date,
        'order_direction': 'asc',
        'page': start_page, 
        'page_size': page_size,  
        }
        
    search_url = f'{BASE_URL}/{search_type}'
    
    ## Run search for public articles
    response = requests.get(search_url, params=search_params, headers=HEADERS)

    ## Put output into json format
    output = response.json()

    while response.status_code == 200:
        while response.status_code == 200 and output:
            # Flatten output if needed
            if flatten_output:
                output = [flatten(result) for result in output]

            # Turn outputs into DataFrame & add page info
            output_df = pd.DataFrame(output)
            output_df['search_page'] = search_params['page']
            output_df['publish_query'] = search_params['published_since']

            # Append modified output df to our cumulative search DataFrame
            search_df = pd.concat([search_df, output_df]).reset_index(drop=True)

            # Increment page number to query
            search_params['page'] += 1

            ## Run search for public articles
            response = requests.get(search_url, params=search_params, headers=HEADERS)

            ## Put output into json format
            output = response.json()

        if output_df.shape[0] < search_params['page_size']:
            return search_df

        # Get new date to search
        search_date = search_df['published_date'].values[-1].split('T')[0]
        search_params['published_since'] = search_date
        search_params['page'] = start_page
        
        ## Run search for public articles
        response = requests.get(search_url, params=search_params, headers=HEADERS)

        ## Put output into json format
        output = response.json()
        
    return search_df

#### Run query #1 functions - example

In [11]:
search_terms = ['\"machine learning\" OR \"artificial intelligence\"', 
                '\"machine learning\"', 
                '\"artificial intelligence\"',
                '\"deep learning\"',
                '\"neural network\"',
                '\"supervised learning\"',
                '\"unsupervised learning\"',
                '\"reinforcement learning\"',
                '\"training data\"']
search_types = ['articles', 'collections']

In [12]:
search_output_dict = get_all_search_outputs(search_terms, search_types, flatten_output=True)

In [13]:
sample_key = (search_terms[0], search_types[0])
sample_df = search_output_dict[sample_key]

In [16]:
for key, df in search_output_dict.items():
    print('Key:', key, 'Num results:', df.shape[0])

Key: ('"machine learning" OR "artificial intelligence"', 'articles') Num results: 379
Key: ('"machine learning" OR "artificial intelligence"', 'collections') Num results: 157
Key: ('"machine learning"', 'articles') Num results: 22926
Key: ('"machine learning"', 'collections') Num results: 6569
Key: ('"artificial intelligence"', 'articles') Num results: 12423
Key: ('"artificial intelligence"', 'collections') Num results: 3536
Key: ('"deep learning"', 'articles') Num results: 7332
Key: ('"deep learning"', 'collections') Num results: 1905
Key: ('"neural network"', 'articles') Num results: 15445
Key: ('"neural network"', 'collections') Num results: 5826
Key: ('"supervised learning"', 'articles') Num results: 2162
Key: ('"supervised learning"', 'collections') Num results: 787
Key: ('"unsupervised learning"', 'articles') Num results: 1126
Key: ('"unsupervised learning"', 'collections') Num results: 437
Key: ('"reinforcement learning"', 'articles') Num results: 1826
Key: ('"reinforcement lear

## Query #2: query API for full metadata for hits from query #1

In [None]:
def _retrieve_object_json(object_url, flatten_output=False):
    '''
    Queries Figshare for object data (json file) & returns the json data as a dictionary
    
    Params:
    - object_url (str): path for the dataset
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - object_data_dict (dict): dictionary containing json data
    '''
    
    # Download the metadata
    response = requests.get(object_url, headers=HEADERS)
    json_data = response.json()
    
    # Flatten json
    if flatten_output:
        json_data = flatten(json_data)
    
    return json_data

In [None]:
def get_query_metadata(object_paths, flatten_output=False):
    """
    Retrieves the metadata for the object/objects listed in object_paths
    
    Params:
    - object_paths (str/list-like): string or list of strings containing the paths for the objects
    - flatten_output (bool): optional, (default=False)
    
    Returns:
    - metadata_df (pandas.DataFrame): DataFrame containing metadata for the requested objects
    """
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(object_paths) == str:
        object_paths = [object_paths]
    
    # Make sure our input is valid
    assert len(object_paths) > 0, 'Please enter at least one object id'
    
    #create empty pandas dataframe to put results in
    metadata_df = pd.DataFrame()

    #for each path, get full object details
    for object_path in tqdm(object_paths):
        #URL syntax for object details is: https://api.figshare.com/v2/{search_type}/{object_id}        
        json_data = _retrieve_object_json(object_path, flatten_output)
        
        #appending json collapses first level, which is a start
        #for now, can leave files, custom fields, author, etc as list of dictionary
        metadata_df = metadata_df.append(json_data, ignore_index=True)
        
    return metadata_df

#### Run query #2  functions - example

In [None]:
def get_all_metadata(search_output_dict, flatten_output=False):
    """
    Retrieves all of the metadata that relates to the provided DataFrames
    
    Params:
    - search_output_dict : dict
        Dictionary of DataFrames from get_all_search_outputs
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output  
      
    Returns:
    - metadata_dict : collections.OrderedDict
        OrderedDict of DataFrames with metadata for each query
        Order matches the order of search_output_dict
    """
    
    ## Extract IDs from DataFrame, and returns as list of strings
    metadata_dict = OrderedDict()

    for query, df in search_output_dict.items():
        print(f'Retrieving {query} metadata')
        # Create object paths
        _, search_type = query
        object_ids = df.id.convert_dtypes(convert_string=True).tolist()
        object_paths = [f'{BASE_URL}/{search_type}/{object_id}' for object_id in object_ids]

        metadata_dict[query] = get_query_metadata(object_paths, flatten_output)
    
    return metadata_dict

In [None]:
metadata_dict = get_all_metadata(search_output_dict, flatten_output=True)

## Merge results of query #1 and query #2

In [None]:
def merge_search_and_metadata_dicts(search_dict, metadata_dict, on=None, left_on=None, right_on=None, save=False):
    """
    Merges together all of the search and metadata DataFrames by the given 'on' key
    
    Params:
    - search_dict (dict): dictionary of search output results
    - metadata_dict (dict): dictionary of metadata results
    - on (str/list-like): column name(s) to merge the two dicts on
    - left_on (str/list-like): column name(s) to merge the left dict on
    - right_on (str/list-like): column name(s) to merge the right dict on
    - save=False, optional (bool/list-like): specifies if the output DataFrames should be saved
        If True: saves to file of format 'data/kaggle/kaggle_{search_term}_{search_type}.csv'
        If list-like: saves to respective location in list of save locations
            Must contain enough strings (one per query; len(search_terms) * len(search_types))
            
    Returns:
    - df_dict (OrderedDict): OrderedDict containing all of the merged search/metadata dicts
    """

    # Make sure the dictionaries contain the same searches
    assert search_dict.keys() == metadata_dict.keys(), 'Dictionaries must contain the same searches'
    
    num_dataframes = len(search_dict)
    
    # Ensure the save variable data is proper
    try:
        if isinstance(save, bool):
            save = [save] * num_dataframes
        assert len(save) == num_dataframes
    except:
        raise ValueError('Incorrect save value(s)')
        
    # Merge the DataFrames
    df_dict = OrderedDict()
    for (query_key, search_df), (query_key, metadata_df), save_loc in zip(search_dict.items(), 
                                                                          metadata_dict.items(), 
                                                                          save):
        # Keep just search info, id and timeline from initial extract 
        # Timeline only present in some search types
        columns_to_keep = ['id', 'search_page']
        
        if 'timeline' in search_df.columns:
            columns_to_keep.append('timeline')
            
        search_df = search_df[columns_to_keep]

        # Merge small version of "full" dataframe with "detailed" dataframe
        df_all = pd.merge(search_df, metadata_df, on=on, left_on=left_on, right_on=right_on, how='outer')
            
        # Save DataFrame
        if save_loc:
            data_dir = os.path.join('data', 'figshare')
            if isinstance(save_loc, str):
                output_file = save_loc
            elif isinstance(save_loc, bool):
                # Ensure figshare directory is already created
                if not os.path.isdir(data_dir):
                    os.path.mkdir(data_dir)
                
                search_term, search_type = query_key
                output_file = f'{search_term}_{search_type}.csv'
            else:
                raise ValueError(f'Save type must be bool or str, not {type(save_loc)}')

            search_df.to_csv(os.path.join(data_dir, output_file), index=False)
        
        df_dict[query_key] = df_all
    
    return df_dict

#### Run merge function - example

In [None]:
df_dict = merge_search_and_metadata_dicts(search_output_dict, metadata_dict, on='id')

#### Final example output

In [None]:
#results of query #1
output_df = search_output_dict[sample_key]

#results of query #2
metadata_df = metadata_dict[sample_key]

#result of merging datasets into "full" dataframe
full_df = df_dict[sample_key]

In [None]:
full_df.head()