# Figshare API

# Setup

## Instructions

This notebook utilizes the Figshare API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Figshare account at https://figshare.com/account/register
2. After logging in, click on your account photo in the top right corner, and then click on 'Applications'
3. Access API key either by:
    - Create an application by clicking on 'Create Application'
    - Create an API key by clicking on 'Create Personal Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'FIGSHARE_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Figshare API ([Figshare](https://docs.figshare.com))

## Workflow - basically same as kaggle

Figshare allows a variety of search types. For this script, we search: (1) articles, (2) collections, and (3) projects.

"Articles" in this context include a variety of item types: 1 - Figure, 2 - Media, 3 - Dataset, 5 - Poster, 6 - Journal contribution, 7 - Presentation, 8 - Thesis, 9 - Software, 11 - Online resource, 12 - Preprint, 13 - Book, 14 - Conference contribution, 15 - Chapter, 16 - Peer review, 17 - Educational resource, 18 - Report, 19 - Standard, 20 - Composition, 21 - Funding, 22 - Physical object, 23 - Data management plan, 24 - Workflow, 25 - Monograph, 26 - Performance, 27 - Event, 28 - Service, 29 - Model

Figshare workflow:
1.	Specify search terms in a list
2.	Specify search types in a list
    - Search can only be conducted over articles, collections, or projects
3.	The function `get_individual_search_output` calls the Figshare API for each combination of search terms (i.e., "machine learning") and search types (i.e., articles, collections, projects)
    - Results are paginated, so for each search term/search type combination, return results from each page of results and combine into a single data frame
    - Results returned are high level only (id, title, doi, URL, etc) – full object metadata is called downstream
4.	Function `get_individual_search_output` is called within `get_all_search_outputs`
    - Returns ordered dict
5.	Use object-level URL from `get_individual_search_output` results to get full metadata for each object:
    - Function `_retrieve_object_json` calls Figshare API for each URL and returns JSON response
    - Function `get_metadata` actually calls `_retrieve_object_json` to do the search and formats the results as a dataframe 
6.	Not quite sure what the code under “perform metadata extraction” is doing
7.	Running `merge_search_and_metadata_dicts` returns `KeyError: "['search_type'] not in index"`
8.	Everything below “code that loops through search terms and objects” is old and can be deleted


## Imports

In [1]:
import requests # For querying data from API
import pandas as pd # For storing/manipulating query data
from tqdm import tqdm # Gives status bar on loop completion
import itertools # For efficient looping over queries
from collections import OrderedDict
from flatten_json import flatten

# For loading credentials
import pickle
import os 

API access tokens have been stored in credentials.pkl file

In [2]:
# Load credentials

# Check for credentials file
try:
    with open('credentials.pkl', 'rb') as credentials:
        FIGSHARE_TOKEN = pickle.load(credentials)['FIGSHARE_TOKEN']
except:
    FIGSHARE_TOKEN = input('Please enter your Figshare API Key: ')

In [3]:
# Search constants
BASE_URL = 'https://api.figshare.com/v2/'
HEADERS = {'Authorization': f'token {FIGSHARE_TOKEN}'}

Figshare allows a variety of search types. For this script, we search: (1) articles, (2) collections, and (3) projects.

"Articles" in this context include a variety of item types: 1 - Figure, 2 - Media, 3 - Dataset, 5 - Poster, 6 - Journal contribution, 7 - Presentation, 8 - Thesis, 9 - Software, 11 - Online resource, 12 - Preprint, 13 - Book, 14 - Conference contribution, 15 - Chapter, 16 - Peer review, 17 - Educational resource, 18 - Report, 19 - Standard, 20 - Composition, 21 - Funding, 22 - Physical object, 23 - Data management plan, 24 - Workflow, 25 - Monograph, 26 - Performance, 27 - Event, 28 - Service, 29 - Model

Overall workflow:
1. Extract object IDs for each combination of search terms (i.e., "machine learning" and search types (i.e., articles, collections, projects)

# Data Wrangling

## Extracting Object ID's

In [4]:
#needs name that is more different from above function

def get_all_search_outputs(search_terms, search_types, flatten_output=False):
    """
    Call the Figshare API for each search term and search type. 
    Results are retured in results['{term}_{type}'] = df
    
    Params:
    - search_terms (list-like): collection of search terms to query over
    - search_types (list-like): collection of search types to query over
    - flatten_output (bool): optional, (default=False)
    
    Returns:
    - results (dict): dictionary consisting of returned DataFrames from get_search_output for each query
    """
    
    num_searches = len(search_terms) * len(search_types)
    results = OrderedDict()

    for search_term, search_type in itertools.product(search_terms, search_types):
        results[(search_term, search_type)] = get_individual_search_output(search_term, search_type, flatten_output)
        
    return results

In [5]:
def get_individual_search_output(search_term, search_type, flatten_output=False):
    """
    Calls the Figshare API with the specified search term and returns the search output results.
    
    Params:
    - search_term (str): keyword to seach for
    - search_type (str): objects to search over (must be either datasets or kernels)
    - flatten_output (bool): optional (default=False)
   
    Returns:
    - df (pandas.DataFrame): DataFrame containing the output of the search query
    """
    
    # Make sure our input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    assert search_type in ('articles', 'collections', 'projects'), \
        'Search can only be conducted over articles, collections, or projects'
        
    # Set search variables
    start_page = 1
    page_size = 1000 # Maximum page size (min = 10)
    output = None
    search_df = pd.DataFrame()
    
    search_params = {
        'search_for': search_term,
        'page': start_page, 
        'page_size': page_size,  
        }
        
    search_url = f'{BASE_URL}/{search_type}'

    ## Run search for public articles
    response = requests.get(search_url, params=search_params, headers=HEADERS)

    ## Put output into json format
    output = response.json()
    
    # Continue searching until we reach an empty page
    while output != []:
        # Flatten output if needed
        if flatten_output:
            output = [flatten(result) for result in output]
        
        # Turn outputs into DataFrame & add page info
        output_df = pd.DataFrame(output)
        output_df['search_page'] = search_params['page']
        
        # Append modified output df to our cumulative search DataFrame
        search_df = pd.concat([search_df, output_df])
        
        # Increment page number to query
        search_params['page'] += 1

        ## Run search for public articles
        response = requests.get(search_url, params=search_params, headers=HEADERS)

        ## Put output into json format
        output = response.json()
    
    return search_df

### Perform Search

In [6]:
search_terms = ['iguana']
search_types = ['collections', 'projects', 'articles']

In [7]:
search_output_dict = get_all_search_outputs(search_terms, search_types, flatten_output=True)

In [8]:
sample_key = (search_terms[0], search_types[0])
sample_df = search_output_dict[sample_key]

In [9]:
sample_df.head()

Unnamed: 0,id,title,doi,handle,url,published_date,timeline_posted,search_page
0,3582815,Data from: Vascular patterns in iguanas and ot...,10.5061/dryad.27m63.2,,https://api.figshare.com/v2/collections/3582815,2016-11-25T19:47:11Z,2016-11-25T19:47:11,1
1,4755440,Data from: Vascular patterns in iguanas and ot...,10.5061/dryad.27m63,,https://api.figshare.com/v2/collections/4755440,2019-11-26T08:07:43Z,2019-11-26T08:07:43,1
2,4596320,Data from: Vascular patterns in iguanas and ot...,10.5061/dryad.27m63.1,,https://api.figshare.com/v2/collections/4596320,2019-07-30T16:44:02Z,2019-07-30T16:44:02,1
3,5234804,First known trace fossil of a nesting iguana (...,10.1371/journal.pone.0242935,,https://api.figshare.com/v2/collections/5234804,2020-12-09T18:32:22Z,2020-12-09T18:32:22,1
4,5311858,Systemic <i>Helicobacter</i> infection and ass...,10.1371/journal.pone.0247010,,https://api.figshare.com/v2/collections/5311858,2021-02-19T18:33:04Z,2021-02-19T18:33:04,1


## Get Metadata

In [10]:
def _retrieve_object_json(object_url, flatten_output=False):
    '''
    Queries Figshare for object data (json file) & returns the json data as a dictionary
    
    Params:
    - object_url (str): path for the dataset
    - flatten_output (bool): optional (default=False)
    
    Returns:
    - object_data_dict (dict): dictionary containing json data
    '''
    
    # Download the metadata
    response = requests.get(object_url, headers=HEADERS)
    json_data = response.json()
    
    # Flatten json
    if flatten_output:
        json_data = flatten(json_data)
    
    return json_data

In [11]:
def get_metadata(object_paths, flatten_output=False):
    """
    Retrieves the metadata for the object/objects listed in object_paths
    
    Params:
    - object_paths (str/list-like): string or list of strings containing the paths for the objects
    - flatten_output (bool): optional, (default=False)
    
    Returns:
    - metadata_df (pandas.DataFrame): DataFrame containing metadata for the requested objects
    """
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(object_paths) == str:
        object_paths = [object_paths]
    
    # Make sure our input is valid
    assert len(object_paths) > 0, 'Please enter at least one object id'
    
    #create empty pandas dataframe to put results in
    metadata_df = pd.DataFrame()

    #for each path, get full object details
    for object_path in tqdm(object_paths):
        #URL syntax for object details is: https://api.figshare.com/v2/{search_type}/{object_id}        
        json_data = _retrieve_object_json(object_path, flatten_output)
        
        #appending json collapses first level, which is a start
        #for now, can leave files, custom fields, author, etc as list of dictionary
        metadata_df = metadata_df.append(json_data, ignore_index=True)
        
    return metadata_df

### Perform Metadata Extraction

In [12]:
## Extract IDs from DataFrame, and returns as list of strings
metadata_dict = OrderedDict()

for query, df in search_output_dict.items():
    print(f'Retrieving {query} metadata')
    # Create object paths
    _, search_type = query
    object_ids = df.id.convert_dtypes(convert_string=True).tolist()
    object_paths = [f'{BASE_URL}/{search_type}/{object_id}' for object_id in object_ids]
    
    metadata_dict[query] = get_metadata(object_paths, flatten_output=True)

  0%|          | 0/41 [00:00<?, ?it/s]

Retrieving ('iguana', 'collections') metadata


100%|██████████| 41/41 [00:50<00:00,  1.23s/it]
  0%|          | 0/2 [00:00<?, ?it/s]

Retrieving ('iguana', 'projects') metadata


100%|██████████| 2/2 [00:02<00:00,  1.01s/it]
  0%|          | 0/152 [00:00<?, ?it/s]

Retrieving ('iguana', 'articles') metadata


100%|██████████| 152/152 [03:05<00:00,  1.22s/it]


## Combining Results

In [18]:
def merge_search_and_metadata_dicts(search_dict, metadata_dict, on=None, left_on=None, right_on=None, save=False):
    """
    Merges together all of the search and metadata DataFrames by the given 'on' key
    
    Params:
    - search_dict (dict): dictionary of search output results
    - metadata_dict (dict): dictionary of metadata results
    - on (str/list-like): column name(s) to merge the two dicts on
    - left_on (str/list-like): column name(s) to merge the left dict on
    - right_on (str/list-like): column name(s) to merge the right dict on
    - save=False, optional (bool/list-like): specifies if the output DataFrames should be saved
        If True: saves to file of format 'data/kaggle/kaggle_{search_term}_{search_type}.csv'
        If list-like: saves to respective location in list of save locations
            Must contain enough strings (one per query; len(search_terms) * len(search_types))
            
    Returns:
    - df_dict (OrderedDict): OrderedDict containing all of the merged search/metadata dicts
    """

    # Make sure the dictionaries contain the same searches
    assert search_dict.keys() == metadata_dict.keys(), 'Dictionaries must contain the same searches'
    
    num_dataframes = len(search_dict)
    
    # Ensure the save variable data is proper
    try:
        if isinstance(save, bool):
            save = [save] * num_dataframes
        assert len(save) == num_dataframes
    except:
        raise ValueError('Incorrect save value(s)')
        
    # Merge the DataFrames
    df_dict = OrderedDict()
    for (query_key, search_df), (query_key, metadata_df), save_loc in zip(search_dict.items(), 
                                                                          metadata_dict.items(), 
                                                                          save):
        # Keep just search info, id and timeline from initial extract 
        # Timeline only present in some search types
        columns_to_keep = ['id', 'search_page']
        
        if 'timeline' in search_df.columns:
            columns_to_keep.append('timeline')
            
        search_df = search_df[columns_to_keep]

        # Merge small version of "full" dataframe with "detailed" dataframe
        df_all = pd.merge(search_df, metadata_df, on=on, left_on=left_on, right_on=right_on, how='outer')
            
        # Save DataFrame
        if save_loc:
            data_dir = os.path.join('data', 'figshare')
            if isinstance(save_loc, str):
                output_file = save_loc
            elif isinstance(save_loc, bool):
                # Ensure figshare directory is already created
                if not os.path.isdir(data_dir):
                    os.path.mkdir(data_dir)
                
                search_term, search_type = query_key
                output_file = f'{search_term}_{search_type}.csv'
            else:
                raise ValueError(f'Save type must be bool or str, not {type(save_loc)}')

            search_df.to_csv(os.path.join(data_dir, output_file), index=False)
        
        df_dict[query_key] = df_all
    
    return df_dict

In [19]:
df_dict = merge_search_and_metadata_dicts(search_output_dict, metadata_dict, on='id')

In [20]:
#check out the output

In [21]:
full_df = df_dict[('iguana', 'articles')]

In [22]:
full_df.head()

Unnamed: 0,id,search_page,authors_0_full_name,authors_0_id,authors_0_is_active,authors_0_orcid_id,authors_0_url_name,authors_1_full_name,authors_1_id,authors_1_is_active,...,categories_12_parent_id,categories_12_title,categories_13_id,categories_13_parent_id,categories_13_title,tags_27,tags_28,tags_29,tags_30,tags_31
0,9172331,1,William Ruger Porter,813288.0,0.0,,_,Lawrence M. Witmer,36585.0,0.0,...,,,,,,,,,,
1,9172307,1,William Ruger Porter,813288.0,0.0,,_,Lawrence M. Witmer,36585.0,0.0,...,,,,,,,,,,
2,9172310,1,William Ruger Porter,813288.0,0.0,,_,Lawrence M. Witmer,36585.0,0.0,...,,,,,,,,,,
3,9172322,1,William Ruger Porter,813288.0,0.0,,_,Lawrence M. Witmer,36585.0,0.0,...,,,,,,,,,,
4,9172319,1,William Ruger Porter,813288.0,0.0,,_,Lawrence M. Witmer,36585.0,0.0,...,,,,,,,,,,
