# Figshare API

# Setup

## Instructions

This notebook utilizes the Figshare API. Follow these steps in order to get the necessary credentials to continue:
1. Create a Figshare account at https://figshare.com/account/register
2. After logging in, click on your account photo in the top right corner, and then click on 'Applications'
3. Access API key either by:
    - Create an application by clicking on 'Create Application'
    - Create an API key by clicking on 'Create Personal Token'
4. Load API key:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'OPENML_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Figshare API ([Figshare](https://docs.figshare.com))

## Imports

In [29]:
import requests as rq # For querying data from API
import pandas as pd # For storing/manipulating query data
from tqdm import tqdm # Gives status bar on loop completion
import itertools # For efficient looping over queries
from collections import OrderedDict

# For loading credentials
import pickle
import os 

API access tokens have been stored in credentials.pkl file

In [34]:
# Load credentials

# Check for credentials file
try:
    with open('credentials.pkl', 'rb') as credentials:
        FIGSHARE_TOKEN = pickle.load(credentials)['FIGSHARE_TOKEN']
except:
    FIGSHARE_TOKEN = input('Please enter your Figshare API Key: ')

In [3]:
# Search constants
BASE_URL = 'https://api.figshare.com/v2/'
HEADERS = {'Authorization': f'token {FIGSHARE_TOKEN}'}

In [4]:
#Search seems to be all by type (articles, collections, etc.)
#All are public
#Plan is to search (1) articles, (2) collections, (3) projects
#Article search may be minimally useful, but if have linked DOIs in object, could be linked data

#Can set up to cycle through which search (articles, collections, data) and by page

Overall workflow: use main search to get IDs of matching objects, then use IDs to get full object details & associated files

# Data Wrangling

## Extracting Object ID's

In [5]:
def get_search_outputs(search_terms, search_types, save=False):
    """
    Call the Figshare API for each search term and search type. 
    Results are retured in results['{term}_{type}'] = df
    
    Params:
    - search_terms (list-like): collection of search terms to query over
    - search_types (list-like): collection of search types to query over
    
    Returns:
    - results (dict): dictionary consisting of returned DataFrames from get_search_output for each query
    """
    
    num_searches = len(search_terms) * len(search_types)
    results = OrderedDict()

    for search_term, search_type in itertools.product(search_terms, search_types):
        results[(search_term, search_type)] = get_search_output(search_term, search_type)
        
    return results

In [6]:
def get_search_output(search_term, search_type):
    """
    Calls the Figshare API with the specified search term and returns the search output results.
    
    Params:
    - search_term (str): keyword to seach for
    - search_type (str): objects to search over (must be either datasets or kernels)
   
    Returns:
    - df (pandas.DataFrame): DataFrame containing the output of the search query
    """
    
    # Make sure our input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    assert search_type in ('articles', 'collections', 'projects'), \
        'Search can only be conducted over articles, collections, or projects'
        
    # Set search variables
    start_page = 1
    page_size = 1000 # Maximum page size (min = 10)
    output = None
    search_df = pd.DataFrame()
    
    search_params = {
        'search_for': search_term,
        'page': start_page, 
        'page_size': page_size,  
        }
        
    search_url = f'{BASE_URL}/{search_type}'

    ## Run search for public articles
    response = rq.get(search_url, params=search_params, headers=HEADERS)

    ## Put output into json format
    output = response.json()
    
    # Continue searching until we reach an empty page
    while output != []:
        # Add output to our DataFrame
        search_df = search_df.append(output)
        
        # Add info about which search type and page this is associated with
        search_df['search_type'] = search_type
        search_df['search_page'] = search_params['page']
        
        # Increment page number to query
        search_params['page'] += 1

        ## Run search for public articles
        response = rq.get(search_url, params=search_params, headers=HEADERS)

        ## Put output into json format
        output = response.json()
        
    return search_df

### Perform Search

In [7]:
search_terms = ['iguana']
search_types = ['collections', 'projects']

In [8]:
search_output_dict = get_search_outputs(search_terms, search_types)

In [9]:
sample_key = (search_terms[0], search_types[0])
sample_df = search_output_dict[sample_key]

In [10]:
sample_df.head()

Unnamed: 0,id,title,doi,handle,url,published_date,timeline,search_type,search_page
0,3582815,Data from: Vascular patterns in iguanas and ot...,10.5061/dryad.27m63.2,,https://api.figshare.com/v2/collections/3582815,2016-11-25T19:47:11Z,{'posted': '2016-11-25T19:47:11'},collections,1
1,4596320,Data from: Vascular patterns in iguanas and ot...,10.5061/dryad.27m63.1,,https://api.figshare.com/v2/collections/4596320,2019-07-30T16:44:02Z,{'posted': '2019-07-30T16:44:02'},collections,1
2,4755440,Data from: Vascular patterns in iguanas and ot...,10.5061/dryad.27m63,,https://api.figshare.com/v2/collections/4755440,2019-11-26T08:07:43Z,{'posted': '2019-11-26T08:07:43'},collections,1
3,5234804,First known trace fossil of a nesting iguana (...,10.1371/journal.pone.0242935,,https://api.figshare.com/v2/collections/5234804,2020-12-09T18:32:22Z,{'posted': '2020-12-09T18:32:22'},collections,1
4,5311858,Systemic <i>Helicobacter</i> infection and ass...,10.1371/journal.pone.0247010,,https://api.figshare.com/v2/collections/5311858,2021-02-19T18:33:04Z,{'posted': '2021-02-19T18:33:04'},collections,1


## Get Metadata

In [11]:
def _retrieve_object_json(object_url):
    '''
    Queries Figshare for object data (json file) & returns the json data as a dictionary
    
    Params:
    - object_url (str): path for the dataset
    
    Returns:
    - object_data_dict (dict): dictionary containing json data
    '''
    
    # Download the metadata
    response = rq.get(object_url, headers=HEADERS)
    json_data = response.json()
    
    return json_data

In [12]:
def get_metadata(object_paths):
    """
    Retrieves the metadata for the object/objects listed in object_paths
    
    Params:
    - object_paths (str/list-like): string or list of strings containing the paths for the objects
    
    Returns:
    - metadata_df (pandas.DataFrame): DataFrame containing metadata for the requested objects
    """
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(object_paths) == str:
        object_paths = [object_paths]
    
    # Make sure our input is valid
    assert len(object_paths) > 0, 'Please enter at least one object id'
    
    #create empty pandas dataframe to put results in
    metadata_df = pd.DataFrame()

    #for each path, get full object details
    for object_path in tqdm(object_paths):
        #URL syntax for object details is: https://api.figshare.com/v2/{search_type}/{object_id}        
        json_data = _retrieve_object_json(object_path)
        
        #appending json collapses first level, which is a start
        #for now, can leave files, custom fields, author, etc as list of dictionary
        metadata_df = metadata_df.append(json_data, ignore_index=True)

        
    return metadata_df

### Perform Metadata Extraction

In [13]:
## Extract IDs from DataFrame, and returns as list of strings
metadata_dict = OrderedDict()

for query, df in search_output_dict.items():
    # Create object paths
    _, search_type = query
    object_ids = df.id.convert_dtypes(convert_string=True).tolist()
    object_paths = [f'{BASE_URL}/{search_type}/{object_id}' for object_id in object_ids]
    
    metadata_dict[query] = get_metadata(object_paths)

100%|██████████| 41/41 [00:36<00:00,  1.13it/s]
100%|██████████| 2/2 [00:01<00:00,  1.00it/s]


## Combining Results

In [26]:
def merge_search_and_metadata_dicts(search_dict, metadata_dict, on, save=False):
    """
    Merges together all of the search and metadata DataFrames by the given 'on' key
    
    Params:
    - search_dict (dict): dictionary of search output results
    - metadata_dict (dict): dictionary of metadata results
    - on (str/list-like): column name(s) to merge the two dicts on
    - save=False, optional (bool/list-like): specifies if the output DataFrames should be saved
        If True: saves to file of format 'data/figshare/figshare_{search_term}_{search_type}.csv'
        If list-like: saves to respective location in list of save locations
            Must contain enough strings (one per query; len(search_terms) * len(search_types))
            
    Returns:
    - df_dict (OrderedDict): OrderedDict containing all of the merged search/metadata dicts
    """

    # Make sure the dictionaries contain the same searches
    assert search_dict.keys() == metadata_dict.keys(), 'Dictionaries must contain the same searches'
    
    num_dataframes = len(search_dict)
    
    # Ensure the save variable data is proper
    try:
        if isinstance(save, bool):
            save = [save] * num_dataframes
        assert len(save) == num_dataframes
    except:
        raise ValueError('Incorrect save value(s)')

    # Ensure the on variable is proper
    try:
        assert len(on) == 2 or isinstance(on, str)
        if (len(on) == 2) and (not isinstance(on, str)):
            left_on, right_on = on
            on = None
    except:
        raise ValueError('Incorrect value of "on" passed')
        
    # Merge the DataFrames
    df_dict = OrderedDict()
    for (query_key, search_df), (query_key, metadata_df), save_loc in zip(search_dict.items(), 
                                                                          metadata_dict.items(), 
                                                                          save):
        # Keep just search info, id and timeline from initial extract 
        # Timeline only present in some search types
        columns_to_keep = ['id', 'search_type', 'search_page']
        
        if 'timeline' in search_df.columns:
            columns_to_keep.append('timeline')
            
        search_df = search_df[columns_to_keep]

        #Merge small version of "full" dataframe with "detailed" dataframe
        if on: # only one value to merge on
            df_all = pd.merge(search_df, metadata_df, on=on, how='inner')
        else:
            df_all = pd.merge(search_df, metadata_df, left_on=left_on, right_on=right_on)
            
        # Save DataFrame
        if save_loc:
            data_dir = os.path.join('data', 'figshare')
            if isinstance(save_loc, str):
                output_file = save_loc
            elif isinstance(save_loc, bool):
                # Ensure figshare directory is already created
                if not os.path.isdir(data_dir):
                    os.path.mkdir(data_dir)
                
                search_term, search_type = query_key
                output_file = f'{search_term}_{search_type}.csv'
            else:
                raise ValueError(f'Save type must be bool or str, not {type(save_loc)}')

            search_df.to_csv(os.path.join(data_dir, output_file), index=False)
        
        df_dict[query_key] = df_all
    
    return df_dict

In [27]:
df_dict = merge_search_and_metadata_dicts(search_output_dict, metadata_dict, on='id')

In [28]:
df_dict[('iguana', 'collections')]

Unnamed: 0,id,search_type,search_page,articles_count,authors,categories,citation,created_date,custom_fields,description,...,resource_doi,resource_id,resource_link,resource_title,resource_version,tags,timeline,title,url,version
0,3582815,collections,1,6.0,"[{'id': 813288, 'full_name': 'William Ruger Po...","[{'id': 1, 'title': 'Biophysics', 'parent_id':...","Ruger Porter, William; M. Witmer, Lawrence (20...",2016-11-25T19:47:11Z,"[{'name': 'dwc.ScientificName', 'value': 'Igua...",Squamates use the circulatory system to regula...,...,10.1371/journal.pone.0139215,10.5061/dryad.27m63.2,,"Porter WR, Witmer LM (2015) Vascular patterns ...",0.0,"[vasculature, diapsid, iguana, blood vessels, ...","{'posted': '2016-11-25T19:47:11', 'publisherPu...",Data from: Vascular patterns in iguanas and ot...,https://api.figshare.com/v2/collections/3582815,1.0
1,4596320,collections,1,5.0,"[{'id': 813288, 'full_name': 'William Ruger Po...","[{'id': 1, 'title': 'Biophysics', 'parent_id':...","Ruger Porter, William; M. Witmer, Lawrence (20...",2019-07-30T16:44:02Z,"[{'name': 'dc.type.embargo', 'value': ''}, {'n...",Squamates use the circulatory system to regula...,...,10.1371/journal.pone.0139215,10.5061/dryad.27m63.1,,"Porter WR, Witmer LM (2015) Vascular patterns ...",0.0,"[vasculature, diapsid, iguana, blood vessels, ...","{'posted': '2019-07-30T16:44:02', 'publisherPu...",Data from: Vascular patterns in iguanas and ot...,https://api.figshare.com/v2/collections/4596320,1.0
2,4755440,collections,1,6.0,"[{'id': 7850966, 'full_name': 'William Ruger P...","[{'id': 1, 'title': 'Biophysics', 'parent_id':...","Porter, William Ruger; Witmer, Lawrence M. (20...",2019-11-26T08:07:43Z,"[{'name': 'dc.type.embargo', 'value': ''}, {'n...",Squamates use the circulatory system to regula...,...,10.1371/journal.pone.0139215,10.5061/dryad.27m63,,"Porter WR, Witmer LM (2015) Vascular patterns ...",0.0,"[vasculature, diapsid, iguana, blood vessels, ...","{'posted': '2019-11-26T08:07:43', 'publisherPu...",Data from: Vascular patterns in iguanas and ot...,https://api.figshare.com/v2/collections/4755440,1.0
3,5234804,collections,1,10.0,"[{'id': 9749630, 'full_name': 'Anthony J. Mart...","[{'id': 24, 'title': 'Evolutionary Biology', '...","Martin, Anthony J.; Stearns, Dorothy; Whitten,...",2020-12-09T18:32:22Z,[],<div><p>Most species of modern iguanas (Iguani...,...,10.1371/journal.pone.0242935,10.1371/journal.pone.0242935,,First known trace fossil of a nesting iguana (...,0.0,"[San Salvador, cross-bedded oolitic eolianite,...","{'posted': '2020-12-09T18:32:22', 'publisherPu...",First known trace fossil of a nesting iguana (...,https://api.figshare.com/v2/collections/5234804,1.0
4,5311858,collections,1,11.0,"[{'id': 10169338, 'full_name': 'Kenneth J. Con...","[{'id': 7, 'title': 'Medicine', 'parent_id': 4...","Conley, Kenneth J.; Seimon, Tracie A.; Popescu...",2021-02-19T18:33:04Z,[],<div><p>The Blue Iguana Recovery Programme mai...,...,10.1371/journal.pone.0247010,10.1371/journal.pone.0247010,,Systemic <i>Helicobacter</i> infection and ass...,0.0,"[GCBI 1, iguana, Queen Elizabeth II Botanic Pa...","{'posted': '2021-02-19T18:33:04', 'publisherPu...",Systemic <i>Helicobacter</i> infection and ass...,https://api.figshare.com/v2/collections/5311858,1.0
5,1631183,collections,1,6.0,"[{'id': 158053, 'full_name': 'Seiji Wada', 'is...","[{'id': 16, 'title': 'Physiology', 'parent_id'...","Wada, Seiji; Kawano-Yamashita, Emi; Koyanagi, ...",2015-12-04T13:52:52Z,[],<div><p>The pineal-related organs of lower ver...,...,10.1371/journal.pone.0039003,,http://dx.plos.org/10.1371/journal.pone.0039003,Expression of UV-Sensitive Parapinopsin in the...,0.0,"[uv-sensitive, parapinopsin, iguana, parietal,...","{'posted': '2012-06-14T01:04:51', 'revision': ...",Expression of UV-Sensitive Parapinopsin in the...,https://api.figshare.com/v2/collections/1631183,1.0
6,5180264,collections,1,1.0,"[{'id': 7485488, 'full_name': 'Tassika Koomgun...","[{'id': 1064, 'title': 'Gene and Molecular The...","Koomgun, Tassika; Laopichienpong, Nararat; Sin...",2020-10-20T04:28:01Z,[],<p>The majority of lizards classified in the s...,...,10.3389/fgene.2020.556267,556267,https://www.frontiersin.org/articles/10.3389/f...,Genome Complexity Reduction High-Throughput Ge...,0.0,"[open access, DArTseqTM, Iguanoidea, SNP, supe...","{'posted': '2020-10-20T04:28:01', 'revision': ...",Genome Complexity Reduction High-Throughput Ge...,https://api.figshare.com/v2/collections/5180264,1.0
7,4700840,collections,1,5.0,"[{'id': 7507544, 'full_name': 'Gregory A. Lewb...","[{'id': 39, 'title': 'Ecology', 'parent_id': 3...","Lewbart, Gregory A.; Grijalva, Colon J.; Calle...",2019-10-16T17:35:00Z,[],"<div><p>The land iguanas, <i>Conolophus pallid...",...,10.1371/journal.pone.0222884,10.1371/journal.pone.0222884,http://dx.plos.org/10.1371/journal.pone.0222884,Health assessment of <i>Conolophus subcristatu...,1.0,"[subcristatus X Amblyrhynchus cristatus, South...","{'posted': '2019-10-16T17:35:00', 'publisherPu...",Health assessment of <i>Conolophus subcristatu...,https://api.figshare.com/v2/collections/4700840,1.0
8,1752338,collections,1,9.0,"[{'id': 306022, 'full_name': 'Xinping Wang', '...","[{'id': 4, 'title': 'Biochemistry', 'parent_id...","Wang, Xinping; Deng, Xuliang; Zhang, Xichen (2...",2015-12-04T14:12:59Z,[],<p>Detection of amelogenin expression in iguan...,...,10.1371/journal.pone.0045871,,http://dx.plos.org/10.1371/journal.pone.0045871,Identification of a Novel Splicing Form of Ame...,0.0,"[amelogenin, iguana]","{'posted': '2013-02-19T22:48:24', 'revision': ...",Identification of a Novel Splicing Form of Ame...,https://api.figshare.com/v2/collections/1752338,1.0
9,3543345,collections,1,8.0,"[{'id': 3277482, 'full_name': 'Amy MacLeod', '...","[{'id': 13, 'title': 'Genetics', 'parent_id': ...","MacLeod, Amy; Rodríguez, Ariel; Vences, Miguel...",2016-10-28T12:29:53Z,"[{'name': 'dwc.ScientificName', 'value': 'Ambl...",The effects of the direct interaction between ...,...,10.1098/rspb.2015.0425,10.5061/dryad.pp6bm,,"MacLeod A, Rodríguez A, Vences M, Orozco-terWe...",0.0,"[speciation, hybridization, islands]","{'posted': '2016-10-28T12:29:53', 'publisherPu...",Data from: Hybridization masks speciation in t...,https://api.figshare.com/v2/collections/3543345,1.0


## Code that loops through search terms and objects

### also an issue with projects objects that stops code

In [None]:
search_type = 'collections'
search_term = 'iguana'
search_url = f'https://api.figshare.com/v2/{search_type}'

PARAMS = {
        'search_for': search_term, #search term
        'page': 1, 
        'page_size': 10,  
        }

response = rq.get(search_url, params=PARAMS, headers=HEADERS)

In [None]:
#FIGSHARE_TOKEN imported from credentials.pkl


#List of search terms
SEARCH_TERMS = ['iguana']

#List of Figshare object types to search
SEARCH_TYPES = ['articles', 'collections', 'projects']

# Set dummy json output for page loop
output = None

###### LOOP 1 - loop through search terms ######
for search_term in SEARCH_TERMS:
    print('Searching:', search_term)

    #Specify page to return with search (results are paginated) - TO DO: FIX HARD CODING
    PAGE = 1
    
    #Specify number of results included on a page (default is 10, max is 1000) - TO DO: FIX HARD CODING
    PAGE_SIZE = 1000

    #Set params term
    PARAMS = {
        'search_for': search_term, #search term
        'page': PAGE, 
        'page_size': PAGE_SIZE,  
        }
    
    ###### LOOP 2 - loop through object types ######

    for search_type in SEARCH_TYPES:
        print(f'\tSearching over: {search_type}')
        
        URL_j = f'https://api.figshare.com/v2/{search_type}'

        ## Run search for public articles
        response = rq.get(URL_j, params=PARAMS, headers=HEADERS)

        ## Put output into json format
        output = response.json()
        
        # Continue searching until we reach an empty page
        while output != []:
            #Convert output to pd dataframe and see table format
            df_full = pd.DataFrame(output)

            ## Extract IDs
            full_ids = list(df_full.id)

            ####### LOOP 3 - loop to to extract object  details by object ID ######

            #create empty pandas dataframe to put results in
            df_detailed = pd.DataFrame()

            #for each ID, get full object details
            for i in full_ids:
                id_i = str(i)
                #URL syntax for object details is: https://api.figshare.com/v2/{search_type}/{object_id}
                URL_i = f'{URL_j}/{id_i}'
                response_i = rq.get(URL_i, headers=HEADERS)
                output_i = response_i.json()
                #appending json collapses first level, which is a start
                #for now, can leave files, custom fields, author, etc as list of dictionary
                df_detailed = df_detailed.append(output_i, ignore_index=True)

                #Keep just id and timeline from initial extract (all others are in detailed extract)
                #'timeline' is NOT in projects, so...do we really need it?
                df_small = df_full[['id','timeline']]

                #Add info about which search type and page this is associated with
                df_detailed['Search_type'] = search_type
                df_detailed['search_page'] = PARAMS['page']

                #Merge small version of "full" dataframe with "detailed" dataframe
                df_all = pd.merge(df_small, df_detailed, on='id', how='inner')

            #Write to csv
            output_file = f'Figshare_{search_term}_{search_type}.csv'
            df_all.to_csv(output_file, index=False)
            
            # Increment page number to query
            PARAMS['page'] += 1
            
            ## Run search for public articles
            response = rq.get(URL_j, params=PARAMS, headers=HEADERS)
            
            ## Put output into json format
            output = response.json()

    print(f'Finished {search_term} search')

In [None]:
df_all.head()

## OLD CODE
#### Specify search parameters

In [None]:
#FIGSHARE_TOKEN imported from credentials.pkl
HEADERS = {'Authorization': 'token '+ FIGSHARE_TOKEN}

#Specify search terms
SEARCH = 'machine learning'

#turn into list for loop: SEARCH = ['machine learning'] 
#with whatever search terms we decide to use

#Specify which search (collections, articles, projects, data)
ARTICLE_URL = 'https://api.figshare.com/v2/articles'
COLLECTIONS_URL = "https://api.figshare.com/v2/collections"
PROJECTS_URL = "https://api.figshare.com/v2/projects"

#turn into list for loop: SEARCH_TYPES = ['articles', 'collections', 'projects']
#for j in SEARCH_TYPES:
    #URL_j = "https://api.figshare.com/v2/" + SEARCH_TYPES_j
    
#article search should include datasets
#Only return articles with the respective type. Mapping for item_type is: 
#1 - Figure, 2 - Media, 3 - Dataset, 5 - Poster, 6 - Journal contribution, 7 - Presentation, 
#8 - Thesis, 9 - Software, 11 - Online resource, 12 - Preprint, 13 - Book, 14 - Conference contribution, 
#15 - Chapter, 16 - Peer review, 17 - Educational resource, 18 - Report, 19 - Standard, 20 - Composition, 
#21 - Funding, 22 - Physical object, 23 - Data management plan, 24 - Workflow, 
#25 - Monograph, 26 - Performance, 27 - Event, 28 - Service, 29 - Model

#should make a df with this and merge to get more informative item_type column in final output

#Specify page to return with search (results are paginated)
PAGE = 1

#Specify number of results included on a page (default is 10, max is 1000)
PAGE_SIZE = 10

## Could iterate through pages until get response: {'message': 'Bad Request', 'code': 'BadRequest'} 
## Seems like there should be a better way

#Specify page and page size parameters
#Other search options are available, including limit and offset, but at the moment page/page size seem most useful
    #if set both page/page size and limit/offset, get:
    #{'message': 'Pagination options can be set either via page/page_size or limit/offset params','code': 'ConflictingPaginationOptions'}
    
    #'limit': 1000, #Number of results included on a page. Used for pagination with query (optional) - not sure how differs from page_size
    #'offset': 1 #, #Where to start the listing(the offset of the first result). Used for pagination with limit (optional)
#We don't need to specify other search parameters like institution, group, modified since, etc.

#Full search parameters
PARAMS = {
    'search_for': SEARCH, #search term
    'page': PAGE, 
    'page_size': PAGE_SIZE,  
    }

## Draft workflow: search public articles

In [None]:
## Run search for public articles
response = rq.get(ARTICLE_URL, params=PARAMS, headers=HEADERS)

## Put output into json format
output = response.json()

#Convert output to pd dataframe and see table format
df_full = pd.DataFrame(output)

In [None]:
## See what output looks like
output

In [None]:
## See what df looks like
df_full

In [None]:
## Extract IDs
full_ids = list(df_full.id)

In [None]:
## Loop to extract article details by object ID

#URL syntax is: https://api.figshare.com/v2/articles/{article_id}

#create empty pandas dataframe
df_detailed = pd.DataFrame()

#for each ID, 
for i in full_ids:
    id_i = str(i)
    URL_i = 'https://api.figshare.com/v2/articles/' + id_i
    #print(URL_i)
    response_i = rq.get(URL_i, headers=HEADERS)
    json_i = response_i.json()
    #json_normalize collapses first level, which is a start
    #for now, can leave files, custom fields, author, etc as list of dictionary
    df_i = pd.json_normalize(json_i)
    df_detailed = df_detailed.append(df_i)

In [None]:
df_detailed

### Compare high level metadata extract with detailed extract

In [None]:
## Column names in the "full" initial API call
df_full.columns

In [None]:
## Column names in the "detailed" API call
df_detailed.columns

In [None]:
## What do these have in common?
overlap = list(set(df_full).intersection(set(df_detailed)))
overlap

In [None]:
## Are they the same?
sorted(df_full.columns) == sorted(overlap)

In [None]:
## So which ones are in one but not the other?
sorted(df_full.columns) 

In [None]:
sorted(overlap)

In [None]:
# timeline is not in detailed extract, but it's the only one that's not the same

### Combine full and detailed API extracts by object ID

In [None]:
## Keep just id and timeline from initial extract (all others are in detailed extract)
## CHECK IF TIMELINE VAR IS IN COLLECTIONS AND PROJECT OBJECTS
df_small = df_full[['id','timeline']]

df_small

In [None]:
## Merge small version of "full" dataframe with "detailed" dataframe
df_all = pd.merge(df_small, df_detailed, on = "id", how = "inner")

df_all

In [None]:
## List files associated with public articles (by ID)

#URL syntax is: https://api.figshare.com/v2/articles/{article_id}/files

#looks like files are already pulled from article details, so don't need a separate call - all info should already be there
#at some point go back and confirm that there's no new info in files API call