# Harvard Dataverse API

# Setup

## Instructions

This notebook utilizes the Harvard Dataverse API. Follow these steps in order to get the necessary credentials to continue:

1. Create a Harvard Dataverse account at [Harvard Dataverse](https://dataverse.harvard.edu/dataverseuser.xhtml;jsessionid=797ccf2a28f987da3f1895ad81df?editMode=CREATE&redirectPage=%2Fdataverse_homepage.xhtml)
2. After logging in, click on the user dropdown menu in the top right corner, and click on 'API Token'
3. Click on 'Create Token' to receive API Token
4. Load API Token:
    - For repeated use, follow the ```pickle_tutorial.ipynb``` instructions to create create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'DATAVERSE_TOKEN': MYKEY}```, with MYKEY being your API key.
    - For sparser use, users can run the credentials cell and paste their API key when prompted.

## Additional Information

Documentation Guide:
- Dataverse API ([Dataverse](https://guides.dataverse.org/en/latest/user/index.html))
- Harvard Dataverse ([Harvard](https://dataverse.harvard.edu))

## Imports

In [1]:
import requests # For querying data from API
import pandas as pd # For storing/manipulating query data
from tqdm import tqdm # Gives status bar on loop completion
import itertools # For efficient looping over queries
from collections import OrderedDict
from flatten_json import flatten
import re

import pickle # For loading credentials

# Scraping imports (for metadata)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support.select import By
import selenium.webdriver.support.expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

from bs4 import BeautifulSoup # Parsing scraped webpage data

In [2]:
# Load credentials
try:
    with open('credentials.pkl', 'rb') as credentials:
        DATAVERSE_TOKEN = pickle.load(credentials)['DATAVERSE_TOKEN']
except:
    DATAVERSE_TOKEN = input('Please enter your Dataverse API Key: ')

# Data Wrangling

## Setup

In [3]:
BASE_URL = 'https://dataverse.harvard.edu/api'
file_url = 'https://dataverse.harvard.edu/file.xhtml?fileId='
HEADERS = {'X-Dataverse-key': DATAVERSE_TOKEN}

## Extracting

In [4]:
def get_all_search_outputs(search_terms, search_types, flatten_output=False):
    """Call the Dataverse API for each search term. 
    
    Results are retured in results[(search_term)] = df
    
    Parameters
    ----------
    search_terms : list-like
        collection of search terms to query over.
    search_types : list-like
        collection of objects to search over (must be either dataset or file).
    flatten_output : boolean, optional (default=False)
        Flag for specifying if nested output should be flattened.
    
    Returns
    -------
    results : dict
        dictionary consisting of returned DataFrames from get_search_output for each query.
    """

    results = OrderedDict()

    for search_term, search_type in tqdm(itertools.product(search_terms, search_types)):
        results[(search_term, search_type)] = get_individual_search_output(search_term, search_type, flatten_output)
        
    return results

In [5]:
def _convert_major_minor_version(row):
    major = int(row['majorVersion'])
    minor = int(row['minorVersion'])
    return float(f'{major}.{minor}')

In [6]:
def get_individual_search_output(search_term, search_type, flatten_output=False):
    """Calls the Dataverse API with the specified search term and returns the search output results.
    
    Parameters
    ----------
    search_term : str
    search_type : str
    flatten_output : boolean, optional (default=False)
        Flag for specifying if nested output should be flattened.
   
    Returns
    -------
    search_df : DataFrame
        DataFrame containing the output of the search query.
    """
    
    # Set search URL
    search_url = f'{BASE_URL}/search'
    
    # Make sure out input is valid
    assert isinstance(search_term, str), 'Search term must be a string'
    assert search_type in ('dataset', 'file'), 'Search can only be conducted over "dataset" or "file"'
    
    # Set search parameters
    start = 0
    page_size = 100
    search_df = pd.DataFrame()
    
    search_params = {
        'q': search_term,
        'per_page': page_size,
        'start': start,
        'type': search_type
    }
    
    # Conduct initial query, extract json results
    response = requests.get(search_url, params=search_params, headers=HEADERS)
    output = response.json()
    output = output['data']
    
    # Search until no more items are returned
    while output.get('items'):
        # Extract relevant output data
        output = output['items']
        
        # Flatten output if necessary
        if flatten_output:
            output = [flatten(result) for result in output]
        
        output_df = pd.DataFrame(output)
        output_df['page'] = search_params['start'] // search_params['per_page'] + 1
        
        search_df = pd.concat([search_df, output_df]).reset_index(drop=True)
        
        # Increment result offset to perform another search
        search_params['start'] += search_params['per_page']
        
        # Perform next search and convert results to json
        response = requests.get(search_url, params=search_params, headers=HEADERS)
        output = response.json()
        output = output['data']
    
    if 'majorVersion' in search_df.columns and 'minorVersion' in search_df.columns:
        # Drop null versions since version is required for metadata extraction
        search_df = search_df.dropna(subset = ('majorVersion', 'minorVersion'), how='any')
        # Add query-friendly dataset version column (for metadata extraction)
        search_df['version'] = search_df.apply(_convert_major_minor_version, axis=1)
    
    if search_type == 'file':
        search_df['url'] = search_df.apply(lambda x: f'{file_url}{x.file_id}', axis=1)

    return search_df

### Run initial API query functions

In [7]:
search_terms = ['"artificial intelligence"']
search_types = ['dataset', 'file']

In [8]:
search_output_dict = get_all_search_outputs(search_terms, search_types, flatten_output=True)

2it [00:14,  7.31s/it]


#### Take a look at the results

In [9]:
sample_key = (search_terms[0], search_types[0])
sample_df = search_output_dict[sample_key]

In [10]:
sample_df

Unnamed: 0,name,type,url,global_id,description,published_at,publisher,citationHtml,identifier_of_dataverse,name_of_dataverse,...,publications_10_citation,publications_11_citation,publications_11_url,publications_12_citation,publications_13_citation,publications_13_url,publications_14_citation,publications_15_citation,page,version
0,Replication Data for: The Impact of Automation...,dataset,https://doi.org/10.7910/DVN/6TDHPF,doi:10.7910/DVN/6TDHPF,Discourse surrounding the future of work often...,2021-08-05T15:49:04Z,Harvard Dataverse,"Nazareno, Lu&iacute;sa; Schiff, Daniel, 2021, ...",harvard,Harvard Dataverse,...,,,,,,,,,1,1.0
1,"Appendix for ""U.S. Public Opinion on the Gover...",dataset,https://doi.org/10.7910/DVN/BREC5M,doi:10.7910/DVN/BREC5M,Appendix for Baobao Zhang and Allan Dafoe. 202...,2019-12-22T19:55:33Z,GovAI AI Public Opinion,"Baobao Zhang; Allan Dafoe, 2019, ""Appendix for...",govaipublicopinion,GovAI AI Public Opinion,...,,,,,,,,,1,1.0
2,Replication Data for: Artificial Intelligence:...,dataset,https://doi.org/10.7910/DVN/SGFRYA,doi:10.7910/DVN/SGFRYA,This repository contains the replication files...,2019-07-22T19:35:05Z,GovAI AI Public Opinion,"Zhang, Baobao; Dafoe, Allan, 2019, ""Replicatio...",govaipublicopinion,GovAI AI Public Opinion,...,,,,,,,,,1,1.0
3,Replication Data for: Assessing Public Value F...,dataset,https://doi.org/10.7910/DVN/LIGARA,doi:10.7910/DVN/LIGARA,In the context of rising delegation of adminis...,2021-05-06T15:33:51Z,Public Administration,"Schiff, Daniel; Schiff, Kaylyn Jackson; Pierso...",pa,Public Administration,...,,,,,,,,,1,2.0
4,Artificial Intelligence Reconstructs Missing S...,dataset,https://doi.org/10.7910/DVN/U6B15L,doi:10.7910/DVN/U6B15L,The retrodiction and prediction of solar activ...,2020-10-13T18:01:27Z,Harvard Dataverse,"Velasco Herrera, Victor Manuel, 2020, ""Artific...",harvard,Harvard Dataverse,...,,,,,,,,,1,1.0
5,"Replication Data for: ""Artificial intelligence...",dataset,https://doi.org/10.7910/DVN/EX4NG2,doi:10.7910/DVN/EX4NG2,"Replication Data for: ""Artificial intelligence...",2020-09-26T19:56:57Z,Nathaniel Hendrix's Dataverse,"Hendrix, Nathaniel, 2020, ""Replication Data fo...",nhendrix,Nathaniel Hendrix's Dataverse,...,,,,,,,,,1,3.0
6,Replication Data for: Artificial intelligence ...,dataset,https://doi.org/10.7910/DVN/UMJVWA,doi:10.7910/DVN/UMJVWA,This research explores an Artificial Intellige...,2020-12-20T19:05:28Z,Harvard Dataverse,"Nistal-Nu&ntilde;o, Beatriz, 2020, ""Replicatio...",harvard,Harvard Dataverse,...,,,,,,,,,1,1.0
7,"Replication data for ""Artificial intelligence ...",dataset,https://doi.org/10.7910/DVN/62BRBP,doi:10.7910/DVN/62BRBP,Image data set for the paper: https://doi.org/...,2021-05-31T09:56:14Z,Harvard Dataverse,"Dom&iacute;nguez-Rodrigo, Manuel, 2021, ""Repli...",harvard,Harvard Dataverse,...,,,,,,,,,1,1.0
8,The association between predicting of blood pr...,dataset,https://doi.org/10.7910/DVN/BXQ9TI,doi:10.7910/DVN/BXQ9TI,Artificial intelligence (AI) analysis was perf...,2021-04-17T01:34:39Z,Harvard Dataverse,"Qi, benling, 2021, ""The association between pr...",harvard,Harvard Dataverse,...,,,,,,,,,1,1.0
9,Replication Data for: Digital Tracks: Applicat...,dataset,https://doi.org/10.7910/DVN/CUFZKT,doi:10.7910/DVN/CUFZKT,Replication Data for: Digital Tracks: Applicat...,2021-07-14T21:15:54Z,Social Data Lab Dataverse,"Sotelo Docio, Susana; Benitez-Baleato, Jesus M...",socialdatalab,Social Data Lab Dataverse,...,,,,,,,,,1,1.2


## Metadata

In [11]:
dataset_metadata_attribute_paths = {
    'deposit_date': '#metadata_dateOfDeposit > td',
    'num_downloads': '#metrics-body > div'
}

file_metadata_attribute_paths = {
    'deposit_date': '#fileDepositDateBlock > td',
    'num_downloads': '#metrics-body > div'
}

In [12]:
chrome_options = Options()
chrome_options.add_argument('--headless')

In [13]:
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)



Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [/Users/michaelbaluja/.wdm/drivers/chromedriver/mac64/92.0.4515.107/chromedriver] found in cache


In [14]:
def clean_results(results):
    """Cleans the results scraped from the page.
    
    Parameters
    ----------
    results : dict
    
    Returns
    -------
    results : dict
    """
    
    num_downloads = results.get('num_downloads')

    if num_downloads:
        results['num_downloads'] = re.findall('\d+', num_downloads)[0]
        
    return results

In [15]:
def get_attribute_value(soup, path):
    try:
        return soup.select_one(path).text
    except AttributeError:
        return None

In [16]:
def get_attribute_values(driver, **kwargs):
    """Returns attribute values for all relevant given attribute path dicts.
    
    Parameters
    ----------
    driver : WebDriver
        Selenium webdriver to use for html extraction.
    kwargs : dict, optional
        Attribute dicts to parse through. Accepts landing page, metadata, and terms dicts.
    
    Returns
    -------
    attribute_value_dict : dict
    """
    
    attribute_value_dict = dict()
    
    # Extract attribute path dicts
    landing_attribute_paths = kwargs.get('landing_attribute_paths')
    metadata_attribute_paths = kwargs.get('metadata_attribute_paths')
    terms_attribute_paths = kwargs.get('terms_attribute_paths')
    
    if landing_attribute_paths:
        # Retrieve html data and create parsable object
        html = driver.page_source
        soup = BeautifulSoup(html)
        
        landing_attribute_values = {attribute: get_attribute_value(soup, path) 
                                    for attribute, path in landing_attribute_paths.items()}
        attribute_value_dict = {**attribute_value_dict, **landing_attribute_values}
    if metadata_attribute_paths:
        driver.find_element_by_link_text('Metadata').click()
        
        # Retrieve html data and create parsable object
        html = driver.page_source
        soup = BeautifulSoup(html)
        
        metadata_attribute_values = {attribute: get_attribute_value(soup, path) 
                                    for attribute, path in metadata_attribute_paths.items()}
        attribute_value_dict = {**attribute_value_dict, **metadata_attribute_values}
    if terms_attribute_paths:
        driver.find_element_by_link_text('Terms').click()
        
        # Retrieve html data and create parsable object
        html = driver.page_source
        soup = BeautifulSoup(html)
        
        terms_attribute_values = {attribute: get_attribute_value(soup, path) 
                                    for attribute, path in terms_attribute_paths.items()}
        attribute_value_dict = {**attribute_value_dict, **terms_attribute_values}
        
    return attribute_value_dict

In [17]:
def get_query_metadata(object_paths, driver, flatten_output=False, **kwargs):
    """Retrieves the dataset metadata for the object/objects listed in object_paths
    
    Parameters
    ----------
    object_paths : str/list-like
    flatten_output : boolean, optional (default=False)
        Flag for specifying if nested output should be flattened.
    kwargs : dict, optional
        Additional parameters, including attribute path dictionaries.
    
    Returns
    -------
    metadata_df : DataFrame
    """
    
    metadata_df = pd.DataFrame()
    
    for object_path in tqdm(object_paths):
        object_dict = dict()
        
        # Retrieve webpage
        driver.get(object_path)
        
        # Extract & clean attribute values
        object_dict = get_attribute_values(driver, **kwargs)
        object_dict['url'] = object_path
        object_dict = clean_results(object_dict)
        
        # Add results to DataFrame
        metadata_df = metadata_df.append(object_dict, ignore_index=True)
    
    return metadata_df

In [18]:
def get_all_metadata(search_output_dict, flatten_output=False):
    """Retrieves all of the metadata that relates to the provided DataFrames
    
    Parameters
    ----------
    search_output_dict : dict
        Dictionary of DataFrames from get_all_search_outputs.
    flatten_output : bool, optional (default=False)
        Flag for flattening nested columns of output.
      
    Returns
    -------
    metadata_dict : OrderedDict
        OrderedDict of DataFrames with metadata for each query.
        Order matches the order of search_output_dict.
    """
    
    metadata_dict = OrderedDict()

    for query, df in search_output_dict.items():
        search_term, search_type = query
        
        object_paths = df['url']
        
        metadata_dict[query] = get_query_metadata(object_paths, driver, flatten_output, **path_dict[search_type])
    
    return metadata_dict

In [19]:
path_dict = {
    'dataset': {
        'metadata_attribute_paths': dataset_metadata_attribute_paths
    },
    'file': {
        'landing_attribute_paths': file_metadata_attribute_paths
    }
}

### Get Metadata Results

In [20]:
metadata_dict = get_all_metadata(search_output_dict, flatten_output=True)

100%|██████████| 39/39 [01:49<00:00,  2.81s/it]
100%|██████████| 6/6 [00:06<00:00,  1.02s/it]


## Combine all results

In [21]:
def merge_search_and_metadata_dicts(search_dict, metadata_dict, on=None, left_on=None, right_on=None, save=False):
    """Merges together all of the search and metadata DataFrames by the given 'on' key.
    
    Parameters
    ----------
    search_dict : dict
        Dictionary of search output results.
    metadata_dict : dict
        Dictionary of metadata results.
    on : str/list-like
        Column name(s) to merge the two dicts on.
    left_on : str/list-like
        Column name(s) to merge the left dict on.
    right_on : str/list-like
        Column name(s) to merge the right dict on.
    save : boolean, optional (default=False)
        Specifies if the output DataFrames should be saved.
        If True: saves to file of format 'data/figshare/figshare_{search_term}_{search_type}.csv'.
        If list-like: saves to respective location in list of save locations.
            Must contain enough strings (one per query; len(search_terms) * len(search_types)).
            
    If the on/left_on/right_on values are not explicitely specified, behavior defaults to what is done
    in the pandas documentation.
    
    Returns
    -------
    df_dict : OrderedDict
        OrderedDict containing all of the merged search/metadata dicts.
    """

    # Make sure the dictionaries contain the same searches
    assert search_dict.keys() == metadata_dict.keys(), 'Dictionaries must contain the same searches'
    
    num_dataframes = len(search_dict)
    
    # Ensure the save variable data is proper
    try:
        if isinstance(save, bool):
            save = [save] * num_dataframes
        assert len(save) == num_dataframes
    except:
        raise ValueError('Incorrect save value(s)')
        
    # Merge the DataFrames
    df_dict = OrderedDict()
    for (query_key, search_df), (query_key, metadata_df), save_loc in zip(search_dict.items(), 
                                                                          metadata_dict.items(), 
                                                                          save):

        # Merge small version of "full" dataframe with "detailed" dataframe
        df_all = pd.merge(search_df, metadata_df, on=on, left_on=left_on, right_on=right_on, how='outer')
            
        # Save DataFrame
        if save_loc:
            data_dir = os.path.join('data', 'dataverse')
            if isinstance(save_loc, str):
                output_file = save_loc
            elif isinstance(save_loc, bool):
                # Ensure figshare directory is already created
                if not os.path.isdir(data_dir):
                    os.path.mkdir(data_dir)
                
                search_term, search_type = query_key
                output_file = f'{search_term}_{search_type}.csv'
            else:
                raise ValueError(f'Save type must be bool or str, not {type(save_loc)}')

            search_df.to_csv(os.path.join(data_dir, output_file), index=False)
        
        df_dict[query_key] = df_all
    
    return df_dict

### Run merge function

In [22]:
df_dict = merge_search_and_metadata_dicts(search_output_dict, metadata_dict)

In [23]:
df_dict[sample_key]

Unnamed: 0,name,type,url,global_id,description,published_at,publisher,citationHtml,identifier_of_dataverse,name_of_dataverse,...,publications_11_url,publications_12_citation,publications_13_citation,publications_13_url,publications_14_citation,publications_15_citation,page,version,deposit_date,num_downloads
0,Replication Data for: The Impact of Automation...,dataset,https://doi.org/10.7910/DVN/6TDHPF,doi:10.7910/DVN/6TDHPF,Discourse surrounding the future of work often...,2021-08-05T15:49:04Z,Harvard Dataverse,"Nazareno, Lu&iacute;sa; Schiff, Daniel, 2021, ...",harvard,Harvard Dataverse,...,,,,,,,1,1.0,2021-08-05,1
1,"Appendix for ""U.S. Public Opinion on the Gover...",dataset,https://doi.org/10.7910/DVN/BREC5M,doi:10.7910/DVN/BREC5M,Appendix for Baobao Zhang and Allan Dafoe. 202...,2019-12-22T19:55:33Z,GovAI AI Public Opinion,"Baobao Zhang; Allan Dafoe, 2019, ""Appendix for...",govaipublicopinion,GovAI AI Public Opinion,...,,,,,,,1,1.0,2019-12-22,61
2,Replication Data for: Artificial Intelligence:...,dataset,https://doi.org/10.7910/DVN/SGFRYA,doi:10.7910/DVN/SGFRYA,This repository contains the replication files...,2019-07-22T19:35:05Z,GovAI AI Public Opinion,"Zhang, Baobao; Dafoe, Allan, 2019, ""Replicatio...",govaipublicopinion,GovAI AI Public Opinion,...,,,,,,,1,1.0,2019-07-21,1
3,Replication Data for: Assessing Public Value F...,dataset,https://doi.org/10.7910/DVN/LIGARA,doi:10.7910/DVN/LIGARA,In the context of rising delegation of adminis...,2021-05-06T15:33:51Z,Public Administration,"Schiff, Daniel; Schiff, Kaylyn Jackson; Pierso...",pa,Public Administration,...,,,,,,,1,2.0,2021-04-09,70
4,Artificial Intelligence Reconstructs Missing S...,dataset,https://doi.org/10.7910/DVN/U6B15L,doi:10.7910/DVN/U6B15L,The retrodiction and prediction of solar activ...,2020-10-13T18:01:27Z,Harvard Dataverse,"Velasco Herrera, Victor Manuel, 2020, ""Artific...",harvard,Harvard Dataverse,...,,,,,,,1,1.0,2020-10-13,80
5,"Replication Data for: ""Artificial intelligence...",dataset,https://doi.org/10.7910/DVN/EX4NG2,doi:10.7910/DVN/EX4NG2,"Replication Data for: ""Artificial intelligence...",2020-09-26T19:56:57Z,Nathaniel Hendrix's Dataverse,"Hendrix, Nathaniel, 2020, ""Replication Data fo...",nhendrix,Nathaniel Hendrix's Dataverse,...,,,,,,,1,3.0,2020-07-16,211
6,Replication Data for: Artificial intelligence ...,dataset,https://doi.org/10.7910/DVN/UMJVWA,doi:10.7910/DVN/UMJVWA,This research explores an Artificial Intellige...,2020-12-20T19:05:28Z,Harvard Dataverse,"Nistal-Nu&ntilde;o, Beatriz, 2020, ""Replicatio...",harvard,Harvard Dataverse,...,,,,,,,1,1.0,2020-04-26,16
7,"Replication data for ""Artificial intelligence ...",dataset,https://doi.org/10.7910/DVN/62BRBP,doi:10.7910/DVN/62BRBP,Image data set for the paper: https://doi.org/...,2021-05-31T09:56:14Z,Harvard Dataverse,"Dom&iacute;nguez-Rodrigo, Manuel, 2021, ""Repli...",harvard,Harvard Dataverse,...,,,,,,,1,1.0,2021-05-31,0
8,The association between predicting of blood pr...,dataset,https://doi.org/10.7910/DVN/BXQ9TI,doi:10.7910/DVN/BXQ9TI,Artificial intelligence (AI) analysis was perf...,2021-04-17T01:34:39Z,Harvard Dataverse,"Qi, benling, 2021, ""The association between pr...",harvard,Harvard Dataverse,...,,,,,,,1,1.0,2021-04-16,3
9,Replication Data for: Digital Tracks: Applicat...,dataset,https://doi.org/10.7910/DVN/CUFZKT,doi:10.7910/DVN/CUFZKT,Replication Data for: Digital Tracks: Applicat...,2021-07-14T21:15:54Z,Social Data Lab Dataverse,"Sotelo Docio, Susana; Benitez-Baleato, Jesus M...",socialdatalab,Social Data Lab Dataverse,...,,,,,,,1,1.2,2021-07-14,1
