# Setup

NOTE: A few of the changes made (global variables being referenced from inside functions) were done in order to ease the transition to object oriented design without having to change any of the function structures

## Instructions

This notebook utilizes the OpenML API. Follow these steps in order to get the necessary credentials to continue (additional information is available at the OpenML documentation under "Additional Information" below):

1. Create an OpenML account at https://www.openml.org/register
2. After logging in, open your account page (click the avatar on the top right)
3. Open 'Account Settings', then 'API authentication' to find your API key

There are multiple ways of authenticating. Any of the following will work for this notebook:

Temporarily:
- When prompted below (if none of the following methods are completed), enter your API key in the text box.
    - This method is the easiest, but must be repeated every time the notebook is loaded.

Permanently:
- Following the pickle_tutorial.ipynb instructions, create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'OPENML_TOKEN': MYKEY}```, with MYKEY being your API key.
- Use the openml CLI tool with ```openml configure apikey MYKEY```, with MYKEY being your API key.
- Create a plain text file ```~/.openml/config``` that contains the line ```apikey=MYKEY```, with MYKEY being your API key. 

## Additional Information

Documentation Guide:
- OpenML API ([OpenML](https://docs.openml.org/Python-start/))
- OpenML API ([GitHub](https://github.com/openml/openml-python)) 

Issues:
- When importing arff exceptions, they may not be found. If this is the case, uninstall arff and install liac-arff
- Datasets and Tasks are slow to iterate over after ~100-120 queries. Shouldn't have anything to do with setup since the loop over query id's is the same as the API code w/ added error handling

## Imports

In [1]:
# Import openml, installing if necessary
try:
    import openml
except ImportError as e:
    !pip3 install openml
    import openml

import pandas as pd # For storing/manipulating query data
import pickle # For loading credentials
import warnings # For warning users who do things they shouldn't
import os # For loading credentials
from tqdm import tqdm # Gives status bar on loop completion
from itertools import product # Used for iterating over nested for loops faster
from flatten_json import json
from utils import flatten_nested_df
from collections import OrderedDict

In [2]:
# Load credentials

# Check if config file or CLI variable already set key value
try:
    assert openml.config.apikey != ''
except AssertionError:
    # Check for credentials file
    if os.path.exists('credentials.pkl'):
        with open('credentials.pkl', 'rb') as credentials:
            openml.config.apikey = pickle.load(credentials)['OPENML_TOKEN']
    else:
        openml.config.apikey = input('Please enter your OpenML API Key: ')

## Helper Functions

In [3]:
def _get_value_attributes(obj):
    """
    Given an object, returns a list of the object's value-based variables
    
    Params:
    - obj : list-like 
        object to be analyzed 
    
    Returns:
    - attributes : list
        value-based variables for the object given
    """  
    
    # This code will pull all of the attributes of the provided class that are not callable or "private" 
    # for the class. 
    attributes = [attr for attr in dir(obj) if 
                           not hasattr(getattr(obj, attr), '__call__')
                           and not attr.startswith('_')]
    
    return attributes

In [4]:
def _get_evaluations_search_output(flatten_output=False):
    # Get different evaluation measures we can search for
    evaluations_measures = openml.evaluations.list_evaluation_measures()
    
    # Create DataFrame to store attributes
    evaluations_df = pd.DataFrame()

    # Get evaluation data for each available measure
    for measure in tqdm(evaluations_measures):
        # Query all data for a given evaluation measure
        evaluations_dict = openml.evaluations.list_evaluations(measure, size=size_limit)

        try:
            # Grab one of the evaluations in order to extract attributes
            sample_evaluation = next(iter(evaluations_dict.items()))[1]
        # StopIteration will occur in the preceding code if an evaluation search returns no results for a given measure
        except StopIteration:
            continue

        # Get list of attributes the evaluation offers
        evaluations_attributes = _get_value_attributes(sample_evaluation) 

        # Adds the queried data to the DataFrame
        for query in evaluations_dict.values():
            attribute_dict = {attribute: getattr(query, attribute) for attribute in evaluations_attributes}
            evaluations_df = evaluations_df.append(attribute_dict, ignore_index=True)

        evaluations_df = flatten_nested_df(evaluations_df)
        
    return evaluations_df

In [5]:
def get_all_search_outputs(search_types, flatten_output=False):
    """
    Call the OpenML API for each search type. 
    Results are retured in results['({type},)'] = df
    
    Params:
    - search_types : list-like 
        collection of search types to query over
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output
    
    Returns:
    - results : dict
        dictionary consisting of returned DataFrames from get_search_output for each query
    """
    
    results = OrderedDict()

    for search_type in search_types:
        results[(search_type,)] = get_individual_search_output(search_type, flatten_output)
        
    return results

In [6]:
def get_individual_search_output(search_type, flatten_output=False):
    """
    Calls the OpenML API with the specified search term and returns the search output results.
    
    Params:
    - search_type : str
        Must be in ('conferences', 'datasets', 'evaluations', 'papers', 'tasks')
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output
   
    Returns:
    - query_df : pandas.DataFrame
        DataFrame containing the output of the search query
    """
    # Ensure proper instance type is passed in
    try:
        assert search_type in ('datasets', 'runs', 'tasks', 'evaluations')
    except AssertionError:
        raise ValueError(f'\'{search_type}\' is not a valid instance type')
    
    # Handle special case for evaluations
    if search_type == 'evaluations':
        return _get_evaluations_search_output(flatten_output)
    
    # Use query type to get necessary openml api functions
    base_command = getattr(openml, search_type)
    list_queries = getattr(base_command, f'list_{search_type}')

    # Get base information about every object listed on OpenML for the given query type
    query_dict = list_queries(size=size_limit)
    query_df = pd.DataFrame(query_dict).transpose().reset_index(drop=True)
    
    # Flatten the nested DataFrame
    if flatten_output:
        query_df = flatten_nested_df(query_df)
    
    return query_df

# Retrieve Data

In [7]:
# For testing purposes, we set the following "small"-scale range over which collections to search
size_limit = 25

In [8]:
search_types = ['datasets', 'runs', 'tasks']

In [9]:
search_output_dict = get_all_search_outputs(search_types, flatten_output=True)

## Get Metadata

In [10]:
def get_query_metadata(object_paths, search_type, flatten_output=False):
    """
    Retrieves the metadata for the object/objects listed in object_paths
    
    Params:
    - object_paths : str/list-like
    - search_type : str
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output
    
    Returns:
    - metadata_df : pandas.DataFrame
    """
    
    # If a singular search term is provided as a string, need to wrap it in a list
    if type(object_paths) == str:
        object_paths = [object_paths]
    
    # Make sure our input is valid
    assert len(object_paths) > 0, 'Please enter at least one object id'
    
    base_command = getattr(openml, search_type)
    get_query = getattr(base_command, f'get_{search_type[:-1:]}')
    
    # Request each query
    queries = []
    error_queries = []
    for object_path in tqdm(object_paths):
        try:
            queries.append(get_query(object_path))
        except:
            error_queries.append(object_path)
    
    
    # Get list of attributes the queries offer
    query_attributes = _get_value_attributes(queries[0])

    # Create DataFrame to store attributes
    query_attribute_df = pd.DataFrame(columns=query_attributes)

    # Append attributes of each dataset to the DataFrame
    for query in tqdm(queries):
        attribute_dict = {attribute: getattr(query, attribute) for attribute in query_attributes}
        query_attribute_df = query_attribute_df.append(attribute_dict, ignore_index=True)
        
    # Flatten the nested DataFrame
    if flatten_output:
        query_attribute_df = flatten_nested_df(query_attribute_df)

    return query_attribute_df

In [11]:
def get_all_metadata(search_output_dict, flatten_output=False):
    """
    Retrieves all of the metadata that relates to the provided DataFrames
    
    Params:
    - search_output_dict : dict
        Dictionary of DataFrames from get_all_search_outputs
    - flatten_output : bool, optional (default=False)
        flag for flattening nested columns of output  
      
    Returns:
    - metadata_dict : collections.OrderedDict
        OrderedDict of DataFrames with metadata for each query
        Order matches the order of search_output_dict
    """

    metadata_dict = OrderedDict()

    for query, df in search_output_dict.items():
        print(f'Retrieving {query} metadata')

        # Get ID name
        search_type = query[0]

        if search_type == 'datasets':
            id_name = 'did'
        elif search_type == 'runs':
            id_name = 'run_id'
        elif search_type == 'tasks':
            id_name = 'tid'

        # Grab the object paths as the id's from the DataFrame
        object_paths = df[id_name].values

        metadata_dict[query] = get_query_metadata(object_paths, search_type, flatten_output)
        
    return metadata_dict

In [12]:
metadata_dict = get_all_metadata(search_output_dict, flatten_output=True)

  0%|          | 0/25 [00:00<?, ?it/s]

Retrieving ('datasets',) metadata


100%|██████████| 25/25 [00:08<00:00,  2.94it/s]
100%|██████████| 25/25 [00:00<00:00, 199.69it/s]
100%|██████████| 25/25 [00:00<00:00, 451.83it/s]
100%|██████████| 24/24 [00:00<00:00, 241.89it/s]

Retrieving ('runs',) metadata



  0%|          | 0/25 [00:00<?, ?it/s]

Retrieving ('tasks',) metadata


100%|██████████| 25/25 [00:08<00:00,  2.94it/s]
100%|██████████| 25/25 [00:00<00:00, 352.84it/s]


## Combining Results

In [13]:
def merge_search_and_metadata_dicts(search_dict, metadata_dict, on=None, left_on=None, right_on=None, save=False):
    """
    Merges together all of the search and metadata DataFrames by the given 'on' key
    
    Params:
    - search_dict : dict
        dictionary of search output results
    - metadata_dict : dict
        dictionary of metadata results
    - on : str/list-like
        column name(s) to merge the two dicts on
    - left_on : str/list-like
        column name(s) to merge the left dict on
    - right_on : str/list-like
        column name(s) to merge the right dict on
    - save : bool, optional (default=False)
        specifies if the output DataFrames should be saved
        If True: saves to file of format 'data/kaggle/kaggle_{search_term}_{search_type}.csv'
        If list-like: saves to respective location in list of save locations
            Must contain enough strings (one per query; len(search_terms) * len(search_types))
            
    If the on/left_on/right_on values are not explicitely specified, behavior defaults to what is done
    in the pandas documentation
    
    Returns:
    - df_dict : OrderedDict
        OrderedDict containing all of the merged search/metadata dicts
    """

    # Make sure the dictionaries contain the same searches
    assert search_dict.keys() == metadata_dict.keys(), 'Dictionaries must contain the same searches'
    
    num_dataframes = len(search_dict)
    
    # Ensure the save variable data is proper
    try:
        if isinstance(save, bool):
            save = [save] * num_dataframes
        assert len(save) == num_dataframes
    except:
        raise ValueError('Incorrect save value(s)')

    # Merge the DataFrames
    df_dict = OrderedDict()
    for (query_key, search_df), (query_key, metadata_df), save_loc in zip(search_dict.items(), 
                                                                          metadata_dict.items(), 
                                                                          save):

        # Merge small version of "full" dataframe with "detailed" dataframe
        df_all = pd.merge(search_df, metadata_df, on=on, left_on=left_on, right_on=right_on, how='outer')
            
        # Save DataFrame
        if save_loc:
            data_dir = os.path.join('data', 'openml')
            if isinstance(save_loc, str):
                output_file = save_loc
            elif isinstance(save_loc, bool):
                # Ensure kaggle directory is already created
                if not os.path.isdir(data_dir):
                    os.path.mkdir(data_dir)

                search_type = query_key[0]
                output_file = f'{search_type}.csv'
            else:
                raise ValueError('Save type must be bool or str')

            search_df.to_csv(os.path.join(data_dir, output_file), index=False)
        
        df_dict[query_key] = df_all
    
    return df_dict

In [14]:
df_dict = merge_search_and_metadata_dicts(search_output_dict, metadata_dict)

In [15]:
# Add evaluations data (doesn't have metadata so had to be handled separately)
df_dict[('evaluations',)] = get_individual_search_output('evaluations', flatten_output=True)

100%|██████████| 71/71 [01:05<00:00,  1.09it/s]


In [16]:
df_dict[('evaluations',)]

Unnamed: 0,array_data,data_id,data_name,flow_id,flow_name,function,run_id,setup_id,task_id,upload_time,uploader,uploader_name,value,values
0,"[0.93111,0.999975,0.994856,0.0,1,0.990326]",1.0,anneal,76.0,weka.Bagging_REPTree(1),area_under_roc_curve,62.0,17.0,1.0,2014-04-06 23:57:45,1.0,Jan van Rijn,0.995034,
1,"[0.730267,0.998862,0.976922,0.0,1,0.978059]",1.0,anneal,59.0,weka.JRip(1),area_under_roc_curve,237.0,4.0,1.0,2014-04-07 01:34:48,1.0,Jan van Rijn,0.978916,
2,"[0.973736,0.998217,0.990664,0.0,1,0.991929]",1.0,anneal,67.0,weka.BayesNet_K2(1),area_under_roc_curve,359.0,12.0,1.0,2014-04-07 04:08:17,1.0,Jan van Rijn,0.992099,
3,"[0.936728,0.999975,0.998962,0.0,1,0.999009]",1.0,anneal,65.0,weka.RandomForest(1),area_under_roc_curve,413.0,10.0,1.0,2014-04-07 04:35:45,1.0,Jan van Rijn,0.998598,
4,"[0.874438,0.999368,0.997455,0.0,1,0.999446]",1.0,anneal,74.0,weka.Logistic(1),area_under_roc_curve,500.0,15.0,1.0,2014-04-07 06:52:21,1.0,Jan van Rijn,0.996849,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
920,"[0.245111,0.252288,0.999395]",40670.0,dna,16347.0,sklearn.pipeline.Pipeline(simpleimputer=sklear...,unweighted_recall,10423855.0,8255697.0,167140.0,2019-12-05 03:03:03,8323.0,Heinrich Peters,0.498931,
921,"[0.950456,0.951634,0.962515]",40670.0,dna,16347.0,sklearn.pipeline.Pipeline(simpleimputer=sklear...,unweighted_recall,10423856.0,8255805.0,167140.0,2019-12-05 03:06:05,8323.0,Heinrich Peters,0.954868,
922,"[0.954368,0.951634,0.964933]",40670.0,dna,16347.0,sklearn.pipeline.Pipeline(simpleimputer=sklear...,unweighted_recall,10423857.0,8255887.0,167140.0,2019-12-05 03:09:54,8323.0,Heinrich Peters,0.956978,
923,"[0,0,0,0,0,1]",1478.0,har,16345.0,sklearn.pipeline.Pipeline(simpleimputer=sklear...,unweighted_recall,10423858.0,8255841.0,14970.0,2019-12-05 03:14:02,8323.0,Heinrich Peters,0.166667,
