# Setup

## Instructions

This notebook utilizes the OpenML API. Follow these steps in order to get the necessary credentials to continue (additional information is available at the OpenML documentation under "Additional Information" below):

1. Create an OpenML account at https://www.openml.org/register
2. After logging in, open your account page (click the avatar on the top right)
3. Open 'Account Settings', then 'API authentication' to find your API key

There are multiple ways of authenticating. Any of the following will work for this notebook:

Temporarily:
- When prompted below (if none of the following methods are completed), enter your API key in the text box.
    - This method is the easiest, but must be repeated every time the notebook is loaded.

Permanently:
- Following the pickle_tutorial.ipynb instructions, create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'OPENML_TOKEN': MYKEY}```, with MYKEY being your API key.
- Use the openml CLI tool with ```openml configure apikey MYKEY```, with MYKEY being your API key.
- Create a plain text file ```~/.openml/config``` that contains the line ```apikey=MYKEY```, with MYKEY being your API key. 

## Additional Information

Documentation Guide:
- OpenML API ([OpenML](https://docs.openml.org/Python-start/))
- OpenML API ([GitHub](https://github.com/openml/openml-python)) 

Issues:
- When importing arff exceptions, they may not be found. If this is the case, uninstall arff and install liac-arff
- Datasets and Tasks are slow to iterate over after ~100-120 queries. Shouldn't have anything to do with setup since the loop over query id's is the same as the API code w/ added error handling

## Imports

In [None]:
# Import openml, installing if necessary
try:
    import openml
except ImportError as e:
    !pip3 install openml
    import openml

import pandas as pd # For storing/manipulating query data
import pickle # For loading credentials
import warnings # For warning users who do things they shouldn't
import os # For loading credentials
from tqdm import tqdm # Gives status bar on loop completion
from itertools import product # Used for iterating over nested for loops faster

In [None]:
# Load credentials

# Check if config file or CLI variable already set key value
try:
    assert openml.config.apikey != ''
except AssertionError:
    # Check for credentials file
    if os.path.exists('credentials.pkl'):
        with open('credentials.pkl', 'rb') as credentials:
            openml.config.apikey = pickle.load(credentials)['OPENML_TOKEN']
    else:
        openml.config.apikey = input('Please enter your OpenML API Key: ')

## Exception Imports

In [None]:
from openml.exceptions import OpenMLServerException
dataset_exceptions = (OpenMLServerException,)
run_exceptions = (TypeError, OpenMLServerException)

from arff import BadRelationFormat, BadDataFormat
task_exceptions = (NotImplementedError, BadRelationFormat, BadDataFormat)

## Helper Functions

In [None]:
def get_value_attributes(obj):
    '''
    Given an object, returns a list of the object's value-based variables
    
    Params:
    - obj (list-like): object to be analyzed 
    
    Returns:
    - attributes (list): value-based variables for the object given
    '''  
    # This code will pull all of the attributes of the provided class that are not callable or "private" 
    # for the class. 
    attributes = [attr for attr in dir(obj) if 
                           not hasattr(getattr(obj, attr), '__call__')
                           and not attr.startswith('_')]
    
    return attributes

In [None]:
def retrieve_all_data(query_type, exceptions=[], query_limit=None, report_error_queries=False):
    '''
    Retrieves all possible data that the OpenML API will return for a given query type.
    
    Params:
    - query_type (str): type of data to pull. options: (datasets, runs, tasks)
    - exceptions=[], optional (list-like): list of exceptions to handle when querying data
        ex: exceptions=(OpenMLServerException) will gracefully skip any queries that throw an OpenMLServerException
            (can occur when a query, such as a run, has been deleted)
    - query_limit=None, optional (int): number of queries to return. 
    - report_error_queries=False, optional (bool)
    
    Returns:
    
    '''
    # Ensure proper instance type is passed in
    try:
        assert query_type in ('datasets', 'runs', 'tasks')
    except AssertionError:
        raise ValueError(f'\'{query_type}\' is not a valid instance type')
    
    # Make sure exceptions are proper
    # If exceptions are not iterable, turn into iterable
    try:
        iter(exceptions)
    except TypeError:
        exceptions = [exceptions]
    finally:
        try:
            assert all([issubclass(exception, BaseException) for exception in exceptions])
        except (AssertionError, TypeError):
            raise ValueError(f'Invalid exception in \'{exceptions}\'')


    # Use query type to get necessary openml api functions
    base_command = getattr(openml, query_type)
    list_queries = getattr(base_command, f'list_{query_type}')
    get_query = getattr(base_command, f'get_{query_type[:-1:]}')

    # Get base information about every object listed on OpenML for the given query type
    query_dict = list_queries(size=query_limit)
    query_df = pd.DataFrame(query_dict).transpose().reset_index(drop=True)
    
    # Gather specific query object
    query_ids = query_dict.keys()

    queries = []
    error_queries = []
    for query_id in tqdm(query_ids):
        try:
            queries.append(get_query(query_id))
        except exceptions as e:
            error_queries.append((query_id, e))
            
    # Report error queries
    if report_error_queries:
        print('Error queries:\n', error_queries)
            
    # Get list of attributes the queries offer
    query_attributes = get_value_attributes(queries[0])
    
    # Create DataFrame to store attributes
    query_submission_df = pd.DataFrame(columns=query_attributes)

    # Append attributes of each dataset to the DataFrame
    for query in tqdm(queries):
        attribute_dict = {attribute: getattr(query, attribute) for attribute in query_attributes}
        query_submission_df = query_submission_df.append(attribute_dict, ignore_index=True)
        
    return query_df, query_submission_df

# Retrieve Data

In [None]:
# For testing purposes, we set the following "small"-scale range over which collections to search
size_limit = 300

## Datasets

Note: the dataset code could be simplified via the get_datasets() function, but for uniformity sake, we follow the convention done for the runs/tasks code

In [None]:
dataset_df.head()

In [None]:
dataset_submission_df.head()

## Evaluations

In [None]:
# Retrieves the attributes of an evaluation object
# Note: this only works because the attributes that we want to track are all parameters of the class.
# This retrieves the same end data as the get_value_attributes function, but does not require an actual instance

# To dissect, openml.evaluations.OpoenMLEvalution is the class that defines our evaluation objects.
# The .__init__ segment calls the initialization function for the class
# The .__code__.co_varnames segment then returns the parameters of that function
# The [1::] returns all but the first variable. Since this is a class method, the first is always 'self'
evaluations_attributes = openml.evaluations.OpenMLEvaluation.__init__.__code__.co_varnames[1::]

In [None]:
# Get different evaluation measures we can search for
evaluations_measures = openml.evaluations.list_evaluation_measures()

In [None]:
# Create DataFrame to store attributes
evaluations_df = pd.DataFrame(columns=evaluations_attributes)

# Get evaluation data for each available measure
for measure in tqdm(evaluations_measures):
    # Query all data for a given evaluation measure
    evaluations_dict = openml.evaluations.list_evaluations(measure, size=size_limit)
    
    # Adds the queried data to the DataFrame
    for _, query in evaluations_dict.items():
        attribute_dict = {attribute: getattr(query, attribute) for attribute in evaluations_attributes}

        evaluations_df = evaluations_df.append(attribute_dict, ignore_index=True)

In [None]:
evaluations_df.head()

## Runs

In [None]:
runs_df, runs_submission_df = retrieve_all_data(query_type='runs',
                                                exceptions=run_exceptions,
                                                query_limit=size_limit)

In [None]:
runs_df.head()

In [None]:
runs_submission_df.head()

## Tasks

In [None]:
tasks_df, tasks_submission_df = retrieve_all_data(query_type='tasks', 
                                                  exceptions=task_exceptions,
                                                  query_limit=size_limit)

In [None]:
tasks_df.head()

In [None]:
tasks_submission_df.head()