# Setup

## Instructions

This notebook utilizes the OpenML API. Follow these steps in order to get the necessary credentials to continue (additional information is available at the OpenML documentation under "Additional Information" below):

1. Create an OpenML account at https://www.openml.org/register
2. After logging in, open your account page (click the avatar on the top right)
3. Open 'Account Settings', then 'API authentication' to find your API key

There are multiple ways of authenticating. Any of the following will work for this notebook:

Temporarily:
- When prompted below (if none of the following methods are completed), enter your API key in the text box.
    - This method is the easiest, but must be repeated every time the notebook is loaded.

Permanently:
- Following the pickle_tutorial.ipynb instructions, create a ```./credentials.pkl``` file that holds a dictionary containing the entry ```{'OPENML_TOKEN': MYKEY}```, with MYKEY being your API key.
- Use the openml CLI tool with ```openml configure apikey MYKEY```, with MYKEY being your API key.
- Create a plain text file ```~/.openml/config``` that contains the line ```apikey=MYKEY```, with MYKEY being your API key. 

## Additional Information

Documentation Guide:
- OpenML API ([OpenML](https://docs.openml.org/Python-start/))
- OpenML API ([GitHub](https://github.com/openml/openml-python)) 

Issues:
- When importing arff exceptions, they may not be found. If this is the case, uninstall arff and install liac-arff
- Datasets and Tasks are slow to iterate over after ~100-120 queries. Shouldn't have anything to do with setup since the loop over query id's is the same as the API code w/ added error handling

## Imports

In [1]:
# Import openml, installing if necessary
try:
    import openml
except ImportError as e:
    !pip3 install openml
    import openml

import pandas as pd # For storing/manipulating query data
import pickle # For loading credentials
import warnings # For warning users who do things they shouldn't
import os # For loading credentials
from tqdm import tqdm # Gives status bar on loop completion
from itertools import product # Used for iterating over nested for loops faster

In [2]:
# Load credentials

# Check if config file or CLI variable already set key value
try:
    assert openml.config.apikey != ''
except AssertionError:
    # Check for credentials file
    if os.path.exists('credentials.pkl'):
        with open('credentials.pkl', 'rb') as credentials:
            openml.config.apikey = pickle.load(credentials)['OPENML_TOKEN']
    else:
        openml.config.apikey = input('Please enter your OpenML API Key: ')

## Exception Imports

In [3]:
from openml.exceptions import OpenMLServerException
dataset_exceptions = (OpenMLServerException,)
run_exceptions = (TypeError, OpenMLServerException)

from arff import BadRelationFormat, BadDataFormat
task_exceptions = (NotImplementedError, BadRelationFormat, BadDataFormat)

## Helper Functions

In [4]:
def get_value_attributes(obj):
    """
    Given an object, returns a list of the object's value-based variables
    
    Params:
    - obj (list-like): object to be analyzed 
    
    Returns:
    - attributes (list): value-based variables for the object given
    """  
    
    # This code will pull all of the attributes of the provided class that are not callable or "private" 
    # for the class. 
    attributes = [attr for attr in dir(obj) if 
                           not hasattr(getattr(obj, attr), '__call__')
                           and not attr.startswith('_')]
    
    return attributes

In [5]:
def retrieve_all_data(query_type, exceptions=[], query_limit=None, report_error_queries=False):
    """
    Retrieves all possible data that the OpenML API will return for a given query type.
    
    Params:
    - query_type (str): type of data to pull. options: (datasets, runs, tasks)
    - exceptions=[], optional (list-like): list of exceptions to handle when querying data
        ex: exceptions=(OpenMLServerException) will gracefully skip any queries that throw an OpenMLServerException
            (can occur when a query, such as a run, has been deleted)
    - query_limit=None, optional (int): number of queries to return 
    - report_error_queries=False, optional (bool)
    
    Returns:
    - query_df (pd.DataFrame): DataFrame of all surface level information about the listing of an instance
    - query_attribute_df (pd.DataFrame): DataFrame of all attributes contained in an instance
    """
    
    # Ensure proper instance type is passed in
    try:
        assert query_type in ('datasets', 'runs', 'tasks')
    except AssertionError:
        raise ValueError(f'\'{query_type}\' is not a valid instance type')
    
    # Make sure exceptions are proper
    # If exceptions are not iterable, turn into iterable
    try:
        iter(exceptions)
    except TypeError:
        exceptions = [exceptions]
    finally:
        try:
            assert all([issubclass(exception, BaseException) for exception in exceptions])
        except (AssertionError, TypeError):
            raise ValueError(f'Invalid exception in \'{exceptions}\'')


    # Use query type to get necessary openml api functions
    base_command = getattr(openml, query_type)
    list_queries = getattr(base_command, f'list_{query_type}')
    get_query = getattr(base_command, f'get_{query_type[:-1:]}')

    # Get base information about every object listed on OpenML for the given query type
    query_dict = list_queries(size=query_limit)
    query_df = pd.DataFrame(query_dict).transpose().reset_index(drop=True)
    
    # Gather specific query object
    query_ids = query_dict.keys()

    queries = []
    error_queries = []
    for query_id in tqdm(query_ids):
        try:
            queries.append(get_query(query_id))
        except exceptions as e:
            error_queries.append((query_id, e))
            
    # Report error queries
    if report_error_queries:
        print('Error queries:\n', error_queries)
            
    # Get list of attributes the queries offer
    query_attributes = get_value_attributes(queries[0])
    
    # Create DataFrame to store attributes
    query_attribute_df = pd.DataFrame(columns=query_attributes)

    # Append attributes of each dataset to the DataFrame
    for query in tqdm(queries):
        attribute_dict = {attribute: getattr(query, attribute) for attribute in query_attributes}
        query_attribute_df = query_attribute_df.append(attribute_dict, ignore_index=True)
        
    return query_df, query_attribute_df

# Retrieve Data

In [6]:
# For testing purposes, we set the following "small"-scale range over which collections to search
size_limit = 25

## Datasets

Note: the dataset code could be simplified via the get_datasets() function, but for uniformity sake, we follow the convention done for the runs/tasks code

In [7]:
dataset_df, dataset_submission_df = retrieve_all_data(query_type='datasets',
                                                exceptions=dataset_exceptions,
                                                query_limit=size_limit)

100%|██████████| 25/25 [00:11<00:00,  2.26it/s]
100%|██████████| 25/25 [00:00<00:00, 193.66it/s]


In [8]:
dataset_df.head()

Unnamed: 0,did,name,version,uploader,status,format,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
0,2,anneal,1,1,active,ARFF,684.0,7.0,8.0,5.0,39.0,898.0,898.0,22175.0,6.0,33.0
1,3,kr-vs-kp,1,1,active,ARFF,1669.0,3.0,1527.0,2.0,37.0,3196.0,0.0,0.0,0.0,37.0
2,4,labor,1,1,active,ARFF,37.0,3.0,20.0,2.0,17.0,57.0,56.0,326.0,8.0,9.0
3,5,arrhythmia,1,1,active,ARFF,245.0,13.0,2.0,13.0,280.0,452.0,384.0,408.0,206.0,74.0
4,6,letter,1,1,active,ARFF,813.0,26.0,734.0,26.0,17.0,20000.0,0.0,0.0,16.0,1.0


In [9]:
dataset_submission_df.head()

Unnamed: 0,cache_format,citation,collection_date,contributor,creator,data_feather_file,data_file,data_pickle_file,dataset_id,default_target_attribute,...,parquet_file,qualities,row_id_attribute,tag,update_comment,upload_date,url,version,version_label,visibility
0,pickle,https://archive.ics.uci.edu/ml/citation_policy...,1990,David Sterling and Wray Buntine,"[David Sterling, Wray Buntine]",,/Users/michaelbaluja/.openml/org/openml/www/da...,,2,class,...,,"{'AutoCorrelation': 0.6064659977703456, 'CfsSu...",,"[study_1, study_14, study_34, study_37, study_...",,2014-04-06T23:19:24,https://www.openml.org/data/v1/download/166687...,1,1,public
1,pickle,https://archive.ics.uci.edu/ml/citation_policy...,1989-08-01,Rob Holte,Alen Shapiro,,/Users/michaelbaluja/.openml/org/openml/www/da...,,3,class,...,,"{'AutoCorrelation': 0.9990610328638497, 'CfsSu...",,"[mythbusting_1, OpenML-CC18, OpenML100, study_...",,2014-04-06T23:19:28,https://www.openml.org/data/v1/download/3/kr-v...,1,1,public
2,pickle,https://archive.ics.uci.edu/ml/citation_policy...,1988-11-01,Stan Matwin,Collective Bargaining Review of Labour Canada,,/Users/michaelbaluja/.openml/org/openml/www/da...,,4,class,...,,"{'AutoCorrelation': 0.75, 'CfsSubsetEval_Decis...",,"[mythbusting_1, study_1, study_15, study_20, s...",,2014-04-06T23:19:30,https://www.openml.org/data/v1/download/4/labo...,1,1,public
3,pickle,https://archive.ics.uci.edu/ml/citation_policy...,1998-01-01,,"[H. Altay Guvenir, Burak Acar, Haldun Muderris...",,/Users/michaelbaluja/.openml/org/openml/www/da...,,5,class,...,,"{'AutoCorrelation': 0.35476718403547675, 'CfsS...",,"[sport, study_1, study_41, study_76, study_93,...",,2014-04-06T23:19:36,https://www.openml.org/data/v1/download/5/arrh...,1,1,public
4,pickle,"P. W. Frey and D. J. Slate. ""Letter Recognitio...",1991-01-01,,David J. Slate,,/Users/michaelbaluja/.openml/org/openml/www/da...,,6,class,...,,"{'AutoCorrelation': 0.04090204510225511, 'CfsS...",,"[AzurePilot, AzurePilot1, OpenML-CC18, OpenML1...",,2014-04-06T23:19:41,https://www.openml.org/data/v1/download/6/lett...,1,1,public


## Evaluations

In [10]:
# Get different evaluation measures we can search for
evaluations_measures = openml.evaluations.list_evaluation_measures()

In [11]:
# Create DataFrame to store attributes
evaluations_df = pd.DataFrame()

# Get evaluation data for each available measure
for measure in tqdm(evaluations_measures):
    # Query all data for a given evaluation measure
    evaluations_dict = openml.evaluations.list_evaluations(measure, size=size_limit)
    
    try:
        # Grab one of the evaluations in order to extract attributes
        sample_evaluation = next(iter(evaluations_dict.items()))[1]
    # StopIteration will occur in the preceding code if an evaluation search returns no results for a given measure
    except StopIteration:
        continue
    
    # Get list of attributes the evaluation offers
    evaluations_attributes = get_value_attributes(sample_evaluation) 
    
    # Adds the queried data to the DataFrame
    for query in evaluations_dict.values():
        attribute_dict = {attribute: getattr(query, attribute) for attribute in evaluations_attributes}
        evaluations_df = evaluations_df.append(attribute_dict, ignore_index=True)

100%|██████████| 71/71 [01:16<00:00,  1.08s/it]


In [12]:
evaluations_df.head()

Unnamed: 0,array_data,data_id,data_name,flow_id,flow_name,function,run_id,setup_id,task_id,upload_time,uploader,uploader_name,value,values
0,"[0.93111,0.999975,0.994856,0.0,1,0.990326]",1.0,anneal,76.0,weka.Bagging_REPTree(1),area_under_roc_curve,62.0,17.0,1.0,2014-04-06 23:57:45,1.0,Jan van Rijn,0.995034,
1,"[0.730267,0.998862,0.976922,0.0,1,0.978059]",1.0,anneal,59.0,weka.JRip(1),area_under_roc_curve,237.0,4.0,1.0,2014-04-07 01:34:48,1.0,Jan van Rijn,0.978916,
2,"[0.973736,0.998217,0.990664,0.0,1,0.991929]",1.0,anneal,67.0,weka.BayesNet_K2(1),area_under_roc_curve,359.0,12.0,1.0,2014-04-07 04:08:17,1.0,Jan van Rijn,0.992099,
3,"[0.936728,0.999975,0.998962,0.0,1,0.999009]",1.0,anneal,65.0,weka.RandomForest(1),area_under_roc_curve,413.0,10.0,1.0,2014-04-07 04:35:45,1.0,Jan van Rijn,0.998598,
4,"[0.874438,0.999368,0.997455,0.0,1,0.999446]",1.0,anneal,74.0,weka.Logistic(1),area_under_roc_curve,500.0,15.0,1.0,2014-04-07 06:52:21,1.0,Jan van Rijn,0.996849,


## Runs

In [13]:
runs_df, runs_submission_df = retrieve_all_data(query_type='runs',
                                                exceptions=run_exceptions,
                                                query_limit=size_limit)

100%|██████████| 25/25 [00:00<00:00, 287.72it/s]
100%|██████████| 24/24 [00:00<00:00, 248.64it/s]


In [14]:
runs_df.head()

Unnamed: 0,run_id,task_id,setup_id,flow_id,uploader,task_type,upload_time,error_message
0,1,68,6,61,1,TaskType.LEARNING_CURVE,2014-04-06 23:30:40,
1,2,72,16,75,1,TaskType.LEARNING_CURVE,2014-04-06 23:31:13,
2,3,95,8,63,1,TaskType.LEARNING_CURVE,2014-04-06 23:32:38,
3,7,88,13,70,1,TaskType.LEARNING_CURVE,2014-04-06 23:36:01,
4,8,85,2,57,1,TaskType.LEARNING_CURVE,2014-04-06 23:38:24,


In [15]:
runs_submission_df.head()

Unnamed: 0,data_content,dataset_id,description_text,error_message,evaluations,flow,flow_id,flow_name,fold_evaluations,id,...,setup_id,setup_string,tags,task,task_evaluation_measure,task_id,task_type,trace,uploader,uploader_name
0,,13,,,"{'area_under_roc_curve': 0.6867257828504536, '...",,75,weka.AdaBoostM1_DecisionStump(1),{},2,...,16,weka.classifiers.meta.AdaBoostM1 -- -P 100 -S ...,[testing],,predictive_accuracy,72,Learning Curve,,1,Jan van Rijn
1,,36,,,"{'area_under_roc_curve': 0.963585211421575, 'a...",,63,weka.HoeffdingTree(1),{},3,...,8,weka.classifiers.trees.HoeffdingTree -- -L 2 -...,,,predictive_accuracy,95,Learning Curve,,1,Jan van Rijn
2,,29,,,"{'area_under_roc_curve': 0.8574182903700429, '...",,70,weka.SMO_PolyKernel(1),{},7,...,13,weka.classifiers.functions.SMO -- -C 1.0 -L 0....,,,predictive_accuracy,88,Learning Curve,,1,Jan van Rijn
3,,26,,,"{'area_under_roc_curve': 0.7862987608291605, '...",,57,weka.OneR(1),{},8,...,2,weka.classifiers.rules.OneR -- -B 6,,,predictive_accuracy,85,Learning Curve,,1,Jan van Rijn
4,,32,,,"{'area_under_roc_curve': 0.9878527592419466, '...",,67,weka.BayesNet_K2(1),{},9,...,12,weka.classifiers.bayes.BayesNet -- -D -Q weka....,,,predictive_accuracy,91,Learning Curve,,1,Jan van Rijn


## Tasks

In [16]:
tasks_df, tasks_submission_df = retrieve_all_data(query_type='tasks', 
                                                  exceptions=task_exceptions,
                                                  query_limit=size_limit)

100%|██████████| 25/25 [00:10<00:00,  2.45it/s]
100%|██████████| 25/25 [00:00<00:00, 244.93it/s]


In [17]:
tasks_df.head()

Unnamed: 0,tid,ttid,did,name,task_type,status,estimation_procedure,evaluation_measures,source_data,target_feature,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures
0,2,TaskType.SUPERVISED_CLASSIFICATION,2,anneal,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,2,class,684,7,8,5,39,898,898,22175,6,33
1,3,TaskType.SUPERVISED_CLASSIFICATION,3,kr-vs-kp,Supervised Classification,active,10-fold Crossvalidation,,3,class,1669,3,1527,2,37,3196,0,0,0,37
2,4,TaskType.SUPERVISED_CLASSIFICATION,4,labor,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,4,class,37,3,20,2,17,57,56,326,8,9
3,5,TaskType.SUPERVISED_CLASSIFICATION,5,arrhythmia,Supervised Classification,active,10-fold Crossvalidation,predictive_accuracy,5,class,245,13,2,13,280,452,384,408,206,74
4,6,TaskType.SUPERVISED_CLASSIFICATION,6,letter,Supervised Classification,active,10-fold Crossvalidation,,6,class,813,26,734,26,17,20000,0,0,16,1


In [18]:
tasks_submission_df.head()

Unnamed: 0,class_labels,cost_matrix,dataset_id,estimation_parameters,estimation_procedure,estimation_procedure_id,evaluation_measure,id,openml_url,split,target_name,task_id,task_type,task_type_id
0,"[1, 2, 3, 4, 5, U]",,2,"{'number_repeats': '1', 'number_folds': '10', ...","{'type': 'crossvalidation', 'parameters': {'nu...",1,predictive_accuracy,2,https://www.openml.org/t/2,,class,2,Supervised Classification,TaskType.SUPERVISED_CLASSIFICATION
1,"[nowin, won]",,3,"{'number_repeats': '1', 'number_folds': '10', ...","{'type': 'crossvalidation', 'parameters': {'nu...",1,,3,https://www.openml.org/t/3,,class,3,Supervised Classification,TaskType.SUPERVISED_CLASSIFICATION
2,"[bad, good]",,4,"{'number_repeats': '1', 'number_folds': '10', ...","{'type': 'crossvalidation', 'parameters': {'nu...",1,predictive_accuracy,4,https://www.openml.org/t/4,,class,4,Supervised Classification,TaskType.SUPERVISED_CLASSIFICATION
3,"[1, 10, 11, 12, 13, 14, 15, 16, 2, 3, 4, 5, 6,...",,5,"{'number_repeats': '1', 'number_folds': '10', ...","{'type': 'crossvalidation', 'parameters': {'nu...",1,predictive_accuracy,5,https://www.openml.org/t/5,,class,5,Supervised Classification,TaskType.SUPERVISED_CLASSIFICATION
4,"[A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, ...",,6,"{'number_repeats': '1', 'number_folds': '10', ...","{'type': 'crossvalidation', 'parameters': {'nu...",1,,6,https://www.openml.org/t/6,,class,6,Supervised Classification,TaskType.SUPERVISED_CLASSIFICATION
