# Setup

In [1]:
# Import openml, installing if necessary
try:
    import openml
except ImportError as e:
    !pip3 install openml
    import openml

import pandas as pd # For storing/manipulating query data
import pickle # For loading credentials
import os # For loading credentials

In [2]:
# Load credentials

# Check if config file or CLI variable already set key value
try:
    assert openml.config.apikey != ''
except AssertionError:
    # Check for credentials file
    if os.path.exists('credentials.pkl'):
        with open('credentials.pkl', 'rb') as credentials:
            openml.config.apikey = pickle.load(credentials)['OPENML_TOKEN']
    else:
        openml.config.apikey = input('Please enter your OpenML API Key: ')

## References

\[1\] https://www.w3schools.com/python/ref_func_vars.asp

\[2\] https://www.geeksforgeeks.org/python-dir-function/

# Walkthrough

Since the general work flow is identical for Datasets, Runs, and Tasks, we condense this explanation down to just showing an example for Datasets for the initial walkthrough.

## Datasets

We first query OpenML for a couple of datasets using the list_datasets() function

In [3]:
dataset_list = openml.datasets.list_datasets(size=2)

The list_datasets function returns high level information about the datasets as an Ordered Dictionary. Each entry in the OrderedDict is a dictionary containing information for a single dataset.

In [4]:
# We get the OrderedDict keys in order to take a look at an example of the data returned.
# Since the odict_keys object returned is not subscriptable, we cast it as a list and then take the first key
dataset_list_keys = dataset_list.keys()
sample_dataset_key = list(dataset_list_keys)[0]
sample_dataset_info = dataset_list[sample_dataset_key]

In [5]:
sample_dataset_info

{'did': 2,
 'name': 'anneal',
 'version': 1,
 'uploader': '1',
 'status': 'active',
 'format': 'ARFF',
 'MajorityClassSize': 684.0,
 'MaxNominalAttDistinctValues': 7.0,
 'MinorityClassSize': 8.0,
 'NumberOfClasses': 5.0,
 'NumberOfFeatures': 39.0,
 'NumberOfInstances': 898.0,
 'NumberOfInstancesWithMissingValues': 898.0,
 'NumberOfMissingValues': 22175.0,
 'NumberOfNumericFeatures': 6.0,
 'NumberOfSymbolicFeatures': 33.0}

The 'did' entry contains the (d)ataset (id), which we then want to use in order to retrieve the dataset object via the get_dataset() function

In [6]:
# We extract the dataset id from the samle dataset in order to search for it
sample_id = sample_dataset_info['did']
sample_dataset = openml.datasets.get_dataset(sample_id)

In [7]:
sample_dataset

OpenML Dataset
Name..........: anneal
Version.......: 1
Format........: ARFF
Upload Date...: 2014-04-06 23:19:24
Licence.......: Public
Download URL..: https://www.openml.org/data/v1/download/1666876/anneal.arff
OpenML URL....: https://www.openml.org/d/2
# of features.: 39
# of instances: 898

Taking a look at the dataset, we see that there isn't really much information we can use from this. Most of the information is already present from the list_datasets() function. To get a better understanding of what these dataset objects contain, we can use the vars() function to see all of the object's changeable attributes \[1\].

In [8]:
sample_dataset_vars = vars(sample_dataset)

In [9]:
sample_dataset_vars

{'dataset_id': 2,
 'name': 'anneal',
 'version': 1,
 'description': "**Author**: Unknown. Donated by David Sterling and Wray Buntine  \n\n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Annealing) - 1990  \n\n**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)  \n\n\n\nThe original Annealing dataset from UCI. The exact meaning of the features and classes is largely unknown. Annealing, in metallurgy and materials science, is a heat treatment that alters the physical and sometimes chemical properties of a material to increase its ductility and reduce its hardness, making it more workable. It involves heating a material to above its recrystallization temperature, maintaining a suitable temperature, and then cooling. (Wikipedia)\n\n\n\n### Attribute Information:\n\n     1. family:          --,GB,GK,GS,TN,ZA,ZF,ZH,ZM,ZS\n\n     2. product-type:    C, H, G\n\n     3. steel:           -,R,A,U,K,M,S,W,V\n\n     4. carbon:          continuous\n\n     5. hardnes

From the vars() output, we can see quite a bit more information about the dataset than what the OpenML object reported. While some of this may seem unnecessary (md5checksum, for example), it may still be beneficial to have for later analysis (who is contributing the most to these platforms and how reusable are their contributions?)

In addition to the vars() function for listing the actual data attributes, we can take a look at the dir() function. The dir() function returns a list of all of the attributes and methods for an object \[2\]. 

In [10]:
sample_dataset_dir = dir(sample_dataset)

In [11]:
sample_dataset_dir

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_apply_repr_template',
 '_cache_compressed_file_from_file',
 '_compressed_cache_file_paths',
 '_convert_array_format',
 '_dataset',
 '_download_data',
 '_entity_letter',
 '_get_arff',
 '_get_file_elements',
 '_get_repr_body_fields',
 '_load_data',
 '_minio_url',
 '_parse_data_from_arff',
 '_parse_publish_response',
 '_to_dict',
 '_to_xml',
 '_unpack_categories',
 'cache_format',
 'citation',
 'collection_date',
 'contributor',
 'creator',
 'data_feather_file',
 'data_file',
 'data_pickle_file',
 'dataset_id',
 'default_target_attribute',
 'description',
 'feather_attrib

From this, we see all of the previous object attribute names, along with all of the functions available to the object's class. While this may not seem any more beneficial than the vars() function for the purpose of retrieving information from the object, it does allow us to squeeze out a couple more variables.

In [12]:
def get_attributes(obj):
    attributes = [attr for attr in dir(obj) if 
                           not hasattr(getattr(obj, attr), '__call__')
                           and not attr.startswith('_')]
    return attributes

We use the get_attributes() function as a way to retrieve attribute names from dir() that may or may not have been available from vars(). This function works by taking all of the attributes listed from dir() and returning all of the ones that 
1. are not functions (are unable to be "called") and 
2. are not private to the class (don't start with a leading underscore).

In [13]:
dataset_attributes = get_attributes(sample_dataset)

In [14]:
set(dataset_attributes).difference(sample_dataset_vars.keys())

{'id', 'openml_url'}

Although the only differences between the attributes we scrape from dir() and the attributes presented from vars() seem to be relatively unuseful (as the id is present again as the dataset_id, and we currently do not have a use for the url's), the cost of retrieving our data in this way is low compared to the cost of the rest of the data querying. We can also take a look at what additional attributes are provided when using this method for runs and tasks.

In [15]:
run_list = openml.runs.list_runs(size=2)
task_list = openml.tasks.list_tasks(size=2)

# This is the same functionality for extracting a run/task & querying the id as was done for datasets,
# but we condense the code a bit for brevity. Note that we index at 1 instead of 0 due to an error in the way
# OpenML stores the run associated with the first run_id

sample_run_info = run_list[list(run_list.keys())[1]]
sample_run_id = sample_run_info['run_id']

sample_task_info = task_list[list(task_list.keys())[0]]
sample_task_id = sample_task_info['tid']

sample_run = openml.runs.get_run(sample_run_id)
sample_task = openml.tasks.get_task(sample_task_id)

In [16]:
sample_run_info

{'run_id': 2,
 'task_id': 72,
 'setup_id': 16,
 'flow_id': 75,
 'uploader': 1,
 'task_type': <TaskType.LEARNING_CURVE: 3>,
 'upload_time': '2014-04-06 23:31:13',
 'error_message': ''}

In [17]:
sample_task_info

{'tid': 2,
 'ttid': <TaskType.SUPERVISED_CLASSIFICATION: 1>,
 'did': 2,
 'name': 'anneal',
 'task_type': 'Supervised Classification',
 'status': 'active',
 'estimation_procedure': '10-fold Crossvalidation',
 'evaluation_measures': 'predictive_accuracy',
 'source_data': '2',
 'target_feature': 'class',
 'MajorityClassSize': 684,
 'MaxNominalAttDistinctValues': 7,
 'MinorityClassSize': 8,
 'NumberOfClasses': 5,
 'NumberOfFeatures': 39,
 'NumberOfInstances': 898,
 'NumberOfInstancesWithMissingValues': 898,
 'NumberOfMissingValues': 22175,
 'NumberOfNumericFeatures': 6,
 'NumberOfSymbolicFeatures': 33}

We can take a look at the sample run and sample task to get an understanding of what data is present in those objects.

In [18]:
sample_run

OpenML Run
Uploader Name...: Jan van Rijn
Uploader Profile: https://www.openml.org/u/1
Metric..........: predictive_accuracy
Result..........: 0.7136363636363636
Run ID..........: 2
Run URL.........: https://www.openml.org/r/2
Task ID.........: 72
Task Type.......: Learning Curve
Task URL........: https://www.openml.org/t/72
Flow ID.........: 75
Flow Name.......: weka.AdaBoostM1_DecisionStump(1)
Flow URL........: https://www.openml.org/f/75
Setup ID........: 16
Setup String....: weka.classifiers.meta.AdaBoostM1 -- -P 100 -S 1 -I 10 -W weka.classifiers.trees.DecisionStump
Dataset ID......: 13
Dataset URL.....: https://www.openml.org/d/13

In [19]:
sample_task

OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 2
Task URL.............: https://www.openml.org/t/2
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 6
Cost Matrix..........: Available

In [20]:
run_attributes = get_attributes(sample_run)
task_attributes = get_attributes(sample_task)

sample_run_vars = vars(sample_run)
sample_task_vars = vars(sample_task)

In [21]:
sample_run_vars

{'uploader': 1,
 'uploader_name': 'Jan van Rijn',
 'task_id': 72,
 'task_type': 'Learning Curve',
 'task_evaluation_measure': 'predictive_accuracy',
 'flow_id': 75,
 'flow_name': 'weka.AdaBoostM1_DecisionStump(1)',
 'setup_id': 16,
 'setup_string': 'weka.classifiers.meta.AdaBoostM1 -- -P 100 -S 1 -I 10 -W weka.classifiers.trees.DecisionStump',
 'parameter_settings': [OrderedDict([('oml:name', 'I'),
               ('oml:value', '10'),
               ('oml:component', '75')]),
  OrderedDict([('oml:name', 'P'),
               ('oml:value', '100'),
               ('oml:component', '75')]),
  OrderedDict([('oml:name', 'S'),
               ('oml:value', '1'),
               ('oml:component', '75')]),
  OrderedDict([('oml:name', 'W'),
               ('oml:value', 'weka.classifiers.trees.DecisionStump'),
               ('oml:component', '75')])],
 'dataset_id': 13,
 'evaluations': OrderedDict([('area_under_roc_curve', 0.6867257828504536),
              ('average_cost', 0.0),
              ('f_

In [22]:
sample_task_vars

{'task_id': 2,
 'task_type_id': <TaskType.SUPERVISED_CLASSIFICATION: 1>,
 'task_type': 'Supervised Classification',
 'dataset_id': 2,
 'evaluation_measure': 'predictive_accuracy',
 'estimation_procedure': {'type': 'crossvalidation',
  'parameters': {'number_repeats': '1',
   'number_folds': '10',
   'percentage': '',
   'stratified_sampling': 'true'},
  'data_splits_url': 'https://www.openml.org/api_splits/get/2/Task_2_splits.arff'},
 'estimation_procedure_id': 1,
 'split': None,
 'target_name': 'class',
 'class_labels': ['1', '2', '3', '4', '5', 'U'],
 'cost_matrix': None}

In [23]:
set(run_attributes).difference(sample_run_vars.keys())

{'id', 'openml_url'}

In [24]:
set(task_attributes).difference(sample_task_vars.keys())

{'estimation_parameters', 'id', 'openml_url'}

We see that for runs, we again don't retrieve any additional useful information. However, we now get the estimation parameters used for running tasks, which is useful for purposes of reuse.

## Evaluations

Since the evaluations work flow is a bit different, lets also take a look at the information that we retrieve when using the dir() and vars() functions for the evaluation objects.

While the datasets/runs/tasks did not require any information when listing, we are required to provide an evaluation measure when listing evaluations. To do this, we first query the different evaluation measures available.

In [25]:
evaluation_measures = openml.evaluations.list_evaluation_measures()

We can then use one of the evaluations returned to retrieve some evaluations of that type using the list_evaluations() function.

In [26]:
sample_measure = evaluation_measures[0]
evaluation_list = openml.evaluations.list_evaluations(sample_measure, size=2)

Similar to before, we grab one of the OrderedDict keys and take a look at an example evaluation object

In [27]:
evaluation_list_keys = evaluation_list.keys()
sample_evaluation_key = list(evaluation_list_keys)[0]
sample_evaluation_info = evaluation_list[sample_evaluation_key]

In [28]:
sample_evaluation_info

OpenML Evaluation
Run ID.........: 62
OpenML Run URL.: https://www.openml.org/r/62
Task ID........: 1
OpenML Flow URL: https://www.openml.org/f/76
Setup ID.......: 17
Data ID........: 1
Data Name......: anneal
OpenML Data URL: https://www.openml.org/d/1
Metric Used....: area_under_roc_curve
Result.........: 0.995034

Taking a look at a sample evaluation, we can see that this is different than what was returned from the list_datasets() function. Instead of a dictionary containing information, we recieve an OpenML object, similar to what was returned for the get_dataset() function. The evaluations do not have a list_evaluations() function that returns any additional information, so this is the terminal result that we can retrieve. 

Again, let's take a look at what the vars() and dir() functions give us for this object.

In [29]:
sample_evaluation_vars = vars(sample_evaluation_info)

In [30]:
sample_evaluation_vars

{'run_id': 62,
 'task_id': 1,
 'setup_id': 17,
 'flow_id': 76,
 'flow_name': 'weka.Bagging_REPTree(1)',
 'data_id': 1,
 'data_name': 'anneal',
 'function': 'area_under_roc_curve',
 'upload_time': '2014-04-06 23:57:45',
 'uploader': 1,
 'uploader_name': 'Jan van Rijn',
 'value': 0.995034,
 'values': None,
 'array_data': '[0.93111,0.999975,0.994856,0.0,1,0.990326]'}

In [31]:
sample_evaluation_dir = dir(sample_evaluation_info)

In [32]:
sample_evaluation_dir

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'array_data',
 'data_id',
 'data_name',
 'flow_id',
 'flow_name',
 'function',
 'run_id',
 'setup_id',
 'task_id',
 'upload_time',
 'uploader',
 'uploader_name',
 'value',
 'values']

The dir() results for the sample evaluation don't appear to provide much more useful information than what is present in the vars, but let's take a look at what attributes we get.

In [33]:
evaluation_attributes = get_attributes(sample_evaluation_info)

In [34]:
set(evaluation_attributes).difference(sample_evaluation_vars.keys())

set()

As hinted at above, there is actually no difference between retrieving attributes with our own functionality versus simply using the vars function to retrieve the data for us. Because of this, the evaluations portion of our code utilizes the vars() functionality instead as a time-saving measure.