In [None]:
%matplotlib inline


# Tasks

A tutorial on how to list and download tasks.


In [1]:
# License: BSD 3-Clause

import openml
from openml.tasks import TaskType
import pandas as pd

Tasks are identified by IDs and can be accessed in two different ways:

1. In a list providing basic information on all tasks available on OpenML.
   This function will not download the actual tasks, but will instead download
   meta data that can be used to filter the tasks and retrieve a set of IDs.
   We can filter this list, for example, we can only list tasks having a
   special tag or only tasks for a specific target such as
   *supervised classification*.
2. A single task by its ID. It contains all meta information, the target
   metric, the splits and an iterator which can be used to access the
   splits in a useful manner.



## Listing tasks

We will start by simply listing only *supervised classification* tasks:



In [2]:
tasks = openml.tasks.list_tasks(task_type=TaskType.SUPERVISED_CLASSIFICATION)

**openml.tasks.list_tasks()** returns a dictionary of dictionaries by default, which we convert
into a
`pandas dataframe <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html>`_
to have better visualization capabilities and easier access:



In [3]:
tasks = pd.DataFrame.from_dict(tasks, orient="index")
print(tasks.columns)
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

# As conversion to a pandas dataframe is a common task, we have added this functionality to the
# OpenML-Python library which can be used by passing ``output_format='dataframe'``:
tasks_df = openml.tasks.list_tasks(
    task_type=TaskType.SUPERVISED_CLASSIFICATION, output_format="dataframe"
)
print(tasks_df.head())

Index(['tid', 'ttid', 'did', 'name', 'task_type', 'status',
       'estimation_procedure', 'evaluation_measures', 'source_data',
       'target_feature', 'MajorityClassSize', 'MaxNominalAttDistinctValues',
       'MinorityClassSize', 'NumberOfClasses', 'NumberOfFeatures',
       'NumberOfInstances', 'NumberOfInstancesWithMissingValues',
       'NumberOfMissingValues', 'NumberOfNumericFeatures',
       'NumberOfSymbolicFeatures', 'cost_matrix'],
      dtype='object')
First 5 of 3750 tasks:
   tid                                ttid  did        name  \
2    2  TaskType.SUPERVISED_CLASSIFICATION    2      anneal   
3    3  TaskType.SUPERVISED_CLASSIFICATION    3    kr-vs-kp   
4    4  TaskType.SUPERVISED_CLASSIFICATION    4       labor   
5    5  TaskType.SUPERVISED_CLASSIFICATION    5  arrhythmia   
6    6  TaskType.SUPERVISED_CLASSIFICATION    6      letter   

                   task_type  status     estimation_procedure  \
2  Supervised Classification  active  10-fold Crossvalidation 

We can filter the list of tasks to only contain datasets with more than
500 samples, but less than 1000 samples:



In [4]:
filtered_tasks = tasks.query("NumberOfInstances > 500 and NumberOfInstances < 1000")
print(list(filtered_tasks.index))

[2, 11, 15, 29, 37, 41, 49, 53, 232, 241, 245, 259, 267, 271, 279, 283, 1766, 1775, 1779, 1793, 1801, 1805, 1813, 1817, 1882, 1891, 1895, 1909, 1917, 1921, 1929, 1933, 1945, 1952, 1956, 1967, 1973, 1977, 1983, 1987, 2079, 2125, 2944, 3022, 3034, 3047, 3049, 3053, 3054, 3055, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 4189, 4191, 4197, 4198, 4199, 4217, 4223, 4225, 4226, 4234, 4240, 4254, 4265, 4266, 4288, 4328, 4341, 4345, 4365, 4395, 4396, 4397, 4409, 4411, 4423, 4499, 4508, 4515, 4517, 4518, 4519, 4522, 4538, 4557, 4558, 4562, 4565, 4572, 4582, 4584, 4591, 4618, 4676, 4684, 4697, 4704, 7286, 7307, 7543, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 12738, 12739, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 14583

In [5]:
# Number of tasks
print(len(filtered_tasks))

295


Then, we can further restrict the tasks to all have the same resampling strategy:



In [6]:
filtered_tasks = filtered_tasks.query('estimation_procedure == "10-fold Crossvalidation"')
print(list(filtered_tasks.index))

[2, 11, 15, 29, 37, 41, 49, 53, 2079, 3022, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 7286, 7307, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 145836, 145839, 145848, 145878, 145882, 145914, 145917, 145952, 145959, 145970, 145976, 145978, 146062, 146064, 146065, 146066, 146069, 146092, 146156, 146216, 146219, 146231, 146818, 146819, 168300, 168907, 189932, 189937, 189941, 190136, 190138, 190139, 190140, 190143, 190146, 233171, 359953, 359954, 359955, 360857, 360865, 360868, 360869, 360951, 360953, 360964]


In [7]:
# Number of tasks
print(len(filtered_tasks))

124


Resampling strategies can be found on the
`OpenML Website <https://www.openml.org/search?type=measure&q=estimation%20procedure>`_.

Similar to listing tasks by task type, we can list tasks by tags:



In [8]:
tasks = openml.tasks.list_tasks(tag="OpenML100", output_format="dataframe")
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

First 5 of 91 tasks:
    tid                                ttid  did           name  \
3     3  TaskType.SUPERVISED_CLASSIFICATION    3       kr-vs-kp   
6     6  TaskType.SUPERVISED_CLASSIFICATION    6         letter   
11   11  TaskType.SUPERVISED_CLASSIFICATION   11  balance-scale   
12   12  TaskType.SUPERVISED_CLASSIFICATION   12  mfeat-factors   
14   14  TaskType.SUPERVISED_CLASSIFICATION   14  mfeat-fourier   

                    task_type  status     estimation_procedure source_data  \
3   Supervised Classification  active  10-fold Crossvalidation           3   
6   Supervised Classification  active  10-fold Crossvalidation           6   
11  Supervised Classification  active  10-fold Crossvalidation          11   
12  Supervised Classification  active  10-fold Crossvalidation          12   
14  Supervised Classification  active  10-fold Crossvalidation          14   

   target_feature  MajorityClassSize  MaxNominalAttDistinctValues  \
3           class               1669  

In [17]:
tasks

[OpenML Classification Task
 Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
 Task ID..............: 2
 Task URL.............: https://www.openml.org/t/2
 Estimation Procedure.: crossvalidation
 Evaluation Measure...: predictive_accuracy
 Target Feature.......: class
 # of Classes.........: 6
 Cost Matrix..........: Available,
 OpenML Classification Task
 Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
 Task ID..............: 1891
 Task URL.............: https://www.openml.org/t/1891
 Estimation Procedure.: crossvalidation
 Evaluation Measure...: predictive_accuracy
 Target Feature.......: class
 # of Classes.........: 3
 Cost Matrix..........: Available,
 OpenML Classification Task
 Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
 Task ID..............: 31
 Task URL.............: https://www.openml.org/t/31
 Estimation Procedure.: crossvalidation
 Target Feature.......: class
 #

Furthermore, we can list tasks based on the dataset id:



In [18]:
tasks = openml.tasks.list_tasks(data_id=1471, output_format="dataframe")
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

First 5 of 24 tasks:
         tid                                ttid   did           name  \
9983    9983  TaskType.SUPERVISED_CLASSIFICATION  1471  eeg-eye-state   
14951  14951  TaskType.SUPERVISED_CLASSIFICATION  1471  eeg-eye-state   
56483  56483         TaskType.SUBGROUP_DISCOVERY  1471  eeg-eye-state   
56484  56484         TaskType.SUBGROUP_DISCOVERY  1471  eeg-eye-state   
56485  56485         TaskType.SUBGROUP_DISCOVERY  1471  eeg-eye-state   

                       task_type  status     estimation_procedure source_data  \
9983   Supervised Classification  active  10-fold Crossvalidation        1471   
14951  Supervised Classification  active  10-fold Crossvalidation        1471   
56483         Subgroup Discovery  active                      NaN        1471   
56484         Subgroup Discovery  active                      NaN        1471   
56485         Subgroup Discovery  active                      NaN        1471   

      target_feature  MajorityClassSize  ...  NumberO

In addition, a size limit and an offset can be applied both separately and simultaneously:



In [19]:
tasks = openml.tasks.list_tasks(size=10, offset=50, output_format="dataframe")
print(tasks)

    tid                                ttid  did             name  \
59   59  TaskType.SUPERVISED_CLASSIFICATION   61             iris   
60   60  TaskType.SUPERVISED_CLASSIFICATION   62              zoo   
62   62             TaskType.LEARNING_CURVE    2           anneal   
63   63             TaskType.LEARNING_CURVE    3         kr-vs-kp   
64   64             TaskType.LEARNING_CURVE    4            labor   
65   65             TaskType.LEARNING_CURVE    5       arrhythmia   
66   66             TaskType.LEARNING_CURVE    7        audiology   
67   67             TaskType.LEARNING_CURVE    8  liver-disorders   
68   68             TaskType.LEARNING_CURVE    9            autos   
69   69             TaskType.LEARNING_CURVE   10            lymph   

                    task_type  status             estimation_procedure  \
59  Supervised Classification  active          10-fold Crossvalidation   
60  Supervised Classification  active          10-fold Crossvalidation   
62             Lea

**OpenML 100**
is a curated list of 100 tasks to start using OpenML. They are all
supervised classification tasks with more than 500 instances and less than 50000
instances per task. To make things easier, the tasks do not contain highly
unbalanced data and sparse data. However, the tasks include missing values and
categorical features. You can find out more about the *OpenML 100* on
`the OpenML benchmarking page <https://docs.openml.org/benchmark/>`_.

Finally, it is also possible to list all tasks on OpenML with:



In [20]:
tasks = openml.tasks.list_tasks(output_format="dataframe")
print(len(tasks))

46483


### Exercise

Search for the tasks on the 'eeg-eye-state' dataset.



In [21]:
tasks.query('name=="eeg-eye-state"')

Unnamed: 0,tid,ttid,did,name,task_type,status,estimation_procedure,evaluation_measures,source_data,target_feature,...,NumberOfNumericFeatures,NumberOfSymbolicFeatures,number_samples,cost_matrix,source_data_labeled,target_feature_event,target_feature_left,target_feature_right,quality_measure,target_value
3511,9983,TaskType.SUPERVISED_CLASSIFICATION,1471,eeg-eye-state,Supervised Classification,active,10-fold Crossvalidation,,1471,Class,...,14.0,1.0,,,,,,,,
4692,14951,TaskType.SUPERVISED_CLASSIFICATION,1471,eeg-eye-state,Supervised Classification,active,10-fold Crossvalidation,,1471,Class,...,14.0,1.0,,,,,,,,
8032,56483,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Cortana Quality,1.0
8033,56484,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Information gain,1.0
8034,56485,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Binomial test,1.0
8035,56486,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Jaccard,1.0
8036,56487,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Cortana Quality,2.0
8037,56488,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Information gain,2.0
8038,56489,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Binomial test,2.0
8039,56490,TaskType.SUBGROUP_DISCOVERY,1471,eeg-eye-state,Subgroup Discovery,active,,,1471,Class,...,14.0,1.0,,,,,,,Jaccard,2.0


## Downloading tasks

We provide two functions to download tasks, one which downloads only a
single task by its ID, and one which takes a list of IDs and downloads
all of these tasks:



In [22]:
task_id = 31
task = openml.tasks.get_task(task_id)

Properties of the task are stored as member variables:



In [23]:
print(task)

OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 31
Task URL.............: https://www.openml.org/t/31
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available


And:



In [24]:
ids = [2, 1891, 31, 9983]
tasks = openml.tasks.get_tasks(ids)
print(tasks[0])

OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 2
Task URL.............: https://www.openml.org/t/2
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 6
Cost Matrix..........: Available


## Creating tasks

You can also create new tasks. Take the following into account:

* You can only create tasks on *active* datasets
* For now, only the following tasks are supported: classification, regression,
  clustering, and learning curve analysis.
* For now, tasks can only be created on a single dataset.
* The exact same task must not already exist.

Creating a task requires the following input:

* task_type: The task type ID, required (see below). Required.
* dataset_id: The dataset ID. Required.
* target_name: The name of the attribute you aim to predict. Optional.
* estimation_procedure_id : The ID of the estimation procedure used to create train-test
  splits. Optional.
* evaluation_measure: The name of the evaluation measure. Optional.
* Any additional inputs for specific tasks

It is best to leave the evaluation measure open if there is no strong prerequisite for a
specific measure. OpenML will always compute all appropriate measures and you can filter
or sort results on your favourite measure afterwards. Only add an evaluation measure if
necessary (e.g. when other measure make no sense), since it will create a new task, which
scatters results across tasks.



### Example

Let's create a classification task on a dataset. In this example we will do this on the
Iris dataset (ID=128 (on test server)). We'll use 10-fold cross-validation (ID=1),
and *predictive accuracy* as the predefined measure (this can also be left open).
If a task with these parameters exists, we will get an appropriate exception.
If such a task doesn't exist, a task will be created and the corresponding task_id
will be returned.



In [27]:
try:
    my_task = openml.tasks.create_task(
        task_type=TaskType.SUPERVISED_CLASSIFICATION,
        dataset_id=128,
        target_name="class",
        evaluation_measure="predictive_accuracy",
        estimation_procedure_id=1,
    )
    my_task.publish()
except openml.exceptions.OpenMLServerException as e:
    # Error code for 'task already exists'
    if e.code == 614:
        # Lookup task
        tasks = openml.tasks.list_tasks(data_id=128, output_format="dataframe")
        tasks = tasks.query(
            'task_type == "Supervised Classification" '
            'and estimation_procedure == "10-fold Crossvalidation" '
            'and evaluation_measures == "predictive_accuracy"'
        )
        task_id = tasks.loc[:, "tid"].values[0]
        print("Task already exists. Task ID is", task_id)



* `Complete list of task types <https://www.openml.org/search?type=task_type>`_.
* `Complete list of model estimation procedures <https://www.openml.org/search?q=%2520measure_type%3Aestimation_procedure&type=measure>`_.
* `Complete list of evaluation measures <https://www.openml.org/search?q=measure_type%3Aevaluation_measure&type=measure>`_.


