In [None]:
%matplotlib inline


# Benchmark suites

This is a brief showcase of OpenML benchmark suites, which were introduced by
`Bischl et al. (2019) <https://arxiv.org/abs/1708.03731v2>`_. Benchmark suites standardize the
datasets and splits to be used in an experiment or paper. They are fully integrated into OpenML
and simplify both the sharing of the setup and the results.


In [18]:
# License: BSD 3-Clause

import openml
from sklearn.datasets import fetch_openml

## OpenML-CC18

As an example we have a look at the OpenML-CC18, which is a suite of 72 classification datasets
from OpenML which were carefully selected to be usable by many algorithms and also represent
datasets commonly used in machine learning research. These are all datasets from mid-2018 that
satisfy a large set of clear requirements for thorough yet practical benchmarking:

1. the number of observations are between 500 and 100,000 to focus on medium-sized datasets,
2. the number of features does not exceed 5,000 features to keep the runtime of the algorithms
   low
3. the target attribute has at least two classes with no class having less than 20 observations
4. the ratio of the minority class and the majority class is above 0.05 (to eliminate highly
   imbalanced datasets which require special treatment for both algorithms and evaluation
   measures).

A full description can be found in the `OpenML benchmarking docs
<https://docs.openml.org/benchmark/#openml-cc18>`_.

In this example we'll focus on how to use benchmark suites in practice.



## Downloading benchmark suites



In [3]:
suite = openml.study.get_suite(99)
print(suite)

OpenML Benchmark Suite
ID..............: 99
Name............: OpenML-CC18 Curated Classification benchmark
Status..........: active
Main Entity Type: task
Study URL.......: https://www.openml.org/s/99
# of Data.......: 72
# of Tasks......: 72
Creator.........: https://www.openml.org/u/1
Upload Time.....: 2019-02-21 18:47:13


The benchmark suite does not download the included tasks and datasets itself, but only contains
a list of which tasks constitute the study.

Tasks can then be accessed via



In [4]:
tasks = suite.tasks
print(tasks)

[3, 6, 11, 12, 14, 15, 16, 18, 22, 23, 28, 29, 31, 32, 37, 43, 45, 49, 53, 219, 2074, 2079, 3021, 3022, 3481, 3549, 3560, 3573, 3902, 3903, 3904, 3913, 3917, 3918, 7592, 9910, 9946, 9952, 9957, 9960, 9964, 9971, 9976, 9977, 9978, 9981, 9985, 10093, 10101, 14952, 14954, 14965, 14969, 14970, 125920, 125922, 146195, 146800, 146817, 146819, 146820, 146821, 146822, 146824, 146825, 167119, 167120, 167121, 167124, 167125, 167140, 167141]


and iterated over for benchmarking. For speed reasons we only iterate over the first three tasks:



In [5]:
for task_id in tasks[:3]:
    task = openml.tasks.get_task(task_id)
    print(task)

OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 3
Task URL.............: https://www.openml.org/t/3
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available
OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 6
Task URL.............: https://www.openml.org/t/6
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 26
Cost Matrix..........: Available
OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 11
Task URL.............: https://www.openml.org/t/11
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 3
Cost Matrix..........: Available


In [14]:
data_2 = openml.datasets.get_dataset(3, download_data=True)

In [29]:
data = fetch_openml(data_id = 1088)

In [31]:
data.data

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var54666,Var54667,Var54668,Var54669,Var54670,Var54671,Var54672,Var54673,Var54674,Var54675
0,1.061,1.369,2.666,0.649,3.778,1.859,2.016,0.873,1.712,1.849,...,2.211,1.653,1.379,2.801,1.231,3.228,0.495,1.971,2.896,0.796
1,0.864,1.077,3.020,1.345,3.486,2.060,1.686,1.412,1.440,0.603,...,1.474,1.156,0.812,2.294,1.513,2.893,0.855,2.013,2.833,0.961
2,1.526,1.764,2.202,1.740,3.593,1.767,1.360,1.109,2.109,1.495,...,2.313,2.548,0.869,2.731,1.664,3.629,0.930,2.200,3.224,1.253
3,1.507,1.225,2.531,1.110,3.528,2.597,0.843,1.651,1.402,1.688,...,2.304,3.483,1.378,2.788,2.398,3.216,0.591,2.001,2.770,0.981
4,0.836,1.320,2.572,0.702,3.511,1.639,2.453,1.663,1.209,1.327,...,1.787,1.861,0.936,2.315,1.438,3.280,1.126,2.219,2.850,0.581
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378,1.299,2.929,0.929,1.258,3.459,2.805,3.091,1.973,0.775,0.799,...,2.193,2.974,1.066,2.434,1.134,3.428,0.895,2.343,2.641,0.965
379,1.001,1.444,2.727,0.717,3.705,2.994,1.206,0.712,0.303,1.072,...,2.311,3.075,1.442,2.809,1.719,3.686,0.965,1.734,2.589,0.836
380,0.835,1.345,0.893,1.643,3.648,1.479,3.661,2.283,0.464,1.586,...,1.939,3.182,1.218,2.983,1.956,3.297,0.500,2.055,2.723,0.943
381,1.458,1.463,1.657,0.944,3.495,1.983,2.086,1.159,1.639,0.867,...,2.567,3.141,1.035,2.782,1.798,3.534,0.892,2.097,2.995,0.824


In [25]:
import openml
from sklearn.ensemble import RandomForestClassifier
import sklearn



AttributeError: module 'sklearn' has no attribute 'impute'

## Further examples

* `sphx_glr_examples_30_extended_suites_tutorial.py`
* `sphx_glr_examples_30_extended_study_tutorial.py`
* `sphx_glr_examples_40_paper_2018_ida_strang_example.py`

