# OpenML in Python 
Joaquin Vanschoren @joavanschoren

## Authentication
* Create an OpenML account on http://www.openml.org
* Go to your OpenML account, click 'Account Settings', and then 'API authentication'. 
<center><img src="images/openml_login.png" width="800"></center>
* Create a little file to store the API key locally: ~/.openml/config

In [None]:
apikey=FILL_IN_API_KEY
cachedir=FILL_IN_CACHE_DIR

# Data

## List ALL the datasets

In [20]:
from openml import datasets, tasks, runs
import numpy as np
import pandas as pd
import os



In [7]:
from openml import datasets, tasks, runs
import numpy as np
import pandas as pd
import os

datalist = datasets.list_datasets()

data = pd.DataFrame(datalist).transpose()
print("First 10 of %s datasets..." % len(datalist))
print(data[:10][['did','name','NumberOfInstances','NumberOfFeatures']])

First 10 of 19492 datasets...
   did             name NumberOfInstances NumberOfFeatures
1    1           anneal               898               39
2    2           anneal               898               39
3    3         kr-vs-kp              3196               37
4    4            labor                57               17
5    5       arrhythmia               452              280
6    6           letter             20000               17
7    7        audiology               226               70
8    8  liver-disorders               345                7
9    9            autos               205               26
10  10            lymph               148               19


Download a specific dataset. This is done based on the dataset ID (called 'did' in the table above).

In [8]:
dataset = datasets.get_dataset(1471)

print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])

This is dataset 'eeg-eye-state', the target feature is called 'Class'
URL: http://www.openml.org/data/download/1587924/phplE7q6h
**Author**: Oliver Roesler, it12148'@'lehre.dhbw-stuttgart.de  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State), Baden-Wuerttemberg, Cooperative State University (DHBW), Stuttgart, Germany  
**Please cite**:   

All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video fr


Get the actual data

In [9]:
X, y, attribute_names = dataset.get_data(target=dataset.default_target_attribute, return_attribute_names=True)
eeg = pd.DataFrame(X, columns=attribute_names)
eeg['class'] = y
print(eeg[:10])

            V1           V2           V3           V4           V5  \
0  4329.229980  4009.229980  4289.229980  4148.209961  4350.259766   
1  4324.620117  4004.620117  4293.850098  4148.720215  4342.049805   
2  4327.689941  4006.669922  4295.379883  4156.410156  4336.919922   
3  4328.720215  4011.790039  4296.410156  4155.899902  4343.589844   
4  4326.149902  4011.790039  4292.310059  4151.279785  4347.689941   
5  4321.029785  4004.620117  4284.100098  4153.330078  4345.640137   
6  4319.490234  4001.030029  4280.509766  4151.790039  4343.589844   
7  4325.640137  4006.669922  4278.459961  4143.080078  4344.100098   
8  4326.149902  4010.770020  4276.410156  4139.490234  4345.129883   
9  4326.149902  4011.280029  4276.919922  4142.049805  4344.100098   

            V6           V7           V8           V9          V10  \
0  4586.149902  4096.919922  4641.029785  4222.049805  4238.459961   
1  4586.669922  4097.439941  4638.970215  4210.770020  4226.669922   
2  4583.589844  409

Train a scikit-learn model on the data

In [10]:
from sklearn import preprocessing, ensemble

dataset = datasets.get_dataset(61)
X, y = dataset.get_data(target=dataset.default_target_attribute)
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

You can also ask which features are categorical to do your own encoding

In [11]:
X, y, categorical = dataset.get_data(target=dataset.default_target_attribute,return_categorical_indicator=True)
enc = preprocessing.OneHotEncoder(categorical_features=categorical)
X = enc.fit_transform(X)
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

# Tasks

To run benchmarks consistently (also across studies and tools), OpenML offers Tasks, which include specific train-test splits and other information.

## List ALL the tasks

In [26]:
task_list = tasks.list_tasks(size=10)

mytasks = pd.DataFrame(task_list).transpose()
print("First 5 of %s tasks:" % len(mytasks))
#print(mytasks.columns)
print(mytasks[:5][['tid','did','name','task_type','estimation_procedure']])

First 5 of 10 tasks:
  tid did        name                  task_type     estimation_procedure
1   1   1      anneal  Supervised Classification  10-fold Crossvalidation
2   2   2      anneal  Supervised Classification  10-fold Crossvalidation
3   3   3    kr-vs-kp  Supervised Classification  10-fold Crossvalidation
4   4   4       labor  Supervised Classification  10-fold Crossvalidation
5   5   5  arrhythmia  Supervised Classification  10-fold Crossvalidation


## Download tasks

In [27]:
from pprint import pprint
task = tasks.get_task(10)
pprint(vars(task))

{'class_labels': ['normal', 'metastases', 'malign_lymph', 'fibrosis'],
 'cost_matrix': None,
 'dataset_id': 10,
 'estimation_parameters': {'number_folds': '10',
                           'number_repeats': '1',
                           'percentage': '',
                           'stratified_sampling': 'true'},
 'estimation_procedure': {'data_splits_url': 'http://www.openml.org/api_splits/get/10/Task_10_splits.arff',
                          'parameters': {'number_folds': '10',
                                         'number_repeats': '1',
                                         'percentage': '',
                                         'stratified_sampling': 'true'},
                          'type': 'crossvalidation'},
 'evaluation_measure': 'predictive_accuracy',
 'target_name': 'class',
 'task_id': 10,
 'task_type': 'Supervised Classification'}


# Runs

Run a scikit-learn classifier on the task (using the right splits)

In [28]:
from openml.runs import run_task

clf = ensemble.RandomForestClassifier()
run = run_task(task, clf)
print("RandomForest has run on the task.")

RandomForest has run on the task.


In [39]:
%%capture
clf = ensemble.RandomForestClassifier()
run = run_task(task, clf)
return_code = run.publish()
print(return_code)

Upload the run to the OpenML server

In [40]:
%%capture
import xmltodict

myrun = run.publish()

#if(return_code == 200):
#    response_dict = xmltodict.parse(response)
#    run_id = response_dict['oml:upload_run']['oml:run_id']
#    print("Uploaded run with id %s" % (run_id))
#    print("Check it at www.openml.org/r/%s" % (run_id))

## OpenML with Python
You can easily run and share scikit-learn experiments on OpenML

In [41]:
print(myrun.run_id)


1843272


In [36]:
from sklearn import pipeline, ensemble, preprocessing
from openml import tasks,runs, datasets
task = tasks.get_task(59)
pipe = pipeline.Pipeline(steps=[
            ('Imputer', preprocessing.Imputer(strategy='median')),
            ('OneHotEncoder', preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore')),
            ('Classifier', ensemble.RandomForestClassifier())
           ])
run = runs.run_task(task, pipe)
#vars(run)
response = run.publish()

<?xml version="1.0" encoding="utf-8"?>
<oml:run xmlns:oml="http://openml.org/openml">
	<oml:task_id>59</oml:task_id>
	<oml:flow_id>5432</oml:flow_id>
	<oml:parameter_setting>
		<oml:name>verbose</oml:name>
		<oml:value>0</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>min_samples_leaf</oml:name>
		<oml:value>1</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>class_weight</oml:name>
		<oml:value>None</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>bootstrap</oml:name>
		<oml:value>True</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>n_jobs</oml:name>
		<oml:value>1</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>oob_score</oml:name>
		<oml:value>False</oml:value>
		<om

OpenMLServerError: OpenML Server error: <oml:error xmlns:oml="http://openml.org/openml">
	<oml:code>213</oml:code>
	<oml:message>Parameter in run xml unknown</oml:message>
		<oml:additional_information>Name: Classifier, flow id (component): 5432</oml:additional_information>
	</oml:error>


In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [42]:
from sklearn import ensemble
from openml import tasks,runs
task = tasks.get_task(14951)
clf = ensemble.RandomForestClassifier()
run = runs.run_task(task, clf)
myrun = run.publish()
print(myrun.run_id)


<?xml version="1.0" encoding="utf-8"?>
<oml:run xmlns:oml="http://openml.org/openml">
	<oml:task_id>14951</oml:task_id>
	<oml:flow_id>4830</oml:flow_id>
	<oml:parameter_setting>
		<oml:name>verbose</oml:name>
		<oml:value>0</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>min_samples_leaf</oml:name>
		<oml:value>1</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>class_weight</oml:name>
		<oml:value>None</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>bootstrap</oml:name>
		<oml:value>True</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>n_jobs</oml:name>
		<oml:value>1</oml:value>
		<oml:component>4830</oml:component>
	</oml:parameter_setting>
	<oml:parameter_setting>
		<oml:name>oob_score</oml:name>
		<oml:value>False</oml:value>
		

In [48]:
print(myrun.detailed_evaluations)

None


In [46]:
myruns = runs.list_runs_by_task(59)
print(len(myruns))
print(myruns)

2422
[{'setup_id': 12, 'run_id': 81, 'uploader': 1, 'flow_id': 67, 'task_id': 59}, {'setup_id': 13, 'run_id': 161, 'uploader': 1, 'flow_id': 70, 'task_id': 59}, {'setup_id': 1, 'run_id': 234, 'uploader': 1, 'flow_id': 56, 'task_id': 59}, {'setup_id': 6, 'run_id': 447, 'uploader': 1, 'flow_id': 61, 'task_id': 59}, {'setup_id': 18, 'run_id': 473, 'uploader': 1, 'flow_id': 77, 'task_id': 59}, {'setup_id': 7, 'run_id': 491, 'uploader': 1, 'flow_id': 62, 'task_id': 59}, {'setup_id': 16, 'run_id': 550, 'uploader': 1, 'flow_id': 75, 'task_id': 59}, {'setup_id': 11, 'run_id': 6088, 'uploader': 2, 'flow_id': 66, 'task_id': 59}, {'setup_id': 12, 'run_id': 6157, 'uploader': 2, 'flow_id': 67, 'task_id': 59}, {'setup_id': 3, 'run_id': 6158, 'uploader': 2, 'flow_id': 58, 'task_id': 59}, {'setup_id': 47, 'run_id': 6159, 'uploader': 2, 'flow_id': 119, 'task_id': 59}, {'setup_id': 3, 'run_id': 6160, 'uploader': 2, 'flow_id': 58, 'task_id': 59}, {'setup_id': 47, 'run_id': 6161, 'uploader': 2, 'flow_id':