# OpenML in Python 
Joaquin Vanschoren @joavanschoren

## How to...
- Download datasets and tasks in Python and Jupyter
- Use scikit-learn to build classifiers
- Upload the results to the server


# Setup

## Installation
Via source

In [1]:
#! git clone https://github.com/openml/openml-python.
#! cd openml-python
#! python setup.py install

## Authentication
* Create an OpenML account on http://www.openml.org
* Go to your OpenML account, click 'Account Settings', and then 'API authentication'. 
<center><img src="images/openml_login.png" width="800"></center>
* Create a little file to store the API key locally: ~/.openml/config

In [None]:
apikey=FILL_IN_API_KEY
cachedir=FILL_IN_CACHE_DIR

## Alternative authentication

In [2]:
import openml

# assumes you have your api key in ~/.openml/config
# amueller's read/write key that he will throw away later
openml.config.apikey='610344db6388d9ba34f6db45a3cf71de'
# we also want to use the test server so as not to polute the production system
openml.config.server = "http://www.openml.org/api/v1/xml"

# Data

## List ALL the datasets

In [12]:
from openml import datasets, tasks, runs
import numpy as np
import pandas as pd
import os

datalist = datasets.list_datasets()

data = pd.DataFrame(datalist)
print("First 10 of %s datasets..." % len(datalist))
print(data[:10][['did','name','NumberOfInstances','NumberOfFeatures']])

First 10 of 19538 datasets...
   did             name  NumberOfInstances  NumberOfFeatures
0    1           anneal                898                39
1    2           anneal                898                39
2    3         kr-vs-kp               3196                37
3    4            labor                 57                17
4    5       arrhythmia                452               280
5    6           letter              20000                17
6    7        audiology                226                70
7    8  liver-disorders                345                 7
8    9            autos                205                26
9   10            lymph                148                19


Subset based on any property

In [6]:
bin_data = data.loc[data['NumberOfClasses'] == 2]
print("First 10 of %s datasets..." % len(bin_data))
print(bin_data[:10][['did','name', 'NumberOfInstances','NumberOfFeatures']])

First 10 of 605 datasets...
    did            name  NumberOfInstances  NumberOfFeatures
2     3        kr-vs-kp               3196                37
3     4           labor                 57                17
12   13   breast-cancer                286                10
14   15        breast-w                699                10
23   24        mushroom               8124                23
24   25           colic                368                28
26   27           colic                368                23
28   29        credit-a                690                16
30   31        credit-g               1000                21
32   33  cylinder-bands                540                40


Subset based on any property

In [13]:
big_data = data.loc[data['NumberOfInstances'] > 60000]
big_data = big_data.sort_values(by='NumberOfInstances', ascending=True)
print("First 10 of %s datasets..." % len(bin_data))
print(big_data[:10][['did','name', 'NumberOfInstances']])

First 10 of 605 datasets...
         did                          name  NumberOfInstances
1305    1588                           w8a              64700
2424    4533  KEGGMetabolicReactionNetwork              65554
1308    1591                     connect-4              67557
423      554                     mnist_784              70000
1296    1578                      real-sim              72309
1062    1213                       BNG(mv)              78732
19509  23396               COMET_MC_SAMPLE              89640
19508  23395               COMET_MC_SAMPLE              89640
19537  23512                         higgs              98050
2423    4532                         higgs              98050


Download a specific dataset. This is done based on the dataset ID (called 'did' in the table above).

In [10]:
dataset = openml.datasets.get_dataset(23512)

print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])

This is dataset 'higgs', the target feature is called 'class'
URL: http://www.openml.org/data/download/2063675/phpZLgL9q
**Author**: Daniel Whiteson daniel'@'uci.edu", Assistant Professor, Physics, Univ. of California Irvine  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/HIGGS)  
**Please cite**: Baldi, P., P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-energy Physics with Deep Learning. Nature Communications 5 (July 2, 2014).  

**Note: This is the UCI Higgs dataset, same as version 1, but it fixes the definition of the class attribute, which is categorical, not numeric.**

Data


In [14]:
from pprint import pprint
pprint(vars(dataset))

{'citation': None,
 'collection_date': None,
 'contributor': None,
 'creator': None,
 'data_file': '/Users/joa/.openml/cache/datasets/23512/dataset.arff',
 'data_pickle_file': '/Users/joa/.openml/cache/datasets/23512/dataset.pkl',
 'default_target_attribute': 'class',
 'description': '**Author**: Daniel Whiteson daniel\'@\'uci.edu", Assistant '
                'Professor, Physics, Univ. of California Irvine  \n'
                '**Source**: '
                '[UCI](https://archive.ics.uci.edu/ml/datasets/HIGGS)  \n'
                '**Please cite**: Baldi, P., P. Sadowski, and D. Whiteson. '
                'Searching for Exotic Particles in High-energy Physics with '
                'Deep Learning. Nature Communications 5 (July 2, 2014).  \n'
                '\n'
                '**Note: This is the UCI Higgs dataset, same as version 1, but '
                'it fixes the definition of the class attribute, which is '
                'categorical, not numeric.**\n'
                '\n'

Get the actual data

In [15]:
X, y, attribute_names = dataset.get_data(target=dataset.default_target_attribute, return_attribute_names=True)
iris = pd.DataFrame(X, columns=attribute_names)
iris['class'] = y
print(iris[:10])

   lepton_pT  lepton_eta  lepton_phi  missing_energy_magnitude  \
0   0.907542    0.329147    0.359412                  1.497970   
1   0.798835    1.470639   -1.635975                  0.453773   
2   1.344385   -0.876626    0.935913                  1.992050   
3   1.105009    0.321356    1.522401                  0.882808   
4   1.595839   -0.607811    0.007075                  1.818450   
5   0.409391   -1.884684   -1.027292                  1.672452   
6   0.933895    0.629130    0.527535                  0.238033   
7   1.405144    0.536603    0.689554                  1.179567   
8   1.176566    0.104161    1.397002                  0.479721   
9   0.945974    1.111244    1.218337                  0.907639   

   missing_energy_phi    jet1pt   jet1eta   jet1phi  jet1b-tag    jet2pt  \
0           -0.313010  1.095531 -0.557525 -1.588230   2.173076  0.812581   
1            0.425629  1.104875  1.282322  1.381664   0.000000  0.851737   
2            0.882454  1.786066 -1.646778 -0.

Train a scikit-learn model on the data

In [21]:
from sklearn import preprocessing, ensemble

dataset = datasets.get_dataset(61)
X, y = dataset.get_data(target=dataset.default_target_attribute)
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

You can also ask which features are categorical to do your own encoding

In [20]:
X, y, categorical = dataset.get_data(target=dataset.default_target_attribute,return_categorical_indicator=True)
enc = preprocessing.OneHotEncoder(categorical_features=categorical)
X = enc.fit_transform(X)
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

# Tasks

To run benchmarks consistently (also across studies and tools), OpenML offers Tasks, which include specific train-test splits and other information.

## List ALL the tasks

In [22]:
task_list = tasks.list_tasks()

tasks = pd.DataFrame(task_list)
print("First 5 of %s tasks:" % len(tasks))
print(tasks[:5][['tid','did','name','task_type','estimation_procedure']])

First 5 of 46562 tasks:
   tid  did        name                  task_type     estimation_procedure
0    1    1      anneal  Supervised Classification  10-fold Crossvalidation
1    2    2      anneal  Supervised Classification  10-fold Crossvalidation
2    3    3    kr-vs-kp  Supervised Classification  10-fold Crossvalidation
3    4    4       labor  Supervised Classification  10-fold Crossvalidation
4    5    5  arrhythmia  Supervised Classification  10-fold Crossvalidation


## Download tasks

In [15]:
task = tasks.get_task(10)
print(task)

OpenMLTask instance.
Task ID: 10
Task type: Supervised Classification
Dataset id: 10


# Runs

Run a scikit-learn classifier on the task (using the right splits)

In [16]:
from openml.runs import run_task

clf = ensemble.RandomForestClassifier()
run = run_task(task, clf)
print("RandomForest has run on the task.")

2823
RandomForest has run on the task.


Upload the run to the OpenML server

In [17]:
import xmltodict

return_code, response = run.publish()

if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s" % (run_id))
    print("Check it at www.openml.org/r/%s" % (run_id))

Uploaded run with id 538241
Check it at www.openml.org/r/538241


## TL;DR;
You can easily run and share scikit-learn experiments on OpenML

In [24]:
from sklearn import tree
from openml import tasks,runs
task = tasks.get_task(14951)
clf = tree.DecisionTreeClassifier()
run = runs.run_task(task, clf)
return_code, response = run.publish()

# get the run id for reference
import xmltodict
if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s. Check it at www.openml.org/r/%s" % (run_id,run_id))

4074
Uploaded run with id 595118. Check it at www.openml.org/r/595118


## Challenge: Build the 'best' model on the Higgs dataset together
* Check progress on: http://www.openml.org/t/52950

In [None]:
from openml import tasks,runs
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Imputer
task = tasks.get_task(52950)
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                                          strategy="mean",
                                          axis=0)),
                      ("knn", KNeighborsClassifier())])
run = runs.run_task(task, estimator)
return_code, response = run.publish()

# get the run id for reference
import xmltodict
if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s. Check it at www.openml.org/r/%s" % (run_id,run_id))