# OpenML in Python 
(Work in progress)  
Joaquin Vanschoren @joavanschoren

## How to...
- Download datasets and tasks
- Use scikit-learn to build classifiers
- Upload the results to the server


# Setup

## Authentication
Via API key (e.g. 'rgu9hw1h03g24j974hr3586g4j598fgh')  
Always keep it secret

Go to your OpenML account settings and see 'API authentication' to retrieve your key. 
<center><img src="files/openml_login.png" width="800"></center>

## Installation
Via source

In [1]:
#! git clone https://github.com/openml/openml-python.
#! cd openml-python
#! python setup.py install

## Setup

In [7]:
from sklearn import preprocessing, ensemble

import openml
import numpy as np
import pandas as pd
import os

# assumes you have your api key in ~/.openml/config
# amueller's read/write key that he will throw away later
openml.config.apikey='ef035bbd1ca47af239c384aea9124ec5'

# Data

## List ALL the datasets

In [8]:
datasets = openml.datasets.list_datasets()

data = pd.DataFrame(datasets)
print("First 10 of %s datasets..." % len(datasets))
print(data[:10][['did','name','NumberOfInstances','NumberOfFeatures']])

First 10 of 3307 datasets...
   did             name  NumberOfInstances  NumberOfFeatures
0    1           anneal              898.0              39.0
1    2           anneal              898.0              39.0
2    3         kr-vs-kp             3196.0              37.0
3    4            labor               57.0              17.0
4    5       arrhythmia              452.0             280.0
5    6           letter            20000.0              17.0
6    7        audiology              226.0              70.0
7    8  liver-disorders              345.0               7.0
8    9            autos              205.0              26.0
9   10            lymph              148.0              19.0


Subset based on any property

In [9]:
bin_data = data.loc[data['NumberOfClasses'] == 2]
print("First 10 of %s datasets..." % len(bin_data))
print(bin_data[:10][['did','name', 'NumberOfInstances','NumberOfFeatures']])

First 10 of 738 datasets...
    did            name  NumberOfInstances  NumberOfFeatures
2     3        kr-vs-kp             3196.0              37.0
3     4           labor               57.0              17.0
12   13   breast-cancer              286.0              10.0
14   15        breast-w              699.0              10.0
23   24        mushroom             8124.0              23.0
24   25           colic              368.0              28.0
26   27           colic              368.0              23.0
28   29        credit-a              690.0              16.0
30   31        credit-g             1000.0              21.0
32   33  cylinder-bands              540.0              40.0


Subset based on any property

In [10]:
big_data = data.loc[data['NumberOfInstances'] > 60000]
big_data = big_data.sort_values(by='NumberOfInstances', ascending=True)
print("First 10 of %s datasets..." % len(bin_data))
print(big_data[:10][['did','name', 'NumberOfInstances']])

First 10 of 738 datasets...
       did                          name  NumberOfInstances
1302  1588                           w8a            64700.0
2421  4533  KEGGMetabolicReactionNetwork            65554.0
1305  1591                     connect-4            67557.0
423    554                     mnist_784            70000.0
1293  1578                      real-sim            72309.0
1062  1213                       BNG(mv)            78732.0
2420  4532                         higgs            98050.0
247    357                vehicle_sensIT            98528.0
1080  1242                   vehicleNorm            98528.0
1307  1593       SensIT-Vehicle-Combined            98528.0


Download a specific dataset. This is done based on the dataset ID (called 'did' in the table above).

In [11]:
dataset = openml.datasets.get_dataset(61)

print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])

KeyError: 'oml:data_set_description'

In [None]:
from pprint import pprint
pprint(vars(dataset))

Get the actual data

In [None]:
X, y, attribute_names = dataset.get_dataset(target=dataset.default_target_attribute, return_attribute_names=True)
iris = pd.DataFrame(X, columns=attribute_names)
iris['class'] = y
print(iris[:10])

Have fun with it

In [None]:
iris.plot(kind='scatter', x='petallength', y='petalwidth', c='class', s=50);

Train a scikit-learn model on the data

In [None]:
dataset = openml.datasets.get_dataset(61)
X, y = dataset.get_dataset(target=dataset.default_target_attribute)
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)

In [None]:
# Helper code by Gilles Louppe
# % matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

def plot_surface(clf, X, y, 
                 xlim=(0, 7), ylim=(0, 3), n_steps=250, 
                 subplot=None, show=True):
    if subplot is None:
        fig = plt.figure()
    else:
        plt.subplot(*subplot)
        
    xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], n_steps), 
                         np.linspace(ylim[0], ylim[1], n_steps))
    
    if hasattr(clf, "decision_function"):
        z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    else:
        z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
        
    z = z.reshape(xx.shape)
    plt.contourf(xx, yy, z, alpha=0.8, cmap=plt.cm.RdBu_r)
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.xlim(*xlim)
    plt.ylim(*ylim)
    
    if show:
        plt.show()

In [None]:
X_2d = X[:,2:4]
clf.fit(X_2d, y)
plot_surface(clf, X_2d, y)

You can also ask which features are categorical to do your own encoding

In [None]:
X, y, categorical = dataset.get_dataset(target=dataset.default_target_attribute,return_categorical_indicator=True)
enc = preprocessing.OneHotEncoder(categorical_features=categorical)
X = enc.fit_transform(X)
clf.fit(X, y)

# Tasks

To run benchmarks consistently (also across studies and tools), OpenML offers Tasks, which include specific train-test splits and other information.

## List ALL the tasks

In [None]:
task_list = openml.tasks.list_tasks()

tasks = pd.DataFrame(task_list)
print("First 5 of %s tasks:" % len(tasks))
print(tasks[:5][['tid','did','name','task_type','estimation_procedure']])

## Download tasks

In [None]:
task = openml.tasks.get_task(10)
print(task)

# Runs

Run a scikit-learn classifier on the task (using the right splits)

In [None]:
from openml.runs import run_task

clf = ensemble.RandomForestClassifier()
run = run_task(task, clf)
print("RandomForest has run on the task.")

Upload the run to the OpenML server

In [None]:
import xmltodict

return_code, response = run.publish()

if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s" % (run_id))
    print("Check it at www.openml.org/r/%s" % (run_id))

More to come soon...