## OpenML in research, iris

It covers downloading datasets, tasks, how to use scikit-learn to build classifiers, and upload the results to the server.


Initialization and login. This assumes you have a .openml dir in your homedir with a subdir for caches and a file with your API key. You find your API key in your account settings on openml.org.

In [1]:
!pip freeze

backports.ssl-match-hostname==3.4.0.2
bokeh==0.10.0
climate==0.4.5
decorator==4.0.4
downhill==0.3.1
Flask==0.10.1
funcsigs==0.4
ipykernel==4.1.1
ipython==4.0.0
ipython-genutils==0.1.0
itsdangerous==0.24
Jinja2==2.8
jsonschema==2.4.0
jupyter-client==4.1.1
jupyter-core==4.0.6
liac-arff==2.1.1.dev0
MarkupSafe==0.23
matplotlib==1.4.3
mistune==0.7.1
mock==1.3.0
mpld3==0.2
nbconvert==4.0.0
nbformat==4.0.1
neurolab==0.3.5
nose==1.3.7
notebook==4.0.6
numexpr==2.4.4
numpy==1.10.1
openml==0.2.1
openpyxl==1.8.6
pandas==0.17.0
path.py==0.0.0
pbr==1.8.1
pexpect==3.3
pickleshare==0.5
plac==0.9.1
plotly==1.8.6
ptyprocess==0.5
PyBrain==0.3
Pygments==2.0.2
pyparsing==2.0.3
python-dateutil==2.4.2
pytz==2015.7
PyYAML==3.11
pyzmq==14.7.0
rep==0.6.5
requests==2.8.1
root-numpy==4.3.0.dev0
rootpy==0.8.0.dev0
runipy==0.1.3
scikit-learn==0.16.1
scipy==0.16.0
simplegeneric==0.8.1
six==1.10.0
tables==3.2.2
terminado==0.5
theanets==0.6.2
Theano==0.7.0
torn

In [1]:
%%bash

python --version
which pip

/root/miniconda/envs/rep_py2/bin/pip


Python 2.7.10 :: Continuum Analytics, Inc.


In [3]:
from sklearn import preprocessing, ensemble
from openml.apiconnector import APIConnector
import numpy as np
import pandas as pd
import os
import openml
import xmltodict
from IPython.core.display import display, HTML


openml_dir = os.path.join(os.path.expanduser("~"), ".openml")
if not os.path.exists(openml_dir): os.makedirs(openml_dir)
cache_dir = os.path.join(openml_dir, "cache")
key =  ## Put your key as a string here

In [4]:
con = APIConnector(cache_directory=cache_dir, apikey=key)

List all datasets on OpenML

In [5]:
datasets = openml.datasets.get_dataset_list(con)

datasets_df = pd.DataFrame(datasets)
datasets_df.head()


Unnamed: 0,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumBinaryAtts,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures,did,format,name,status
0,684,10,0,14,6,39,898,0,0,6,32,1,ARFF,anneal,active
1,684,9,0,7,6,39,898,898,22175,6,32,2,ARFF,anneal,active
2,1669,3,1527,34,2,37,3196,0,0,0,36,3,ARFF,kr-vs-kp,active
3,37,3,20,3,2,17,57,56,326,8,8,4,ARFF,labor,active
4,245,2,0,73,16,280,452,384,408,206,73,5,ARFF,arrhythmia,active


Download a specific dataset. This is done based on the dataset ID (called 'did' in the table above).

In [6]:
from pprint import pprint
import arff

N = 61
print("Downloading dataset %d." % N)
dataset = openml.datasets.download_dataset(con, N)

print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
print("More info, including the location off the .arff file on disk:")
pprint(vars(dataset))


Downloading dataset 61.
This is dataset 'iris', the target feature is called 'class'
More info, including the location off the .arff file on disk:
{'citation': None,
 'collection_date': u'1936',
 'contributor': None,
 'creator': u'R.A. Fisher',
 'data_file': '/root/.openml/cache/datasets/61/dataset.arff',
 'data_pickle_file': '/root/.openml/cache/datasets/61/dataset.pkl',
 'default_target_attribute': u'class',
 'description': u"**Author**: R.A. Fisher  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Iris) - 1936 - Donated by Michael Marshall  \n**Please cite**:   \n\n**Iris Plants Database**  \nThis is perhaps the best known database to be found in the pattern recognition literature.  Fisher's paper is a classic in the field and is referenced frequently to this day.  (See Duda & Hart, for example.)  The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.  One class is     linearly separable from the other 2; the latter are NOT 

In [7]:
X, y = dataset.get_dataset(target=dataset.default_target_attribute)

In [8]:
features = openml.datasets.download_dataset_features(con, N)

In [9]:
features

OrderedDict([(u'@xmlns:oml', u'http://openml.org/openml'),
             (u'oml:feature',
              [OrderedDict([(u'oml:index', u'0'),
                            (u'oml:name', u'sepallength'),
                            (u'oml:data_type', u'numeric'),
                            (u'oml:is_target', u'false'),
                            (u'oml:is_ignore', u'false'),
                            (u'oml:is_row_identifier', u'false')]),
               OrderedDict([(u'oml:index', u'1'),
                            (u'oml:name', u'sepalwidth'),
                            (u'oml:data_type', u'numeric'),
                            (u'oml:is_target', u'false'),
                            (u'oml:is_ignore', u'false'),
                            (u'oml:is_row_identifier', u'false')]),
               OrderedDict([(u'oml:index', u'2'),
                            (u'oml:name', u'petallength'),
                            (u'oml:data_type', u'numeric'),
                            (u'oml:is

In [10]:
iris = pd.DataFrame(X)
iris['class'] = y
print(iris[:10])

     0    1    2    3  class
0  5.1  3.5  1.4  0.2      0
1  4.9  3.0  1.4  0.2      0
2  4.7  3.2  1.3  0.2      0
3  4.6  3.1  1.5  0.2      0
4  5.0  3.6  1.4  0.2      0
5  5.4  3.9  1.7  0.4      0
6  4.6  3.4  1.4  0.3      0
7  5.0  3.4  1.5  0.2      0
8  4.4  2.9  1.4  0.2      0
9  4.9  3.1  1.5  0.1      0


Training a scikit-learn model with the data

In [11]:
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

You can also ask with features are categorical to do your own encoding 

In [12]:
task_list = openml.tasks.get_task_list(con)

In [13]:
print(task_list[0], len(task_list))

({'status': u'active', u'target_feature': u'class', 'task_type': u'Supervised Classification', u'NumberOfInstancesWithMissingValues': 0, 'name': u'anneal', u'estimation_procedure': u'10-fold Crossvalidation', 'did': 1, u'NumberOfSymbolicFeatures': 32, u'NumBinaryAtts': 14, u'NumberOfMissingValues': 0, u'MajorityClassSize': 684, u'NumberOfInstances': 898, u'source_data': u'1', u'evaluation_measures': u'predictive_accuracy', u'MaxNominalAttDistinctValues': 10, u'NumberOfClasses': 6, 'tid': 1, u'MinorityClassSize': 0, u'NumberOfFeatures': 39, u'NumberOfNumericFeatures': 6}, 1906)


In [14]:
tasks = pd.DataFrame(task_list)
tasks[['tid','did','task_type','NumberOfInstances','NumberOfFeatures','NumberOfClasses']].head(11)

Unnamed: 0,tid,did,task_type,NumberOfInstances,NumberOfFeatures,NumberOfClasses
0,1,1,Supervised Classification,898,39,6.0
1,2,2,Supervised Classification,898,39,6.0
2,3,3,Supervised Classification,3196,37,2.0
3,4,4,Supervised Classification,57,17,2.0
4,5,5,Supervised Classification,452,280,16.0
5,6,6,Supervised Classification,20000,17,26.0
6,7,7,Supervised Classification,226,70,24.0
7,8,8,Supervised Classification,345,7,
8,9,9,Supervised Classification,205,26,7.0
9,10,10,Supervised Classification,148,19,4.0


Download a single OpenML task (id=10), create a scikit-learn classifier (RandomForest), and run it on the task

In [15]:
T = 10107
task = openml.tasks.download_task(con, T)
print(task)

clf = ensemble.RandomForestClassifier()

OpenMLTask instance.
Task ID: 10107
Task type: Supervised Classification
Dataset id: 61


In [16]:
run = openml.runs.openml_run(con, task, clf)
print("RandomForest has run on the task.")

2638
RandomForest has run on the task.


Upload the run to the OpenML server

In [17]:
return_code, response = run.publish(con)

  if value is None or value == u'' or value != value:


In [18]:
if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s" % (run_id))
    display(HTML("<a href='http://www.openml.org/r/{0}' target=_blank>http://www.openml.org/r/{0}</a>".format(run_id)))


Uploaded run with id 537768
