## OpenML in research, iris

It covers downloading datasets, tasks, how to use scikit-learn to build classifiers, and upload the results to the server.


Initialization and login. This assumes you have a .openml dir in your homedir with a subdir for caches and a file with your API key. You find your API key in your account settings on openml.org.

In [1]:
from sklearn import preprocessing, ensemble
from openml.apiconnector import APIConnector
import numpy as np
import pandas as pd
import os
import openml
import xmltodict
from IPython.core.display import display, HTML


openml_dir = os.path.join(os.path.expanduser("~"), ".openml")
if not os.path.exists(openml_dir): os.makedirs(openml_dir)
cache_dir = os.path.join(openml_dir, "cache")
key =  ## Put your key as a string here

In [2]:
con = APIConnector(cache_directory=cache_dir, apikey=key)

List all datasets on OpenML

In [3]:
datasets = openml.datasets.get_dataset_list(con)

datasets_df = pd.DataFrame(datasets)
datasets_df.head()


Unnamed: 0,MajorityClassSize,MaxNominalAttDistinctValues,MinorityClassSize,NumBinaryAtts,NumberOfClasses,NumberOfFeatures,NumberOfInstances,NumberOfInstancesWithMissingValues,NumberOfMissingValues,NumberOfNumericFeatures,NumberOfSymbolicFeatures,did,format,name,status
0,684.0,10.0,0.0,14.0,6.0,39.0,898.0,0.0,0.0,6.0,32.0,1,ARFF,anneal,active
1,684.0,9.0,0.0,7.0,6.0,39.0,898.0,898.0,22175.0,6.0,32.0,2,ARFF,anneal,active
2,1669.0,3.0,1527.0,34.0,2.0,37.0,3196.0,0.0,0.0,0.0,36.0,3,ARFF,kr-vs-kp,active
3,37.0,3.0,20.0,3.0,2.0,17.0,57.0,56.0,326.0,8.0,8.0,4,ARFF,labor,active
4,245.0,2.0,0.0,73.0,16.0,280.0,452.0,384.0,408.0,206.0,73.0,5,ARFF,arrhythmia,active


Download a specific dataset. This is done based on the dataset ID (called 'did' in the table above).

In [4]:
from pprint import pprint
import arff

N = 61
print("Downloading dataset %d." % N)
dataset = openml.datasets.download_dataset(con, N)

print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
print("More info, including the location off the .arff file on disk:")
pprint(vars(dataset))


Downloading dataset 61.
This is dataset 'iris', the target feature is called 'class'
More info, including the location off the .arff file on disk:
{'citation': None,
 'collection_date': '1936',
 'contributor': None,
 'creator': 'R.A. Fisher',
 'data_file': '/Users/anaderi/.openml/cache/datasets/61/dataset.arff',
 'data_pickle_file': '/Users/anaderi/.openml/cache/datasets/61/dataset.pkl',
 'default_target_attribute': 'class',
 'description': '**Author**: R.A. Fisher  \n'
                '**Source**: '
                '[UCI](https://archive.ics.uci.edu/ml/datasets/Iris) - 1936 - '
                'Donated by Michael Marshall  \n'
                '**Please cite**:   \n'
                '\n'
                '**Iris Plants Database**  \n'
                'This is perhaps the best known database to be found in the '
                "pattern recognition literature.  Fisher's paper is a classic "
                'in the field and is referenced frequently to this day.  (See '
                'D

In [5]:
X, y = dataset.get_dataset(target=dataset.default_target_attribute)

In [6]:
features = openml.datasets.download_dataset_features(con, N)

In [7]:
features

OrderedDict([('@xmlns:oml', 'http://openml.org/openml'),
             ('oml:feature',
              [OrderedDict([('oml:index', '0'),
                            ('oml:name', 'sepallength'),
                            ('oml:data_type', 'numeric'),
                            ('oml:is_target', 'false'),
                            ('oml:is_ignore', 'false'),
                            ('oml:is_row_identifier', 'false')]),
               OrderedDict([('oml:index', '1'),
                            ('oml:name', 'sepalwidth'),
                            ('oml:data_type', 'numeric'),
                            ('oml:is_target', 'false'),
                            ('oml:is_ignore', 'false'),
                            ('oml:is_row_identifier', 'false')]),
               OrderedDict([('oml:index', '2'),
                            ('oml:name', 'petallength'),
                            ('oml:data_type', 'numeric'),
                            ('oml:is_target', 'false'),
              

In [8]:
iris = pd.DataFrame(X)
iris['class'] = y
print(iris[:10])

     0    1    2    3  class
0  5.1  3.5  1.4  0.2      0
1  4.9  3.0  1.4  0.2      0
2  4.7  3.2  1.3  0.2      0
3  4.6  3.1  1.5  0.2      0
4  5.0  3.6  1.4  0.2      0
5  5.4  3.9  1.7  0.4      0
6  4.6  3.4  1.4  0.3      0
7  5.0  3.4  1.5  0.2      0
8  4.4  2.9  1.4  0.2      0
9  4.9  3.1  1.5  0.1      0


Training a scikit-learn model with the data

In [9]:
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

You can also ask with features are categorical to do your own encoding 

In [10]:
task_list = openml.tasks.get_task_list(con)

In [11]:
print(task_list[0], len(task_list))

{'NumberOfMissingValues': 0, 'task_type': 'Supervised Classification', 'NumberOfInstances': 898, 'MajorityClassSize': 684, 'MinorityClassSize': 0, 'NumberOfSymbolicFeatures': 32, 'NumberOfClasses': 6, 'MaxNominalAttDistinctValues': 10, 'NumberOfInstancesWithMissingValues': 0, 'estimation_procedure': '10-fold Crossvalidation', 'NumBinaryAtts': 14, 'NumberOfFeatures': 39, 'name': 'anneal', 'tid': 1, 'source_data': '1', 'evaluation_measures': 'predictive_accuracy', 'did': 1, 'NumberOfNumericFeatures': 6, 'target_feature': 'class', 'status': 'active'} 1906


In [12]:
tasks = pd.DataFrame(task_list)
tasks[['tid','did','task_type','NumberOfInstances','NumberOfFeatures','NumberOfClasses']].head(11)

Unnamed: 0,tid,did,task_type,NumberOfInstances,NumberOfFeatures,NumberOfClasses
0,1,1,Supervised Classification,898,39,6.0
1,2,2,Supervised Classification,898,39,6.0
2,3,3,Supervised Classification,3196,37,2.0
3,4,4,Supervised Classification,57,17,2.0
4,5,5,Supervised Classification,452,280,16.0
5,6,6,Supervised Classification,20000,17,26.0
6,7,7,Supervised Classification,226,70,24.0
7,8,8,Supervised Classification,345,7,
8,9,9,Supervised Classification,205,26,7.0
9,10,10,Supervised Classification,148,19,4.0


Download a single OpenML task (id=10), create a scikit-learn classifier (RandomForest), and run it on the task

In [13]:
T = 10107
task = openml.tasks.download_task(con, T)
print(task)

clf = ensemble.RandomForestClassifier()

OpenMLTask instance.
Task ID: 10107
Task type: Supervised Classification
Dataset id: 61


In [14]:
run = openml.runs.openml_run(con, task, clf)
print("RandomForest has run on the task.")

2799
RandomForest has run on the task.


Upload the run to the OpenML server

In [15]:
return_code, response = run.publish(con)

In [17]:
if(return_code == 200):
    response_dict = xmltodict.parse(response)
    run_id = response_dict['oml:upload_run']['oml:run_id']
    print("Uploaded run with id %s" % (run_id))
    display(HTML("<a href='http://www.openml.org/r/{0}' target=_blank>http://www.openml.org/r/{0}</a>".format(run_id)))


Uploaded run with id 537752
