# OpenML in Python 
OpenML is an online collaboration platform for machine learning: 

* Share/reuse machine learning datasets, algorithms, models, experiments
* Well documented/annotated datasets, uniform access
* APIs in Java, R, Python\*,... to download/upload everything
* Better reproducibility of experiments, reuse of machine learning models 
* Works well with machine learning libraries such as scikit-learn
* Large scale benchmarking, compare to state of the art

In [23]:
# General imports and settings
from preamble import *
%matplotlib inline
InteractiveShell.ast_node_interactivity = "all"
HTML('''<style>html, body{overflow: visible !important} .CodeMirror{min-width:105% !important;} .rise-enabled .CodeMirror, .rise-enabled .output_subarea{font-size:140%; line-height:1.2; overflow: visible;} .output_subarea pre{width:110%}</style>''') # For slides

## Authentication

* Create a (free) OpenML account on http://www.openml.org. 
* After logging in, open your account page (avatar on the top right)
* Open 'Account Settings', then 'API authentication' to find your API key.

There are two ways to authenticate:  

* Create a plain text file `~/.openml/config` with the line 'apikey=MYKEY', replacing MYKEY with your API key.
* Run the code below, replacing 'MYKEY' with your API key.

In [None]:
# This is a temporary read-only OpenML key. Replace with your own key. 
oml.config.apikey = '11e82c8d91c5abece86f424369c71590'

# Data sets
We can list, select, and download all OpenML datasets

### List datasets

In [None]:
datalist = oml.datasets.list_datasets() # Returns a dict
datalist = pd.DataFrame.from_dict(datalist, orient='index') # Create a DataFrame
print("First 10 of %s datasets..." % len(datalist))
datalist[:10][['did','name','NumberOfInstances',
               'NumberOfFeatures','NumberOfClasses']]

There are many properties that we can query

In [None]:
list(datalist)
datalist = datalist[['did','name','NumberOfInstances',
               'NumberOfFeatures','NumberOfClasses']]

and we can filter or sort on all of them

In [None]:
datalist[datalist.NumberOfInstances>10000
        ].sort(['NumberOfInstances'])[:20]

or find specific ones

In [None]:
datalist.query('name == "eeg-eye-state"')

In [None]:
datalist.query('NumberOfClasses > 50')

Download a specific dataset. This is done based on the dataset ID (called 'did').

In [None]:
dataset = oml.datasets.get_dataset(1471)

print("This is dataset '%s', the target feature is '%s'" % 
      (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])

Convert the data to a DataFrame for easier processing/plotting

In [None]:
X, y, attribute_names = dataset.get_data(
    target=dataset.default_target_attribute, 
    return_attribute_names=True)
eeg = pd.DataFrame(X, columns=attribute_names)
eeg['class'] = y
print(eeg[:10])

In [None]:
eegs = eeg.sample(n=1000)
_ = pd.scatter_matrix(eegs.iloc[:100,:4], c=eegs[:100]['class'], figsize=(10, 10), 
                  marker='o', hist_kwds={'bins': 20}, 
                  alpha=.8, cmap='viridis')

## Train models
Train a scikit-learn model on the data manually

In [None]:
from sklearn import neighbors

dataset = oml.datasets.get_dataset(1471)
X, y = dataset.get_data(target=dataset.default_target_attribute)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

You can also ask which features are categorical to do your own encoding

In [None]:
from sklearn import preprocessing
dataset = oml.datasets.get_dataset(10)
X, y, categorical = dataset.get_data(
    target=dataset.default_target_attribute,
    return_categorical_indicator=True)
print("Categorical features: %s" % categorical)
enc = preprocessing.OneHotEncoder(categorical_features=categorical)
X = enc.fit_transform(X)
clf.fit(X, y)

# Tasks

To compar models consistently (across studies and tools), OpenML offers Tasks, which include specific train-test splits and other information to define a scientific task. Tasks are typically created via the website by the dataset provider.

## Listing tasks

In [None]:
task_list = oml.tasks.list_tasks(size=5000) # Get first 5000 tasks

mytasks = pd.DataFrame(task_list).transpose()
print("First 5 of %s tasks:" % len(mytasks))

In [None]:
mytasks = mytasks[['tid','did','name','task_type','estimation_procedure','evaluation_measures']]
print(mytasks.head())

Search for the tasks you need

In [None]:
print(mytasks.query('name=="eeg-eye-state"'))

## Download tasks

In [None]:
task = oml.tasks.get_task(14951)
pprint(vars(task))

# Runs: Train models on tasks
We can run (many) scikit-learn algorithms on (many) OpenML tasks.

In [None]:
task = oml.tasks.get_task(14951)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
run = oml.runs.run_task(task, clf)
run.model

Share the run on the OpenML server

In [None]:
myrun = run.publish()
print("Uploaded to http://www.openml.org/r/" + str(myrun.run_id))

### It also works with pipelines

In [None]:
from sklearn import pipeline, ensemble, preprocessing
from openml import tasks,runs, datasets
task = tasks.get_task(59)
pipe = pipeline.Pipeline(steps=[
            ('Imputer', preprocessing.Imputer(strategy='median')),
            ('OneHotEncoder', preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore')),
            ('Classifier', ensemble.RandomForestClassifier())
           ])
run = runs.run_task(task, pipe)
myrun = run.publish()
print("Uploaded to http://www.openml.org/r/" + str(myrun.run_id))

## All together
Train any model on any OpenML dataset and upload to OpenML in a few lines of code

In [None]:
from sklearn.linear_model import LogisticRegression

task = oml.tasks.get_task(145677)
clf = LogisticRegression()
run = oml.runs.run_task(task, clf)
run.model
myrun = run.publish()
print("Uploaded to http://www.openml.org/r/" + str(myrun.run_id))

## A Challenge
We'll see many machine learning algorithms in this course. Try to build the best possible models on several OpenML tasks, and compare your results with the rest of the class, and learn from them. Some tasks you could try (or browse openml.org):

* EEG eye state: data_id:[1471](http://www.openml.org/d/1471), task_id:[14951](http://www.openml.org/t/14951)
* Mice protein: data_id:[4550](http://www.openml.org/d/4550), task_id:[34538](http://www.openml.org/t/34538), 1k instances, 80 features, missing values. Easy.
* Walking activity: data_id:[1509](http://www.openml.org/d/1509), task_id: [9945](http://www.openml.org/t/9945), 150k instances
* Thoracic_surgery: data_id:[4329](http://www.openml.org/d/4329), task_id: [145679](http://www.openml.org/t/145679). 0.5K instances, no missing values.
* Diabetes130US: data_id:[23512](http://www.openml.org/d/23512), task_id:[?](http://www.openml.org/t/?). 100k instances, missing values

Easy benchmarking:

In [None]:
import openml as oml
from sklearn import neighbors

for task_id in [14951,10103,9945]:
    task = oml.tasks.get_task(task_id)
    data = oml.datasets.get_dataset(task.dataset_id)
    clf = neighbors.KNeighborsClassifier(n_neighbors=5)
    run = oml.runs.run_task(task, clf)
    myrun = run.publish()
    print("kNN on %s: http://www.openml.org/r/%d" % (data.name, myrun.run_id))

## Other possibilities
OpenML's Python API is currently still under development. To be added soon:

* Organizing data sets, algorithms, and experiments into studies
* Sharing data and expriments with circles of friends
* Downloading previous experiments, evaluations and models
* Uploading new datasets to OpenML via python
* Filters for listings (e.g. filter by author, tags, other properties)

All of this is already possible with the R and Java API.