# OpenML in Python 
OpenML is an online collaboration platform for machine learning: 

* Find or share interesting, well-documented datasets
* Define research / modelling goals (tasks)
* Explore large amounts of machine learning algorithms, with APIs in Java, R, Python
* Log and share reproducible experiments, models, results 
* Works seamlessly with scikit-learn and other libraries
* Large scale benchmarking, compare to state of the art

# Installation

* `pip install openml`

In [None]:
!pip install openml

### Exercise
- Find datasets with more than 10000 examples
- Find a dataset called 'eeg_eye_state'
- Find all datasets with more than 50 classes

## Download datasets
Download the `eeg_eye_state` dataset. This is done based on the dataset ID ('did').

Get the actual data.  
Returned as numpy array, with meta-info (e.g. target feature, feature names,...)

### Exercise
- Explore the data visually

# Task

The function `openml.evaluation.list_evaluations(...)` returns a dictionary of evaluation records. It has several filtering functions, to keep the resulting set small (keep in mind that OpenML has almost 10 million runs, and more than a billion evaluation records). The function is documented in the [API docs](https://openml.github.io/openml-python/master/generated/openml.evaluations.list_evaluations.html#openml.evaluations.list_evaluations). It returns a dict mapping from `run_id` to [OpenMLEvaluation](https://openml.github.io/openml-python/master/generated/openml.OpenMLEvaluation.html#openml.OpenMLEvaluation). Examples of filters are `task`, `flow` and `function`. Note that one of these is mandatory.

* Obtain a subset of 100 predictive accuracy (`predictive_accuracy`) results on the letter dataset (task id = 6).
* Obtain a subset of 100 predictive accuracy (`predictive_accuracy`) results per task in the OpenML 100 and plot these 

# Dataset Upload

There are various ways to upload a dataset. The most convenient ways are documented in [this example](https://github.com/openml/openml-python/blob/master/examples/create_upload_tutorial.py). Most conveniently, this can be done using a [pandas dataframe](https://github.com/openml/openml-python/blob/a0ef724fec6ab31f6381d3ac2a84827ab535170d/examples/create_upload_tutorial.py#L206). Additionally, we need to create a [OpenMLDataset](https://openml.github.io/openml-python/master/generated/openml.OpenMLDataset.html#openml.OpenMLDataset) object, containing information about the dataset. Most notably, the arguments `name`, `default_target_attribute`, `attributes` and `data` need to be set.

* Find your favorite dataset (on your laptop), load it as pandas dataframe and upload it to OpenML.
* Common problem: Server returns error 131. This means that the description file was not complete. The [XSD](https://github.com/openml/OpenML/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd) for uploading the dataset hints what fields are mandatory.


# Create a task

Create a task on the previously uploaded dataset. Note: each task can only be created once. Once you are succesfull at creating a task, you would need to find its ID using the search functionality, or the frontend. For functions related to creating and finding tasks, see `openml.tasks.create_task` and `openml.tasks.list_tasks`. The task type ID of a supervised classification task is 1, and the task type ID of a supervised regression task is 2. 

For a list of all estimation procedures, see the following URL: `https://test.openml.org/api/v1/estimationprocedure/list`


Find the task above. While we have of course recorded the task id, it might be the case that we lose it, and therefore we might need to be independent of the print above


# Run a scikit-learn flow on the task

Find a scikit-learn model that you like (e.g., the Random Forest) and run it on the aforementioned task. For this, you typically use the function `openml.run.run_model_on_task`. Don't forget to publish the run, and see it on the website. Note that most scikit-learn algorithms can not handle nominal values nor missing values. 


# Experiment, Repeat

This is all you need to know about experimenting on OpenML. Now, run the same algorithm with various hyperparameters, and observe the results on the website, or use the API function `openml.evaluations.list_evaluations` to list all evaluations on the flow. 