Automate Machine Learning with TPOT
===================================

This example shows how [TPOT](https://epistasislab.github.io/tpot/) can be used with Dask.

TPOT is an [automated machine learning](https://en.wikipedia.org/wiki/Automated_machine_learning) library.
It evaluates many scikit-learn pipelines and hyperparameter combinations to find a model that works well for your data. Evaluating all these computations is computationally expensive, but ammenable to parallelism. TPOT can use Dask to distribute these computations on a cluster of machines.

This notebook can be run interactively on the [dask examples binder](https://github.com/dask/dask-examples).
The following video shows a larger version of this notebook on a cluster.

In [1]:
from IPython.display import HTML

HTML('<div style="position:relative;height:0;padding-bottom:56.25%"><iframe src="https://www.youtube.com/embed/uyx9nBuOYQQ?ecver=2" width="640" height="360" frameborder="0" allow="autoplay; encrypted-media" style="position:absolute;width:100%;height:100%;left:0" allowfullscreen></iframe></div>')

In [2]:
!pip install tpot

Collecting tpot
  Downloading TPOT-0.11.6-py3-none-any.whl (86 kB)
[?25l[K     |███▉                            | 10 kB 36.0 MB/s eta 0:00:01[K     |███████▋                        | 20 kB 41.3 MB/s eta 0:00:01[K     |███████████▍                    | 30 kB 48.1 MB/s eta 0:00:01[K     |███████████████▎                | 40 kB 21.1 MB/s eta 0:00:01[K     |███████████████████             | 51 kB 18.4 MB/s eta 0:00:01[K     |██████████████████████▉         | 61 kB 15.7 MB/s eta 0:00:01[K     |██████████████████████████▋     | 71 kB 16.3 MB/s eta 0:00:01[K     |██████████████████████████████▌ | 81 kB 15.7 MB/s eta 0:00:01[K     |████████████████████████████████| 86 kB 7.9 MB/s 




Collecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)




Collecting deap>=1.2
  Downloading deap-1.3.1-cp38-cp38-manylinux2010_x86_64.whl (157 kB)
[?25l[K     |██                              | 10 kB 40.8 MB/s eta 0:00:01[K     |████▏                           | 20 kB 48.0 MB/s eta 0:00:01[K     |██████▎                         | 30 kB 53.8 MB/s eta 0:00:01[K     |████████▎                       | 40 kB 49.6 MB/s eta 0:00:01[K     |██████████▍                     | 51 kB 52.9 MB/s eta 0:00:01[K     |████████████▌                   | 61 kB 33.8 MB/s eta 0:00:01[K     |██████████████▋                 | 71 kB 29.3 MB/s eta 0:00:01[K     |████████████████▋               | 81 kB 31.9 MB/s eta 0:00:01[K     |██████████████████▊             | 92 kB 28.6 MB/s eta 0:00:01[K     |████████████████████▉           | 102 kB 25.8 MB/s eta 0:00:01[K     |███████████████████████         | 112 kB 25.8 MB/s eta 0:00:01[K     |█████████████████████████       | 122 kB 25.8 MB/s eta 0:00:01[K     |███████████████████████████     | 133

[K     |████████████████████████████████| 157 kB 25.8 MB/s 


Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)




Building wheels for collected packages: stopit


  Building wheel for stopit (setup.py) ... [?25l-

 done
[?25h  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11956 sha256=73f86511779ec19c55dec5c458249a6b742d77c359d1dfe77bbfd07007295897
  Stored in directory: /home/runner/.cache/pip/wheels/a8/bb/8f/6b9328d23c2dcedbfeb8498b9f650d55d463089e3b8fc0bfb2
Successfully built stopit


Installing collected packages: update-checker, deap, stopit, tpot


Successfully installed deap-1.3.1 stopit-1.1.2 tpot-0.11.6 update-checker-0.18.0


In [3]:
import tpot
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split



## Setup Dask

We first start a Dask client in order to get access to the Dask dashboard, which will provide progress and performance metrics. 

You can view the dashboard by clicking on the dashboard link after you run the cell.

In [4]:
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:38813  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 7.29 GB


## Create Data

We'll use the digits dataset.
To ensure the example runs quickly, we'll make the training dataset relatively small.

In [5]:
digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    train_size=0.05,
    test_size=0.95,
)

These are just small, in-memory NumPy arrays. This example is not applicable to larger-than-memory Dask arrays.

## Using Dask

TPOT follows the scikit-learn API; we specify a `TPOTClassifier` with a few hyperparameters, and then fit it on some data.
By default, TPOT trains on your single machine.
To ensure your cluster is used, specify the `use_dask` keyword.

In [6]:
# scale up: Increase the TPOT parameters like population_size, generations
tp = TPOTClassifier(
    generations=2,
    population_size=10,
    cv=2,
    n_jobs=-1,
    random_state=0,
    verbosity=0,
    config_dict=tpot.config.classifier_config_dict_light,
    use_dask=True,
)

In [7]:
tp.fit(X_train, y_train)

TPOTClassifier(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
                                                                                  'l1',
                                                                                  'l2',
                                                                                  'manhattan',
                                                                                  'cosine'],
                                                                     'linkage': ['ward',
                                                                                 'complete',
                                                                                 'average']},
                            'sklearn.decomposition.PCA': {'iterated_power': range(1, 11),
                                                          'svd_solver': ['randomized']},
                            'sklearn.feature_selection.SelectFwe': {'alpha': array([0.

## Learn More

See the [Dask-ML](http://ml.dask.org/) and [TPOT](https://epistasislab.github.io/tpot/) documenation for more information on using Dask and TPOT.