[View in Colaboratory](https://colab.research.google.com/github/leobezerra/automl-pybr14/blob/master/AutoML.ipynb)

# AutoML with Python

In this notebook, we'll create a classifier for a digit classification problem in an automated way.

We'll use the ```auto-sklearn``` package for that, which implements the AutoSklean approach to AutoML.

AutoSklearn is approach based on ensembles that is able to select and configure algorithms in an automated way.

It takes as input a labeled training set and produces an algorithm expected to perform well on that input problem.

AutoSklearn can be used both for classification and regression.

In this example, we'll create a classifier for the MNIST handwritten digit classification problem. 

### Installing auto-sklearn

The ```auto-sklearn``` package has a few dependencies we need to install first:

**Note que o comando abaixo funciona no Colab e em distribuições Linux baseadas em Debian. Se você usa outra plataforma, você precisará instalar essa biblioteca manualmente.**

In [0]:
!apt-get install build-essential swig

Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.4ubuntu1).
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0
0 upgraded, 2 newly installed, 0 to remove and 12 not upgraded.
Need to get 1,100 kB of archives.
After this operation, 5,822 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,460 B]
Fetched 1,100 kB in 1s (1,297 kB/s)
Selecting previously unselected package swig3.0.
(Reading database ... 22278 files and directories currently installed.)
Preparing to unpack .../swig3.0_3.0.12-1_amd64.deb ...
Unpacking swig3.0 (3.0.12-1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_3.0.12-1_amd64.deb ...
Unpacking s

In [0]:
!curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   209  100   209    0     0   1514      0 --:--:-- --:--:-- --:--:--  1514
Collecting nose
[?25l  Downloading https://files.pythonhosted.org/packages/15/d8/dd071918c040f50fa1cf80da16423af51ff8ce4a0f2399b7bf8de45ac3d9/nose-1.3.7-py3-none-any.whl (154kB)
[K    100% |████████████████████████████████| 163kB 4.5MB/s 
[?25hInstalling collected packages: nose
Successfully installed nose-1.3.7
Collecting Cython
[?25l  Downloading https://files.pythonhosted.org/packages/64/3f/cac281f3f019b825bbc03fa8cb7eb03d9c355f4aa9eef978279a4966cb21/Cython-0.29-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)
[K    100% |████████████████████████████████| 2.1MB 9.4MB/s 
[?25hInstalling collected packages: Cython
Successfully installed Cython-0.29
Collecting xgboost==0.7.pos

**Note that pip expects the name ```auto-sklearn``` rather than ```autosklearn```**

In [0]:
!pip install auto-sklearn

Collecting auto-sklearn
[?25l  Downloading https://files.pythonhosted.org/packages/0a/a6/cbbff9205cb7dc71d67a6c05ecdd5aa05856bc1638360238d25a4a01d670/auto-sklearn-0.4.0.tar.gz (3.4MB)
[K    100% |████████████████████████████████| 3.4MB 9.3MB/s 
Collecting pynisher<0.5,>=0.4 (from auto-sklearn)
  Downloading https://files.pythonhosted.org/packages/d2/cd/4e0469a55fd280df177af2d5e94d72541d3bb0115280e31a23c8922987e6/pynisher-0.4.2.tar.gz
Building wheels for collected packages: auto-sklearn, pynisher
  Running setup.py bdist_wheel for auto-sklearn ... [?25l- \ | / - \ | / - \ | / - \ done
[?25h  Stored in directory: /root/.cache/pip/wheels/3f/4e/d9/489ca4cb2f6fd94f58180b0073d15746583f772f25d9178b94
  Running setup.py bdist_wheel for pynisher ... [?25l- done
[?25h  Stored in directory: /root/.cache/pip/wheels/81/35/cb/37fe9c279ac6e56fc8805e146a431c27550dce1ad868ffa04e
Successfully built auto-sklearn pynisher
Installing collected packages: pynisher, auto-sk

## Importing the libraries

For this example, we'll use the classification module of AutoSklearn.

In [0]:
from autosklearn.classification import AutoSklearnClassifier

We'll also need the metrics module from scikit-learn when evaluating the model produced. We'll use the accuracy and confusion matrix tools.

In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix

## Reading the data

We'll use the MNIST dataset available in TensorFlow.

**You can also download this dataset from other sources, if you don't (wanna) have TensorFlow installed in your computer.**
 

In [0]:
from tensorflow.examples.tutorials.mnist import input_data

In [0]:
m=input_data.read_data_sets("MNIST")

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Instructions for updating:
Please use tf.data to implement this functionality.


Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST/train-images-idx3-ubyte.gz


Instructions for updating:
Please use tf.data to implement this functionality.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST/t10k-labels-idx1-ubyte.gz


This dataset comes pre-split into a training and a testing subset. 

**This is extremely important to ensure that the produced model is good at classifying new samples, not just the ones it already knows.** 

### Training examples

The training subset contains two structures: ```images``` and  ```labels```.

Let's check what's in ```images```:

**The machine learning community has conventioned that X is used for examples and y for labels.** 

In [0]:
X_train = m.train.images
print("X_train")
print(X_train)

X_train
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


From what we can see, X_train is a matrix with lots of zeros. 

Let's check a bit more:

In [0]:
type(X_train)

numpy.ndarray

If you're used to pandas, scikit-learn or other data science / machine learning packages, you've probably come across ```ndarrays``` from the ```numpy``` package.

It is basically a more efficient version of a python ```list```, and we can check its dimensions using the attribute ```shape```:

In [0]:
X_train.shape

(55000, 784)

The shape of the ```images``` collection reveals that it contains 55000 images, each represented by a 784 array.

Each vector is the result of flattening the original 28x28 image (28 * 28 = 784).

**This used to be a very common representation in computer vision problems before deep learning algorithms, since flattening an image makes the data compatible with most algorithms but discards the vertical spatial relation between pixels.**

Now let's check what's in the ```labels``` collection.

In [0]:
y_train = m.train.labels
print("y_train")
print(y_train)

y_train
[7 3 4 ... 5 6 8]


This time we have a one-dimensional array, but what do the values mean?

Let's check its type.

In [0]:
type(y_train)

numpy.ndarray

Again, we have an ```ndarray```,  so we can check its shape:

In [0]:
y_train.shape

(55000,)

The ```labels``` collection states to which class each of the 55000 images from the ```images``` collection belongs to.

In this case, labels range from 0 to 9, since those are the handwritten digits present in the dataset.

**We can do some descriptive analysis of our data using Pandas and plotting libraries. Here I'll keep it simple and check only the distribution of the training examples across the different classes.**

In [0]:
from pandas import Series
Series(y_train).value_counts()

1    6179
7    5715
3    5638
2    5470
9    5454
0    5444
6    5417
8    5389
4    5307
5    4987
dtype: int64

Note that the data set contains many more images of 1 than of 5. 

**It's possible to preprocess the data to balance the number of examples from each class. I won't do it here to keep the notebook simple, but this is often helpful for the performance of the models.**

### Testing examples

Now let's check what the testing subset looks like. Again, it contains collections of ```images``` and ```labels```.

In [0]:
X_test = m.test.images
print("X_test")
print(X_test)

X_test
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Indeed, it looks exactly like the training set images.

Let's check its type:

In [0]:
type(X_test)

numpy.ndarray

Yup, ```ndarray``` as expected.  What about its shape?

In [0]:
X_test.shape

(10000, 784)

Now we have 10000 examples of 784 flattened images.

**Note that the ratio between the training and testing subset sizes is quite in favor of training. This could be a problem in a real-world application, but it has been used as default for this particular problem.**

Finally, let's check what's in the ```labels``` collection.

In [0]:
y_test = m.test.labels
print("y_test")
print(y_test)

y_test
[7 2 1 ... 4 5 6]


As expected, we have a list of values that should represent the class to which each of the testing example belongs to.

Let's confirm that:

In [0]:
type(y_test)

numpy.ndarray

In [0]:
y_test.shape

(10000,)

## Creating a classifier with autosklearn

The AutoML approach we're studying here is trying to solve a problem known as **CASH** (*combined algorithm selection and hyperparameter configuration)*.

This problem addresses two questions simultaneously: 
* what is best algorithm for my problem?
* how can it be configured to perform best on my problem?

Addressing these two questions simultaneously is what lets AutoSklearn create a model that is already configured and is expected to perform well on your input problem.

To do that, we must provide a setup for AutoSklearn to run. We do that through arguments of the ```AutoSklearnClassifier``` constructor. 

Let's see some of the most important:
* `time_left_for_this_task`: the total time AutoSklearn will have to create your model. In general, this depends on what algorithms you let AutoSklearn test and how long they take to run on your problem. In general, the total runtime should let AutoSklearn run at least 1000 experiments (trying 1000 different algorithms/configurations).
* `per_run_time_limit`: the maximum time a single algorithm/configuration is allowed to run. Again, this depends on what algorithms you let AutoSklearn test and how long they take to run on your problem. In general, you should not allow a single algorithm to use over 1% of your total runtime, but depending on the application this can be relaxed to 10%.
* `resampling_strategy`: the strategy used internally by AutoSklearn to separate between training and validation subsets. Holdout is the simplest (and default), but other techniques such as cross validation can also be used.

The code below configures an `AutoSklearnClassifier`
 to have 20min to produce a configured, high-performing algorithm, using holdout as validation strategy and limiting each run to a maximum of 2 min.

**Note that the default settings from `AutoSklearnClassifier` will have it run for 1h. This is still not considered enough in practical scenarios -- one should try between 24h and 72h, which mean leaving the computer working for you while you go for a weekend on the beach :D**

In [0]:
automl_20min = AutoSklearnClassifier(time_left_for_this_task=1200, per_run_time_limit=120, resampling_strategy='holdout')
automl = automl_20min

In [0]:
#automl_1h = AutoSklearnClassifier()
#automl = automl_1h

We now tell `AutoSklearnClassifier` object to select the configured algorithm that best fits our data:

**Beware: this will take as long as you have configured it to take.**

In [0]:
automl.fit(X_train, y_train)

  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)




  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)
  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)
  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)




  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)
  Y_train_pred = np.nanmean(Y_train_pred_full, axis=0)


AutoSklearnClassifier(delete_output_folder_after_terminate=True,
           delete_tmp_folder_after_terminate=True,
           disable_evaluator_output=False, ensemble_nbest=50,
           ensemble_size=50, exclude_estimators=None,
           exclude_preprocessors=None, get_smac_object_callback=None,
           include_estimators=None, include_preprocessors=None,
           initial_configurations_via_metalearning=25,
           ml_memory_limit=3072, output_folder=None,
           per_run_time_limit=120, resampling_strategy='holdout',
           resampling_strategy_arguments=None, seed=1, shared_mode=False,
           smac_scenario_args=None, time_left_for_this_task=1200,
           tmp_folder=None)

The object referenced by ```automl``` is 

## Evaluating the classifier

To assess how good the classifier produced by ```AutoSklearnClassifier``` is, we need to have it predict labels for our testing subset:

In [0]:
y_predicted = automl.predict(X_test)

Let's see what is the output of the `predict` method:

In [0]:
print("y_predicted")
print(y_predicted)

y_predicted
[7 2 1 ... 4 5 6]


As expected, we get a list of labels predicted for each example. Let's just double-check that: 

In [0]:
type(y_predicted)

numpy.ndarray

In [0]:
y_predicted.shape

(10000,)

Yep, we have 10000 predictions in `y_predicted`.

Now let's see how do those predictions compare to the true labels we had in `y_test`.

We can do that using a **confusion matrix**. 

If you have m classes in your problem, a confusion matrix is an m-m matrix that counts on its main diagonal the number of times that the prediction was correct.

Else, if you have an example from class i mistakenly predicted as belonging to class j, this is reported is cell (i,j):

In [0]:
confusion_matrix(y_test, y_predicted)

array([[ 970,    0,    1,    0,    0,    3,    2,    1,    3,    0],
       [   0, 1119,    3,    3,    0,    2,    4,    1,    3,    0],
       [   5,    0,  999,    6,    2,    1,    3,   10,    6,    0],
       [   1,    0,   13,  963,    0,   12,    0,    9,    9,    3],
       [   1,    0,    2,    0,  951,    0,    6,    0,    3,   19],
       [   5,    1,    1,   14,    4,  853,    7,    1,    5,    1],
       [   7,    3,    1,    0,    4,    7,  930,    0,    6,    0],
       [   0,    4,   23,    4,    4,    0,    0,  979,    3,   11],
       [   4,    0,    7,   12,    7,    9,    4,    4,  918,    9],
       [   7,    5,    3,   10,   11,    4,    1,    4,    4,  960]])

Clearly our model is able to classify well most of the examples of the testing set.

Let's use an analytical measure to quantify this:

In [0]:
print("Accuracy score",accuracy_score(y_test, y_predicted))

Accuracy score 0.9642


The above result means your classifiers is correct roughly 96.5% of the times you ask it to classify a handwritten digit similar to the ones from the MNIST dataset :D

## Critical discussion

Given the amount of small amount of time we gave to AutoSklearn, results are still pretty good if you have no background at all in computer vision.

This is exactly the kind of scenario for which AutoML has been thought: one has little background on machine learning and/or the application problem.

However, if you have specialized knowledge, you know that one can get 97% accuracy using an SVM or above 99% using deep learning.

Also, setting up AutoSklearnClassifier involves a number of decisions that directly influence the quality of the results, such as the validation strategy.

That's why the guys behind the autosklearn package the and AutoML community in general (like me) keep researching this topic -- **feel free to join us :D**