# Getting started with steps

This notebook shows how to **create** steps, **fit** them to data, **transform** new data and take advantage of persistence

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from steps.base import Step, BaseTransformer

## Grabbing some data

We'll import a dataset from scikit-learn for our experiments and divide it into training and test sets

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()
X_digits, y_digits = digits.data, digits.target

num_train = int(0.8 * len(y_digits))

sample_ids = np.random.permutation(len(y_digits))

train_ids = sample_ids[:num_train]
test_ids = sample_ids[num_train:]

print('{} samples for training'.format(len(train_ids)))
print('{} samples for test'.format(len(test_ids)))

Steps communicate data between each other with plain **Python dictionaries**. This makes it easy to pass collections of **arbitrary data types** (Numpy arrays, Pandas dataframes, etc.). The basic structure is as follows (you can get much more fancy but we leave that to the next example)

In [None]:
data_train = {'input':
                {
                     'X': X_digits[train_ids, :],
                     'y': y_digits[train_ids],
                }
            }

data_test = {'input':
                {
                     'X': X_digits[test_ids, :],
                     'y': y_digits[test_ids],
                }
            }

## Creating steps
Let's create a simple step - first we define an **adapter** to tell it how to interpret its input data dictionary (this allows you to do lots of clever things but we'll stick to basics for now)

In [None]:
# This adapter just extracts the values under keys 'X' and 'y' from the node 'input'
input_adapter = {
                     'X': [('input', 'X')],
                     'y': [('input', 'y')]
                 }

The second ingredient of a step is a transformer, which is where the real action happens. You just have to define a class following a **simple API** and then it's up to you to be as creative as you want!

... or you can just **wrap you favorite Scikit-learn estimator** like we do here:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

class RandomForestTransformer(BaseTransformer):
    def __init__(self):
        self.estimator = RandomForestClassifier()
        
    def fit(self, X, y):
        self.estimator.fit(X, y)
        return self

    def transform(self, X, **kwargs):
        y_pred  = self.estimator.predict(X)
        return {'y_pred': y_pred}  # TODO: exaplain this
    
    def save(self, filepath):
        joblib.dump(self.estimator, filepath)
        
    def load(self, filepath):
        self.estimator = joblib.load(filepath)
        return self

So what does the transformer do? It must be able to:
* **initialize** itself
* **fit** and **transform** the incoming data prepared by the adapter; when transforming, the result should be returned as a **dictionary** that can be **passed on to the next step**
* **save** and **load** its parameters; this is handy when you're trying to avoid re-computing things over and over.

See how flexible this is? You can just as easily wrap your Keras or Pytorch models.

Now let's combine our adapter and transformer

In [None]:
classifier_step = Step(name='classifier',
                       transformer=RandomForestTransformer(),
                       input_data=['input'],                 
                       adapter=input_adapter,
                       cache_dirpath='./cache'
                      )

And that's our one-step pipeline finished. You can visualize it too:

In [None]:
classifier_step

This is just about the simplest pipeline you can imagine. Now let's train it!

## Training

In [None]:
classifier_step.clean_cache()
preds_train = classifier_step.fit_transform(data_train);

In [None]:
acc_train = np.sum(preds_train['y_pred'] == data_train['input']['y']) / data_train['input']['y'].size

## Generating test predictions

Running test data through our pipeline is as easy as this:

In [None]:
preds_test = classifier_step.transform(data_test)

In [None]:
acc_test = np.sum(preds_test['y_pred'] == data_test['input']['y']) / data_test['input']['y'].size
print('Test accuracy = {:.4f}'.format(acc_test))

Let's have a look at some predictions to see if they're sensible

In [None]:
fix, axs = plt.subplots(4, 8, figsize=(10, 6))
for ii, ax in enumerate(axs.ravel()):
    ax.imshow(data_test['input']['X'][ii].reshape(8, 8), cmap='gray')
    ax.axis('off')
    ax.set_title('pred = {}'.format(preds_test['y_pred'][ii]))

And that's about it for a start! Have a look at the next notebook for a more advanced example.