# Getting started with steps

This notebook shows how to **create** steps, **fit** them to data, **transform** new data and take advantage of persistence

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from steppy.base import Step, BaseTransformer

EXPERIMENT_DIR = './ex1'

In [None]:
import shutil

# By default pipelines will try to load previously trained models so we delete the cache to ba sure we're starting from scratch
shutil.rmtree(EXPERIMENT_DIR, ignore_errors=True)

## Grabbing some data

We'll import a dataset from scikit-learn for our experiments and divide it into training and test sets

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_digits, y_digits = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, test_size=0.2, stratify=y_digits, random_state=42)

print('{} samples for training'.format(len(y_train)))
print('{} samples for test'.format(len(y_test)))

Steps communicate data between each other with plain **Python dictionaries**. This makes it easy to pass collections of **arbitrary data types** (Numpy arrays, Pandas dataframes, etc.). The basic structure is as follows (you can get much more fancy but we leave that to the next example)

In [None]:
data_train = {'input':
                {
                     'X': X_train,
                     'y': y_train,
                }
            }

data_test = {'input':
                {
                     'X': X_test,
                     'y': y_test,
                }
            }

## Creating steps

The main component of a step is a transformer. You just have to define a class following a **simple API** of ` BaseTransformer` and then it's up to you to be as creative as you want!

... or you can just **wrap you favorite Scikit-learn estimator** like we do here:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

class RandomForestTransformer(BaseTransformer):
    def __init__(self):
        self.estimator = RandomForestClassifier()
        
    def fit(self, X, y):
        self.estimator.fit(X, y)
        return self

    def transform(self, X, **kwargs):
        y_pred  = self.estimator.predict(X)
        return {'y_pred': y_pred}
    
    def persist(self, filepath):
        joblib.dump(self.estimator, filepath)
        
    def load(self, filepath):
        self.estimator = joblib.load(filepath)
        return self

So what does the transformer do? It must be able to:
* **initialize** itself
* **fit** and **transform** the incoming data prepared by the adapter; when transforming, the result should be returned as a **dictionary** that can be **passed on to the next step**
* **persist** and **load** its parameters; this is handy when you're trying to avoid re-computing things over and over.

See how flexible this is? You can just as easily wrap your Keras or Pytorch models.

Now let's turn our transformer into a step:

In [None]:
classifier_step = Step(name='classifier',
                       transformer=RandomForestTransformer(),
                       input_data=['input'],                 
                       experiment_directory=EXPERIMENT_DIR
                      )

And that's our one-step pipeline finished. You can visualize it too:

In [None]:
classifier_step

This is just about the simplest pipeline you can imagine. Now let's train it!

## Training

Training a pipeline is a one-liner:

In [None]:
preds_train = classifier_step.fit_transform(data_train);

Let's see how well we do on our training data:

In [None]:
from sklearn.metrics import accuracy_score
acc_train = accuracy_score(data_train['input']['y'], preds_train['y_pred'])
print('Training accuracy = {:.4f}'.format(acc_train))

## Generating test predictions

Running test data through our pipeline is a one-liner too:

In [None]:
preds_test = classifier_step.transform(data_test)

How good is our test score?

In [None]:
acc_test = accuracy_score(data_test['input']['y'], preds_test['y_pred'])
print('Test accuracy = {:.4f}'.format(acc_test))

That's pretty good for a first attempt!

Let's have a look at some predictions to make sure they're sensible

In [None]:
fix, axs = plt.subplots(4, 8, figsize=(10, 6))
for i, ax in enumerate(axs.ravel()):
    ax.imshow(data_test['input']['X'][i].reshape(8, 8), cmap='gray')
    ax.axis('off')
    ax.set_title('pred = {}'.format(preds_test['y_pred'][i]))

And that's about it for a start! As you can see:
* It's easy to create steps by inheriting from `BaseTransformer`
* Transferring data between steps with Python dicts gives you a lot of flexibility
* Steps wrap easily around Scikit-learn estimators
* You can display a graph showing the structure of your pipeline
* Training and testing are pretty much one-liners

At this point it may seem like a lot of work for not much benefit but once we start moving towards more complex pipelines, the reasoning behind all the components will become more clear. Have a look at the next notebook for a more advanced, multi-step pipeline!