# Adapters: Creating steps with multiple inputs

This notebook shows how to create a more complex pipeline, including steps with multiple inputs

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from steppy.base import Step, BaseTransformer
from steppy.adapter import Adapter, E

EXPERIMENT_DIR = './ex2'

In [None]:
import shutil

# By default pipelines will try to load previously trained models so we delete the cache to ba sure we're starting from scratch
shutil.rmtree(EXPERIMENT_DIR, ignore_errors=True)

## Data

As before, we'll import a dataset from Scikit-learn for our experiments and divide it into training and test sets

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

dset = load_breast_cancer()
X_dset, y_dset = dset.data, dset.target

X_train, X_test, y_train, y_test = train_test_split(X_dset, y_dset, test_size=0.2, stratify=y_dset, random_state=42)

print('{} samples for training'.format(len(y_train)))
print('{} samples for test'.format(len(y_test)))

data_train = {'input':
                {
                     'X': X_train,
                     'y': y_train,
                }
            }

data_test = {'input':
                {
                     'X': X_test,
                     'y': y_test,
                }
            }

## Creating pipeline components

This time we want to build a more fancy pipeline. We'll normalize our data, run PCA to compute some features of a different flavour and then combine them with our original features in a final logistic regression step.

Our first step will be a normalization step. We could use the one from Scikit-learn but we'll write a pure Numpy implementation just to show how this could be done:

In [None]:
from sklearn.externals import joblib

class NormalizationTransformer(BaseTransformer):
    def __init__(self):
        self.mean = None
        self.std = None
    
    # Having only X as input ensures that we don't accidentally fit y
    def fit(self, X):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X, **kwargs):
        X_tfm  = (X - self.mean) / self.std
        return {'X': X_tfm}
    
    def persist(self, filepath):
        joblib.dump([self.mean, self.std], filepath)
        
    def load(self, filepath):
        self.mean, self.std = joblib.load(filepath)
        return self

We'll also construct a PCA transformer for our normalized features:

In [None]:
from sklearn.decomposition import PCA

class PCATransformer(BaseTransformer):
    def __init__(self):
        self.estimator = PCA(n_components=10)
        
    def fit(self, X):
        self.estimator.fit(X)
        return self

    def transform(self, X, **kwargs):
        X_tfm  = self.estimator.transform(X)
        return {'X': X_tfm}
    
    def persist(self, filepath):
        joblib.dump(self.estimator, filepath)
        
    def load(self, filepath):
        self.estimator = joblib.load(filepath)
        return self

Finally, we'll use logistic regression as our classifier:

In [None]:
from sklearn.linear_model import LogisticRegression

class LogRegTransformer(BaseTransformer):
    def __init__(self):
        self.estimator = LogisticRegression()
        
    def fit(self, X, y):
        self.estimator.fit(X, y)
        return self

    def transform(self, X, **kwargs):
        y_pred  = self.estimator.predict(X)
        return {'y_pred': y_pred}
    
    def persist(self, filepath):
        joblib.dump(self.estimator, filepath)
        
    def load(self, filepath):
        self.estimator = joblib.load(filepath)
        return self

## Assembling the pipeline
Now we'll create steps from our transformers and link them all together:

Our normalization step will only require the features from the input, not the labels. In fact, we would like to *avoid* giving it the labels just in case there could be data leak in the implementation (the first rule of data science is you don't trust anyone). To achieve this, we will use a special `adapter` argument to the step constructors, which allows us to extract just the required variables from the data dictionary.

In [None]:
norm_step = Step(name='Normalizer',
                 transformer=NormalizationTransformer(),
                 input_data=['input'],
                 adapter=Adapter({
                     'X': E('input', 'X')
                 }),
                 experiment_directory=EXPERIMENT_DIR)

The notation `E('input', 'X')` tells steppy that this is a placeholder for extracting the output `X` from input data called `input`

In [None]:
pca_step = Step(name='PCA',
                transformer=PCATransformer(),
                input_steps=[norm_step],                 
                experiment_directory=EXPERIMENT_DIR)

Our classifier step will have to combine two data flows: the features processed by PCA, and the labels fed directly from input. Therefore, we will have to use the `adapter` argument to specify how to map those inputs to transformer arguments.

In [None]:
lr_step = Step(name='LogReg',
               transformer=LogRegTransformer(),
               input_steps=[pca_step],
               input_data=['input'],
               adapter=Adapter({
                   'X': E('PCA', 'X'),
                   'y': E('input', 'y')
               }),
               experiment_directory=EXPERIMENT_DIR)

One may think it's a bit cumbersome to create your transformers and then have to wrap them with steps. However, there is an advantage to this - think about it:
* The **transformer** is the ***implementation*** of a machine learning algorithm. It has an input and outputs but it doesn't even know what these are connected to.
* The **steps** define the ***connections*** between different transformers. At this level of abstraction, all the algorithmic details are hidden. The code that defines steps and connects them together is compact and it's easier to see what is connected to what.

So what does our pipeline look like?

In [None]:
lr_step

This looks about right - let's move on to training!

## Training

Training a pipeline is a one-liner. When we fit the final logistic regression step, it will go back to its input steps and fit them too (assuming there's no cache or persistent outputs - that's why we delete any leftover cache at the start of the notebook). This also works recursively, so the parent steps will ask the grandparent steps to fit etc.

In [None]:
preds_train = lr_step.fit_transform(data_train)

Let's see how well we do on our training data:

In [None]:
from sklearn.metrics import accuracy_score
acc_train = accuracy_score(data_train['input']['y'], preds_train['y_pred'])
print('Training accuracy = {:.4f}'.format(acc_train))

## Generating test predictions

Running test data through our pipeline is a one-liner too:

In [None]:
preds_test = lr_step.transform(data_test)

What is our test score?

In [None]:
acc_test = accuracy_score(data_test['input']['y'], preds_test['y_pred'])
print('Test accuracy = {:.4f}'.format(acc_test))

That seems pretty good. Have a look at the next notebook for even more complex pipelines with parallel branches.