# Better custom transformers in ML pipelines

One of the most convenient features in `scikit-learn` is the ability to build complex models by chaining transformers and estimators into pipelines.

![Optimus Prime](optimus-thumb.png)


Importantly, all (hyper-)parameters of each transformer remain accessible and *tunable*. The simplicity suffers somewhat once we need to add custom preprocessing functions into the pipeline. The "standard" approach using [`sklearn.preprocessing.FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) felt decidedly unsatisfactory once I tried to define some parameter search spaces, so I looked into implementing a more usable alternative:

> Beautiful is better than ugly!

<!--more-->

## Example and motivation

The pipeline approach simplifies model selection (including hyperparameter tuning), provides a simple way
to persist models, and thus solves many deployment and reproducibility issues.

`scikit-learn` provides a wide range of transformers for common data preprocessing tasks. Consider the following example:

In [1]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

Now, take some data:

In [3]:
X = np.array([
    [1., 2.],
    [3., 4.],
])

and put it through the pipeline:

In [4]:
pipeline.fit_transform(X)

array([[-1., -1.],
       [ 1.,  1.]])

Now, let us inspect the parameters of the pipeline:

In [5]:
pipeline.get_params()

{'memory': None,
 'steps': [('scaler',
   StandardScaler(copy=True, with_mean=True, with_std=True))],
 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True}

and change some of them:

In [6]:
pipeline.set_params(scaler__with_mean=False)
# or:
pipeline.named_steps['scaler'].set_params(with_mean=False)
pipeline.fit_transform(X)

array([[1., 2.],
       [3., 4.]])

This mechanism is hugely useful for saving models or performing search over parameters, e.g. for cross-validation. A search grid could be specified as

In [7]:
param_grid = {
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False]
}

### The problem with `FunctionTransformer`
What if we want to apply a custom function? Let's consider a simple stateless transform (i.e. one that does not need to store any fitted parameters):

In [8]:
def scale(x, factor=1.):
    """Scale array by given factor."""
    return x * factor

To wrap this in a pipeline, we can use the built-in `FunctionTransformer`:

In [9]:
from sklearn.preprocessing import FunctionTransformer

In [10]:
pipeline = Pipeline([
    ('scaler', FunctionTransformer(scale))
])
pipeline.fit_transform(X)



array([[1., 2.],
       [3., 4.]])

How do we change the `factor` parameter? `get_params` is suddenly much less useful:

In [11]:
pipeline.get_params()

{'memory': None,
 'steps': [('scaler',
   FunctionTransformer(accept_sparse=False, check_inverse=True,
             func=<function scale at 0x11c4db6a8>, inv_kw_args=None,
             inverse_func=None, kw_args=None, pass_y='deprecated',
             validate=None))],
 'scaler': FunctionTransformer(accept_sparse=False, check_inverse=True,
           func=<function scale at 0x11c4db6a8>, inv_kw_args=None,
           inverse_func=None, kw_args=None, pass_y='deprecated',
           validate=None),
 'scaler__accept_sparse': False,
 'scaler__check_inverse': True,
 'scaler__func': <function __main__.scale(x, factor=1.0)>,
 'scaler__inv_kw_args': None,
 'scaler__inverse_func': None,
 'scaler__kw_args': None,
 'scaler__pass_y': 'deprecated',
 'scaler__validate': None}

What we are forced to use is `FunctionTransformer`'s catch-all `kw_args`:

In [12]:
pipeline.set_params(scaler__kw_args={'factor': 2.})
pipeline.fit_transform(X)



array([[2., 4.],
       [6., 8.]])

If we wanted to perform a hyperparameter search, we would need to define a grid in a rather cumbersome way:

In [13]:
param_grid = {
    'scaler__kw_args': [{'factor': 1.}, {'factor': 2.}]
}

Importantly, we lose the ability to factorize the search space.

Alternatively, we can wrap our function in an custom transformer class:

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin

class ScaleTransformer(BaseEstimator, TransformerMixin):
    """Custom scaling transformer"""
    def __init__(self, factor=1.):
        self.factor = factor
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        return scale(X, factor=self.factor)

In [15]:
pipeline = Pipeline([
    ('scaler', ScaleTransformer())
])
pipeline.set_params(scaler__factor=2.)
pipeline.fit_transform(X)

array([[2., 4.],
       [6., 8.]])

The magic goes on under the hood: `Pipeline` inspects the `__init__` method of the transformer to determine what parameters are available. However, writing all this boilerplate for each parametric function seems repetetive and outright un-pythonic.

What I wanted was a *tranformer factory*, which can construct the equivalent transformer *class* (or *instance*) from the function alone, along the lines of:

```python
pipeline = Pipeline([
    ('scaler', function_transformer(scale))
])

pipeline.set_params(scaler__factor=2.)
```

## Creating custom transformers dynamically

To create our transformer with desired properties dynamically, we need to solve three problems:
1. Determine the signature of the input function
2. Create functions for class methods `__init__`, `fit`, `transform`
3. Create the transformer class

In [16]:
func = scale

### Getting the function signature
Using the all-powerfull `inspect` module, we can get the function name, function args, kwargs, and their default values:

In [17]:
import inspect

signature = inspect.signature(func)
args = [name for name, param in signature.parameters.items() if param.default is inspect._empty]
kwargs_defaults = [(name, param.default) for name, param in signature.parameters.items() if param.default is not inspect._empty]
kwargs, defaults = zip(*kwargs_defaults)
all_args = list(args) + list(kwargs)

In [18]:
func.__name__, func.__doc__

('scale', 'Scale array by given factor.')

In [19]:
all_args, args, kwargs, defaults

(['x', 'factor'], ['x'], ('factor',), (1.0,))

### Creating the class methods

Unfortunately, the only way to create the class methods seems to rely on `eval` - the `FunctionMaker` from the `decorator` module provides ome respite.

In [20]:
from decorator import FunctionMaker

In [21]:
init_signature = '__init__(self, {args})'.format(args=', '.join(kwargs))
init_kwarg_string = '\n'.join(['self.{kwarg}={kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
init_body = """self.func = func
{init_kwarg_string}""".format(init_kwarg_string=init_kwarg_string)

proto__init = FunctionMaker.create(init_signature, init_body, {'func': func}, defaults=defaults)
proto_fit = FunctionMaker.create('fit(self, x)', 'return self', {})

kwarg_string = ', '.join(['{kwarg}=self.{kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
transform_body = 'return self.func(x, {kwarg_string})'.format(kwarg_string=kwarg_string)
proto_transform = FunctionMaker.create('transform(self, x)', transform_body, {})

In [22]:
proto_dict = {
    '__doc__': func.__doc__,
    '__init__': proto__init,
    'fit': proto_fit,
    'transform': proto_transform
}

### Creating the new class

In [23]:
from sklearn.base import BaseEstimator, TransformerMixin
new_class = type('FunctionTransformer_'+func.__name__, (BaseEstimator, TransformerMixin), proto_dict)

In [24]:
new_transformer = new_class()

In [25]:
??new_transformer

[0;31mType:[0m        FunctionTransformer_scale
[0;31mString form:[0m FunctionTransformer_scale(factor=1.0)
[0;31mDocstring:[0m   Scale array by given factor.


In [26]:
new_transformer.__doc__

'Scale array by given factor.'

In [27]:
new_transformer.get_params()

{'factor': 1.0}

In [28]:
new_transformer.set_params(factor=3)

FunctionTransformer_scale(factor=3)

In [29]:
new_transformer.fit_transform(X)

array([[ 3.,  6.],
       [ 9., 12.]])

## Complete code

Putting it all together, I arrived at the implementation found [here](https://github.com/ig248/mario/blob/master/mario/factory/):

In [30]:
from sklearn.pipeline import Pipeline
from mario.factory import function_transformer

In [31]:
pipeline = Pipeline([
    ('identity', function_transformer()),
    ('scaler', function_transformer(scale, factor=2))
])
pipeline.set_params(scaler__factor=10)
pipeline.fit_transform(X)

array([[10., 20.],
       [30., 40.]])

No more parameter grids over lists of `kw_args` dictionaries!