# A Better FunctionTransformer
In an attempt to simplify building machine learning pipelines in Python, I found myself tearing into the fundamental fabric of the language such as metaclasses. If something seems cumbersome, there must be a better way! (Though it might take some effort to find it!)

## Pipelines and parameters
One of the most convenient features in `scikit-learn` is the ability to build complex models by chaining transformers and estimators into pipelines, as well as access and set (hyper-)parameters *after* the transformer (or pipeline) is initialized.

Let us create a simple pipeline with a single step:

In [1]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
pipeline = Pipeline([
    ('scaler', StandardScaler())
])

Now, take some data:

In [3]:
X = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
], dtype=float)

and put it through the pipeline:

In [4]:
pipeline.fit_transform(X)

array([[-1.22474487, -1.22474487],
       [ 0.        ,  0.        ],
       [ 1.22474487,  1.22474487]])

Now, let us inspect the parameters of the pipeline:

In [5]:
pipeline.get_params()

{'memory': None,
 'steps': [('scaler',
   StandardScaler(copy=True, with_mean=True, with_std=True))],
 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True}

and change some of them:

In [6]:
pipeline.set_params(scaler__with_mean=False)
# or:
pipeline.named_steps['scaler'].set_params(with_mean=False)
pipeline.fit_transform(X)

array([[0.61237244, 1.22474487],
       [1.83711731, 2.44948974],
       [3.06186218, 3.67423461]])

This mechanism is hugely useful for saving models or performing search over parameters, e.g. for cross-validation.

## Custom transformers
What if we want to apply a custom function? Let's consider a simple stateless transform (i.e. one that does not need to store any fitted parameters):

In [7]:
def scale(x, factor=1.):
    """Scale array x by given factor"""
    return x * factor

To wrap this in a pipeline, we can use the built-in `FunctionTransformer`:

In [8]:
from sklearn.preprocessing import FunctionTransformer

In [10]:
pipeline = Pipeline([
    ('scaler', FunctionTransformer(scale))
])
pipeline.fit_transform(X)



array([[1., 2.],
       [3., 4.],
       [5., 6.]])

How do we change the `factor` parameter? `get_params` is suddenly much less useful:

In [11]:
pipeline.get_params()

{'memory': None,
 'steps': [('scaler',
   FunctionTransformer(accept_sparse=False, check_inverse=True,
             func=<function scale at 0x118b96730>, inv_kw_args=None,
             inverse_func=None, kw_args=None, pass_y='deprecated',
             validate=None))],
 'scaler': FunctionTransformer(accept_sparse=False, check_inverse=True,
           func=<function scale at 0x118b96730>, inv_kw_args=None,
           inverse_func=None, kw_args=None, pass_y='deprecated',
           validate=None),
 'scaler__accept_sparse': False,
 'scaler__check_inverse': True,
 'scaler__func': <function __main__.scale(x, factor=1.0)>,
 'scaler__inv_kw_args': None,
 'scaler__inverse_func': None,
 'scaler__kw_args': None,
 'scaler__pass_y': 'deprecated',
 'scaler__validate': None}

What we need is `FunctionTransformer`'s parameter `kw_args`:

In [12]:
# pipeline.set_params(scaler__factor=2.) # raises ValueError
pipeline.set_params(scaler__kw_args={'factor': 2.})
pipeline.fit_transform(X)



array([[ 2.,  4.],
       [ 6.,  8.],
       [10., 12.]])

This is not especially elegant, but we can wrap it up in an object that behaves just like a nativ transformer:

In [13]:
from sklearn.base import BaseEstimator, TransformerMixin

class ScaleTransformer(BaseEstimator, TransformerMixin):
    """Custom scaling transformer"""
    def __init__(self, factor=1.):
        self.factor = factor
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        return scale(X, factor=self.factor)

In [14]:
pipeline = Pipeline([
    ('scaler', ScaleTransformer())
])
pipeline.set_params(scaler__factor=2.)
pipeline.fit_transform(X)

array([[ 2.,  4.],
       [ 6.,  8.],
       [10., 12.]])

The magic goes on under the hood: `Pipeline` inspects the `__init__` method of the transformer to determine what parameters are available. However, writing all this boilerplate for each parametric function seems repetetive and outright un-pythonic.

What I wanted was a *tranformer factory*, which can construct the equivalent transformer *class* (or *instance*) from the function alone, along the lines of:

```python
pipeline = Pipeline([
    ('scaler', BetterFunctionTansformer(scale))
])

pipeline.set_params(scaler__factor=2.)
```

## Dynamically created transformer class
To create our transformer with desired properties dynamically, we need to solve three problems:
1. Determine the signature of the input function
2. Create functions for class methods `__init__`, `fit`, `transform`
3. Create the transformer class

In [15]:
func = scale

### Getting the function signature
Using `inspect.signature`, we get the function name, function args, kwargs and their default values

In [18]:
import inspect

signature = inspect.signature(func)
args = [name for name, param in signature.parameters.items() if param.default is inspect._empty]
kwargs_defaults = [(name, param.default) for name, param in signature.parameters.items() if param.default is not inspect._empty]
kwargs, defaults = zip(*kwargs_defaults)
all_args = list(args) + list(kwargs)

In [19]:
func.__name__, func.__doc__

('scale', 'Scale array x by given factor')

In [20]:
all_args, args, kwargs, defaults

(['x', 'factor'], ['x'], ('factor',), (1.0,))

### Creating the class methods

In [21]:
from decorator import FunctionMaker

In [22]:
init_signature = '__init__(self, {args})'.format(args=', '.join(kwargs))
init_kwarg_string = '\n'.join(['self.{kwarg}={kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
init_body = """self.func = func
{init_kwarg_string}""".format(init_kwarg_string=init_kwarg_string)

proto__init = FunctionMaker.create(init_signature, init_body, {'func': func}, defaults=defaults)
proto_fit = FunctionMaker.create('fit(self, x)', 'return self', {})

kwarg_string = ', '.join(['{kwarg}=self.{kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
transform_body = 'return self.func(x, {kwarg_string})'.format(kwarg_string=kwarg_string)
proto_transform = FunctionMaker.create('transform(self, x)', transform_body, {})

In [23]:
proto_dict = {
    '__init__': proto__init,
    'fit': proto_fit,
    'transform': proto_transform
}

### Creating the new class

In [24]:
from sklearn.base import BaseEstimator, TransformerMixin
new_class = type('FunctionTransformer_'+func.__name__, (BaseEstimator, TransformerMixin), proto_dict)

In [25]:
new_transformer = new_class()

### Voila!

In [26]:
new_transformer.get_params()

{'factor': 1.0}

In [27]:
new_transformer.set_params(factor=3)

FunctionTransformer_scale(factor=3)

In [28]:
new_transformer.fit_transform(X)

array([[ 3.,  6.],
       [ 9., 12.],
       [15., 18.]])

In [29]:
X

array([[1., 2.],
       [3., 4.],
       [5., 6.]])

## Complete code

Putting it all together, I arrived at the code in `sklearn_transformer_factory.py`

In [30]:
%load_ext autoreload
%autoreload 2

In [32]:
from mario.factory.transformer_factory import make_transformer

In [33]:
pipeline = Pipeline([
    ('identity', make_transformer()),
    ('scaler', make_transformer(scale, factor=2))
])
pipeline.set_params(scaler__factor=10)
pipeline.fit_transform(X)

array([[10., 20.],
       [30., 40.],
       [50., 60.]])