This tutorial helps you to understand how you can transform your data using Transformer class and how to make your own class for data transformation.

## 1. Data Format for Pipelines

To use your data in Reskit, you should transform it to a dictionary with fields X and y. Here, we use a scikit-learn function for making classification problem.

In [1]:
from sklearn.datasets import make_classification


X, y = make_classification()
data = {'X': X, 'y': y}

## 2. Function Usage in Pipelines Through Transformer

Let say we want to scale our data with zero mean and unit variance. We have function which can do it. 

In [4]:
print('Means of columns: ', data['X'].mean(axis=0), '\n')
print('Stds of columns: ', data['X'].std(axis=0))

Means of columns:  [ 0.0895726   0.02239302 -0.01146629 -0.06274079 -0.00419747  0.0623276
 -0.15522739  0.15054792 -0.06350153  0.16210763  0.06918616  0.07071597
  0.05752406  0.0592492   0.17829296 -0.0224572  -0.0471177   0.14207094
 -0.08877845  0.06817185] 

Stds of columns:  [ 1.21537283  0.37932576  0.9745023   1.03091299  1.09148538  1.2685434
  0.81497409  1.07653402  1.11846887  0.91349843  0.9596036   0.9126007
  1.19550076  1.0238642   0.889447    0.88625815  0.9614719   0.99481902
  1.12363487  0.93610273]


In [15]:
from sklearn.preprocessing import scale
import numpy as np

def check_mean_and_std_of_columns(X):
    """ Calculates mean and std for columns of matrix. """
    means = np.round(X.mean(axis=0), 15)
    stds = np.round(X.std(axis=0), 15)

    print('Means of columns: ', means, '\n')
    print('Stds of columns: ', stds)
    
scaled_X = scale(data['X'])
check_mean_and_std_of_columns(scaled_X)

Means of columns:  [-0.  0.  0. -0. -0.  0.  0. -0.  0.  0.  0.  0.  0. -0. -0. -0. -0.  0.
  0.  0.] 

Stds of columns:  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.]


And we want to use it in pipelines. Then you should rewrite you function.

In [16]:
def my_function(data):
    data['X'] = scale(data['X'])
    return data

Let's check the result.

In [17]:
from reskit.core import Transformer


transformer = Transformer(func=my_function)
scaled_data = transformer.fit_transform(data)

scaled_X = scale(scaled_data['X'])
check_mean_and_std_of_columns(scaled_X)

Means of columns:  [-0. -0.  0. -0. -0. -0.  0. -0. -0.  0. -0.  0.  0. -0. -0. -0. -0.  0.
  0.  0.] 

Stds of columns:  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.]


Finally, let's try it to use in pipeline.

In [18]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('scaler', Transformer(func=my_function))])
scaled_data = pipeline.fit_transform(data)

scaled_X = scale(scaled_data['X'])
check_mean_and_std_of_columns(scaled_X)

Means of columns:  [-0. -0.  0. -0. -0. -0.  0. -0. -0.  0. -0.  0.  0. -0. -0. -0. -0.  0.
  0.  0.] 

Stds of columns:  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.]


Thus, you can write simple functions which takes and returns a dictionary with fixed structure.

## 3. Transformation Data to X, y form

One of restriction of scikit-learn pipeline is that it cannot change y in through transformations. Using dictionary can solve this problem, but scikit-learn classifiers and some transformers can take a data only in (X, y) form. So you should define when Transformer class should change this form to (X, y).

In [28]:
transformer = Transformer(func=my_function, collect='X')
X, y = transformer.fit_transform(data)
check_mean_and_std_of_columns(X)

Means of columns:  [-0.  0.  0. -0. -0.  0.  0. -0.  0.  0.  0.  0.  0. -0. -0. -0. -0.  0.
  0.  0.] 

Stds of columns:  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.]


## 4. Your own transformer

If you need more flexibility in transformation, you can implement your own transformer. Simplest example:

In [1]:
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

class MyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        #
        # Write here the code if transformer need
        # to learn anything from data.
        #
        # Usually nothing should be here, 
        # just return self.
        #
        return self
    
    def transform(self, X):
        #
        # Write here your transformation
        #
        return X