# Using dstoolbox transformers

## Table of contents

1. Slicing
    1. [ItemSelector](#ItemSelector)
2. Encoding
    1. [XLabelEncoder](#XLabelEncoder)
3. Preprocessing
    1. [ParallelFunctionTransformer](#ParallelFunctionTransformer)
4. Casting
    1. [ToDataFrame](#ToDataFrame)

## Imports

In [1]:
import re

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [3]:
from dstoolbox.transformers import ItemSelector
from dstoolbox.transformers import XLabelEncoder
from dstoolbox.transformers import ParallelFunctionTransformer
from dstoolbox.transformers import ToDataFrame

## Slicing

### ItemSelector

Select a column or a slice along `axis=1` from a numpy array.

In [4]:
X = np.eye(5)

In [5]:
X

array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

#### Just using the 2nd column

In [6]:
# pass a slice object
pipeline = Pipeline([
    ('selector', ItemSelector(slice(1, 2))),
    ('scaler', StandardScaler()),
])

In [7]:
pipeline.fit_transform(X)

array([[-0.5],
       [ 2. ],
       [-0.5],
       [-0.5],
       [-0.5]])

In [8]:
# or a list
pipeline = Pipeline([
    ('selector', ItemSelector([1])),
    ('scaler', StandardScaler()),
])

In [9]:
pipeline.fit_transform(X)

array([[-0.5],
       [ 2. ],
       [-0.5],
       [-0.5],
       [-0.5]])

#### Using the 2nd and 4th column

In [10]:
pipeline = Pipeline([
    ('selector', ItemSelector([1, 3])),
    ('scaler', StandardScaler()),
])

In [11]:
pipeline.fit_transform(X)

array([[-0.5, -0.5],
       [ 2. , -0.5],
       [-0.5, -0.5],
       [-0.5,  2. ],
       [-0.5, -0.5]])

#### Using a slice

In [12]:
pipeline = Pipeline([
    ('selector', ItemSelector(np.s_[2:6:2])),
    ('scaler', StandardScaler()),
])

In [13]:
pipeline.fit_transform(X)

array([[-0.5, -0.5],
       [-0.5, -0.5],
       [ 2. , -0.5],
       [-0.5, -0.5],
       [-0.5,  2. ]])

#### Apply slicing on pandas DataFrame

In [14]:
X = pd.DataFrame(data={
        'names': ['Alice', 'Bob', 'Charles', 'Dora', 'Eve'],
        'surnames': ['Carroll', 'Meister', 'Darwin', 'Explorer', 'Wally'],
        'age': [14., 30., 55., 7., 25.]}
)

In [15]:
X

Unnamed: 0,age,names,surnames
0,14.0,Alice,Carroll
1,30.0,Bob,Meister
2,55.0,Charles,Darwin
3,7.0,Dora,Explorer
4,25.0,Eve,Wally


In [16]:
# use a string as key
item_selector = ItemSelector('names')
item_selector.fit_transform(X)

0      Alice
1        Bob
2    Charles
3       Dora
4        Eve
Name: names, dtype: object

In [17]:
# use list of strings as keys
item_selector = ItemSelector(['names', 'age'])
item_selector.fit_transform(X)

Unnamed: 0,names,age
0,Alice,14.0
1,Bob,30.0
2,Charles,55.0
3,Dora,7.0
4,Eve,25.0


In [18]:
# use list of ints as keys
item_selector = ItemSelector([1, 0])
item_selector.fit_transform(X)

Unnamed: 0,names,age
0,Alice,14.0
1,Bob,30.0
2,Charles,55.0
3,Dora,7.0
4,Eve,25.0


#### Using functions to determine columns of the DataFrame

Sometimes you don't know the column names beforehand. Then you can supply a function to `ItemSelector`, which will be evaluated on each column. If the column matches (i.e. the result of applying the function to the column name is true), it will be returned.

In [19]:
# only return columns that end with 'names'
def func(s):
    return s.endswith('names')

item_selector = ItemSelector(func)
item_selector.fit_transform(X)

Unnamed: 0,names,surnames
0,Alice,Carroll
1,Bob,Meister
2,Charles,Darwin
3,Dora,Explorer
4,Eve,Wally


In [20]:
# use in combination with regular expressions
pattern = re.compile(r'n*a')
item_selector = ItemSelector(key=pattern.match)

item_selector = ItemSelector(pattern.match)
item_selector.fit_transform(X)

Unnamed: 0,age,names
0,14.0,Alice
1,30.0,Bob
2,55.0,Charles
3,7.0,Dora
4,25.0,Eve


#### Forcing a 2d shape

`sklearn` transformers often require a 2d array as input. In that case, use `force_2d=True` argument.

*This would raise a warning*:

    pipeline = Pipeline([
        ('selector', ItemSelector('age')),
        ('scaler', StandardScaler()),
    ])
    pipeline.fit_transform(X)

*This works*:

In [21]:
pipeline = Pipeline([
    ('selector', ItemSelector('age', force_2d=True)),
    ('scaler', StandardScaler()),
])
pipeline.fit_transform(X)

array([[-0.73897334],
       [ 0.23017202],
       [ 1.74446165],
       [-1.16297444],
       [-0.0726859 ]])

## Encoding

### XLabelEncoder

`sklearn`'s `LabelEncoder` is intended for use in conjunction with target data. However, sometimes, we would like to encode feature data. The problem is that the `LabelEncoder` will raise an error when new samples are encountered. The `XLabelEncoder` will encode new samples to the value `0` instead. Furthermore, the encoded data will have shape `n x 1`, so that they can later be used as feature data (e.g. in a `FeatureUnion`).

In [22]:
X = np.array(['a', 'b', 'c', 'a', 'c'])

In [23]:
encoder = XLabelEncoder().fit(X)

#### When all labels are known, `XLabelEncoder` maps to the values 1..n. It returns a 2d array.

In [24]:
encoder.transform(X)

array([[1],
       [2],
       [3],
       [1],
       [3]])

#### When new labels are encountered, they are mapped to `0`.

In [25]:
encoder.transform(np.array(['a', 'b', 'c', 'd', 'e', 'a']))

array([[1],
       [2],
       [3],
       [0],
       [0],
       [1]])

## Preprocessing

### ParallelFunctionTransformer

The `ParallelFunctionTransformer`, as its name suggests, transforms data in a parallelized manner. The data will be partitioned into `n_jobs` equally sized parts and then be transformed in parallel.

As parallelization induces overhead, use this only when the map function is slow. Furthermore, some functions don't lend themselves to parallelization, as shown below.

In [26]:
X = np.arange(10).reshape(-1, 1)

Remember: We cannot use a `lambda` function, since it cannot be pickled.

In [27]:
def plus_one_func(X):
    return X + 1

In [28]:
transformer = ParallelFunctionTransformer(func=plus_one_func, n_jobs=2)

In [29]:
transformer.fit_transform(X)

array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

#### Caution

Functions such as 'adding the standard deviation' or 'divding by the max values' will not work for `n_jobs > 1`, because they require information of the whole data.

In [30]:
def max_of(X):
    return np.ones_like(X) * np.max(X)

In [31]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=1)

In [32]:
transformer.fit_transform(X)

array([[9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9]])

In [33]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=2)

In [34]:
transformer.fit_transform(X)

array([[4],
       [4],
       [4],
       [4],
       [4],
       [9],
       [9],
       [9],
       [9],
       [9]])

## Casting

### ToDataFrame

This is a helper class that simplifies the common use case of converting data in a `Pipeline` to a pandas `DataFrame`. It deals with a couple of types and allows to determine the column names for some of those.

#### numpy arrays

In [35]:
X = np.arange(5)
transformer = ToDataFrame()
transformer.fit_transform(X)

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4


In [36]:
X = np.eye(5)
transformer.fit_transform(X)

Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0


Pass a list of strings to the `columns` argument to determine the `DataFrame`'s column names.

In [37]:
transformer = ToDataFrame(columns=['col_%d' % i for i in range(5)])
transformer.fit_transform(X)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4
0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0


#### pandas Series

In [38]:
X = pd.Series(data=np.arange(5), name='col')
transformer = ToDataFrame()
transformer.fit_transform(X)

Unnamed: 0,col
0,0
1,1
2,2
3,3
4,4


For series, it is possible to give another name by passing a string to the `columns` argument.

In [39]:
X = pd.Series(data=np.arange(5), name='col')
transformer = ToDataFrame(columns='other-name')
transformer.fit_transform(X)

Unnamed: 0,other-name
0,0
1,1
2,2
3,3
4,4


#### dicts

When a `dict` is passed, the keys determine the column names. This can be especially useful in conjunction with `dstoolbox.pipeline.DictFeatureUnion`.

In [40]:
X = {'col0': np.arange(5), 'col1': np.linspace(0, 1, 5)}
transformer = ToDataFrame()
transformer.fit_transform(X)

Unnamed: 0,col0,col1
0,0,0.0
1,1,0.25
2,2,0.5
3,3,0.75
4,4,1.0


#### lists

`ToDataFrame` also works when a simple `list` of data needs to be transformed to a `DataFrame`. Again, you may or may not set the `columns` argument.

In [41]:
X = [5, 4, 3, 2, 1]
transformer = ToDataFrame(columns=['my-col'])
transformer.fit_transform(X)

Unnamed: 0,my-col
0,5
1,4
2,3
3,2
4,1
