# Using dstoolbox transformers

## Table of contents

1. Slicing
    1. [ItemSelector](#ItemSelector)
2. Text
    1. [W2VTransformer](#W2VTransformer)
3. Encoding
    1. [XLabelEncoder](#XLabelEncoder)
4. Preprocessing
    1. [ParallelFunctionTransformer](#ParallelFunctionTransformer)

## Imports

In [1]:
import re

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [3]:
from dstoolbox.transformers import ItemSelector
from dstoolbox.transformers import W2VTransformer
from dstoolbox.transformers import XLabelEncoder
from dstoolbox.transformers.preprocessing import ParallelFunctionTransformer

## Slicing

### ItemSelector

Select a column or a slice along `axis=1` from a numpy array.

In [4]:
X = np.eye(5)

In [5]:
X

array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

#### Just using the 2nd column

In [6]:
# pass a slice object
pipeline = Pipeline([
    ('selector', ItemSelector(slice(1, 2))),
    ('scaler', StandardScaler()),
])

In [7]:
pipeline.fit_transform(X)

array([[-0.5],
       [ 2. ],
       [-0.5],
       [-0.5],
       [-0.5]])

In [8]:
# or a list
pipeline = Pipeline([
    ('selector', ItemSelector([1])),
    ('scaler', StandardScaler()),
])

In [9]:
pipeline.fit_transform(X)

array([[-0.5],
       [ 2. ],
       [-0.5],
       [-0.5],
       [-0.5]])

#### Using the 2nd and 4th column

In [10]:
pipeline = Pipeline([
    ('selector', ItemSelector([1, 3])),
    ('scaler', StandardScaler()),
])

In [11]:
pipeline.fit_transform(X)

array([[-0.5, -0.5],
       [ 2. , -0.5],
       [-0.5, -0.5],
       [-0.5,  2. ],
       [-0.5, -0.5]])

#### Using a slice

In [12]:
pipeline = Pipeline([
    ('selector', ItemSelector(np.s_[2:6:2])),
    ('scaler', StandardScaler()),
])

In [13]:
pipeline.fit_transform(X)

array([[-0.5, -0.5],
       [-0.5, -0.5],
       [ 2. , -0.5],
       [-0.5, -0.5],
       [-0.5,  2. ]])

#### Apply slicing on pandas DataFrame

In [14]:
X = pd.DataFrame(data={
        'names': ['Alice', 'Bob', 'Charles', 'Dora', 'Eve'],
        'surnames': ['Carroll', 'Meister', 'Darwin', 'Explorer', 'Wally'],
        'age': [14., 30., 55., 7., 25.]}
)

In [15]:
X

Unnamed: 0,age,names,surnames
0,14.0,Alice,Carroll
1,30.0,Bob,Meister
2,55.0,Charles,Darwin
3,7.0,Dora,Explorer
4,25.0,Eve,Wally


In [16]:
# use a string as key
item_selector = ItemSelector('names')
item_selector.fit_transform(X)

0      Alice
1        Bob
2    Charles
3       Dora
4        Eve
Name: names, dtype: object

In [17]:
# use list of strings as keys
item_selector = ItemSelector(['names', 'age'])
item_selector.fit_transform(X)

Unnamed: 0,names,age
0,Alice,14.0
1,Bob,30.0
2,Charles,55.0
3,Dora,7.0
4,Eve,25.0


In [18]:
# use list of ints as keys
item_selector = ItemSelector([1, 0])
item_selector.fit_transform(X)

Unnamed: 0,names,age
0,Alice,14.0
1,Bob,30.0
2,Charles,55.0
3,Dora,7.0
4,Eve,25.0


#### Using functions to determine columns of the DataFrame

Sometimes you don't know the column names beforehand. Then you can supply a function to `ItemSelector`, which will be evaluated on each column. If the column matches (i.e. the result of applying the function to the column name is true), it will be returned.

In [19]:
# only return columns that end with 'names'
def func(s):
    return s.endswith('names')

item_selector = ItemSelector(func)
item_selector.fit_transform(X)

Unnamed: 0,names,surnames
0,Alice,Carroll
1,Bob,Meister
2,Charles,Darwin
3,Dora,Explorer
4,Eve,Wally


In [20]:
# use in combination with regular expressions
pattern = re.compile(r'n*a')
item_selector = ItemSelector(key=pattern.match)

item_selector = ItemSelector(pattern.match)
item_selector.fit_transform(X)

Unnamed: 0,age,names
0,14.0,Alice
1,30.0,Bob
2,55.0,Charles
3,7.0,Dora
4,25.0,Eve


#### Forcing a 2d shape

`sklearn` transformers often require a 2d array as input. In that case, use `force_2d=True` argument.

*This would raise a warning*:

    pipeline = Pipeline([
        ('selector', ItemSelector('age')),
        ('scaler', StandardScaler()),
    ])
    pipeline.fit_transform(X)

*This works*:

In [21]:
pipeline = Pipeline([
    ('selector', ItemSelector('age', force_2d=True)),
    ('scaler', StandardScaler()),
])
pipeline.fit_transform(X)

array([[-0.73897334],
       [ 0.23017202],
       [ 1.74446165],
       [-1.16297444],
       [-0.0726859 ]])

## Text

### W2VTransformer

#### Mocking the fit method so that we don't need real word2vec data

In [22]:
word2idx = {"herren": 0, "damen": 1, "nike": 2}
word_embeddings = np.arange(15).reshape(3, 5).astype(float)

def mock_fit(self, X=None, y=None):
    self.word2idx_ = word2idx
    self.syn0_ = word_embeddings
    return self

In [23]:
print(word2idx)
print(word_embeddings)

{'herren': 0, 'damen': 1, 'nike': 2}
[[  0.   1.   2.   3.   4.]
 [  5.   6.   7.   8.   9.]
 [ 10.  11.  12.  13.  14.]]


In [24]:
setattr(W2VTransformer, 'fit', mock_fit)

#### Applying W2VTransformer with mean aggregation (default)

In [25]:
X = ['damen Herren', 'schuhe nike', 'unbekannt', 'schuhe herren damen']

In [26]:
transformer = W2VTransformer('path/to/word2vec/file')

In [27]:
transformer.fit()

W2VTransformer(aggr_func=<function mean at 0x7f70f41ed048>,
        analyze=<bound method VectorizerMixin.build_analyzer of CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)>,
        max_words=100, path_w2v='path/to/word2vec/file')

In [28]:
transformer.transform(X)

array([[  2.5,   3.5,   4.5,   5.5,   6.5],
       [ 10. ,  11. ,  12. ,  13. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ],
       [  2.5,   3.5,   4.5,   5.5,   6.5]])

Note that unknown words are simply skipped; if the sentence contains no known words, a vector of zeros is returned.

#### Applying W2VTransformer with max aggregation

In [29]:
transformer = W2VTransformer('path/to/word2vec/file', aggr_func=np.max)

In [30]:
transformer.fit()

W2VTransformer(aggr_func=<function amax at 0x7f70f41eca60>,
        analyze=<bound method VectorizerMixin.build_analyzer of CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)>,
        max_words=100, path_w2v='path/to/word2vec/file')

In [31]:
transformer.transform(X)

array([[  5.,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [  0.,   0.,   0.,   0.,   0.],
       [  5.,   6.,   7.,   8.,   9.]])

#### Applying different aggregation functions using a FeatureUnion

In [32]:
features = FeatureUnion([
    ('w2v_min', W2VTransformer('path/to/word2vec/file', aggr_func=np.min)),
    ('w2v_median', W2VTransformer('path/to/word2vec/file', aggr_func=np.median)),
    ('w2v_max', W2VTransformer('path/to/word2vec/file', aggr_func=np.max)),
])

In [33]:
features.fit(X)

FeatureUnion(n_jobs=1,
       transformer_list=[('w2v_min', W2VTransformer(aggr_func=<function amin at 0x7f70f41ecae8>,
        analyze=<bound method VectorizerMixin.build_analyzer of CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
    ...       tokenizer=None, vocabulary=None)>,
        max_words=100, path_w2v='path/to/word2vec/file'))],
       transformer_weights=None)

In [34]:
features.transform(X)

array([[  0. ,   1. ,   2. ,   3. ,   4. ,   2.5,   3.5,   4.5,   5.5,
          6.5,   5. ,   6. ,   7. ,   8. ,   9. ],
       [ 10. ,  11. ,  12. ,  13. ,  14. ,  10. ,  11. ,  12. ,  13. ,
         14. ,  10. ,  11. ,  12. ,  13. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,
          0. ,   0. ,   0. ,   0. ,   0. ,   0. ],
       [  0. ,   1. ,   2. ,   3. ,   4. ,   2.5,   3.5,   4.5,   5.5,
          6.5,   5. ,   6. ,   7. ,   8. ,   9. ]])

#### It is possible to use a custom preprocesser for the text data

In [35]:
transformer = W2VTransformer('path/to/word2vec/file', analyze=lambda x: x.upper().split('E'))

In [36]:
transformer.analyze(X[0])

['DAM', 'N H', 'RR', 'N']

#### It is possible to set a maximum number of words to analyze, which may be useful if there are rare, very long texts

In [37]:
transformer = W2VTransformer('path/to/word2vec/file', max_words=2)

In [38]:
transformer.fit()

W2VTransformer(aggr_func=<function mean at 0x7f70f41ed048>,
        analyze=<bound method VectorizerMixin.build_analyzer of CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)>,
        max_words=2, path_w2v='path/to/word2vec/file')

In [39]:
transformer.transform(X)

array([[  2.5,   3.5,   4.5,   5.5,   6.5],
       [ 10. ,  11. ,  12. ,  13. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ],
       [  0. ,   1. ,   2. ,   3. ,   4. ]])

Note that in the last sample, the word "damen" has been dropped.

## Encoding

### XLabelEncoder

`sklearn`'s `LabelEncoder` is intended for use in conjunction with target data. However, sometimes, we would like to encode feature data. The problem is that the `LabelEncoder` will raise an error when new samples are encountered. The `XLabelEncoder` will encode new samples to the value `0` instead. Furthermore, the encoded data will have shape `n x 1`, so that they can later be used as feature data (e.g. in a `FeatureUnion`).

In [40]:
X = np.array(['a', 'b', 'c', 'a', 'c'])

In [41]:
encoder = XLabelEncoder().fit(X)

#### When all labels are known, `XLabelEncoder` maps to the values 1..n. It returns a 2d array.

In [42]:
encoder.transform(X)

array([[1],
       [2],
       [3],
       [1],
       [3]])

#### When new labels are encountered, they are mapped to `0`.

In [43]:
encoder.transform(np.array(['a', 'b', 'c', 'd', 'e', 'a']))

array([[1],
       [2],
       [3],
       [0],
       [0],
       [1]])

## Preprocessing

### ParallelFunctionTransformer

The `ParallelFunctionTransformer`, as its name suggests, transforms data in a parallelized manner. The data will be partitioned into `n_jobs` equally sized parts and then be transformed in parallel.

As parallelization induces overhead, use this only when the map function is slow. Furthermore, some functions don't lend themselves to parallelization, as shown below.

In [44]:
X = np.arange(10).reshape(-1, 1)

Remember: We cannot use a `lambda` function, since it cannot be pickled.

In [45]:
def plus_one_func(X):
    return X + 1

In [46]:
transformer = ParallelFunctionTransformer(func=plus_one_func, n_jobs=2)

In [47]:
transformer.fit_transform(X)

array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

#### Caution

Functions such as 'adding the standard deviation' or 'divding by the max values' will not work for `n_jobs > 1`, because they require information of the whole data.

In [48]:
def max_of(X):
    return np.ones_like(X) * np.max(X)

In [49]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=1)

In [50]:
transformer.fit_transform(X)

array([[9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9],
       [9]])

In [51]:
transformer = ParallelFunctionTransformer(func=max_of, n_jobs=2)

In [52]:
transformer.fit_transform(X)

array([[4],
       [4],
       [4],
       [4],
       [4],
       [9],
       [9],
       [9],
       [9],
       [9]])