## The transformative abilities of `sklearn.compose`: a life-saver in disguise?

<br>
<div>
    <center>
        <img src="imgs/scikit.png" style="width: 400px;">
    </center>
</div>

- [Scikit-learn](https://scikit-learn.org/stable/): one of the most popular libraries for machine learning (ML)
    - Algorithms, feature selection, pipelining, evaluation, etc.
- `sklearn.compose` released in mid-2018
    - Still relatively slim, but powerful with existing scikit-learn modules like `sklearn.preprocessing` and general scikit API

**Goal of this tutorial: to demonstrate how to implement a configuration-based approach to machine learning dataset creation.**

> The most [recent stable release of scikit-learn](https://scikit-learn.org/dev/versions.html) is version 0.21.3. `sklearn.compose`, by all accounts, seems to have appeared around version 0.20, so the capabilities presented by this section of scikit-learn are relatively new.

## What dataset will we be using?

The [University of California, Irvine (UCI) Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult) contains a treasure-trove of datasets for ML work. I chose the ["Adult" dataset](https://archive.ics.uci.edu/ml/datasets/Adult), which tasks the analyst with predicting, based off of a variety of inputs, whether an adult makes more or less than $50k per year. This dataset comes with a mixture of real, categorical, and integer features, which ought to make for a much more "real-world" dataset-processing example.

## Let's get started

## First, some housekeeping

If you haven't already, run `sh setup.sh` from the base directory to:

1) Set up a virtual environment for dependency management

2) Start the Jupyter Notebook server

## Loading our dataset

In [43]:
import numpy as np
np.random.seed(100)

import pandas as pd
from pprint import pprint

# Gathered from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
cols = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per',
    'native-country',
    'makes_gt_50k'
]

df = pd.read_csv(
    'https://archive.ics.uci.edu/'
    'ml/machine-learning-databases/'
    'adult/adult.data',
    names=cols
)

Next, let's take a look at some metadata:

In [44]:
print(f'Shape of dataset: {df.shape}')
print(f'Data sample:\n{df.head()}')

Shape of dataset: (32561, 15)
Data sample:
   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per  native-country makes_gt_50k  
0          2174          

In [45]:
print(f'Data types:\n{df.dtypes}')
print(f'Number of unique values by field, for non-numeric features:\n{df.select_dtypes(include=["object"]).nunique()}')

Data types:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per          int64
native-country    object
makes_gt_50k      object
dtype: object
Number of unique values by field, for non-numeric features:
workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex                2
native-country    42
makes_gt_50k       2
dtype: int64


- Quite a bit of diversity in this dataset
    - Mixture of continuous (`age`, `capital-gain`, `hours-per`, etc.) and categorical (`workclass`, `education`, etc.) features
    - Relatively low cardinality for categorical features

> A logical next step in the process of building a predictive model would be to perform some exploratory data analysis on each of the potential input features. **For the sake of this exercise**, let's assume we've done that and proceed straight to feature-engineering.

## Feature engineering

- "[It has been a common trope that 80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis.](https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists)"
- Seemingly endless amount of preprocessing required to transform a "raw" dataset into something ready for analysis/modeling
- For example, mixed types not supported in many of Python's core ML libraries
    - Can't use `sex` like `['male', 'female', 'male', 'female']` – we need to encode this field in a numerical fashion ("one-hot encoding")

> Note: oftentimes, preprocessing will be applied across the entire dataset - not just for categorical features. For the sake of brevity, I'll only demonstrate the one-hot-encoding approach and leave it up to you to incorporate more sophisticated encoding strategies for features of other types.

### Using `pandas.get_dummies`

The data-manipulation library `pandas` has a function called `get_dummies`, which creates ["dummy" variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)), given some input. Here's an example of how we might encode `sex` using `pandas.get_dummies`.

In [14]:
print(f"Original column:\n{df['sex'].head()}")
print(pd.get_dummies(df['sex'], prefix='sex').head())

Original column:
0       Male
1       Male
2       Male
3       Male
4     Female
Name: sex, dtype: object
   sex_ Female  sex_ Male
0            0          1
1            0          1
2            0          1
3            0          1
4            1          0


### Using [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html):

In [15]:
from sklearn.preprocessing import OneHotEncoder

# Note: when using models prone to perfect collinearity, you'll want to set `drop=True`
enc = OneHotEncoder(sparse=False)
print(enc.fit_transform(df['sex'].values.reshape(-1, 1))[:5])

[[0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]]


### A short summary of these two (out of *many* potential) approaches

- Both perfectly 👌 ways to perform one-hot encoding
- *But*, the latter approach (using `OneHotEncoder`) plays much more nicely with the rest of the `sklearn.compose` module

> Technically, `pd.get_dummies` could work as well, but it would take a bit more work, and the main benefit of the second approach is staying within the `scikit-learn` API.

### So how does `sklearn.compose` help with all of this preprocessing?

- The [main page for `sklearn.compose`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose), shows how few functions/classes exist within that submodule:

<img src="imgs/sklearn.compose.module.png" style="width: 700px;">

**We're mostly concerned with `ColumnTransformer` and `make_column_transformer`**

From the `ColumnTransformer` description:

> Applies transformers to columns of an array or pandas DataFrame.
This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

- `make_column_transformer` is simply a wrapper for `ColumnTransformer`
    - Doesn't support as many options, so we'll concern ourselves primarily with its namesake

**`sklearn.compose` will allow us to very easily construct ready-for-modeling datasets, using pre-defined encoding patterns.**

### Enough talk – how does a `ColumnTransformer` work?

Let's go through the one-hot encoding example from above, using a `ColumnTransformer`.

In [16]:
from sklearn.compose import ColumnTransformer

col = 'sex'
enc = OneHotEncoder(sparse=False)
trans = ColumnTransformer([(col, enc, [col])])
trans.fit_transform(df)[:5]

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.]])

<div>
    <center>
        <img src="imgs/ytho.jpg" style="width: 400px;">
    </center>
</div>

- Power of `ColumnTransformer`: abstraction of more complex encoding pipelines across a "wide" dataset
    - Trivial to encode simply one feature
    - Applying *d*-number of encoding strategies across *n*-number of columns, and want a consistent and "summarized" way of doing so? `ColumnTransformer`s can help.

### A more complex example: using configuration files to manage encoding strategies

> Note: you could also get at the type of feature by relying on pandas' default data-type parsing when the file is initially read, i.e. looking at `df.dtypes`. What is shown in this tutorial is a more explicit approach.

In [46]:
strategies = [
    {
        'col': 'sex',
        'kind': 'categorical'
    },
    {
        'col':  'race',
        'kind': 'categorical'
    },
    {
        'col': 'age',
        'kind': 'continuous'
    }
]

In [47]:
transformers = []
for s in strategies:
    col, kind = s['col'], s['kind']
    
    # An opinionated encoding mechanism
    if kind == 'categorical':
        transformer = OneHotEncoder(sparse=False)
    elif kind == 'continuous':
        # Default to not applying any preprocessing to continuous features
        transformer = 'passthrough'
    else:
        # Add support at some point for other data types
        pass
    
    result = (col, transformer, [col])
    transformers.append(result)

master_trans = ColumnTransformer(transformers)
master_trans.fit_transform(df)

array([[ 0.,  1.,  0., ...,  0.,  1., 39.],
       [ 0.,  1.,  0., ...,  0.,  1., 50.],
       [ 0.,  1.,  0., ...,  0.,  1., 38.],
       ...,
       [ 1.,  0.,  0., ...,  0.,  1., 58.],
       [ 0.,  1.,  0., ...,  0.,  1., 22.],
       [ 1.,  0.,  0., ...,  0.,  1., 52.]])

### Additional topics

We're just scratching the surface of `sklearn.compose`!

#### Getting names of encoded features

We can use the `get_feature_names` method to obtain the names of the encoded features. Let's try it out:

In [48]:
master_trans.get_feature_names()

NotImplementedError: get_feature_names is not yet supported when using a 'passthrough' transformer.

This is to be expected!

We can use inheritance to create our own `get_feature_names` method for these objects:

In [49]:
from sklearn.impute import SimpleImputer
from sklearn.base import TransformerMixin, BaseEstimator

# Inherit sklearn's BaseEstimator and TransformerMixin classes to make these new
# classes play nicely with the rest of the `sklearn.compose` functionality we're using
class PassthroughEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        self.feature_names = list(X)
        return self
    def transform(self, X):
        return X
    def get_feature_names(self):
        return self.feature_names

class SimpleImputerWithFeatureNames(SimpleImputer):
    # X is a pandas.DataFrame or pandas.Series, in this case
    def fit(self, X, y = None):
        self.feature_names = list(X)
        # Execute parent class method - the nuts and bolts of this child object
        super().fit(X)
        return self
    def get_feature_names(self):
        return self.feature_names

In [50]:
transformers = []
for s in strategies:
    col, kind = s['col'], s['kind']
    
    # An opinionated encoding mechanism
    if kind == 'categorical':
        # Note that OneHotEncoder already has a `get_feature_names` method
        transformer = OneHotEncoder(sparse=False)
    elif kind == 'continuous':
        # Default to not applying any preprocessing to continuous features
        transformer = PassthroughEncoder()
    else:
        # Add support at some point for other data types
        pass
    
    result = (col, transformer, [col])
    transformers.append(result)

trans = ColumnTransformer(transformers)
trans.fit_transform(df)

array([[ 0.,  1.,  0., ...,  0.,  1., 39.],
       [ 0.,  1.,  0., ...,  0.,  1., 50.],
       [ 0.,  1.,  0., ...,  0.,  1., 38.],
       ...,
       [ 1.,  0.,  0., ...,  0.,  1., 58.],
       [ 0.,  1.,  0., ...,  0.,  1., 22.],
       [ 1.,  0.,  0., ...,  0.,  1., 52.]])

Let's check out the feature names for this `ColumnTransformer`:

In [51]:
trans.get_feature_names()

['sex__x0_ Female',
 'sex__x0_ Male',
 'race__x0_ Amer-Indian-Eskimo',
 'race__x0_ Asian-Pac-Islander',
 'race__x0_ Black',
 'race__x0_ Other',
 'race__x0_ White',
 'age__age']

Not terribly clean, but decent!

> We can use `lambda s: s.split('__')[0]` for originally continuous features and `lambda s: re.sub('__x0', '', s)` for categorical) to clean these names up

#### One procedure to rule them all

Wouldn't it be nice to simply pass some configuration to some function and get back our fully encoded analytic dataset? Say no more! But first, let's create a more complex and realistic setup, with many more features.

In [52]:
cols = {
    'categorical': [
        'workclass',
        'marital-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'native-country'
    ],
    'continuous': [
        'age',
        'fnlwgt',
        'education-num',
        'capital-gain',
        'capital-loss',
        'hours-per'
    ]
}

strategies = [
    {
        'col': col,
        'kind': kind,
        'fill_value': 'median' if kind == 'continuous' else np.nan
    }
    for kind in cols
    for col in cols[kind]
]

In [53]:
from pprint import pprint
pprint(strategies)

[{'col': 'workclass', 'fill_value': nan, 'kind': 'categorical'},
 {'col': 'marital-status', 'fill_value': nan, 'kind': 'categorical'},
 {'col': 'occupation', 'fill_value': nan, 'kind': 'categorical'},
 {'col': 'relationship', 'fill_value': nan, 'kind': 'categorical'},
 {'col': 'race', 'fill_value': nan, 'kind': 'categorical'},
 {'col': 'sex', 'fill_value': nan, 'kind': 'categorical'},
 {'col': 'native-country', 'fill_value': nan, 'kind': 'categorical'},
 {'col': 'age', 'fill_value': 'median', 'kind': 'continuous'},
 {'col': 'fnlwgt', 'fill_value': 'median', 'kind': 'continuous'},
 {'col': 'education-num', 'fill_value': 'median', 'kind': 'continuous'},
 {'col': 'capital-gain', 'fill_value': 'median', 'kind': 'continuous'},
 {'col': 'capital-loss', 'fill_value': 'median', 'kind': 'continuous'},
 {'col': 'hours-per', 'fill_value': 'median', 'kind': 'continuous'}]


Here are a couple of functions (the former is called by the latter in this particular implementation) that will enable us to create our "master" transformer in one fell swoop:

In [54]:
def strategy_handler(
    kind,
    fill_value,
    ignore_fill_value,
    handle_unknown):

    ''' Generic feature strategy handler '''
    if kind == 'categorical':
        return OneHotEncoder(sparse=False, handle_unknown=handle_unknown)
    elif kind == 'continuous':
        if ignore_fill_value:
            return PassthroughEncoder()
        else:
            # These are the four strategies supported by SimpleImputer currently
            if fill_value in {'mean', 'median', 'most_frequent', 'constant'}:
                params = {'strategy': fill_value, 'fill_value': None}
            else:
                params = {'strategy': 'constant', 'fill_value': fill_value}
            return SimpleImputerWithFeatureNames(**params)
    else:
        raise ValueError(f'Kind "{kind}" invalid. Try "continuous" or "categorical"')

In [55]:
def build_transformer(strategies, ignore_fill_value, handle_unknown):
    transformers = []
    for s in strategies:
        col, kind, fill_value = s['col'], s['kind'], s['fill_value']
        transformer = strategy_handler(
            kind,
            fill_value,
            ignore_fill_value,
            handle_unknown
        )
        result = (col, transformer, [col])
        transformers.append(result)
    return ColumnTransformer(transformers)

Let's use `build_transformer` to construct our transformer.

In [56]:
trans = build_transformer(
    strategies=strategies,
    ignore_fill_value=False,
    # We'll come back to this
    handle_unknown='ignore'
)

X = trans.fit_transform(df)
print(X)
print(trans.get_feature_names())

[[0.0000e+00 0.0000e+00 0.0000e+00 ... 2.1740e+03 0.0000e+00 4.0000e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 ... 0.0000e+00 0.0000e+00 1.3000e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 ... 0.0000e+00 0.0000e+00 4.0000e+01]
 ...
 [0.0000e+00 0.0000e+00 0.0000e+00 ... 0.0000e+00 0.0000e+00 4.0000e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 ... 0.0000e+00 0.0000e+00 2.0000e+01]
 [0.0000e+00 0.0000e+00 0.0000e+00 ... 1.5024e+04 0.0000e+00 4.0000e+01]]
['workclass__x0_ ?', 'workclass__x0_ Federal-gov', 'workclass__x0_ Local-gov', 'workclass__x0_ Never-worked', 'workclass__x0_ Private', 'workclass__x0_ Self-emp-inc', 'workclass__x0_ Self-emp-not-inc', 'workclass__x0_ State-gov', 'workclass__x0_ Without-pay', 'marital-status__x0_ Divorced', 'marital-status__x0_ Married-AF-spouse', 'marital-status__x0_ Married-civ-spouse', 'marital-status__x0_ Married-spouse-absent', 'marital-status__x0_ Never-married', 'marital-status__x0_ Separated', 'marital-status__x0_ Widowed', 'occupation__x0_ ?', 'occupation__x

#### Using transformers on unseen data / in production

We most certainly will want to use the strategies and transformation logic assigned to a particular `ColumnTransformer` instance to transform new datasets. There are two common use cases here: 1) creating test/validation datasets for model evaluation; 2) using an already-fitted transformer instance to transform new data in "production" (whatever that looks like).

Using the functions developed above, we can very easily demonstrate how transformers can be used in both the training and testing/consumption of a model.

In [57]:
trans = build_transformer(
    strategies=strategies,
    ignore_fill_value=False,
    # We came back to this
    handle_unknown='ignore'
)

train_df = df.copy()
test_df = df.copy()

# Randomly assign "new" values to some of the unseen dataset, `test_df`
rand_idxs = np.random.randint(test_df.shape[0], size=50)
test_df.loc[rand_idxs, ['sex', 'native-country']] = 'DEFINITELY A NEVER-BEFORE-SEEN VALUE'

In [58]:
train_X = trans.fit_transform(train_df)
test_X = trans.transform(test_df)

print(f'Training dataset shape: {train_X.shape}')
print(f'Testing dataset shape: {test_X.shape}')

Training dataset shape: (32561, 92)
Testing dataset shape: (32561, 92)


So we see that setting `handle_unknown='ignore'` effectively ignores any new values in all of the columns. Let's see what happens if we set `handle_unknown='error'`.

In [59]:
trans = build_transformer(
    strategies=strategies,
    ignore_fill_value=False,
    handle_unknown='error'
)

train_X = trans.fit_transform(train_df)
test_X = trans.transform(test_df)

ValueError: Found unknown categories ['DEFINITELY A NEVER-BEFORE-SEEN VALUE'] in column 0 during transform

That's a very long trackback, but here's what we're interested in:

```python
ValueError: Found unknown categories ['DEFINITELY A NEVER-BEFORE-SEEN VALUE'] in column 0 during transform
```

- This is totally expected behavior
    - New value in at least one of the features and our `ColumnTransformer` doesn't know how to handle it

> Note: handling unseen values like this is out of the scope of this talk, so I'll leave that for you to ponder!

- More concisely:
    - Could certainly use `handle_unknown='ignore'` when moving your model to production
    - This will ignore any new values and coerce the unseen dataset to adhere to the original transformer's "width" configuration

Why not iterate through each transformer in the "master" transformer, and see what new values have emerged for all applicable features?

In [60]:
def monitor(df, transformer):
    '''
    Iterate through each transformer in an
    already-fitted ColumnTransformer to check
    whether new data raises any Exceptions (intended or otherwise)
    '''

    if not hasattr(transformer, 'transformers_'):
        print('Transformer must be fitted first')
    else:
        # Select only those transformers that inherit from the BaseEstimator class
        transformers = [
            (t_inst, t_col)
            for _, t_inst, t_col in transformer.transformers_
            if isinstance(t_inst, BaseEstimator)
        ]

        for t_inst, t_col in transformers:
            try:
                t_inst.transform(df[t_col])
            except Exception as e:
                e = str(e)
                if 'unknown categories' in e:
                    print(f'{e} ... found in column "{t_col}"')
                else:
                    raise

In [61]:
# Fit two effectively identical transformers to the same dataset
production_trans = build_transformer(strategies, True, 'ignore')
monitoring_trans = build_transformer(strategies, True, 'error')

production_trans.fit_transform(train_df)
monitoring_trans.fit(train_df)

# Remember, `test_df` has "new" values
production_trans.transform(test_df)
monitor(test_df, monitoring_trans)

Found unknown categories ['DEFINITELY A NEVER-BEFORE-SEEN VALUE'] in column 0 during transform ... found in column "['sex']"
Found unknown categories ['DEFINITELY A NEVER-BEFORE-SEEN VALUE'] in column 0 during transform ... found in column "['native-country']"


## FAQ

### What about the leave-one-out procedure for avoiding perfect [collinearity](https://en.wikipedia.org/wiki/Multicollinearity)?

Currently, this pipeline doesn't accommodate "leave-one-out" functionality, as is required for typical linear models like logistic and linear regression. I'm fairly sure, with a few small tweaks to `OneHotEncoder`, this could be accomplished - I haven't had the time to investigate, and I almost exclusively use these transformers with tree-based models, which don't succumb to collinearity concerns.

### What about the ["rule-of-*n*"](https://whatis.techtarget.com/definition/rule-of-five-statistics)?

In its current state, this transformation workflow also doesn't account for the total number of occurrences of a particular value for a particular feature. Oftentimes, depending on your task, you might choose to remove values for features with little [support](https://en.wikipedia.org/wiki/Association_rule_learning#Support), because few cases having that condition typically are not enough to draw valid inference. Again, most tree-based model implementation (e.g. those in `sklearn` and `xgboost`) will skirt this issue by limiting the number of instances required to occur at a leaf node. Features with support less than that limit will not be chosen for splits in these kinds of models.