# The transformative abilities of `sklearn.compose`: a life-saver in disguise?

[Scikit-learn](https://scikit-learn.org/stable/) is undoubtedly one of the most popular libraries for machine learning (ML). From the algorithms provided in its core API to other useful capabilities like feature selection, pipelining, and evaluation, scikit-learn has positioned itself as a must-have on the toolbelt of many data folks. In mid-2018, a new submodule for the core scikit-learn library was initiated: `sklearn.compose`. While still relatively slim, this module, when coupled with existing scikit-learn modules like `sklearn.preprocessing`, can be powerful. The goal of this tutorial is to demonstrate how to implement a configuration-based approach to machine learning dataset creation. Specifically, we'll use the [sklearn.compose](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) and [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) modules.

The most [recent stable release of scikit-learn](https://scikit-learn.org/dev/versions.html) is version 0.21.3. `sklearn.compose`, by all accounts, seems to have appeared around version 0.20, so the capabilities presented by this section of scikit-learn are relatively new.

## What dataset will we be using?

The [University of California, Irvine (UCI) Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult) contains a treasure-trove of datasets for ML work. I chose the ["Adult" dataset](https://archive.ics.uci.edu/ml/datasets/Adult), which tasks the analyst with predicting, based off of a variety of inputs, whether an adult makes more or less than $50k per year. This dataset comes with a mixture of real, categorical, and integer features, which ought to make for a much more "real-world" dataset-processing example.

## First, some housekeeping

If you haven't already, run `sh setup.sh` from the base directory to:

1) Download the "Adult" dataset

2) Set up a virtual environment for dependency management

3) Start the Jupyter Notebook server

## Getting started with the actual exercise

First, we'll load the adult dataset:

In [2]:
import numpy as np
import pandas as pd
from pprint import pprint

# Gathered from the adult.names file and posted here for your convenience
cols = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per',
    'native-country',
    'makes_gt_50k'
]

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=cols)

Next, let's take a look at some metadata:

In [3]:
print(f'Shape of dataset: {df.shape}')
print(f'Data sample:\n{df.head()}')
print(f'Data types:\n{df.dtypes}')
print(f'Number of unique values by field, for non-numeric features:\n{df.select_dtypes(include=["object"]).nunique()}')

Shape of dataset: (32561, 15)
Data sample:
   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per  native-country makes_gt_50k  
0          2174          

As we can see, there is quite a diversity of fields in this dataset. We have a mixture of continuous (`age`, `capital-gain`, `hours-per`, etc.) and categorical (`workclass`, `education`, etc.) features.

Now, a logical next step in the process of building a predictive model would be to perform some exploratory data analysis on each of the potential input features. **For the sake of this exercise**, let's assume we've done that and proceed straight to feature-engineering.

## Feature engineering

Ah, feature engineering - as the old (well, as old as the term "data scientist") adage goes, about 80% of your time will be spent pulling together features for whatever model you're building. Now, the vast majority of this time is spent working with stakeholders, thinking about the domain, and trying to come up with the most relevant predictors for whatever predictive task you're after.

However, once you've got all of your main features pulled together, oftentimes that's just the first step (albeit a very large one): you'll likely need to preprocess a lot of the fields in order to make your data play nicely with whatever ML algorithm software you're trying to use.

For example, most of the algorithms in Python's main ML libraries don't natively support mixed types in input datasets. That is to say, instead of feeding a vector for `sex` like `['male', 'female', 'male', 'female']` as an input feature, we will instead need to encode this field in a numerical fashion. By far the most common approach for encoding categorical vectors is called "one-hot encoding." Below, I'll show a few (of many) examples of how one-hot encoding can be accomplished in Python.

> Note: oftentimes, preprocessing will be applied across the entire dataset - not just for categorical features. For the sake of brevity, I'll only demonstrate the one-hot-encoding approach and leave it up to you to incorporate more sophisticated encoding strategies for features of other types.

### Using `pandas.get_dummies`

The data-manipulation library `pandas` has a function called `get_dummies`, which creates ["dummy" variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)), given some input. Here's an example of how we might encode `sex` using `pandas.get_dummies`:

In [4]:
print(f"Original column:\n{df['sex'].head(10)}")
print(f"That same column, one-hot-encoded:\n{pd.get_dummies(df['sex'], prefix='sex').head(10)}")

Original column:
0       Male
1       Male
2       Male
3       Male
4     Female
5     Female
6     Female
7       Male
8     Female
9       Male
Name: sex, dtype: object
That same column, one-hot-encoded:
   sex_ Female  sex_ Male
0            0          1
1            0          1
2            0          1
3            0          1
4            1          0
5            1          0
6            1          0
7            0          1
8            1          0
9            0          1


Another approach would be to use [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html):

In [5]:
from sklearn.preprocessing import OneHotEncoder

# Note: when using models prone to perfect collinearity, you'll want to set `drop=True`
enc = OneHotEncoder(sparse=False)
print(f"That same column, one-hot-encoded:\n{enc.fit_transform(df['sex'].values.reshape(-1, 1))[:10]}")

That same column, one-hot-encoded:
[[0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]]


Now, both of these approaches are perfectly fine ways of performing one-hot encoding. However, the latter approach will play very nicely with the rest of the `sklearn.compose` module, which I'm here to demonstrate. Technically, `pd.get_dummies` could work as well, but it would take a bit more work, and the main benefit of the second approach is staying within the `scikit-learn` API.

### So how does `sklearn.compose` help with all of this preprocessing?

If you look at the [main page for `sklearn.compose`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose), you'll notice how few functions/classes exist within that submodule. We're most concerned with `ColumnTransformer` and `make_column_transformer`. From the `ColumnTransformer` description:

> Applies transformers to columns of an array or pandas DataFrame.
This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

The `make_column_transformer` is simply shorthand for `ColumnTransformer`, and doesn't support as many options as its namesake, so for this exercise we'll concern ourselves primarily with the latter.

Effectively, the code provided through the `compose` submodule will allow us to very easily construct analytic, ready-to-be-modeled-off-of datasets, using pre-defined encoding patterns.

### Enough talk – how does a `ColumnTransformer` work?

Let's go through the one-hot encoding example from above, using a `ColumnTransformer`:

In [6]:
from sklearn.compose import ColumnTransformer

col = 'sex'
enc = OneHotEncoder(sparse=False)
trans = ColumnTransformer([(col, enc, [col])])
trans.fit_transform(df)[:10]

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.]])

As you can see, we've successfully use a `ColumnTransformer` and the typical scikit-learn `fit` and `transform` patterns to one-hot-encode the `sex` column, just like we did above.

<img src="imgs/ytho.jpg" style="width: 400px;"/>


The power of the `ColumnTransformer` is seen when you're dealing with trying to abstract more complex encoding pipelines across a "wide" dataset. Sure, properly encoding `sex` by itself is a trivial task. But what if you have to apply *d*-number of encoding strategies across *n*-number of columns, and want a consistent and "summarized" way of doing so? Allow me to demonstrate.

We're effectively going to treat the encoding/preprocessing step as a configuration file problem. First, we'll select a few columns (of whatever type), and specify what type of feature the column represents.

> Note: you could also get at the type of feature by relying on pandas' default data-type parsing when the file is initially read, i.e. looking at `df.dtypes`. What is shown in this tutorial is a more explicit approach.

In [7]:
strategies = [
    {
        'col': 'sex',
        'kind': 'categorical'
    },
    {
        'col':  'race',
        'kind': 'categorical'
    },
    {
        'col': 'age',
        'kind': 'continuous'
    }
]

transformers = []
for s in strategies:
    col, kind = s['col'], s['kind']
    
    # An opinionated encoding mechanism
    if kind == 'categorical':
        transformer = OneHotEncoder(sparse=False)
    elif kind == 'continuous':
        # Default to not applying any preprocessing to continuous features
        transformer = 'passthrough'
    else:
        # Add support at some point for other data types
        pass
    
    result = (col, transformer, [col])
    transformers.append(result)

master_trans = ColumnTransformer(transformers)
master_trans.fit_transform(df)

array([[ 0.,  1.,  0., ...,  0.,  1., 39.],
       [ 0.,  1.,  0., ...,  0.,  1., 50.],
       [ 0.,  1.,  0., ...,  0.,  1., 38.],
       ...,
       [ 1.,  0.,  0., ...,  0.,  1., 58.],
       [ 0.,  1.,  0., ...,  0.,  1., 22.],
       [ 1.,  0.,  0., ...,  0.,  1., 52.]])

### Additional topics

In the example above, we went from a raw, unprocessed dataset to something ready for further analysis/modeling, all in very few lines of code thanks to our dataset-as-a-configuration-file approach.

But there is still more to touch on here. Below I'll detail some additional topics related to `sklearn.compose`.

#### Getting names of encoded features

All we see above is a NumPy matrix containing a bunch of numbers. What if we wanted to more easily inspect or share this dataset? One of the first things we may want to know is the human-readable name for each column.

In comes the `get_feature_names` method off of our `ColumnTransformer` object. Let's see what results we get:

In [8]:
master_trans.get_feature_names()

NotImplementedError: get_feature_names is not yet supported when using a 'passthrough' transformer.

This error is to be expected (at least at this point in `sklearn.compose`'s development cycle)! Since we've used the `'passthrough'` option for some of our features, we don't have the ability to get the name of the feature than was originally transformed. But, just because that capability isn't yet enabled for the "passthrough" transformation doesn't mean we can't write that functionality ourselves! 😈

> Note: the `sklearn.preprocessing.SimpleImputer` class also doesn't have a `get_feature_names` method - so we'll work around that below as well.

In [10]:
from sklearn.impute import SimpleImputer
from sklearn.base import TransformerMixin, BaseEstimator

# Inherit sklearn's BaseEstimator and TransformerMixin classes to make these new
# classes play nicely with the rest of the `sklearn.compose` functionality we're using
class PassthroughEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        self.feature_names = list(X)
        return self
    def transform(self, X):
        return X
    def get_feature_names(self):
        return self.feature_names

class SimpleImputerWithFeatureNames(SimpleImputer):
    # X is a pandas.DataFrame or pandas.Series, in this case
    def fit(self, X, y = None):
        self.feature_names = list(X)
        # Execute parent class method - the nuts and bolts of this child object
        super().fit(X)
        return self
    def get_feature_names(self):
        return self.feature_names

In [14]:
transformers = []
for s in strategies:
    col, kind = s['col'], s['kind']
    
    # An opinionated encoding mechanism
    if kind == 'categorical':
        transformer = OneHotEncoder(sparse=False)
    elif kind == 'continuous':
        # Default to not applying any preprocessing to continuous features
        transformer = PassthroughEncoder()
    else:
        # Add support at some point for other data types
        pass
    
    result = (col, transformer, [col])
    transformers.append(result)

trans = ColumnTransformer(transformers)
trans.fit_transform(df)

array([[ 0.,  1.,  0., ...,  0.,  1., 39.],
       [ 0.,  1.,  0., ...,  0.,  1., 50.],
       [ 0.,  1.,  0., ...,  0.,  1., 38.],
       ...,
       [ 1.,  0.,  0., ...,  0.,  1., 58.],
       [ 0.,  1.,  0., ...,  0.,  1., 22.],
       [ 1.,  0.,  0., ...,  0.,  1., 52.]])

Let's check out the feature names for this `ColumnTransformer`:

In [15]:
trans.get_feature_names()

['sex__x0_ Female',
 'sex__x0_ Male',
 'race__x0_ Amer-Indian-Eskimo',
 'race__x0_ Asian-Pac-Islander',
 'race__x0_ Black',
 'race__x0_ Other',
 'race__x0_ White',
 'age__age']

In [None]:
def strategy_handler(
    kind,
    fill_value,
    ignore_fill_value,
    handle_unknown):

    ''' Generic feature strategy handler '''
    if kind == 'categorical':
        return OneHotEncoder(sparse=False, handle_unknown=handle_unknown)
    elif kind == 'continuous':
        if ignore_fill_value:
            return PassthroughEncoder()
        else:
            # These are the four strategies supported by SimpleImputer currently
            if fill_value in {'mean', 'median', 'most_frequent', 'constant'}:
                params = {'strategy': fill_value, 'fill_value': None}
            else:
                params = {'strategy': 'constant', 'fill_value': fill_value}
            return SimpleImputerWithFeatureNames(**params)
    else:
        raise ValueError('Kind "{}" invalid. Try "continuous" or "categorical"'.format(kind))
        
def build_transformer(strategies, ignore_fill_value, handle_unknown):
    ''' Take entire `strategies` (see model_config.py) and create master transformer '''
    transformers = []
    for s in strategies:
        name, kind, fill_value = s['name'], s['kind'], s['fill_value']
        transformer = strategy_handler(
            kind,
            fill_value,
            ignore_fill_value,
            handle_unknown
        )
        result = (name, transformer, [name])
        transformers.append(result)
    return ColumnTransformer(transformers)

In [None]:
def monitor(df, transformer):
    '''
    Iterate through each transformer in an
    already-fitted ColumnTransformer to check
    whether new data raises any Exceptions (intended or otherwise)
    '''

    if not hasattr(transformer, 'transformers_'):
        print('Transformer must be fitted first')
    else:
        # Select only those transformers that inherit from the BaseEstimator class
        transformers = [
            (t_inst, t_col)
            for _, t_inst, t_col in transformer.transformers_
            if isinstance(t_inst, BaseEstimator)
        ]

        for t_inst, t_col in transformers:
            try:
                t_inst.transform(df[t_col])
            except Exception as e:
                e = str(e)
                if 'unknown categories' in e:
                    print(f'{e} ... found in column "{t_col}"')
                else:
                    raise

In [None]:
strategies = [
    {
        'name': 'age',
        'kind': 'continuous',
        'fill_value': np.nan
    },
    {
        'name': 'workclass',
        'kind': 'categorical',
        'fill_value': np.nan
    },
    {
        'name': 'fnlwgt',
        'kind': 'continuous',
        'fill_value': np.nan
    },
#     {
#         'name': 'education',
#         'kind': 'categorical',
#         'fill_value': np.nan
#     },
    {
        'name': 'education-num',
        'kind': 'continuous',
        'fill_value': np.nan
    },
    {
        'name': 'marital-status',
        'kind': 'categorical',
        'fill_value': np.nan
    },
    {
        'name': 'occupation',
        'kind': 'categorical',
        'fill_value': np.nan
    },
    {
        'name': 'relationship',
        'kind': 'categorical',
        'fill_value': np.nan
    },
    {
        'name': 'race',
        'kind': 'categorical',
        'fill_value': np.nan
    },
    {
        'name': 'sex',
        'kind': 'categorical',
        'fill_value': np.nan
    },
    {
        'name': 'capital-gain',
        'kind': 'continuous',
        'fill_value': np.nan
    },
    {
        'name': 'capital-loss',
        'kind': 'continuous',
        'fill_value': np.nan
    },
    {
        'name': 'hours-per',
        'kind': 'continuous',
        'fill_value': np.nan
    },
    {
        'name': 'native-country',
        'kind': 'categorical',
        'fill_value': np.nan
    },
]

df['make_gt_50k'] = np.where(df['makes_gt_50k'] == ' >50K', 1, 0)

In [None]:
transformer = build_transformer(strategies, True, 'ignore')
transformer.fit_transform(df)

# TODO: show original versus monitoring transformers

leave-one-out, rule-of-*n*
PassthroughTransformer to get at `get_feature_names`
Fill values
Accepts any sort-of scikit Estimator, with fit and fit_transform stuff