# The transformative abilities of `sklearn.compose`: a life-saver in disguise?

> Note: the initial working title for this talk was "MLPaaCF: Machine Learning Preprocessing as a Config File," which robbed me of the opportunity to make [Transformers](https://en.wikipedia.org/wiki/Transformers_(film_series)) puns several times throughout.

[Scikit-learn](https://scikit-learn.org/stable/) is undoubtedly one of the most popular libraries for machine learning (ML). From the algorithms provided in its core API to other useful capabilities like feature selection, pipelining, and evaluation, scikit-learn has positioned itself as a must-have on the toolbelt of many data folks. In mid-2018, a new submodule for the core scikit-learn library was initiated: `sklearn.compose`. While still relatively slim, this module, when coupled with existing scikit-learn modules like `sklearn.preprocessing`, can be powerful. The goal of this tutorial is to demonstrate how to implement a configuration-based approach to machine learning dataset creation. Specifically, we'll use the [sklearn.compose](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) and [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) modules.

The most [recent stable release of scikit-learn](https://scikit-learn.org/dev/versions.html) is version 0.21.3. `sklearn.compose`, by all accounts, seems to have appeared around version 0.20, so the capabilities presented by this section of scikit-learn are relatively new.

## What dataset will we be using?

The [University of California, Irvine (UCI) Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult) contains a treasure-trove of datasets for ML work. I chose the ["Adult" dataset](https://archive.ics.uci.edu/ml/datasets/Adult), which tasks the analyst with predicting, based off of a variety of inputs, whether an adult makes more or less than $50k per year. This dataset comes with a mixture of real, categorical, and integer features, which ought to make for a much more "real-world" dataset-processing example.

## First, some housekeeping

If you haven't already, run `sh setup.sh` from the base directory to:

1) Download the "Adult" dataset

2) Set up a virtual environment for dependency management

3) Start the Jupyter Notebook server

## Getting started with the actual exercise

First, we'll load the adult dataset:

In [35]:
import pandas as pd
from pprint import pprint

# Gathered from the adult.names file and posted here for your convenience
cols = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per',
    'native-country',
    'makes_gt_50k'
]

df = pd.read_csv('data/adult.data', names=cols)

Next, let's take a look at some metadata:

In [36]:
print(f'Shape of dataset: {df.shape}')
print(f'Data sample:\n{df.head()}')
print(f'Data types:\n{df.dtypes}')
print(f'Number of unique values by field, for non-numeric features:\n{df.select_dtypes(include=["object"]).nunique()}')

Shape of dataset: (32561, 15)
Data sample:
   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per  native-country makes_gt_50k  
0          2174          

As we can see, there is quite a diversity of fields in this dataset. We have a mixture of continuous (`age`, `capital-gain`, `hours-per`, etc.) and categorical (`workclass`, `education`, etc.) features.

Now, a logical next step in the process of building a predictive model would be to perform some exploratory data analysis on each of the potential input features. **For the sake of this exercise**, let's assume we've done that and proceed straight to feature-engineering.

## Feature engineering

Most of the algorithms in Python's main ML libraries don't natively support mixed types in input datasets. That is to say, instead of feeding a vector for `sex` like `['male', 'female', 'male', 'female']` as an input feature, we will instead need to preprocess this field. By far the most common approach for encoding categorical vectors is called "one-hot encoding." Below, I'll show a few (of many) examples of how one-hot encoding can be accomplished in Python.

### Using `pandas.get_dummies`

The data-manipulation library `pandas` has a function called `get_dummies`, which creates ["dummy" variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)), given some input. Here's an example of how we might encode `sex` using `pandas.get_dummies`:

In [25]:
print(f"Original column:\n{df['sex'].head(10)}")
print(f"That same column, one-hot-encoded:\n{pd.get_dummies(df['sex'], prefix='sex').head(10)}")

Original column:
0       Male
1       Male
2       Male
3       Male
4     Female
5     Female
6     Female
7       Male
8     Female
9       Male
Name: sex, dtype: object
That same column, one-hot-encoded:
   sex_ Female  sex_ Male
0            0          1
1            0          1
2            0          1
3            0          1
4            1          0
5            1          0
6            1          0
7            0          1
8            1          0
9            0          1


Another approach would be to use [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html):

In [34]:
from sklearn.preprocessing import OneHotEncoder

# Note: when using models prone to perfect collinearity, you'll want to set `drop=True`
enc = OneHotEncoder(sparse=False)
print(f"That same column, one-hot-encoded:\n{enc.fit_transform(df['sex'].values.reshape(-1, 1))[:10]}")

That same column, one-hot-encoded:
[[0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]]


Now, both of these approaches are perfectly fine ways of performing one-hot encoding. However, the latter approach will play very nicely with the rest of the `sklearn.compose` module, which I'm here to demonstrate. Technically, `pd.get_dummies` could work as well, but it would take a bit more work, and the main benefit of the second approach is staying within the `scikit-learn` API.

leave-one-out, rule-of-*n*