In [1]:
np.set_printoptions(precision=3, linewidth=100)

# Introducing the `ColumnTransformer`: applying different transformations to different features in a scikit-learn pipeline

*This work is supported by the Université Paris-Saclay Center for Data Science*

<!-- PELICAN_BEGIN_SUMMARY -->
<p>
Short summary: the <code>ColumnTransformer</code>, which allows to apply different transformers to different features, has landed in scikit-learn (the <a href="https://github.com/scikit-learn/scikit-learn/pull/9012">PR</a> has been merged in master and this will be included in the upcoming release 0.20). 
</p>
<!-- PELICAN_END_SUMMARY -->

---
Real-world data often contains heterogenous data types. When processing the data before applying the final prediction model, we typically want to use different preprocessing steps and transformations for those different types of columns.  
A simple example: we may want to scale the numerical features and one-hot encode the categorical features. 

Currently, scikit-learn does not provide a good solution to do this out of the box. You can do the preprocessing beforehand using eg pandas, or you can select subsets of columns and apply different transformers on them manually. But, that does not easily allow to put those preprocessing steps in a scikit-learn `Pipeline`, which can be important to avoid data leakage or to do a grid search over preprocessing parameters.

There are third-party projects that try to address this. For example, the [`sklearn_pandas`](https://github.com/scikit-learn-contrib/sklearn-pandas) package has a `DataFrameMapper` that maps subsets of a DataFrame's coluns to a specific transformation. Many thanks to the authors of this library, as such "contrib" packages are essential in extending the functionality of scikit-learn, and to explore things that would take a long time in scikit-learn itself.  
The `ColumnTransformer` aims to bring this functionality into the core scikit-learn library, with support for numpy arrays and sparse matrices, and good integration with the rest of scikit-learn.


<!--

When working with prediction problems, in many cases your dataset will contain categorical variables. These are non-numeric variables -- or if numeric, the values should not be interpreted as numeric values -- that typically consist of a limited number of unique values (the categories or the levels). 
On the other hand, most machine learning models require numeric input data. Therefore, categorical variables are *encoded*: they are converted to one or multiple numeric features.
A well known example is one-hot or dummy encoding.


Currently there is no good out-of-the-box solution in scikit-learn. There is the `OneHotEncoder` which provides one-hot encoding, but because it only works on integer columns and has a bit of an awkward API, it is rather limited in practice. 
Chris Mofitt recently wrote a nice guide on how to encode categorical variables in python ([see his blogpost](http://pbpython.com/categorical-encoding.html)). He shows different ways to solve this: by (mis)using the `LabelEncoder` (which is actually meant for the target variable, not for encoding features) or using pandas' `get_dummies`, etc.
But none of these solutions are ideal for the simple cases or can readily be integrated in scikit-learn pipelines.

The newly added `CategoricalEncoder` tries to solve this: provide a built-in way to encode your categorical variables with some common options (either a one-hot or dummy encoding or an ordinal encoding).

-->

### Example

To illustrate the basic usage of the `ColumnTransformer`, let's load the titanic survival dataset:

In [2]:
titanic = pd.read_csv("https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv")
# there is still a small problem with using the CategoricalEncoder and missing values,
# so for now I am going to assume there are no missing values by dropping them
titanic = titanic.dropna(subset=['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'])

Selecting some of the features and target:

In [3]:
target = titanic.survived.values
features = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']]

In [4]:
features.head()

Unnamed: 0,pclass,sex,age,fare,embarked
0,1,female,29.0,211.3375,S
1,1,male,0.9167,151.55,S
2,1,female,2.0,151.55,S
3,1,male,30.0,151.55,S
4,1,female,25.0,151.55,S


This dataset contains some categorical variables ("pclass", "sex" and "embarked"), and some numerical variables ("age" and "fare"). Note that the "pclass", although categorical, is already encoded as integers in the dataset. 
So let's use the `ColumnTransformer` to combine transformers for those two types of features:

In [5]:
from sklearn.preprocessing import StandardScaler, CategoricalEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

In [6]:
preprocess = make_column_transformer(
    (['age', 'fare'], StandardScaler()),
    (['pclass', 'sex', 'embarked'], CategoricalEncoder())
)

The above creates a simple preprocessing pipeline (that will be combined in a full prediction pipeline below) to scale the numerical features and one-hot encode the categorical features.  
We can check this is indeed working as expected by transforming the input data

In [7]:
preprocess.fit_transform(features).toarray()[:5]

array([[-0.057,  3.136,  1.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  1.   ],
       [-2.012,  2.063,  1.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  1.   ],
       [-1.937,  2.063,  1.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  1.   ],
       [ 0.013,  2.063,  1.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  1.   ],
       [-0.335,  2.063,  1.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  1.   ]])

In the above, we specified the subsets of columns as lists. We can also use boolean masks (eg to make a selection of the columns based on the data types), integer positions and slices. Further, the `ColumnTransformer` allows you to specify wether to drop or pass through other columns that were not specified. See the [development docs](http://scikit-learn.org/dev/modules/generated/sklearn.compose.ColumnTransformer.html) for more details.

**! This is new functionality in scikit-learn, so you are very welcome to try out the development version, experiment with it in your use cases, and provide feedback!** I am sure there are ways to further improve this functionality (the [PR](https://github.com/scikit-learn/scikit-learn/pull/9012))

### Integrating in a full pipeline

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

In [9]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=0)

In [10]:
model = make_pipeline(
    preprocess,
    LogisticRegression())

In [11]:
model.fit(X_train, y_train)
print("logistic regression score: %f" % model.score(X_test, y_test))

logistic regression score: 0.804598


In [12]:
from sklearn.ensemble import RandomForestClassifier

In [13]:
preprocess2 = make_column_transformer(
    (['age', 'fare'], StandardScaler()),
    (['pclass', 'sex', 'embarked'], CategoricalEncoder(encoding='ordinal'))
)

In [14]:
rf = make_pipeline(
    preprocess2,
    RandomForestClassifier(n_estimators=500, random_state=0))

In [15]:
rf.fit(X_train, y_train)
print("random forest score: %f" % rf.score(X_test, y_test))

random forest score: 0.793103


In [16]:
rf.get_params()

{'columntransformer': ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
          transformers=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True), ['age', 'fare']), ('categoricalencoder', CategoricalEncoder(categories='auto', dtype=<class 'numpy.float64'>,
           encoding='ordinal', handle_unknown='error'), ['pclass', 'sex', 'embarked'])]),
 'columntransformer__categoricalencoder': CategoricalEncoder(categories='auto', dtype=<class 'numpy.float64'>,
           encoding='ordinal', handle_unknown='error'),
 'columntransformer__categoricalencoder__categories': 'auto',
 'columntransformer__categoricalencoder__dtype': numpy.float64,
 'columntransformer__categoricalencoder__encoding': 'ordinal',
 'columntransformer__categoricalencoder__handle_unknown': 'error',
 'columntransformer__n_jobs': 1,
 'columntransformer__remainder': 'passthrough',
 'columntransformer__standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True)

In [44]:
numerical_features = features.dtypes == 'float'
categorical_features = ~numerical_features

In [45]:
preprocess = make_column_transformer(
    (numerical_features, StandardScaler()),
    (categorical_features, CategoricalEncoder()),
    )

In [94]:
lr = make_pipeline(preprocess, LogisticRegression())


logistic regression score: 0.804598


In [101]:
preprocess = ColumnTransformer([
    ('scaler', StandardScaler(), ['age', 'fare']),
    ('onehot', CategoricalEncoder(), ['pclass', 'sex', 'embarked'],),
], remainder='drop')

In [102]:
lr = make_pipeline(preprocess, LogisticRegression())
lr.fit(X_train, y_train)
print("logistic regression score: %f" % lr.score(X_test, y_test))

logistic regression score: 0.804598


In [103]:
preprocess = ColumnTransformer([
    ('scaler', StandardScaler(), ['age', 'fare']),
    ('onehot', CategoricalEncoder(), ['pclass', 'sex', 'embarked'],),
], remainder='passthrough')

In [104]:
lr = make_pipeline(preprocess, LogisticRegression())
lr.fit(X_train, y_train)
print("logistic regression score: %f" % lr.score(X_test, y_test))

logistic regression score: 0.808429


In [100]:
preprocess.fit_transform(X_train)

<782x12 sparse matrix of type '<class 'numpy.float64'>'
	with 4368 stored elements in Compressed Sparse Row format>

In [113]:
lr.get_params()

{'columntransformer': ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
          transformers=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True), ['age', 'fare']), ('onehot', CategoricalEncoder(categories='auto', dtype=<class 'numpy.float64'>,
           encoding='onehot', handle_unknown='error'), ['pclass', 'sex', 'embarked'])]),
 'columntransformer__n_jobs': 1,
 'columntransformer__onehot': CategoricalEncoder(categories='auto', dtype=<class 'numpy.float64'>,
           encoding='onehot', handle_unknown='error'),
 'columntransformer__onehot__categories': 'auto',
 'columntransformer__onehot__dtype': numpy.float64,
 'columntransformer__onehot__encoding': 'onehot',
 'columntransformer__onehot__handle_unknown': 'error',
 'columntransformer__remainder': 'passthrough',
 'columntransformer__scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'columntransformer__scaler__copy': True,
 'columntransformer__scaler__with_mean': True,


In [105]:
from sklearn.ensemble import RandomForestClassifier

In [111]:
preprocess = ColumnTransformer([
    ('scaler', StandardScaler(), ['age', 'fare']),
    ('onehot', CategoricalEncoder(encoding='ordinal'), ['pclass', 'sex', 'embarked'],),
], remainder='drop')

In [112]:
rf = make_pipeline(preprocess, RandomForestClassifier(n_estimators=500, random_state=0))
rf.fit(X_train, y_train)
print("random forest score: %f" % rf.score(X_test, y_test))

random forest score: 0.793103


See the [development docs](http://scikit-learn.org/dev/modules/preprocessing.html#encoding-categorical-features) for more information.

Having this conversion available as a sklearn transformer also makes it easier to put in a `Pipeline`. Although, this is at the moment not yet fully straightforward because we need to combine the output of this categorical encoder with the other numeric columns. Currently you can already use the [FeatureUnion](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html) or the [DataFrameMapper](https://github.com/scikit-learn-contrib/sklearn-pandas#transformation-mapping) from the sklearn-pandas project, but the future `ColumnTransformer` will provide a built-in way to make this much easier (this is another PR I am working on: https://github.com/scikit-learn/scikit-learn/pull/9012).


**! This is brand new functionality in scikit-learn, so feedback is very welcome! (the [PR](https://github.com/scikit-learn/scikit-learn/pull/9151))**

### Want more categorical encoders?

The `CategoricalEncoder` only provides two ways to encode (one-hot or dummy, and ordinal encoding), but there are many more possible ways to convert your categorical variables into numeric features suited to feed into models. The [Category Encoders](http://contrib.scikit-learn.org/categorical-encoding/) is a scikit-learn-contrib package that provides a whole suite of scikit-learn compatible transformers for different types of categorical encodings.