# OneHot Encoding in Scikit-Learn with DataFrames of Mixed Column Types

## Some Toydata

- Imagine we have some dataset that consists of both numerical and categorical features.
- And we just want to convert the categorical features into a onehot encoding (while leaving the numerical features untouched)

In [1]:
import pandas as pd

In [2]:
feature_1 = [
    1.1, 2.1, 3.1, 4.2,
    5.1, 6.1, 7.1, 8.1,
    1.2, 2.1, 3.1, 4.1
]

feature_2 = [
    'b', 'b', 'b', 'b',
    'a', 'a', 'a', 'a',
    'c', 'c', 'c', 'c'
]

df = pd.DataFrame({'numerical': feature_1, 'categorical': feature_2})
df

Unnamed: 0,numerical,categorical
0,1.1,b
1,2.1,b
2,3.1,b
3,4.2,b
4,5.1,a
5,6.1,a
6,7.1,a
7,8.1,a
8,1.2,c
9,2.1,c


## Onehot Encoding

- We can use e.g., scikit-learn's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to expand the categorical column into onehot-encoded ones
- By default, the `OneHotEncoder` will expand all columns into categorical ones (this includes the numerical ones), which is not what we want if we have mixed-type datasets
- We can use the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to select specific columns we want to transform, though

In [3]:
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder



ohe = OneHotEncoder(sparse=False, drop='first', dtype='float')


categorical_features = ['categorical']

col_transformer = ColumnTransformer(
    transformers=[
        ('cat', ohe, categorical_features)],
         # include the numerical column(s) via passthrough:
         remainder='passthrough' 
)

col_transformer.fit(df)
X_t = col_transformer.transform(df)
X_t

array([[1. , 0. , 1.1],
       [1. , 0. , 2.1],
       [1. , 0. , 3.1],
       [1. , 0. , 4.2],
       [0. , 0. , 5.1],
       [0. , 0. , 6.1],
       [0. , 0. , 7.1],
       [0. , 0. , 8.1],
       [0. , 1. , 1.2],
       [0. , 1. , 2.1],
       [0. , 1. , 3.1],
       [0. , 1. , 4.1]])

In [4]:
%load_ext watermark
%watermark --iversions

pandas : 1.4.0
sklearn: 1.0.2

