<div class="alert alert-block alert-success">
    <h1 align="center">Scikit-Learn Tips</h1>
    <h3 align="center">Tip 05: ColumnTransformers</h3>
</div>

Use ColumnTransformer to apply different preprocessing to different columns:

- select from DataFrame columns by name
- passthrough or drop unspecified columns

See example 👇

In [31]:
import pandas as pd
df = pd.read_csv('data.csv')

In [32]:
cols = ['fare', 'embarked', 'sex', 'age']
X = df[cols]

In [33]:
X

Unnamed: 0,fare,embarked,sex,age
0,211.3375,S,female,29.00
1,151.5500,S,male,0.92
2,151.5500,S,female,2.00
3,151.5500,S,male,30.00
4,151.5500,S,female,25.00
...,...,...,...,...
1304,14.4542,C,female,14.50
1305,14.4542,C,female,
1306,7.2250,C,male,26.50
1307,7.2250,C,male,27.00


In [34]:
X.dropna(subset = ['embarked'] , inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [35]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1307 entries, 0 to 1308
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   fare      1306 non-null   float64
 1   embarked  1307 non-null   object 
 2   sex       1307 non-null   object 
 3   age       1044 non-null   float64
dtypes: float64(2), object(2)
memory usage: 51.1+ KB


In [27]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

In [28]:
ohe = OneHotEncoder()
imp = SimpleImputer()

In [29]:
ct = make_column_transformer(
    (ohe, ['embarked', 'sex']),  # apply OneHotEncoder to Embarked and Sex
    (imp, ['age']),              # apply SimpleImputer to Age
    remainder='passthrough')     # include remaining column (Fare) in the output

In [36]:
X

Unnamed: 0,fare,embarked,sex,age
0,211.3375,S,female,29.00
1,151.5500,S,male,0.92
2,151.5500,S,female,2.00
3,151.5500,S,male,30.00
4,151.5500,S,female,25.00
...,...,...,...,...
1304,14.4542,C,female,14.50
1305,14.4542,C,female,
1306,7.2250,C,male,26.50
1307,7.2250,C,male,27.00


In [30]:
# column order: Embarked (3 columns), Sex (2 columns), Age (1 column), Fare (1 column)
ct.fit_transform(X)

array([[  0.    ,   0.    ,   1.    , ...,   0.    ,  29.    , 211.3375],
       [  0.    ,   0.    ,   1.    , ...,   1.    ,   0.92  , 151.55  ],
       [  0.    ,   0.    ,   1.    , ...,   0.    ,   2.    , 151.55  ],
       ...,
       [  1.    ,   0.    ,   0.    , ...,   1.    ,  26.5   ,   7.225 ],
       [  1.    ,   0.    ,   0.    , ...,   1.    ,  27.    ,   7.225 ],
       [  0.    ,   0.    ,   1.    , ...,   1.    ,  29.    ,   7.875 ]])

In [37]:
ct.fit_transform(X).shape

(1307, 7)