# Day 26: Column Transformer in Machine Learning

The Column Transformer is a useful tool in scikit-learn library that allows you to apply different preprocessing and feature extraction steps to different columns or subsets of columns in a dataset. It is especially useful when you are working with heterogeneous data, where different columns have different data types or require different preprocessing steps.

## Basic Problem

Consider a dataset with both numerical and categorical columns. You may want to apply different preprocessing steps to the numerical and categorical columns. For example, you may want to scale the numerical columns and one-hot encode the categorical columns. Using the Column Transformer, you can apply these steps to the appropriate columns without having to split the dataset manually.

## Example

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [2]:
df = pd.read_csv("covid_toy.csv")

In [3]:
df.head(10)

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No
5,84,Female,,Mild,Bangalore,Yes
6,14,Male,101.0,Strong,Bangalore,No
7,20,Female,,Strong,Mumbai,Yes
8,19,Female,100.0,Strong,Bangalore,No
9,64,Female,101.0,Mild,Delhi,No


In [4]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In the given data, "age" is numeric data and "gender", "fever", "cough", "city", "has_covid" are categorical data.we can use One Hot Encoding for "gender" and "city", Simple Impute for "fever", and Ordinal Encoding for "cough". "Age" is numerical data and does not require any encoding.

## Without Column Transformer

### Adding simple imputer to fever col

In [5]:
si = SimpleImputer()
df_fever = si.fit_transform(df[["fever"]])

### Ordinalencoding to Cough

In [6]:
oe = OrdinalEncoder(categories=[['Mild', 'Strong']])
df_cough = oe.fit_transform(df[['cough']])

### OneHotEncoding to gender and city

In [7]:
ohe = OneHotEncoder(drop='first', sparse=False)
df_gender_city = ohe.fit_transform(df[['gender', 'city']])



In [8]:
df_age = df.drop(columns=['gender','fever','cough','city']).values

### Extracting Age

In [9]:
df_transformed = np.concatenate((df_age, df_fever, df_gender_city, df_cough),axis=1)

In [10]:
df_transformed[10]

array([75, 'No', 100.84444444444445, 0.0, 1.0, 0.0, 0.0, 0.0],
      dtype=object)

## With Column Transformer

In [11]:
from sklearn.compose import ColumnTransformer

In [12]:
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse=False,drop='first'),['gender','city'])
],remainder='passthrough')

In [13]:
df_transformed = transformer.fit_transform(df)



In [14]:
df_transformed[10]

array([100.84444444444445, 0.0, 0.0, 1.0, 0.0, 0.0, 75, 'No'],
      dtype=object)