- **Why Use ColumnTransformer?**
    - Automation: Automatically applies preprocessing to the specified columns.
    - Modularity: Easily define different transformations for numerical and categorical features.
    - Integration: Works seamlessly within a Pipeline, combining preprocessing and model training. ,convert this for my jupyter-notebook

In [1]:
import pandas as pd
# Sample data
df = pd.DataFrame({
    "Age": [25, 32, 47, 19, 38],
    "Salary": [50000, 60000, 120000, 20000, 75000],
    "Gender": ["Male", "Female", "Female", "Male", "Female"],
    "Country": ["India", "USA", "UK", "India", "Germany"],
    "Purchased": [0, 1, 1, 0, 1]
})
df

Unnamed: 0,Age,Salary,Gender,Country,Purchased
0,25,50000,Male,India,0
1,32,60000,Female,USA,1
2,47,120000,Female,UK,1
3,19,20000,Male,India,0
4,38,75000,Female,Germany,1


In [2]:
X = df[['Age','Salary','Gender','Country']]
X 

Unnamed: 0,Age,Salary,Gender,Country
0,25,50000,Male,India
1,32,60000,Female,USA
2,47,120000,Female,UK
3,19,20000,Male,India
4,38,75000,Female,Germany


In [3]:
y = df['Purchased']
y 

0    0
1    1
2    1
3    0
4    1
Name: Purchased, dtype: int64

- **1. List of Column Names**
    - You can explicitly provide a list of column names (as in your example):

In [4]:
from sklearn.preprocessing import StandardScaler,OneHotEncoder 
from sklearn.compose import ColumnTransformer 

preprocessor = ColumnTransformer(transformers=[('num',StandardScaler(),['Age','Salary']), # Scale numerical columns
                                               ('cat',OneHotEncoder(),['Gender'])])       # One-hot encode categorical columns
preprocessor

In [5]:
from sklearn.model_selection import train_test_split 

# Step 4: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


In [6]:
from sklearn.pipeline import Pipeline # it is a class in sklearn 
from sklearn.ensemble import RandomForestClassifier 


pipeline_obj = Pipeline(steps=[('preprocessing',preprocessor),('model',RandomForestClassifier())])
pipeline_obj

In [7]:
pipeline_obj.fit(X_train,y_train)

In [8]:
X_test 

Unnamed: 0,Age,Salary,Gender,Country
1,32,60000,Female,USA
4,38,75000,Female,Germany


In [9]:
y_test

1    1
4    1
Name: Purchased, dtype: int64

In [10]:
y_pred =pipeline_obj.predict(X_test)
y_pred 

array([0, 1])

- **2. Index of Columns**
    - If you're working with a dataset without named columns (e.g., a NumPy array or a DataFrame without column names), you can use column indices:

In [20]:
df 

Unnamed: 0,Age,Salary,Gender,Country,Purchased
0,25,50000,Male,India,0
1,32,60000,Female,USA,1
2,47,120000,Female,UK,1
3,19,20000,Male,India,0
4,38,75000,Female,Germany,1


In [21]:
from sklearn.pipeline import Pipeline # it is a class in sklearn 
from sklearn.ensemble import RandomForestClassifier 

'''The error is caused by the OneHotEncoder encountering categories during the test set 
transformation that were not present in the training set. To fix this issue, you need to specify 
the handle_unknown='ignore' parameter when initializing the OneHotEncoder. This ensures that any 
unknown categories in the test data will be ignored instead of causing an error.'''

preprocessing = ColumnTransformer([
    ('num',StandardScaler(),[0,1]),
    ('cat',OneHotEncoder(handle_unknown='ignore'),[2,3]),
])

def pipeline(preprocessing):
    pipeline_obj = Pipeline(steps=[
        ('preprocessing',preprocessing),
        ('model',RandomForestClassifier())
        ])

    pipeline_obj.fit(X_train,y_train)
    y_pred =pipeline_obj.predict(X_test)
    display(y_pred)
    
pipeline(preprocessing)

array([0, 0])

**3. Using a Callable Function**

In [28]:
X 

Unnamed: 0,Age,Salary,Gender,Country
0,25,50000,Male,India
1,32,60000,Female,USA
2,47,120000,Female,UK
3,19,20000,Male,India
4,38,75000,Female,Germany


In [30]:
type(X)

pandas.core.frame.DataFrame

In [32]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Age      5 non-null      int64 
 1   Salary   5 non-null      int64 
 2   Gender   5 non-null      object
 3   Country  5 non-null      object
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes


In [31]:
X.select_dtypes(include='number')

Unnamed: 0,Age,Salary
0,25,50000
1,32,60000
2,47,120000
3,19,20000
4,38,75000


In [33]:
X.select_dtypes(include='number').columns

Index(['Age', 'Salary'], dtype='object')

In [34]:
X.select_dtypes(include='object')

Unnamed: 0,Gender,Country
0,Male,India
1,Female,USA
2,Female,UK
3,Male,India
4,Female,Germany


In [35]:
X.select_dtypes(include='object').columns

Index(['Gender', 'Country'], dtype='object')

In [36]:
type(X.select_dtypes(include='object').columns)

pandas.core.indexes.base.Index

In [40]:
def pipelines():
    preprocessing = ColumnTransformer([
        ('num',StandardScaler(),X.select_dtypes(include='number').columns), # Dynamically selecting numeric columns
        ('cat',OneHotEncoder(handle_unknown='ignore'),X.select_dtypes(include='object').columns)#  Dynamically selecting categorical columns
    ])
    pipeline_obj = Pipeline(
        steps=[
            ('preprocessing',preprocessing),
            ('model',RandomForestClassifier())
        ]
    )
    pipeline_obj.fit(X_train,y_train)
    y_pred = pipeline_obj.predict(X_test)
    display(y_pred)
pipelines()

array([0, 1])

- **4. Using make_column_selector**

In [42]:
from sklearn.compose import make_column_selector 

make_column_selector(dtype_include='number')

<sklearn.compose._column_transformer.make_column_selector at 0x740d26867550>

In [43]:
make_column_selector(dtype_include='object') 

<sklearn.compose._column_transformer.make_column_selector at 0x740d26867190>

In [47]:
def pipelines():
    preprocessing = ColumnTransformer(
        transformers = [
            ('num',StandardScaler(),make_column_selector(dtype_include='number')),
            ('cat',OneHotEncoder(),make_column_selector(dtype_include='object'))
        ]
    )
    pipeline_obj = Pipeline(
        steps= [
            ('preprocessing',preprocessing),
            ('model',RandomForestClassifier()) 
        ]
    )
    
    pipeline_obj.fit(X_train,y_train)
    y_pred = pipeline_obj.predict(X_train) 
    display(y_pred)
    
pipelines()

array([1, 0, 0])

- **5. Using Logical Selection**

In [58]:
def pipelines():
    preprocessing = ColumnTransformer(
        transformers = [
            ('num',StandardScaler(),lambda X : [col for col in X.columns if col in ['Age','Salary']]),
            ('cat',OneHotEncoder(handle_unknown='ignore'),lambda X : [col for col in X.columns if col in ['Gender','Country']])
        ]
    )
    pipeline_obj = Pipeline(
        steps= [
            ('preprocessing',preprocessing),
            ('model',RandomForestClassifier()) 
        ]
    )
    
    pipeline_obj.fit(X_train,y_train)
    display(pipeline_obj.score(X_test, y_test))
    
pipelines()

0.5

In [60]:
pd.DataFrame(preprocessing.fit_transform(X))

Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.735767,-0.456435,0.0,1.0,0.0,1.0,0.0,0.0
1,-0.020438,-0.152145,1.0,0.0,0.0,0.0,0.0,1.0
2,1.51241,1.673597,1.0,0.0,0.0,0.0,1.0,0.0
3,-1.348907,-1.369306,0.0,1.0,0.0,1.0,0.0,0.0
4,0.592701,0.30429,1.0,0.0,1.0,0.0,0.0,0.0


<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Pipeline and ColumnTransformer Notes</title>
    
</head>
<body>

<h1>Why You Don't Pass <code>fit_transform(data)</code> Directly into a Pipeline</h1>

<p>The reason you don't pass the <code>fit_transform(data)</code> result of a <code>ColumnTransformer</code> directly into a <code>Pipeline</code> is that the <code>Pipeline</code> is designed to handle the full end-to-end process of data preprocessing and modeling. Let me explain this in more detail:</p>

<h2>How <code>Pipeline</code> Works</h2>
<p>A <code>Pipeline</code> allows you to chain together multiple steps, such as:</p>
<ul>
    <li>Preprocessing (e.g., scaling, encoding).</li>
    <li>Model training (e.g., fitting a Random Forest).</li>
</ul>
<p>When you pass raw data into the <code>Pipeline</code>:</p>
<ul>
    <li>It automatically applies the transformations (like scaling and encoding) using the <code>fit_transform()</code> method during training.</li>
    <li>It also applies the same transformations using <code>transform()</code> when making predictions.</li>
</ul>

<h2>What Happens If You Use <code>fit_transform(data)</code> Outside?</h2>
<p>If you manually apply <code>fit_transform(data)</code> on your data <strong>before</strong> passing it into the <code>Pipeline</code>, you break the automated preprocessing functionality. For example:</p>

<div class="highlight">
    <h3>Loss of Reusability</h3>
    <p>During prediction (<code>predict()</code>), the <code>Pipeline</code> expects raw data to go through the same transformations. If you've already transformed the data manually, the <code>Pipeline</code> will fail because it expects raw inputs.</p>

    <h3>Inconsistent Transformations</h3>
    <p>If new or unseen data (e.g., test data) is passed to the <code>Pipeline</code>, the transformations won't be applied correctly unless you do the preprocessing yourself. This defeats the purpose of automation.</p>

    <h3>Separation of Steps</h3>
    <p>By manually transforming data, you now need to manage preprocessing and model training/testing as separate steps, leading to a more error-prone and less maintainable workflow.</p>
</div>

<h2>Correct Approach: Pass the <code>ColumnTransformer</code> to the <code>Pipeline</code></h2>
<p>Instead of manually transforming the data, include the <code>ColumnTransformer</code> directly in the <code>Pipeline</code>:</p>

<pre>
<code>
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Sample dataset
data = pd.DataFrame({
    'Age': [25, 32, 47],
    'Salary': [50000, 60000, 70000],
    'Gender': ['Male', 'Female', 'Male'],
    'Country': ['India', 'USA', 'Germany']
})

X = data[['Age', 'Salary', 'Gender', 'Country']]
y = [0, 1, 0]  # Target variable

# Define ColumnTransformer
preprocessing = ColumnTransformer([
    ('num', StandardScaler(), ['Age', 'Salary']),
    ('cat', OneHotEncoder(), ['Gender', 'Country'])
])

# Create the Pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing),            # Automatically applies fit_transform()
    ('model', RandomForestClassifier())
])

# Train the Pipeline
pipeline.fit(X, y)

# Predict using the Pipeline
y_pred = pipeline.predict(X)
print("Predictions:", y_pred)
</code>
</pre>

<h2>Key Workflow with <code>Pipeline</code></h2>
<ol>
    <li><strong>Raw data</strong> is passed to the <code>Pipeline</code>.</li>
    <li>The <code>ColumnTransformer</code> (inside <code>Pipeline</code>) handles <strong>all transformations</strong> during <code>fit()</code> and <code>transform()</code>.</li>
    <li>The model is trained on the transformed data.</li>
    <li>When making predictions, raw data is again transformed automatically before predictions are made.</li>
</ol>

<h2>Why This Is Better:</h2>
<ul>
    <li><strong>Automation:</strong> You don't need to worry about applying transformations manually during training or prediction.</li>
    <li><strong>Consistency:</strong> The same transformations are applied to both training and test data, ensuring that the model performs reliably on unseen data.</li>
    <li><strong>Code Clarity:</strong> The workflow is cleaner and less error-prone because everything is encapsulated in the <code>Pipeline</code>.</li>
</ul>

</body>
</html>
