# Scikit-Learn Pipelines: A Complete Guide

## **Why Use a Pipeline?**
A `Pipeline` in Scikit-Learn is useful for chaining multiple preprocessing steps and an estimator into one object. This ensures that:
- Data transformations are **applied consistently** to both training and test sets.
- **Data leakage** is prevented by ensuring transformations (e.g., imputation) are learned only from `X_train` and then applied to `X_test`.
- Code is **cleaner and more maintainable**.

## **How a Pipeline Works Internally**
A pipeline consists of **transformers** (like `SimpleImputer`, `StandardScaler`) and an **estimator** (like `LogisticRegression`, `RandomForestClassifier`).

### **What Happens When You Call `pipeline.fit(X_train, y_train)`?**
1. **Transformers (like `SimpleImputer`)**:
   - `fit(X_train)`: Learns parameters (e.g., mean for missing values).
   - `transform(X_train)`: Applies transformations (e.g., fills missing values).
2. **Estimator (like `LogisticRegression`)**:
   - `fit(X_train_transformed, y_train)`: Trains the model using the transformed `X_train`.

### **What Happens When You Call `pipeline.predict(X_test)`?**
1. **Transformers (like `SimpleImputer`)**:
   - `transform(X_test)`: Uses **previously learned** parameters (e.g., mean from `X_train`) to transform `X_test`.
   - 🚨 **No `fit()` is called here**, ensuring `X_test` is not used to compute new transformation parameters.
2. **Estimator (like `LogisticRegression`)**:
   - `predict(X_test_transformed)`: Uses the trained model to make predictions.

---

## **Pipeline vs. Manual Preprocessing**
### **Manual Preprocessing (Without a Pipeline)**
```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Sample Data
data = pd.DataFrame({
    'age': [25, np.nan, 35, 40, np.nan, 50],
    'salary': [50000, 54000, 62000, 70000, 67000, np.nan],
    'purchased': [0, 1, 0, 1, 1, 0]
})
X = data[['age', 'salary']]
y = data['purchased']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Manual Imputation
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)  # Fit on X_train
X_test_imputed = imputer.transform(X_test)  # Transform X_test (no fit!)

# Train Model
model = LogisticRegression()
model.fit(X_train_imputed, y_train)

# Make Predictions
predictions = model.predict(X_test_imputed)
print(predictions)
```
✅ **We must manually call `.transform(X_test)`, ensuring we don’t fit again on `X_test`.**

---

### **Using a Pipeline (Automated Preprocessing & Prediction)**
```python
from sklearn.pipeline import Pipeline

# Define a pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('classifier', LogisticRegression())
])

# Fit pipeline on training data
pipeline.fit(X_train, y_train)  # Imputer and model fit together

# Predict using pipeline
predictions = pipeline.predict(X_test)  # X_test is automatically imputed before prediction
print(predictions)
```
✅ **Pipeline automatically applies the same transformations to `X_test` without manual intervention.**

---

## **Key Takeaways**
✔️ `fit(X_train, y_train)`:
- Applies `fit_transform(X_train)` for transformers.
- Trains the estimator using transformed `X_train`.

✔️ `predict(X_test)`:
- Applies `transform(X_test)` using parameters learned from `X_train`.
- Uses the trained model to predict.

✔️ **Using a pipeline prevents data leakage and keeps the workflow clean and reproducible.**



**Why we dont fit on the X-Test?:**

1. SimpleImputer(strategy='mean') learns the mean of each column during .fit().

2. This mean is computed only from X_train and is then reused for imputing missing values in both X_train and X_test.

If you do .fit() on X_test, you're "cheating" — you're letting your model see part of the test data during training, which breaks the idea of testing on unseen data.

# ColumnTransformer and Its Connection to Pipelines

## **Why Use a ColumnTransformer?**
A `ColumnTransformer` is used when different preprocessing steps need to be applied to different columns in a dataset. Without it, a `Pipeline` applies the same transformations to all columns, which is problematic when dealing with mixed data types (e.g., numerical and categorical features).

### **Key Benefits:**
- **Handle different feature types separately** (e.g., impute and scale numerical data, one-hot encode categorical data).
- **Prevents data leakage** by ensuring transformations learned from `X_train` are correctly applied to `X_test`.
- **Keeps preprocessing structured** and avoids errors from applying incorrect transformations to certain columns.

## **How ColumnTransformer Works Inside a Pipeline**
When a `ColumnTransformer` is included in a `Pipeline`, the flow works as follows:
1. **During `pipeline.fit(X_train, y_train)`**:
   - The `ColumnTransformer` applies `fit_transform(X_train)` to learn transformations for each column type.
   - The transformed `X_train` is then passed to the estimator (e.g., `LogisticRegression.fit()`).

2. **During `pipeline.predict(X_test)`**:
   - The `ColumnTransformer` applies `transform(X_test)` (without refitting).
   - The transformed `X_test` is then used for predictions.

## **Comparison: With and Without ColumnTransformer**
### **🚨 Without ColumnTransformer (Incorrect Handling of Mixed Data)**
```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Sample Data
data = pd.DataFrame({
    'age': [25, None, 35, 40, None, 50],
    'salary': [50000, 54000, 62000, 70000, 67000, None],
    'gender': ['M', 'F', 'F', 'M', 'M', 'F'],
    'purchased': [0, 1, 0, 1, 1, 0]
})

X = data[['age', 'salary', 'gender']]
y = data['purchased']

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# This will fail because we are trying to scale a categorical column
pipeline.fit(X, y)
```
🚨 **Issue**: `SimpleImputer` and `StandardScaler` are applied to all columns, causing an error when trying to scale `'gender'`.

---

### **✅ Correct Approach: Using ColumnTransformer in a Pipeline**
```python
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Define preprocessing for numerical and categorical columns
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ]), ['age', 'salary']),  # Applies to numeric columns

    ('cat', OneHotEncoder(handle_unknown='ignore'), ['gender'])  # Applies to categorical columns
])

# Full pipeline with preprocessing and model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit on training data
pipeline.fit(X_train, y_train)

# Predict on test data (Only calls transform, no fit_transform!)
predictions = pipeline.predict(X_test)
print(predictions)
```
✅ **Fix**: Now, numerical and categorical columns are preprocessed separately before training, avoiding errors.

## **Key Takeaways**
✔️ Use `Pipeline` **alone** only when all columns undergo the same transformation.
✔️ Use `ColumnTransformer` **inside a Pipeline** when different transformations are needed for different column types.
✔️ `pipeline.fit(X_train, y_train)` → Calls `fit_transform(X_train)` on transformers and trains the model.
✔️ `pipeline.predict(X_test)` → Calls `transform(X_test)`, ensuring consistent preprocessing without refitting.
✔️ **Always split your data** before calling `.fit()` to prevent data leakage.

This ensures that preprocessing and model training are done efficiently and correctly. 🚀



END