# 📘 Notes: Custom Transformer + ColumnTransformer + Pipeline (scikit-learn)

### ✨ Concept:
- Real datasets often mix **categorical features** (like “Hero”, “Power”) and **numerical features** (like “Strength”, “Age”).  
- ML algorithms (like Logistic Regression, LightGBM, etc.) only work with **numeric features** → so categoricals must be encoded.  
- Target (`Alive`) is what we want to predict (binary classification: Alive=1, Dead=0).  

---

### ✨ Concept:
1. **Why inherit `BaseEstimator, TransformerMixin`?**
   - `BaseEstimator`: gives `get_params` and `set_params`. Enables scikit-learn tools (like `GridSearchCV`) to work with this transformer.  
   - `TransformerMixin`: provides `fit_transform()` automatically. Any scikit-learn pipeline now recognizes this as a proper transformer.

2. **Logic inside transformer**:
   - `.fit()`: learns a mapping from category → integer.
     Example: `"Hero": {"Hulk":0, "IronMan":1, "Thor":2}`
   - `.transform()`: applies the learned mapping on data. If category wasn’t seen before during training, maps it to `-1`.  
     👉 This ensures consistency between train and test data.

3. **Why store `cat_maps_`?**
   - Keeps the learned mapping so it works on unseen (test/production) data.
   - Without this, encoding would differ between datasets and model predictions would break.


### ✨ Concept:
- `ColumnTransformer` lets you apply **different preprocessing to different feature subsets**.  
- `"cat"`: Apply `CatToIntTransformer` to categorical columns only (`Hero`, `Power`).  
- `remainder="passthrough"`: Keep numerical features (`Strength`, `Age`) unchanged.  

👉 This avoids manually splitting datasets into categoricals/numericals and recombining them.


### ✨ Concept:
- A scikit-learn `Pipeline` stitches multiple steps together:
  - Step 1: Preprocessing (`preprocessor`) → transforms raw data into all-numeric data.
  - Step 2: Model (`LogisticRegression`) → trains on that transformed data.
- Once built, you can `.fit()` and `.predict()` on raw data → pipeline automatically handles both preprocessing and prediction.  

👉 Pipelines guarantee that **training and prediction use the exact same transformations.** No risk of preprocessing mistakes.


### ✨ Concept:
- Split into train/test data.  
- `.fit()`:
  - First calls `fit()` on the `preprocessor` (your `CatToIntTransformer` learns mappings).  
  - Then applies transformation (`transform`) to encode categoricals, passthrough numericals.  
  - Finally, trains `LogisticRegression` on the fully numeric dataset.  


### ✨ Concept:
- Shows what the raw `X_train` looks like after preprocessing.
- Example:  
  Hero strings → `[0,1,2]`, Power strings → `[0,1,2]`, numericals unchanged.  

---

### ✨ Concept:
- When you call `pipeline.predict()`:
  - New raw test data passes through the *same transformer mappings* (`cat_maps_`).  
  - Model predicts based on transformed features.  
- Ensures that **training → prediction** is consistent.

---

# ✅ Key Takeaways
1. **BaseEstimator**: provides API consistency → every sklearn model/transformer works the same.  
2. **TransformerMixin**: gives `fit_transform()` automatically → required for sklearn Pipelines.  
3. **Custom Transformer (`CatToIntTransformer`)**:
   - Learns mappings of categorical → integer in training (`fit`).  
   - Reuses those mappings for test/prediction (`transform`).  
   - Ensures unseen categories don’t break the pipeline.  
4. **ColumnTransformer**: applies different preprocessing to different feature subsets (categorical encoding + passthrough numericals).  
5. **Pipeline**: chains preprocessing + model → ensures end-to-end consistency and reproducibility.  
6. **Practical use**: This is exactly how `basic_model.py` in your Databricks/MLflow demo ensures LightGBM can train with mixed categorical + numerical features.  

---

👉 **Analogy:**  
- `BaseEstimator` = Scikit-learn’s “rulebook” → ensures consistency across all models/transformers.  
- `CatToIntTransformer` = Custom **translator** → turns text categories into numbers.  
- `ColumnTransformer` = **task assigner** → sends categorical columns to translator, leaves numeric columns alone.  
- `Pipeline` = **orchestra conductor** → makes sure preprocessing + model play together, in sync.  

In [2]:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [5]:
data = pd.DataFrame({
    "Hero": ["IronMan", "Thor", "Hulk", "Thor", "Hulk", "IronMan"],
    "Power": ["Tech", "Thunder", "Smash", "Thunder", "Smash", "Tech"],
    "Strength": [90, 95, 100, 92, 99, 88],
    "Age": [40, 1500, 50, 1490, 55, 42],
    "Alive": [1, 1, 0, 1, 0, 1]   # Target column
})

cat_features = ["Hero", "Power"]
num_features = ["Strength", "Age"]
target = "Alive"

In [4]:
data

Unnamed: 0,Hero,Power,Strength,Age,Alive
0,IronMan,Tech,90,40,1
1,Thor,Thunder,95,1500,1
2,Hulk,Smash,100,50,0
3,Thor,Thunder,92,1490,1
4,Hulk,Smash,99,55,0
5,IronMan,Tech,88,42,1


In [9]:
pd.Categorical(data[cat_features])

['Hero', 'Power']
Categories (2, object): ['Hero', 'Power']

##  pd.Categorical
- Before pd.Categorical (as strings): The column stores the full string for every single row: ["Medium", "Small", "Large", "Medium", "Medium", ...]. If you have a million rows, you're storing the word "Medium" hundreds of thousands of times.
- After pd.Categorical: The function is like creating a smart reference key. It finds all the unique categories ("Small", "Medium", "Large") and assigns them an integer code.

In [10]:
class CatToIntTransformer(BaseEstimator, TransformerMixin):
    """
    Encode categorical columns into integer codes.
    Unknown categories at transform time -> -1
    """

    def __init__(self, cat_features):
        self.cat_features = cat_features
        self.cat_maps_ = {}

    def fit(self, X, y=None):
        # Learn mapping {category -> int} for each categorical column
        for col in self.cat_features:
            c = pd.Categorical(X[col])
            self.cat_maps_[col] = dict(zip(c.categories, range(len(c.categories)), strict=False))
        return self

    def transform(self, X):
        # Apply learned mappings, map new/unseen categories to -1
        X = X.copy()
        for col in self.cat_features:
            X[col] = X[col].map(lambda val, col=col: self.cat_maps_[col].get(val, -1)).astype("category")
        return X

In [15]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", CatToIntTransformer(cat_features=cat_features), cat_features)
    ],
    remainder="passthrough"  # keep numerical cols unchanged
)

In [16]:
preprocessor

0,1,2
,transformers,"[('cat', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,cat_features,"['Hero', 'Power']"


In [22]:
 preprocessor.fit_transform(X_train)

array([[1, 1, 88, 42],
       [0, 0, 100, 50],
       [0, 0, 99, 55],
       [2, 2, 92, 1490]], dtype=object)

In [18]:
pipeline = Pipeline(
    steps=[
        ("preprocessor" , preprocessor),
        ("classifier" , LogisticRegression())
    ]
)
pipeline

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('cat', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,cat_features,"['Hero', 'Power']"

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [19]:
X = data[cat_features + num_features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

pipeline.fit(X_train, y_train)

print("✅ Model trained successfully")

✅ Model trained successfully
