# Tabular Regression Workflow Template (EDA ‚Üí Skew ‚Üí Outliers ‚Üí Model Choice)

This notebook is a **reusable template** for tabular regression problems (e.g. Kaggle competitions).  
It includes both **code** and a **granular decision workflow** so you can follow the same thought process every time.

---

## üîÅ High-Level Workflow

1. **Set config & load data**
2. **Understand structure**: dtypes, missingness, basic stats
3. **Explore numeric features**: distributions, feature‚Äìtarget relationships, correlations
4. **Explore categorical & boolean features**
5. **Quantify skewness & kurtosis** and decide on transformations
6. **Detect & handle outliers** (winsorize, remove, or flag)
7. **Assess relationship shape** (linear vs monotonic vs nonlinear)
8. **Choose model family** based on aggregate shape (linear vs trees)
9. **(Later) Build preprocessing + baseline model + CV**

You can duplicate this notebook for any new regression competition and only change the config (paths, target name, etc.).


## üß≠ Decision Workflow Cheat Sheet (Granular Rules)

Use this as a **mental and practical checklist** every time.

### 1Ô∏è‚É£ Data & Structure

1. Load train (and test if available).
2. Check:
   - `shape` (rows, columns)
   - dtypes
   - missing values
   - obvious ID columns
3. Identify initial column groups:
   - Numeric features
   - Categorical features
   - Boolean / 0‚Äì1 features
   - Target column
   - ID column(s)

> üìå **Action**: If something looks wrong (e.g. target all zeros, date parsed as object, weird dtypes), fix **before** going further.

---

### 2Ô∏è‚É£ Numeric EDA: Distributions & Relationships

For numeric columns (excluding target and IDs):

1. Plot histograms for each numeric feature.
2. Plot **target distribution** (hist + boxplot).
3. Plot **scatter plots** of feature vs target for a subset of numeric features.
4. Compute correlations:
   - Pearson (linear) 
   - Spearman (rank / monotonic) for sanity checks later (optional).

**Interpretation rules:**

- If a feature‚Äôs scatter vs target looks roughly like a **straight band** ‚Üí relationship is approximately **linear**.
- If it is curved (U-shape, log curve, exponential, plateauing) ‚Üí **nonlinear**.
- If there is no clear pattern ‚Üí likely **weak/no signal** or dominated by noise.

> üìå **If most of your strong features look linear:**  
> ‚Üí Linear models (Ridge/ElasticNet) are a good first baseline (after transformations).  
> üìå **If most look clearly nonlinear/curved/step-like:**  
> ‚Üí Start with tree-based models (LightGBM/XGBoost/CatBoost).  
> üìå **If it‚Äôs a mix or unclear:**  
> ‚Üí Start with a tree model (safe default), then experiment with linear models later.

---

### 3Ô∏è‚É£ Categorical & Boolean EDA

For categorical features:

- Look at value counts.
- Compute target statistics by category (mean, count, etc.).
- Plot **boxplots/violins** of target vs category.
- Compute **Cram√©r‚Äôs V** between categoricals to find redundancies.

For boolean / 0‚Äì1 features:

- Compute **point-biserial correlation** with the target.
- Plot boxplots of target vs boolean value.

**Interpretation rules:**

- Categories with very different target means are **highly informative**.
- Categoricals strongly associated with each other (high Cram√©r‚Äôs V) may be redundant.
- Boolean features with high |correlation| with target are good candidates to keep; others might be weak.

> üìå **Action**:  
> - Plan encodings (one-hot, target encoding, CatBoost handling).  
> - Consider merging rare categories if cardinality is high.

---

### 4Ô∏è‚É£ Skewness & Kurtosis: Shape of Numeric Distributions

For each numeric feature (including target):

- Compute **skewness** and **kurtosis**.

**Skewness rules of thumb:**

- `|skew| < 0.5` ‚Üí approximately symmetric
- `0.5 ‚â§ |skew| ‚â§ 1.0` ‚Üí moderately skewed
- `|skew| > 1.0` ‚Üí highly skewed

**Kurtosis (Fisher=False) rules:**

- `‚âà 3` ‚Üí roughly normal tails
- `> 3` ‚Üí heavy tails (more outliers)
- `< 3` ‚Üí light tails

**Transformation decisions:**

- If `|skew| < 0.5` ‚Üí leave as is (no transform needed for shape).
- If `0.5 ‚â§ |skew| ‚â§ 1.0`:
  - Consider **log1p** or **sqrt** transform for **right-skewed** (positive skew) features.
  - For **left-skewed** features, you can reflect: `x' = max(x) - x + 1`, then log/sqrt.
- If `|skew| > 1.0`:
  - Strong candidate for transformation:
    - `log1p(x)` if x ‚â• 0
    - Box-Cox or Yeo‚ÄìJohnson if more flexibility is needed
  - Also examine for outliers.

> üìå **Model choice impact:**  
> - Linear models prefer **low skew + near-normal residuals**.  
> - Tree models handle skew fine, but removing extreme skew can still improve generalization.

---

### 5Ô∏è‚É£ Outlier Detection & Handling

Goal: reduce the effect of **unreasonably extreme values** that can distort training, especially for linear models.

Steps:

1. For key numeric features (and/or all of them):
   - Use **IQR rule** or **z-score** to flag outliers.
   - Optionally use **IsolationForest** for multivariate detection.
2. Compare target distribution **with vs without** outliers to see their impact.

**Decision rules:**

- **If outliers are legitimate signal** (e.g., very rich customers, very large valid sales):  
  - Consider **keeping them**, especially if using tree-based models.
- **If outliers are likely errors / noise / impossible values**:  
  - Remove those rows outright.
- **If outliers are extreme but plausible and hurting linear models**:  
  - Use **winsorization** (clip to low/high percentiles, e.g. 1% and 99%).  
  - Or transform (log) then clip less aggressively.

General strategies:

- `strategy="winsorize"` ‚Üí good baseline for regression.  
- `strategy="remove"` ‚Üí use cautiously; track % of data removed.  
- `strategy="flag"` ‚Üí keep original values but add `_outlier` indicator features.

> üìå **Best practice:** For contest work, start with winsorizing or flagging rather than deleting.

---

### 6Ô∏è‚É£ Relationship Shape & Model Family

Use scatter plots, Pearson vs Spearman correlations, and your EDA impressions to classify features:

- **Linear relationship**: roughly straight trend in scatter, high Pearson & Spearman.
- **Nonlinear monotonic**: curved trend but always increasing/decreasing; low Pearson, higher Spearman.
- **Nonlinear non-monotonic**: U-shapes, plateaus, or complicated patterns.
- **No clear relationship**: cloud with no pattern.

**Model choice rules:**

- If **most strong features are linear** **and** you‚Äôre comfortable with transformations:  
  ‚Üí Try a **linear regression / Ridge / ElasticNet baseline** after fixing skew & outliers.
- If **many features are clearly nonlinear or monotonic but curved**:  
  ‚Üí Prefer **tree-based gradient boosting** (LightGBM/XGBoost/CatBoost).
- If the picture is **mixed** (some linear, some nonlinear) or unclear:  
  ‚Üí Start with **LightGBM** (good default for tabular).  
  ‚Üí Later, build a **linear baseline** to compare.

> ‚ùó You almost never build separate models per feature.  
> You **transform features** based on their shapes, then feed them into a single model (or ensemble).

---

### 7Ô∏è‚É£ Putting It All Together (Execution Flow)

When you open a new regression dataset, follow this order:

1. **Config & Data Load**
   - Set paths, target name, ID column.
   - Load `train_df` (and `test_df` if available).

2. **Initial Structure Check**
   - Run `summarize_dataframe(train_df)`.
   - Fix obvious issues (dtypes, weird IDs, broken target).

3. **Column Typing**
   - Use helpers to get `num_cols`, `cat_cols`, `bool_cols`.

4. **Numeric EDA**
   - Plot target distribution.  
   - Plot numeric feature histograms and a subset of feature-vs-target scatter plots.  
   - Check correlation with target (Pearson).

5. **Categorical & Boolean EDA**
   - Examine value counts.  
   - Summarize target by category and plot box/violin.  
   - Compute Cram√©r‚Äôs V matrix for categoricals; point-biserial correlations for booleans.

6. **Skewness & Kurtosis**
   - Compute skew/kurtosis for all numeric features.  
   - Decide which features are candidates for log / other transformations.

7. **Outliers**
   - Use IQR/Z-score/IsolationForest to flag outliers on key columns.  
   - Compare target with/without to see impact.  
   - Apply chosen strategy: winsorize / remove / flag.

8. **Model Strategy Planning**
   - Based on shapes and correlations:  
     - If mostly linear ‚Üí plan a linear model baseline + engineered features.  
     - If mostly nonlinear ‚Üí plan tree-based models.  
     - If mixed ‚Üí start with trees, later add linear baseline.

9. **(Next Notebook Sections)**
   - Implement preprocessing (encoders, scalers, transformers).  
   - Build train/validation split (KFold, TimeSeriesSplit, etc.).  
   - Train baseline models and compare metrics.  
   - Iterate with feature engineering and ensembling.


In [None]:
# ========== 1. Imports & Config ==========

import os
from pathlib import Path
from typing import Optional, List

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import skew, kurtosis, chi2_contingency, pointbiserialr, zscore
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import IsolationForest

# Display & plotting options
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100

# ---- Project-level config (edit per dataset/competition) ----
DATA_DIR = Path("../input")      # change to your data path
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"           # set to None if no test set

TARGET_COL = "target"            # change to your target column
ID_COL = "id"                    # change or set to None if no ID

RANDOM_STATE = 42


## 2Ô∏è‚É£ Load Data

Edit the `DATA_DIR`, `TRAIN_FILE`, `TEST_FILE`, and `TARGET_COL` in the config cell above to match your dataset.

Then run this cell to load your train (and test, if present) data.


In [None]:
def load_data(
    data_dir: Path = DATA_DIR,
    train_file: str = TRAIN_FILE,
    test_file: Optional[str] = TEST_FILE,
):
    """Load train/test DataFrames from CSV."""
    train_path = data_dir / train_file
    if not train_path.exists():
        raise FileNotFoundError(f"Train file not found: {train_path}")
        
    train_df = pd.read_csv(train_path)
    
    test_df = None
    if test_file is not None:
        test_path = data_dir / test_file
        if test_path.exists():
            test_df = pd.read_csv(test_path)
        else:
            print(f"Test file not found: {test_path} (continuing without test_df)")
    
    print("Train shape:", train_df.shape)
    if test_df is not None:
        print("Test shape :", test_df.shape)
    else:
        print("Test data  : None")
    
    return train_df, test_df


train_df, test_df = load_data()


## 3Ô∏è‚É£ Column Typing & Initial Summary

Use these helpers to:

- Identify numeric, categorical, and boolean/0‚Äì1 features
- Get a quick overview of the data structure, missing values, and basic stats


In [None]:
def get_numeric_features(df: pd.DataFrame, exclude: Optional[List[str]] = None) -> List[str]:
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if exclude:
        num_cols = [c for c in num_cols if c not in exclude]
    return num_cols


def get_categorical_features(df: pd.DataFrame) -> List[str]:
    return df.select_dtypes(include=["object", "category"]).columns.tolist()


def get_boolean_features(df: pd.DataFrame) -> List[str]:
    bool_cols = df.select_dtypes(include=["bool"]).columns.tolist()
    for col in df.select_dtypes(include=["int64", "int32", "int16"]).columns:
        unique_vals = df[col].dropna().unique()
        if len(unique_vals) <= 2 and set(unique_vals).issubset({0, 1}):
            bool_cols.append(col)
    return list(dict.fromkeys(bool_cols))


def summarize_dataframe(df: pd.DataFrame, name: str = "df"):
    print(f"===== {name} SUMMARY =====")
    print("Shape:", df.shape)
    
    print("\nFirst 5 rows:")
    display(df.head())

    print("\nDtypes:")
    display(df.dtypes)

    print("\nMissing values (count):")
    display(df.isna().sum().sort_values(ascending=False))

    print("\nBasic describe (numeric):")
    display(df.describe().T)

    print("\nPossible categorical columns (heuristic):")
    cat_like = []
    for col in df.columns:
        if df[col].dtype == "object":
            cat_like.append(col)
        else:
            unique_vals = df[col].nunique()
            if unique_vals < 20 and str(df[col].dtype).startswith("int"):
                cat_like.append(col)
    print(cat_like)


summarize_dataframe(train_df, name="train_df")


num_cols = get_numeric_features(
    train_df,
    exclude=[TARGET_COL] + ([ID_COL] if ID_COL in train_df.columns else [])
)
cat_cols = get_categorical_features(train_df)
bool_cols = get_boolean_features(train_df)

print("Numeric features (first 10):", num_cols[:10], "..." if len(num_cols) > 10 else "")
print("Categorical features:", cat_cols)
print("Boolean features:", bool_cols)


## 4Ô∏è‚É£ Numeric EDA: Distributions & Correlations

Follow this sequence:

1. Inspect target distribution.  
2. Inspect numeric feature distributions.  
3. Look at scatter plots of feature vs target.  
4. Examine correlations with the target.  


In [None]:
def plot_target_distribution(df: pd.DataFrame, target_col: str = TARGET_COL):
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    sns.histplot(df[target_col], kde=True, ax=axes[0])
    axes[0].set_title(f"Distribution of {target_col}")

    sns.boxplot(x=df[target_col], ax=axes[1])
    axes[1].set_title(f"Boxplot of {target_col}")

    plt.tight_layout()
    plt.show()


def plot_numeric_distributions(df: pd.DataFrame, max_cols: int = 12):
    num_cols_local = get_numeric_features(df, exclude=[TARGET_COL])
    num_cols_local = num_cols_local[:max_cols]

    n = len(num_cols_local)
    if n == 0:
        print("No numeric features to plot.")
        return

    n_cols = 3
    n_rows = int(np.ceil(n / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
    axes = axes.flatten()

    for i, col in enumerate(num_cols_local):
        sns.histplot(df[col], kde=False, ax=axes[i])
        axes[i].set_title(col)

    for j in range(i + 1, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()


def plot_feature_vs_target(df: pd.DataFrame, target_col: str = TARGET_COL, max_cols: int = 6):
    num_cols_local = get_numeric_features(df, exclude=[target_col])
    num_cols_local = num_cols_local[:max_cols]

    n = len(num_cols_local)
    if n == 0:
        print("No numeric features to plot vs target.")
        return

    n_cols = 3
    n_rows = int(np.ceil(n / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
    axes = axes.flatten()

    for i, col in enumerate(num_cols_local):
        sns.scatterplot(x=df[col], y=df[target_col], ax=axes[i], alpha=0.4)
        axes[i].set_xlabel(col)
        axes[i].set_ylabel(target_col)
        axes[i].set_title(f"{col} vs {target_col}")

    for j in range(i + 1, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()


def correlation_with_target(df: pd.DataFrame, target_col: str = TARGET_COL, top_n: int = 20):
    num_cols_local = df.select_dtypes(include=[np.number]).columns.tolist()
    if target_col not in num_cols_local:
        print(f"Target {target_col} is not numeric or not in df.")
        return

    corr = df[num_cols_local].corr()[target_col].sort_values(ascending=False)
    print("Top positively correlated with target:")
    display(corr.head(top_n))
    print("\nTop negatively correlated with target:")
    display(corr.tail(top_n))


def plot_correlation_heatmap(df: pd.DataFrame, target_col: str = TARGET_COL, top_n: int = 20):
    num_cols_local = df.select_dtypes(include=[np.number]).columns.tolist()
    if target_col not in num_cols_local:
        print(f"Target {target_col} is not numeric or not in df.")
        return

    corr_series = df[num_cols_local].corr()[target_col].drop(target_col)
    top_features = corr_series.abs().sort_values(ascending=False).head(top_n).index.tolist()
    cols_to_plot = top_features + [target_col]

    corr_matrix = df[cols_to_plot].corr()

    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=False, cmap="coolwarm", center=0)
    plt.title(f"Correlation heatmap (top {top_n} correlated with {target_col})")
    plt.tight_layout()
    plt.show()


# Run core numeric EDA
plot_target_distribution(train_df, TARGET_COL)
plot_numeric_distributions(train_df)
plot_feature_vs_target(train_df, TARGET_COL)
correlation_with_target(train_df, TARGET_COL)
plot_correlation_heatmap(train_df, TARGET_COL)


## 5Ô∏è‚É£ Categorical & Boolean EDA

Now examine **categorical** and **boolean** predictors:

- How the target varies across categories
- How categoricals relate to each other (Cram√©r's V)
- How booleans relate to the target (point-biserial correlation)


In [None]:
def cramers_v(x, y) -> float:
    confusion_matrix = pd.crosstab(x, y)
    chi2, p, dof, expected = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    k = min(confusion_matrix.shape) - 1
    if k == 0:
        return np.nan
    return np.sqrt((chi2 / n) / k)


def cramers_v_matrix(df: pd.DataFrame, cols: Optional[List[str]] = None) -> pd.DataFrame:
    if cols is None:
        cols = get_categorical_features(df)

    n = len(cols)
    result = pd.DataFrame(np.ones((n, n)), index=cols, columns=cols)

    for i in range(n):
        for j in range(i + 1, n):
            v = cramers_v(df[cols[i]], df[cols[j]])
            result.iloc[i, j] = v
            result.iloc[j, i] = v

    return result


def plot_cramers_v_heatmap(df: pd.DataFrame, cols: Optional[List[str]] = None):
    cv_mat = cramers_v_matrix(df, cols)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cv_mat, annot=False, cmap="coolwarm", vmin=0, vmax=1)
    plt.title("Cram√©r's V between categorical features")
    plt.tight_layout()
    plt.show()


def summarize_target_by_category(
    df: pd.DataFrame,
    cat_col: str,
    target_col: str = TARGET_COL,
    sort_by: str = "mean",
) -> pd.DataFrame:
    summary = (
        df.groupby(cat_col)[target_col]
        .agg(["count", "mean", "std", "min", "max"]
        ).sort_values(by=sort_by, ascending=False)
    )
    display(summary)
    return summary


def plot_target_by_category(
    df: pd.DataFrame,
    cat_col: str,
    target_col: str = TARGET_COL,
    max_categories: int = 20,
    kind: str = "box",
):
    if df[cat_col].nunique() > max_categories:
        top_cats = df[cat_col].value_counts().head(max_categories).index
        data = df[df[cat_col].isin(top_cats)].copy()
    else:
        data = df

    plt.figure(figsize=(12, 6))
    if kind == "box":
        sns.boxplot(x=cat_col, y=target_col, data=data)
    elif kind == "violin":
        sns.violinplot(x=cat_col, y=target_col, data=data, cut=0)
    else:
        raise ValueError("kind must be 'box' or 'violin'")

    plt.title(f"{target_col} by {cat_col}")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()


def pointbiserial_correlations_with_target(
    df: pd.DataFrame,
    bool_cols: Optional[List[str]] = None,
    target_col: str = TARGET_COL,
) -> pd.DataFrame:
    if bool_cols is None:
        bool_cols = get_boolean_features(df)

    results = []
    for col in bool_cols:
        series = df[col]
        if series.dtype == "bool":
            series = series.astype(int)

        mask = series.notna() & df[target_col].notna()
        if mask.sum() == 0:
            corr = np.nan
            pval = np.nan
        else:
            corr, pval = pointbiserialr(series[mask], df[target_col][mask])
        results.append({"feature": col, "corr": corr, "p_value": pval})

    res_df = pd.DataFrame(results).sort_values("corr", key=lambda x: x.abs(), ascending=False)
    display(res_df)
    return res_df


def plot_target_by_boolean(df: pd.DataFrame, bool_col: str, target_col: str = TARGET_COL):
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[bool_col].astype(str), y=df[target_col])
    plt.title(f"{target_col} by {bool_col} (bool)")
    plt.xlabel(bool_col)
    plt.tight_layout()
    plt.show()


# Run basic categorical/boolean EDA if there are such columns
if len(cat_cols) > 0:
    print("\nCram√©r's V heatmap for categorical features:")
    if len(cat_cols) > 1:
        plot_cramers_v_heatmap(train_df, cat_cols)
    for col in cat_cols[:5]:
        print(f"\n=== {col} vs {TARGET_COL} ===")
        summarize_target_by_category(train_df, col, TARGET_COL)
        plot_target_by_category(train_df, col, TARGET_COL, kind="box")


if len(bool_cols) > 0:
    print("\nPoint-biserial correlations with target for boolean features:")
    pointbiserial_correlations_with_target(train_df, bool_cols, TARGET_COL)
    for col in bool_cols:
        plot_target_by_boolean(train_df, col, TARGET_COL)


## 6Ô∏è‚É£ Skewness & Kurtosis: Shape & Transform Suggestions

Use skewness and kurtosis to decide **which numeric features need transformation**.

Rules (built into your workflow):

- `|skew| < 0.5` ‚Üí leave as is.
- `0.5 ‚â§ |skew| ‚â§ 1.0` ‚Üí consider log/sqrt transform.
- `|skew| > 1.0` ‚Üí strong candidate for transformation (log1p, Box-Cox, Yeo‚ÄìJohnson).


In [None]:
def skew_kurtosis_summary(df: pd.DataFrame) -> pd.DataFrame:
    num_cols_local = df.select_dtypes(include=[np.number]).columns.tolist()
    summary = []

    for col in num_cols_local:
        col_data = df[col].dropna()
        if len(col_data) == 0:
            continue
        summary.append({
            "feature": col,
            "skewness": skew(col_data),
            "kurtosis": kurtosis(col_data, fisher=False)
        })

    result = pd.DataFrame(summary).set_index("feature")
    display(result.sort_values("skewness", key=lambda x: x.abs(), ascending=False))
    return result


def suggest_log_transform(df: pd.DataFrame, skew_threshold: float = 1.0) -> List[str]:
    num_cols_local = df.select_dtypes(include=[np.number]).columns
    candidates = []

    for col in num_cols_local:
        col_data = df[col].dropna()
        if len(col_data) == 0:
            continue
        s = skew(col_data)
        if abs(s) > skew_threshold and col_data.min() >= 0:
            candidates.append((col, s))

    print(f"Log-transform candidates (|skew| > {skew_threshold} and min>=0):")
    display(pd.DataFrame(candidates, columns=["feature", "skewness"]))
    return [c[0] for c in candidates]


skew_kurtosis_df = skew_kurtosis_summary(train_df)
log_candidates = suggest_log_transform(train_df)


## 7Ô∏è‚É£ Outlier Detection & Handling

Now use simple, consistent rules to detect and handle outliers.

Recommended flow:

1. Start with **IQR or z-score** on key numeric columns.  
2. Check how many points are flagged and how much they shift target stats.  
3. Decide whether to **keep**, **winsorize**, **remove**, or **flag**.


In [None]:
def detect_outliers_iqr(df: pd.DataFrame, col: str, multiplier: float = 1.5):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    mask = (df[col] < lower) | (df[col] > upper)
    return mask, lower, upper


def detect_outliers_z(df: pd.DataFrame, col: str, threshold: float = 3.0):
    col_data = df[col]
    col_z = zscore(col_data.dropna())
    mask_raw = np.abs(col_z) > threshold
    mask = pd.Series(False, index=df.index)
    mask[col_data.dropna().index] = mask_raw
    return mask


def detect_outliers_isoforest(df: pd.DataFrame, num_cols: Optional[List[str]] = None, contamination: float = 0.01):
    if num_cols is None:
        num_cols = get_numeric_features(df, exclude=[TARGET_COL])

    iso = IsolationForest(contamination=contamination, random_state=RANDOM_STATE)
    preds = iso.fit_predict(df[num_cols])
    mask = preds == -1
    return pd.Series(mask, index=df.index)


def compare_target_with_without_outliers(df: pd.DataFrame, mask: pd.Series, target_col: str = TARGET_COL):
    full_mean = df[target_col].mean()
    clean_mean = df[~mask][target_col].mean()
    print(f"Full target mean:      {full_mean:.4f}")
    print(f"Without outliers mean: {clean_mean:.4f}")
    print(f"Difference:            {clean_mean - full_mean:.4f}")
    print(f"Outlier count:         {mask.sum()} / {len(df)}")


def winsorize_series(s: pd.Series, lower_q: float = 0.01, upper_q: float = 0.99) -> pd.Series:
    lower = s.quantile(lower_q)
    upper = s.quantile(upper_q)
    return s.clip(lower, upper)


def process_outliers(
    df: pd.DataFrame,
    num_cols: Optional[List[str]] = None,
    strategy: str = "winsorize",
    z_thresh: float = 3.0,
    lower_q: float = 0.01,
    upper_q: float = 0.99,
) -> pd.DataFrame:
    if num_cols is None:
        num_cols = get_numeric_features(df, exclude=[TARGET_COL])

    df_processed = df.copy()

    for col in num_cols:
        if strategy == "remove":
            mask = detect_outliers_z(df_processed, col, threshold=z_thresh)
            df_processed = df_processed[~mask]
        elif strategy == "winsorize":
            df_processed[col] = winsorize_series(df_processed[col], lower_q, upper_q)
        elif strategy == "flag":
            mask = detect_outliers_z(df_processed, col, threshold=z_thresh)
            df_processed[f"{col}_outlier"] = mask.astype(int)
        else:
            raise ValueError("strategy must be one of: 'remove', 'winsorize', 'flag'")

    return df_processed


# Example: examine outliers for a single numeric feature, if any exist
example_col = num_cols[0] if len(num_cols) > 0 else None
if example_col:
    mask_iqr, low, high = detect_outliers_iqr(train_df, example_col)
    print(f"Example IQR outlier bounds for {example_col}: [{low:.4f}, {high:.4f}]")
    compare_target_with_without_outliers(train_df, mask_iqr, TARGET_COL)

# Example: create a winsorized version of the training data
train_df_processed = process_outliers(
    train_df,
    num_cols=get_numeric_features(train_df, exclude=[TARGET_COL]),
    strategy="winsorize",
    lower_q=0.01,
    upper_q=0.99,
)
print("Original shape:", train_df.shape)
print("Processed shape:", train_df_processed.shape)


## 8Ô∏è‚É£ Relationship Shape & Model Family Choice (Conceptual)

At this point, you have:

- Cleaned understanding of the data
- Insight into numeric and categorical behaviors
- Skew and outliers under control

Now you decide **what kind of model to try first**.

### A. Quick Relationship Diagnostics (Recommended)

For each numeric feature:

1. Look at the scatter vs target plots.
2. Optionally compute:
   - Pearson correlation (linear)
   - Spearman correlation (monotonic)

**Heuristics:**

- If **Pearson and Spearman are both high** and scatter looks straight-ish ‚Üí **linear** relationship.
- If **Spearman > Pearson** and the scatter is curved but monotonic ‚Üí **nonlinear monotonic**.
- If both correlations are low and scatter is a cloud ‚Üí likely **no strong relationship**.

### B. Aggregate Decisions

- If **most strong features are linear** and you‚Äôre willing to transform skewed ones:  
  ‚Üí Start with a **linear baseline** (Ridge/ElasticNet).  
  - Standardize features.  
  - Consider adding polynomial or interaction terms for obvious curves.

- If **many features show nonlinear / monotonic patterns**:  
  ‚Üí Start with **tree-based gradient boosting** (LightGBM/XGBoost/CatBoost).  
  - Encodes nonlinearity and interactions automatically.  
  - Handles skew and outliers better by default.

- If the dataset is **mixed** (some linear, some nonlinear, some categorical-heavy):  
  ‚Üí Start with **LightGBM or CatBoost** as a robust baseline, then later:
  - Build a linear model for interpretability.
  - Compare performance and use both if needed (stacking/ensembling).

> ‚úÖ You **do not** build separate models per feature.  
> You **transform the features** based on their shapes and feed them into a single model (or ensemble).

In a future section of this notebook (you can add later), you‚Äôll:

- Build preprocessing pipelines (encoders, scalers, transformers)
- Define train/validation splits
- Train baseline models (LGBM / XGB / ElasticNet)
- Evaluate and iterate

For now, this notebook serves as your **guided EDA + data-shape analysis foundation** for any regression task.


## 9Ô∏è‚É£ Feature Engineering Strategy (From Raw Columns to Useful Signals)

This section is about **creating better features**, not just cleaning them.

So far we have focused on:

- Understanding distributions (EDA)
- Handling skewness
- Handling outliers
- Understanding numeric vs categorical relationships

Now we answer:

> ‚ÄúHow can I **transform** these raw columns into features that make the model‚Äôs job easier?‚Äù

---

### üß© What Feature Engineering Is (vs Encoding)

- **Encoding** = turning categorical values into numbers so the model can accept them  
  (e.g., one-hot, target encoding)

- **Feature Engineering** = transforming/expanding columns to expose useful structure  
  (e.g., extracting year/month from date, splitting `"Ford-F150-XL"` into three features)

**Order of operations (big picture):**

1. Handle missing values  
2. Handle skew and outliers  
3. **Feature engineering (this section)**  
4. Encode categoricals  
5. Scale numeric features (if needed for linear models)  
6. Model training

---

### üß≠ Types of Feature Engineering You‚Äôll Do

Common buckets:

1. **Date / time decomposition**
   - From a single datetime column ‚Üí `year`, `month`, `day`, `weekday`, `hour`, etc.
   - Good when seasonal, monthly, weekly patterns matter.

2. **String / code splitting**
   - Split structured text like `"Ford-F150-XL"` or `"A12-B3"` into multiple meaningful parts.
   - Example: `Brand`, `Model`, `Trim`.

3. **Ratios / rates / per-unit features**
   - E.g., `income_per_person`, `sales_per_store`, `goals_per_minute`.
   - Often more informative than raw counts.

4. **Bucketization / binning**
   - Turn a continuous numeric feature into a small number of ordered categories.
   - E.g., `age` ‚Üí `age_bin` of `[0‚Äì18, 19‚Äì35, 36‚Äì60, 60+]`.

5. **Polynomial / interaction terms**
   - `x¬≤`, `x*y`, etc., when you suspect curved or interacting relationships for **linear models**.
   - Tree models often learn these interactions on their own.

6. **Counts / frequency features**
   - Replace a category with its frequency in the data, or add a parallel "count" feature.
   - E.g., number of times a customer ID appears.

---

### üß† Model-Type-Aware Feature Engineering

The model family changes how aggressive you need to be:

- For **tree-based models** (LightGBM, XGBoost, CatBoost):
  - They handle nonlinearity and interactions automatically.
  - Feature engineering helps, but you can keep it **simple and safe**:
    - Ratios
    - Date parts
    - Easy binning
    - Count features

- For **linear models** (Ridge, Lasso, ElasticNet):
  - You must **manually provide nonlinearity**:
    - Log transforms
    - Polynomial terms (squares, interactions)
    - Carefully chosen bins

> ‚úÖ Rule of thumb:  
> Start with simple, interpretable feature engineering that you can justify.  
> Add more complex transforms only if you see consistent gains in CV.

---

### üß™ Practical Workflow for Feature Engineering

When approaching a new dataset:

1. **Look at EDA findings**:
   - Columns that correlate with target?
   - Any code-like strings? Dates? Ratios that make sense?

2. **Decide which transforms are relevant**:
   - If you see dates ‚Üí add date parts.
   - If you see codes like `"A-B-C"` ‚Üí split into sub-features.
   - If you see ‚Äúcounts over time or groups‚Äù ‚Üí create ratios or per-unit measures.
   - If relationships are clearly curved in scatter plots ‚Üí consider polynomials or bins.

3. **Apply transformations to both `train_df` and `test_df`** consistently:
   - Same functions, same parameters, same mappings.

4. **Keep track of what you added**:
   - Maintain a list of new feature names.
   - Optionally tag engineered features (e.g., name them with suffixes like `_fe`, `_ratio`, `_bin`).

We‚Äôll now add **reusable helper functions** for common feature engineering patterns.


In [None]:
# ========== 9. Feature Engineering Helpers ==========

from typing import Tuple, Dict

def add_datetime_parts(
    df: pd.DataFrame,
    col: str,
    prefix: str | None = None,
    drop_original: bool = False
) -> pd.DataFrame:
    """
    Convert a column to datetime (if not already) and add common date parts:
    - year, month, day, dayofweek
    Optionally use a prefix and optionally drop the original column.
    """
    df = df.copy()
    if prefix is None:
        prefix = col

    if not np.issubdtype(df[col].dtype, np.datetime64):
        df[col] = pd.to_datetime(df[col], errors="coerce")

    df[f"{prefix}_year"] = df[col].dt.year
    df[f"{prefix}_month"] = df[col].dt.month
    df[f"{prefix}_day"] = df[col].dt.day
    df[f"{prefix}_dow"] = df[col].dt.dayofweek  # 0 = Monday

    if drop_original:
        df = df.drop(columns=[col])

    return df


def split_column(
    df: pd.DataFrame,
    col: str,
    sep: str,
    new_cols: list[str],
    drop_original: bool = False
) -> pd.DataFrame:
    """
    Split a string column on a delimiter into multiple new columns.
    Example:
      col = "CarModel", value = "Ford-F150-XL"
      sep = "-"
      new_cols = ["Brand", "Model", "Trim"]
    """
    df = df.copy()
    split_data = df[col].astype(str).str.split(sep, expand=True)

    if split_data.shape[1] < len(new_cols):
        print(f"Warning: Not enough split parts in {col} for all new_cols")

    for i, new_col in enumerate(new_cols):
        if i < split_data.shape[1]:
            df[new_col] = split_data[i]
        else:
            df[new_col] = np.nan

    if drop_original:
        df = df.drop(columns=[col])

    return df


def add_ratio_feature(
    df: pd.DataFrame,
    num_col: str,
    denom_col: str,
    new_col: str | None = None,
    epsilon: float = 1e-6
) -> pd.DataFrame:
    """
    Create a ratio feature: new_col = num_col / (denom_col + epsilon).
    Useful for per-unit or per-something rates.
    """
    df = df.copy()
    if new_col is None:
        new_col = f"{num_col}_per_{denom_col}"

    df[new_col] = df[num_col] / (df[denom_col] + epsilon)
    return df


def bin_numeric_feature(
    df: pd.DataFrame,
    col: str,
    bins: int | list[float],
    labels: list[str] | None = None,
    new_col: str | None = None,
    strategy: str = "quantile"
) -> pd.DataFrame:
    """
    Bin a numeric feature into categories.

    Parameters
    ----------
    bins:
        - If strategy="uniform": number of equal-width bins or explicit bin edges.
        - If strategy="quantile": number of quantile-based bins.
    strategy:
        - "uniform": use pd.cut
        - "quantile": use pd.qcut
    """
    df = df.copy()
    if new_col is None:
        new_col = f"{col}_bin"

    if strategy == "uniform":
        df[new_col] = pd.cut(df[col], bins=bins, labels=labels, include_lowest=True)
    elif strategy == "quantile":
        df[new_col] = pd.qcut(df[col], q=bins, labels=labels, duplicates="drop")
    else:
        raise ValueError("strategy must be 'uniform' or 'quantile'")

    return df


def add_polynomial_features(
    df: pd.DataFrame,
    cols: list[str],
    degree: int = 2,
    interaction_only: bool = False,
    prefix: str = "poly"
) -> Tuple[pd.DataFrame, list[str]]:
    """
    Add polynomial (and optionally interaction) features for a set of numeric columns.
    - For degree=2 and interaction_only=False, you get squares and pairwise products.
    - This is more useful for linear models; tree models often don‚Äôt need this.

    Returns:
      (df_with_poly, new_feature_names)
    """
    from itertools import combinations_with_replacement, combinations

    df = df.copy()
    new_features = []

    if degree < 2:
        return df, new_features

    cols = [c for c in cols if c in df.columns]

    if interaction_only:
        # Only pairwise products (no squares)
        for c1, c2 in combinations(cols, 2):
            new_col = f"{prefix}_{c1}_x_{c2}"
            df[new_col] = df[c1] * df[c2]
            new_features.append(new_col)
    else:
        # Squares + pairwise products
        # Squares
        for c in cols:
            new_col = f"{prefix}_{c}_sq"
            df[new_col] = df[c] ** 2
            new_features.append(new_col)
        # Interactions
        for c1, c2 in combinations(cols, 2):
            new_col = f"{prefix}_{c1}_x_{c2}"
            df[new_col] = df[c1] * df[c2]
            new_features.append(new_col)

    return df, new_features


In [None]:
# ========== 9.1 Example Feature Engineering Usage (Customize per dataset) ==========

# Make copies so you don't overwrite original until you're happy
fe_train = train_df_processed.copy() if "train_df_processed" in globals() else train_df.copy()
fe_test = test_df.copy() if test_df is not None else None

# 1) Date / time decomposition
# If you have date columns, add them here:
date_columns = []  # e.g. ["date", "order_date"]
for col in date_columns:
    fe_train = add_datetime_parts(fe_train, col, prefix=col, drop_original=False)
    if fe_test is not None and col in fe_test.columns:
        fe_test = add_datetime_parts(fe_test, col, prefix=col, drop_original=False)

# 2) String / code splitting
# Example: if you have something like "Ford-F150-XL" in a column:
code_columns_config: Dict[str, Dict] = {
    # "CarModel": {"sep": "-", "new_cols": ["Brand", "Model", "Trim"], "drop_original": False},
}
for col, cfg in code_columns_config.items():
    fe_train = split_column(
        fe_train,
        col=col,
        sep=cfg["sep"],
        new_cols=cfg["new_cols"],
        drop_original=cfg.get("drop_original", False),
    )
    if fe_test is not None and col in fe_test.columns:
        fe_test = split_column(
            fe_test,
            col=col,
            sep=cfg["sep"],
            new_cols=cfg["new_cols"],
            drop_original=cfg.get("drop_original", False),
        )

# 3) Ratio features
# Example: income_per_person = income / household_size
ratio_pairs = [
    # ("income", "household_size", "income_per_person"),
    # ("sales", "num_stores", None),  # None -> auto name
]
for num_col, denom_col, new_col in ratio_pairs:
    if num_col in fe_train.columns and denom_col in fe_train.columns:
        fe_train = add_ratio_feature(fe_train, num_col, denom_col, new_col=new_col)
        if fe_test is not None and num_col in fe_test.columns and denom_col in fe_test.columns:
            fe_test = add_ratio_feature(fe_test, num_col, denom_col, new_col=new_col)

# 4) Binning / bucketization
# Example: create age bins
bin_config = [
    # {"col": "age", "bins": [0, 18, 35, 60, 120], "labels": ["0-18", "19-35", "36-60", "60+"], "strategy": "uniform"},
]
for cfg in bin_config:
    col = cfg["col"]
    if col in fe_train.columns:
        fe_train = bin_numeric_feature(
            fe_train,
            col=col,
            bins=cfg["bins"],
            labels=cfg.get("labels"),
            new_col=cfg.get("new_col", f"{col}_bin"),
            strategy=cfg.get("strategy", "uniform"),
        )
        if fe_test is not None and col in fe_test.columns:
            fe_test = bin_numeric_feature(
                fe_test,
                col=col,
                bins=cfg["bins"],
                labels=cfg.get("labels"),
                new_col=cfg.get("new_col", f"{col}_bin"),
                strategy=cfg.get("strategy", "uniform"),
            )

# 5) Polynomial / interaction features (mainly for linear models)
# Choose a small subset of important numeric features to avoid explosion
poly_base_cols = []  # e.g. ["feature1", "feature2"]
if poly_base_cols:
    fe_train, poly_new_features = add_polynomial_features(
        fe_train,
        cols=poly_base_cols,
        degree=2,
        interaction_only=False,
        prefix="poly",
    )
    if fe_test is not None:
        # Ensure all base cols exist in fe_test
        missing_cols = [c for c in poly_base_cols if c not in fe_test.columns]
        if missing_cols:
            print("Warning: some polynomial base cols missing in fe_test:", missing_cols)
        else:
            fe_test, _ = add_polynomial_features(
                fe_test,
                cols=poly_base_cols,
                degree=2,
                interaction_only=False,
                prefix="poly",
            )

print("Feature-engineered train shape:", fe_train.shape)
if fe_test is not None:
    print("Feature-engineered test shape:", fe_test.shape)


## 9Ô∏è‚É£ Missing Values & Imputation Strategy

Before encoding or modeling, we must resolve missing values. Most ML algorithms **cannot** handle NaNs directly, and missingness can distort distance-based metrics, scaling, and encoding.

### üéØ Goals of Imputation
- Replace missing values with **reasonable estimates**
- Avoid introducing **bias** or **information leakage**
- Preserve or enhance predictive signal
- Ensure train/test consistency

---

### üß≠ How to Choose an Imputation Strategy

Use the following logic:

#### **1. Understand the Missingness Type**

There are three missingness patterns:

| Missingness Type | Meaning | Model Impact |
|------------------|---------|-------------|
| MCAR | Missing completely at random | Safe to impute with simple stats |
| MAR | Missing depends on other columns | Might need more careful imputation |
| MNAR | Missing depends on the value itself (e.g., salary missing for high earners) | Missingness itself is informative ‚Üí add a missing flag |

**Rule:**  
If you believe missingness conveys information, **create a `*_missing` flag** before imputing.

---

### **2. Decide Based on Feature Type**

| Feature Type | SIMPLE Strategy (Baseline) | ADVANCED Strategy |
|--------------|---------------------------|------------------|
| Numeric | Median | IterativeImputer / KNNImputer |
| Categorical | Most frequent value | Target encoding with `Unknown` bucket |
| Boolean | Mode or treat as categorical | Rarely requires more |
| Datetime | No median ‚Üí forward/backfill or extract date parts first | Model-based time-series fills |

**Why median for numeric?**  
- Median handles skew and outliers better than mean.

---

### **3. When to Use Simple vs Advanced**

**Use SIMPLE imputation when:**
- Dataset size is medium/large
- Data has low to moderate missingness (< 20%)
- Model is tree-based (LightGBM/XGBoost/CatBoost)
- You are in early prototyping

**Use ADVANCED imputation when:**
- Missingness is high or patterned
- Linear models are planned
- The feature is critical and sensitive
- Imputation affects model stability

---

### üß™ General Workflow



In [None]:
# ========== 9. Simple Imputation ==========

from sklearn.impute import SimpleImputer

def summarize_missing(df: pd.DataFrame):
    """Display missing percentages for each column."""
    missing = df.isna().mean() * 100
    display(missing[missing > 0].sort_values(ascending=False))
    return missing

print("Missing values in training data:")
missing_report = summarize_missing(train_df)

# ---- SIMPLE IMPUTERS ----

numeric_imputer = SimpleImputer(strategy="median")
categorical_imputer = SimpleImputer(strategy="most_frequent")


def apply_simple_imputation(train_df: pd.DataFrame, test_df: pd.DataFrame | None = None):
    """
    Applies simple type-aware imputation to numeric and categorical features.
    Does NOT encode or scale features.
    """
    train = train_df.copy()
    test = test_df.copy() if test_df is not None else None

    num_cols = get_numeric_features(train, exclude=[TARGET_COL])
    cat_cols = get_categorical_features(train)

    # Numeric imputation
    if len(num_cols) > 0:
        train[num_cols] = numeric_imputer.fit_transform(train[num_cols])
        if test is not None:
            test[num_cols] = numeric_imputer.transform(test[num_cols])

    # Categorical imputation
    if len(cat_cols) > 0:
        train[cat_cols] = categorical_imputer.fit_transform(train[cat_cols])
        if test is not None:
            test[cat_cols] = categorical_imputer.transform(test[cat_cols])

    return train, test


train_imputed, test_imputed = apply_simple_imputation(fe_train, fe_test if "fe_test" in globals() else None)
print("After simple imputation:", train_imputed.shape)


In [None]:
def add_missing_flags(df: pd.DataFrame, threshold: float = 0.0):
    """
    Add binary missing flags for columns with missing values above threshold.
    """
    df = df.copy()
    for col in df.columns:
        if df[col].isna().mean() > threshold:
            df[col + "_missing"] = df[col].isna().astype(int)
    return df

train_imputed = add_missing_flags(train_imputed)
if test_imputed is not None:
    test_imputed = add_missing_flags(test_imputed)


### üîÅ Advanced Imputation Strategies (Use Selectively)

Upgrade from simple imputation when:

- Missingness is correlated with important features
- Linear models behave poorly
- You need smoother estimates than median
- Data has strong local neighborhoods

#### Available Advanced Approaches:

| Method | When to Use | Pros | Cons |
|-------|-------------|------|------|
| `KNNImputer` | Numeric features, local similarity | Uses nearest rows | Slow on large data |
| `IterativeImputer` | Complex datasets, MAR patterns | Learns relationships | Risk of leakage |
| `SoftImpute` | Matrix-like data | Captures latent structure | Heavy assumptions |
| Group-based imputation | Business entities, customers | Logical/ domain-aware | Requires domain sense |


In [None]:
from sklearn.impute import KNNImputer, IterativeImputer

# KNN Imputer (better for local numeric structure)
knn_imputer = KNNImputer(n_neighbors=5)

# Iterative Imputer (model-based estimation)
iterative_imputer = IterativeImputer(random_state=RANDOM_STATE)


def apply_advanced_imputation(train_df, test_df=None, strategy="knn"):
    df_train = train_df.copy()
    df_test = test_df.copy() if test_df is not None else None

    num_cols = get_numeric_features(df_train, exclude=[TARGET_COL])

    if strategy == "knn":
        df_train[num_cols] = knn_imputer.fit_transform(df_train[num_cols])
        if df_test is not None:
            df_test[num_cols] = knn_imputer.transform(df_test[num_cols])

    elif strategy == "iterative":
        df_train[num_cols] = iterative_imputer.fit_transform(df_train[num_cols])
        if df_test is not None:
            df_test[num_cols] = iterative_imputer.transform(df_test[num_cols])

    else:
        raise ValueError("strategy must be 'knn' or 'iterative'")

    return df_train, df_test


### üèÅ Summary: Imputation Decision Rules

Use **Simple Imputation** when:
- You are using tree models
- Missingness is low-to-moderate
- Speed and stability matter

Use **Advanced Imputation** when:
- Missingness is patterned or high
- Linear models behave poorly
- Numeric relationships are strong
- You need smoother, model-aware values

Add **Missingness Flags** when:
- Missingness is likely informative
- Data is customer/entity-based
- You suspect MNAR

> üöÄ Start simple. Only escalate when evidence tells you to.


## 1Ô∏è‚É£2Ô∏è‚É£ Baseline Modeling & Model Comparison

At this point, the data has passed through:

1. EDA ‚Üí understanding distributions and relationships  
2. Skewness & outlier handling  
3. Feature engineering (date parts, ratios, bins, etc.)  
4. Missingness and imputation

We now:

- Build **modular pipelines** for different model families
- Use the **same preprocessed dataset** (`train_imputed`) for all of them
- Compare performance (RMSE, R¬≤) to see which family fits this problem best

> üß† Rule: only change **`model_type` strings** to switch between linear, tree-based, and deep learning models.  
> Preprocessing is chosen automatically based on model config (scaling, encoding, etc.).


## 1Ô∏è‚É£1Ô∏è‚É£ Model Families & Modular Selection

Up to this point, the notebook prepares the data:

- EDA ‚Üí Skew ‚Üí Outliers ‚Üí Feature Engineering ‚Üí Missingness
- (Next) Encoding & Scaling

Now we design a **modular way to pick models**, including:

- Tree-based models (LightGBM, XGBoost, RandomForest)
- Linear models (Ridge, ElasticNet)
- Neural nets / deep learning models (e.g., Keras MLP using tensors)

Instead of hard-coding everything per model, we create:

1. A **model registry**: a dictionary that describes each model type‚Äôs needs.
2. A **ModelConfig** object: captures what preprocessing is required.
3. A **factory function** that:
   - looks at `model_type` (e.g., `"lightgbm"`, `"elasticnet"`, `"keras_mlp"`)
   - builds a consistent pipeline:
     - numeric preprocessing (imputation, scaling)
     - categorical preprocessing (encoding)
     - model itself

---

### üß† Why abstract like this?

Different families have different needs:

| Model Family | Needs scaling? | Handles nonlinearity? | Handles categoricals directly? |
|--------------|----------------|------------------------|---------------------------------|
| Linear (Ridge, ElasticNet) | ‚úÖ Yes | ‚ùå Only if we engineer | ‚ùå Needs encoding |
| Tree-based (RF, LGBM, XGB) | ‚ùå Not needed | ‚úÖ Yes | ‚ùå Usually need encoding* |
| CatBoost | ‚ùå | ‚úÖ | ‚úÖ Yes (native cats) |
| Neural nets (Keras MLP) | ‚úÖ Strongly recommended | ‚úÖ (via layers) | ‚ùå Needs encoding & scaling |

\* some libraries have native cat support, but standard sklearn trees do not.

So we want our code to **understand** those requirements and set up preprocessing accordingly.

---

### üß© Design

We‚Äôll define:

- A `ModelConfig` that stores:
  - `name`
  - `needs_scaling`
  - `handles_categoricals_natively`
  - `model_family` (e.g., `"tree"`, `"linear"`, `"neural_net"`)
- A `MODEL_REGISTRY` dict mapping model names to `ModelConfig`.
- A function `get_model_config(model_type)` that returns the config.
- A function `build_model_pipeline(model_type, numeric_features, categorical_features)` that:
  - builds a `ColumnTransformer` for numeric/categorical parts
  - attaches the appropriate model
  - returns a sklearn-style pipeline

For deep learning / tensors:

- We‚Äôll add a **Keras MLP** option where the model is built via a small `build_keras_regressor` function.
- This gives you a path to use TensorFlow/tensors in the same framework.


In [None]:
# ========== 11. Model Family Config & Pipeline Factory ==========

from dataclasses import dataclass
from typing import Optional, List, Literal, Dict

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import ElasticNet, Ridge
from sklearn.ensemble import RandomForestRegressor

# Optional: comment out if packages aren't installed
try:
    from xgboost import XGBRegressor
except ImportError:
    XGBRegressor = None

try:
    from lightgbm import LGBMRegressor
except ImportError:
    LGBMRegressor = None

# Optional: TensorFlow / Keras for deep learning models
try:
    import tensorflow as tf
    from tensorflow import keras
except ImportError:
    tf = None
    keras = None


ModelFamily = Literal["linear", "tree", "neural_net"]


@dataclass
class ModelConfig:
    name: str
    family: ModelFamily
    needs_scaling: bool
    handles_categoricals_natively: bool = False   # e.g., CatBoost
    notes: str = ""


MODEL_REGISTRY: Dict[str, ModelConfig] = {
    # Linear models
    "ridge": ModelConfig(
        name="ridge",
        family="linear",
        needs_scaling=True,
        notes="Good baseline linear model with L2 regularization."
    ),
    "elasticnet": ModelConfig(
        name="elasticnet",
        family="linear",
        needs_scaling=True,
        notes="Mix of L1/L2 regularization, can do feature selection."
    ),

    # Tree-based models
    "random_forest": ModelConfig(
        name="random_forest",
        family="tree",
        needs_scaling=False,
        notes="Bagged trees, robust but can be slower."
    ),
    "xgboost": ModelConfig(
        name="xgboost",
        family="tree",
        needs_scaling=False,
        notes="Gradient boosting trees, strong for tabular."
    ),
    "lightgbm": ModelConfig(
        name="lightgbm",
        family="tree",
        needs_scaling=False,
        notes="Fast gradient boosting, great for large tabular."
    ),
    # You could add CatBoost separately with handles_categoricals_natively=True

    # Neural nets / deep learning
    "keras_mlp": ModelConfig(
        name="keras_mlp",
        family="neural_net",
        needs_scaling=True,
        notes="Dense neural net via Keras; expects scaled numeric + encoded cats."
    ),
}


def get_model_config(model_type: str) -> ModelConfig:
    if model_type not in MODEL_REGISTRY:
        raise ValueError(f"Unknown model_type '{model_type}'. Available: {list(MODEL_REGISTRY.keys())}")
    return MODEL_REGISTRY[model_type]


# --- Builders for actual estimator objects ---

def build_sklearn_regressor(model_type: str):
    """Return an instantiated sklearn-compatible regressor given a model_type."""
    cfg = get_model_config(model_type)

    if cfg.family == "linear":
        if model_type == "ridge":
            return Ridge(alpha=1.0, random_state=RANDOM_STATE)
        elif model_type == "elasticnet":
            return ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=RANDOM_STATE)
    
    if cfg.family == "tree":
        if model_type == "random_forest":
            return RandomForestRegressor(
                n_estimators=300,
                max_depth=None,
                n_jobs=-1,
                random_state=RANDOM_STATE,
            )
        elif model_type == "xgboost":
            if XGBRegressor is None:
                raise ImportError("XGBRegressor not available. Install xgboost.")
            return XGBRegressor(
                n_estimators=500,
                learning_rate=0.05,
                max_depth=6,
                subsample=0.8,
                colsample_bytree=0.8,
                random_state=RANDOM_STATE,
                tree_method="hist",
            )
        elif model_type == "lightgbm":
            if LGBMRegressor is None:
                raise ImportError("LGBMRegressor not available. Install lightgbm.")
            return LGBMRegressor(
                n_estimators=500,
                learning_rate=0.05,
                max_depth=-1,
                subsample=0.8,
                colsample_bytree=0.8,
                random_state=RANDOM_STATE,
            )

    if cfg.family == "neural_net" and model_type == "keras_mlp":
        if keras is None:
            raise ImportError("TensorFlow/Keras not available. Install tensorflow.")
        # We'll build this via a wrapper below
        return None  # placeholder; pipeline builder will handle

    raise ValueError(f"Model builder not implemented for model_type={model_type}")


# --- Keras MLP builder (for deep learning / tensors) ---

def build_keras_mlp_regressor(input_dim: int) -> keras.Model:
    """
    Simple dense neural net for regression.
    Assumes inputs are:
      - fully numeric
      - scaled
      - all categorical already encoded
    """
    model = keras.Sequential([
        keras.layers.Input(shape=(input_dim,)),
        keras.layers.Dense(128, activation="relu"),
        keras.layers.Dense(64, activation="relu"),
        keras.layers.Dense(1)  # regression output
    ])
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-3),
        loss="mse",
        metrics=["mae"]
    )
    return model


In [None]:
from sklearn.base import BaseEstimator, RegressorMixin
import numpy as np

# Optional: a simple sklearn-compatible wrapper for Keras
class KerasRegressorWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, build_fn, epochs=20, batch_size=32, verbose=0):
        self.build_fn = build_fn
        self.epochs = epochs
        self.batch_size = batch_size
        self.verbose = verbose
        self.model_ = None

    def fit(self, X, y):
        X = np.asarray(X, dtype="float32")
        y = np.asarray(y, dtype="float32")
        self.model_ = self.build_fn(input_dim=X.shape[1])
        self.model_.fit(
            X,
            y,
            epochs=self.epochs,
            batch_size=self.batch_size,
            verbose=self.verbose,
        )
        return self

    def predict(self, X):
        X = np.asarray(X, dtype="float32")
        preds = self.model_.predict(X, verbose=0)
        return preds.ravel()


def build_model_pipeline(
    model_type: str,
    df: pd.DataFrame,
    target_col: str = TARGET_COL
) -> Pipeline:
    """
    Build a unified sklearn pipeline:
      - ColumnTransformer for numeric + categorical features
      - Optional scaling (depending on model family)
      - Chosen model estimator (sklearn or Keras wrapper)
    """
    cfg = get_model_config(model_type)

    # Identify feature sets
    num_cols = get_numeric_features(df, exclude=[target_col])
    cat_cols = get_categorical_features(df)

    numeric_transformers = []
    categorical_transformers = []

    # We assume imputation has already been done and we are now at encoding + scaling.
    # If you want imputation inside the pipeline instead, you can add SimpleImputer here.

    # Scaling: only if the model needs it (linear models & neural nets)
    if cfg.needs_scaling and len(num_cols) > 0:
        numeric_transformers.append(("scaler", StandardScaler()))
    # If no scaling needed, we just pass-through numeric features.

    # One-hot encode categoricals by default (for models that don't handle cats natively)
    if len(cat_cols) > 0 and not cfg.handles_categoricals_natively:
        categorical_transformers.append(
            ("onehot", OneHotEncoder(handle_unknown="ignore", sparse=True))
        )

    # Build ColumnTransformer
    transformers = []
    if len(num_cols) > 0:
        transformers.append(("num", Pipeline(numeric_transformers) if numeric_transformers else "passthrough", num_cols))
    if len(cat_cols) > 0 and not cfg.handles_categoricals_natively:
        transformers.append(("cat", Pipeline(categorical_transformers), cat_cols))

    preprocessor = ColumnTransformer(
        transformers=transformers,
        remainder="drop",  # drop any columns not explicitly listed
    )

    # Build estimator
    if cfg.family == "neural_net" and model_type == "keras_mlp":
        if keras is None:
            raise ImportError("TensorFlow/Keras not installed, cannot build keras_mlp.")
        # We'll use the wrapper; input_dim will be inferred at fit time
        estimator = KerasRegressorWrapper(
            build_fn=build_keras_mlp_regressor,
            epochs=30,
            batch_size=64,
            verbose=0,
        )
    else:
        estimator = build_sklearn_regressor(model_type)

    model_pipeline = Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            ("model", estimator),
        ]
    )

    return model_pipeline


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Choose which preprocessed data to use:
# e.g., train_imputed from earlier steps (after FE + imputation)
data_for_model = train_imputed  # or fe_train / fe_train_imputed, etc.

X = data_for_model.drop(columns=[TARGET_COL])
y = data_for_model[TARGET_COL]

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

for model_type in ["lightgbm", "xgboost", "random_forest", "elasticnet", "keras_mlp"]:
    try:
        print(f"\n=== Training model: {model_type} ===")
        cfg = get_model_config(model_type)
        print("Config:", cfg)

        pipeline = build_model_pipeline(model_type, data_for_model, target_col=TARGET_COL)
        pipeline.fit(X_train, y_train)

        y_pred = pipeline.predict(X_valid)
        rmse = mean_squared_error(y_valid, y_pred, squared=False)
        r2 = r2_score(y_valid, y_pred)
        print(f"RMSE: {rmse:.4f} | R¬≤: {r2:.4f}")
    except Exception as e:
        print(f"Skipping {model_type} due to error: {e}")


In [None]:
# ========== 12. Baseline Modeling & Model Comparison ==========

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


def evaluate_models(
    data: pd.DataFrame,
    target_col: str = TARGET_COL,
    id_col: str | None = ID_COL,
    model_types: list[str] | None = None,
    test_size: float = 0.2,
    random_state: int = RANDOM_STATE,
) -> pd.DataFrame:
    """
    Train several model types on a simple train/validation split
    and return a comparison table with RMSE and R¬≤.
    """
    if model_types is None:
        model_types = ["lightgbm", "xgboost", "random_forest", "elasticnet", "keras_mlp"]

    df = data.copy()

    # Drop ID column if present
    drop_cols = [target_col]
    if id_col is not None and id_col in df.columns:
        drop_cols.append(id_col)

    X = df.drop(columns=drop_cols)
    y = df[target_col]

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    results = []

    for model_type in model_types:
        print(f"\n=== Training model: {model_type} ===")
        try:
            cfg = get_model_config(model_type)
            print("Config:", cfg)

            pipeline = build_model_pipeline(model_type, df, target_col=target_col)
            pipeline.fit(X_train, y_train)

            y_pred = pipeline.predict(X_valid)
            rmse = mean_squared_error(y_valid, y_pred, squared=False)
            r2 = r2_score(y_valid, y_pred)
            print(f"RMSE: {rmse:.4f} | R¬≤: {r2:.4f}")

            results.append({
                "model_type": model_type,
                "family": cfg.family,
                "needs_scaling": cfg.needs_scaling,
                "handles_categoricals_natively": cfg.handles_categoricals_natively,
                "rmse": rmse,
                "r2": r2,
            })
        except Exception as e:
            print(f"Skipping {model_type} due to error: {e}")
            results.append({
                "model_type": model_type,
                "family": None,
                "needs_scaling": None,
                "handles_categoricals_natively": None,
                "rmse": np.nan,
                "r2": np.nan,
            })

    results_df = pd.DataFrame(results).sort_values("rmse")
    display(results_df)
    return results_df


In [None]:
# Choose which dataset to model on:
# After all steps, this should usually be train_imputed
data_for_model = train_imputed  # or fe_train/train_df_processed if you're experimenting

# Pick which models to compare; remove any you don't have installed
models_to_try = [
    "lightgbm",      # if lightgbm installed
    "xgboost",       # if xgboost installed
    "random_forest",
    "elasticnet",
    "keras_mlp",     # if tensorflow/keras installed
]

model_results = evaluate_models(
    data=data_for_model,
    target_col=TARGET_COL,
    id_col=ID_COL,
    model_types=models_to_try,
    test_size=0.2,
    random_state=RANDOM_STATE,
)


## üîÆ Next Steps Roadmap (For Future Refinement)

The current notebook covers:

1. EDA: distributions, correlations, relationships  
2. Skew & outliers  
3. Feature engineering (date parts, ratios, bins, polynomial options)  
4. Missingness & imputation (simple + advanced)  
5. Modular modeling: linear, tree-based, and neural net (Keras MLP)

To go further in a competition setting, consider adding these **later**, in this order:

---

### 1Ô∏è‚É£ Encoding Variants (Beyond One-Hot)

**When:**
- You have high-cardinality categoricals (many unique values)
- One-hot encoding explodes feature count
- You want more signal from rare categories

**What to add:**
- Target encoding for high-cardinality features (with CV to avoid leakage)
- Frequency / count encoding
- Leave-one-out encoding

**Where in notebook:**
- Replace or extend the categorical branch inside `build_model_pipeline`:
  - Instead of `OneHotEncoder`, use a target encoder (e.g. category_encoders library) for selected columns.

---

### 2Ô∏è‚É£ Cross-Validation Instead of Single Train/Valid Split

**When:**
- You need more stable and reliable scores
- LB is noisy, and you want robust local validation

**What to add:**
- KFold / StratifiedKFold (for general regression/classification)
- GroupKFold (if grouped entities like customers, stores)
- TimeSeriesSplit (if time-ordered data)

**Where in notebook:**
- Replace the `train_test_split` inside `evaluate_models` with a CV loop:
  - For each fold: fit pipeline ‚Üí predict ‚Üí collect metrics ‚Üí average.

---

### 3Ô∏è‚É£ Model Ensembling / Blending

**When:**
- You have several strong but different models (e.g., LGBM, XGB, ElasticNet)
- Their errors are uncorrelated
- You want the last bit of LB improvement

**What to add:**
- Simple average of predictions (mean of model outputs)
- Weighted average based on validation RMSE
- Stacking model that takes multiple model predictions as input features

**Where in notebook:**
- After `evaluate_models`, add:
  - A cell that trains multiple models fully on the whole training data
  - Combines their predictions on validation or test data

---

### 4Ô∏è‚É£ Moving Imputation/Encoding/Scaling Fully Inside Pipelines

**When:**
- You are ready for cleaner, production-like code
- You want to avoid mistakes between train/test transformations

**What to add:**
- Move SimpleImputer and/or advanced imputers inside the `ColumnTransformer`
- Ensure all transforms are fit only on training data

**Where in notebook:**
- Modify `build_model_pipeline`:
  - Add `SimpleImputer` to numeric and categorical transformers
  - Remove external imputation steps (so you no longer need `train_imputed` and can feed `fe_train` directly).

---

üìå **For now:**  
You have a complete, end-to-end, logically ordered notebook that:

- Teaches you the process
- Runs with simple, robust defaults
- Lets you plug in trees, linear models, and a Keras MLP

When you‚Äôre comfortable with this, pick **one** of the roadmap items above and implement it incrementally instead of all at once.
