# Tabular Regression Workflow Template (EDA ‚Üí Skew ‚Üí Outliers ‚Üí Model Choice)

This notebook is a **reusable template** for tabular regression problems (e.g. Kaggle competitions).  
It includes both **code** and a **granular decision workflow** so you can follow the same thought process every time.

---

## üîÅ High-Level Workflow

1. **Set config & load data**
2. **Understand structure**: dtypes, missingness, basic stats
3. **Explore numeric features**: distributions, feature‚Äìtarget relationships, correlations
4. **Explore categorical & boolean features**
5. **Quantify skewness & kurtosis** and decide on transformations
6. **Detect & handle outliers** (winsorize, remove, or flag)
7. **Assess relationship shape** (linear vs monotonic vs nonlinear)
8. **Choose model family** based on aggregate shape (linear vs trees)
9. **(Later) Build preprocessing + baseline model + CV**

You can duplicate this notebook for any new regression competition and only change the config (paths, target name, etc.).


## üß≠ Decision Workflow Cheat Sheet (Granular Rules)

Use this as a **mental and practical checklist** every time.

### 1Ô∏è‚É£ Data & Structure

1. Load train (and test if available).
2. Check:
   - `shape` (rows, columns)
   - dtypes
   - missing values
   - obvious ID columns
3. Identify initial column groups:
   - Numeric features
   - Categorical features
   - Boolean / 0‚Äì1 features
   - Target column
   - ID column(s)

> üìå **Action**: If something looks wrong (e.g. target all zeros, date parsed as object, weird dtypes), fix **before** going further.

---

### 2Ô∏è‚É£ Numeric EDA: Distributions & Relationships

For numeric columns (excluding target and IDs):

1. Plot histograms for each numeric feature.
2. Plot **target distribution** (hist + boxplot).
3. Plot **scatter plots** of feature vs target for a subset of numeric features.
4. Compute correlations:
   - Pearson (linear) 
   - Spearman (rank / monotonic) for sanity checks later (optional).

**Interpretation rules:**

- If a feature‚Äôs scatter vs target looks roughly like a **straight band** ‚Üí relationship is approximately **linear**.
- If it is curved (U-shape, log curve, exponential, plateauing) ‚Üí **nonlinear**.
- If there is no clear pattern ‚Üí likely **weak/no signal** or dominated by noise.

> üìå **If most of your strong features look linear:**  
> ‚Üí Linear models (Ridge/ElasticNet) are a good first baseline (after transformations).  
> üìå **If most look clearly nonlinear/curved/step-like:**  
> ‚Üí Start with tree-based models (LightGBM/XGBoost/CatBoost).  
> üìå **If it‚Äôs a mix or unclear:**  
> ‚Üí Start with a tree model (safe default), then experiment with linear models later.

---

### 3Ô∏è‚É£ Categorical & Boolean EDA

For categorical features:

- Look at value counts.
- Compute target statistics by category (mean, count, etc.).
- Plot **boxplots/violins** of target vs category.
- Compute **Cram√©r‚Äôs V** between categoricals to find redundancies.

For boolean / 0‚Äì1 features:

- Compute **point-biserial correlation** with the target.
- Plot boxplots of target vs boolean value.

**Interpretation rules:**

- Categories with very different target means are **highly informative**.
- Categoricals strongly associated with each other (high Cram√©r‚Äôs V) may be redundant.
- Boolean features with high |correlation| with target are good candidates to keep; others might be weak.

> üìå **Action**:  
> - Plan encodings (one-hot, target encoding, CatBoost handling).  
> - Consider merging rare categories if cardinality is high.

---

### 4Ô∏è‚É£ Skewness & Kurtosis: Shape of Numeric Distributions

For each numeric feature (including target):

- Compute **skewness** and **kurtosis**.

**Skewness rules of thumb:**

- `|skew| < 0.5` ‚Üí approximately symmetric
- `0.5 ‚â§ |skew| ‚â§ 1.0` ‚Üí moderately skewed
- `|skew| > 1.0` ‚Üí highly skewed

**Kurtosis (Fisher=False) rules:**

- `‚âà 3` ‚Üí roughly normal tails
- `> 3` ‚Üí heavy tails (more outliers)
- `< 3` ‚Üí light tails

**Transformation decisions:**

- If `|skew| < 0.5` ‚Üí leave as is (no transform needed for shape).
- If `0.5 ‚â§ |skew| ‚â§ 1.0`:
  - Consider **log1p** or **sqrt** transform for **right-skewed** (positive skew) features.
  - For **left-skewed** features, you can reflect: `x' = max(x) - x + 1`, then log/sqrt.
- If `|skew| > 1.0`:
  - Strong candidate for transformation:
    - `log1p(x)` if x ‚â• 0
    - Box-Cox or Yeo‚ÄìJohnson if more flexibility is needed
  - Also examine for outliers.

> üìå **Model choice impact:**  
> - Linear models prefer **low skew + near-normal residuals**.  
> - Tree models handle skew fine, but removing extreme skew can still improve generalization.

---

### 5Ô∏è‚É£ Outlier Detection & Handling

Goal: reduce the effect of **unreasonably extreme values** that can distort training, especially for linear models.

Steps:

1. For key numeric features (and/or all of them):
   - Use **IQR rule** or **z-score** to flag outliers.
   - Optionally use **IsolationForest** for multivariate detection.
2. Compare target distribution **with vs without** outliers to see their impact.

**Decision rules:**

- **If outliers are legitimate signal** (e.g., very rich customers, very large valid sales):  
  - Consider **keeping them**, especially if using tree-based models.
- **If outliers are likely errors / noise / impossible values**:  
  - Remove those rows outright.
- **If outliers are extreme but plausible and hurting linear models**:  
  - Use **winsorization** (clip to low/high percentiles, e.g. 1% and 99%).  
  - Or transform (log) then clip less aggressively.

General strategies:

- `strategy="winsorize"` ‚Üí good baseline for regression.  
- `strategy="remove"` ‚Üí use cautiously; track % of data removed.  
- `strategy="flag"` ‚Üí keep original values but add `_outlier` indicator features.

> üìå **Best practice:** For contest work, start with winsorizing or flagging rather than deleting.

---

### 6Ô∏è‚É£ Relationship Shape & Model Family

Use scatter plots, Pearson vs Spearman correlations, and your EDA impressions to classify features:

- **Linear relationship**: roughly straight trend in scatter, high Pearson & Spearman.
- **Nonlinear monotonic**: curved trend but always increasing/decreasing; low Pearson, higher Spearman.
- **Nonlinear non-monotonic**: U-shapes, plateaus, or complicated patterns.
- **No clear relationship**: cloud with no pattern.

**Model choice rules:**

- If **most strong features are linear** **and** you‚Äôre comfortable with transformations:  
  ‚Üí Try a **linear regression / Ridge / ElasticNet baseline** after fixing skew & outliers.
- If **many features are clearly nonlinear or monotonic but curved**:  
  ‚Üí Prefer **tree-based gradient boosting** (LightGBM/XGBoost/CatBoost).
- If the picture is **mixed** (some linear, some nonlinear) or unclear:  
  ‚Üí Start with **LightGBM** (good default for tabular).  
  ‚Üí Later, build a **linear baseline** to compare.

> ‚ùó You almost never build separate models per feature.  
> You **transform features** based on their shapes, then feed them into a single model (or ensemble).

---

### 7Ô∏è‚É£ Putting It All Together (Execution Flow)

When you open a new regression dataset, follow this order:

1. **Config & Data Load**
   - Set paths, target name, ID column.
   - Load `train_df` (and `test_df` if available).

2. **Initial Structure Check**
   - Run `summarize_dataframe(train_df)`.
   - Fix obvious issues (dtypes, weird IDs, broken target).

3. **Column Typing**
   - Use helpers to get `num_cols`, `cat_cols`, `bool_cols`.

4. **Numeric EDA**
   - Plot target distribution.  
   - Plot numeric feature histograms and a subset of feature-vs-target scatter plots.  
   - Check correlation with target (Pearson).

5. **Categorical & Boolean EDA**
   - Examine value counts.  
   - Summarize target by category and plot box/violin.  
   - Compute Cram√©r‚Äôs V matrix for categoricals; point-biserial correlations for booleans.

6. **Skewness & Kurtosis**
   - Compute skew/kurtosis for all numeric features.  
   - Decide which features are candidates for log / other transformations.

7. **Outliers**
   - Use IQR/Z-score/IsolationForest to flag outliers on key columns.  
   - Compare target with/without to see impact.  
   - Apply chosen strategy: winsorize / remove / flag.

8. **Model Strategy Planning**
   - Based on shapes and correlations:  
     - If mostly linear ‚Üí plan a linear model baseline + engineered features.  
     - If mostly nonlinear ‚Üí plan tree-based models.  
     - If mixed ‚Üí start with trees, later add linear baseline.

9. **(Next Notebook Sections)**
   - Implement preprocessing (encoders, scalers, transformers).  
   - Build train/validation split (KFold, TimeSeriesSplit, etc.).  
   - Train baseline models and compare metrics.  
   - Iterate with feature engineering and ensembling.


In [None]:
# ========== 1. Imports & Config ==========

import os
from pathlib import Path
from typing import Optional, List

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import skew, kurtosis, chi2_contingency, pointbiserialr, zscore
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import IsolationForest

# Display & plotting options
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100

# ---- Project-level config (edit per dataset/competition) ----
DATA_DIR = Path("../input")      # change to your data path
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"           # set to None if no test set

TARGET_COL = "target"            # change to your target column
ID_COL = "id"                    # change or set to None if no ID

RANDOM_STATE = 42


## 2Ô∏è‚É£ Load Data

Edit the `DATA_DIR`, `TRAIN_FILE`, `TEST_FILE`, and `TARGET_COL` in the config cell above to match your dataset.

Then run this cell to load your train (and test, if present) data.


In [None]:
def load_data(
    data_dir: Path = DATA_DIR,
    train_file: str = TRAIN_FILE,
    test_file: Optional[str] = TEST_FILE,
):
    """Load train/test DataFrames from CSV."""
    train_path = data_dir / train_file
    if not train_path.exists():
        raise FileNotFoundError(f"Train file not found: {train_path}")
        
    train_df = pd.read_csv(train_path)
    
    test_df = None
    if test_file is not None:
        test_path = data_dir / test_file
        if test_path.exists():
            test_df = pd.read_csv(test_path)
        else:
            print(f"Test file not found: {test_path} (continuing without test_df)")
    
    print("Train shape:", train_df.shape)
    if test_df is not None:
        print("Test shape :", test_df.shape)
    else:
        print("Test data  : None")
    
    return train_df, test_df


train_df, test_df = load_data()


## 3Ô∏è‚É£ Column Typing & Initial Summary

Use these helpers to:

- Identify numeric, categorical, and boolean/0‚Äì1 features
- Get a quick overview of the data structure, missing values, and basic stats


In [None]:
def get_numeric_features(df: pd.DataFrame, exclude: Optional[List[str]] = None) -> List[str]:
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if exclude:
        num_cols = [c for c in num_cols if c not in exclude]
    return num_cols


def get_categorical_features(df: pd.DataFrame) -> List[str]:
    return df.select_dtypes(include=["object", "category"]).columns.tolist()


def get_boolean_features(df: pd.DataFrame) -> List[str]:
    bool_cols = df.select_dtypes(include=["bool"]).columns.tolist()
    for col in df.select_dtypes(include=["int64", "int32", "int16"]).columns:
        unique_vals = df[col].dropna().unique()
        if len(unique_vals) <= 2 and set(unique_vals).issubset({0, 1}):
            bool_cols.append(col)
    return list(dict.fromkeys(bool_cols))


def summarize_dataframe(df: pd.DataFrame, name: str = "df"):
    print(f"===== {name} SUMMARY =====")
    print("Shape:", df.shape)
    
    print("\nFirst 5 rows:")
    display(df.head())

    print("\nDtypes:")
    display(df.dtypes)

    print("\nMissing values (count):")
    display(df.isna().sum().sort_values(ascending=False))

    print("\nBasic describe (numeric):")
    display(df.describe().T)

    print("\nPossible categorical columns (heuristic):")
    cat_like = []
    for col in df.columns:
        if df[col].dtype == "object":
            cat_like.append(col)
        else:
            unique_vals = df[col].nunique()
            if unique_vals < 20 and str(df[col].dtype).startswith("int"):
                cat_like.append(col)
    print(cat_like)


summarize_dataframe(train_df, name="train_df")


num_cols = get_numeric_features(
    train_df,
    exclude=[TARGET_COL] + ([ID_COL] if ID_COL in train_df.columns else [])
)
cat_cols = get_categorical_features(train_df)
bool_cols = get_boolean_features(train_df)

print("Numeric features (first 10):", num_cols[:10], "..." if len(num_cols) > 10 else "")
print("Categorical features:", cat_cols)
print("Boolean features:", bool_cols)


## 4Ô∏è‚É£ Numeric EDA: Distributions & Correlations

Follow this sequence:

1. Inspect target distribution.  
2. Inspect numeric feature distributions.  
3. Look at scatter plots of feature vs target.  
4. Examine correlations with the target.  


In [None]:
def plot_target_distribution(df: pd.DataFrame, target_col: str = TARGET_COL):
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    sns.histplot(df[target_col], kde=True, ax=axes[0])
    axes[0].set_title(f"Distribution of {target_col}")

    sns.boxplot(x=df[target_col], ax=axes[1])
    axes[1].set_title(f"Boxplot of {target_col}")

    plt.tight_layout()
    plt.show()


def plot_numeric_distributions(df: pd.DataFrame, max_cols: int = 12):
    num_cols_local = get_numeric_features(df, exclude=[TARGET_COL])
    num_cols_local = num_cols_local[:max_cols]

    n = len(num_cols_local)
    if n == 0:
        print("No numeric features to plot.")
        return

    n_cols = 3
    n_rows = int(np.ceil(n / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
    axes = axes.flatten()

    for i, col in enumerate(num_cols_local):
        sns.histplot(df[col], kde=False, ax=axes[i])
        axes[i].set_title(col)

    for j in range(i + 1, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()


def plot_feature_vs_target(df: pd.DataFrame, target_col: str = TARGET_COL, max_cols: int = 6):
    num_cols_local = get_numeric_features(df, exclude=[target_col])
    num_cols_local = num_cols_local[:max_cols]

    n = len(num_cols_local)
    if n == 0:
        print("No numeric features to plot vs target.")
        return

    n_cols = 3
    n_rows = int(np.ceil(n / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
    axes = axes.flatten()

    for i, col in enumerate(num_cols_local):
        sns.scatterplot(x=df[col], y=df[target_col], ax=axes[i], alpha=0.4)
        axes[i].set_xlabel(col)
        axes[i].set_ylabel(target_col)
        axes[i].set_title(f"{col} vs {target_col}")

    for j in range(i + 1, len(axes)):
        axes[j].axis("off")

    plt.tight_layout()
    plt.show()


def correlation_with_target(df: pd.DataFrame, target_col: str = TARGET_COL, top_n: int = 20):
    num_cols_local = df.select_dtypes(include=[np.number]).columns.tolist()
    if target_col not in num_cols_local:
        print(f"Target {target_col} is not numeric or not in df.")
        return

    corr = df[num_cols_local].corr()[target_col].sort_values(ascending=False)
    print("Top positively correlated with target:")
    display(corr.head(top_n))
    print("\nTop negatively correlated with target:")
    display(corr.tail(top_n))


def plot_correlation_heatmap(df: pd.DataFrame, target_col: str = TARGET_COL, top_n: int = 20):
    num_cols_local = df.select_dtypes(include=[np.number]).columns.tolist()
    if target_col not in num_cols_local:
        print(f"Target {target_col} is not numeric or not in df.")
        return

    corr_series = df[num_cols_local].corr()[target_col].drop(target_col)
    top_features = corr_series.abs().sort_values(ascending=False).head(top_n).index.tolist()
    cols_to_plot = top_features + [target_col]

    corr_matrix = df[cols_to_plot].corr()

    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=False, cmap="coolwarm", center=0)
    plt.title(f"Correlation heatmap (top {top_n} correlated with {target_col})")
    plt.tight_layout()
    plt.show()


# Run core numeric EDA
plot_target_distribution(train_df, TARGET_COL)
plot_numeric_distributions(train_df)
plot_feature_vs_target(train_df, TARGET_COL)
correlation_with_target(train_df, TARGET_COL)
plot_correlation_heatmap(train_df, TARGET_COL)


## 5Ô∏è‚É£ Categorical & Boolean EDA

Now examine **categorical** and **boolean** predictors:

- How the target varies across categories
- How categoricals relate to each other (Cram√©r's V)
- How booleans relate to the target (point-biserial correlation)


In [None]:
def cramers_v(x, y) -> float:
    confusion_matrix = pd.crosstab(x, y)
    chi2, p, dof, expected = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    k = min(confusion_matrix.shape) - 1
    if k == 0:
        return np.nan
    return np.sqrt((chi2 / n) / k)


def cramers_v_matrix(df: pd.DataFrame, cols: Optional[List[str]] = None) -> pd.DataFrame:
    if cols is None:
        cols = get_categorical_features(df)

    n = len(cols)
    result = pd.DataFrame(np.ones((n, n)), index=cols, columns=cols)

    for i in range(n):
        for j in range(i + 1, n):
            v = cramers_v(df[cols[i]], df[cols[j]])
            result.iloc[i, j] = v
            result.iloc[j, i] = v

    return result


def plot_cramers_v_heatmap(df: pd.DataFrame, cols: Optional[List[str]] = None):
    cv_mat = cramers_v_matrix(df, cols)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cv_mat, annot=False, cmap="coolwarm", vmin=0, vmax=1)
    plt.title("Cram√©r's V between categorical features")
    plt.tight_layout()
    plt.show()


def summarize_target_by_category(
    df: pd.DataFrame,
    cat_col: str,
    target_col: str = TARGET_COL,
    sort_by: str = "mean",
) -> pd.DataFrame:
    summary = (
        df.groupby(cat_col)[target_col]
        .agg(["count", "mean", "std", "min", "max"]
        ).sort_values(by=sort_by, ascending=False)
    )
    display(summary)
    return summary


def plot_target_by_category(
    df: pd.DataFrame,
    cat_col: str,
    target_col: str = TARGET_COL,
    max_categories: int = 20,
    kind: str = "box",
):
    if df[cat_col].nunique() > max_categories:
        top_cats = df[cat_col].value_counts().head(max_categories).index
        data = df[df[cat_col].isin(top_cats)].copy()
    else:
        data = df

    plt.figure(figsize=(12, 6))
    if kind == "box":
        sns.boxplot(x=cat_col, y=target_col, data=data)
    elif kind == "violin":
        sns.violinplot(x=cat_col, y=target_col, data=data, cut=0)
    else:
        raise ValueError("kind must be 'box' or 'violin'")

    plt.title(f"{target_col} by {cat_col}")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()


def pointbiserial_correlations_with_target(
    df: pd.DataFrame,
    bool_cols: Optional[List[str]] = None,
    target_col: str = TARGET_COL,
) -> pd.DataFrame:
    if bool_cols is None:
        bool_cols = get_boolean_features(df)

    results = []
    for col in bool_cols:
        series = df[col]
        if series.dtype == "bool":
            series = series.astype(int)

        mask = series.notna() & df[target_col].notna()
        if mask.sum() == 0:
            corr = np.nan
            pval = np.nan
        else:
            corr, pval = pointbiserialr(series[mask], df[target_col][mask])
        results.append({"feature": col, "corr": corr, "p_value": pval})

    res_df = pd.DataFrame(results).sort_values("corr", key=lambda x: x.abs(), ascending=False)
    display(res_df)
    return res_df


def plot_target_by_boolean(df: pd.DataFrame, bool_col: str, target_col: str = TARGET_COL):
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[bool_col].astype(str), y=df[target_col])
    plt.title(f"{target_col} by {bool_col} (bool)")
    plt.xlabel(bool_col)
    plt.tight_layout()
    plt.show()


# Run basic categorical/boolean EDA if there are such columns
if len(cat_cols) > 0:
    print("\nCram√©r's V heatmap for categorical features:")
    if len(cat_cols) > 1:
        plot_cramers_v_heatmap(train_df, cat_cols)
    for col in cat_cols[:5]:
        print(f"\n=== {col} vs {TARGET_COL} ===")
        summarize_target_by_category(train_df, col, TARGET_COL)
        plot_target_by_category(train_df, col, TARGET_COL, kind="box")


if len(bool_cols) > 0:
    print("\nPoint-biserial correlations with target for boolean features:")
    pointbiserial_correlations_with_target(train_df, bool_cols, TARGET_COL)
    for col in bool_cols:
        plot_target_by_boolean(train_df, col, TARGET_COL)


## 6Ô∏è‚É£ Skewness & Kurtosis: Shape & Transform Suggestions

Use skewness and kurtosis to decide **which numeric features need transformation**.

Rules (built into your workflow):

- `|skew| < 0.5` ‚Üí leave as is.
- `0.5 ‚â§ |skew| ‚â§ 1.0` ‚Üí consider log/sqrt transform.
- `|skew| > 1.0` ‚Üí strong candidate for transformation (log1p, Box-Cox, Yeo‚ÄìJohnson).


In [None]:
def skew_kurtosis_summary(df: pd.DataFrame) -> pd.DataFrame:
    num_cols_local = df.select_dtypes(include=[np.number]).columns.tolist()
    summary = []

    for col in num_cols_local:
        col_data = df[col].dropna()
        if len(col_data) == 0:
            continue
        summary.append({
            "feature": col,
            "skewness": skew(col_data),
            "kurtosis": kurtosis(col_data, fisher=False)
        })

    result = pd.DataFrame(summary).set_index("feature")
    display(result.sort_values("skewness", key=lambda x: x.abs(), ascending=False))
    return result


def suggest_log_transform(df: pd.DataFrame, skew_threshold: float = 1.0) -> List[str]:
    num_cols_local = df.select_dtypes(include=[np.number]).columns
    candidates = []

    for col in num_cols_local:
        col_data = df[col].dropna()
        if len(col_data) == 0:
            continue
        s = skew(col_data)
        if abs(s) > skew_threshold and col_data.min() >= 0:
            candidates.append((col, s))

    print(f"Log-transform candidates (|skew| > {skew_threshold} and min>=0):")
    display(pd.DataFrame(candidates, columns=["feature", "skewness"]))
    return [c[0] for c in candidates]


skew_kurtosis_df = skew_kurtosis_summary(train_df)
log_candidates = suggest_log_transform(train_df)


## 7Ô∏è‚É£ Outlier Detection & Handling

Now use simple, consistent rules to detect and handle outliers.

Recommended flow:

1. Start with **IQR or z-score** on key numeric columns.  
2. Check how many points are flagged and how much they shift target stats.  
3. Decide whether to **keep**, **winsorize**, **remove**, or **flag**.


In [None]:
def detect_outliers_iqr(df: pd.DataFrame, col: str, multiplier: float = 1.5):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    mask = (df[col] < lower) | (df[col] > upper)
    return mask, lower, upper


def detect_outliers_z(df: pd.DataFrame, col: str, threshold: float = 3.0):
    col_data = df[col]
    col_z = zscore(col_data.dropna())
    mask_raw = np.abs(col_z) > threshold
    mask = pd.Series(False, index=df.index)
    mask[col_data.dropna().index] = mask_raw
    return mask


def detect_outliers_isoforest(df: pd.DataFrame, num_cols: Optional[List[str]] = None, contamination: float = 0.01):
    if num_cols is None:
        num_cols = get_numeric_features(df, exclude=[TARGET_COL])

    iso = IsolationForest(contamination=contamination, random_state=RANDOM_STATE)
    preds = iso.fit_predict(df[num_cols])
    mask = preds == -1
    return pd.Series(mask, index=df.index)


def compare_target_with_without_outliers(df: pd.DataFrame, mask: pd.Series, target_col: str = TARGET_COL):
    full_mean = df[target_col].mean()
    clean_mean = df[~mask][target_col].mean()
    print(f"Full target mean:      {full_mean:.4f}")
    print(f"Without outliers mean: {clean_mean:.4f}")
    print(f"Difference:            {clean_mean - full_mean:.4f}")
    print(f"Outlier count:         {mask.sum()} / {len(df)}")


def winsorize_series(s: pd.Series, lower_q: float = 0.01, upper_q: float = 0.99) -> pd.Series:
    lower = s.quantile(lower_q)
    upper = s.quantile(upper_q)
    return s.clip(lower, upper)


def process_outliers(
    df: pd.DataFrame,
    num_cols: Optional[List[str]] = None,
    strategy: str = "winsorize",
    z_thresh: float = 3.0,
    lower_q: float = 0.01,
    upper_q: float = 0.99,
) -> pd.DataFrame:
    if num_cols is None:
        num_cols = get_numeric_features(df, exclude=[TARGET_COL])

    df_processed = df.copy()

    for col in num_cols:
        if strategy == "remove":
            mask = detect_outliers_z(df_processed, col, threshold=z_thresh)
            df_processed = df_processed[~mask]
        elif strategy == "winsorize":
            df_processed[col] = winsorize_series(df_processed[col], lower_q, upper_q)
        elif strategy == "flag":
            mask = detect_outliers_z(df_processed, col, threshold=z_thresh)
            df_processed[f"{col}_outlier"] = mask.astype(int)
        else:
            raise ValueError("strategy must be one of: 'remove', 'winsorize', 'flag'")

    return df_processed


# Example: examine outliers for a single numeric feature, if any exist
example_col = num_cols[0] if len(num_cols) > 0 else None
if example_col:
    mask_iqr, low, high = detect_outliers_iqr(train_df, example_col)
    print(f"Example IQR outlier bounds for {example_col}: [{low:.4f}, {high:.4f}]")
    compare_target_with_without_outliers(train_df, mask_iqr, TARGET_COL)

# Example: create a winsorized version of the training data
train_df_processed = process_outliers(
    train_df,
    num_cols=get_numeric_features(train_df, exclude=[TARGET_COL]),
    strategy="winsorize",
    lower_q=0.01,
    upper_q=0.99,
)
print("Original shape:", train_df.shape)
print("Processed shape:", train_df_processed.shape)


## 8Ô∏è‚É£ Relationship Shape & Model Family Choice (Conceptual)

At this point, you have:

- Cleaned understanding of the data
- Insight into numeric and categorical behaviors
- Skew and outliers under control

Now you decide **what kind of model to try first**.

### A. Quick Relationship Diagnostics (Recommended)

For each numeric feature:

1. Look at the scatter vs target plots.
2. Optionally compute:
   - Pearson correlation (linear)
   - Spearman correlation (monotonic)

**Heuristics:**

- If **Pearson and Spearman are both high** and scatter looks straight-ish ‚Üí **linear** relationship.
- If **Spearman > Pearson** and the scatter is curved but monotonic ‚Üí **nonlinear monotonic**.
- If both correlations are low and scatter is a cloud ‚Üí likely **no strong relationship**.

### B. Aggregate Decisions

- If **most strong features are linear** and you‚Äôre willing to transform skewed ones:  
  ‚Üí Start with a **linear baseline** (Ridge/ElasticNet).  
  - Standardize features.  
  - Consider adding polynomial or interaction terms for obvious curves.

- If **many features show nonlinear / monotonic patterns**:  
  ‚Üí Start with **tree-based gradient boosting** (LightGBM/XGBoost/CatBoost).  
  - Encodes nonlinearity and interactions automatically.  
  - Handles skew and outliers better by default.

- If the dataset is **mixed** (some linear, some nonlinear, some categorical-heavy):  
  ‚Üí Start with **LightGBM or CatBoost** as a robust baseline, then later:
  - Build a linear model for interpretability.
  - Compare performance and use both if needed (stacking/ensembling).

> ‚úÖ You **do not** build separate models per feature.  
> You **transform the features** based on their shapes and feed them into a single model (or ensemble).

In a future section of this notebook (you can add later), you‚Äôll:

- Build preprocessing pipelines (encoders, scalers, transformers)
- Define train/validation splits
- Train baseline models (LGBM / XGB / ElasticNet)
- Evaluate and iterate

For now, this notebook serves as your **guided EDA + data-shape analysis foundation** for any regression task.
