# Complete Synthetic Regression Simulation
__`End-to-End Realistic Data Pathology Benchmark`__
## Objective

This notebook generates a single synthetic regression dataset that simultaneously includes:

- Missingness (MCAR, MAR, MNAR)

- Outliers (univariate and multivariate)

- High cardinality categorical features

- Ordinal categorical features

- Target skewness / imbalance (regression-appropriate)

- Non-linear relationships

- Heteroskedastic noise

- Latent leakage-prone features

This dataset is intended to serve as a long-term benchmark for:

- EDA

- Preprocessing techniques

- Feature engineering

- Robust modeling

- Validation and leakage detection

## Imports and Configuration

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


# Generate Dataset

##  Global Parameters

In [2]:
np.random.seed(2010)

N = 8000

## Core Numeric Features

In [3]:
age = np.random.randint(18, 75, size=N)

income = np.random.lognormal(mean=10.7, sigma=0.7, size=N)

tenure_years = np.random.exponential(scale=6, size=N)

monthly_usage = np.random.gamma(shape=2.5, scale=40, size=N)


##  Ordinal Categorical Feature

In [4]:
education_level = np.random.choice(
    ["High School", "Bachelor", "Master", "PhD"],
    size=N,
    p=[0.35, 0.4, 0.2, 0.05]
)

education_map = {
    "High School": 0,
    "Bachelor": 1,
    "Master": 2,
    "PhD": 3
}

education_ordinal = pd.Series(education_level).map(education_map)


##  High Cardinality Categorical Feature

In [5]:
customer_segment = np.random.choice(
    [f"segment_{i}" for i in range(1, 101)],
    size=N
)


100 levels → deliberate high-cardinality encoding challenge.

## Region (Low Cardinality)

In [8]:
region = np.random.choice(
    ["North", "South", "East", "West"],
    size=N,
    p=[0.4, 0.25, 0.2, 0.15]
)


## True Signal Construction (Regression Target)

Target: Annual Customer Value

In [7]:
base_value = (
    0.03 * income +
    1200 * np.log1p(tenure_years) +
    8 * monthly_usage +
    2500 * education_ordinal
)

regional_effect = pd.Series(region).map({
    "North": 1.1,
    "South": 0.95,
    "East": 1.0,
    "West": 1.05
}).values

true_value = base_value * regional_effect


## Heteroskedastic Noise

Noise increases with income (realistic business behavior).

In [9]:
noise = np.random.normal(
    loc=0,
    scale=0.15 * income / income.mean(),
    size=N
)

annual_customer_value = true_value + noise


## Inject Outliers
### Extreme Target Outliers (Top 1%)

In [10]:
outlier_idx = np.random.choice(
    np.arange(N),
    size=int(0.01 * N),
    replace=False
)

annual_customer_value[outlier_idx] *= np.random.uniform(2.5, 4.0, size=len(outlier_idx))


### Feature Outliers

In [11]:
monthly_usage[outlier_idx] *= np.random.uniform(3, 6, size=len(outlier_idx))

# Create DataFrame

In [12]:
df = pd.DataFrame({
    "customer_id": range(1, N + 1),
    "age": age,
    "income": income,
    "tenure_years": tenure_years,
    "monthly_usage": monthly_usage,
    "education_level": education_level,
    "customer_segment": customer_segment,
    "region": region,
    "annual_customer_value": annual_customer_value
})

df.head()


Unnamed: 0,customer_id,age,income,tenure_years,monthly_usage,education_level,customer_segment,region,annual_customer_value
0,1,18,78000.816898,14.440569,155.332076,Master,segment_13,North,11273.442537
1,2,18,95995.286838,0.248255,22.465606,Bachelor,segment_74,North,6408.490395
2,3,67,70092.926345,9.625033,195.203083,Bachelor,segment_75,South,8550.323559
3,4,64,89249.664825,0.677753,186.829061,High School,segment_74,North,4553.038385
4,5,37,51381.573393,6.999693,86.45363,High School,segment_68,North,4491.894205


## Missingness Injection
### MCAR (Random)

In [13]:
mcar_idx = np.random.choice(df.index, size=int(0.05 * N), replace=False)
df.loc[mcar_idx, "monthly_usage"] = np.nan


### MAR (Depends on income)

In [14]:
mar_mask = df["income"] > df["income"].quantile(0.8)
df.loc[mar_mask.sample(frac=0.15, random_state=42).index, "tenure_years"] = np.nan


### MNAR (Depends on target)

In [15]:
mnar_mask = df["annual_customer_value"] > df["annual_customer_value"].quantile(0.9)
df.loc[mnar_mask.sample(frac=0.25, random_state=42).index, "income"] = np.nan


## Leakage-Prone Feature (Deliberate)

___`NOTE:   This feature must not be used in training.`___

In [16]:
df["future_discount_applied"] = (
    df["annual_customer_value"] * 0.02 +
    np.random.normal(0, 50, size=N)
)


__`This feature must be removed before modeling.`__


## Target Skewness (Regression “Imbalance”)



In [17]:
df["annual_customer_value"].describe()


count     8000.000000
mean      7229.886019
std       3501.537485
min        662.137318
25%       4907.683732
50%       6856.232395
75%       8997.826931
max      50289.918012
Name: annual_customer_value, dtype: float64

Characteristics:

- Long-tailed distribution

- Small population of extreme high-value customers

- Common in revenue modeling

# Dataset Summary

In [18]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   customer_id              8000 non-null   int64  
 1   age                      8000 non-null   int32  
 2   income                   6000 non-null   float64
 3   tenure_years             6800 non-null   float64
 4   monthly_usage            7600 non-null   float64
 5   education_level          8000 non-null   object 
 6   customer_segment         8000 non-null   object 
 7   region                   8000 non-null   object 
 8   annual_customer_value    8000 non-null   float64
 9   future_discount_applied  8000 non-null   float64
dtypes: float64(5), int32(1), int64(1), object(3)
memory usage: 593.9+ KB


In [19]:
df.isna().mean().sort_values(ascending=False)


income                     0.25
tenure_years               0.15
monthly_usage              0.05
customer_id                0.00
age                        0.00
education_level            0.00
customer_segment           0.00
region                     0.00
annual_customer_value      0.00
future_discount_applied    0.00
dtype: float64

## Save Dataset

In [20]:
df.to_csv(
    "../datasets/synthetic_customer_value_regression_complete.csv",
    index=False
)


In [21]:
df

Unnamed: 0,customer_id,age,income,tenure_years,monthly_usage,education_level,customer_segment,region,annual_customer_value,future_discount_applied
0,1,18,,14.440569,155.332076,Master,segment_13,North,11273.442537,204.852853
1,2,18,95995.286838,0.248255,22.465606,Bachelor,segment_74,North,6408.490395,55.691786
2,3,67,70092.926345,9.625033,195.203083,Bachelor,segment_75,South,8550.323559,77.920800
3,4,64,89249.664825,0.677753,186.829061,High School,segment_74,North,4553.038385,127.854392
4,5,37,51381.573393,6.999693,86.453630,High School,segment_68,North,4491.894205,96.421251
...,...,...,...,...,...,...,...,...,...,...
7995,7996,63,69008.660132,2.189009,112.173461,High School,segment_77,North,4577.170654,182.978886
7996,7997,57,49750.384441,0.344267,160.676037,Bachelor,segment_5,North,6195.846927,77.317460
7997,7998,20,79547.734504,1.573367,167.778888,Bachelor,segment_44,North,7731.307992,188.006950
7998,7999,26,48761.263261,0.468053,,Bachelor,segment_5,North,5973.494967,170.542220


## What This Dataset Contains


| Aspect             | Included |
| ------------------ | -------- |
| Regression target  | Yes      |
| Non-linear signal  | Yes      |
| Ordinal features   | Yes      |
| High cardinality   | Yes      |
| MCAR / MAR / MNAR  | Yes      |
| Outliers           | Yes      |
| Heteroskedasticity | Yes      |
| Leakage trap       | Yes      |
| Business realism   | Yes      |


## Intended Downstream Usage

This dataset feeds directly into:

    01_Exploratory_Data_Analysis/

    02_Data_Preprocessing/

    03_Feature_Engineering/

    04_Supervised_Learning/

    06_Model_Evaluation_and_Validation/

    09_Pipelines_and_Workflows/

<br><br><br><br><br><br><br><br>


# Preprocessing and Modeling Pipelines
___`Leakage-Safe, Robust, and Business-Ready`___

## Objective of the Pipeline

This pipeline addresses the following challenges already embedded in the dataset:

- MCAR / MAR / MNAR missingness

- Ordinal and nominal categorical features

- High-cardinality categorical feature

- Outliers and skewed distributions

- Leakage-prone feature

- Regression target with long-tail behavior

Design principles:

- No data leakage

- Column-type–aware transformations

- Single source of truth via Pipeline

- Model-agnostic preprocessing

## Train / Test Split (Leakage Control)

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
TARGET = "annual_customer_value"
LEAKAGE_FEATURES = ["future_discount_applied", "customer_id"]

X = df.drop(columns=[TARGET] + LEAKAGE_FEATURES)
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

## Feature Type Definition (Explicit Contract)

In [25]:
numeric_features = [
    "age",
    "income",
    "tenure_years",
    "monthly_usage"
]

ordinal_features = ["education_level"]

nominal_low_cardinality = ["region"]

nominal_high_cardinality = ["customer_segment"]


## Preprocessing Components
### Numeric Pipeline

Design choices:

- Median imputation → robust to outliers and skewness

- Log transform → stabilize heavy tails

- RobustScaler → reduce outlier impact

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, FunctionTransformer

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("log_transform", FunctionTransformer(np.log1p, feature_names_out="one-to-one")),
    ("scaler", RobustScaler())
])

### Ordinal Encoding Pipeline

Ordinality is preserved explicitly.

In [27]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(
        categories=[["High School", "Bachelor", "Master", "PhD"]]
    ))
])


### Nominal Low-Cardinality Pipeline

In [28]:
from sklearn.preprocessing import OneHotEncoder

nominal_low_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(
        drop="first",
        handle_unknown="ignore"
    ))
])


### Nominal High-Cardinality Pipeline
Strategy:

- Frequency encoding via OneHotEncoder(min_frequency=...)

- Avoids explosion of sparse dimensions

In [29]:
nominal_high_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(
        min_frequency=0.02,
        handle_unknown="ignore"
    ))
])


## ColumnTransformer (Unified Preprocessing)

In [32]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("ord", ordinal_pipeline, ordinal_features),
        ("nom_low", nominal_low_pipeline, nominal_low_cardinality),
        ("nom_high", nominal_high_pipeline, nominal_high_cardinality)
    ],
    remainder="drop"
)


This object is now the single, authoritative preprocessing contract.

# Modeling Pipeline (Baseline)
## Model Choice

Random Forest chosen because:

- Handles non-linearity

- Robust to outliers

- Strong baseline for tabular data

In [30]:
from sklearn.ensemble import RandomForestRegressor


## Full Pipeline

In [33]:
model_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", RandomForestRegressor(
        n_estimators=300,
        max_depth=None,
        min_samples_leaf=5,
        random_state=42,
        n_jobs=-1
    ))
])


## Model Training

In [34]:
model_pipeline.fit(X_train, y_train)


## Evaluation (Initial Sanity Check)

In [35]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_pred = model_pipeline.predict(X_test)

print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R²:", r2_score(y_test, y_pred))


MAE: 1265.4593608469022
RMSE: 1920.9031742765894
R²: 0.6405873064199028


# Why This Pipeline Is Benchmark-Grade



| Aspect                   | Covered           |
| ------------------------ | ----------------- |
| Leakage prevention       | Explicit          |
| Mixed feature types      | Yes               |
| Missingness handling     | MCAR / MAR / MNAR |
| Ordinality preserved     | Yes               |
| High-cardinality control | Yes               |
| Outlier robustness       | Yes               |
| Reusability              | High              |
| Deployment-ready         | Yes               |
