# 01 – Domain-Driven Feature Engineering
    Translating Business Knowledge into Predictive Signal
    
    
## Objective

This notebook focuses on **domain-driven feature engineering**, covering:

- Why business knowledge matters more than algorithms
- Translating raw variables into meaningful signals
- Ratio, interaction, and lifecycle-based features
- Customer behavior and risk proxies
- Guardrails to avoid leakage

It answers:

    How do we transform raw data into features that reflect real-world business processes?

## Why Domain-Driven Features Matter

Most model performance gains come from **better features**, not more complex models.

Domain-driven features:
- Encode business logic
- Improve interpretability
- Reduce model complexity
- Generalize better under distribution shift

A weak model with strong features often outperforms a strong model with weak features.




## Imports and Dataset


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


In [3]:
df = pd.read_csv("../datasets/synthetic_customer_churn_classification_complete.csv")
df.head()


Unnamed: 0,customer_id,age,income,tenure_years,avg_monthly_usage,support_tickets_last_year,satisfaction_level,customer_segment,region,churn,future_retention_offer
0,1,18,,2.012501,138.021163,1,,segment_18,South,0,-0.069047
1,2,18,58991.061162,9.00555,213.043003,2,Very High,segment_98,West,0,-0.226607
2,3,67,31130.298545,3.633058,68.591582,2,Medium,segment_134,North,0,-0.065741
3,4,64,,4.295957,28.790894,1,,segment_72,North,0,0.061886
4,5,37,22301.231175,2.549855,100.136569,2,High,segment_147,East,1,1.073678


## Step 1 – Business Context Framing

We assume a **subscription-based customer churn problem**.

Key business questions:
- Who is at risk of churn?
- What behaviors precede churn?
- Which customers are most valuable to retain?

Feature engineering should reflect these questions.


## Step 2 – Raw Feature Review


In [4]:
df.describe(include="all").transpose()


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
customer_id,10000.0,,,,5000.5,2886.89568,1.0,2500.75,5000.5,7500.25,10000.0
age,10000.0,,,,48.6211,17.922116,18.0,33.0,49.0,64.0,79.0
income,7000.0,,,,55256.084045,55657.710502,3325.613998,24298.26878,40650.189882,67759.20165,1100057.685041
tenure_years,10000.0,,,,5.442411,5.4356,0.000661,1.602746,3.786117,7.560793,54.238486
avg_monthly_usage,9600.0,,,,80.613112,66.529768,0.302308,39.688541,66.481462,104.682739,1322.966577
support_tickets_last_year,10000.0,,,,1.7959,1.352119,0.0,1.0,2.0,3.0,9.0
satisfaction_level,7500.0,5.0,Medium,2573.0,,,,,,,
customer_segment,10000.0,150.0,segment_105,90.0,,,,,,,
region,10000.0,4.0,North,3961.0,,,,,,,
churn,10000.0,,,,0.1852,0.388479,0.0,0.0,0.0,0.0,1.0


## Step 3 – Lifecycle and Tenure Features

Customer behavior changes with lifecycle stage.

Tenure is often **non-linear**:
- Early churn risk is high
- Stable mid-term customers
- Late-stage fatigue


In [5]:
df["tenure_stage"] = pd.cut(
    df["tenure_years"],
    bins=[0, 1, 3, 5, 10, np.inf],
    labels=["New", "Early", "Mid", "Established", "Loyal"]
)

df["tenure_stage"].value_counts()


tenure_stage
Early          2622
Established    2460
Mid            1758
New            1618
Loyal          1542
Name: count, dtype: int64

## Step 4 – Financial Intensity Features

Absolute values often hide risk.

Ratios reveal **relative burden**.


In [10]:
#df["charges_to_income_ratio"] = (df["monthly_charges"] / df["income"])

#df[["monthly_charges", "income", "charges_to_income_ratio"]].head()


## Step 5 – Usage Efficiency Features

High charges + low usage may indicate dissatisfaction.


In [11]:
#df["cost_per_usage_unit"] = (df["monthly_charges"] / (df["avg_monthly_usage"] + 1e-6))

#df[["monthly_charges", "avg_monthly_usage", "cost_per_usage_unit"]].head()


## Step 6 – Behavioral Risk Flags

Binary flags often capture strong business signals.


In [12]:
# df["high_price_low_usage_flag"] = (
#     (df["monthly_charges"] > df["monthly_charges"].median()) &
#     (df["avg_monthly_usage"] < df["avg_monthly_usage"].median())
# ).astype(int)

# df["income_missing_flag"] = df["income"].isna().astype(int)

# df[[
#     "high_price_low_usage_flag",
#     "income_missing_flag"
# ]].head()


## Step 7 – Interaction Features

Interactions encode **conditional relationships**.


In [13]:
# df["long_contract_new_customer"] = (
#     (df["contract_type"] == "Two-Year") &
#     (df["tenure_years"] < 1)
# ).astype(int)

# df["long_contract_new_customer"].value_counts()


## Step 8 – Aggregation-Based Features (Conceptual)

In real systems, aggregation features often include:
- Rolling averages
- Trend indicators
- Peer group benchmarks

These require temporal data and careful validation.


## Step 9 – Leakage Guardrails

Domain features must:
- Use only historical information
- Avoid target-derived logic
- Be computable at prediction time


## Step 10 – Feature Sanity Checks


In [14]:
# engineered_features = [
#     "charges_to_income_ratio",
#     "cost_per_usage_unit",
#     "high_price_low_usage_flag",
#     "long_contract_new_customer"
# ]

# df[engineered_features + ["churn"]].corr()


## Business Interpretability Check

Each engineered feature should answer:
- What behavior does this represent?
- Can it be explained to a stakeholder?
- Would it remain valid in the future?

## Common Mistakes (Avoided)

- `[neg] - ` Creating mathematically clever but meaningless features
- `[neg] - ` Encoding target leakage
- `[neg] - ` Over-engineering without validation
- `[neg] - ` Ignoring business interpretability


## Summary Table

| Feature Type | Example |
|------------|--------|
| Lifecycle | Tenure stage |
| Ratio | Charges / income |
| Efficiency | Cost per usage |
| Flag | High price, low usage |
| Interaction | Contract × tenure |


## Key Takeaways

- Business logic drives feature value
- Ratios often outperform raw values
- Flags capture risk efficiently
- Interpretability is a competitive advantage
- Features must survive production constraints


## Next Notebook

03_Feature_Engineering/

└── [02_interaction_features.ipynb](02_interaction_features.ipynb)


<br><br><br><br><br>



# Complete: [Data Science Techniques](https://github.com/lei-soares/Data-Science-Techniques)

- [00_Data_Generation_and_Simulation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/00_Data_Generation_and_Simulation)


- [01_Exploratory_Data_Analysis_(EDA)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/01_Exploratory_Data_Analysis_(EDA))


- [02_Data_Preprocessing](https://github.com/lei-soares/Data-Science-Techniques/tree/main/02_Data_Preprocessing)


- [03_Feature_Engineering](https://github.com/lei-soares/Data-Science-Techniques/tree/main/03_Feature_Engineering)


- [04_Supervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/04_Supervised_Learning)


- [05_Unsupervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/05_Unsupervised_Learning)


- [06_Model_Evaluation_and_Validation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/06_Model_Evaluation_and_Validation)


- [07_Model_Tuning_and_Optimization](https://github.com/lei-soares/Data-Science-Techniques/tree/main/07_Model_Tuning_and_Optimization)


- [08_Interpretability_and_Explainability](https://github.com/lei-soares/Data-Science-Techniques/tree/main/08_Interpretability_and_Explainability)


- [09_Pipelines_and_Workflows](https://github.com/lei-soares/Data-Science-Techniques/tree/main/09_Pipelines_and_Workflows)


- [10_Natural_Language_Processing_(NLP)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/10_Natural_Language_Processing_(NLP))


- [11_Time_Series](https://github.com/lei-soares/Data-Science-Techniques/tree/main/11_Time_Series)


- [12_Anomaly_and_Fraud_Detection](https://github.com/lei-soares/Data-Science-Techniques/tree/main/12_Anomaly_and_Fraud_Detection)


- [13_Imbalanced_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/13_Imbalanced_Learning)


- [14_Deployment_and_Production_Concepts](https://github.com/lei-soares/Data-Science-Techniques/tree/main/14_Deployment_and_Production_Concepts)


- [15_Business_and_Experimental_Design](https://github.com/lei-soares/Data-Science-Techniques/tree/main/15_Business_and_Experimental_Design)