## **02 - Feature Engineering (FE)**

**Source**

This notebook starts from the 

**Goals**
- Transform the canonical EDA dataset into a model-ready feature matrix without redefining the prediction problem
- Preserve a strict separation of concerns between:
    - EDA (problem definition and data semantics)
    - Feature engineering (input representation)
    - Modeling (learning and evaluation)
- Ensure all feature transformations are:
    - Deterministic
    - Reproducible
    - Auditable
- Minimize the risk of data leakage by:
    - Using only information available at prediction time
    - Applying all target-independent transformations
- Maintain model-agnostic feature representations that can be reused across:
    - Linear models
    - Tree-based models
    - Ensemble methods
- Prioritize interpretability and transparency, especially in a credit risk context
- Enable a clean handoff to a production-grade feature build script driven by a single feature specification
- Support iterative experimentation through explicit versioning, rather than ad-hoc feature changes

**To-Do Checklist** 

1. Inputs & Data Contracts
- [] Load cleaned dataset produced by EDA
- [] Load eda_summary.json
- [] Verify EDA summary contains target, drop columns, numeric, categorical, and datetime columns
- [] Validate dataset schema matches EDA expectations
- [] Fail early if schema mismatch is detected

2. Target Variable Handling
- Confirm target column exists
- Verify target encoding is binary
- Validate class distribution
- Freeze target definition
- Exclude target from feature transforms

3. Feature Exclusion Logic
- Drop explicitly excluded columns
- Drop constant and low-variance columns
- Drop identifier columns
- Verify dropped columns are not used downstream
- Log total columns removed

4. Feature Type Validation
- Validate numeric dtypes and ranges
- Validate categorical cardinality
- Validate datetime parsing
- Assert column type disjointness

5. Datetime Feature Engineering
- Convert datetime columns explicitly
- Derive year, month, quarter features
- Validate temporal logic
- Drop raw datetime columns
- Document decisions

6. Categorical Feature Engineering
- Compute cardinality
- Assign encoding strategy by tier
- Consolidate rare categories
- Handle missing categories
- Document encoding per feature

7. Numerical Feature Engineering
- Inspect distributions
- Identify skew
- Apply safe transforms
- Preserve originals unless justified
- Avoid premature normalization

8. Feature Interactions (Optional)
- Identify justified interactions
- Avoid combinatorial explosion
- Validate inference availability
- Document rationale

9. Dimensionality Reduction (Optional / Conditional)
- Evaluate whether dimensionality reduction is necessary based on:
    - Feature count after encoding
    - Multicollinearity severity
    - Model class requirements
- Prefer feature selection over projection-based methods by default
- If used:
    - Fit dimensionality reduction on training data only
    - Treat the reducer as part of the preprocessing pipeline
    - Persist the fitted reducer artifact
- Validate:
    - Information retention
    - Impact on downstream model performance
    - Stability across folds
- Document interpretability and regulatory tradeoffs
- Explicitly justify inclusion or exclusion

    Notes
    - Projection-based methods (e.g., PCA) are generally avoided for:
        - Linear credit models
        - Regulated or explainability-sensitive settings
    - Tree-based models often do not require dimensionality reduction

10. Missing Value Strategy (Design Only)
- Identify missingness
- Define numeric and categorical strategies
- Do not fit imputers here
- Defer fitting to build script

11. Feature Leakage Checks
- Ensure no post-outcome features
- Validate prediction-time availability
- Check datetime cutoffs
- Document assumptions

12. Candidate Feature Definition & Evaluation (Design Phase)
(Optional section â€” occurs before Feature Specification Output)
- Enumerate candidate engineered features by type:
    - Datetime-derived features
    - Encoded categorical features
    - Transformed numerical features
    - Interaction features (if any)
- Clearly label features as candidates, not final
- Validate:
    - Availability at prediction time
    - No target leakage
    - Interpretability
- Evaluate candidates using:
    - Simple univariate metrics
    - Stability checks
    - Sanity plots
- Track:
    - Included
    - Excluded
    - Deferred (v2)

13. Feature Specification Output
- Create feature spec
- Include version, features, drops, transforms
- Store under configs/
- Treat as single source of truth

14. Feature Preview Artifact (Exploratory Only)
- Generate preview matrix
- Validate shape, names, dtypes
- Check NaNs/infs
- Save with version tag
- Mark as non-authoritative

15. Validation & Sanity Checks
- No duplicates
- No unexpected NaNs or infs
- Reasonable feature count
- Target distribution unchanged

16. Reproducibility & Versioning
- Record version
- Freeze seeds if applicable
- Log feature counts
- Store metadata

17. Handoff to Feature Build Script
- Identify logic to move to src/features
- Confirm script inputs and outputs
- Ensure notebook not required for rebuild

18. Notebook Completion Criteria
- Decisions documented
- Spec finalized
- No hidden logic
- Rebuild reproducible via script

19. Reviewer-Facing Summary
- Why these features
- Tradeoffs
- Limitations
- Planned v2 changes

---

Definition of Done  
A model can be trained without this notebook using only the feature spec and the feature build script.



# 1. Inputs & Data Contracts

We will:

- Load cleaned dataset produced by EDA
- Load eda_summary.json
- Verify EDA summary contains target, drop columns, numeric, categorical, and datetime columns
- Validate dataset schema matches EDA expectations
- Fail early if schema mismatch is detected

In [None]:
import json
from pathlib import Path

import pandas as pd

# -----------------------------
# Paths (single source of truth)
# -----------------------------
DATA_DIR = Path("../data/_artifacts_preview")

EDA_DATA_PATH = DATA_DIR / "eda_cleaned.parquet"
EDA_SCHEMA_PATH = DATA_DIR / "eda_cleaned_schema.json"
EDA_SUMMARY_PATH = DATA_DIR / "eda_summary.json"

# -----------------------------
# Load artifacts
# -----------------------------
if not EDA_DATA_PATH.exists():
    raise FileNotFoundError(f"Missing EDA dataset: {EDA_DATA_PATH}")

if not EDA_SCHEMA_PATH.exists():
    raise FileNotFoundError(f"Missing EDA schema: {EDA_SCHEMA_PATH}")

if not EDA_SUMMARY_PATH.exists():
    raise FileNotFoundError(f"Missing EDA summary: {EDA_SUMMARY_PATH}")

df = pd.read_parquet(EDA_DATA_PATH)

with open(EDA_SCHEMA_PATH, "r") as f:
    eda_schema = json.load(f)

with open(EDA_SUMMARY_PATH, "r") as f:
    eda_summary = json.load(f)

# -----------------------------
# Contract validation (lightweight)
# -----------------------------
TARGET = eda_summary["target_variable"]

if TARGET not in df.columns:
    raise KeyError(f"Target column '{TARGET}' not found in EDA dataset")

num_minus_default_cols = [col for col in eda_summary["numerical_cols"] if col != TARGET]

expected_feature_cols = (
    set(eda_summary["numerical_cols"])
    | set(eda_summary["categorical_cols"])
    | set(eda_summary["datetime_cols"])
)

expected_feature_cols -= {TARGET}  # ensure target is not in features

columns_to_drop = (
    set(eda_summary["cols_to_drop"])
    | set(eda_summary["low_variance_cols"])
    | set(eda_summary["constant_cols"])
)

expected_feature_cols -= set(columns_to_drop)

actual_feature_cols = set(df.columns) - {TARGET}

missing_cols = expected_feature_cols - actual_feature_cols
extra_cols = actual_feature_cols - expected_feature_cols

if missing_cols:
    raise ValueError(f"Missing expected feature columns: {sorted(missing_cols)}")

if extra_cols:
    raise ValueError(f"Unexpected extra columns in dataset: {sorted(extra_cols)}")

# -----------------------------
# Separate features and target
# -----------------------------
X = df.drop(columns=[TARGET])
y = df[TARGET]

print("Feature Engineering input loaded successfully")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print("Target distribution:")
print(y.value_counts(normalize=True))

Expected feature columns: 77
Expected feature columns sample: ['acc_open_past_24mths', 'addr_state', 'all_util', 'annual_inc', 'annual_inc_joint', 'application_type', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'delinq_2yrs', 'dti', 'dti_joint', 'emp_length', 'fico_range_high', 'fico_range_low', 'funded_amnt', 'grade', 'home_ownership', 'il_util', 'inq_fi', 'installment', 'int_rate', 'issue_d', 'loan_amnt', 'max_bal_bc', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'open_acc', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'open_rv_12m', 'open_rv_24m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec', 'pub_rec_bankruptcies', 'purpose', 'revol_bal', 'revol_bal_joint', 'sec_app_chargeoff_within_12_mths', 'sec_app_

# 12. Candidate Feature Definition & Evaluation (Design Stage)



In [None]:
df_fe = df.copy()

# Loan-to-income ratio
if {"loan_amnt", "annual_inc"}.issubset(df_fe.columns):
    df_fe["loan_to_income"] = df_fe["loan_amnt"] / df_fe["annual_inc"]

# Installment-to-income ratio
if {"installment", "annual_inc"}.issubset(df_fe.columns):
    df_fe["installment_to_income"] = df_fe["installment"] / df_fe["annual_inc"]

# Term in months
if "term" in df_fe.columns:
    df_fe["term_months"] = df_fe["term"].astype(str).str.extract(r"(\d+)").astype(float)

# Grade numeric
if "grade" in df_fe.columns:
    grade_map = {g: i for i, g in enumerate(sorted(df_fe["grade"].dropna().unique()))}
    df_fe["grade_numeric"] = df_fe["grade"].map(grade_map)

# Subgrade numeric (e.g., A1, B3)
if "sub_grade" in df_fe.columns:
    sub_sorted = sorted(df_fe["sub_grade"].dropna().unique())
    sub_map = {sg: i for i, sg in enumerate(sub_sorted)}
    df_fe["sub_grade_numeric"] = df_fe["sub_grade"].map(sub_map)

# Log transforms for skewed variables
for col in ["annual_inc", "loan_amnt"]:
    if col in df_fe.columns:
        df_fe[f"log_{col}"] = np.log1p(df_fe[col])

df_fe.head()