# Notebook 02 — Feature Engineering (Silver & Gold Layers)

This notebook performs the feature engineering steps required to prepare the dataset
for machine learning. The transformations follow a standard Databricks/Spark ML workflow.

### Steps:
- Load Bronze table
- Identify categorical and numerical features
- Clean categorical values ("NA" → "unknown")
- Apply StringIndexer + OneHotEncoder
- Normalize numerical features
- Assemble final feature vector
- Generate Silver and Gold Delta tables



## Load Bronze Table

Read the Bronze Delta table generated in Notebook 01.  
This table contains standardized raw data ready for feature engineering.


In [0]:
df_bronze = spark.read.format("delta").load(
    "dbfs:/Volumes/workspace/credit-risk/credit-risk/bronze"
)

df_bronze.display()



## Identify Feature Types

Programmatically detect categorical (string) and numerical (int/double) columns.
This keeps the pipeline dynamic and schema-driven.


In [0]:
categorical_cols = [c for c, t in df_bronze.dtypes if t == "string"]
numeric_cols = [c for c, t in df_bronze.dtypes if t in ["int", "double"]]

categorical_cols, numeric_cols

## Clean Categorical Values

Replace invalid category tokens ("NA") with a consistent placeholder ("unknown").
This avoids value explosion during indexing and prevents model bias.


In [0]:
from pyspark.sql.functions import when

df_fixed = df_bronze

cols_with_na = ["saving_accounts", "checking_account"]

for col_name in cols_with_na:
    df_fixed = df_fixed.withColumn(
        col_name,
        when(df_fixed[col_name] == "NA", "unknown").otherwise(df_fixed[col_name])
    )

df_fixed.display()


## Encode Categorical Features

Use:
- StringIndexer → categorical string → numerical index  
- OneHotEncoder → index → sparse vector  

This prevents the model from interpreting categories as ordinal.


In [0]:
from pyspark.ml.feature import StringIndexer

indexers = [
    StringIndexer(
        inputCol=c,
        outputCol=f"{c}_idx",
        handleInvalid="keep"
    )
    for c in categorical_cols
]


In [0]:
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(
    inputCols=[f"{c}_idx" for c in categorical_cols],
    outputCols=[f"{c}_ohe" for c in categorical_cols]
)


## Normalize Numerical Features

Assemble all numerical columns into a single vector and normalize them using MinMaxScaler.  
This ensures numerical values share a similar range before model training.


In [0]:
from pyspark.ml.feature import VectorAssembler, MinMaxScaler

numeric_assembler = VectorAssembler(
    inputCols=numeric_cols,
    outputCol="numeric_vector"
)

scaler = MinMaxScaler(
    inputCol="numeric_vector",
    outputCol="numeric_scaled"
)


## Assemble Final Feature Vector

Concatenate all encoded categorical vectors and the normalized numeric vector into the 
`features` column used by Spark ML models.


In [0]:
final_features = [f"{c}_ohe" for c in categorical_cols] + ["numeric_scaled"]

assembler = VectorAssembler(
    inputCols=final_features,
    outputCol="features"
)



## Fit Pipeline & Generate Gold Table

Fit the end-to-end transformation pipeline and produce:
- Silver table (cleaned & indexed)
- Gold table (fully encoded feature matrix)


In [0]:
from pyspark.ml import Pipeline

pipeline = Pipeline(
    stages=indexers + [encoder, numeric_assembler, scaler, assembler]
)

model = pipeline.fit(df_fixed)
df_gold = model.transform(df_fixed)

df_gold.display()


## Save Silver & Gold Tables

Persist the engineered tables in Delta format for downstream model training.


In [0]:
df_fixed.write.format("delta").mode("overwrite").save(
    "dbfs:/Volumes/workspace/credit-risk/credit-risk/silver"
)


In [0]:
df_gold.write.format("delta").mode("overwrite").save(
    "dbfs:/Volumes/workspace/credit-risk/credit-risk/gold"
)


In [0]:
df_gold.display()
