# Notebook 01 â€” Ingestion & Initial Cleaning

This notebook performs the first stage of the data engineering pipeline:
- Load the raw dataset into Spark
- Diagnose schema, nulls, and data quality
- Standardize column names
- Remove unnecessary columns
- Save the cleaned dataset into the Bronze layer (Delta format)

This step is critical because every downstream transformation depends on clean,
consistent, and well-structured data.


In [0]:
spark


In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *


## Load raw dataset
Reading dataset stored in workspace volumes.
Schema inference enabled for initial inspection.



In [0]:
df_raw = spark.read.csv(
    "dbfs:/Volumes/workspace/credit-risk/credit-risk/german_credit_data.csv",
    header=True,
    inferSchema=True
)

df_raw.display()


## Null analysis
Count null values to verify dataset completeness.



In [0]:
from pyspark.sql.functions import col, when, count

nulls = df_raw.select([
    count(when(col(c).isNull(), c)).alias(c)
    for c in df_raw.columns
])

display(nulls)


In [0]:
df_clean = df_raw.drop("_c0")


## Clean column names
Convert to lowercase and snake_case for Spark ML compatibility.


In [0]:
df_clean = df_clean.toDF(*[
    c.lower()
     .replace(" ", "_")
     .replace("-", "_")
     .replace("/", "_")
    for c in df_clean.columns
])

df_clean.display()


## Write Bronze table
Minimal transformations; stored in Delta for downstream feature engineering.


In [0]:
df_clean.write.format("delta").mode("overwrite").save(
    "dbfs:/Volumes/workspace/credit-risk/credit-risk/bronze"
)


In [0]:
df_bronze = spark.read.format("delta").load(
    "dbfs:/Volumes/workspace/credit-risk/credit-risk/bronze"
)

df_bronze.display()
