### Silver Layer Transformation

This notebook processes Bronze Delta tables into the Silver layer.
The objective is to cleanse, standardize, and enrich data while applying core business rules.


In [20]:
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

builder = SparkSession.builder.appName("Silver Ingestion") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("WARN")

### Loading Bronze Tables

Read Bronze Delta tables as the input source for Silver transformations.
These datasets represent raw ingested data with minimal preprocessing.

In [21]:
orders_clean = spark.read.format("delta").load("../delta/01_bronze/orders")
order_items_clean = spark.read.format("delta").load("../delta/01_bronze/order_items")
order_payments_clean = spark.read.format("delta").load("../delta/01_bronze/order_payments")
customers_df = spark.read.format("delta").load("../delta/01_bronze/customers")
products_df = spark.read.format("delta").load("../delta/01_bronze/products")
sellers_df = spark.read.format("delta").load("../delta/01_bronze/sellers")

### Data Cleansing

Handle missing values, invalid records, and duplicates.
Only data that meets quality and consistency requirements is propagated to the Silver layer.

In [22]:
payments_agg = (
    order_payments_clean
        .dropna(subset=["order_id", "payment_installments"])
        .groupBy("order_id")
        .agg(F.sum("payment_installments").alias("payment_count"))
)

### Creating table alias

In [23]:
o = orders_clean.alias("o")
i = order_items_clean.alias("i")
p = payments_agg.alias("p")
c = customers_df.alias("c")
pr = products_df.alias("pr")
s = sellers_df.alias("s")

### Dataset Integration

Join related datasets to create enriched, relationally consistent Silver tables.
Keys and join conditions are validated to prevent data duplication or loss.

In [24]:
orders_enriched = (
    o
        .join(i, F.col("o.order_id") == F.col("i.order_id"), "inner")
        .join(p, F.col("o.order_id") == F.col("p.order_id"), "left")
        .join(c, F.col("o.customer_id") == F.col("c.customer_id"), "left")
        .join(pr, F.col("i.product_id") == F.col("pr.product_id"), "left")
        .join(s, F.col("i.seller_id") == F.col("s.seller_id"), "left")
)

### Adding columns

Calculating and adding new columns, based on the requests provided in the document.

In [25]:
orders_enriched = (
    orders_enriched
        .withColumn(
            "total_price",
            F.col("i.price") + F.col("i.freight_value")
        )
        .withColumn(
            "profit_margin",
            F.col("i.price") - F.col("i.freight_value")
        )
        .withColumn(
            "delivery_time_days",
            F.datediff(
                F.col("o.order_delivered_customer_date"),
                F.col("o.order_purchase_timestamp")
            )
        )
)

### Enriched Orders Projection

Define the final structure of the `orders_enriched` dataset by selecting key identifiers, financial metrics, delivery indicators, and date attributes.
The resulting schema is optimized for downstream Silver persistence and Gold-level analytics.

In [26]:
orders_enriched = orders_enriched.select(
    F.col("o.order_id").alias("order_id"),
    F.col("o.customer_id").alias("customer_id"),
    F.col("c.customer_state").alias("customer_state"),
    F.col("i.product_id").alias("product_id"),
    F.col("pr.product_category_name").alias("product_category_name"),
    F.col("i.seller_id").alias("seller_id"),
    F.col("o.order_purchase_timestamp"),
    F.col("i.price"),
    F.col("i.freight_value"),
    F.col("total_price"),
    F.col("profit_margin"),
    F.col("delivery_time_days"),
    F.col("payment_count"),
    F.col("o.year").alias("year"),
    F.col("o.month").alias("month"),
    F.col("o.day").alias("day")
)

Checking the columns and data types of the `orders_enriched` table

In [27]:
orders_enriched.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- customer_state: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_category_name: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- order_purchase_timestamp: timestamp (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)
 |-- total_price: double (nullable = true)
 |-- profit_margin: double (nullable = true)
 |-- delivery_time_days: integer (nullable = true)
 |-- payment_count: long (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)



In [28]:
orders_enriched.select("year", "month", "day").distinct().show()

+----+-----+---+
|year|month|day|
+----+-----+---+
|2018|    8|  6|
|2018|    8|  7|
|2017|   11| 25|
|2017|   11| 28|
|2018|    7| 31|
|2017|   11| 24|
|2018|    5| 14|
|2017|   11| 29|
|2018|    5|  8|
|2018|    5| 16|
|2017|   11| 27|
|2018|    5|  7|
|2017|   12|  4|
|2018|    5| 15|
|2018|    1| 15|
|2018|    5|  9|
|2017|   11| 26|
|2018|    8| 15|
|2018|    2| 22|
|2018|    8| 16|
+----+-----+---+
only showing top 20 rows


                                                                                

#### Writing the data into the silver layer

In [29]:
(
    orders_enriched.write
        .format("delta")
        .mode("overwrite")
        .option("mergeSchema", "true")
        .partitionBy("year", "month", "day")
        .save("../delta/02_silver/orders_enriched")
)

26/01/16 00:19:13 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
                                                                                