# Bronze → Silver: Order Items

## Purpose
Clean order line items and add calculated fields

## Transformations
- Remove null order_id/product_id
- Calculate total_amount = price + freight
- Filter negative prices
- Add ingestion timestamp

## Input
- **Source**: `bronze/olist/orders/OLIST.OLIST_ORDER_ITEMS_BASE.parquet`
- **Records**: ~112,650

## Output
- **Destination**: `silver/order_items_clean/`
- **Format**: Delta Lake

**Author:** Kevin  
**Date:** Feb 9, 2026


In [0]:
from pyspark.sql.functions import col, current_timestamp, count, when, sum, avg

storage_account_name = "stgolistmigration"
account_key = ""

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

def get_bronze_path(folder, filename):
    return f"abfss://bronze@{storage_account_name}.dfs.core.windows.net/olist/{folder}/{filename}"

def get_silver_path(table):
    return f"abfss://silver@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


In [0]:
bronze_path = get_bronze_path("orders", "OLIST.OLIST_ORDER_ITEMS_BASE.parquet")

print(f"📖 Reading: {bronze_path}")

df_items_bronze = spark.read.parquet(bronze_path)

print(f"✅ Loaded: {df_items_bronze.count():,} rows")
print(f"   Columns: {len(df_items_bronze.columns)}")

df_items_bronze.limit(3).show(truncate=False, vertical=True)


📖 Reading: abfss://bronze@stgolistmigration.dfs.core.windows.net/olist/orders/OLIST.OLIST_ORDER_ITEMS_BASE.parquet
✅ Loaded: 112,650 rows
   Columns: 7
-RECORD 0-----------------------------------------------
 ORDER_ID            | 00010242fe8c5a6d1ba2dd792cb16214 
 ORDER_ITEM_ID       | 1.000000000000000000             
 PRODUCT_ID          | 4244733e06e7ecb4970a6e2683c13e61 
 SELLER_ID           | 48436dade18ac8b2bce089ec2a041202 
 SHIPPING_LIMIT_DATE | 2017-09-19 09:45:35              
 PRICE               | 58.9                             
 FREIGHT_VALUE       | 13.29                            
-RECORD 1-----------------------------------------------
 ORDER_ID            | 00018f77f2f0320c557190d7a144bdd3 
 ORDER_ITEM_ID       | 1.000000000000000000             
 PRODUCT_ID          | e5f2d52b802189ee658865ca93d83a8f 
 SELLER_ID           | dd7ddc04e1b6c2c614352b383efe2d36 
 SHIPPING_LIMIT_DATE | 2017-05-03 11:05:13              
 PRICE               | 239.9                      

In [0]:
bronze_path = get_bronze_path("orders", "OLIST.OLIST_ORDER_ITEMS_BASE.parquet")

print(f"📖 Reading: {bronze_path}")

df_items_bronze = spark.read.parquet(bronze_path)

print(f"✅ Loaded: {df_items_bronze.count():,} rows")
print(f"   Columns: {len(df_items_bronze.columns)}")

df_items_bronze.limit(3).show(truncate=False, vertical=True)


📖 Reading: abfss://bronze@stgolistmigration.dfs.core.windows.net/olist/orders/OLIST.OLIST_ORDER_ITEMS_BASE.parquet
✅ Loaded: 112,650 rows
   Columns: 7
-RECORD 0-----------------------------------------------
 ORDER_ID            | 00010242fe8c5a6d1ba2dd792cb16214 
 ORDER_ITEM_ID       | 1.000000000000000000             
 PRODUCT_ID          | 4244733e06e7ecb4970a6e2683c13e61 
 SELLER_ID           | 48436dade18ac8b2bce089ec2a041202 
 SHIPPING_LIMIT_DATE | 2017-09-19 09:45:35              
 PRICE               | 58.9                             
 FREIGHT_VALUE       | 13.29                            
-RECORD 1-----------------------------------------------
 ORDER_ID            | 00018f77f2f0320c557190d7a144bdd3 
 ORDER_ITEM_ID       | 1.000000000000000000             
 PRODUCT_ID          | e5f2d52b802189ee658865ca93d83a8f 
 SELLER_ID           | dd7ddc04e1b6c2c614352b383efe2d36 
 SHIPPING_LIMIT_DATE | 2017-05-03 11:05:13              
 PRICE               | 239.9                      

In [0]:
print("🔍 Data Quality Check")
print("=" * 80)

# Null counts
print("\n1️⃣ NULL VALUES:")
null_counts = df_items_bronze.select([
    count(when(col(c).isNull(), c)).alias(c) 
    for c in df_items_bronze.columns
])
null_counts.show(vertical=True, truncate=False)

# Check for negative prices
negative_prices = df_items_bronze.filter(col("price") < 0).count()
print(f"\n2️⃣ NEGATIVE PRICES: {negative_prices:,}")

# Price statistics
print(f"\n3️⃣ PRICE STATS:")
df_items_bronze.select(
    avg("price").alias("avg_price"),
    sum("price").alias("total_revenue"),
    avg("freight_value").alias("avg_freight")
).show(truncate=False)

print("=" * 80)


🔍 Data Quality Check

1️⃣ NULL VALUES:
-RECORD 0------------------
 ORDER_ID            | 0   
 ORDER_ITEM_ID       | 0   
 PRODUCT_ID          | 0   
 SELLER_ID           | 0   
 SHIPPING_LIMIT_DATE | 0   
 PRICE               | 0   
 FREIGHT_VALUE       | 0   


2️⃣ NEGATIVE PRICES: 0

3️⃣ PRICE STATS:
+------------------+--------------------+------------------+
|avg_price         |total_revenue       |avg_freight       |
+------------------+--------------------+------------------+
|120.65373901477277|1.3591643700014152E7|19.990319928983567|
+------------------+--------------------+------------------+



In [0]:
print("🔄 Transforming order items...")

df_items_silver = df_items_bronze \
    .filter(col("order_id").isNotNull()) \
    .filter(col("product_id").isNotNull()) \
    .filter(col("price") >= 0) \
    .withColumn("total_amount", col("price") + col("freight_value")) \
    .withColumn("ingestion_timestamp", current_timestamp())

silver_count = df_items_silver.count()
removed = df_items_bronze.count() - silver_count

print(f"✅ Transformation complete")
print(f"   Silver rows: {silver_count:,}")
print(f"   Removed: {removed:,}")


🔄 Transforming order items...
✅ Transformation complete
   Silver rows: 112,650
   Removed: 0


In [0]:
print("📊 Silver Preview")
print("=" * 80)

print("\nRevenue stats:")
df_items_silver.select(
    count("*").alias("items"),
    sum("total_amount").alias("total_revenue"),
    avg("total_amount").alias("avg_order_value"),
    avg("price").alias("avg_price"),
    avg("freight_value").alias("avg_freight")
).show(truncate=False)

print("\nSample records:")
df_items_silver.limit(3).show(truncate=False, vertical=True)


📊 Silver Preview

Revenue stats:
+------+--------------------+------------------+----------------+------------------+
|items |total_revenue       |avg_order_value   |avg_price       |avg_freight       |
+------+--------------------+------------------+----------------+------------------+
|112650|1.5843553239998823E7|140.64405894362028|120.653739014649|19.990319928983567|
+------+--------------------+------------------+----------------+------------------+


Sample records:
-RECORD 0-----------------------------------------------
 ORDER_ID            | 00010242fe8c5a6d1ba2dd792cb16214 
 ORDER_ITEM_ID       | 1.000000000000000000             
 PRODUCT_ID          | 4244733e06e7ecb4970a6e2683c13e61 
 SELLER_ID           | 48436dade18ac8b2bce089ec2a041202 
 SHIPPING_LIMIT_DATE | 2017-09-19 09:45:35              
 PRICE               | 58.9                             
 FREIGHT_VALUE       | 13.29                            
 total_amount        | 72.19                            
 ingestion_

In [0]:
output_path = get_silver_path("order_items_clean")

print(f"💾 Writing to: {output_path}")

df_items_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(output_path)

print("✅ Order Items Silver complete!")


💾 Writing to: abfss://silver@stgolistmigration.dfs.core.windows.net/order_items_clean/
✅ Order Items Silver complete!


In [0]:
print("🔍 Verifying...")

df_verify = spark.read.format("delta").load(output_path)

print(f"✅ Verified: {df_verify.count():,} order items")
print(f"   Total revenue: ${df_verify.agg(sum('total_amount')).collect()[0][0]:,.2f}")

print("=" * 80)
print("🎉 Order Items Bronze → Silver complete!")


🔍 Verifying...
✅ Verified: 112,650 order items
   Total revenue: $15,843,553.24
🎉 Order Items Bronze → Silver complete!
