# Bronze → Silver: Payments (Aggregated)

## Purpose
Aggregate payment data by order_id (some orders have multiple payments)

## Transformations
- Group by order_id
- Sum payment values
- Count payment methods
- Calculate avg installments
- Aggregate payment types into array

## Input
- **Source**: `bronze/olist/payments/OLIST.OLIST_ORDER_PAYMENTS_BASE.parquet`
- **Records**: ~103,886 payment records

## Output
- **Destination**: `silver/payments_clean/`
- **Format**: Delta Lake
- **Expected Records**: ~99,440 (one row per order)

**Author:** Kevin  
**Date:** Feb 9, 2026


In [0]:
from pyspark.sql.functions import (
    col, sum, count, avg, max, collect_list,
    current_timestamp, when, lower, trim
)

storage_account_name = "stgolistmigration"
account_key = ""

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

def get_bronze_path(folder, filename):
    return f"abfss://bronze@{storage_account_name}.dfs.core.windows.net/olist/{folder}/{filename}"

def get_silver_path(table):
    return f"abfss://silver@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


In [0]:
bronze_path = get_bronze_path("payments", "OLIST.OLIST_ORDER_PAYMENTS_BASE.parquet")

print(f"📖 Reading: {bronze_path}")

df_payments_bronze = spark.read.parquet(bronze_path)

print(f"✅ Loaded: {df_payments_bronze.count():,} payment records")
print(f"   Columns: {len(df_payments_bronze.columns)}")

df_payments_bronze.limit(3).show(truncate=False, vertical=True)


📖 Reading: abfss://bronze@stgolistmigration.dfs.core.windows.net/olist/payments/OLIST.OLIST_ORDER_PAYMENTS_BASE.parquet
✅ Loaded: 103,886 payment records
   Columns: 5
-RECORD 0------------------------------------------------
 ORDER_ID             | b81ef226f3fe1789b1e8b2acac839d17 
 PAYMENT_SEQUENTIAL   | 1.000000000000000000             
 PAYMENT_TYPE         | credit_card                      
 PAYMENT_INSTALLMENTS | 8.000000000000000000             
 PAYMENT_VALUE        | 99.33                            
-RECORD 1------------------------------------------------
 ORDER_ID             | a9810da82917af2d9aefd1278f1dcfa0 
 PAYMENT_SEQUENTIAL   | 1.000000000000000000             
 PAYMENT_TYPE         | credit_card                      
 PAYMENT_INSTALLMENTS | 1.000000000000000000             
 PAYMENT_VALUE        | 24.39                            
-RECORD 2------------------------------------------------
 ORDER_ID             | 25e8ea4e93396b6fa0d3dd708e76c1bd 
 PAYMENT_SEQUENTIAL 

In [0]:
print("🔍 Data Quality Check")
print("=" * 80)

# Null counts
print("\n1️⃣ NULL VALUES:")
null_counts = df_payments_bronze.select([
    count(when(col(c).isNull(), c)).alias(c) 
    for c in df_payments_bronze.columns
])
null_counts.show(vertical=True, truncate=False)

# Payment type distribution
print(f"\n2️⃣ PAYMENT TYPES:")
df_payments_bronze.groupBy("payment_type") \
    .count() \
    .orderBy(col("count").desc()) \
    .show(truncate=False)

# Check for multiple payments per order
print(f"\n3️⃣ ORDERS WITH MULTIPLE PAYMENTS:")
multi_payments = df_payments_bronze.groupBy("order_id") \
    .count() \
    .filter(col("count") > 1) \
    .count()

total_orders = df_payments_bronze.select("order_id").distinct().count()
print(f"Total unique orders: {total_orders:,}")
print(f"Orders with multiple payments: {multi_payments:,}")
print(f"Percentage: {multi_payments/total_orders*100:.1f}%")

print("=" * 80)


🔍 Data Quality Check

1️⃣ NULL VALUES:
-RECORD 0-------------------
 ORDER_ID             | 0   
 PAYMENT_SEQUENTIAL   | 0   
 PAYMENT_TYPE         | 0   
 PAYMENT_INSTALLMENTS | 0   
 PAYMENT_VALUE        | 0   


2️⃣ PAYMENT TYPES:
+------------+-----+
|payment_type|count|
+------------+-----+
|credit_card |76795|
|boleto      |19784|
|voucher     |5775 |
|debit_card  |1529 |
|not_defined |3    |
+------------+-----+


3️⃣ ORDERS WITH MULTIPLE PAYMENTS:
Total unique orders: 99,440
Orders with multiple payments: 2,961
Percentage: 3.0%


In [0]:
print("🔄 Aggregating payments by order_id...")

df_payments_silver = df_payments_bronze \
    .filter(col("order_id").isNotNull()) \
    .filter(col("payment_value") > 0) \
    .withColumn("payment_type_clean", lower(trim(col("payment_type")))) \
    .groupBy("order_id") \
    .agg(
        sum("payment_value").alias("total_payment_value"),
        count("*").alias("payment_count"),
        collect_list("payment_type_clean").alias("payment_types"),
        avg("payment_installments").alias("avg_installments"),
        max("payment_sequential").alias("max_payment_sequential")
    ) \
    .withColumn("ingestion_timestamp", current_timestamp())

silver_count = df_payments_silver.count()
bronze_count = df_payments_bronze.count()

print(f"✅ Aggregation complete")
print(f"   Bronze records: {bronze_count:,} (payment transactions)")
print(f"   Silver records: {silver_count:,} (unique orders)")
print(f"   Compression ratio: {bronze_count/silver_count:.2f}x")


🔄 Aggregating payments by order_id...
✅ Aggregation complete
   Bronze records: 103,886 (payment transactions)
   Silver records: 99,437 (unique orders)
   Compression ratio: 1.04x


In [0]:
print("📊 Silver Preview")
print("=" * 80)

print("\nPayment summary:")
df_payments_silver.select(
    count("*").alias("orders_with_payments"),
    sum("total_payment_value").alias("total_revenue"),
    avg("total_payment_value").alias("avg_payment_per_order"),
    avg("payment_count").alias("avg_payments_per_order"),
    avg("avg_installments").alias("overall_avg_installments")
).show(truncate=False)

print("\nSample aggregated records:")
df_payments_silver.limit(3).show(truncate=False, vertical=True)

print("\nOrders with multiple payment methods:")
df_payments_silver.filter(col("payment_count") > 1) \
    .limit(5) \
    .select("order_id", "payment_count", "payment_types", "total_payment_value") \
    .show(truncate=False)


📊 Silver Preview

Payment summary:
+--------------------+--------------------+---------------------+----------------------+----------------------------+
|orders_with_payments|total_revenue       |avg_payment_per_order|avg_payments_per_order|overall_avg_installments    |
+--------------------+--------------------+---------------------+----------------------+----------------------------+
|99437               |1.6008872119999776E7|160.99512374669163   |1.0446513873105585    |2.91475681318144385211975747|
+--------------------+--------------------+---------------------+----------------------+----------------------------+


Sample aggregated records:
-RECORD 0--------------------------------------------------
 order_id               | 85be7c94bcd3f908fc877157ee21f755 
 total_payment_value    | 72.75                            
 payment_count          | 1                                
 payment_types          | [credit_card]                    
 avg_installments       | 1.000000000000000000

In [0]:
output_path = get_silver_path("payments_clean")

print(f"💾 Writing to: {output_path}")

df_payments_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(output_path)

print("✅ Payments Silver complete!")


💾 Writing to: abfss://silver@stgolistmigration.dfs.core.windows.net/payments_clean/
✅ Payments Silver complete!


In [0]:
print("🔍 Verifying...")

df_verify = spark.read.format("delta").load(output_path)

print(f"✅ Verified: {df_verify.count():,} orders with payment data")
print(f"   Total revenue: ${df_verify.agg(sum('total_payment_value')).collect()[0][0]:,.2f}")

print("\nPayment distribution:")
df_verify.groupBy("payment_count") \
    .count() \
    .orderBy("payment_count") \
    .show(10, truncate=False)

print("=" * 80)
print("🎉 Payments Bronze → Silver complete!")
print("\n🏆 ALL 4 SILVER TABLES COMPLETE! 🏆")


🔍 Verifying...
✅ Verified: 99,437 orders with payment data
   Total revenue: $16,008,872.12

Payment distribution:
+-------------+-----+
|payment_count|count|
+-------------+-----+
|1            |96476|
|2            |2383 |
|3            |303  |
|4            |105  |
|5            |52   |
|6            |36   |
|7            |28   |
|8            |11   |
|9            |9    |
|10           |5    |
+-------------+-----+
only showing top 10 rows
🎉 Payments Bronze → Silver complete!

🏆 ALL 4 SILVER TABLES COMPLETE! 🏆
