# Bronze → Silver: Customers Transformation

## Purpose
Clean and standardize customer data

## Transformations
- Remove null customer_ids
- Standardize state/city names (uppercase, trim)
- Format zip codes (pad with zeros)
- Remove duplicates

## Input
- **Source**: `bronze/olist/customers/OLIST.OLIST_CUSTOMERS_BASE.parquet`
- **Records**: ~99,441

## Output
- **Destination**: `silver/customers_clean/`
- **Format**: Delta Lake

**Author:** Kevin  
**Date:** Feb 9, 2026


## Config & Imports


In [0]:
from pyspark.sql.functions import (
    col, upper, trim, initcap, lpad, 
    current_timestamp, count, when
)

storage_account_name = "stgolistmigration"
account_key = ""

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

# Path helpers
def get_bronze_path(folder, filename):
    return f"abfss://bronze@{storage_account_name}.dfs.core.windows.net/olist/{folder}/{filename}"

def get_silver_path(table):
    return f"abfss://silver@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


## Read Bronze Data


In [0]:
bronze_path = get_bronze_path("customers", "OLIST.OLIST_CUSTOMERS_BASE.parquet")  # Ensure storage account key is configured properly before this line.

print(f"📖 Reading: {bronze_path}")

df_customers_bronze = spark.read.parquet(bronze_path)

print(f"✅ Loaded: {df_customers_bronze.count():,} rows")
print(f"   Columns: {len(df_customers_bronze.columns)}")

# Show sample
df_customers_bronze.limit(3).show(truncate=False, vertical=True)


📖 Reading: abfss://bronze@stgolistmigration.dfs.core.windows.net/olist/customers/OLIST.OLIST_CUSTOMERS_BASE.parquet
✅ Loaded: 99,441 rows
   Columns: 5
-RECORD 0----------------------------------------------------
 CUSTOMER_ID              | 06b8999e2fba1a1fbc88172c00ba8bc7 
 CUSTOMER_UNIQUE_ID       | 861eff4711a542e4b93843c6dd7febb0 
 CUSTOMER_ZIP_CODE_PREFIX | 14409.000000000000000000         
 CUSTOMER_CITY            | franca                           
 CUSTOMER_STATE           | SP                               
-RECORD 1----------------------------------------------------
 CUSTOMER_ID              | 18955e83d337fd6b2def6b18a428ac77 
 CUSTOMER_UNIQUE_ID       | 290c77bc529b7ac935b93aa66c333dc3 
 CUSTOMER_ZIP_CODE_PREFIX | 9790.000000000000000000          
 CUSTOMER_CITY            | sao bernardo do campo            
 CUSTOMER_STATE           | SP                               
-RECORD 2----------------------------------------------------
 CUSTOMER_ID              | 4e7b3e00288586

## Data Quality Check


In [0]:
print("🔍 Data Quality Check")
print("=" * 80)

# Null counts
print("\n1️⃣ NULL VALUES:")
null_counts = df_customers_bronze.select([
    count(when(col(c).isNull(), c)).alias(c) 
    for c in df_customers_bronze.columns
])
null_counts.show(vertical=True, truncate=False)

# Check duplicates
total = df_customers_bronze.count()
unique = df_customers_bronze.select("customer_id").distinct().count()
dups = total - unique

print(f"\n2️⃣ DUPLICATES:")
print(f"Total: {total:,}")
print(f"Unique: {unique:,}")
print(f"Duplicates: {dups:,}")

# State distribution (top 10)
print(f"\n3️⃣ TOP 10 STATES:")
df_customers_bronze.groupBy("customer_state") \
    .count() \
    .orderBy(col("count").desc()) \
    .limit(10) \
    .show(truncate=False)

print("=" * 80)


🔍 Data Quality Check

1️⃣ NULL VALUES:
-RECORD 0-----------------------
 CUSTOMER_ID              | 0   
 CUSTOMER_UNIQUE_ID       | 0   
 CUSTOMER_ZIP_CODE_PREFIX | 0   
 CUSTOMER_CITY            | 0   
 CUSTOMER_STATE           | 0   


2️⃣ DUPLICATES:
Total: 99,441
Unique: 99,441
Duplicates: 0

3️⃣ TOP 10 STATES:
+--------------+-----+
|customer_state|count|
+--------------+-----+
|SP            |41746|
|RJ            |12852|
|MG            |11635|
|RS            |5466 |
|PR            |5045 |
|SC            |3637 |
|BA            |3380 |
|DF            |2140 |
|ES            |2033 |
|GO            |2020 |
+--------------+-----+



## Apply Transformations


In [0]:
print("🔄 Transforming customers...")

df_customers_silver = df_customers_bronze \
    .filter(col("customer_id").isNotNull()) \
    .withColumn(
        "customer_state_clean",
        upper(trim(col("customer_state")))
    ) \
    .withColumn(
        "customer_city_clean",
        initcap(trim(col("customer_city")))
    ) \
    .withColumn(
        "customer_zip_code",
        lpad(col("customer_zip_code_prefix"), 5, "0")
    ) \
    .withColumn("ingestion_timestamp", current_timestamp()) \
    .dropDuplicates(["customer_id"]) \
    .select(
        "customer_id",
        "customer_unique_id",
        col("customer_zip_code").alias("zip_code"),
        col("customer_city_clean").alias("city"),
        col("customer_state_clean").alias("state"),
        "ingestion_timestamp"
    )

silver_count = df_customers_silver.count()
removed = total - silver_count

print(f"✅ Transformation complete")
print(f"   Silver rows: {silver_count:,}")
print(f"   Removed: {removed:,}")


🔄 Transforming customers...
✅ Transformation complete
   Silver rows: 99,441
   Removed: 0


## Validate & Preview


In [0]:
print("📊 Silver Layer Preview")
print("=" * 80)

# Show state distribution
print("\nTop 10 states:")
df_customers_silver.groupBy("state") \
    .count() \
    .orderBy(col("count").desc()) \
    .limit(10) \
    .show(truncate=False)

# Show sample records
print("\nSample records:")
df_customers_silver.limit(5).show(truncate=False, vertical=True)


📊 Silver Layer Preview

Top 10 states:
+-----+-----+
|state|count|
+-----+-----+
|SP   |41746|
|RJ   |12852|
|MG   |11635|
|RS   |5466 |
|PR   |5045 |
|SC   |3637 |
|BA   |3380 |
|DF   |2140 |
|ES   |2033 |
|GO   |2020 |
+-----+-----+


Sample records:
-RECORD 0-----------------------------------------------
 customer_id         | e3c7e245a96d7fa339fe6c16f8da4e90 
 customer_unique_id  | 79051ee5ee98c4bd6982e67e2e79dbcb 
 zip_code            | 7847.                            
 city                | Franco Da Rocha                  
 state               | SP                               
 ingestion_timestamp | 2026-02-09 12:45:57.572021       
-RECORD 1-----------------------------------------------
 customer_id         | a56b03f5e6015f1a502b9810309b98b7 
 customer_unique_id  | b6cbe1a8674ee23e9fb086e3c61677b8 
 zip_code            | 41308                            
 city                | Salvador                         
 state               | BA                               
 inges

## Write to Silver


In [0]:
output_path = get_silver_path("customers_clean")

print(f"💾 Writing to: {output_path}")

df_customers_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(output_path)

print("✅ Customers Silver layer complete!")


💾 Writing to: abfss://silver@stgolistmigration.dfs.core.windows.net/customers_clean/
✅ Customers Silver layer complete!


## Verify


In [0]:
print("🔍 Verifying...")

df_verify = spark.read.format("delta").load(output_path)

print(f"✅ Verified: {df_verify.count():,} customers")
print(f"   Columns: {len(df_verify.columns)}")

# Show summary by state
print("\nCustomers by State (top 10):")
df_verify.groupBy("state") \
    .count() \
    .orderBy(col("count").desc()) \
    .limit(10) \
    .show(truncate=False)

print("=" * 80)
print("🎉 Customers Bronze → Silver complete!")


🔍 Verifying...
✅ Verified: 99,441 customers
   Columns: 6

Customers by State (top 10):
+-----+-----+
|state|count|
+-----+-----+
|SP   |41746|
|RJ   |12852|
|MG   |11635|
|RS   |5466 |
|PR   |5045 |
|SC   |3637 |
|BA   |3380 |
|DF   |2140 |
|ES   |2033 |
|GO   |2020 |
+-----+-----+

🎉 Customers Bronze → Silver complete!
