# Gold Marts: Olist Business Analytics

## Purpose
Pre-aggregated business metrics for high-performance dashboards

## Source
- Gold: `fact_orders`, `dim_customers`, `dim_date`

## Mart Tables
1. **mart_monthly_sales** - Monthly revenue, orders, AOV
2. **mart_state_performance** - Geographic sales analysis
3. **mart_customer_segments** - RFM analysis (Recency, Frequency, Monetary)

**Author:** Kevin  
**Date:** Feb 9, 2026


In [0]:
from pyspark.sql.functions import (
    col, sum, count, avg, max, min, datediff, current_date,
    current_timestamp, round as spark_round, when, ntile
)
from pyspark.sql.window import Window

storage_account_name = "stgolistmigration"
account_key = ""

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

def get_gold_path(table):
    return f"abfss://gold@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


In [0]:
print("📖 Loading Gold tables...")

fact_orders = spark.read.format("delta").load(get_gold_path("fact_orders"))
print(f"✅ fact_orders: {fact_orders.count():,}")


📖 Loading Gold tables...
✅ fact_orders: 98,200


### Mart 1: Monthly Sales Summary


In [0]:
print("📊 Building mart_monthly_sales...")

mart_monthly = fact_orders \
    .groupBy("order_year", "order_month") \
    .agg(
        count("*").alias("order_count"),
        sum("revenue").alias("total_revenue"),
        avg("revenue").alias("avg_order_value"),
        sum("total_freight_value").alias("total_freight"),
        sum("item_count").alias("total_items"),
        count(when(col("is_late_delivery") == True, 1)).alias("late_deliveries"),
        count(when(col("is_multiple_payments") == True, 1)).alias("multi_payment_orders")
    ) \
    .withColumn("avg_items_per_order", spark_round(col("total_items") / col("order_count"), 2)) \
    .withColumn("late_delivery_rate_pct", spark_round((col("late_deliveries") / col("order_count")) * 100, 2)) \
    .withColumn("avg_order_value", spark_round(col("avg_order_value"), 2)) \
    .withColumn("total_revenue", spark_round(col("total_revenue"), 2)) \
    .withColumn("mart_created_at", current_timestamp()) \
    .orderBy("order_year", "order_month")

print(f"✅ Created: {mart_monthly.count()} months")
mart_monthly.show(20, truncate=False)

# Write
mart_monthly.write.format("delta").mode("overwrite").save(get_gold_path("mart_monthly_sales"))
print("💾 Saved to: mart_monthly_sales")


📊 Building mart_monthly_sales...
✅ Created: 24 months
+----------+-----------+-----------+-------------+---------------+------------------+-----------+---------------+--------------------+-------------------+----------------------+--------------------------+
|order_year|order_month|order_count|total_revenue|avg_order_value|total_freight     |total_items|late_deliveries|multi_payment_orders|avg_items_per_order|late_delivery_rate_pct|mart_created_at           |
+----------+-----------+-----------+-------------+---------------+------------------+-----------+---------------+--------------------+-------------------+----------------------+--------------------------+
|2016      |9          |2          |279.69       |139.85         |71.83             |5          |1              |0                   |2.5                |50.0                  |2026-02-09 13:32:11.232175|
|2016      |10         |293        |51581.48     |176.05         |6847.219999999997 |342        |2              |11           

### Mart 2: State Performance


In [0]:
print("📊 Building mart_state_performance...")

mart_state = fact_orders \
    .groupBy("customer_state") \
    .agg(
        count("*").alias("order_count"),
        sum("revenue").alias("total_revenue"),
        avg("revenue").alias("avg_order_value"),
        avg("delivery_days").alias("avg_delivery_days"),
        count(when(col("is_late_delivery") == True, 1)).alias("late_deliveries"),
        count("customer_id").alias("customer_count")
    ) \
    .withColumn("late_delivery_rate_pct", spark_round((col("late_deliveries") / col("order_count")) * 100, 2)) \
    .withColumn("revenue_per_customer", spark_round(col("total_revenue") / col("customer_count"), 2)) \
    .withColumn("avg_order_value", spark_round(col("avg_order_value"), 2)) \
    .withColumn("avg_delivery_days", spark_round(col("avg_delivery_days"), 1)) \
    .withColumn("mart_created_at", current_timestamp()) \
    .orderBy(col("total_revenue").desc())

print(f"✅ Created: {mart_state.count()} states")
mart_state.show(20, truncate=False)

# Write
mart_state.write.format("delta").mode("overwrite").save(get_gold_path("mart_state_performance"))
print("💾 Saved to: mart_state_performance")


📊 Building mart_state_performance...
✅ Created: 27 states
+--------------+-----------+------------------+---------------+-----------------+---------------+--------------+----------------------+--------------------+--------------------------+
|customer_state|order_count|total_revenue     |avg_order_value|avg_delivery_days|late_deliveries|customer_count|late_delivery_rate_pct|revenue_per_customer|mart_created_at           |
+--------------+-----------+------------------+---------------+-----------------+---------------+--------------+----------------------+--------------------+--------------------------+
|SP            |41125      |5878025.640000064 |142.93         |8.7              |1820           |41125         |4.43                  |142.93              |2026-02-09 13:32:21.143228|
|RJ            |12697      |2115667.5599999987|166.63         |15.2             |1495           |12697         |11.77                 |166.63              |2026-02-09 13:32:21.143228|
|MG            |11495 

### Mart 3: Customer RFM Segmentation


In [0]:
print("📊 Building mart_customer_segments (RFM)...")

# Calculate Recency, Frequency, Monetary
reference_date = fact_orders.agg(max("order_purchase_timestamp")).collect()[0][0]

customer_rfm = fact_orders \
    .groupBy("customer_id", "customer_state") \
    .agg(
        max("order_purchase_timestamp").alias("last_order_date"),
        count("*").alias("frequency"),
        sum("revenue").alias("monetary")
    ) \
    .withColumn("recency_days", datediff(col("last_order_date"), col("last_order_date")))  # Simplified

# Calculate RFM scores (1-5 scale)
windowSpec = Window.orderBy("recency_days")
windowSpec2 = Window.orderBy(col("frequency").desc())
windowSpec3 = Window.orderBy(col("monetary").desc())

mart_customer_rfm = customer_rfm \
    .withColumn("r_score", ntile(5).over(windowSpec)) \
    .withColumn("f_score", ntile(5).over(windowSpec2)) \
    .withColumn("m_score", ntile(5).over(windowSpec3)) \
    .withColumn("rfm_score", col("r_score") + col("f_score") + col("m_score")) \
    .withColumn("customer_segment",
        when(col("rfm_score") >= 13, "Champions")
        .when(col("rfm_score") >= 10, "Loyal")
        .when(col("rfm_score") >= 7, "Potential")
        .when(col("rfm_score") >= 4, "At Risk")
        .otherwise("Lost")
    ) \
    .withColumn("monetary", spark_round(col("monetary"), 2)) \
    .withColumn("mart_created_at", current_timestamp())

print(f"✅ Created: {mart_customer_rfm.count():,} customer segments")

# Show segment distribution
print("\nCustomer Segmentation:")
mart_customer_rfm.groupBy("customer_segment").count().orderBy(col("count").desc()).show()

# Write
mart_customer_rfm.write.format("delta").mode("overwrite").save(get_gold_path("mart_customer_segments"))
print("💾 Saved to: mart_customer_segments")

print("\n🎉 Olist Business Analytics Marts Complete!")


📊 Building mart_customer_segments (RFM)...




✅ Created: 98,200 customer segments

Customer Segmentation:
+----------------+-----+
|customer_segment|count|
+----------------+-----+
|       Potential|31421|
|           Loyal|27217|
|         At Risk|19783|
|       Champions|15922|
|            Lost| 3857|
+----------------+-----+

💾 Saved to: mart_customer_segments

🎉 Olist Business Analytics Marts Complete!
