# 02_silver_to_gold - Silver to Gold curation (Microsoft Fabric Lakehouse)

**Purpose**  
Transform cleaned **Silver** tables into denormalised **Gold** Delta tables aligned to **Customer Insights - Data** contracts:

- **Profiles:** `profiles.Customer`  
- **Activities:** `activities.Orders` (line-level)  
- **Supporting:** `supporting.Products`, `supporting.Channels`, `supporting.Calendar`, `supporting.LoyaltyRewardPoints`, `supporting.LoyaltyPrograms`  
- **Analytics (optional):** `analytics.CustomerMetrics` snapshot for downstream KPIs.

**Design rules (CI-Data compatible)**  
- **Delta min reader version <= 2**; **deletion vectors disabled**; **~15 days `_delta_log`** retention.  
- Activity PKs + non-null **timestamp**; FK to `profiles.Customer`.  
- Partition **Orders** by `order_date` (daily).

> This notebook is intentionally **modular** and **comment-rich**, so it can serve as a clear reference implementation.


In [7]:
# ==============================
# Configuration & parameters
# ==============================

# OneLake Connection
TARGET_DB = "ci_lakehouse"
SOURCE_DB = "ci_lakehouse"
spark.sql(f"USE `{TARGET_DB}`")

# Root folder for Gold Delta tables inside the Lakehouse.
GOLD_ROOT = "Files/Tables_gold"    # folder under the Lakehouse
USE_OVERWRITE_SNAPSHOT = True  # simple + idempotent; can be swapped for MERGE later  # TODO: replace overwrite with merge/upsert when schema is stable

# Delta table properties required by CI Data compatibility and our guardrails.
DELTA_PROPS = {
    "delta.minReaderVersion": "2",
    "delta.enableDeletionVectors": "false",
    "delta.logRetentionDuration": "15 days",
}

# Expected Silver inputs (produced by 01_bronze_to_silver.ipynb).
SILVER_TABLES = {
    "customer_silver": {},
    "salestable_silver": {},
    "salesline_silver": {},
    "retailchannel_silver": {},
    "loyalty_card_silver": {},
    "loyalty_card_tier_silver": {},
    "loyalty_point_trans_silver": {},
    "loyalty_program_affiliation_silver": {},
    # Optional: "product_silver", "store_silver" can be integrated later.
}

# Column precision for monetary/quantity fields.
DECIMAL_PRECISION = 18
DECIMAL_SCALE = 4


StatementMeta(, 83a6b924-3a83-46d6-acb1-2221b657faf5, 5, Finished, Available, Finished)

In [8]:
# ==============================
# Utilities
# ==============================
from pyspark.sql import functions as F, Window as W
from pyspark.sql import types as T

def log(msg: str) -> None:
    """Lightweight logger for notebook steps."""
    print(f"[gold] {msg}")

def table_exists(name: str) -> bool:
    """Return True if a table is available in the current Lakehouse database."""
    try:
        return spark.catalog.tableExists(name)
    except Exception:
        return False

def safe_read_table(name: str):
    """Read a Silver table or return None if it does not exist."""
    if not table_exists(name):
        log(f"SKIP: missing source -> {name}")
        return None
    return spark.table(name)

def ensure_dir(path: str) -> None:
    """Ensure the target folder exists (Fabric creates on write; method kept for symmetry)."""
    try:
        spark._jsparkSession.sessionState().newHadoopConf()
    except Exception:
        pass

def register_delta_table(table_name: str, path: str) -> None:
    """Create a Delta table at a given LOCATION if it doesn't already exist."""
    spark.sql(f"""CREATE TABLE IF NOT EXISTS `{table_name}`
                     USING DELTA
                     LOCATION '{path}'""")

def set_delta_properties(table_name: str, props: dict) -> None:
    """Set Delta table properties (minReaderVersion, deletion vectors, log retention, ...)."""
    kv = ", ".join([f"{k}='{v}'" for k, v in props.items()])
    spark.sql(f"ALTER TABLE `{table_name}` SET TBLPROPERTIES ({kv})")

def write_delta(df, table_name: str, partition_cols=None, enforce_props: bool = True) -> None:
    """Write DataFrame to Gold Delta at fixed LOCATION, register, enforce properties, refresh."""
    path = f"{GOLD_ROOT}/{table_name}"
    ensure_dir(path)
    writer = (df.write
              .format("delta")
              .option("mergeSchema", "true"))
    if USE_OVERWRITE_SNAPSHOT:
        writer = writer.mode("overwrite")
    else:
        writer = writer.mode("overwrite")  # placeholder for future MERGE
    if partition_cols:
        writer = writer.partitionBy(*partition_cols)
    writer.save(path)
    register_delta_table(table_name, path)
    if enforce_props:
        set_delta_properties(table_name, DELTA_PROPS)
    spark.sql(f"REFRESH TABLE `{table_name}`")
    log(f"Wrote Gold table: {table_name} -> {path}")

def cast_decimal(col, precision=DECIMAL_PRECISION, scale=DECIMAL_SCALE):
    return F.col(col).cast(T.DecimalType(precision, scale))

def to_date_utc(col):
    # `to_date` drops time; it assumes input is a timestamp already in UTC.
    return F.to_date(F.col(col))

def to_utc_ts(col):
    return F.col(col).cast("timestamp")  # assumed UTC in input

def assert_pk_unique(df, pk_cols, table_label: str):
    """Sanity check: primary key uniqueness."""
    dup = (df.groupBy(*[F.col(c) for c in pk_cols])
             .count()
             .filter(F.col("count") > 1))
    n_dup = dup.count()
    if n_dup > 0:
        log(f"WARN: {table_label} has {n_dup} duplicate PK(s). Investigate before CI Data consumption.")
    else:
        log(f"OK: {table_label} PK uniqueness holds.")

def customer_id_expr(data_area_col: str, account_col: str):
    """Stable composite CustomerId = lower(dataarea) + '_' + lower(account)."""
    return F.lower(F.concat_ws("_", F.col(data_area_col), F.col(account_col)))

def order_line_id_expr(sales_id_col: str, line_recid_col: str):
    return F.concat_ws("_", F.col(sales_id_col), F.col(line_recid_col).cast("string"))

"""#########################################################################################
Transformation functions - each returns a DataFrame ready for write_delta()
#########################################################################################"""

StatementMeta(, 83a6b924-3a83-46d6-acb1-2221b657faf5, 6, Finished, Available, Finished)

'#########################################################################################\nTransformation functions â€” each returns a DataFrame ready for write_delta()\n#########################################################################################'

In [9]:
def build_profiles_customer(customer_silver):
    """Create Gold Profiles.Customer from Silver customer.

    Requirements:

      - PK: CustomerId (composite from company + account)

      - Denormalised profile details (name, contact, address)

    """
    if customer_silver is None:
        return None

    df = (customer_silver
          .withColumn("customer_id", customer_id_expr("data_area_id", "account_num"))
          .select(
              # Keys
              F.col("customer_id").alias("CustomerId"),
              F.col("data_area_id").alias("DataAreaId"),
              F.col("account_num").alias("AccountNum"),
              # Party / contact
              F.col("party_type").alias("PartyType"),
              F.col("name").alias("Name"),
              F.col("known_as").alias("KnownAs"),
              F.col("language_id").alias("LanguageId"),
              F.col("primary_email").alias("PrimaryEmail"),
              F.col("primary_phone").alias("PrimaryPhone"),
              F.col("person_first_name").alias("PersonFirstName"),
              F.col("person_middle_name").alias("PersonMiddleName"),
              F.col("person_last_name").alias("PersonLastName"),
              # Address (current)
              F.col("address_line").alias("AddressLine"),
              F.col("street").alias("Street"),
              F.col("street_number").alias("StreetNumber"),
              F.col("city").alias("City"),
              F.col("state").alias("State"),
              F.col("county").alias("County"),
              F.col("postal_code").alias("PostalCode"),
              F.col("country_region_id").alias("CountryRegionId"),
              F.col("latitude").alias("Latitude"),
              F.col("longitude").alias("Longitude"),
              # Lineage / audit
              F.col("party_recid").alias("PartyRecordId"),
              F.col("primary_address_location_recid").alias("PrimaryAddressLocationId"),
              F.current_timestamp().alias("RowModifiedUtc")
          ))
    return df


def build_activities_orders(sales_hdr, sales_line, retail_channel, customer_silver=None):
    """Create Gold Activities.Orders (line-level).

    Requirements:

      - PK: OrderLineId = sales_id + '_' + line_recid

      - FK: CustomerId (from Profiles)

      - Timestamp: OrderDate (non-null), plus OrderTimestamp for fidelity

      - Partition by OrderDate

    """
    if sales_hdr is None or sales_line is None:
        return None

    # Header x Line
    orders = (sales_line.alias("L")
              .join(sales_hdr.alias("H"), on=["sales_id", "data_area_id"], how="inner"))

    # Channel lookup
    if retail_channel is not None:
        rc = retail_channel.dropDuplicates(["channel_recid"]).alias("RC")
        orders = (orders
                  .join(
                      rc,
                      orders["retail_channel_recid"] == F.col("RC.channel_recid"),
                      "left"
                  )
                  .withColumn("channel_id", F.col("RC.channel_id")))
    else:
        orders = orders.withColumn("channel_id", F.lit(None).cast("string"))

    # CustomerId derivation
    if customer_silver is not None:
        cust_key = (customer_silver
                    .withColumn("customer_id", customer_id_expr("data_area_id", "account_num"))
                    .select(F.col("data_area_id").alias("C_data_area_id"),
                            F.col("account_num").alias("C_account_num"),
                            F.col("customer_id")))
        orders = (orders
                  .join(cust_key,
                        (orders["data_area_id"] == cust_key["C_data_area_id"]) &
                        (orders["cust_account"] == cust_key["C_account_num"]),
                        "left")
                  .drop("C_data_area_id", "C_account_num"))
        orders = orders.withColumn("customer_id", F.col("customer_id"))
    else:
        orders = orders.withColumn("customer_id", customer_id_expr("data_area_id", "cust_account"))

    # Business columns
    orders = (orders
              .withColumn("order_line_id", order_line_id_expr("sales_id", "line_recid"))
              .withColumn("order_timestamp", to_utc_ts("created_datetime_utc"))
              .withColumn("order_date", to_date_utc("created_datetime_utc"))
              .withColumn("unit_price", cast_decimal("sales_price"))
              .withColumn("line_discount", cast_decimal("line_discount"))
              .withColumn("sales_qty", cast_decimal("sales_qty"))
              .withColumn("line_amount_calc",
                          F.coalesce(F.col("line_amount"),
                                     (F.col("sales_qty") * F.col("unit_price")) - F.col("line_discount")))
              .withColumn("line_amount", cast_decimal("line_amount_calc"))
             )

    # Select final schema (denormalised)
    orders = (orders.select(
        F.col("order_line_id").alias("OrderLineId"),          # PK
        F.col("sales_id").alias("OrderId"),
        F.col("customer_id").alias("CustomerId"),             # FK -> Profiles.Customer
        F.col("channel_id").alias("ChannelId"),
        F.col("item_id").alias("ItemId"),
        F.col("product_name").alias("ProductName"),
        F.col("order_date").alias("OrderDate"),               # for partitioning
        F.col("order_timestamp").alias("OrderTimestamp"),     # high-fidelity time
        F.col("sales_qty").alias("SalesQty"),
        F.col("unit_price").alias("UnitPrice"),
        F.col("line_discount").alias("LineDiscount"),
        F.col("line_amount").alias("LineAmount"),
        cast_decimal("cost_price").alias("CostPrice"),
        F.col("currency_code").alias("CurrencyCode"),
        F.col("sales_unit").alias("SalesUnit"),
        F.col("price_unit").alias("PriceUnit"),
        F.col("sales_status").alias("SalesStatus"),
        F.col("data_area_id").alias("DataAreaId"),
        F.current_timestamp().alias("RowModifiedUtc")
    ))

    # Basic guards
    orders = orders.filter(F.col("OrderDate").isNotNull())
    return orders


def build_supporting_products(sales_line, product_silver=None):
    """Create Supporting.Products.

    Prefer product_silver if present else derive from distinct item_id/product_name in sales_line.

    """
    if product_silver is not None:
        df = (product_silver
              .select(
                  F.col("product_id").alias("ProductId"),
                  F.col("product_name").alias("ProductName"),
                  F.current_timestamp().alias("RowModifiedUtc")
              )
              .dropDuplicates(["ProductId"]))  # enforce PK uniqueness
    elif sales_line is not None:
        df = (sales_line
              .select(
                  F.col("item_id").alias("ProductId"),
                  F.col("product_name").alias("ProductName"),
              )
              .dropna(subset=["ProductId"])
              .dropDuplicates(["ProductId"])
              .withColumn("RowModifiedUtc", F.current_timestamp()))
    else:
        return None
    return df


def build_supporting_channels(retail_channel):
    """Create Supporting.Channels from retailchannel_silver."""
    if retail_channel is None:
        return None
    df = (retail_channel
          .select(
              F.col("channel_id").alias("ChannelId"),
              F.col("channel_recid").alias("ChannelRecId"),
              F.current_timestamp().alias("RowModifiedUtc"),
          )
          .dropna(subset=["ChannelId"])
          .dropDuplicates(["ChannelId"]))
    return df


def build_supporting_calendar(orders=None, sales_line=None):
    """Create Supporting.Calendar as a simple date dimension covering observed order dates.

    Uses Orders if available; otherwise derives from sales_line created_datetime_utc.

    """
    src = None
    if orders is not None and orders.columns:
        src = orders.select(F.col("OrderDate").alias("d")).where(F.col("d").isNotNull())
        date_bounds = src.agg(F.min("d").alias("min_d"), F.max("d").alias("max_d")).collect()[0]
    elif sales_line is not None:
        src = sales_line.select(F.to_date(F.col("created_datetime_utc")).alias("d")).where(F.col("d").isNotNull())
        date_bounds = src.agg(F.min("d").alias("min_d"), F.max("d").alias("max_d")).collect()[0]
    else:
        return None

    start_d, end_d = date_bounds["min_d"], date_bounds["max_d"]
    if start_d is None or end_d is None:
        return None

    # Build date sequence [start_d, end_d] with 1-day step
    seq = (spark.createDataFrame([(1,)], ["id"])
           .select(F.explode(F.sequence(F.lit(start_d), F.lit(end_d), F.expr("interval 1 day"))).alias("Date")))

    cal = (seq
           .withColumn("Year", F.year("Date"))
           .withColumn("Quarter", F.quarter("Date"))
           .withColumn("Month", F.month("Date"))
           .withColumn("Day", F.dayofmonth("Date"))
           .withColumn("WeekOfYear", F.weekofyear("Date"))
           .withColumn("DayOfWeek", F.date_format("Date", "E"))
           .withColumn("RowModifiedUtc", F.current_timestamp()))

    return cal


def build_activities_loyalty_points(loyalty_trans, loyalty_card=None, loyalty_tier=None, sales_hdr=None, retail_channel=None):
    """Create Gold activities.LoyaltyPoints from Silver loyalty datasets."""
    if loyalty_trans is None:
        return None

    df = loyalty_trans.alias("LT")

    if loyalty_card is not None:
        card_lookup = (
            loyalty_card
            .select(
                F.col("card_recid"),
                F.col("card_number"),
                F.col("customer_id")
            )
            .dropDuplicates(["card_recid"])
        )
        df = df.join(card_lookup, "card_recid", "left")
    else:
        df = (
            df
            .withColumn("card_number", F.lit(None).cast("string"))
            .withColumn("customer_id", F.lit(None).cast("string"))
        )

    if loyalty_tier is not None:
        tier_lookup = (
            loyalty_tier
            .filter(F.col("is_current") == F.lit(True))
            .select(
                F.col("card_recid").alias("tier_card_recid"),
                F.col("loyalty_tier").alias("tier_loyalty_tier"),
                F.col("affiliation_id").alias("tier_program_id")
            )
            .dropDuplicates(["tier_card_recid"])
        )
        df = df.join(tier_lookup, df.card_recid == tier_lookup.tier_card_recid, "left")
        df = df.drop("tier_card_recid")
    else:
        df = (
            df
            .withColumn("tier_loyalty_tier", F.lit(None).cast("string"))
            .withColumn("tier_program_id", F.lit(None).cast("string"))
        )

    if sales_hdr is not None:
        sales_lookup = (
            sales_hdr
            .select(
                F.col("sales_id").alias("sh_sales_id"),
                F.col("retail_channel_recid").alias("sh_retail_channel_recid")
            )
            .dropDuplicates(["sh_sales_id"])
        )
        df = df.join(sales_lookup, df.sales_id == sales_lookup.sh_sales_id, "left")
        df = df.drop("sh_sales_id")
        df = df.withColumn("retail_channel_recid", F.col("sh_retail_channel_recid"))
        df = df.drop("sh_retail_channel_recid")
    elif "retail_channel_recid" not in df.columns:
        df = df.withColumn("retail_channel_recid", F.lit(None).cast("long"))

    if retail_channel is not None:
        channel_lookup = (
            retail_channel
            .select(
                F.col("channel_recid").alias("rc_channel_recid"),
                F.col("channel_id").alias("rc_channel_id")
            )
            .dropDuplicates(["rc_channel_recid"])
        )
        df = df.join(channel_lookup, df.retail_channel_recid == channel_lookup.rc_channel_recid, "left")
        df = df.drop("rc_channel_recid")
        df = df.withColumn("channel_id", F.col("rc_channel_id"))
        df = df.drop("rc_channel_id")
    else:
        df = df.withColumn("channel_id", F.lit(None).cast("string"))

    df = df.withColumn(
        "event_timestamp",
        F.coalesce(F.col("event_timestamp_utc"), F.col("expiration_date"))
    )
    df = df.withColumn(
        "event_date",
        F.to_date(F.coalesce(F.col("event_timestamp"), F.col("expiration_date")))
    )

    df = df.withColumn(
        "loyalty_event_id",
        F.sha2(
            F.concat_ws(
                "|",
                F.lower(F.coalesce(F.col("card_number").cast("string"), F.lit(""))),
                F.lower(F.coalesce(F.col("reward_point_id").cast("string"), F.lit(""))),
                F.lower(F.coalesce(F.col("sales_id").cast("string"), F.lit(""))),
                F.lower(F.coalesce(F.col("transaction_id").cast("string"), F.lit(""))),
                F.lower(F.coalesce(F.col("rec_id").cast("string"), F.lit("")))
            ),
            256
        )
    )

    df = df.withColumn("program_id_final", F.col("tier_program_id").cast("string"))
    df = df.withColumn(
        "loyalty_tier_final",
        F.coalesce(F.col("tier_loyalty_tier").cast("string"), F.col("loyalty_tier").cast("string"))
    )

    df = df.filter(F.col("loyalty_event_id").isNotNull())
    df = df.filter(F.col("event_date").isNotNull())

    result = (
        df.select(
            F.col("loyalty_event_id").alias("LoyaltyEventId"),
            F.col("customer_id").alias("CustomerId"),
            F.col("card_number").alias("CardNumber"),
            F.col("reward_point_id").alias("RewardPointId"),
            F.col("points_delta").alias("PointsDelta"),
            F.col("remaining_balance").alias("RemainingBalance"),
            F.col("program_id_final").alias("ProgramId"),
            F.col("loyalty_tier_final").alias("LoyaltyTier"),
            F.col("sales_id").alias("OrderId"),
            F.col("channel_id").alias("ChannelId"),
            F.col("store_id").alias("StoreId"),
            F.col("terminal_id").alias("TerminalId"),
            F.col("transaction_id").alias("TransactionId"),
            F.col("staff_id").alias("StaffId"),
            F.col("loyalty_transaction_type").alias("LoyaltyTransactionType"),
            F.col("data_area_id").alias("DataAreaId"),
            F.col("event_timestamp").alias("EventTimestamp"),
            F.col("event_date").alias("EventDate"),
            F.col("expiration_date").alias("ExpirationDate"),
            F.current_timestamp().alias("RowModifiedUtc")
        )
    )
    return result


def build_supporting_loyalty_reward_points():
    """Create Supporting.LoyaltyRewardPoints placeholder table (schema only)."""
    schema = T.StructType([
        T.StructField("RewardPointId", T.StringType(), True),
        T.StructField("RewardPointType", T.StringType(), True),
        T.StructField("Redeemable", T.BooleanType(), True),
        T.StructField("RewardPointCurrency", T.StringType(), True),
        T.StructField("RowModifiedUtc", T.TimestampType(), True),
    ])
    return spark.createDataFrame([], schema)


def build_supporting_loyalty_programs(loyalty_program):
    """Create Supporting.LoyaltyPrograms from loyalty program affiliation Silver."""
    if loyalty_program is None:
        return None
    df = (
        loyalty_program
        .select(
            F.col("program_id").alias("ProgramId"),
            F.col("program_name").alias("ProgramName"),
            F.col("affiliation_type").alias("AffiliationType"),
            F.col("pool_related_cards").alias("PoolRelatedCards"),
            F.col("hide_in_channels").alias("HideInChannels"),
            F.col("data_area_id").alias("DataAreaId"),
            F.col("row_modified_utc").alias("RowModifiedUtc")
        )
        .dropDuplicates(["ProgramId"])
    )
    return df


StatementMeta(, 8d8dae01-cdeb-4329-abdd-f58a8a84626f, 6, Finished, Available, Finished)

StatementMeta(, 00bc4d4b-0c42-4f34-b573-f351a46898d7, 9, Finished, Available, Finished)

StatementMeta(, 83a6b924-3a83-46d6-acb1-2221b657faf5, 7, Finished, Available, Finished)

## Customer metrics (optional)

- Aggregates customer-level KPIs (CLV, frequency, loyalty) into `analytics_CustomerMetrics`.
- Partitioned by `AsOfDate` so consumers can snapshot or opt in as needed.


In [10]:
def build_customer_metrics(profiles_df, orders_df, loyalty_df=None, as_of_date=None):
    """Aggregate customer-level metrics for optional analytics use.

    Parameters
    ----------
    as_of_date : optional date or string
        Provide a fixed date for reproducible runs; defaults to current_date() if None.
    """
    if profiles_df is None or orders_df is None:
        return None
    if as_of_date is None:
        as_of_date_col = F.current_date()
    elif isinstance(as_of_date, str):
        as_of_date_col = F.lit(as_of_date).cast("date")
    else:
        as_of_date_col = F.lit(as_of_date)

    orders_metrics = (
        orders_df
        .groupBy("CustomerId")
        .agg(
            F.countDistinct("OrderId").alias("total_orders"),
            F.sum("LineAmount").alias("total_revenue"),
            F.sum("CostPrice").alias("total_cost"),
            F.sum("LineDiscount").alias("total_discount"),
            F.min("OrderDate").alias("first_order_date"),
            F.max("OrderDate").alias("last_order_date"),
            F.countDistinct("ChannelId").alias("channels_count"),
            F.countDistinct("ItemId").alias("products_count"),
        )
    )

    loyalty_metrics = None
    if loyalty_df is not None and "CustomerId" in loyalty_df.columns:
        loyalty_metrics = (
            loyalty_df
            .groupBy("CustomerId")
            .agg(
                F.sum(F.when(F.col("PointsDelta") > 0, F.col("PointsDelta")).otherwise(F.lit(0))).alias("points_earned"),
                F.sum(F.when(F.col("PointsDelta") < 0, F.abs(F.col("PointsDelta"))).otherwise(F.lit(0))).alias("points_redeemed"),
                F.sum("PointsDelta").alias("points_balance"),
            )
        )

    profiles_cols = [
        F.col("CustomerId"),
        F.col("DataAreaId"),
        F.col("AccountNum"),
        F.col("PartyType"),
        F.col("Name"),
        F.col("LanguageId"),
    ]
    base = profiles_df.select(*profiles_cols)

    metrics = base.join(orders_metrics, on="CustomerId", how="left")
    if loyalty_metrics is not None:
        metrics = metrics.join(loyalty_metrics, on="CustomerId", how="left")

    metrics = (
        metrics
        .withColumn("AsOfDate", as_of_date_col)
        .withColumn("TotalOrders", F.coalesce(F.col("total_orders"), F.lit(0)))
        .withColumn("TotalRevenue", F.coalesce(F.col("total_revenue"), F.lit(0.0)))
        .withColumn("TotalCost", F.coalesce(F.col("total_cost"), F.lit(0.0)))
        .withColumn("TotalDiscount", F.coalesce(F.col("total_discount"), F.lit(0.0)))
        .withColumn("GrossMargin", F.col("TotalRevenue") - F.col("TotalCost"))
        .withColumn("GrossMarginPct", F.when(F.col("TotalRevenue") > 0, F.col("GrossMargin") / F.col("TotalRevenue")).otherwise(F.lit(None)))
        .withColumn("AverageOrderValue", F.when(F.col("TotalOrders") > 0, F.col("TotalRevenue") / F.col("TotalOrders")).otherwise(F.lit(None)))
        .withColumn("OrderFrequencyPerMonth", F.when((F.col("TotalOrders") > 0) & F.col("first_order_date").isNotNull(), F.col("TotalOrders") / F.greatest(F.lit(1.0), F.months_between(as_of_date_col, F.col("first_order_date")))).otherwise(F.lit(None)))
        .withColumn("FirstOrderDate", F.col("first_order_date"))
        .withColumn("LastOrderDate", F.col("last_order_date"))
        .withColumn("RecencyDays", F.when(F.col("LastOrderDate").isNotNull(), F.datediff(as_of_date_col, F.col("LastOrderDate"))).otherwise(F.lit(None)))
        .withColumn("ActiveDays", F.when(F.col("FirstOrderDate").isNotNull() & F.col("LastOrderDate").isNotNull(), F.datediff(F.col("LastOrderDate"), F.col("FirstOrderDate")) + F.lit(1)).otherwise(F.lit(None)))
        .withColumn("DistinctChannels", F.coalesce(F.col("channels_count"), F.lit(0)))
        .withColumn("DistinctProducts", F.coalesce(F.col("products_count"), F.lit(0)))
        .withColumn("PointsEarned", F.coalesce(F.col("points_earned"), F.lit(0.0)))
        .withColumn("PointsRedeemed", F.coalesce(F.col("points_redeemed"), F.lit(0.0)))
        .withColumn("PointsBalance", F.coalesce(F.col("points_balance"), F.lit(0.0)))
        .withColumn("HasLoyaltyActivity", F.when((F.col("PointsEarned") + F.col("PointsRedeemed")) > 0, F.lit(True)).otherwise(F.lit(False)))
        .withColumn("DiscountShare", F.when(F.col("TotalRevenue") > 0, F.col("TotalDiscount") / F.col("TotalRevenue")).otherwise(F.lit(None)))
        .drop("total_orders", "total_revenue", "total_cost", "total_discount", "channels_count", "products_count", "points_earned", "points_redeemed", "points_balance", "first_order_date", "last_order_date")
    )

    return metrics


StatementMeta(, 83a6b924-3a83-46d6-acb1-2221b657faf5, 8, Finished, Available, Finished)

## Build pipeline
This section reads Silver inputs, constructs each Gold model via the functions above, and writes them
to `Tables_gold/<name>` with required Delta properties applied.

In [11]:
# Read Silver inputs
customer_silver      = safe_read_table("customer_silver")
loyalty_card_silver  = safe_read_table("loyalty_card_silver")
loyalty_card_tier_silver = safe_read_table("loyalty_card_tier_silver")
loyalty_point_trans_silver = safe_read_table("loyalty_point_trans_silver")
loyalty_program_affiliation_silver = safe_read_table("loyalty_program_affiliation_silver")
sales_hdr_silver     = safe_read_table("salestable_silver")
sales_line_silver    = safe_read_table("salesline_silver")
retail_channel_silver= safe_read_table("retailchannel_silver")
# Optional inputs (not mandatory in this reference implementation)
product_silver       = safe_read_table("product_silver") if table_exists("product_silver") else None
# Build Profiles.Customer
profiles_customer_df = build_profiles_customer(customer_silver)
if profiles_customer_df is not None:
    # PK uniqueness check
    assert_pk_unique(profiles_customer_df, ["CustomerId"], "profiles.Customer")
    write_delta(profiles_customer_df, "profiles_Customer")
else:
    log("SKIP: profiles.Customer - source not available.")
# Build Activities.Orders
activities_orders_df = build_activities_orders(
    sales_hdr=sales_hdr_silver,
    sales_line=sales_line_silver,
    retail_channel=retail_channel_silver,
    customer_silver=customer_silver
)
if activities_orders_df is not None:
    assert_pk_unique(activities_orders_df, ["OrderLineId"], "activities.Orders")
    write_delta(activities_orders_df, "activities_Orders", partition_cols=["OrderDate"])  # partition by date
else:
    log("SKIP: activities.Orders - required sources not available.")
# Build Supporting.Products
supporting_products_df = build_supporting_products(sales_line=sales_line_silver, product_silver=product_silver)
if supporting_products_df is not None:
    assert_pk_unique(supporting_products_df, ["ProductId"], "supporting.Products")
    write_delta(supporting_products_df, "supporting_Products")
else:
    log("SKIP: supporting.Products - source not available.")
# Build Supporting.Channels
supporting_channels_df = build_supporting_channels(retail_channel_silver)
if supporting_channels_df is not None:
    assert_pk_unique(supporting_channels_df, ["ChannelId"], "supporting.Channels")
    write_delta(supporting_channels_df, "supporting_Channels")
else:
    log("SKIP: supporting.Channels - source not available.")
# Build Supporting.Calendar
supporting_calendar_df = build_supporting_calendar(orders=activities_orders_df, sales_line=sales_line_silver)
if supporting_calendar_df is not None:
    assert_pk_unique(supporting_calendar_df, ["Date"], "supporting.Calendar")
    write_delta(supporting_calendar_df, "supporting_Calendar")
else:
    log("SKIP: supporting.Calendar - could not infer date range.")
# Build Activities.LoyaltyPoints
activities_loyalty_points_df = build_activities_loyalty_points(
    loyalty_trans=loyalty_point_trans_silver,
    loyalty_card=loyalty_card_silver,
    loyalty_tier=loyalty_card_tier_silver,
    sales_hdr=sales_hdr_silver,
    retail_channel=retail_channel_silver
)
if activities_loyalty_points_df is not None:
    assert_pk_unique(activities_loyalty_points_df, ["LoyaltyEventId"], "activities.LoyaltyPoints")
    write_delta(activities_loyalty_points_df, "activities_LoyaltyPoints", partition_cols=["EventDate"])
else:
    log("SKIP: activities.LoyaltyPoints - required sources not available.")
# Build Supporting.LoyaltyRewardPoints (stub schema)
supporting_loyalty_reward_points_df = build_supporting_loyalty_reward_points()
assert_pk_unique(supporting_loyalty_reward_points_df, ["RewardPointId"], "supporting.LoyaltyRewardPoints")
write_delta(supporting_loyalty_reward_points_df, "supporting_LoyaltyRewardPoints")
# Build Supporting.LoyaltyPrograms
supporting_loyalty_programs_df = build_supporting_loyalty_programs(loyalty_program_affiliation_silver)
if supporting_loyalty_programs_df is not None:
    assert_pk_unique(supporting_loyalty_programs_df, ["ProgramId"], "supporting.LoyaltyPrograms")
    write_delta(supporting_loyalty_programs_df, "supporting_LoyaltyPrograms")
else:
    log("SKIP: supporting.LoyaltyPrograms - source not available.")
# Build analytics.CustomerMetrics (optional snapshot)
customer_metrics_df = build_customer_metrics(
    profiles_df=profiles_customer_df,
    orders_df=activities_orders_df,
    loyalty_df=activities_loyalty_points_df
)
if customer_metrics_df is not None:
    assert_pk_unique(customer_metrics_df, ['CustomerId', 'AsOfDate'], 'analytics.CustomerMetrics')
    write_delta(customer_metrics_df, 'analytics_CustomerMetrics', partition_cols=['AsOfDate'])
else:
    log('SKIP: analytics.CustomerMetrics - required sources not available.')


StatementMeta(, 8d8dae01-cdeb-4329-abdd-f58a8a84626f, 7, Finished, Available, Finished)

[gold] OK: profiles.Customer PK uniqueness holds.
[gold] Wrote Gold table: profiles_Customer -> Files/Tables_gold/profiles_Customer
[gold] OK: activities.Orders PK uniqueness holds.
[gold] Wrote Gold table: activities_Orders -> Files/Tables_gold/activities_Orders
[gold] OK: supporting.Products PK uniqueness holds.
[gold] Wrote Gold table: supporting_Products -> Files/Tables_gold/supporting_Products
[gold] OK: supporting.Channels PK uniqueness holds.
[gold] Wrote Gold table: supporting_Channels -> Files/Tables_gold/supporting_Channels
[gold] OK: supporting.Calendar PK uniqueness holds.
[gold] Wrote Gold table: supporting_Calendar -> Files/Tables_gold/supporting_Calendar


StatementMeta(, 00bc4d4b-0c42-4f34-b573-f351a46898d7, 10, Finished, Available, Finished)

[gold] OK: profiles.Customer PK uniqueness holds.
[gold] Wrote Gold table: profiles_Customer -> Files/Tables_gold/profiles_Customer
[gold] OK: activities.Orders PK uniqueness holds.
[gold] Wrote Gold table: activities_Orders -> Files/Tables_gold/activities_Orders
[gold] OK: supporting.Products PK uniqueness holds.
[gold] Wrote Gold table: supporting_Products -> Files/Tables_gold/supporting_Products
[gold] OK: supporting.Channels PK uniqueness holds.
[gold] Wrote Gold table: supporting_Channels -> Files/Tables_gold/supporting_Channels
[gold] OK: supporting.Calendar PK uniqueness holds.
[gold] Wrote Gold table: supporting_Calendar -> Files/Tables_gold/supporting_Calendar
[gold] OK: activities.LoyaltyPoints PK uniqueness holds.
[gold] Wrote Gold table: activities_LoyaltyPoints -> Files/Tables_gold/activities_LoyaltyPoints
[gold] OK: supporting.LoyaltyRewardPoints PK uniqueness holds.
[gold] Wrote Gold table: supporting_LoyaltyRewardPoints -> Files/Tables_gold/supporting_LoyaltyRewardPoint

StatementMeta(, 83a6b924-3a83-46d6-acb1-2221b657faf5, 9, Finished, Available, Finished)

[gold] OK: profiles.Customer PK uniqueness holds.
[gold] Wrote Gold table: profiles_Customer -> Files/Tables_gold/profiles_Customer
[gold] OK: activities.Orders PK uniqueness holds.
[gold] Wrote Gold table: activities_Orders -> Files/Tables_gold/activities_Orders
[gold] OK: supporting.Products PK uniqueness holds.
[gold] Wrote Gold table: supporting_Products -> Files/Tables_gold/supporting_Products
[gold] OK: supporting.Channels PK uniqueness holds.
[gold] Wrote Gold table: supporting_Channels -> Files/Tables_gold/supporting_Channels
[gold] OK: supporting.Calendar PK uniqueness holds.
[gold] Wrote Gold table: supporting_Calendar -> Files/Tables_gold/supporting_Calendar
[gold] OK: activities.LoyaltyPoints PK uniqueness holds.
[gold] Wrote Gold table: activities_LoyaltyPoints -> Files/Tables_gold/activities_LoyaltyPoints
[gold] OK: supporting.LoyaltyRewardPoints PK uniqueness holds.
[gold] Wrote Gold table: supporting_LoyaltyRewardPoints -> Files/Tables_gold/supporting_LoyaltyRewardPoint

## Summary
A quick row-count snapshot for the generated Gold tables.

In [12]:
def count_or_zero(name: str) -> int:
    return spark.table(name).count() if table_exists(name) else 0

gold_tables = [
    "profiles_Customer",
    "activities_Orders",
    "activities_LoyaltyPoints",
    "supporting_Products",
    "supporting_Channels",
    "supporting_Calendar",
    "supporting_LoyaltyRewardPoints",
    "supporting_LoyaltyPrograms",
    "analytics_CustomerMetrics",
]

for t in gold_tables:
    try:
        c = count_or_zero(t)
        log(f"Table {t} count = {c}")
    except Exception as e:
        log(f"Table {t} not available: {e}")


StatementMeta(, 83a6b924-3a83-46d6-acb1-2221b657faf5, 10, Finished, Available, Finished)

[gold] Table profiles_Customer count = 338
[gold] Table activities_Orders count = 185716
[gold] Table activities_LoyaltyPoints count = 4
[gold] Table supporting_Products count = 518
[gold] Table supporting_Channels count = 49
[gold] Table supporting_Calendar count = 45634
[gold] Table supporting_LoyaltyRewardPoints count = 0
[gold] Table supporting_LoyaltyPrograms count = 5
[gold] Table analytics_CustomerMetrics count = 338
