# Fraud Detection - Gold Layer

## Purpose
Create analytics-ready, denormalized tables by joining silver layer tables for:
- Machine learning model training
- Business intelligence and reporting
- Feature-rich datasets with all relevant dimensions

## Gold Tables Created
1. `tx_train_gold` - Training dataset (labeled transactions + all dimensions)
2. `tx_score_gold` - Scoring dataset (unlabeled transactions + all dimensions)

## Data Flow
Silver Tables → Join Dimensions → Gold Tables (denormalized, analytics-ready)

## Joins Performed
- Transactions + Labels (inner join for train, left anti for score)
- Transactions + Users (left join on client_id)
- Transactions + Cards (left join on card_id)
- Transactions + MCC (left join on mcc code)

In [0]:
from pyspark.sql.functions import broadcast

# Load silver tables
tx_train = spark.table("workspace.fraud.tx_train_silver")  # Already has labels joined
usr = spark.table("workspace.fraud.users_silver")
crd = spark.table("workspace.fraud.cards_silver")
mcc = spark.table("workspace.fraud.mcc_dim_silver")

# Create denormalized gold table with all dimensions
tx_train_gold = (
    tx_train
      .join(broadcast(usr), tx_train.client_id == usr.id, "left")
      .drop(usr.id)  # Drop user id to avoid duplicate
      .join(broadcast(crd), tx_train.card_id == crd.id, "left")
      .drop(crd.id, crd.client_id)  # Drop card id and client_id to avoid duplicates
      .join(broadcast(mcc), "mcc", "left")  # Join on mcc code
)

# Save to gold table
tx_train_gold.write.mode("overwrite").option("overwriteSchema", "true").format("delta") \
    .saveAsTable("workspace.fraud.tx_train_gold")

print(f"Created tx_train_gold with {tx_train_gold.count():,} rows")

In [0]:
from pyspark.sql.functions import broadcast

# Load silver tables
tx_score = spark.table("workspace.fraud.tx_score_silver")  # Unlabeled transactions
usr = spark.table("workspace.fraud.users_silver")
crd = spark.table("workspace.fraud.cards_silver")
mcc = spark.table("workspace.fraud.mcc_dim_silver")

# Create denormalized gold table with all dimensions
tx_score_gold = (
    tx_score
      .join(broadcast(usr), tx_score.client_id == usr.id, "left")
      .drop(usr.id)  # Drop user id to avoid duplicate
      .join(broadcast(crd), tx_score.card_id == crd.id, "left")
      .drop(crd.id, crd.client_id)  # Drop card id and client_id to avoid duplicates
      .join(broadcast(mcc), "mcc", "left")  # Join on mcc code
)

# Save to gold table
tx_score_gold.write.mode("overwrite").option("overwriteSchema", "true").format("delta") \
    .saveAsTable("workspace.fraud.tx_score_gold")

print(f"Created tx_score_gold with {tx_score_gold.count():,} rows")

---
## Verification
Verify gold tables were created successfully.

In [0]:
# Summary of all gold tables
print("Gold Layer Tables Summary")
print("=" * 60)

gold_tables = [
    "tx_train_gold",
    "tx_score_gold"
]

for table in gold_tables:
    try:
        count = spark.sql(f"SELECT COUNT(*) as cnt FROM workspace.fraud.{table}").collect()[0]['cnt']
        print(f"{table:25} {count:>15,} rows")
    except Exception as e:
        print(f"{table:25} ERROR: {str(e)[:30]}")

print("=" * 60)
print()

# Show label distribution in training set
print("Label Distribution in tx_train_gold:")
spark.sql("SELECT label, COUNT(*) as count FROM workspace.fraud.tx_train_gold GROUP BY label ORDER BY label").show()

# Show column count
print(f"tx_train_gold has {len(spark.table('workspace.fraud.tx_train_gold').columns)} columns (denormalized)")
print(f"tx_score_gold has {len(spark.table('workspace.fraud.tx_score_gold').columns)} columns (denormalized)")