# Financial Services: Medallion Architecture Demo

## Overview

This notebook demonstrates a **Medallion Architecture** implementation in Oracle AI Data Platform (AIDP) Workbench using a financial services analytics use case. The medallion architecture organizes data into three layers:

- **Bronze Layer**: Raw data ingestion and storage
- **Silver Layer**: Cleaned, validated, and structured data
- **Gold Layer**: Aggregated, analytics-ready data with ML models

### What is Medallion Architecture?

The medallion architecture provides a structured approach to data processing:

- **Bronze**: Raw, unprocessed data as ingested
- **Silver**: Cleansed, validated, and enriched data
- **Gold**: Business-ready data for analytics and ML

### Use Case: Transaction Fraud Detection and Customer Analytics

We'll analyze financial transaction records from a bank across all three layers, culminating in ML-powered fraud detection.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Setup: Create Financial Services Catalog and Medallion Schemas

### Catalog and Schema Design

We'll create:
- `finance.bronze`: Raw transaction data
- `finance.silver`: Cleaned and validated transactions
- `finance.gold`: Analytics and ML-ready data

This structure provides data isolation and governance across layers.

In [None]:
# Create financial services catalog and medallion schemas

spark.sql("CREATE CATALOG IF NOT EXISTS finance")

spark.sql("CREATE SCHEMA IF NOT EXISTS finance.bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS finance.silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS finance.gold")

print("Financial services catalog and medallion schemas created successfully!")
print("- finance.bronze: Raw transaction data")
print("- finance.silver: Cleaned and validated data")
print("- finance.gold: Analytics and ML-ready data")

Financial services catalog and medallion schemas created successfully!
- finance.bronze: Raw transaction data
- finance.silver: Cleaned and validated data
- finance.gold: Analytics and ML-ready data


## Bronze Layer: Raw Data Ingestion

### Bronze Layer Design

The bronze layer stores raw transaction data as ingested, with minimal processing. We'll use Delta tables with liquid clustering for optimal performance.

### Table: `account_transactions_bronze`

- Raw transaction records with all original fields
- Liquid clustering on `account_id` and `transaction_date`
- Preserves data integrity and auditability

In [None]:
# Create Bronze Layer Delta table with liquid clustering

spark.sql("""
CREATE TABLE IF NOT EXISTS finance.bronze.account_transactions_bronze (
    account_id STRING,
    transaction_date TIMESTAMP,
    transaction_type STRING,
    amount DECIMAL(15,2),
    merchant_category STRING,
    location STRING,
    risk_score INT,
    ingestion_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (account_id, transaction_date)
""")

print("Bronze layer table created successfully!")
print("Liquid clustering will automatically optimize data layout for account_id and transaction_date queries.")

Bronze layer table created successfully!
Liquid clustering will automatically optimize data layout for account_id and transaction_date queries.


In [None]:
# Generate sample financial transaction data for Bronze layer

import random
from datetime import datetime, timedelta

# Define financial data constants
TRANSACTION_TYPES = ['Deposit', 'Withdrawal', 'Transfer', 'Payment', 'ATM']
MERCHANT_CATEGORIES = ['Retail', 'Restaurant', 'Online', 'Utilities', 'Entertainment', 'Groceries', 'Healthcare', 'Transportation']
LOCATIONS = ['New York, NY', 'Los Angeles, CA', 'Chicago, IL', 'Houston, TX', 'Miami, FL', 'Online', 'ATM']

# Generate account transaction records
transaction_data = []
base_date = datetime(2024, 1, 1)

# Create 5,000 accounts with 10-50 transactions each
for account_num in range(1, 5001):
    account_id = f"ACC{account_num:08d}"
    
    # Each account gets 10-50 transactions over 12 months
    num_transactions = random.randint(10, 50)
    
    for i in range(num_transactions):
        # Spread transactions over 12 months with realistic timing
        days_offset = random.randint(0, 365)
        hours_offset = random.randint(0, 23)
        transaction_date = base_date + timedelta(days=days_offset, hours=hours_offset)
        
        # Select transaction type
        transaction_type = random.choice(TRANSACTION_TYPES)
        
        # Amount based on transaction type
        if transaction_type in ['Deposit', 'Transfer']:
            amount = round(random.uniform(100, 10000), 2)
        elif transaction_type == 'ATM':
            amount = round(random.uniform(20, 500), 2) * -1
        else:
            amount = round(random.uniform(10, 2000), 2) * -1
        
        # Select merchant category and location
        merchant_category = random.choice(MERCHANT_CATEGORIES)
        if transaction_type == 'ATM':
            location = 'ATM'
        elif transaction_type == 'Online':
            location = 'Online'
        else:
            location = random.choice(LOCATIONS)
        
        # Risk score (0-100, higher = more suspicious)
        risk_score = random.randint(0, 100)
        
        transaction_data.append({
            "account_id": account_id,
            "transaction_date": transaction_date,
            "transaction_type": transaction_type,
            "amount": amount,
            "merchant_category": merchant_category,
            "location": location,
            "risk_score": risk_score
        })

print(f"Generated {len(transaction_data)} raw account transaction records for Bronze layer")
print("Sample record:", transaction_data[0])

Generated 150236 raw account transaction records for Bronze layer
Sample record: {'account_id': 'ACC00000001', 'transaction_date': datetime.datetime(2024, 12, 3, 8, 0), 'transaction_type': 'ATM', 'amount': -414.05, 'merchant_category': 'Restaurant', 'location': 'ATM', 'risk_score': 25}


In [None]:
# Insert raw data into Bronze layer

# Create DataFrame from generated data
df_bronze = spark.createDataFrame(transaction_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_bronze.printSchema()

print("\nSample Bronze Data:")
df_bronze.show(5)

# Insert data into Bronze table
df_bronze.write.mode("overwrite").saveAsTable("finance.bronze.account_transactions_bronze")

print(f"\nSuccessfully inserted {df_bronze.count()} raw records into Bronze layer")
print("Data is now available for Silver layer processing.")

Bronze Layer DataFrame Schema:
root
 |-- account_id: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- location: string (nullable = true)
 |-- merchant_category: string (nullable = true)
 |-- risk_score: long (nullable = true)
 |-- transaction_date: timestamp (nullable = true)
 |-- transaction_type: string (nullable = true)


Sample Bronze Data:


+-----------+--------+-----------+-----------------+----------+-------------------+----------------+
| account_id|  amount|   location|merchant_category|risk_score|   transaction_date|transaction_type|
+-----------+--------+-----------+-----------------+----------+-------------------+----------------+
|ACC00000001| -414.05|        ATM|       Restaurant|        25|2024-12-03 08:00:00|             ATM|
|ACC00000001|-1275.77|        ATM|   Transportation|        25|2024-12-25 17:00:00|      Withdrawal|
|ACC00000001|  -40.76|        ATM|   Transportation|        87|2024-05-16 03:00:00|             ATM|
|ACC00000001| -501.45|Houston, TX|       Restaurant|        74|2024-07-23 09:00:00|      Withdrawal|
|ACC00000001| -298.66|        ATM|   Transportation|        52|2024-06-18 02:00:00|             ATM|
+-----------+--------+-----------+-----------------+----------+-------------------+----------------+
only showing top 5 rows




Successfully inserted 150236 raw records into Bronze layer
Data is now available for Silver layer processing.


## Silver Layer: Data Cleaning and Validation

### Silver Layer Design

The silver layer provides cleaned, validated, and enriched data. We'll:

- Remove invalid records
- Standardize data formats
- Add data quality metrics
- Enrich with derived fields

### Table: `account_transactions_silver`

- Cleaned transaction data with validation flags
- Enhanced with temporal features
- Ready for analytical processing

In [None]:
# Create Silver Layer Delta table

spark.sql("""
CREATE TABLE IF NOT EXISTS finance.silver.account_transactions_silver (
    account_id STRING,
    transaction_date TIMESTAMP,
    transaction_type STRING,
    amount DECIMAL(15,2),
    merchant_category STRING,
    location STRING,
    risk_score INT,
    month INT,
    day_of_week INT,
    hour INT,
    is_valid BOOLEAN,
    data_quality_score DOUBLE,
    processed_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (account_id, transaction_date)
""")

print("Silver layer table created successfully!")

Silver layer table created successfully!


In [None]:
# Process Bronze data to Silver layer

from pyspark.sql.functions import col, when, month, dayofweek, hour, lit

# Read from Bronze layer
bronze_df = spark.table("finance.bronze.account_transactions_bronze")

print(f"Read {bronze_df.count()} records from Bronze layer")

# Data validation and cleaning
silver_df = bronze_df \
    .withColumn("month", month(col("transaction_date"))) \
    .withColumn("day_of_week", dayofweek(col("transaction_date"))) \
    .withColumn("hour", hour(col("transaction_date"))) \
    .withColumn("is_valid", 
                when((col("amount").isNotNull()) & 
                     (col("account_id").isNotNull()) & 
                     (col("transaction_date").isNotNull()), True).otherwise(False)) \
    .withColumn("data_quality_score", 
                when(col("is_valid"), 
                     (lit(1.0) - (col("risk_score") / lit(100.0))) * lit(0.7) + lit(0.3)).otherwise(0.0))

# Filter out invalid records
valid_silver_df = silver_df.filter(col("is_valid") == True)

print(f"After validation: {valid_silver_df.count()} valid records")
print(f"Filtered out {bronze_df.count() - valid_silver_df.count()} invalid records")

# Show sample cleaned data
print("\nSample Silver Layer Data:")
valid_silver_df.select("account_id", "transaction_date", "transaction_type", "amount", "risk_score", "is_valid", "data_quality_score").show(5)

Read 150236 records from Bronze layer


After validation: 150236 valid records


Filtered out 0 invalid records

Sample Silver Layer Data:


+-----------+-------------------+----------------+--------+----------+--------+-------------------+
| account_id|   transaction_date|transaction_type|  amount|risk_score|is_valid| data_quality_score|
+-----------+-------------------+----------------+--------+----------+--------+-------------------+
|ACC00001217|2024-11-07 14:00:00|      Withdrawal| -834.63|        38|    true|              0.734|
|ACC00001217|2024-07-19 09:00:00|         Deposit| 1955.05|        19|    true|              0.867|
|ACC00001217|2024-03-27 19:00:00|      Withdrawal|-1649.43|        52|    true| 0.6359999999999999|
|ACC00001217|2024-11-24 12:00:00|         Deposit| 7660.27|        52|    true| 0.6359999999999999|
|ACC00001217|2024-12-27 05:00:00|         Deposit| 3054.73|        76|    true|0.46799999999999997|
+-----------+-------------------+----------------+--------+----------+--------+-------------------+
only showing top 5 rows



In [None]:
# Insert cleaned data into Silver layer

valid_silver_df.write.mode("overwrite").saveAsTable("finance.silver.account_transactions_silver")

print(f"Successfully inserted {valid_silver_df.count()} cleaned records into Silver layer")
print("Data is now validated, enriched, and ready for Gold layer analytics.")

Successfully inserted 150236 cleaned records into Silver layer
Data is now validated, enriched, and ready for Gold layer analytics.


## Gold Layer: Analytics and ML-Ready Data

### Gold Layer Design

The gold layer provides business-ready analytics and ML features. We'll create:

- Aggregated analytics tables
- ML-ready feature engineering
- Fraud detection model training

### Tables in Gold Layer

- `transaction_analytics_gold`: Aggregated transaction metrics
- `fraud_detection_model_gold`: ML-ready features for fraud detection

In [None]:
# Create Gold Layer Analytics Table

spark.sql("""
CREATE TABLE IF NOT EXISTS finance.gold.transaction_analytics_gold (
    account_id STRING,
    month_year STRING,
    total_transactions BIGINT,
    total_amount DECIMAL(15,2),
    avg_amount DECIMAL(15,2),
    avg_risk_score DOUBLE,
    high_risk_transactions BIGINT,
    transaction_types MAP<STRING, BIGINT>,
    merchant_categories MAP<STRING, BIGINT>,
    created_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (account_id, month_year)
""")

print("Gold layer analytics table created successfully!")

Gold layer analytics table created successfully!


In [None]:
# Create Gold Layer ML Features Table

spark.sql("""
CREATE TABLE IF NOT EXISTS finance.gold.fraud_detection_model_gold (
    account_id STRING,
    transaction_date TIMESTAMP,
    transaction_type STRING,
    amount DECIMAL(15,2),
    merchant_category STRING,
    location STRING,
    risk_score INT,
    month INT,
    day_of_week INT,
    hour INT,
    is_fraud INT,
    created_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (account_id, transaction_date)
""")

print("Gold layer ML features table created successfully!")

Gold layer ML features table created successfully!


In [None]:
# Generate Gold Layer Analytics from Silver Data

from pyspark.sql.functions import date_format, collect_list, map_from_entries, struct, sum as sum_func, avg, count, when

# Read Silver data
silver_df = spark.table("finance.silver.account_transactions_silver")

# Create monthly aggregations
analytics_df = silver_df \
    .withColumn("month_year", date_format(col("transaction_date"), "yyyy-MM")) \
    .groupBy("account_id", "month_year") \
    .agg(
        count("*").alias("total_transactions"),
        sum_func("amount").alias("total_amount"),
        avg("amount").alias("avg_amount"),
        avg("risk_score").alias("avg_risk_score"),
        sum_func(when(col("risk_score") > 60, 1).otherwise(0)).alias("high_risk_transactions")
    )

# Add transaction type and merchant category distributions
type_dist = silver_df \
    .withColumn("month_year", date_format(col("transaction_date"), "yyyy-MM")) \
    .groupBy("account_id", "month_year", "transaction_type") \
    .agg(count("*").alias("count")) \
    .groupBy("account_id", "month_year") \
    .agg(collect_list(struct("transaction_type", "count")).alias("transaction_types_list"))

merchant_dist = silver_df \
    .withColumn("month_year", date_format(col("transaction_date"), "yyyy-MM")) \
    .groupBy("account_id", "month_year", "merchant_category") \
    .agg(count("*").alias("count")) \
    .groupBy("account_id", "month_year") \
    .agg(collect_list(struct("merchant_category", "count")).alias("merchant_categories_list"))

# Join aggregations
gold_analytics = analytics_df \
    .join(type_dist, ["account_id", "month_year"], "left") \
    .join(merchant_dist, ["account_id", "month_year"], "left") \
    .withColumn("transaction_types", map_from_entries(col("transaction_types_list"))) \
    .withColumn("merchant_categories", map_from_entries(col("merchant_categories_list"))) \
    .drop("transaction_types_list", "merchant_categories_list")

print(f"Generated {gold_analytics.count()} monthly analytics records")
print("\nSample Gold Analytics:")
gold_analytics.select("account_id", "month_year", "total_transactions", "total_amount", "avg_risk_score", "high_risk_transactions").show(5)

Generated 52825 monthly analytics records

Sample Gold Analytics:


+-----------+----------+------------------+------------------+------------------+----------------------+
| account_id|month_year|total_transactions|      total_amount|    avg_risk_score|high_risk_transactions|
+-----------+----------+------------------+------------------+------------------+----------------------+
|ACC00001246|   2024-07|                 3|           2394.27|52.333333333333336|                     1|
|ACC00001259|   2024-08|                 2|           7260.81|               4.5|                     0|
|ACC00001309|   2024-06|                 5|          18283.59|              45.8|                     2|
|ACC00001347|   2024-04|                 3|           5311.36|40.333333333333336|                     0|
|ACC00001347|   2024-01|                 4|-950.8899999999999|              59.0|                     3|
+-----------+----------+------------------+------------------+------------------+----------------------+
only showing top 5 rows



In [None]:
# Insert analytics into Gold layer

gold_analytics.write.mode("overwrite").saveAsTable("finance.gold.transaction_analytics_gold")

print(f"Successfully inserted {gold_analytics.count()} analytics records into Gold layer")

Successfully inserted 52825 analytics records into Gold layer


In [None]:
# Prepare ML Features for Gold Layer

# Read Silver data for ML
ml_data = silver_df.withColumn("is_fraud", when(col("risk_score") > 60, 1).otherwise(0))

print(f"Prepared {ml_data.count()} records for ML feature engineering")
print("Fraud distribution:")
ml_data.groupBy("is_fraud").count().show()

Prepared 150236 records for ML feature engineering
Fraud distribution:


+--------+-----+
|is_fraud|count|
+--------+-----+
|       1|59577|
|       0|90659|
+--------+-----+



In [None]:
# Insert into Gold layer

# For simplicity, we'll store the features without full vectorization in the table
# In practice, you'd include properly scaled features
ml_gold_df = ml_data.select(
    "account_id", "transaction_date", "transaction_type", "amount", 
    "merchant_category", "location", "risk_score", "month", 
    "day_of_week", "hour", "is_fraud"
)

ml_gold_df.write.mode("overwrite").saveAsTable("finance.gold.fraud_detection_model_gold")

print(f"Successfully inserted {ml_gold_df.count()} ML-ready records into Gold layer")

Successfully inserted 150236 ML-ready records into Gold layer


## Gold Layer: ML Model Training and Evaluation

### Fraud Detection Model

Now we'll train a Random Forest model using the ML-ready data from the Gold layer to predict fraudulent transactions.

In [None]:
# Load Gold layer ML data for training

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Load data from Gold layer
gold_ml_data = spark.table("finance.gold.fraud_detection_model_gold")

print(f"Loaded {gold_ml_data.count()} records from Gold layer for ML training")
gold_ml_data.groupBy("is_fraud").count().show()

Loaded 150236 records from Gold layer for ML training


+--------+-----+
|is_fraud|count|
+--------+-----+
|       1|59577|
|       0|90659|
+--------+-----+



In [None]:
# Feature engineering pipeline

# Create indexers for categorical variables
transaction_type_indexer = StringIndexer(inputCol="transaction_type", outputCol="transaction_type_index")
merchant_category_indexer = StringIndexer(inputCol="merchant_category", outputCol="merchant_category_index")
location_indexer = StringIndexer(inputCol="location", outputCol="location_index")

# Assemble features
assembler = VectorAssembler(
    inputCols=["amount", "month", "day_of_week", "hour", 
               "transaction_type_index", "merchant_category_index", "location_index"],
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestClassifier(
    labelCol="is_fraud", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[transaction_type_indexer, merchant_category_indexer, location_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = gold_ml_data.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} records")
print(f"Test set: {test_data.count()} records")

Training set: 120258 records


Test set: 29978 records


In [None]:
# Train the fraud detection model

print("Training fraud detection model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="is_fraud", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show prediction results
predictions.select("account_id", "amount", "risk_score", "is_fraud", "prediction", "probability").show(10)

Training fraud detection model...


Model AUC: 0.5077


+-----------+--------+----------+--------+----------+--------------------+
| account_id|  amount|risk_score|is_fraud|prediction|         probability|
+-----------+--------+----------+--------+----------+--------------------+
|ACC00001217| 8519.66|        32|       0|       0.0|[0.58610508814702...|
|ACC00001217| -479.98|        59|       0|       0.0|[0.61198712376148...|
|ACC00001217|-1732.56|        20|       0|       0.0|[0.63416382394754...|
|ACC00001217| -1741.8|        17|       0|       0.0|[0.64010742470468...|
|ACC00001217| -139.08|        52|       0|       0.0|[0.60456059677813...|
|ACC00001217| 2030.16|        60|       0|       0.0|[0.60494424132128...|
|ACC00001217| -485.66|        56|       0|       0.0|[0.58573687275789...|
|ACC00001218|-1464.53|        52|       0|       0.0|[0.64161189482873...|
|ACC00001218| 3339.41|        35|       0|       0.0|[0.59683919284296...|
|ACC00001218|  5661.1|        26|       0|       0.0|[0.61682637643500...|
+-----------+--------+---

In [None]:
# Model evaluation and business insights

# Calculate confusion matrix
confusion_matrix = predictions.groupBy("is_fraud", "prediction").count()
confusion_matrix.show()

# Feature importance
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = ["amount", "month", "day_of_week", "hour", "transaction_type", "merchant_category", "location"]

print("\n=== Feature Importance ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== Business Impact Analysis ===")

# Calculate potential savings from fraud detection
fraud_predictions = predictions.filter("prediction = 1")
high_risk_transactions = fraud_predictions.count()
total_flagged_amount = fraud_predictions.agg(F.sum(F.abs("amount"))).collect()[0][0] or 0

total_test_amount = test_data.agg(F.sum(F.abs("amount"))).collect()[0][0] or 0

print(f"Total test transactions: {test_data.count()}")
print(f"Transactions flagged as high-risk: {high_risk_transactions}")
print(f"Percentage flagged: {(high_risk_transactions/test_data.count())*100:.1f}%")
print(f"Total amount of flagged transactions: ${total_flagged_amount:,.2f}")

# Accuracy metrics
accuracy = predictions.filter("is_fraud = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND is_fraud = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND is_fraud = 1").count() / predictions.filter("is_fraud = 1").count() if predictions.filter("is_fraud = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

+--------+----------+-----+
|is_fraud|prediction|count|
+--------+----------+-----+
|       1|       0.0|11954|
|       0|       0.0|18020|
|       1|       1.0|    1|
|       0|       1.0|    3|
+--------+----------+-----+


=== Feature Importance ===
amount: 0.2039
month: 0.1645
day_of_week: 0.1137
hour: 0.2013
transaction_type: 0.0601
merchant_category: 0.1366
location: 0.1200

=== Business Impact Analysis ===


Total test transactions: 29978
Transactions flagged as high-risk: 4


Percentage flagged: 0.0%
Total amount of flagged transactions: $7,527.98



Model Performance:
Accuracy: 0.6011
Precision: 0.2500
Recall: 0.0001
AUC: 0.5077


## Query Examples Across Medallion Layers

### Bronze Layer Queries
Raw data access for audit and debugging

In [None]:
# Bronze Layer: Raw data queries

print("=== Bronze Layer: Raw Transaction Data ===")
bronze_sample = spark.sql("""
SELECT account_id, transaction_date, transaction_type, amount, risk_score
FROM finance.bronze.account_transactions_bronze
WHERE account_id = 'ACC00000001'
ORDER BY transaction_date DESC
LIMIT 5
""")
bronze_sample.show()

=== Bronze Layer: Raw Transaction Data ===


+-----------+-------------------+----------------+--------+----------+
| account_id|   transaction_date|transaction_type|  amount|risk_score|
+-----------+-------------------+----------------+--------+----------+
|ACC00000001|2024-12-25 17:00:00|      Withdrawal|-1275.77|        25|
|ACC00000001|2024-12-03 08:00:00|             ATM| -414.05|        25|
|ACC00000001|2024-10-28 03:00:00|        Transfer| 5309.08|        50|
|ACC00000001|2024-10-15 14:00:00|         Deposit|  987.68|        71|
|ACC00000001|2024-10-05 06:00:00|         Deposit| 9044.44|        53|
+-----------+-------------------+----------------+--------+----------+



In [None]:
# Silver Layer: Cleaned data queries

print("=== Silver Layer: Validated Transaction Data ===")
silver_sample = spark.sql("""
SELECT account_id, transaction_date, transaction_type, amount, risk_score, is_valid, data_quality_score
FROM finance.silver.account_transactions_silver
WHERE account_id = 'ACC00000001' AND is_valid = true
ORDER BY transaction_date DESC
LIMIT 5
""")
silver_sample.show()

=== Silver Layer: Validated Transaction Data ===


+-----------+-------------------+----------------+--------+----------+--------+------------------+
| account_id|   transaction_date|transaction_type|  amount|risk_score|is_valid|data_quality_score|
+-----------+-------------------+----------------+--------+----------+--------+------------------+
|ACC00000001|2024-12-25 17:00:00|      Withdrawal|-1275.77|        25|    true|             0.825|
|ACC00000001|2024-12-03 08:00:00|             ATM| -414.05|        25|    true|             0.825|
|ACC00000001|2024-10-28 03:00:00|        Transfer| 5309.08|        50|    true|0.6499999999999999|
|ACC00000001|2024-10-15 14:00:00|         Deposit|  987.68|        71|    true|             0.503|
|ACC00000001|2024-10-05 06:00:00|         Deposit| 9044.44|        53|    true|             0.629|
+-----------+-------------------+----------------+--------+----------+--------+------------------+



In [None]:
# Gold Layer: Analytics queries

print("=== Gold Layer: Account Analytics ===")
gold_sample = spark.sql("""
SELECT account_id, month_year, total_transactions, total_amount, avg_risk_score, high_risk_transactions
FROM finance.gold.transaction_analytics_gold
WHERE account_id = 'ACC00000001'
ORDER BY month_year DESC
LIMIT 3
""")
gold_sample.show()

=== Gold Layer: Account Analytics ===


+-----------+----------+------------------+------------+--------------+----------------------+
| account_id|month_year|total_transactions|total_amount|avg_risk_score|high_risk_transactions|
+-----------+----------+------------------+------------+--------------+----------------------+
|ACC00000001|   2024-12|                 2|    -1689.82|          25.0|                     0|
|ACC00000001|   2024-10|                 3|     15341.2|          58.0|                     1|
|ACC00000001|   2024-08|                 2|    -3408.02|          51.0|                     1|
+-----------+----------+------------------+------------+--------------+----------------------+



## Key Takeaways: Medallion Architecture in AIDP

### What We Demonstrated

1. **Bronze Layer**: Raw data ingestion with Delta liquid clustering
2. **Silver Layer**: Data validation, cleaning, and enrichment
3. **Gold Layer**: Analytics aggregation and ML model training
4. **End-to-End Pipeline**: Complete medallion architecture in a single notebook

### AIDP Advantages

- **Unified Platform**: Seamless data flow between layers
- **Governance**: Catalog and schema isolation
- **Performance**: Optimized with liquid clustering
- **ML Integration**: Built-in ML capabilities

### Business Benefits

1. **Data Quality**: Progressive improvement through layers
2. **Analytics Ready**: Business-focused aggregations
3. **ML Automation**: Fraud detection and risk assessment
4. **Scalability**: Handles large financial datasets
5. **Governance**: Audit trails and data lineage

### Best Practices

1. **Layer Isolation**: Keep raw data separate from processed data
2. **Incremental Processing**: Build upon validated foundations
3. **Business Alignment**: Gold layer matches business needs
4. **Performance Optimization**: Use clustering strategically
5. **ML Integration**: Include predictive analytics in gold layer

This notebook demonstrates how Oracle AI Data Platform enables sophisticated financial services analytics with proper data architecture and governance.