# Master Data Management (MDM) - Spanner Native Streaming Processing

This notebook demonstrates a complete end-to-end streaming Master Data Management pipeline using Spanner's native capabilities:

- **Golden Record Bootstrap**: Load existing golden records from BigQuery batch processing
- **Spanner Infrastructure**: Set up minimal Spanner instance for real-time processing
- **Data Migration**: Transfer golden records to Spanner for real-time matching
- **Streaming Data Generation**: Create 100 new customer records for processing
- **4-Way Real-time Matching**: Exact, fuzzy, vector, and business rules matching
- **Synchronous Processing**: Sub-second processing with immediate feedback
- **Golden Record Updates**: Apply survivorship rules and update master entities
- **Live Performance Tracking**: Real-time metrics and Spanner transaction logging

## Architecture Overview

This implementation demonstrates **Traditional 4-Way Matching** for streaming MDM:

### 🎯 **Traditional 4-Way Streaming Flow**
1. **BigQuery Golden Records** → **Spanner Migration** (with embeddings)
2. **New Streaming Record** → **4-Way Matching Process**
3. **Exact Matching** → **Email/Phone Index Lookups**
4. **Fuzzy Matching** → **Name/Address Similarity**
5. **Vector Matching** → **Existing Embeddings Search**
6. **Business Rules** → **Company/Location Logic**
7. **Score Combination** → **Weighted Decision**
8. **Golden Record Updates** → **Spanner Transaction Logging**


### 🚧 **Current Vector Matching Limitation**

**Important**: This demo has a vector matching gap that affects the 4-way strategy:

- ✅ **Exact Matching**: Fully operational (email/phone indexes)
- ✅ **Fuzzy Matching**: Fully operational (name/address similarity)  
- 🚧 **Vector Matching**: **Architecturally supported but operationally limited**
- ✅ **Business Rules**: Fully operational (company/location logic)

**Root Cause**: New streaming records arrive without embeddings, and the system doesn't generate them in real-time.

**Current Behavior**: Vector matching always returns empty (contributes 0.0 to combined score).

**Roadmap**: Full 4-way matching will be enabled when Vertex AI integration is added for real-time embedding generation (+200-500ms latency cost).


### 🎯 **Consistent Processing Approach**
- **All Records**: Run all 4 strategies for comprehensive matching
- **Vector Search**: Uses existing embeddings only (no generation)
- **Performance**: Optimized for sub-400ms processing per record

## 1. Setup and Configuration

In [1]:
# Import required libraries
import warnings
from datetime import datetime
import time
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from batch_mdm_gcp.data_generator import MDMDataGenerator
from batch_mdm_gcp.bigquery_utils import BigQueryMDMHelper
from spanner_utils import SpannerMDMHelper
from streaming_processor import StreamingMDMProcessor
import sys
import os
import random

sys.path.append('..')
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


In [2]:
# =============================================================================
# CONFIGURATION CONSTANTS - Centralized Settings
# =============================================================================

# GCP Configuration
PROJECT_ID = "your-project-id"  # Replace with your GCP project ID
DATASET_ID = "mdm_demo"  # BigQuery dataset (from batch processing)
INSTANCE_ID = "mdm-streaming-demo"  # Spanner instance
DATABASE_ID = "mdm_streaming"  # Spanner database
LOCATION = "US"

# Processing Configuration
NUM_STREAMING_RECORDS = 100
PROCESSING_DELAY_SEC = 0.1  # 10 record per second for demo
TARGET_LATENCY_MS = 400  # Target processing time per record

# Decision Thresholds
AUTO_MERGE_THRESHOLD = 0.8
CREATE_NEW_THRESHOLD = 0.6

print("📋 Configuration loaded:")
print(f"  Target records: {NUM_STREAMING_RECORDS}")
print(f"  Target latency: <{TARGET_LATENCY_MS}ms")
print(f"  Auto-merge threshold: ≥{AUTO_MERGE_THRESHOLD}")
print(f"  Create new threshold: <{CREATE_NEW_THRESHOLD}")

📋 Configuration loaded:
  Target records: 100
  Target latency: <400ms
  Auto-merge threshold: ≥0.8
  Create new threshold: <0.6


In [3]:
# Initialize helpers
try:
    # BigQuery helper (for loading golden records)
    bq_helper = BigQueryMDMHelper(PROJECT_ID, DATASET_ID)
    print(f"✅ Connected to BigQuery project: {PROJECT_ID}")
    print(f"📊 BigQuery dataset: {bq_helper.dataset_ref}")

    # Spanner helper (for streaming processing)
    spanner_helper = SpannerMDMHelper(PROJECT_ID, INSTANCE_ID, DATABASE_ID)
    print(f"✅ Connected to Spanner project: {PROJECT_ID}")
    print(f"🗃️ Spanner instance: {INSTANCE_ID}")
    print(f"🗃️ Spanner database: {DATABASE_ID}")

except Exception as e:
    print(f"❌ Error connecting: {e}")
    print("Please ensure you have:")
    print("1. Set up Google Cloud authentication")
    print("2. Enabled BigQuery and Spanner APIs")
    print("3. Updated PROJECT_ID above")

✅ Connected to BigQuery project: johanesa-playground-326616
📊 BigQuery dataset: johanesa-playground-326616.mdm_demo
✅ Connected to Spanner project: johanesa-playground-326616
🗃️ Spanner instance: mdm-streaming-demo
🗃️ Spanner database: mdm_streaming


## 2. Helper Functions

In [4]:
def update_statistics(result, action_counts, confidence_counts):
    """Update running statistics with current result."""
    action = result.get('action', 'ERROR')
    confidence = result.get('confidence', 'LOW')

    action_counts[action] = action_counts.get(action, 0) + 1
    confidence_counts[confidence] = confidence_counts.get(confidence, 0) + 1


print("✅ Helper functions defined")

✅ Helper functions defined


## 3. Spanner Infrastructure Setup

Create minimal Spanner infrastructure for the streaming demo.

In [5]:
print("🔄 Setting up Spanner infrastructure...")
print("⚠️ Remember to delete the instance after demo to avoid charges")
print()

try:
    # Create Spanner instance (minimal configuration)
    spanner_helper.create_instance_if_needed(processing_units=100)

    # Create database
    spanner_helper.create_database_if_needed()

    # Create schema (aligned with BigQuery golden_records)
    spanner_helper.create_or_replace_schema()

    print("\n✅ Spanner infrastructure ready!")
    print(f"📊 Instance: {INSTANCE_ID} (100 processing units)")
    print(f"🗃️ Database: {DATABASE_ID}")
    print(f"📋 Schema: golden_entities, match_results tables created")

except Exception as e:
    print(f"❌ Error setting up Spanner infrastructure: {e}")
    print("Please check your GCP permissions and try again.")

🔄 Setting up Spanner infrastructure...
⚠️ Remember to delete the instance after demo to avoid charges

  ✅ Instance mdm-streaming-demo already exists
  ✅ Database mdm_streaming already exists
  🔄 Checking schema status...


Created multiplexed session.


  ✅ Schema exists and ready (fast path)
  ✅ Vector index already exists

✅ Spanner infrastructure ready!
📊 Instance: mdm-streaming-demo (100 processing units)
🗃️ Database: mdm_streaming
📋 Schema: golden_entities, match_results tables created


## 4. Load Golden Records from BigQuery

Bootstrap the streaming system with existing golden records from batch processing.

In [6]:
print("🔄 Loading golden records from BigQuery batch processing...")

try:
    # Load golden records from BigQuery
    golden_count = spanner_helper.load_golden_records_from_bigquery(bq_helper)

    if golden_count > 0:
        print(
            f"\n✅ Successfully migrated {golden_count} golden records to Spanner")

        # Verify the migration
        current_count = spanner_helper.get_table_count("golden_entities")
        print(f"📊 Current golden entities in Spanner: {current_count}")

        # Show sample records
        sample_query = """
        SELECT entity_id, master_name, master_email, master_phone,
               source_record_count, processing_path
        FROM golden_entities
        LIMIT 5
        """

        sample_df = spanner_helper.execute_sql(sample_query)
        if not sample_df.empty:
            print("\n🔍 Sample Golden Records in Spanner:")
            sample_df.columns = ['entity_id', 'master_name',
                                 'master_email', 'master_phone', 'source_count', 'path']
            display(sample_df)
    else:
        print("⚠️ No golden records found in BigQuery")
        print("💡 Run the batch processing notebook first to create golden records")

except Exception as e:
    print(f"❌ Error loading golden records: {e}")
    print("💡 Make sure you've run the batch processing notebook first")

🔄 Loading golden records from BigQuery batch processing...
  🔄 Loading golden records from BigQuery...
  🗑️ Cleared table: golden_entities
  ✅ Loaded 100 golden records from BigQuery

✅ Successfully migrated 100 golden records to Spanner
📊 Current golden entities in Spanner: 100

🔍 Sample Golden Records in Spanner:


Unnamed: 0,entity_id,master_name,master_email,master_phone,source_count,path
0,0284c4456f5ecfa6b768f23eb98d6f0f694b,JAMES HORTON,jross@example.net,1901463,2,batch_migrated
1,0337d5978aa00c9cc65779d8e410d858387d,LAUREN BYRD,davidtodd@outlook.com,433620616,3,batch_migrated
2,064acc1718b073764e63539db2e41fe84351,SHAUNXJONES,patrickdarin@example.com,18124382,4,batch_migrated
3,07a2038841964b97649af78fea800ebcb29c,DAVID WALKER,andrew83@outlook.com,5016031051,3,batch_migrated
4,0894b8e347814bdbb43518d52da1f72e0cd4,ANTHONY VAUGHAN,leescott@example.com,563768970,3,batch_migrated


## 5. Embedding Sync Pipeline

Sync embeddings from BigQuery to Spanner for vector matching functionality.

In [7]:
print("🔄 Syncing embeddings from BigQuery to Spanner...")
print()

try:
    # Check if embeddings exist in BigQuery
    embeddings_check_query = f"""
    SELECT COUNT(*) as total_records,
           COUNT(ml_generate_embedding_result) as records_with_embeddings
    FROM `{bq_helper.dataset_ref}.customers_with_embeddings`
    WHERE ml_generate_embedding_result IS NOT NULL
    """

    embeddings_stats = bq_helper.execute_query(embeddings_check_query)

    if not embeddings_stats.empty and embeddings_stats.iloc[0]['records_with_embeddings'] > 0:
        total_embeddings = embeddings_stats.iloc[0]['records_with_embeddings']
        print(
            f"✅ Found {total_embeddings} records with embeddings in BigQuery")

        # Sync embeddings to Spanner
        synced_count = spanner_helper.sync_embeddings_from_bigquery(bq_helper)

        if synced_count > 0:
            print(
                f"✅ Successfully synced {synced_count} embeddings to Spanner")

            # Verify embedding sync
            embedding_count_query = """
            SELECT COUNT(*) as embedding_count
            FROM entity_embeddings
            WHERE embedding IS NOT NULL
            """

            embedding_count_df = spanner_helper.execute_sql(
                embedding_count_query)
            if not embedding_count_df.empty:
                spanner_embedding_count = embedding_count_df.iloc[0]['col_0']
                print(f"📊 Embeddings in Spanner: {spanner_embedding_count}")

                # Show sample embeddings
                sample_embeddings_query = """
                SELECT entity_id, ARRAY_LENGTH(embedding) as vector_length, created_at
                FROM entity_embeddings
                WHERE embedding IS NOT NULL
                LIMIT 5
                """

                sample_embeddings_df = spanner_helper.execute_sql(
                    sample_embeddings_query)
                if not sample_embeddings_df.empty:
                    sample_embeddings_df.columns = [
                        'entity_id', 'vector_length', 'created_at']
                    print("\n🔍 Sample Embeddings in Spanner:")
                    display(sample_embeddings_df)

                print("\n✅ Embeddings synced from BigQuery")
                print("  ✅ Native COSINE_DISTANCE search enabled")
                print("  ✅ Fast-path optimization for new entities")
        else:
            print("⚠️ No embeddings were synced to Spanner")
    else:
        print("⚠️ No embeddings found in BigQuery")
        print("💡 Run the batch processing notebook first to generate embeddings")
except Exception as e:
    print(f"❌ Error syncing embeddings: {e}")

🔄 Syncing embeddings from BigQuery to Spanner...

✅ Found 284 records with embeddings in BigQuery
  🔄 Syncing embeddings from BigQuery ML...
  🗑️ Cleared table: entity_embeddings
  ✅ Synced 284 embeddings from BigQuery ML
✅ Successfully synced 284 embeddings to Spanner
📊 Embeddings in Spanner: 284

🔍 Sample Embeddings in Spanner:


Unnamed: 0,entity_id,vector_length,created_at
0,000c4e82-861b-4d19-b3aa-9760dcf8744e,3072,2025-09-24 11:07:15.199316+00:00
1,0080fd5d-8565-4b3b-b411-b42407bbbb8e,3072,2025-09-24 11:07:15.199316+00:00
2,01a40d30-24d1-4da1-93b8-1f77f6aa71c6,3072,2025-09-24 11:07:15.199316+00:00
3,030867cc-0ec7-4a74-9665-cff3b6f2fcd4,3072,2025-09-24 11:07:15.199316+00:00
4,03d3eb89-e2a4-4267-842c-c22ab8540fcc,3072,2025-09-24 11:07:15.199316+00:00



✅ Embeddings synced from BigQuery
  ✅ Native COSINE_DISTANCE search enabled
  ✅ Fast-path optimization for new entities


## 6. Generate New Streaming Data

Create new customer records to simulate streaming data.

In [8]:
print(
    f"🔄 Generating {NUM_STREAMING_RECORDS} streaming records (80% new, 20% existing with variations)...")

try:
    # Calculate 80-20 split for realistic simulation
    new_records_count = int(NUM_STREAMING_RECORDS * 0.8)  # 80 new records
    existing_records_count = NUM_STREAMING_RECORDS - \
        new_records_count  # 20 existing records

    print(
        f"  📊 Split: {new_records_count} new + {existing_records_count} existing with variations")

    # Part 1: Generate 80% new records (CREATE_NEW path)
    print(f"  🆕 Generating {new_records_count} completely new records...")
    generator = MDMDataGenerator(num_unique_customers=new_records_count)
    new_streaming_datasets = generator.generate_all_datasets()

    # Combine new streaming records
    new_streaming_records = []
    for source, df in new_streaming_datasets.items():
        for _, record in df.iterrows():
            record_dict = record.to_dict()
            record_dict['record_type'] = 'new'  # Tag for tracking
            new_streaming_records.append(record_dict)

    # Part 2: Query 20% existing records from BigQuery (AUTO_MERGE path)
    print(
        f"  🔄 Querying {existing_records_count} existing records from BigQuery...")
    existing_records_query = f"""
    SELECT record_id, full_name, email, phone, address, city, state, company, source_system
    FROM `{bq_helper.dataset_ref}.customers_with_embeddings`
    WHERE ml_generate_embedding_result IS NOT NULL
    ORDER BY RAND()
    LIMIT {existing_records_count}
    """

    try:
        existing_df = bq_helper.execute_query(existing_records_query)

        if not existing_df.empty:
            print(
                f"  ✅ Retrieved {len(existing_df)} existing records from BigQuery")

            # Part 3: Add realistic variations to simulate data drift
            print(f"  🔀 Adding realistic variations to simulate data drift...")
            varied_existing_records = StreamingMDMProcessor.add_realistic_variations(
                existing_df)

            # Tag as existing with variations
            for record in varied_existing_records:
                record['record_type'] = 'existing_varied'  # Tag for tracking

            print(
                f"  ✅ Created {len(varied_existing_records)} varied existing records")

        else:
            print("  ⚠️ No existing records found - using all new records instead")
            varied_existing_records = []

    except Exception as e:
        print(f"  ⚠️ Could not query existing records: {e}")
        print("  💡 Using all new records instead")
        varied_existing_records = []

    # Part 4: Combine and shuffle for realistic streaming order
    print(f"  🔀 Combining and shuffling records for realistic streaming...")
    all_streaming_records = new_streaming_records + varied_existing_records
    random.shuffle(all_streaming_records)

    # Take exactly NUM_STREAMING_RECORDS
    streaming_records = all_streaming_records[:NUM_STREAMING_RECORDS]

    print(f"\n📈 Realistic Streaming Data Summary:")
    print(f"  Total streaming records: {len(streaming_records)}")

    # Show record type distribution
    type_counts = {}
    source_counts = {}
    for record in streaming_records:
        record_type = record.get('record_type', 'unknown')
        source = record.get('source_system', 'unknown')
        type_counts[record_type] = type_counts.get(record_type, 0) + 1
        source_counts[source] = source_counts.get(source, 0) + 1

    print(f"\n🎯 Record Type Distribution (for matching simulation):")
    for record_type, count in type_counts.items():
        percentage = (count / len(streaming_records)) * 100
        expected_action = "CREATE_NEW" if record_type == 'new' else "AUTO_MERGE (likely)"
        print(f"  {record_type}: {count} ({percentage:.1f}%) → {expected_action}")

    print(f"\n📊 Source System Distribution:")
    for source, count in source_counts.items():
        print(f"  {source.upper()}: {count} records")

    print(f"\n🔍 Sample Records (showing variation types):")
    sample_records = []
    for i, record in enumerate(streaming_records[:5]):
        sample_record = {
            'record_id': record['record_id'],
            'full_name': record['full_name'],
            'email': record['email'],
            'phone': record.get('phone', ''),
            'source_system': record['source_system'],
            'type': record.get('record_type', 'unknown')
        }
        sample_records.append(sample_record)

    sample_df = pd.DataFrame(sample_records)
    display(sample_df)

    print(f"\n✅ Realistic 80-20 streaming data ready!")
    print(f"🎯 Expected outcomes:")
    print(
        f"  • New records ({new_records_count}) → CREATE_NEW")
    print(
        f"  • Varied existing ({len(varied_existing_records)}) → AUTO_MERGE (full 4-way matching)")
    print(f"  • This will demonstrate both fast-path and full matching scenarios!")

except Exception as e:
    print(f"❌ Error generating realistic streaming data: {e}")
    print("💡 Falling back to simple generation...")

    # Fallback to simple generation
    generator = MDMDataGenerator(num_unique_customers=NUM_STREAMING_RECORDS)
    streaming_datasets = generator.generate_all_datasets()

    all_streaming_records = []
    for source, df in streaming_datasets.items():
        for _, record in df.iterrows():
            all_streaming_records.append(record.to_dict())

    random.shuffle(all_streaming_records)
    streaming_records = all_streaming_records[:NUM_STREAMING_RECORDS]
    print(f"  📈 Fallback: {len(streaming_records)} new records generated")

🔄 Generating 100 streaming records (80% new, 20% existing with variations)...
  📊 Split: 80 new + 20 existing with variations
  🆕 Generating 80 completely new records...
  🔄 Querying 20 existing records from BigQuery...
  ✅ Retrieved 20 existing records from BigQuery
  🔀 Adding realistic variations to simulate data drift...
  ✅ Created 20 varied existing records
  🔀 Combining and shuffling records for realistic streaming...

📈 Realistic Streaming Data Summary:
  Total streaming records: 100

🎯 Record Type Distribution (for matching simulation):
  new: 89 (89.0%) → CREATE_NEW
  existing_varied: 11 (11.0%) → AUTO_MERGE (likely)

📊 Source System Distribution:
  ECOMMERCE: 31 records
  CRM: 31 records
  STREAMING_VARIATION: 11 records
  ERP: 27 records

🔍 Sample Records (showing variation types):


Unnamed: 0,record_id,full_name,email,phone,source_system,type
0,a0e16d87-2be8-4936-8134-64c6f763b1b3,Christopher Shaffer,edwardsnicole@yahoo.com,(822)918-909,ecommerce,new
1,5211d960-5479-4541-a023-b1b490032a21,Phillip Anderson,maryavila@outlook.com,,crm,new
2,c7548c7c-1b98-445d-af5c-9bae17f80c20,Raymond Ramirez,rasmussenjoshua@yahoo.com,768.750.1429,ecommerce,new
3,8a5c64d4-021b-48b7-9422-f786d1595a8a,Julie Thompson,opatel@outlook.com,+17518606,ecommerce,new
4,d13dd894-c3cb-4096-93fe-df3d85a7790a,Stephen Brown,tchambers@example.net,7643039106,ecommerce,new



✅ Realistic 80-20 streaming data ready!
🎯 Expected outcomes:
  • New records (80) → CREATE_NEW
  • Varied existing (20) → AUTO_MERGE (full 4-way matching)
  • This will demonstrate both fast-path and full matching scenarios!


## 7. Initialize Streaming Processor

Set up the 4-way matching processor.

In [None]:
print("🔄 Initializing 4-way streaming processor...")

try:
    # Initialize the streaming processor
    processor = StreamingMDMProcessor(spanner_helper)

    print("\n📊 Processor Configuration:")
    print(f"  Matching strategies: 4 (exact, fuzzy, vector, business)")
    print(f"  Strategy weights:")
    for strategy, weight in processor.weights.items():
        print(f"    {strategy}: {weight*100:.0f}%")

    print(f"\n🚧 Vector Matching Limitation:")
    print(f"  Current: Vector matching deferred (no real-time embedding generation)")
    print(f"  Roadmap: Full 4-way matching with Vertex AI integration (+200-500ms)")
    print(f"  Impact: Vector strategy contributes 0.0 to all scores")

    print(f"\n⚖️ Decision Thresholds:")
    print(f"  Auto-merge: ≥{processor.auto_merge_threshold}")
    print(f"  Create new: <{processor.create_new_threshold}")

    print("\n✅ Streaming processor ready!")
    print(f"🎯 Target: <{TARGET_LATENCY_MS}ms processing time per record")

except Exception as e:
    print(f"❌ Error initializing processor: {e}")
    processor = None

🔄 Initializing 4-way streaming processor...

📊 Processor Configuration:
  Matching strategies: 4 (exact, fuzzy, vector, business)
  Strategy weights:
    exact: 33%
    fuzzy: 28%
    vector: 22%
    business: 17%

⚖️ Decision Thresholds:
  Auto-merge: ≥0.8
  Create new: <0.6

✅ Streaming processor ready!
🎯 Target: <400ms processing time per record


## 8. Streaming Processing Loop

Process each record with sleep in between to simulate real-time pipeline.

In [None]:
print(
    f"🚀 Starting Streaming MDM Simulation ({NUM_STREAMING_RECORDS} records, per 100ms)")
print("=" * 80)
print()

# Validate prerequisites
if not streaming_records:
    print("❌ No streaming records available. Please run data generation first.")
elif not processor:
    print("❌ Processor not initialized. Please run processor setup first.")
else:
    # Track overall statistics
    start_time = time.time()
    total_processing_time = 0
    action_counts = {}
    confidence_counts = {}

    # Process each record
    for i, record in enumerate(streaming_records, 1):
        record_start = time.time()

        # Process the record with match details
        result = processor.process_record(
            record, i, NUM_STREAMING_RECORDS, include_match_details=True)

        # Store match result in Spanner
        try:
            match_id = processor.store_match_result(record, result)
            print(
                f"  🗃️ → Stored match result in Spanner (match_id: {match_id[:8]}...)")
        except Exception as e:
            print(f"  ⚠️ → Failed to store match result: {e}")

        # Update statistics
        total_processing_time += result.get('processing_time_ms', 0)
        update_statistics(result, action_counts, confidence_counts)

        # Sleep to maintain processing pace
        elapsed = time.time() - record_start
        sleep_time = max(0, PROCESSING_DELAY_SEC - elapsed)
        if sleep_time > 0:
            print(f"  ⏱️ Next record in {sleep_time:.1f}s...")
            time.sleep(sleep_time)

        print()  # Empty line for readability

    # Calculate final statistics
    total_time = time.time() - start_time

    print("🎉 Streaming Simulation Complete!")
    print("=" * 50)
    print(f"📊 Processing Summary:")
    print(f"  Records processed: {NUM_STREAMING_RECORDS}")
    print(f"  Total time: {total_time:.1f} seconds")
    print(
        f"  Average processing time: {total_processing_time/NUM_STREAMING_RECORDS:.0f}ms")
    print(
        f"  Throughput: {NUM_STREAMING_RECORDS/total_time:.1f} records/second")

    print(f"\n⚖️ Decision Distribution:")
    for action, count in action_counts.items():
        percentage = (count / NUM_STREAMING_RECORDS) * 100
        print(f"  {action}: {count} ({percentage:.1f}%)")

    print(f"\n🎯 Confidence Distribution:")
    for confidence, count in confidence_counts.items():
        percentage = (count / NUM_STREAMING_RECORDS) * 100
        print(f"  {confidence}: {count} ({percentage:.1f}%)")

    print(f"\n📁 Results stored in Spanner table: match_results")
    print("💡 Use Section 9 to analyze the results from Spanner")

🚀 Starting Streaming MDM Simulation (100 records, per 100ms)

📨 Record 1/100: Christopher Shaffer (edwardsnicole@yahoo.com) - ecommerce Source
  🧮 Vector matching: Skipped (no embedding for streaming record)
  ⚡ Exact matching: 1 matches found
  🔍 Fuzzy matching: 1 matches found
  🧮 Vector matching: 0 matches found
  📋 Business rules: 2 matches found
  📊 Combined score: 0.66 (MEDIUM confidence) → HUMAN_REVIEW
  🗃️ → HUMAN_REVIEW Spanner (entity_id: 37927760..., merged with existing 37927760...)
  ⏱️ Processing time: 1727ms
  🗃️ → Stored match result in Spanner (match_id: b4a61beb...)

📨 Record 2/100: Phillip Anderson (maryavila@outlook.com) - crm Source
  🧮 Vector matching: Skipped (no embedding for streaming record)
  ⚡ Exact matching: 0 matches found
  🔍 Fuzzy matching: 1 matches found
  🧮 Vector matching: 0 matches found
  📋 Business rules: 2 matches found
  📊 Combined score: 0.33 (LOW confidence) → CREATE_NEW
  📝 Staged entity bc4be7dd... for batch processing
  🗃️ → CREATE_NEW Span

## 9. Analysis and Visualization

Analyze the streaming processing results.

In [11]:
print("📊 Analyzing streaming processing results from Spanner...")

# Query transaction data from Spanner
transactions_query = """
SELECT
    record1_id, source1,
    exact_score, fuzzy_score, vector_score, business_score,
    combined_score, confidence_level, match_decision,
    processing_time_ms, matched_at
FROM match_results
WHERE matched_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY matched_at DESC
"""

transactions_df = spanner_helper.execute_sql(transactions_query)

# Rename columns for compatibility with existing analysis
transactions_df.columns = [
    'record_id', 'source_system',
    'exact_score', 'fuzzy_score', 'vector_score', 'business_score',
    'combined_score', 'confidence', 'action',
    'processing_time_ms', 'matched_at'
]

# Add calculated columns for match counts
transactions_df['exact_matches'] = (
    transactions_df['exact_score'] > 0).astype(int)
transactions_df['fuzzy_matches'] = (
    transactions_df['fuzzy_score'] > 0).astype(int)
transactions_df['vector_matches'] = (
    transactions_df['vector_score'] > 0).astype(int)
transactions_df['business_matches'] = (
    transactions_df['business_score'] > 0).astype(int)


# Performance analysis
print("\n⚡ Performance Analysis:")
print(
    f"  Average processing time: {transactions_df['processing_time_ms'].mean():.0f}ms")
print(
    f"  Median processing time: {transactions_df['processing_time_ms'].median():.0f}ms")
print(
    f"  95th percentile: {transactions_df['processing_time_ms'].quantile(0.95):.0f}ms")
print(
    f"  Max processing time: {transactions_df['processing_time_ms'].max():.0f}ms")

# Matching effectiveness
print("\n🎯 Matching Effectiveness:")
print(
    f"  Average combined score: {transactions_df['combined_score'].mean():.3f}")
print(
    f"  Records with exact matches: {(transactions_df['exact_matches'] > 0).sum()}")
print(
    f"  Records with fuzzy matches: {(transactions_df['fuzzy_matches'] > 0).sum()}")
print(
    f"  Records with vector matches: {(transactions_df['vector_matches'] > 0).sum()}")
print(
    f"  Records with business matches: {(transactions_df['business_matches'] > 0).sum()}")

📊 Analyzing streaming processing results from Spanner...

⚡ Performance Analysis:
  Average processing time: 1694ms
  Median processing time: 1649ms
  95th percentile: 2060ms
  Max processing time: 3443ms

🎯 Matching Effectiveness:
  Average combined score: 0.594
  Records with exact matches: 182
  Records with fuzzy matches: 177
  Records with vector matches: 0
  Records with business matches: 189


In [12]:
# Create visualizations
print("📈 Creating performance visualizations...")

# Processing time distribution
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Processing Time Distribution', 'Action Distribution',
                    'Confidence Distribution', 'Combined Score Distribution'),
    specs=[[{'type': 'histogram'}, {'type': 'pie'}],
           [{'type': 'pie'}, {'type': 'histogram'}]]
)

# Processing time histogram
fig.add_trace(
    go.Histogram(
        x=transactions_df['processing_time_ms'], name='Processing Time (ms)'),
    row=1, col=1
)

# Action distribution pie chart
action_counts = transactions_df['action'].value_counts()
fig.add_trace(
    go.Pie(labels=action_counts.index,
           values=action_counts.values, name='Actions'),
    row=1, col=2
)

# Confidence distribution pie chart
confidence_counts = transactions_df['confidence'].value_counts()
fig.add_trace(
    go.Pie(labels=confidence_counts.index,
           values=confidence_counts.values, name='Confidence'),
    row=2, col=1
)

# Combined score histogram
fig.add_trace(
    go.Histogram(x=transactions_df['combined_score'], name='Combined Score'),
    row=2, col=2
)

fig.update_layout(
    title_text="Streaming MDM Performance Analysis", showlegend=False)
fig.show()

print("✅ Visualizations created!")

📈 Creating performance visualizations...


✅ Visualizations created!


In [13]:
# Strategy effectiveness analysis
print("🎯 4-Strategy Effectiveness Analysis:")

strategy_stats = pd.DataFrame({
    'Strategy': ['Exact', 'Fuzzy', 'Vector', 'Business'],
    'Records_with_Matches': [
        (transactions_df['exact_matches'] > 0).sum(),
        (transactions_df['fuzzy_matches'] > 0).sum(),
        (transactions_df['vector_matches'] > 0).sum(),
        (transactions_df['business_matches'] > 0).sum()
    ],
    'Average_Score': [
        transactions_df['exact_score'].mean(),
        transactions_df['fuzzy_score'].mean(),
        transactions_df['vector_score'].mean(),
        transactions_df['business_score'].mean()
    ]
})

display(strategy_stats)

# Strategy effectiveness chart
fig = px.bar(
    strategy_stats,
    x='Strategy',
    y='Records_with_Matches',
    title='4-Strategy Matching Effectiveness',
    labels={'Records_with_Matches': 'Records with Matches'}
)
fig.show()

print("✅ Strategy analysis complete!")

🎯 4-Strategy Effectiveness Analysis:


Unnamed: 0,Strategy,Records_with_Matches,Average_Score
0,Exact,182,0.91
1,Fuzzy,177,0.881364
2,Vector,0,0.0
3,Business,189,0.278


✅ Strategy analysis complete!


## 10. Final Golden Record Analysis

Analyze the final state of golden records in Spanner.

In [14]:
print("🏆 Analyzing final golden record state...")

# Get final golden record count
final_count = spanner_helper.get_table_count("golden_entities")
print(f"\n📊 Final golden entities count: {final_count}")

# Analyze processing paths
path_query = """
SELECT processing_path, COUNT(*) as count
FROM golden_entities
GROUP BY processing_path
ORDER BY count DESC
"""

path_df = spanner_helper.execute_sql(path_query)
if not path_df.empty:
    path_df.columns = ['processing_path', 'count']
    print("\n🔄 Processing Path Distribution:")
    display(path_df)

# Analyze source record counts
source_query = """
SELECT source_record_count, COUNT(*) as entities
FROM golden_entities
GROUP BY source_record_count
ORDER BY source_record_count
"""

source_df = spanner_helper.execute_sql(source_query)
if not source_df.empty:
    source_df.columns = ['source_record_count', 'entities']
    print("\n📈 Source Record Count Distribution:")
    display(source_df)

# Show sample updated records
updated_query = """
SELECT entity_id, master_name, master_email, source_record_count,
       processing_path, updated_at
FROM golden_entities
WHERE processing_path = 'stream'
ORDER BY updated_at DESC
LIMIT 10
"""

updated_df = spanner_helper.execute_sql(updated_query)
if not updated_df.empty:
    updated_df.columns = ['entity_id', 'master_name',
                          'master_email', 'source_count', 'path', 'updated_at']
    print("\n🔄 Sample Updated Records (Streaming):")
    display(updated_df)

print("\n✅ Golden record analysis complete!")

🏆 Analyzing final golden record state...

📊 Final golden entities count: 110

🔄 Processing Path Distribution:


Unnamed: 0,processing_path,count
0,batch_migrated,95
1,stream,10
2,stream_updated,5



📈 Source Record Count Distribution:


Unnamed: 0,source_record_count,entities
0,1,7
1,2,24
2,3,37
3,4,18
4,5,16
5,6,5
6,7,2
7,22,1



🔄 Sample Updated Records (Streaming):


Unnamed: 0,entity_id,master_name,master_email,source_count,path,updated_at
0,afca4bc55c06d6b016f0374eb7e88da6b692,LUIS HARPER,joel28@example.org,3,stream,2025-09-24 11:11:26.498296+00:00
1,e0be3dd816d8966379ff644e2f9f0c243164,JIM HORTON,jross@hotmail.com,1,stream,2025-09-24 11:11:19.854006+00:00
2,c71b00bcf2341334f04a5b958f0e85259d97,SCOTT SAMPSON,cory58@outlook.com,1,stream,2025-09-24 11:10:53.855958+00:00
3,047db959202c14e4e60d5db53b947b7f910e,VANESSA REED,xanderson@yahoo.com,2,stream,2025-09-24 11:10:30.040264+00:00
4,b07c7424dacd851bd8f9a1bff4e0dce13af4,C THOMPSON,smithjoshua@outlook.com,1,stream,2025-09-24 11:10:09.373901+00:00
5,13783ebe13f403c9b494be11c2026d3f432a,JEFFERY COLEMAN,kennethkidd@yahoo.com,1,stream,2025-09-24 11:10:02.668206+00:00
6,6655038bac53f8dcc5a95d2945083a465820,MARY COCHRAN,lweaver@yahoo.com,1,stream,2025-09-24 11:09:24.984220+00:00
7,35884410c041b1e3e8a0c9a823b1e7eb01d3,VINCENT HERRERA,jennifercarter@example.org,2,stream,2025-09-24 11:09:22.944822+00:00
8,f7b3d79062bf72b781fa42e8bc4ca4c37b7c,SARAH PERRY,pshaw@example.com,1,stream,2025-09-24 11:09:00.813258+00:00
9,bc4be7dd2ec10068e2848ecccc9b83d20213,PHILLIP ANDERSON,maryavila@outlook.com,1,stream,2025-09-24 11:08:09.740419+00:00



✅ Golden record analysis complete!


## 11. Performance Metrics and Summary

Calculate key performance indicators for the 4-strategy streaming MDM pipeline.

In [15]:
print("📈 Calculating 4-Strategy Streaming MDM Performance Metrics...")

# Overall pipeline statistics
initial_golden_count = golden_count if 'golden_count' in locals() else 0
final_golden_count = spanner_helper.get_table_count("golden_entities")
new_entities_created = final_golden_count - initial_golden_count

print(f"\n📊 Pipeline Statistics:")
print(f"  Initial golden records (from BigQuery): {initial_golden_count}")
print(f"  Streaming records processed: {NUM_STREAMING_RECORDS}")
print(f"  Final golden records: {final_golden_count}")
print(f"  Net new entities created: {new_entities_created}")
print(
    f"  Entity consolidation rate: {((NUM_STREAMING_RECORDS - new_entities_created) / NUM_STREAMING_RECORDS * 100):.1f}%")

# Performance metrics from transactions
if 'transactions_df' in locals() and not transactions_df.empty:
    print(f"\n⚡ Performance Metrics:")
    print(
        f"  Average processing time: {transactions_df['processing_time_ms'].mean():.0f}ms")
    print(
        f"  Sub-second guarantee: {(transactions_df['processing_time_ms'] < 1000).sum()}/{len(transactions_df)} ({(transactions_df['processing_time_ms'] < 1000).mean()*100:.1f}%)")
    print(
        f"  Target <400ms: {(transactions_df['processing_time_ms'] < 400).sum()}/{len(transactions_df)} ({(transactions_df['processing_time_ms'] < 400).mean()*100:.1f}%)")

    print(f"\n🎯 4-Strategy Matching Results:")
    print(
        f"  Auto-merge rate: {action_counts.get('AUTO_MERGE', 0)}/{NUM_STREAMING_RECORDS} ({action_counts.get('AUTO_MERGE', 0)/NUM_STREAMING_RECORDS*100:.1f}%)")
    print(
        f"  New entity rate: {action_counts.get('CREATE_NEW', 0)}/{NUM_STREAMING_RECORDS} ({action_counts.get('CREATE_NEW', 0)/NUM_STREAMING_RECORDS*100:.1f}%)")
    print(
        f"  Average confidence score: {transactions_df['combined_score'].mean():.3f}")

print("\n✅ Performance analysis complete!")

📈 Calculating 4-Strategy Streaming MDM Performance Metrics...

📊 Pipeline Statistics:
  Initial golden records (from BigQuery): 100
  Streaming records processed: 100
  Final golden records: 110
  Net new entities created: 10
  Entity consolidation rate: 90.0%

⚡ Performance Metrics:
  Average processing time: 1694ms
  Sub-second guarantee: 0/200 (0.0%)
  Target <400ms: 0/200 (0.0%)

🎯 4-Strategy Matching Results:
  Auto-merge rate: 0/100 (0.0%)
  New entity rate: 32/100 (32.0%)
  Average confidence score: 0.594

✅ Performance analysis complete!


## 12. Cleanup and Cost Management

Optional cleanup to avoid ongoing Spanner charges.

In [16]:
print("🧹 Demo Cleanup Options:")
print("=" * 50)
print(f"💰 Current Spanner instance: {INSTANCE_ID}")
print(f"📊 Processing units: 100 (regional)")
print(f"🗃️ Database: {DATABASE_ID}")
print()
print("⚠️ To avoid ongoing charges, you can delete the Spanner instance:")
print(f"   gcloud spanner instances delete {INSTANCE_ID} --quiet")
print()
print("💡 The BigQuery golden records remain unchanged for future use.")
print("✅ Streaming MDM demo completed successfully!")

🧹 Demo Cleanup Options:
💰 Current Spanner instance: mdm-streaming-demo
📊 Processing units: 100 (regional)
🗃️ Database: mdm_streaming

⚠️ To avoid ongoing charges, you can delete the Spanner instance:
   gcloud spanner instances delete mdm-streaming-demo --quiet

💡 The BigQuery golden records remain unchanged for future use.
✅ Streaming MDM demo completed successfully!
