# Master Data Management (MDM) - Spanner Native Streaming Processing

This notebook demonstrates a complete end-to-end streaming Master Data Management pipeline using Spanner's native capabilities:

- **Golden Record Bootstrap**: Load existing golden records from BigQuery batch processing
- **Spanner Infrastructure**: Set up minimal Spanner instance for real-time processing
- **Data Migration**: Transfer golden records to Spanner for real-time matching
- **Streaming Data Generation**: Create 100 new customer records for processing
- **4-Way Real-time Matching**: Exact, fuzzy, vector, and business rules matching
- **Synchronous Processing**: Sub-second processing with immediate feedback
- **Golden Record Updates**: Apply survivorship rules and update master entities
- **Live Performance Tracking**: Real-time metrics and CSV transaction logging

## Architecture Overview

This implementation follows the streaming processing path:
1. **BigQuery Golden Records** → **Spanner Migration**
2. **Kafka-like Stream** → **Real-time Standardization**
3. **Spanner Vector Search** → **4-Way Matching Engine**
4. **Confidence Scoring** → **AUTO_MERGE/CREATE_NEW Decisions**
5. **Golden Record Updates** → **CSV Transaction Logging**

## 1. Setup and Configuration

In [1]:
# Import required libraries
import warnings
from datetime import datetime
import csv
import time
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from batch_mdm_gcp.data_generator import MDMDataGenerator
from batch_mdm_gcp.bigquery_utils import BigQueryMDMHelper
from spanner_utils import SpannerMDMHelper
from streaming_processor import StreamingMDMProcessor
import sys
import os
import random

sys.path.append('..')
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


In [None]:
# =============================================================================
# CONFIGURATION CONSTANTS - Centralized Settings
# =============================================================================

# GCP Configuration
PROJECT_ID = "your-project-id"  # Replace with your GCP project ID
DATASET_ID = "mdm_demo"  # BigQuery dataset (from batch processing)
INSTANCE_ID = "mdm-streaming-demo"  # Spanner instance
DATABASE_ID = "mdm_streaming"  # Spanner database
LOCATION = "US"

# Processing Configuration
NUM_STREAMING_RECORDS = 100
PROCESSING_DELAY_SEC = 0.1  # 10 record per second for demo
TARGET_LATENCY_MS = 400  # Target processing time per record

# Decision Thresholds
AUTO_MERGE_THRESHOLD = 0.85
CREATE_NEW_THRESHOLD = 0.65

# File Paths
RESULTS_DIR = 'results'
CSV_FILENAME = 'streaming_transactions.csv'

print("📋 Configuration loaded:")
print(f"  Target records: {NUM_STREAMING_RECORDS}")
print(f"  Target latency: <{TARGET_LATENCY_MS}ms")
print(f"  Auto-merge threshold: ≥{AUTO_MERGE_THRESHOLD}")
print(f"  Create new threshold: <{CREATE_NEW_THRESHOLD}")

📋 Configuration loaded:
  Target records: 100
  Target latency: <400ms
  Auto-merge threshold: ≥0.85
  Create new threshold: <0.65


In [3]:
# Initialize helpers
try:
    # BigQuery helper (for loading golden records)
    bq_helper = BigQueryMDMHelper(PROJECT_ID, DATASET_ID)
    print(f"✅ Connected to BigQuery project: {PROJECT_ID}")
    print(f"📊 BigQuery dataset: {bq_helper.dataset_ref}")

    # Spanner helper (for streaming processing)
    spanner_helper = SpannerMDMHelper(PROJECT_ID, INSTANCE_ID, DATABASE_ID)
    print(f"✅ Connected to Spanner project: {PROJECT_ID}")
    print(f"🗃️ Spanner instance: {INSTANCE_ID}")
    print(f"🗃️ Spanner database: {DATABASE_ID}")

except Exception as e:
    print(f"❌ Error connecting: {e}")
    print("Please ensure you have:")
    print("1. Set up Google Cloud authentication")
    print("2. Enabled BigQuery and Spanner APIs")
    print("3. Updated PROJECT_ID above")

✅ Connected to BigQuery project: johanesa-playground-326616
📊 BigQuery dataset: johanesa-playground-326616.mdm_demo
✅ Connected to Spanner project: johanesa-playground-326616
🗃️ Spanner instance: mdm-streaming-demo
🗃️ Spanner database: mdm_streaming


## 2. Helper Functions

In [4]:
def log_transaction_result(result, csv_writer, record):
    """
    Log transaction result to CSV with all required fields.
    Simplified logging function with clear error handling.
    """
    try:
        csv_writer.writerow([
            datetime.now().isoformat(),
            result.get('record_num', 0),
            result.get('record_id', ''),
            record.get('source_system', ''),
            record.get('full_name', ''),
            record.get('email', ''),
            record.get('phone', ''),
            record.get('company', ''),
            result.get('exact_count', 0),
            result.get('fuzzy_count', 0),
            result.get('vector_count', 0),
            result.get('business_count', 0),
            result.get('strategy_scores', {}).get('exact', 0),
            result.get('strategy_scores', {}).get('fuzzy', 0),
            result.get('strategy_scores', {}).get('vector', 0),
            result.get('strategy_scores', {}).get('business', 0),
            result.get('combined_score', 0),
            result.get('confidence', 'LOW'),
            result.get('action', 'ERROR'),
            result.get('entity_id', ''),
            result.get('processing_time_ms', 0)
        ])
        return True
    except Exception as e:
        print(f"  ⚠️ Error logging to CSV: {e}")
        return False


def update_statistics(result, action_counts, confidence_counts):
    """
    Update running statistics with current result.
    """
    action = result.get('action', 'ERROR')
    confidence = result.get('confidence', 'LOW')

    action_counts[action] = action_counts.get(action, 0) + 1
    confidence_counts[confidence] = confidence_counts.get(confidence, 0) + 1


print("✅ Helper functions defined")
print("📋 Functions available:")
print("  • log_transaction_result() - CSV logging")
print("  • update_statistics() - Running metrics")

✅ Helper functions defined
📋 Functions available:
  • log_transaction_result() - CSV logging
  • update_statistics() - Running metrics


## 3. Spanner Infrastructure Setup

Create minimal Spanner infrastructure for the streaming demo.

In [5]:
print("🔄 Setting up Spanner infrastructure...")
print("💰 Cost estimate: ~$65/month for 100 processing units (regional)")
print("⚠️ Remember to delete the instance after demo to avoid charges")
print()

try:
    # Create Spanner instance (minimal configuration)
    spanner_helper.create_instance_if_needed(processing_units=100)

    # Create database
    spanner_helper.create_database_if_needed()

    # Create schema (aligned with BigQuery golden_records)
    spanner_helper.create_or_replace_schema()

    print("\n✅ Spanner infrastructure ready!")
    print(f"📊 Instance: {INSTANCE_ID} (100 processing units)")
    print(f"🗃️ Database: {DATABASE_ID}")
    print(f"📋 Schema: golden_entities, match_results tables created")

except Exception as e:
    print(f"❌ Error setting up Spanner infrastructure: {e}")
    print("Please check your GCP permissions and try again.")

🔄 Setting up Spanner infrastructure...
💰 Cost estimate: ~$65/month for 100 processing units (regional)
⚠️ Remember to delete the instance after demo to avoid charges

  ✅ Instance mdm-streaming-demo already exists
  ✅ Database mdm_streaming already exists
  🔄 Checking schema status...


Created multiplexed session.


  ✅ Schema exists - truncating data only (fast path)
  🗑️ Truncating existing tables...
  🗑️ Cleared table: match_results
    ✅ Truncated: match_results
  🗑️ Cleared table: golden_entities
    ✅ Truncated: golden_entities
  ✅ All tables truncated successfully
  ✅ Schema ready (optimized - ~2-3 seconds)

✅ Spanner infrastructure ready!
📊 Instance: mdm-streaming-demo (100 processing units)
🗃️ Database: mdm_streaming
📋 Schema: golden_entities, match_results tables created


## 4. Load Golden Records from BigQuery

Bootstrap the streaming system with existing golden records from batch processing.

In [6]:
print("🔄 Loading golden records from BigQuery batch processing...")

try:
    # Load golden records from BigQuery
    golden_count = spanner_helper.load_golden_records_from_bigquery(bq_helper)

    if golden_count > 0:
        print(
            f"\n✅ Successfully migrated {golden_count} golden records to Spanner")

        # Verify the migration
        current_count = spanner_helper.get_table_count("golden_entities")
        print(f"📊 Current golden entities in Spanner: {current_count}")

        # Show sample records
        sample_query = """
        SELECT entity_id, master_name, master_email, master_phone,
               source_record_count, processing_path
        FROM golden_entities
        LIMIT 5
        """

        sample_df = spanner_helper.execute_sql(sample_query)
        if not sample_df.empty:
            print("\n🔍 Sample Golden Records in Spanner:")
            sample_df.columns = ['entity_id', 'master_name',
                                 'master_email', 'master_phone', 'source_count', 'path']
            display(sample_df)
    else:
        print("⚠️ No golden records found in BigQuery")
        print("💡 Run the batch processing notebook first to create golden records")

except Exception as e:
    print(f"❌ Error loading golden records: {e}")
    print("💡 Make sure you've run the batch processing notebook first")

🔄 Loading golden records from BigQuery batch processing...
  🔄 Loading golden records from BigQuery...
  🗑️ Cleared table: golden_entities
  ✅ Loaded 265 golden records from BigQuery

✅ Successfully migrated 265 golden records to Spanner
📊 Current golden entities in Spanner: 265

🔍 Sample Golden Records in Spanner:


Unnamed: 0,entity_id,master_name,master_email,master_phone,source_count,path
0,01f1a69f-3d9a-4853-9549-94722b6e534f,CARLA ROBLES,jacob86@example.com,6266634163,1,batch_migrated
1,037a3eb8-51b8-46a9-a94b-949a501d8aa3,MICHELLE MARTINEZ,johndawson@example.org,1767929,1,batch_migrated
2,037b0420-e22f-4e57-81a4-f0a03df2da96,LANCE SMITH,hdavis@example.com,5279484677,1,batch_migrated
3,045e6c25-285a-4836-940e-321afd67376c,SHAUN JONES,patrickdarin@example.com,18124382,1,batch_migrated
4,05c3fa47-48c2-4e78-8762-d4d95ab000fd,TODD COOK,ugibson@example.org,1939263,1,batch_migrated


## 5. Generate New Streaming Data

Create new customer records to simulate streaming data.

In [7]:
print(f"🔄 Generating {NUM_STREAMING_RECORDS} new streaming records...")

try:
    # Generate new streaming data (different from batch data)
    generator = MDMDataGenerator(num_unique_customers=NUM_STREAMING_RECORDS)
    streaming_datasets = generator.generate_all_datasets()

    # Combine all streaming records
    all_streaming_records = []
    for source, df in streaming_datasets.items():
        for _, record in df.iterrows():
            all_streaming_records.append(record.to_dict())

    # Shuffle to simulate random streaming order
    random.shuffle(all_streaming_records)

    # Take exactly NUM_STREAMING_RECORDS
    streaming_records = all_streaming_records[:NUM_STREAMING_RECORDS]

    print(f"\n📈 Streaming Data Summary:")
    print(f"  Total streaming records: {len(streaming_records)}")

    # Show source distribution
    source_counts = {}
    for record in streaming_records:
        source = record.get('source_system', 'unknown')
        source_counts[source] = source_counts.get(source, 0) + 1

    for source, count in source_counts.items():
        print(f"  {source.upper()}: {count} records")

    print(f"\n🔍 Sample Streaming Records:")
    sample_streaming = pd.DataFrame(streaming_records[:3])
    display(sample_streaming[['record_id', 'full_name',
            'email', 'phone', 'source_system']].head(3))

    print("\n✅ Streaming data ready for processing!")

except Exception as e:
    print(f"❌ Error generating streaming data: {e}")
    streaming_records = []

🔄 Generating 100 new streaming records...

📈 Streaming Data Summary:
  Total streaming records: 100
  ECOMMERCE: 32 records
  CRM: 41 records
  ERP: 27 records

🔍 Sample Streaming Records:


Unnamed: 0,record_id,full_name,email,phone,source_system
0,29162980-a086-4e19-9abf-13e3e210f993,Sarah Pope,keith98@example.com,,ecommerce
1,39ac8046-f4d6-4691-999c-31c4745000cd,Tina Ferguson,qvargas@example.org,562-972-0465,crm
2,b81665f0-503d-49d1-b9a6-ba2cb0735847,Justin Hughes,melissachan@example.org,433.446.8007,ecommerce



✅ Streaming data ready for processing!


## 6. Initialize CSV Transaction Logging

Set up CSV file to track each streaming transaction.

In [8]:
print("🔄 Preparing CSV transaction logging...")

try:
    # Create results directory if it doesn't exist
    os.makedirs(RESULTS_DIR, exist_ok=True)

    # CSV file for transaction logging
    csv_file = os.path.join(RESULTS_DIR, CSV_FILENAME)

    # Remove existing file for fresh start
    if os.path.exists(csv_file):
        os.remove(csv_file)
        print(f"  🗑️ Removed existing {csv_file}")

    # Create CSV with headers
    csv_headers = [
        'timestamp', 'record_num', 'record_id', 'source_system',
        'full_name', 'email', 'phone', 'company',
        'exact_matches', 'fuzzy_matches', 'vector_matches', 'business_matches',
        'exact_score', 'fuzzy_score', 'vector_score', 'business_score',
        'combined_score', 'confidence', 'action', 'entity_id',
        'processing_time_ms'
    ]

    with open(csv_file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(csv_headers)

    print(f"✅ CSV transaction log initialized: {csv_file}")
    print(f"📋 Headers: {len(csv_headers)} columns")
    print("📝 Ready to log each streaming transaction!")

except Exception as e:
    print(f"❌ Error setting up CSV logging: {e}")
    csv_file = None

🔄 Preparing CSV transaction logging...
✅ CSV transaction log initialized: results/streaming_transactions.csv
📋 Headers: 21 columns
📝 Ready to log each streaming transaction!


## 7. Initialize Streaming Processor

Set up the 4-way matching processor.

In [9]:
print("🔄 Initializing 4-way streaming processor...")

try:
    # Initialize the streaming processor
    processor = StreamingMDMProcessor(spanner_helper)

    print("\n📊 Processor Configuration:")
    print(f"  Matching strategies: 4 (exact, fuzzy, vector, business)")
    print(f"  Strategy weights:")
    for strategy, weight in processor.weights.items():
        print(f"    {strategy}: {weight*100:.0f}%")

    print(f"\n⚖️ Decision Thresholds:")
    print(f"  Auto-merge: ≥{processor.auto_merge_threshold}")
    print(f"  Create new: <{processor.create_new_threshold}")

    print("\n✅ Streaming processor ready!")
    print(f"🎯 Target: <{TARGET_LATENCY_MS}ms processing time per record")

except Exception as e:
    print(f"❌ Error initializing processor: {e}")
    processor = None

🔄 Initializing 4-way streaming processor...

📊 Processor Configuration:
  Matching strategies: 4 (exact, fuzzy, vector, business)
  Strategy weights:
    exact: 40%
    fuzzy: 30%
    vector: 20%
    business: 10%

⚖️ Decision Thresholds:
  Auto-merge: ≥0.85
  Create new: <0.65

✅ Streaming processor ready!
🎯 Target: <400ms processing time per record


## 8. Streaming Processing Loop

Process each record with sleep in between to simulate real-time pipeline.

In [10]:
print(
    f"🚀 Starting Streaming MDM Simulation ({NUM_STREAMING_RECORDS} records, per 100ms)")
print("=" * 80)
print()

# Validate prerequisites
if not streaming_records:
    print("❌ No streaming records available. Please run data generation first.")
elif not processor:
    print("❌ Processor not initialized. Please run processor setup first.")
elif not csv_file:
    print("❌ CSV logging not set up. Please run CSV setup first.")
else:
    # Track overall statistics
    start_time = time.time()
    total_processing_time = 0
    action_counts = {}
    confidence_counts = {}

    # Process each record with simplified loop
    with open(csv_file, 'a', newline='') as f:
        csv_writer = csv.writer(f)

        for i, record in enumerate(streaming_records, 1):
            record_start = time.time()

            # Process the record with match details
            result = processor.process_record(
                record, i, NUM_STREAMING_RECORDS, include_match_details=True)

            # Update statistics
            total_processing_time += result.get('processing_time_ms', 0)
            update_statistics(result, action_counts, confidence_counts)

            # Log to CSV using helper function
            log_success = log_transaction_result(result, csv_writer, record)

            if log_success:
                print(f"  📝 → Written to CSV (row {i})")
            else:
                print(f"  ⚠️ → CSV logging failed for row {i}")

            # Sleep to maintain processing pace
            elapsed = time.time() - record_start
            sleep_time = max(0, PROCESSING_DELAY_SEC - elapsed)
            if sleep_time > 0:
                print(f"  ⏱️ Next record in {sleep_time:.1f}s...")
                time.sleep(sleep_time)

            print()  # Empty line for readability

    # Calculate final statistics
    total_time = time.time() - start_time

    print("🎉 Streaming Simulation Complete!")
    print("=" * 50)
    print(f"📊 Processing Summary:")
    print(f"  Records processed: {NUM_STREAMING_RECORDS}")
    print(f"  Total time: {total_time:.1f} seconds")
    print(
        f"  Average processing time: {total_processing_time/NUM_STREAMING_RECORDS:.0f}ms")
    print(
        f"  Throughput: {NUM_STREAMING_RECORDS/total_time:.1f} records/second")

    print(f"\n⚖️ Decision Distribution:")
    for action, count in action_counts.items():
        percentage = (count / NUM_STREAMING_RECORDS) * 100
        print(f"  {action}: {count} ({percentage:.1f}%)")

    print(f"\n🎯 Confidence Distribution:")
    for confidence, count in confidence_counts.items():
        percentage = (count / NUM_STREAMING_RECORDS) * 100
        print(f"  {confidence}: {count} ({percentage:.1f}%)")

    print(f"\n📁 Results saved to: {csv_file}")

🚀 Starting Streaming MDM Simulation (100 records, per 100ms)

📨 Record 1/100: Sarah Pope (keith98@example.com) - ecommerce Source
  ⚡ Exact matching: 2 matches found
  🔍 Fuzzy matching: 2 matches found
  🧮 Vector matching: 0 matches found
  📋 Business rules: 4 matches found
  📊 Combined score: 0.73 (MEDIUM confidence) → AUTO_MERGE
  🗃️ → AUTO_MERGE Spanner (entity_id: 72d8f45a..., merged with existing 72d8f45a-5812-4b47-8453-813ea9376122)
  ⏱️ Processing time: 1722ms
  📝 → Written to CSV (row 1)

📨 Record 2/100: Tina Ferguson (qvargas@example.org) - crm Source
  ⚡ Exact matching: 4 matches found
  🔍 Fuzzy matching: 2 matches found
  🧮 Vector matching: 0 matches found
  📋 Business rules: 4 matches found
  📊 Combined score: 0.73 (MEDIUM confidence) → AUTO_MERGE
  🗃️ → AUTO_MERGE Spanner (entity_id: 0d88d875..., merged with existing 0d88d875-be30-4ad7-a540-02ec158fe8fa)
  ⏱️ Processing time: 1898ms
  📝 → Written to CSV (row 2)

📨 Record 3/100: Justin Hughes (melissachan@example.org) - eco

## 9. Analysis and Visualization

Analyze the streaming processing results.

In [11]:
print("📊 Analyzing streaming processing results...")

# Load transaction log
transactions_df = pd.read_csv(csv_file)
print(f"\n📋 Loaded {len(transactions_df)} transaction records")

# Display sample transactions
print("\n🔍 Sample Transactions:")
display(transactions_df[['record_num', 'full_name', 'email',
        'combined_score', 'confidence', 'action', 'processing_time_ms']].head(10))

# Performance analysis
print("\n⚡ Performance Analysis:")
print(
    f"  Average processing time: {transactions_df['processing_time_ms'].mean():.0f}ms")
print(
    f"  Median processing time: {transactions_df['processing_time_ms'].median():.0f}ms")
print(
    f"  95th percentile: {transactions_df['processing_time_ms'].quantile(0.95):.0f}ms")
print(
    f"  Max processing time: {transactions_df['processing_time_ms'].max():.0f}ms")

# Matching effectiveness
print("\n🎯 Matching Effectiveness:")
print(
    f"  Average combined score: {transactions_df['combined_score'].mean():.3f}")
print(
    f"  Records with exact matches: {(transactions_df['exact_matches'] > 0).sum()}")
print(
    f"  Records with fuzzy matches: {(transactions_df['fuzzy_matches'] > 0).sum()}")
print(
    f"  Records with vector matches: {(transactions_df['vector_matches'] > 0).sum()}")
print(
    f"  Records with business matches: {(transactions_df['business_matches'] > 0).sum()}")

📊 Analyzing streaming processing results...

📋 Loaded 100 transaction records

🔍 Sample Transactions:


Unnamed: 0,record_num,full_name,email,combined_score,confidence,action,processing_time_ms
0,1,Sarah Pope,keith98@example.com,0.73,MEDIUM,AUTO_MERGE,1722.106457
1,2,Tina Ferguson,qvargas@example.org,0.73,MEDIUM,AUTO_MERGE,1898.457527
2,3,Justin Hughes,melissachan@example.org,0.72,MEDIUM,AUTO_MERGE,1695.567846
3,4,Vincent Herrera,jennifercarter@example.org,0.169297,LOW,CREATE_NEW,1712.272167
4,5,Heidi Spencer,robertbentley@example.net,0.72,MEDIUM,AUTO_MERGE,1728.089571
5,6,Michael Ball,wbishop@example.net,0.0,LOW,CREATE_NEW,1827.126265
6,7,Douglas Taylor,julie69@yahoo.com,0.73,MEDIUM,AUTO_MERGE,1966.059685
7,8,Michael Norton,watsonrichard@example.net,0.73,MEDIUM,AUTO_MERGE,1989.791155
8,9,Jeffrey Black,finleycasey@outlook.com,0.73,MEDIUM,AUTO_MERGE,2003.081322
9,10,Julie Thompson,opatel@example.net,0.73,MEDIUM,AUTO_MERGE,2008.635998



⚡ Performance Analysis:
  Average processing time: 1922ms
  Median processing time: 1943ms
  95th percentile: 2030ms
  Max processing time: 2480ms

🎯 Matching Effectiveness:
  Average combined score: 0.660
  Records with exact matches: 84
  Records with fuzzy matches: 88
  Records with vector matches: 83
  Records with business matches: 87


In [12]:
# Create visualizations
print("📈 Creating performance visualizations...")

# Processing time distribution
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Processing Time Distribution', 'Action Distribution',
                    'Confidence Distribution', 'Combined Score Distribution'),
    specs=[[{'type': 'histogram'}, {'type': 'pie'}],
           [{'type': 'pie'}, {'type': 'histogram'}]]
)

# Processing time histogram
fig.add_trace(
    go.Histogram(
        x=transactions_df['processing_time_ms'], name='Processing Time (ms)'),
    row=1, col=1
)

# Action distribution pie chart
action_counts = transactions_df['action'].value_counts()
fig.add_trace(
    go.Pie(labels=action_counts.index,
           values=action_counts.values, name='Actions'),
    row=1, col=2
)

# Confidence distribution pie chart
confidence_counts = transactions_df['confidence'].value_counts()
fig.add_trace(
    go.Pie(labels=confidence_counts.index,
           values=confidence_counts.values, name='Confidence'),
    row=2, col=1
)

# Combined score histogram
fig.add_trace(
    go.Histogram(x=transactions_df['combined_score'], name='Combined Score'),
    row=2, col=2
)

fig.update_layout(
    title_text="Streaming MDM Performance Analysis", showlegend=False)
fig.show()

print("✅ Visualizations created!")

📈 Creating performance visualizations...


✅ Visualizations created!


In [13]:
# Strategy effectiveness analysis
print("🎯 4-Strategy Effectiveness Analysis:")

strategy_stats = pd.DataFrame({
    'Strategy': ['Exact', 'Fuzzy', 'Vector', 'Business'],
    'Records_with_Matches': [
        (transactions_df['exact_matches'] > 0).sum(),
        (transactions_df['fuzzy_matches'] > 0).sum(),
        (transactions_df['vector_matches'] > 0).sum(),
        (transactions_df['business_matches'] > 0).sum()
    ],
    'Average_Score': [
        transactions_df['exact_score'].mean(),
        transactions_df['fuzzy_score'].mean(),
        transactions_df['vector_score'].mean(),
        transactions_df['business_score'].mean()
    ]
})

display(strategy_stats)

# Strategy effectiveness chart
fig = px.bar(
    strategy_stats,
    x='Strategy',
    y='Records_with_Matches',
    title='4-Strategy Matching Effectiveness',
    labels={'Records_with_Matches': 'Records with Matches'}
)
fig.show()

print("✅ Strategy analysis complete!")

🎯 4-Strategy Effectiveness Analysis:


Unnamed: 0,Strategy,Records_with_Matches,Average_Score
0,Exact,84,0.84
1,Fuzzy,88,0.874365
2,Vector,83,0.185301
3,Business,87,0.25


✅ Strategy analysis complete!


## 10. Final Golden Record Analysis

Analyze the final state of golden records in Spanner.

In [14]:
print("🏆 Analyzing final golden record state...")

# Get final golden record count
final_count = spanner_helper.get_table_count("golden_entities")
print(f"\n📊 Final golden entities count: {final_count}")

# Analyze processing paths
path_query = """
SELECT processing_path, COUNT(*) as count
FROM golden_entities
GROUP BY processing_path
ORDER BY count DESC
"""

path_df = spanner_helper.execute_sql(path_query)
if not path_df.empty:
    path_df.columns = ['processing_path', 'count']
    print("\n🔄 Processing Path Distribution:")
    display(path_df)

# Analyze source record counts
source_query = """
SELECT source_record_count, COUNT(*) as entities
FROM golden_entities
GROUP BY source_record_count
ORDER BY source_record_count
"""

source_df = spanner_helper.execute_sql(source_query)
if not source_df.empty:
    source_df.columns = ['source_record_count', 'entities']
    print("\n📈 Source Record Count Distribution:")
    display(source_df)

# Show sample updated records
updated_query = """
SELECT entity_id, master_name, master_email, source_record_count,
       processing_path, updated_at
FROM golden_entities
WHERE processing_path = 'stream'
ORDER BY updated_at DESC
LIMIT 10
"""

updated_df = spanner_helper.execute_sql(updated_query)
if not updated_df.empty:
    updated_df.columns = ['entity_id', 'master_name',
                          'master_email', 'source_count', 'path', 'updated_at']
    print("\n🔄 Sample Updated Records (Streaming):")
    display(updated_df)

print("\n✅ Golden record analysis complete!")

🏆 Analyzing final golden record state...

📊 Final golden entities count: 282

🔄 Processing Path Distribution:


Unnamed: 0,processing_path,count
0,batch_migrated,265
1,stream,17



📈 Source Record Count Distribution:


Unnamed: 0,source_record_count,entities
0,1,217
1,2,49
2,3,15
3,5,1



🔄 Sample Updated Records (Streaming):


Unnamed: 0,entity_id,master_name,master_email,source_count,path,updated_at
0,6bb82b5f-c4de-4964-824a-a60664d2fa88,JENNY LEE,nicholas99@example.org,1,stream,2025-09-24 00:11:13.346612+00:00
1,324517a1-5359-407d-9439-ba1372dc984f,MICHELLE ANDERSON,kristy39@example.com,2,stream,2025-09-24 00:11:11.611483+00:00
2,46dfe130-d61c-42d1-b384-53999b6e6676,THOMAS SANTOS,manuel01@example.net,1,stream,2025-09-24 00:11:07.725426+00:00
3,5d7f3e60-01c2-4f24-8b66-c87696ac797f,PATRICIA BUSH,kdunlap@example.net,2,stream,2025-09-24 00:11:05.969990+00:00
4,f4c15377-cf29-4b09-9187-3ed480a1f435,VANESSA REED,xanderson@example.net,2,stream,2025-09-24 00:10:56.459707+00:00
5,7f33b779-63a0-4baf-9b00-8dbb90629579,ELAINE NELSON,robertroach@example.net,2,stream,2025-09-24 00:10:42.862195+00:00
6,24f6520c-4ae4-44c1-b560-6cc040878672,LISA SAWYER,michael56@example.net,1,stream,2025-09-24 00:10:22.092090+00:00
7,c226b643-d8b3-4f3e-8b91-63f8e662d565,VINCENT HERRERA,jennifercarter@example.org,2,stream,2025-09-24 00:10:16.456482+00:00
8,225b974e-eead-45f6-8419-e18555b07d4d,NICHOLAS CHAVEZ,mitchellgriffith@outlook.com,1,stream,2025-09-24 00:10:08.700013+00:00
9,bf2ea1e3-bc9c-4455-807b-8dfbb9d1315d,ERIC YU,yorkcasey@outlook.com,1,stream,2025-09-24 00:10:04.955328+00:00



✅ Golden record analysis complete!


## 11. Performance Metrics and Summary

Calculate key performance indicators for the 4-strategy streaming MDM pipeline.

In [15]:
print("📈 Calculating 4-Strategy Streaming MDM Performance Metrics...")

# Overall pipeline statistics
initial_golden_count = golden_count if 'golden_count' in locals() else 0
final_golden_count = spanner_helper.get_table_count("golden_entities")
new_entities_created = final_golden_count - initial_golden_count

print(f"\n📊 Pipeline Statistics:")
print(f"  Initial golden records (from BigQuery): {initial_golden_count}")
print(f"  Streaming records processed: {NUM_STREAMING_RECORDS}")
print(f"  Final golden records: {final_golden_count}")
print(f"  Net new entities created: {new_entities_created}")
print(
    f"  Entity consolidation rate: {((NUM_STREAMING_RECORDS - new_entities_created) / NUM_STREAMING_RECORDS * 100):.1f}%")

# Performance metrics from transactions
if 'transactions_df' in locals() and not transactions_df.empty:
    print(f"\n⚡ Performance Metrics:")
    print(
        f"  Average processing time: {transactions_df['processing_time_ms'].mean():.0f}ms")
    print(
        f"  Sub-second guarantee: {(transactions_df['processing_time_ms'] < 1000).sum()}/{len(transactions_df)} ({(transactions_df['processing_time_ms'] < 1000).mean()*100:.1f}%)")
    print(
        f"  Target <400ms: {(transactions_df['processing_time_ms'] < 400).sum()}/{len(transactions_df)} ({(transactions_df['processing_time_ms'] < 400).mean()*100:.1f}%)")

    print(f"\n🎯 4-Strategy Matching Results:")
    print(
        f"  Auto-merge rate: {action_counts.get('AUTO_MERGE', 0)}/{NUM_STREAMING_RECORDS} ({action_counts.get('AUTO_MERGE', 0)/NUM_STREAMING_RECORDS*100:.1f}%)")
    print(
        f"  New entity rate: {action_counts.get('CREATE_NEW', 0)}/{NUM_STREAMING_RECORDS} ({action_counts.get('CREATE_NEW', 0)/NUM_STREAMING_RECORDS*100:.1f}%)")
    print(
        f"  Average confidence score: {transactions_df['combined_score'].mean():.3f}")

print("\n✅ Performance analysis complete!")

📈 Calculating 4-Strategy Streaming MDM Performance Metrics...

📊 Pipeline Statistics:
  Initial golden records (from BigQuery): 265
  Streaming records processed: 100
  Final golden records: 282
  Net new entities created: 17
  Entity consolidation rate: 83.0%

⚡ Performance Metrics:
  Average processing time: 1922ms
  Sub-second guarantee: 0/100 (0.0%)
  Target <400ms: 0/100 (0.0%)

🎯 4-Strategy Matching Results:
  Auto-merge rate: 83/100 (83.0%)
  New entity rate: 17/100 (17.0%)
  Average confidence score: 0.660

✅ Performance analysis complete!


## 12. Cleanup and Cost Management

Optional cleanup to avoid ongoing Spanner charges.

In [16]:
print("🧹 Demo Cleanup Options:")
print("=" * 50)
print(f"💰 Current Spanner instance cost: ~$65/month for {INSTANCE_ID}")
print(f"📊 Processing units: 100 (regional)")
print(f"🗃️ Database: {DATABASE_ID}")
print()
print("⚠️ To avoid ongoing charges, you can delete the Spanner instance:")
print(f"   gcloud spanner instances delete {INSTANCE_ID} --quiet")
print()
print("📁 Results preserved in:")
print(f"   {csv_file}")
print(f"   This notebook with all outputs")
print()
print("💡 The BigQuery golden records remain unchanged for future use.")
print("✅ Streaming MDM demo completed successfully!")

🧹 Demo Cleanup Options:
💰 Current Spanner instance cost: ~$65/month for mdm-streaming-demo
📊 Processing units: 100 (regional)
🗃️ Database: mdm_streaming

⚠️ To avoid ongoing charges, you can delete the Spanner instance:
   gcloud spanner instances delete mdm-streaming-demo --quiet

📁 Results preserved in:
   results/streaming_transactions.csv
   This notebook with all outputs

💡 The BigQuery golden records remain unchanged for future use.
✅ Streaming MDM demo completed successfully!
