# Master Data Management (MDM) - BigQuery Native Batch Processing

This notebook demonstrates a complete end-to-end Master Data Management pipeline using BigQuery's native capabilities:

- **Data Generation**: Create realistic sample data with duplicates and variations
- **Data Ingestion**: Load data into BigQuery from multiple sources
- **Data Standardization**: Clean and normalize data using SQL
- **Embedding Generation**: Use BigQuery ML with `gemini-embedding-001`
- **Vector Indexing**: Create vector indexes for fast similarity search
- **Entity Matching**: Implement exact, fuzzy, vector, business rules, and AI natural language matching
- **Confidence Scoring**: Calculate match confidence and make decisions
- **Golden Record Creation**: Generate master entities with survivorship rules
- **Analysis & Visualization**: Analyze results and performance

## Architecture Overview

This implementation follows the batch processing path from the MDM architecture:
1. **Files/APIs/Databases** → **BigQuery Raw Tables**
2. **BigQuery Standardization** → **BigQuery Staging**
3. **BigQuery ML Embeddings** → **BigQuery with Embeddings**
4. **BigQuery Vector Search** → **Unified Matching Engine**
5. **Confidence Scoring** → **Golden Record Creation**
6. **Master Entities** → **Analytics & Distribution**

## 1. Setup and Configuration

In [None]:
# Import required libraries
from bigquery_utils import (
    BigQueryMDMHelper,
    generate_standardization_sql,
    generate_union_sql,
    generate_embedding_sql,
    generate_exact_matching_sql,
    generate_fuzzy_matching_sql,
    generate_vector_matching_sql,
    generate_business_rules_sql,
    generate_combined_scoring_sql,
    generate_ai_natural_language_matching_sql
)
from data_generator import MDMDataGenerator
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from google.cloud import bigquery
from google.auth import default
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Libraries imported successfully")

In [None]:
# Configuration
PROJECT_ID = "your-gcp-project-id"  # Replace with your GCP project ID
DATASET_ID = "mdm_demo"
LOCATION = "US"

# Initialize BigQuery helper
try:
    bq_helper = BigQueryMDMHelper(PROJECT_ID, DATASET_ID)
    print(f"✅ Connected to BigQuery project: {PROJECT_ID}")
    print(f"📊 Dataset: {bq_helper.dataset_ref}")
except Exception as e:
    print(f"❌ Error connecting to BigQuery: {e}")
    print("Please ensure you have:")
    print("1. Set up Google Cloud authentication")
    print("2. Enabled BigQuery API")
    print("3. Updated PROJECT_ID above")

## 2. Generate Sample Data

Create realistic customer data from multiple sources with intentional duplicates and variations.

In [None]:
# Generate sample data
print("🔄 Generating sample customer data...")
generator = MDMDataGenerator(num_unique_customers=120)
datasets = generator.generate_all_datasets()

# Display summary statistics
print("\n📈 Dataset Summary:")
total_records = 0
for source, df in datasets.items():
    print(f"  {source.upper()}: {len(df):,} records")
    total_records += len(df)

print(f"\n📊 Total records: {total_records:,}")
print(f"👥 Unique customers: {generator.num_unique_customers:,}")
print(
    f"🔄 Duplication factor: {total_records / generator.num_unique_customers:.2f}x")

# Show sample records from each source
print("\n🔍 Sample Records:")
for source, df in datasets.items():
    print(f"\n{source.upper()} Sample:")
    display(df[['record_id', 'full_name', 'email',
            'phone', 'address', 'source_system']].head(3))

## 3. Data Ingestion to BigQuery

Load the generated data into BigQuery raw tables.

In [None]:
# Create dataset
print("🔄 Creating BigQuery dataset...")
bq_helper.create_dataset()

# Load data to BigQuery
print("\n🔄 Loading data to BigQuery...")
for source, df in datasets.items():
    table_name = f"raw_{source}_customers"
    print(f"  Loading {source} data to {table_name}...")
    bq_helper.load_dataframe_to_table(df, table_name)

print("\n✅ Data ingestion completed!")

# Verify data loading
print("\n📊 Table Information:")
for source in datasets.keys():
    table_name = f"raw_{source}_customers"
    info = bq_helper.get_table_info(table_name)
    if info:
        print(
            f"  {table_name}: {info['num_rows']:,} rows, {info['num_bytes']:,} bytes")

## 4. Data Standardization

Clean and standardize data from all sources using BigQuery SQL.

In [None]:
# Combine all raw data into a single table
print("🔄 Combining raw data from all sources...")

combine_sql = generate_union_sql(bq_helper.dataset_ref)

bq_helper.execute_query(combine_sql)
print("✅ Raw data combined")

# Standardize the combined data
print("\n🔄 Standardizing data...")
standardization_sql = generate_standardization_sql(
    f"{bq_helper.dataset_ref}.raw_customers_combined",
    f"{bq_helper.dataset_ref}.customers_standardized"
)

bq_helper.execute_query(standardization_sql)
print("✅ Data standardization completed")

# Show standardization results
sample_query = f"""
SELECT
  record_id,
  source_system,
  full_name,
  full_name_clean,
  email,
  email_clean,
  phone,
  phone_clean,
  address,
  address_clean
FROM `{bq_helper.dataset_ref}.customers_standardized`
LIMIT 5
"""

sample_df = bq_helper.execute_query(sample_query)
print("\n🔍 Standardization Sample:")
display(sample_df)

## 5. Embedding Generation with BigQuery ML

Generate embeddings using BigQuery's native ML.GENERATE_EMBEDDING function with the latest `gemini-embedding-001` model.

In [None]:
# Create embedding model
print("\n🔄 Creating embedding model...")
model_sql = f"""
CREATE OR REPLACE MODEL `{bq_helper.dataset_ref}.embedding_model`
REMOTE WITH CONNECTION DEFAULT
OPTIONS(
  ENDPOINT = 'gemini-embedding-001'
)
"""

try:
    bq_helper.execute_query(model_sql)
    print("✅ Embedding model created successfully")
except Exception as e:
    print(f"❌ Error creating model: {e}")
    print("Please ensure:")
    print("1. You have necessary permissions")
    print("2. Vertex AI API is enabled")

In [None]:
# Generate embeddings
print("🔄 Generating embeddings...")
embedding_sql = generate_embedding_sql(
    f"{bq_helper.dataset_ref}.customers_standardized",
    f"{bq_helper.dataset_ref}.customers_with_embeddings",
    f"{bq_helper.dataset_ref}.embedding_model"
)

try:
    bq_helper.execute_query(embedding_sql)
    print("✅ Embeddings generated successfully")

    # Check embedding dimensions
    check_sql = f"""
    SELECT
      COUNT(*) as total_records,
      COUNT(ml_generate_embedding_result) as records_with_embeddings,
      ANY_VALUE(ARRAY_LENGTH(ml_generate_embedding_result)) AS embedding_dimension
    FROM `{bq_helper.dataset_ref}.customers_with_embeddings`
    WHERE ml_generate_embedding_result IS NOT NULL
    LIMIT 1
    """

    result = bq_helper.execute_query(check_sql)
    if not result.empty:
        print(f"📊 Embedding Statistics:")
        print(f"  Total records: {result.iloc[0]['total_records']:,}")
        print(
            f"  Records with embeddings: {result.iloc[0]['records_with_embeddings']:,}")
        print(
            f"  Embedding dimension: {result.iloc[0]['embedding_dimension']}")

except Exception as e:
    print(f"❌ Error generating embeddings: {e}")
    print("This might be due to:")
    print("1. Insufficient permissions")
    print("2. API quotas or limits")

## 6. Vector Index Creation

Create vector indexes for efficient similarity search.

In [None]:
# Note: Vector index creation (with IVF) requires minimum 5,000 rows
# For our sample dataset, we'll use direct vector search
# which is actually more efficient for small datasets

# Create vector index for fast similarity search (will results an error)
print("🔄 Creating vector index...")

vector_index_sql = f"""
CREATE VECTOR INDEX IF NOT EXISTS customer_embedding_index
ON `{bq_helper.dataset_ref}.customers_with_embeddings`(ml_generate_embedding_result)
OPTIONS(
  index_type = 'IVF',
  distance_type = 'COSINE'
)
"""

try:
    bq_helper.execute_query(vector_index_sql)
    print("✅ Vector index created successfully")
    print("📈 This will significantly speed up vector similarity searches")
except Exception as e:
    print(f"⚠️ Vector index creation failed: {e}")
    print("Vector search will still work but may be slower")
    print("Vector indexes require specific BigQuery editions and regions")

## 7. Entity Matching

Implement multiple matching strategies using BigQuery SQL:
- **Exact Matching**: Direct field comparison
- **Fuzzy Matching**: String similarity algorithms
- **Vector Matching**: Semantic similarity using embeddings
- **Business Rules**: Domain-specific logic
- **AI Natural Language**: Direct AI comparison using Gemini 2.5 Pro

In [None]:
# 7.1 Exact Matching
print("🔄 Running exact matching...")
exact_sql = generate_exact_matching_sql(
    f"{bq_helper.dataset_ref}.customers_with_embeddings")
bq_helper.execute_query(exact_sql)
print("✅ Exact matching completed")

# Check exact match results
exact_count_sql = f"""
SELECT
  COUNT(*) as total_exact_matches,
  COUNT(CASE WHEN email_exact_score > 0 THEN 1 END) as email_matches,
  COUNT(CASE WHEN phone_exact_score > 0 THEN 1 END) as phone_matches,
  COUNT(CASE WHEN id_exact_score > 0 THEN 1 END) as id_matches
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_exact_matches`
"""

exact_stats = bq_helper.execute_query(exact_count_sql)
print(
    f"📊 Exact Match Results: {exact_stats.iloc[0]['total_exact_matches']} total matches")
print(f"  📧 Email matches: {exact_stats.iloc[0]['email_matches']}")
print(f"  📞 Phone matches: {exact_stats.iloc[0]['phone_matches']}")
print(f"  🆔 ID matches: {exact_stats.iloc[0]['id_matches']}")

In [None]:
# 7.2 Fuzzy Matching
print("🔄 Running fuzzy matching...")
fuzzy_sql = generate_fuzzy_matching_sql(
    f"{bq_helper.dataset_ref}.customers_with_embeddings")
bq_helper.execute_query(fuzzy_sql)
print("✅ Fuzzy matching completed")

# Check fuzzy match results
fuzzy_count_sql = f"""
SELECT
  COUNT(*) as total_fuzzy_matches,
  AVG(name_fuzzy_score) as avg_name_score,
  AVG(address_fuzzy_score) as avg_address_score,
  AVG(fuzzy_overall_score) as avg_overall_score
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_fuzzy_matches`
"""

fuzzy_stats = bq_helper.execute_query(fuzzy_count_sql)
print(
    f"📊 Fuzzy Match Results: {fuzzy_stats.iloc[0]['total_fuzzy_matches']} total matches")
print(f"  👤 Avg name score: {fuzzy_stats.iloc[0]['avg_name_score']:.3f}")
print(f"  🏠 Avg address score: {fuzzy_stats.iloc[0]['avg_address_score']:.3f}")
print(f"  📈 Avg overall score: {fuzzy_stats.iloc[0]['avg_overall_score']:.3f}")

In [None]:
# 7.3 Vector Matching
print("🔄 Running vector similarity matching...")
vector_sql = generate_vector_matching_sql(
    f"{bq_helper.dataset_ref}.customers_with_embeddings")

try:
    bq_helper.execute_query(vector_sql)
    print("✅ Vector matching completed")

    # Check vector match results
    vector_count_sql = f"""
    SELECT
      COUNT(*) as total_vector_matches,
      AVG(vector_similarity_score) as avg_similarity,
      MIN(vector_similarity_score) as min_similarity,
      MAX(vector_similarity_score) as max_similarity
    FROM `{bq_helper.dataset_ref}.customers_with_embeddings_vector_matches`
    """

    vector_stats = bq_helper.execute_query(vector_count_sql)
    print(
        f"📊 Vector Match Results: {vector_stats.iloc[0]['total_vector_matches']} total matches")
    print(f"  📈 Avg similarity: {vector_stats.iloc[0]['avg_similarity']:.3f}")
    print(f"  📉 Min similarity: {vector_stats.iloc[0]['min_similarity']:.3f}")
    print(f"  📈 Max similarity: {vector_stats.iloc[0]['max_similarity']:.3f}")

except Exception as e:
    print(f"⚠️ Vector matching failed: {e}")
    print("This might be due to missing embeddings or vector index issues")

In [None]:
# 7.4 Business Rules Matching
print("🔄 Running business rules matching...")
business_sql = generate_business_rules_sql(
    f"{bq_helper.dataset_ref}.customers_with_embeddings")
bq_helper.execute_query(business_sql)
print("✅ Business rules matching completed")

# Check business rules results
business_count_sql = f"""
SELECT
  COUNT(*) as total_business_matches,
  COUNT(CASE WHEN same_company_score > 0 THEN 1 END) as company_matches,
  COUNT(CASE WHEN same_location_score > 0 THEN 1 END) as location_matches,
  COUNT(CASE WHEN age_compatibility_score > 0 THEN 1 END) as age_matches,
  COUNT(CASE WHEN income_compatibility_score > 0 THEN 1 END) as income_matches
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_business_matches`
"""

business_stats = bq_helper.execute_query(business_count_sql)
print(
    f"📊 Business Rules Results: {business_stats.iloc[0]['total_business_matches']} total matches")
print(f"  🏢 Company matches: {business_stats.iloc[0]['company_matches']}")
print(f"  📍 Location matches: {business_stats.iloc[0]['location_matches']}")
print(f"  🎂 Age matches: {business_stats.iloc[0]['age_matches']}")
print(f"  💰 Income matches: {business_stats.iloc[0]['income_matches']}")

In [None]:
# Create Gemini 2.5 Pro model
print("\n🔄 Creating Gemini 2.5 Pro model...")
model_sql = f"""
CREATE OR REPLACE MODEL `{bq_helper.dataset_ref}.gemini_25_pro_model`
REMOTE WITH CONNECTION DEFAULT
OPTIONS(
  ENDPOINT = 'gemini-2.5-pro'
)
"""

try:
    bq_helper.execute_query(model_sql)
    print("✅ Gemini model created successfully")
except Exception as e:
    print(f"❌ Error creating model: {e}")
    print("Please ensure:")
    print("1. You have necessary permissions")
    print("2. Vertex AI API is enabled")

In [None]:
# 7.5 AI Natural Language Matching
print("🤖 Running AI natural language matching...")

# Generate AI natural language matching
ai_sql = generate_ai_natural_language_matching_sql(
    f"{bq_helper.dataset_ref}.customers_with_embeddings",
    f"{bq_helper.dataset_ref}.gemini_25_pro_model"
)
bq_helper.execute_query(ai_sql)
print("✅ AI natural language matching completed")

# Check AI match results
ai_count_sql = f"""
SELECT
  COUNT(*) as total_ai_matches,
  AVG(ai_score) as avg_ai_score,
  AVG(confidence) as avg_confidence,
  MIN(ai_score) as min_ai_score,
  MAX(ai_score) as max_ai_score
FROM `{bq_helper.dataset_ref}.ai_natural_language_matches`
"""

ai_stats = bq_helper.execute_query(ai_count_sql)
print(
    f"📊 AI Natural Language Results: {ai_stats.iloc[0]['total_ai_matches']} total matches")
print(f"  🤖 Avg AI score: {ai_stats.iloc[0]['avg_ai_score']:.3f}")
print(f"  🎯 Avg confidence: {ai_stats.iloc[0]['avg_confidence']:.3f}")
print(f"  📉 Min AI score: {ai_stats.iloc[0]['min_ai_score']:.3f}")
print(f"  📈 Max AI score: {ai_stats.iloc[0]['max_ai_score']:.3f}")

# Show sample AI explanations
sample_explanations_sql = f"""
SELECT
  ai_score,
  confidence,
  explanation
FROM `{bq_helper.dataset_ref}.ai_natural_language_matches`
ORDER BY ai_score DESC
LIMIT 5
"""

explanations = bq_helper.execute_query(sample_explanations_sql)
print("\n🔍 Sample AI Explanations:")
for _, row in explanations.iterrows():
    print(
        f"  Score: {row['ai_score']:.3f} | Confidence: {row['confidence']:.3f}")
    print(f"  Explanation: {row['explanation']}")
    print()

## 8. Combined Scoring and Confidence Assessment

Combine all 5 matching strategies with weighted scoring and calculate confidence levels.

In [None]:
# Combine all matching scores (now with 5 strategies)
print("🔄 Combining match scores from 5 strategies...")
combined_sql = generate_combined_scoring_sql(
    bq_helper.dataset_ref,
    "customers_with_embeddings"
)
bq_helper.execute_query(combined_sql)
print("✅ Combined scoring completed")

# Analyze combined results
analysis_sql = f"""
SELECT
  match_decision,
  confidence_level,
  COUNT(*) as count,
  AVG(combined_score) as avg_score,
  MIN(combined_score) as min_score,
  MAX(combined_score) as max_score
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
GROUP BY match_decision, confidence_level
ORDER BY avg_score DESC
"""

analysis_df = bq_helper.execute_query(analysis_sql)
print("\n📊 Match Decision Summary:")
display(analysis_df)

# Show top matches
top_matches_sql = f"""
SELECT
  record1_id,
  record2_id,
  source1,
  source2,
  exact_score,
  fuzzy_score,
  vector_score,
  business_score,
  ai_score,
  combined_score,
  match_decision,
  confidence_level
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
ORDER BY combined_score DESC
LIMIT 10
"""

top_matches_df = bq_helper.execute_query(top_matches_sql)
print("\n🏆 Top 10 Matches (5-Strategy Analysis):")
display(top_matches_df)

## 9. Golden Record Creation

Create master entities using survivorship rules and merge decisions.

In [None]:
# Create golden records with survivorship rules
print("🔄 Creating golden records...")

golden_record_sql = f"""
CREATE OR REPLACE TABLE `{bq_helper.dataset_ref}.golden_records` AS
WITH match_clusters AS (
  -- Create clusters of matching records
  SELECT
    record1_id,
    record2_id,
    combined_score
  FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
  WHERE match_decision IN ('auto_merge', 'human_review')
),
survivorship_rules AS (
  -- Apply survivorship rules to select best values
  SELECT
    GENERATE_UUID() as master_id,
    ARRAY_AGG(record_id) as source_record_ids,

    -- Name: Most complete (longest)
    ARRAY_AGG(full_name_clean ORDER BY LENGTH(full_name_clean) DESC LIMIT 1)[OFFSET(0)] as master_name,

    -- Email: Most recent and complete
    ARRAY_AGG(email_clean ORDER BY
      CASE WHEN email_clean IS NOT NULL THEN 1 ELSE 0 END DESC,
      processed_at DESC LIMIT 1)[OFFSET(0)] as master_email,

    -- Phone: Most recent and complete
    ARRAY_AGG(phone_clean ORDER BY
      CASE WHEN phone_clean IS NOT NULL THEN 1 ELSE 0 END DESC,
      processed_at DESC LIMIT 1)[OFFSET(0)] as master_phone,

    -- Address: Most complete
    ARRAY_AGG(address_clean ORDER BY LENGTH(address_clean) DESC LIMIT 1)[OFFSET(0)] as master_address,
    ARRAY_AGG(city_clean ORDER BY LENGTH(city_clean) DESC LIMIT 1)[OFFSET(0)] as master_city,
    ARRAY_AGG(state_clean ORDER BY LENGTH(state_clean) DESC LIMIT 1)[OFFSET(0)] as master_state,

    -- Other fields: Most recent
    ARRAY_AGG(company ORDER BY processed_at DESC LIMIT 1)[OFFSET(0)] as master_company,
    ARRAY_AGG(annual_income ORDER BY processed_at DESC LIMIT 1)[OFFSET(0)] as master_income,
    ARRAY_AGG(customer_segment ORDER BY processed_at DESC LIMIT 1)[OFFSET(0)] as master_segment,

    -- Metadata
    COUNT(*) as source_record_count,
    ARRAY_AGG(DISTINCT source_system) as source_systems,
    MIN(registration_date) as first_seen,
    MAX(last_activity_date) as last_activity,
    CURRENT_TIMESTAMP() as created_at

  FROM `{bq_helper.dataset_ref}.customers_with_embeddings` c
  WHERE record_id IN (
    SELECT DISTINCT record1_id FROM match_clusters
    UNION DISTINCT
    SELECT DISTINCT record2_id FROM match_clusters
  )
  GROUP BY 1  -- Group by some clustering logic (simplified)
)
SELECT * FROM survivorship_rules
"""

try:
    bq_helper.execute_query(golden_record_sql)
    print("✅ Golden records created successfully")

    # Check golden record statistics
    golden_stats_sql = f"""
    SELECT
      COUNT(*) as total_golden_records,
      AVG(source_record_count) as avg_sources_per_record,
      MAX(source_record_count) as max_sources_per_record
    FROM `{bq_helper.dataset_ref}.golden_records`
    """

    golden_stats = bq_helper.execute_query(golden_stats_sql)
    print(f"📊 Golden Record Statistics:")
    print(
        f"  Total golden records: {golden_stats.iloc[0]['total_golden_records']}")
    print(
        f"  Avg sources per record: {golden_stats.iloc[0]['avg_sources_per_record']:.2f}")
    print(
        f"  Max sources per record: {golden_stats.iloc[0]['max_sources_per_record']}")

except Exception as e:
    print(f"❌ Error creating golden records: {e}")

## 10. Analysis and Visualization

Analyze the MDM pipeline results and create visualizations.

In [None]:
# Create comprehensive analysis
print("📊 Analyzing MDM Pipeline Results...")

# Get overall statistics
overall_stats_sql = f"""
WITH stats AS (
  SELECT
    'Raw Records' as stage,
    COUNT(*) as record_count
  FROM `{bq_helper.dataset_ref}.raw_customers_combined`

  UNION ALL

  SELECT
    'Standardized Records' as stage,
    COUNT(*) as record_count
  FROM `{bq_helper.dataset_ref}.customers_standardized`

  UNION ALL

  SELECT
    'Records with Embeddings' as stage,
    COUNT(*) as record_count
  FROM `{bq_helper.dataset_ref}.customers_with_embeddings`
  WHERE ml_generate_embedding_result IS NOT NULL

  UNION ALL

  SELECT
    'Golden Records' as stage,
    COUNT(*) as record_count
  FROM `{bq_helper.dataset_ref}.golden_records`
)
SELECT * FROM stats ORDER BY record_count DESC
"""

overall_stats = bq_helper.execute_query(overall_stats_sql)
print("\n📈 Pipeline Statistics:")
display(overall_stats)

# Visualize pipeline flow
fig = px.funnel(
    overall_stats,
    x='record_count',
    y='stage',
    title='MDM Pipeline Data Flow',
    labels={'record_count': 'Number of Records', 'stage': 'Pipeline Stage'}
)
fig.show()

In [None]:
# Analyze 5-strategy matching effectiveness
strategy_analysis_sql = f"""
SELECT
  'Exact Matching' as strategy,
  COUNT(*) as matches_found,
  AVG(exact_score) as avg_score
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
WHERE exact_score > 0

UNION ALL

SELECT
  'Fuzzy Matching' as strategy,
  COUNT(*) as matches_found,
  AVG(fuzzy_score) as avg_score
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
WHERE fuzzy_score > 0

UNION ALL

SELECT
  'Vector Matching' as strategy,
  COUNT(*) as matches_found,
  AVG(vector_score) as avg_score
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
WHERE vector_score > 0

UNION ALL

SELECT
  'Business Rules' as strategy,
  COUNT(*) as matches_found,
  AVG(business_score) as avg_score
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
WHERE business_score > 0

UNION ALL

SELECT
  'AI Natural Language' as strategy,
  COUNT(*) as matches_found,
  AVG(ai_score) as avg_score
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
WHERE ai_score > 0

ORDER BY matches_found DESC
"""

strategy_stats = bq_helper.execute_query(strategy_analysis_sql)
print("\n🎯 5-Strategy Matching Effectiveness:")
display(strategy_stats)

# Create visualization
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Matches Found by Strategy', 'Average Score by Strategy'),
    specs=[[{'type': 'bar'}, {'type': 'bar'}]]
)

fig.add_trace(
    go.Bar(x=strategy_stats['strategy'],
           y=strategy_stats['matches_found'], name='Matches'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=strategy_stats['strategy'],
           y=strategy_stats['avg_score'], name='Avg Score'),
    row=1, col=2
)

fig.update_layout(title_text="5-Strategy Matching Analysis", showlegend=False)
fig.show()

In [None]:
# Analyze confidence distribution
confidence_dist_sql = f"""
SELECT
  ROUND(combined_score, 1) as score_bucket,
  COUNT(*) as count,
  match_decision
FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
GROUP BY score_bucket, match_decision
ORDER BY score_bucket
"""

confidence_dist = bq_helper.execute_query(confidence_dist_sql)
print("\n📊 Confidence Score Distribution:")
display(confidence_dist.head(10))

# Create confidence distribution plot
fig = px.histogram(
    confidence_dist,
    x='score_bucket',
    y='count',
    color='match_decision',
    title='Distribution of Match Confidence Scores (5-Strategy)',
    labels={'score_bucket': 'Confidence Score', 'count': 'Number of Matches'}
)
fig.show()

## 11. Performance Metrics and Summary

Calculate key performance indicators for the 5-strategy MDM pipeline.

In [None]:
# Calculate key metrics
print("📈 Calculating 5-Strategy MDM Performance Metrics...")

# Data quality metrics
quality_metrics_sql = f"""
WITH quality_stats AS (
  SELECT
    COUNT(*) as total_records,
    COUNT(CASE WHEN email_clean IS NOT NULL THEN 1 END) / COUNT(*) as email_completeness,
    COUNT(CASE WHEN phone_clean IS NOT NULL THEN 1 END) / COUNT(*) as phone_completeness,
    COUNT(CASE WHEN address_clean IS NOT NULL THEN 1 END) / COUNT(*) as address_completeness,
    COUNT(DISTINCT email_clean) / COUNT(CASE WHEN email_clean IS NOT NULL THEN 1 END) as email_uniqueness,
    COUNT(DISTINCT phone_clean) / COUNT(CASE WHEN phone_clean IS NOT NULL THEN 1 END) as phone_uniqueness
  FROM `{bq_helper.dataset_ref}.customers_standardized`
)
SELECT
  total_records,
  ROUND(email_completeness * 100, 2) as email_completeness_pct,
  ROUND(phone_completeness * 100, 2) as phone_completeness_pct,
  ROUND(address_completeness * 100, 2) as address_completeness_pct,
  ROUND(email_uniqueness * 100, 2) as email_uniqueness_pct,
  ROUND(phone_uniqueness * 100, 2) as phone_uniqueness_pct
FROM quality_stats
"""

quality_metrics = bq_helper.execute_query(quality_metrics_sql)
print("\n📊 Data Quality Metrics:")
display(quality_metrics)

# Matching effectiveness
matching_metrics_sql = f"""
WITH matching_stats AS (
  SELECT
    COUNT(*) as total_potential_matches,
    COUNT(CASE WHEN match_decision = 'auto_merge' THEN 1 END) as auto_merge_count,
    COUNT(CASE WHEN match_decision = 'human_review' THEN 1 END) as human_review_count,
    COUNT(CASE WHEN match_decision = 'no_match' THEN 1 END) as no_match_count,
    AVG(combined_score) as avg_combined_score
  FROM `{bq_helper.dataset_ref}.customers_with_embeddings_combined_matches`
)
SELECT
  total_potential_matches,
  auto_merge_count,
  human_review_count,
  no_match_count,
  ROUND(auto_merge_count / total_potential_matches * 100, 2) as auto_merge_rate_pct,
  ROUND(human_review_count / total_potential_matches * 100, 2) as human_review_rate_pct,
  ROUND(avg_combined_score, 3) as avg_combined_score
FROM matching_stats
"""

matching_metrics = bq_helper.execute_query(matching_metrics_sql)
print("\n🎯 5-Strategy Matching Effectiveness:")
display(matching_metrics)

In [None]:
# Final summary
print("\n" + "="*60)
print("🎉 5-STRATEGY MDM PIPELINE EXECUTION SUMMARY")
print("="*60)

print(f"\n📊 DATA PROCESSING:")
print(
    f"  • Generated {total_records:,} sample records from {len(datasets)} sources")
print(f"  • Representing {generator.num_unique_customers:,} unique customers")
print(
    f"  • Duplication factor: {total_records / generator.num_unique_customers:.2f}x")

if not quality_metrics.empty:
    print(f"\n📈 DATA QUALITY:")
    print(
        f"  • Email completeness: {quality_metrics.iloc[0]['email_completeness_pct']:.1f}%")
    print(
        f"  • Phone completeness: {quality_metrics.iloc[0]['phone_completeness_pct']:.1f}%")
    print(
        f"  • Address completeness: {quality_metrics.iloc[0]['address_completeness_pct']:.1f}%")

if not matching_metrics.empty:
    print(f"\n🎯 5-STRATEGY MATCHING RESULTS:")
    print(
        f"  • Total potential matches: {matching_metrics.iloc[0]['total_potential_matches']:,}")
    print(
        f"  • Auto-merge rate: {matching_metrics.iloc[0]['auto_merge_rate_pct']:.1f}%")
    print(
        f"  • Human review rate: {matching_metrics.iloc[0]['human_review_rate_pct']:.1f}%")
    print(
        f"  • Average combined score: {matching_metrics.iloc[0]['avg_combined_score']:.3f}")

print(f"\n🏗️ ENHANCED ARCHITECTURE HIGHLIGHTS:")
print(f"  • 100% BigQuery-native implementation")
print(f"  • Latest gemini-embedding-001 model for vector matching")
print(f"  • NEW: Gemini 2.5 Pro for AI natural language matching")
print(f"  • Vector indexes for fast similarity search")
print(f"  • 5-strategy matching (exact, fuzzy, vector, rules, AI)")
print(f"  • Enhanced weighted ensemble scoring")
print(f"  • AI-powered explanations for match decisions")
print(f"  • Automated confidence scoring and decision making")
print(f"  • Survivorship rules for golden record creation")

print(f"\n✅ 5-STRATEGY PIPELINE COMPLETED SUCCESSFULLY!")
print("="*60)