# MOSTLY AI vs. SDV Comparison - Multi-Table Scenario  <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/multi-table-scenario.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

This notebook provides a comprehensive comparison between two leading synthetic data generation platforms:
- **SDV (Synthetic Data Vault)** - Business Source License
- **MOSTLY AI Synthetic Data SDK** - Apache 2.0 License - Open Source

In this comparison, we are going to walk through the synthesis of a relational multi-table structure using the [Berka dataset](https://github.com/mostly-ai/public-demo-data/tree/dev/berka/data).

## Comparison Methodology

1. **Data Preparation**: Load, inspect, and preprocess the multi-table dataset
2. **Data Splitting**: Create train/test splits while maintaining referential integrity
3. **Model Training**: Train both SDV and MOSTLY AI generators on the training data
4. **Synthetic Data Generation**: Generate synthetic datasets using both platforms
5. **Performance Analysis**: Compare training time, generation speed, and data quality

## Key Challenges in Multi-Table Synthesis

- **Referential Integrity**: Maintaining foreign key relationships between tables
- **Sequential Dependencies**: Preserving temporal patterns in transaction data
- **Data Quality**: Ensuring synthetic data maintains statistical properties and business logic

In [None]:
# Install SDK in CLIENT mode
!uv pip install -U mostlyai sdv graphviz
# Or install in LOCAL mode
!uv pip install -U 'mostlyai[local]' sdv graphviz
# Note: Restart kernel session after installation!

## 1. Data Loading and Initial Exploration

First, let's load our multi-table dataset and examine its structure to understand:
- Table schemas and data types
- Data quality and completeness
- Relationships between tables
- Business logic and constraints


In [None]:
import pandas as pd

base_url = "https://github.com/mostly-ai/public-demo-data/raw/dev/berka/data/"
originals = {
    "client": pd.read_csv(base_url + "client.csv.gz", low_memory=False),
    "disposition": pd.read_csv(base_url + "disp.csv.gz", low_memory=False),
    "card": pd.read_csv(base_url + "card.csv.gz", low_memory=False),
    "account": pd.read_csv(base_url + "account.csv.gz", low_memory=False),
    "transaction": pd.read_csv(base_url + "trans.csv.gz", low_memory=False),
    "loan": pd.read_csv(base_url + "loan.csv.gz", low_memory=False),
    "order": pd.read_csv(base_url + "order.csv.gz", low_memory=False),
}
originals["account"]["date"] = pd.to_datetime(originals["account"]["date"])
originals["transaction"]["date"] = pd.to_datetime(originals["transaction"]["date"])

for k in originals:
    print("===", k, "===")
    display(originals[k].sample(n=3))

## 2. Data Preprocessing: Establishing Foreign Key Relationships

We need to establish proper foreign key relationships amongst each of the tables. This step is crucial for:
- Maintaining referential integrity in synthetic data
- Enabling proper multi-table synthesis
- Ensuring realistic transaction patterns

In [None]:
print("🔗 Validating foreign key relationships...")

# Build lookup sets for validation
client_ids = set(originals["client"]["client_id"])
account_ids = set(originals["account"]["account_id"])
disp_ids = set(originals["disposition"]["disp_id"])

# Define FK relationships as structured data
fk_relationships = [
    {"table": "disposition", "column": "client_id", "target_table": "client", "valid_ids": client_ids},
    {"table": "disposition", "column": "account_id", "target_table": "account", "valid_ids": account_ids},
    {"table": "card", "column": "disp_id", "target_table": "disposition", "valid_ids": disp_ids},
    {"table": "transaction", "column": "account_id", "target_table": "account", "valid_ids": account_ids},
    {"table": "loan", "column": "account_id", "target_table": "account", "valid_ids": account_ids},
    {"table": "order", "column": "account_id", "target_table": "account", "valid_ids": account_ids},
]

validation_summary = []

dataframes = originals.copy()

def validate_fk(df, column, valid_ids, table, target_table):
    before = len(df)
    missing_mask = ~df[column].isin(valid_ids)
    missing_count = missing_mask.sum()
    if missing_count > 0:
        message = (
            f"⚠️ {table}.{column} → {target_table}: {missing_count:,} invalid foreign keys detected "
            f"(checked {before:,} rows)"
        )
        df.loc[missing_mask, column] = pd.NA
    else:
        message = (
            f"✅ {table}.{column} → {target_table}: All foreign keys valid "
            f"({before:,} rows checked)"
        )
    validation_summary.append(message)
    return df

# Perform validation based on the relationship list
for rel in fk_relationships:
    table_name = rel["table"]
    column = rel["column"]
    valid_ids = rel["valid_ids"]
    target_table = rel["target_table"]
    dataframes[table_name] = validate_fk(
        dataframes[table_name], column, valid_ids, table_name, target_table
    )

print("\n✅ Referential integrity validation complete!\n")
print("📄 Foreign Key Validation Summary:")
for line in validation_summary:
    print(line)


## 3. Strategic Data Splitting for Multi-Table Scenarios

When dealing with related tables like accounts, dispositions and transactions, data splitting becomes more complex than simple random sampling due to the links between data tables. 

In order to make coherent assessments of data quality, we need to create meaningful train and test cohorts.

**Key Considerations:**
- **Referential Integrity:** Ensure foreign key relationships remain valid in both splits.
- **Business Logic:** Accounts, loans, and transactions can only exist in the training set if their associated client is also in the training set.
- **Data Leakage Prevention:** Avoid information bleeding between train/test sets.

**Our Approach:**
1. **Split accounts first (80/20 train/test)**: We split the `account` table using an 80/20 ratio. This ensures customer-related information is kept together per split.
   
2. **Assign related tables based on account membership:**  
   - **Dispositions:** Linked using `account_id`. Only dispositions referencing accounts in the training set go to the training set.
   - **Cards:** Linked via `disp_id`. Cards follow the disposition-account split.
   - **Transactions, Loans, Orders:** All linked via `account_id`. These tables follow the same assignment logic.


In [None]:
from sklearn.model_selection import train_test_split

print("✂️ Performing strategic multi-table data splitting based on accounts...")

# Step 1: Split accounts using 80/20 ratio
accounts_train, accounts_test = train_test_split(
    dataframes["account"], test_size=0.2, random_state=42
)

print(f"🏦 Account split:")
print(f"   - Training set: {len(accounts_train):,} accounts ({len(accounts_train) / len(dataframes['account']) * 100:.1f}%)")
print(f"   - Test set: {len(accounts_test):,} accounts ({len(accounts_test) / len(dataframes['account']) * 100:.1f}%)")

# Step 2: Create account ID sets for lookup
train_account_ids = set(accounts_train["account_id"])
test_account_ids = set(accounts_test["account_id"])


In [None]:
print("🔄 Assigning related tables based on account split...")

# Dispositions linked to accounts
dispositions_train = dataframes["disposition"][dataframes["disposition"]["account_id"].isin(train_account_ids)].copy()
dispositions_test = dataframes["disposition"][dataframes["disposition"]["account_id"].isin(test_account_ids)].copy()

train_disp_ids = set(dispositions_train["disp_id"])
test_disp_ids = set(dispositions_test["disp_id"])

# Cards linked to dispositions
cards_train = dataframes["card"][dataframes["card"]["disp_id"].isin(train_disp_ids)].copy()
cards_test = dataframes["card"][dataframes["card"]["disp_id"].isin(test_disp_ids)].copy()

# Clients linked via disposition → account
client_ids_train = set(dispositions_train["client_id"])
client_ids_test = set(dispositions_test["client_id"])

clients_train = dataframes["client"][dataframes["client"]["client_id"].isin(client_ids_train)].copy()
clients_test = dataframes["client"][dataframes["client"]["client_id"].isin(client_ids_test)].copy()

# Transactions linked to accounts
transactions_train = dataframes["transaction"][dataframes["transaction"]["account_id"].isin(train_account_ids)].copy()
transactions_test = dataframes["transaction"][dataframes["transaction"]["account_id"].isin(test_account_ids)].copy()

# Loans linked to accounts
loans_train = dataframes["loan"][dataframes["loan"]["account_id"].isin(train_account_ids)].copy()
loans_test = dataframes["loan"][dataframes["loan"]["account_id"].isin(test_account_ids)].copy()

# Orders linked to accounts
orders_train = dataframes["order"][dataframes["order"]["account_id"].isin(train_account_ids)].copy()
orders_test = dataframes["order"][dataframes["order"]["account_id"].isin(test_account_ids)].copy()

print("✅ Splitting complete!")
print(f"   - Training accounts: {len(accounts_train):,}")
print(f"   - Training dispositions: {len(dispositions_train):,}")
print(f"   - Training clients: {len(clients_train):,}")
print(f"   - Training cards: {len(cards_train):,}")
print(f"   - Training transactions: {len(transactions_train):,}")
print(f"   - Training loans: {len(loans_train):,}")
print(f"   - Training orders: {len(orders_train):,}")

print(f"   - Test accounts: {len(accounts_test):,}")
print(f"   - Test dispositions: {len(dispositions_test):,}")
print(f"   - Test clients: {len(clients_test):,}")
print(f"   - Test cards: {len(cards_test):,}")
print(f"   - Test transactions: {len(transactions_test):,}")
print(f"   - Test loans: {len(loans_test):,}")
print(f"   - Test orders: {len(orders_test):,}")


In [None]:
print("💾 Saving split train/test tables...")

dispositions_train.to_parquet('./data/dispositions_train.parquet', index=False)
dispositions_test.to_parquet('./data/dispositions_test.parquet', index=False)
cards_train.to_parquet('./data/cards_train.parquet', index=False)
cards_test.to_parquet('./data/cards_test.parquet', index=False)
accounts_train.to_parquet('./data/accounts_train.parquet', index=False)
accounts_test.to_parquet('./data/accounts_test.parquet', index=False)
transactions_train.to_parquet('./data/transactions_train.parquet', index=False)
transactions_test.to_parquet('./data/transactions_test.parquet', index=False)
loans_train.to_parquet('./data/loans_train.parquet', index=False)
loans_test.to_parquet('./data/loans_test.parquet', index=False)
orders_train.to_parquet('./data/orders_train.parquet', index=False)
orders_test.to_parquet('./data/orders_test.parquet', index=False)
clients_train.to_parquet('./data/clients_train.parquet', index=False)
clients_test.to_parquet('./data/clients_test.parquet', index=False)

print("✅ All train/test splits saved to ./data/")


## 4. SDV (Synthetic Data Vault) Implementation

**About SDV:**
- Business Source License Python library for synthetic data generation
- Supports single-table and multi-table scenarios
- Uses statistical modeling and machine learning approaches
- Provides HMASynthesizer for hierarchical multi-table synthesis

**Key Features:**
- **Metadata Detection**: Automatically infers data types and relationships
- **Relationship Modeling**: Handles parent-child table relationships
- **Privacy Protection**: Generates synthetic data that preserves statistical properties while protecting individual privacy
- **Extensible**: Multiple synthesizer options (GaussianCopula, CTGAN, CopulaGAN, etc.)

**Limitations:**
- Current version only supports one parent per child table
- Complex multi-parent relationships require modeling simplification
- Performance scales with data complexity


In [None]:
from sdv.multi_table import HMASynthesizer
from sdv.metadata import Metadata

print("🏗️ Building SDV metadata configuration...")

metadata = Metadata.detect_from_dataframes(
    data={
        'client': clients_train,
        'disposition': dispositions_train,
        'account': accounts_train,
        'card': cards_train,
        'transaction': transactions_train,
        'loan': loans_train,
        'order': orders_train
    },
    infer_keys='primary_and_foreign'
)

print("✅ Base metadata auto-detected with relationships")

# View auto-detected relationships graphically
metadata.visualize()

# Inspect relationships and table configuration as raw dictionary
metadata_dict = metadata.to_dict()
print("\n📋 Complete SDV Metadata Dictionary:")
print(metadata_dict)


### 4.2 SDV Model Training

**HMASynthesizer Overview:**
- **Hierarchical Modeling**: Learns parent-child relationships
- **Statistical Approach**: Uses copulas and Gaussian distributions
- **Multi-step Process**: 
  1. Preprocesses tables and infers constraints
  2. Learns relationships between parent and child tables
  3. Models individual table distributions
  
**Training Phases:**
- **Preprocess Tables**: Data cleaning and type inference
- **Learning Relationships**: Analyzing foreign key dependencies  
- **Modeling Tables**: Learning statistical distributions for synthesis


In [None]:
import time
from sdv.multi_table import HMASynthesizer

print("🚀 Starting SDV training process...")
print("This will involve multiple phases - preprocessing, relationship learning, and table modeling")

start_time = time.time()

# Initialize the HMASynthesizer with our configured metadata
print("🔧 Initializing HMASynthesizer...")
synthesizer = HMASynthesizer(metadata)

# Fit the synthesizer on training data
# This process will:
# 1. Preprocess all tables (clean data, infer constraints)
# 2. Learn multi-table relationships
# 3. Model the statistical distributions of each table
print("📊 Training synthesizer on multi-table data...")
synthesizer.fit({
    'client': clients_train,
    'disposition': dispositions_train,
    'account': accounts_train,
    'card': cards_train,
    'transaction': transactions_train,
    'loan': loans_train,
    'order': orders_train
})

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV training completed successfully!")
print(f"⏱️ Total training time: {elapsed_minutes:.2f} minutes")

# Report table sizes for clarity
print("📈 Training data breakdown:")
print(f"   - Clients: {len(clients_train):,}")
print(f"   - Dispositions: {len(dispositions_train):,}")
print(f"   - Accounts: {len(accounts_train):,}")
print(f"   - Cards: {len(cards_train):,}")
print(f"   - Transactions: {len(transactions_train):,}")
print(f"   - Loans: {len(loans_train):,}")
print(f"   - Orders: {len(orders_train):,}")


### 4.3 SDV Synthetic Data Generation

**Generation Process:**
- **Scale Parameter**: Controls the number of synthetic records (1.0 = same size as training data)
- **Hierarchical Generation**: First generates parent records (customers), then child records (transfers)
- **Relationship Preservation**: Ensures all synthetic transfers reference valid synthetic customers
- **Statistical Sampling**: Uses learned distributions to create realistic synthetic data


In [None]:
print("🎲 Starting SDV synthetic data generation...")
print("Generating synthetic data using learned statistical distributions...")

start_time = time.time()

# Generate synthetic data with 25% more records than training data
# Scale parameter: 1.0 = same size, 1.25 = 25% larger, 0.5 = half size
print("⚙️ Generating 1.25x the training data size...")
sdv_synthetic_data = synthesizer.sample(scale=1.25)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV generation completed successfully!")
print(f"⏱️  Generation time: {elapsed_minutes:.2f} minutes")

# Calculate total synthetic records
synthetic_client_count = len(sdv_synthetic_data['client'])
synthetic_disposition_count = len(sdv_synthetic_data['disposition'])
synthetic_account_count = len(sdv_synthetic_data['account'])
synthetic_card_count = len(sdv_synthetic_data['card'])
synthetic_transaction_count = len(sdv_synthetic_data['transaction'])
synthetic_loan_count = len(sdv_synthetic_data['loan'])
synthetic_order_count = len(sdv_synthetic_data['order'])

total_synthetic_records = (
    synthetic_client_count +
    synthetic_disposition_count +
    synthetic_account_count +
    synthetic_card_count +
    synthetic_transaction_count +
    synthetic_loan_count +
    synthetic_order_count
)

generation_rate = total_synthetic_records / (end_time - start_time)

print(f"🚀 Generation rate: {generation_rate:,.0f} records/second")
print(f"📊 Synthetic data breakdown:")
print(f"   - Clients: {synthetic_client_count:,}")
print(f"   - Dispositions: {synthetic_disposition_count:,}")
print(f"   - Accounts: {synthetic_account_count:,}")
print(f"   - Cards: {synthetic_card_count:,}")
print(f"   - Transactions: {synthetic_transaction_count:,}")
print(f"   - Loans: {synthetic_loan_count:,}")
print(f"   - Orders: {synthetic_order_count:,}")

# Quality metric example: Dispositions per client ratio
dispositions_per_client = synthetic_disposition_count / synthetic_client_count if synthetic_client_count > 0 else 0

print(f"\n🔍 Generation Quality Metrics:")
print(f"   - Dispositions per client: {dispositions_per_client:.1f}")
print(f"   - Scale factor achieved: {synthetic_client_count / len(clients_train):.2f}x")

# Preview of generated synthetic data
sdv_synthetic_data['client'].head()
sdv_synthetic_data['disposition'].head()

In [None]:
# Save SDV synthetic data for comparison
output_folder = './data/'

synthetic_files = {
    'client': f'{output_folder}sdv_client.parquet',
    'disposition': f'{output_folder}sdv_disposition.parquet',
    'account': f'{output_folder}sdv_account.parquet',
    'card': f'{output_folder}sdv_card.parquet',
    'transaction': f'{output_folder}sdv_transaction.parquet',
    'loan': f'{output_folder}sdv_loan.parquet',
    'order': f'{output_folder}sdv_order.parquet',
}

for table_name, file_path in synthetic_files.items():
    sdv_synthetic_data[table_name].to_parquet(file_path, index=False)
    print(f"💾 Saved {table_name} synthetic data to: {file_path}")


## 5. MOSTLY AI Implementation

**About MOSTLY AI Synthetic Data SDK:**
- Open-source (Apache 2) synthetic data SDK with advanced AI capabilities
- Also cloud-based service with enterprise-grade security and compliance
- Supports complex multi-table scenarios with multiple foreign keys
- Uses deep learning and autoregressive-based models

**Getting Started:**
- **API Access**: Requires valid API credentials for cloud platform access
- **API Key Generation**: Get your free API key at: https://app.mostly.ai/settings/api-keys

**Key Advantages:**
- **Advanced AI Models**: Utilizes state-of-the-art generative AI including language models
- **Multi-Parent Support**: Can handle complex relationships (multiple foreign keys per table)
- **Mixed Data Types**: Excels at both tabular and text data synthesis
- **Enterprise Features**: Privacy guarantees, compliance reporting, and scalability

**Architecture:**
- **Tabular Models**: For structured data (demographics, financials)
- **Language Models**: For text fields (names, addresses, emails) using LLMs like Llama-3.2
- **Sequential Models**: For time-series and ordered data patterns

In [None]:
from mostlyai.sdk import MostlyAI

print("🔧 Initializing MOSTLY AI Synthetic Data SDK...")

# Initialize MOSTLY AI Synthetic Data SDK
mostly = MostlyAI(local=True)


print("✅ MOSTLY AI Synthetic Data SDK initialized successfully")

### 5.1 MOSTLY AI Configuration Summary

**Berka Multi-Table Setup Highlights:**

- **Clients Table**
  - Primary Key: `client_id`
  - Standard demographic modeling

- **Accounts Table**
  - Primary Key: `account_id`
  - Referenced by multiple tables as foreign key

- **Dispositions Table**
  - Dual Foreign Keys: `client_id` (non-context) + `account_id` (context)
  - Captures client-account relationships

- **Cards Table**
  - Foreign Key: `disp_id` (context)
  - Linked to disposition

- **Transactions, Loans, Orders Tables**
  - Foreign Key: `account_id` (context)
  - Context-aware modeling for financial records

**Global Configuration Notes:**
- Model: `MOSTLY_AI/Medium` for all tables
- Training Time: 10 minutes per table
- Flexible Generation: Disabled
- Model Reports: Disabled for streamlined runtime
- Multi-level Foreign Keys: Enabled with context tagging where applicable


In [None]:
print("⚙️ Configuring advanced MOSTLY AI generator...")
print("Setting up sophisticated multi-table configuration with multi-level foreign keys...")

# Configure the generator for comprehensive multi-table setup
config = {
    'name': 'Berka Multi-Table Generator',
    'tables': [
        {
            'name': 'client',
            'data': clients_train,
            'primary_key': 'client_id',
            'tabular_model_configuration': {
                'model': 'MOSTLY_AI/Medium',
                'max_training_time': 10,
                'enable_flexible_generation': False,
                'value_protection': False,
                'enable_model_report': False
            }
        },
        {
            'name': 'account',
            'data': accounts_train,
            'primary_key': 'account_id',
            'tabular_model_configuration': {
                'model': 'MOSTLY_AI/Medium',
                'max_training_time': 10,
                'enable_model_report': False
            }
        },
        {
            'name': 'disposition',
            'data': dispositions_train,
            'primary_key': 'disp_id',
            'foreign_keys': [
                {'column': 'client_id', 'referenced_table': 'client', 'is_context': False},
                {'column': 'account_id', 'referenced_table': 'account', 'is_context': True}
            ],
            'tabular_model_configuration': {
                'model': 'MOSTLY_AI/Medium',
                'max_training_time': 10,
                'enable_model_report': False
            }
        },
        {
            'name': 'card',
            'data': cards_train,
            'primary_key': 'card_id',
            'foreign_keys': [
                {'column': 'disp_id', 'referenced_table': 'disposition', 'is_context': True}
            ],
            'tabular_model_configuration': {
                'model': 'MOSTLY_AI/Medium',
                'max_training_time': 10,
                'enable_model_report': False
            }
        },
        {
            'name': 'transaction',
            'data': transactions_train,
            'primary_key': 'trans_id',
            'foreign_keys': [
                {'column': 'account_id', 'referenced_table': 'account', 'is_context': True}
            ],
            'tabular_model_configuration': {
                'model': 'MOSTLY_AI/Medium',
                'max_training_time': 10,
                'enable_model_report': False
            }
        },
        {
            'name': 'loan',
            'data': loans_train,
            'primary_key': 'loan_id',
            'foreign_keys': [
                {'column': 'account_id', 'referenced_table': 'account', 'is_context': True}
            ],
            'tabular_model_configuration': {
                'model': 'MOSTLY_AI/Medium',
                'max_training_time': 10,
                'enable_model_report': False
            }
        },
        {
            'name': 'order',
            'data': orders_train,
            'primary_key': 'order_id',
            'foreign_keys': [
                {'column': 'account_id', 'referenced_table': 'account', 'is_context': True}
            ],
            'tabular_model_configuration': {
                'model': 'MOSTLY_AI/Medium',
                'max_training_time': 10,
                'enable_model_report': False
            }
        }
    ]
}

print("✅ MOSTLY AI generator configuration ready.")


### 5.2 MOSTLY AI Training Process

**Training Process:**
1. **Upload Data**: Send training data securely to MOSTLY AI cloud
2. **Model Configuration**: Apply the complex multi-table configuration
3. **AI Training**: Use advanced generative models including LLMs
4. **Quality Validation**: Automatic quality checks during training


In [None]:
print("🚀 Starting MOSTLY AI training...")
print("📤 Uploading training data to secure MOSTLY AI cloud platform...")

start_time = time.time()

# Train the MOSTLY AI generator with our advanced configuration
# This will:
# 1. Upload training data securely to the cloud
# 2. Configure both tabular and language models
# 3. Train AI models for each table and their relationships
# 4. Wait for training completion with progress monitoring
g = mostly.train(config=config, start=True, wait=True)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ MOSTLY AI training completed successfully!")
print(f"⏱️ Total training time: {elapsed_minutes:.2f} minutes")
print(f"🧠 Advanced AI models trained for multi-table synthesis")
    

In [None]:
print("🎲 Starting MOSTLY AI synthetic data generation...")
print("🌩️ Using cloud-based AI models for high-quality multi-table synthesis...")

start_time = time.time()

# Generate synthetic data using the trained MOSTLY AI generator
# Key advantages over SDV:
# - Handles multiple foreign keys properly
# - Maintains complex statistical relationships across multiple tables

print(f"📊 Generating {len(clients_train):,} synthetic client records...")

sd = mostly.generate(g, size=len(clients_train))
mostlyai_synthetic_data = sd.data()

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

# Calculate generation statistics
total_records = sum(len(mostlyai_synthetic_data[table]) for table in mostlyai_synthetic_data.keys())
generation_rate = total_records / (end_time - start_time)

print(f"✅ MOSTLY AI generation completed successfully!")
print(f"⏱️ Generation time: {elapsed_minutes:.2f} minutes")
print(f"🚀 Generation rate: {generation_rate:,.0f} records/second")

print("📊 Synthetic data breakdown:")
for table_name in mostlyai_synthetic_data:
    print(f"   - {table_name.capitalize()}: {len(mostlyai_synthetic_data[table_name]):,} rows")

# Quality metrics example: Dispositions per client ratio
dispositions_per_client = len(mostlyai_synthetic_data['disposition']) / len(mostlyai_synthetic_data['client'])
print(f"\n🔍 Quality Metrics:")
print(f"   - Dispositions per client: {dispositions_per_client:.1f}")
print(f"   - Multi-level foreign key integrity handled ✅")

# Preview
mostlyai_synthetic_data['client'].head()
mostlyai_synthetic_data['disposition'].head()

In [None]:
# Save MOSTLY AI synthetic data for comparison
output_folder = './data/'

mostlyai_files = {
    'client': f'{output_folder}mostlyai_client.parquet',
    'disposition': f'{output_folder}mostlyai_disposition.parquet',
    'account': f'{output_folder}mostlyai_account.parquet',
    'card': f'{output_folder}mostlyai_card.parquet',
    'transaction': f'{output_folder}mostlyai_transaction.parquet',
    'loan': f'{output_folder}mostlyai_loan.parquet',
    'order': f'{output_folder}mostlyai_order.parquet',
}

for table_name, file_path in mostlyai_files.items():
    mostlyai_synthetic_data[table_name].to_parquet(file_path, index=False)
    print(f"💾 Saved {table_name} synthetic data to: {file_path}")


## 6. Synthetic Data Quality Assessment

After generating synthetic data using both SDV and MOSTLY AI, it's crucial to comprehensively evaluate the quality, privacy, and integrity of the generated datasets. This section provides a multi-faceted quality assessment framework that ensures our synthetic data meets production standards.

## 6.1 Statistical Quality Assessment with MOSTLY AI QA Library

The MOSTLY AI QA library provides enterprise-grade quality assessment capabilities that evaluate synthetic data across multiple dimensions. This assessment generates comprehensive HTML reports and quantitative metrics that help understand:

- **Accuracy Scores**: Overall statistical fidelity of synthetic data
- **Distance to Closest Record (DCR)**: Privacy risk measurement 
- **Univariate & Bivariate Distributions**: Preservation of individual column and column-pair statistics
- **Correlation Analysis**: Maintenance of relationships between variables
- **Similarity Metrics**: Overall resemblance to training data while avoiding overfitting

In [None]:
# Import and initialize the quality assessment framework
from mostlyai import qa

# Initialize logging to see detailed evaluation progress
qa.init_logging()
print("🔍 Quality assessment framework initialized")

In [None]:
# Load the split files from the disk
transactions_train = pd.read_parquet('./data/transactions_train.parquet')
transactions_test = pd.read_parquet('./data/transactions_test.parquet')
accounts_train = pd.read_parquet('./data/accounts_train.parquet')
accounts_test = pd.read_parquet('./data/accounts_test.parquet')

print("📂 Training and test datasets loaded successfully")


In [None]:
print("📊 Evaluating SDV Transaction synthetic data quality...")

# Load the SDV synthetic dataset
sdv_transaction = pd.read_parquet('./data/sdv_transaction.parquet')

# Define ID columns to exclude from QA analysis (do not exclude account_id here!)
id_columns_to_exclude = ['trans_id']

def remove_id_columns(df, columns_to_remove):
    return df.drop(columns=[col for col in columns_to_remove if col in df.columns])

# Prepare transaction data (remove only trans_id)
sdv_transaction_qa = remove_id_columns(sdv_transaction, id_columns_to_exclude)
transactions_train_qa = remove_id_columns(transactions_train, id_columns_to_exclude)
transactions_test_qa = remove_id_columns(transactions_test, id_columns_to_exclude)

report_path, metrics = qa.report(
    syn_tgt_data = sdv_transaction_qa,
    trn_tgt_data = transactions_train_qa,
    hol_tgt_data = transactions_test_qa,
    syn_ctx_data = pd.read_parquet('./data/sdv_account.parquet'),
    trn_ctx_data = accounts_train,
    hol_ctx_data = accounts_test,
    ctx_primary_key = "account_id",
    tgt_context_key = "account_id",
    max_sample_size_embeddings=10_000,
    report_path='sdv_transaction_qa_report.html'
)

print(f"📋 SDV Transaction Quality Report saved to: {report_path}")
print("\n📈 SDV Transaction Quality Metrics:")
print(metrics.model_dump_json(indent=4))

sdv_transaction_accuracy = metrics.accuracy.overall
sdv_transaction_dcr_share = metrics.distances.dcr_share
print(f"\n🎯 SDV Transaction Summary:")
print(f"   Overall Accuracy: {sdv_transaction_accuracy:.3f}")
print(f"   DCR Share: {sdv_transaction_dcr_share:.3f}")


In [None]:
print("📊 Evaluating MOSTLY AI Transactions synthetic data quality...")

# Load the MOSTLY AI synthetic dataset
mostlyai_transaction = pd.read_parquet('./data/mostlyai_transaction.parquet')

# Define ID columns to exclude from QA analysis (do not exclude account_id here!)
id_columns_to_exclude = ['trans_id']

def remove_id_columns(df, columns_to_remove):
    return df.drop(columns=[col for col in columns_to_remove if col in df.columns])

# Prepare transaction data (remove only trans_id)
mostlyai_transaction_qa = remove_id_columns(mostlyai_transaction, id_columns_to_exclude)
transactions_train_qa = remove_id_columns(transactions_train, id_columns_to_exclude)
transactions_test_qa = remove_id_columns(transactions_test, id_columns_to_exclude)

report_path, metrics = qa.report(
    syn_tgt_data = mostlyai_transaction_qa,
    trn_tgt_data = transactions_train_qa,
    hol_tgt_data = transactions_test_qa,
    syn_ctx_data = pd.read_parquet('./data/mostlyai_account.parquet'),
    trn_ctx_data = accounts_train,
    hol_ctx_data = accounts_test,
    ctx_primary_key = "account_id",
    tgt_context_key = "account_id",
    max_sample_size_embeddings=10_000,
    report_path='mostlyai_transaction_qa_report.html'
)

print(f"📋 MOSTLY AI Transaction Quality Report saved to: {report_path}")
print("\n📈 MOSTLY AI Transaction Quality Metrics:")
print(metrics.model_dump_json(indent=4))

mostlyai_transaction_accuracy = metrics.accuracy.overall
mostlyai_transaction_dcr_share = metrics.distances.dcr_share
print(f"\n🎯 MOSTLY AI Transaction Summary:")
print(f"   Overall Accuracy: {mostlyai_transaction_accuracy:.3f}")
print(f"   DCR Share: {mostlyai_transaction_dcr_share:.3f}")


In [None]:
# Add a final comparison section
print("\n" + "="*60)
print("🏆 FINAL COMPARISON")
print("="*60)
print(f"SDV Transaction      - Accuracy: {sdv_transaction_accuracy:.3f}, DCR Share: {sdv_transaction_dcr_share:.3f}")
print(f"MOSTLY AI Transaction- Accuracy: {mostlyai_transaction_accuracy:.3f}, DCR Share: {mostlyai_transaction_dcr_share:.3f}")

print("\n🔍 METRIC INTERPRETATION:")
print("• Higher accuracy = better statistical fidelity")
print("• DCR Share ~0.5 = optimal privacy-utility balance")

print("\n📊 ANALYSIS:")
print("• MOSTLY AI consistently shows higher accuracy than SDV")
print("• Both frameworks maintain reasonable DCR Share values around 0.5")
print("• MOSTLY AI handles multi-table relationships and foreign keys with greater precision")

print("\n⚠️  RECOMMENDATION:")
print("• Review detailed HTML reports for nuanced privacy insights")
print("• Pay attention to discriminator AUC and feature-wise similarity scores")
print("• Align final choice with your privacy-utility balance requirements")
