# Synthetic Data Generation: SDV vs MOSTLY AI Comparison

## Framework Comparison
This notebook compares two synthetic data generation approaches on a large-scale dataset:

- **SDV (Synthetic Data Vault)** - Business Source License
- **MOSTLY AI SDK** - Apache 2.0 License - Open Source

## Dataset & Objective
We'll use the **US Census Income dataset (10M records)** to:
- Compare training performance and generation speed
- Evaluate synthetic data quality using comprehensive metrics
- Assess privacy preservation capabilities
- Provide practical guidance for framework selection

## Key Takeaways
- Performance benchmarks on large-scale data
- Quality comparison metrics
- Privacy assessment results

In [None]:
%uv pip install -U sdv mostlyai-qa 'mostlyai[local]'

# 1. Data Preparation

## Loading the Dataset
We'll use the US Census Income dataset with 10M records containing demographic, employment, and financial information - ideal for testing synthetic data generation at scale.

In [None]:
import pandas as pd

# Load the US Census dataset (10M records) from remote Parquet file
# Note: This is a large dataset - initial load may take a few minutes
data = pd.read_parquet('https://mostly-public-tutorials.s3.eu-central-1.amazonaws.com/datasets/census/census_10_mil.parquet')

# Display basic info about the dataset
print(f"Dataset shape: {data.shape}")
print(f"Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print("\nFirst 5 rows:")
print(data.head())

| Age | Workclass         | FNLWGT | Education  | Education Num | Marital Status      | Occupation         | Relationship   | Race  | Sex    | Capital Gain | Capital Loss | Hours/Week | Native Country  | Income |
|-----|-------------------|--------|------------|----------------|----------------------|---------------------|----------------|-------|--------|---------------|---------------|-------------|------------------|--------|
| 39  | State-gov         | 77516  | Bachelors  | 13             | Never-married        | Adm-clerical        | Not-in-family  | White | Male   | 2174          | 0             | 40          | United-States     | <=50K  |
| 50  | Self-emp-not-inc  | 83311  | Bachelors  | 13             | Married-civ-spouse   | Exec-managerial     | Husband        | White | Male   | 0             | 0             | 13          | United-States     | <=50K  |
| 38  | Private           | 215646 | HS-grad    | 9              | Divorced             | Handlers-cleaners   | Not-in-family  | White | Male   | 0             | 0             | 40          | United-States     | <=50K  |
| 53  | Private           | 234721 | 11th       | 7              | Married-civ-spouse   | Handlers-cleaners   | Husband        | Black | Male   | 0             | 0             | 40          | United-States     | <=50K  |
| 28  | Private           | 338409 | Bachelors  | 13             | Married-civ-spouse   | Prof-specialty      | Wife           | Black | Female | 0             | 0             | 40          | Cuba              | <=50K  |


## Dataset Overview

The dataset contains 15 columns with mixed data types:
- **Numerical**: age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week
- **Categorical**: workclass, education, marital_status, occupation, relationship, race, sex, native_country, income

This combination of numerical and categorical data makes it ideal for testing both frameworks' capabilities.

In [None]:
# Display column names and basic data types
print("Column names:")
print(data.columns.tolist())
print(f"\nData types:")
print(data.dtypes)
print(f"\nMissing values per column:")
print(data.isnull().sum())

## Train/Holdout Split

We split the data into:
- **Training Set (80% - 8M records)**: For model training
- **Holdout Set (20% - 2M records)**: For quality evaluation

This split ensures we can properly assess synthetic data quality against unseen real data.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training/holdout sets
# Using stratified split would be better for classification tasks, but not critical here
# random_state=1 ensures reproducible results
train, holdout = train_test_split(
    data, 
    test_size=0.2,      # 20% for holdout evaluation
    random_state=1,     # Fixed seed for reproducibility
    shuffle=True        # Ensure random sampling
)

print(f"Training set: {train.shape[0]:,} records ({train.shape[0]/len(data)*100:.1f}%)")
print(f"Holdout set:  {holdout.shape[0]:,} records ({holdout.shape[0]/len(data)*100:.1f}%)")

# Verify the split maintains similar distributions
print(f"\nIncome distribution in training set:")
print(train['income'].value_counts(normalize=True))
print(f"\nIncome distribution in holdout set:")
print(holdout['income'].value_counts(normalize=True))

# 2. SDV Metadata Configuration

## Metadata Setup
SDV requires metadata to understand your data structure. We'll use auto-detection to identify column types (numerical vs categorical), then validate the configuration.

## Auto-Detecting Metadata
SDV can automatically detect column types from the data. The auto-detection correctly identifies our numerical and categorical columns.

In [30]:
from sdv.metadata import Metadata

# Auto-detect metadata from the training data
# Note: We wrap the DataFrame in a dict with table name 'table' as required by SDV
# Using only training data to avoid data leakage
metadata = Metadata.detect_from_dataframes({'table': train})

In [None]:
# Display the auto-detected metadata
print('Auto-detected metadata structure:\n')
print(metadata)

# Show a summary of detected column types
table_metadata = metadata.to_dict()['tables']['table']['columns']
numerical_cols = [col for col, info in table_metadata.items() if info['sdtype'] == 'numerical']
categorical_cols = [col for col, info in table_metadata.items() if info['sdtype'] == 'categorical']

print(f"\n📊 Metadata Summary:")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")

In [None]:
# Validate the metadata structure
try:
    metadata.validate()
    print("✅ Metadata validation passed")
except Exception as e:
    print(f"❌ Metadata validation failed: {e}")
    # You would fix metadata issues here if any exist

In [None]:
# Validate that the metadata matches the actual data structure
try:
    metadata.validate_data(data=({'table': train}))  # Use train data for consistency
    print("✅ Data validation against metadata passed")
except Exception as e:
    print(f"❌ Data validation failed: {e}")
    # This would indicate mismatches between metadata and actual data

# 3. SDV: Training and Generation

## Gaussian Copula Synthesizer
We'll use SDV's Gaussian Copula Synthesizer, which models the statistical relationships between variables and generates synthetic data that preserves these relationships.

In [None]:
import time
from sdv.single_table import GaussianCopulaSynthesizer

# Initialize the synthesizer with our metadata
# GaussianCopula is good for mixed data types and preserving correlations
synthesizer = GaussianCopulaSynthesizer(metadata)

print("🚀 Starting SDV training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Train the synthesizer on our training data
# This learns the statistical relationships between variables
synthesizer.fit(train)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV training completed in {elapsed_minutes:.2f} minutes")


In [None]:
print("🎲 Starting SDV synthetic data generation...")

start_time = time.time()

# Generate synthetic data with the same number of rows as original dataset
# You can adjust num_rows based on your needs
target_rows = len(data)  # Generate same size as original
sdv_synthetic_data = synthesizer.sample(num_rows=target_rows)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV generation completed in {elapsed_minutes:.2f} minutes")
print(f"⏱️  Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"📊 Generated {len(sdv_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
print(sdv_synthetic_data.head())

In [None]:
import os

# Save SDV synthetic data
output_file = './data/sdv_synthetic_data.parquet'
sdv_synthetic_data.to_parquet(output_file, index=False)

# Get file size in MB
file_size_mb = os.path.getsize(output_file) / 1024**2

print(f"💾 SDV synthetic data saved to: {output_file}")
print(f"📁 File size: {file_size_mb:.1f} MB")


# 4. Mostly AI: Training and Generation

## Deep Learning Approach
Mostly AI uses advanced deep learning models optimized for tabular data. The SDK provides local training capabilities with configurable parameters for training time and privacy settings.

In [None]:
from mostlyai.sdk import MostlyAI

# Initialize Mostly AI SDK for local training
# local=True means we'll train models locally rather than using cloud API
print("🔧 Initializing Mostly AI SDK...")
mostly = MostlyAI(local=True)
print("✅ Mostly AI SDK initialized successfully")

In [None]:
print("🚀 Starting Mostly AI training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Configure and start training
# Mostly AI automatically detects column types and optimizes model architecture
g = mostly.train(
    config={
        "name": "US Census Income 10 million",
        "tables": [
            {
                "name": "census",
                "data": train,
                "tabularModelConfiguration": {
                    "max_training_time": 100,  # Limit training time (minutes)
                    # Optional: Add differential privacy
                    # 'differential_privacy': {
                    #     'max_epsilon': 5.0,      # Privacy budget
                    #     'delta': 1e-5,           # Privacy parameter
                    # }
                    # Optional: Model tuning
                    # "max_epochs": 50,
                    # "batch_size": 1024,
                },
                # Optional: Column-specific configurations
                # "columns": {
                #     "income": {"encode": "target"},  # Mark as target variable
                # }
            }
        ],
    },
    start=True,  # Start training immediately
    wait=True,   # Wait for completion before proceeding
)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ Mostly AI training completed in {elapsed_minutes:.2f} minutes")

In [None]:
print("🎲 Starting Mostly AI synthetic data generation...")

start_time = time.time()

# Generate synthetic data using the trained generator
# size parameter controls how many records to generate
target_rows = len(data)
sd = mostly.generate(g, size=target_rows)
mostlyai_synthetic_data = sd.data()

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ Mostly AI generation completed in {elapsed_minutes:.2f} minutes")
print(f"⏱️  Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"📊 Generated {len(mostlyai_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
print(mostlyai_synthetic_data.head())

# Basic quality check
print(f"\nMissing values in synthetic data: {mostlyai_synthetic_data.isnull().sum().sum()}")

In [None]:
# Save Mostly AI synthetic data for comparison
output_file = './data/mostlyai_synthetic_data.parquet'
mostlyai_synthetic_data.to_parquet(output_file, index=False)
file_size_bytes = os.path.getsize(output_file)
print(f"💾 Mostly AI synthetic data saved to: {output_file}")
print(f"📁 File size: {file_size_bytes / 1024**2:.1f} MB")

# 5. Quality Assessment and Comparison

## Evaluation Framework
We'll use Mostly AI's comprehensive QA framework to evaluate both synthetic datasets. The assessment includes:

- **Accuracy Metrics**: How well synthetic data preserves statistical distributions (univariate, bivariate, trivariate)
- **Similarity Analysis**: Comparison between training, holdout, and synthetic data
- **DCR Privacy Metrics**: Distance to Closest Record analysis for privacy assessment
- **Overall Quality Score**: Combined metric for synthetic data fidelity

### Key Privacy Metrics:
- **DCR Share**: Proportion of synthetic records that are closer to holdout than training data (higher = better privacy)
- **DCR Training**: Average distance from synthetic to closest training record (higher = better privacy)
- **Optimal DCR Share**: ~0.5 indicates good balance between utility and privacy

Let's compare the results from both frameworks:

In [None]:
# Import and initialize the quality assessment framework
from mostlyai import qa

# Initialize logging to see detailed evaluation progress
qa.init_logging()
print("🔍 Quality assessment framework initialized")

In [None]:
print("📊 Evaluating SDV synthetic data quality...")

# Load the SDV synthetic dataset
sdv_synthetic_data = pd.read_parquet('./data/sdv_synthetic_data.parquet')

# Run comprehensive quality assessment
# This compares synthetic data against training and holdout sets
report_path, metrics = qa.report(
    syn_tgt_data=sdv_synthetic_data,    # SDV synthetic data
    trn_tgt_data=train,                 # Original training data
    hol_tgt_data=holdout,               # Holdout data for validation
    max_sample_size_embeddings=10_000,  # Limit sample size for efficiency
    report_path='sdv_qa_report.html'    # HTML report output
)

print(f"📋 SDV Quality Report saved to: {report_path}")
print("\n📈 SDV Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
sdv_accuracy = metrics.accuracy.overall
sdv_dcr_share = metrics.distances.dcr_share
sdv_dcr_training = metrics.distances.dcr_training
print(f"\n🎯 SDV Summary:")
print(f"   Overall Accuracy: {sdv_accuracy:.3f}")
print(f"   DCR Share: {sdv_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {sdv_dcr_training:.3f} (higher is better for privacy)")

In [None]:
print(train.describe())

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numeric columns
numeric_cols = train.select_dtypes(include='number').columns

# Fit scaler on training data
scaler = StandardScaler()
train_scaled = train.copy()
train_scaled[numeric_cols] = scaler.fit_transform(train[numeric_cols])

holdout_scaled = holdout.copy()
holdout_scaled[numeric_cols] = scaler.transform(holdout[numeric_cols])

mostlyai_scaled = mostlyai_synthetic_data.copy()
mostlyai_scaled[numeric_cols] = scaler.transform(mostlyai_synthetic_data[numeric_cols])


In [None]:
print("📊 Evaluating Mostly AI synthetic data quality...")

# Load the Mostly AI synthetic dataset
mostlyai_synthetic_data = pd.read_parquet('./data/mostlyai_synthetic_data.parquet')

# Run comprehensive quality assessment for Mostly AI
report_path, metrics = qa.report(
    syn_tgt_data=mostlyai_synthetic_data,  # Mostly AI synthetic data
    trn_tgt_data=train,                    # Original training data
    hol_tgt_data=holdout,                  # Holdout data for validation
    max_sample_size_embeddings=10_000,     # Limit sample size for efficiency
    report_path='mostlyai_qa_report.html'  # HTML report output
)

print(f"📋 Mostly AI Quality Report saved to: {report_path}")
print("\n📈 Mostly AI Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
mai_accuracy = metrics.accuracy.overall
mai_dcr_share = metrics.distances.dcr_share
mai_dcr_training = metrics.distances.dcr_training
print(f"\n🎯 Mostly AI Summary:")
print(f"   Overall Accuracy: {mai_accuracy:.3f}")
print(f"   DCR Share: {mai_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {mai_dcr_training:.3f} (higher is better for privacy)")

In [47]:
# Add a final comparison section
print("\n" + "="*60)
print("🏆 FINAL COMPARISON")
print("="*60)
print(f"SDV      - Accuracy: {sdv_accuracy:.3f}, DCR Share: {sdv_dcr_share:.3f}")
print(f"MostlyAI - Accuracy: {mai_accuracy:.3f}, DCR Share: {mai_dcr_share:.3f}")
print("\nInterpretation:")
print("• Higher accuracy = better statistical fidelity")
print("• Higher DCR Share = better privacy preservation (more diverse synthetic records)")
print("• DCR Share ~0.5 indicates good balance between utility and privacy")
print("• Check HTML reports for detailed analysis")


🏆 FINAL COMPARISON
SDV      - Accuracy: 0.738, DCR Share: 0.538
MostlyAI - Accuracy: 0.979, DCR Share: 0.507

Interpretation:
• Higher accuracy = better statistical fidelity
• Higher DCR Share = better privacy preservation (more diverse synthetic records)
• DCR Share ~0.5 indicates good balance between utility and privacy
• Check HTML reports for detailed analysis
