# Synthetic Data Generation: SDV vs MOSTLY AI Comparison

## Framework Comparison
This notebook compares two synthetic data generation approaches on a large-scale dataset:

- **SDV (Synthetic Data Vault)** - Business Source License
- **MOSTLY AI SDK** - Apache 2.0 License - Open Source

## Dataset & Objective
We'll use the **US Census Income dataset (10M records)** to:
- Compare training performance and generation speed
- Evaluate synthetic data quality using comprehensive metrics
- Assess privacy preservation capabilities
- Provide practical guidance for framework selection

## Key Takeaways
- Performance benchmarks on large-scale data
- Quality comparison metrics
- Privacy assessment results

In [1]:
%uv pip install -U sdv mostlyai-qa 'mostlyai[local]'

/Users/kennethhamilton/Desktop/sdv-mostly-experiment/venv/bin/python: No module named uv
Note: you may need to restart the kernel to use updated packages.


# 1. Data Preparation

## Loading the Dataset
We'll use the US Census Income dataset with 10M records containing demographic, employment, and financial information - ideal for testing synthetic data generation at scale.

In [2]:
import pandas as pd

# Load the US Census dataset (10M records) from remote Parquet file
# Note: This is a large dataset - initial load may take a few minutes
data = pd.read_parquet('https://mostly-public-tutorials.s3.eu-central-1.amazonaws.com/datasets/census/census_10_mil.parquet')

# Display basic info about the dataset
print(f"Dataset shape: {data.shape}")
print(f"Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print("\nFirst 5 rows:")
print(data.head())

Dataset shape: (10000000, 15)
Memory usage: 6106.2 MB

First 5 rows:
   age         workclass  fnlwgt  education  education_num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital_status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital_gain  capital_loss  hours_per_week native_country income  
0          2174             0              

| Age | Workclass         | FNLWGT | Education  | Education Num | Marital Status      | Occupation         | Relationship   | Race  | Sex    | Capital Gain | Capital Loss | Hours/Week | Native Country  | Income |
|-----|-------------------|--------|------------|----------------|----------------------|---------------------|----------------|-------|--------|---------------|---------------|-------------|------------------|--------|
| 39  | State-gov         | 77516  | Bachelors  | 13             | Never-married        | Adm-clerical        | Not-in-family  | White | Male   | 2174          | 0             | 40          | United-States     | <=50K  |
| 50  | Self-emp-not-inc  | 83311  | Bachelors  | 13             | Married-civ-spouse   | Exec-managerial     | Husband        | White | Male   | 0             | 0             | 13          | United-States     | <=50K  |
| 38  | Private           | 215646 | HS-grad    | 9              | Divorced             | Handlers-cleaners   | Not-in-family  | White | Male   | 0             | 0             | 40          | United-States     | <=50K  |
| 53  | Private           | 234721 | 11th       | 7              | Married-civ-spouse   | Handlers-cleaners   | Husband        | Black | Male   | 0             | 0             | 40          | United-States     | <=50K  |
| 28  | Private           | 338409 | Bachelors  | 13             | Married-civ-spouse   | Prof-specialty      | Wife           | Black | Female | 0             | 0             | 40          | Cuba              | <=50K  |


## Dataset Overview

The dataset contains 15 columns with mixed data types:
- **Numerical**: age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week
- **Categorical**: workclass, education, marital_status, occupation, relationship, race, sex, native_country, income

This combination of numerical and categorical data makes it ideal for testing both frameworks' capabilities.

In [3]:
# Display column names and basic data types
print("Column names:")
print(data.columns.tolist())
print(f"\nData types:")
print(data.dtypes)
print(f"\nMissing values per column:")
print(data.isnull().sum())

Column names:
['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

Data types:
age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

Missing values per column:
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


## Train/Holdout Split

We split the data into:
- **Training Set (80% - 8M records)**: For model training
- **Holdout Set (20% - 2M records)**: For quality evaluation

This split ensures we can properly assess synthetic data quality against unseen real data.

In [4]:
from sklearn.model_selection import train_test_split

# Split data into training/holdout sets
# Using stratified split would be better for classification tasks, but not critical here
# random_state=1 ensures reproducible results
train, holdout = train_test_split(
    data, 
    test_size=0.2,      # 20% for holdout evaluation
    random_state=1,     # Fixed seed for reproducibility
    shuffle=True        # Ensure random sampling
)

print(f"Training set: {train.shape[0]:,} records ({train.shape[0]/len(data)*100:.1f}%)")
print(f"Holdout set:  {holdout.shape[0]:,} records ({holdout.shape[0]/len(data)*100:.1f}%)")

# Verify the split maintains similar distributions
print(f"\nIncome distribution in training set:")
print(train['income'].value_counts(normalize=True))
print(f"\nIncome distribution in holdout set:")
print(holdout['income'].value_counts(normalize=True))

Training set: 8,000,000 records (80.0%)
Holdout set:  2,000,000 records (20.0%)

Income distribution in training set:
income
<=50K    0.760662
>50K     0.239338
Name: proportion, dtype: float64

Income distribution in holdout set:
income
<=50K    0.760922
>50K     0.239078
Name: proportion, dtype: float64


# 2. SDV Metadata Configuration

## Metadata Setup
SDV requires metadata to understand your data structure. We'll use auto-detection to identify column types (numerical vs categorical), then validate the configuration.

## Auto-Detecting Metadata
SDV can automatically detect column types from the data. The auto-detection correctly identifies our numerical and categorical columns.

In [5]:
from sdv.metadata import Metadata

# Auto-detect metadata from the training data
# Note: We wrap the DataFrame in a dict with table name 'table' as required by SDV
# Using only training data to avoid data leakage
metadata = Metadata.detect_from_dataframes({'table': train})

In [6]:
# Display the auto-detected metadata
print('Auto-detected metadata structure:\n')
print(metadata)

# Show a summary of detected column types
table_metadata = metadata.to_dict()['tables']['table']['columns']
numerical_cols = [col for col, info in table_metadata.items() if info['sdtype'] == 'numerical']
categorical_cols = [col for col, info in table_metadata.items() if info['sdtype'] == 'categorical']

print(f"\n📊 Metadata Summary:")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")

Auto-detected metadata structure:

{
    "tables": {
        "table": {
            "columns": {
                "age": {
                    "sdtype": "numerical"
                },
                "workclass": {
                    "sdtype": "categorical"
                },
                "fnlwgt": {
                    "sdtype": "numerical"
                },
                "education": {
                    "sdtype": "categorical"
                },
                "education_num": {
                    "sdtype": "numerical"
                },
                "marital_status": {
                    "sdtype": "categorical"
                },
                "occupation": {
                    "sdtype": "categorical"
                },
                "relationship": {
                    "sdtype": "categorical"
                },
                "race": {
                    "sdtype": "categorical"
                },
                "sex": {
                    "sdtype": "categori

In [7]:
# Validate the metadata structure
try:
    metadata.validate()
    print("✅ Metadata validation passed")
except Exception as e:
    print(f"❌ Metadata validation failed: {e}")
    # You would fix metadata issues here if any exist

✅ Metadata validation passed


In [8]:
# Validate that the metadata matches the actual data structure
try:
    metadata.validate_data(data=({'table': train}))  # Use train data for consistency
    print("✅ Data validation against metadata passed")
except Exception as e:
    print(f"❌ Data validation failed: {e}")
    # This would indicate mismatches between metadata and actual data

✅ Data validation against metadata passed


# 3. SDV: Training and Generation

## Gaussian Copula Synthesizer
We'll use SDV's Gaussian Copula Synthesizer, which models the statistical relationships between variables and generates synthetic data that preserves these relationships.

In [9]:
import time
from sdv.single_table import GaussianCopulaSynthesizer

# Initialize the synthesizer with our metadata
# GaussianCopula is good for mixed data types and preserving correlations
synthesizer = GaussianCopulaSynthesizer(metadata)

print("🚀 Starting SDV training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Train the synthesizer on our training data
# This learns the statistical relationships between variables
synthesizer.fit(train)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV training completed in {elapsed_minutes:.2f} minutes")



We strongly recommend saving the metadata using 'save_to_json' for replicability in future SDV versions.



🚀 Starting SDV training...
Training on 8,000,000 records with 15 features
✅ SDV training completed in 14.23 minutes


In [10]:
print("🎲 Starting SDV synthetic data generation...")

start_time = time.time()

# Generate synthetic data with the same number of rows as original dataset
# You can adjust num_rows based on your needs
target_rows = len(data)  # Generate same size as original
sdv_synthetic_data = synthesizer.sample(num_rows=target_rows)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV generation completed in {elapsed_minutes:.2f} minutes")
print(f"⏱️  Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"📊 Generated {len(sdv_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
print(sdv_synthetic_data.head())

🎲 Starting SDV synthetic data generation...
✅ SDV generation completed in 1.57 minutes
⏱️  Generation rate: 106,032 records/second
📊 Generated 10,000,000 synthetic records

First 5 synthetic records:
   age workclass  fnlwgt     education  education_num      marital_status  \
0   50   Private  201012     Bachelors              9  Married-civ-spouse   
1   24   Private  116032     Bachelors             12       Never-married   
2   21   Private  287433  Some-college             12       Never-married   
3   40   Private  174244     Assoc-voc             12  Married-civ-spouse   
4   35   Private  400403    Assoc-acdm              7  Married-civ-spouse   

          occupation   relationship   race     sex  capital_gain  \
0              Sales        Husband  White    Male          2915   
1       Adm-clerical      Unmarried  Black    Male             3   
2              Sales  Not-in-family  White    Male           319   
3      Other-service  Not-in-family  White  Female         22380 

In [11]:
import os

# Save SDV synthetic data
output_file = './data/sdv_synthetic_data.parquet'
sdv_synthetic_data.to_parquet(output_file, index=False)

# Get file size in MB
file_size_mb = os.path.getsize(output_file) / 1024**2

print(f"💾 SDV synthetic data saved to: {output_file}")
print(f"📁 File size: {file_size_mb:.1f} MB")


💾 SDV synthetic data saved to: ./data/sdv_synthetic_data.parquet
📁 File size: 135.6 MB


# 4. Mostly AI: Training and Generation

## Deep Learning Approach
Mostly AI uses advanced deep learning models optimized for tabular data. The SDK provides local training capabilities with configurable parameters for training time and privacy settings.

In [12]:
from mostlyai.sdk import MostlyAI

# Initialize Mostly AI SDK for local training
# local=True means we'll train models locally rather than using cloud API
print("🔧 Initializing Mostly AI SDK...")
mostly = MostlyAI(local=True)
print("✅ Mostly AI SDK initialized successfully")

🔧 Initializing Mostly AI SDK...


✅ Mostly AI SDK initialized successfully


In [14]:
print("🚀 Starting Mostly AI training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Configure and start training
# Mostly AI automatically detects column types and optimizes model architecture
g = mostly.train(
    config={
        "name": "US Census Income 10 million",
        "tables": [
            {
                "name": "census",
                "data": train,
                "tabularModelConfiguration": {
                    "max_training_time": 100,  # Limit training time (minutes)
                    # Optional: Add differential privacy
                    # 'differential_privacy': {
                    #     'max_epsilon': 5.0,      # Privacy budget
                    #     'delta': 1e-5,           # Privacy parameter
                    # }
                    # Optional: Model tuning
                    # "max_epochs": 50,
                    # "batch_size": 1024,
                },
                # Optional: Column-specific configurations
                # "columns": {
                #     "income": {"encode": "target"},  # Mark as target variable
                # }
            }
        ],
    },
    start=True,  # Start training immediately
    wait=True,   # Wait for completion before proceeding
)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ Mostly AI training completed in {elapsed_minutes:.2f} minutes")

🚀 Starting Mostly AI training...
Training on 8,000,000 records with 15 features


Output()


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



✅ Mostly AI training completed in 105.14 minutes


In [16]:
print("🎲 Starting Mostly AI synthetic data generation...")

start_time = time.time()

# Generate synthetic data using the trained generator
# size parameter controls how many records to generate
target_rows = len(data)
sd = mostly.generate(g, size=target_rows)
mostlyai_synthetic_data = sd.data()

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ Mostly AI generation completed in {elapsed_minutes:.2f} minutes")
print(f"⏱️  Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"📊 Generated {len(mostlyai_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
print(mostlyai_synthetic_data.head())

# Basic quality check
print(f"\nMissing values in synthetic data: {mostlyai_synthetic_data.isnull().sum().sum()}")

🎲 Starting Mostly AI synthetic data generation...


Output()

✅ Mostly AI generation completed in 2.68 minutes
⏱️  Generation rate: 62,146 records/second
📊 Generated 10,000,000 synthetic records

First 5 synthetic records:
   age workclass  fnlwgt     education  education_num      marital_status  \
0   70         ?  183583       5th-6th              3             Widowed   
1   23   Private  427427  Some-college             10       Never-married   
2   22   Private  123206          11th              7       Never-married   
3   39   Private  224820          10th              6  Married-civ-spouse   
4   21   Private   43014       HS-grad              9       Never-married   

          occupation    relationship                race     sex  \
0                  ?   Not-in-family               White    Male   
1       Adm-clerical  Other-relative               White  Female   
2  Handlers-cleaners       Own-child  Amer-Indian-Eskimo    Male   
3       Craft-repair         Husband               Black    Male   
4    Protective-serv       Own-child

In [19]:
# Save Mostly AI synthetic data for comparison
output_file = './data/mostlyai_synthetic_data.parquet'
mostlyai_synthetic_data.to_parquet(output_file, index=False)
file_size_bytes = os.path.getsize(output_file)
print(f"💾 Mostly AI synthetic data saved to: {output_file}")
print(f"📁 File size: {file_size_bytes / 1024**2:.1f} MB")

💾 Mostly AI synthetic data saved to: ./data/mostlyai_synthetic_data.parquet
📁 File size: 103.7 MB


# 5. Quality Assessment and Comparison

## Evaluation Framework
We'll use Mostly AI's comprehensive QA framework to evaluate both synthetic datasets. The assessment includes:

- **Accuracy Metrics**: How well synthetic data preserves statistical distributions (univariate, bivariate, trivariate)
- **Similarity Analysis**: Comparison between training, holdout, and synthetic data
- **DCR Privacy Metrics**: Distance to Closest Record analysis for privacy assessment
- **Overall Quality Score**: Combined metric for synthetic data fidelity

### Key Privacy Metrics:
- **DCR Share**: Proportion of synthetic records that are closer to holdout than training data (higher = better privacy)
- **DCR Training**: Average distance from synthetic to closest training record (higher = better privacy)
- **Optimal DCR Share**: ~0.5 indicates good balance between utility and privacy

Let's compare the results from both frameworks:

In [20]:
# Import and initialize the quality assessment framework
from mostlyai import qa

# Initialize logging to see detailed evaluation progress
qa.init_logging()
print("🔍 Quality assessment framework initialized")

🔍 Quality assessment framework initialized


In [22]:
print("📊 Evaluating SDV synthetic data quality...")

# Load the SDV synthetic dataset
sdv_synthetic_data = pd.read_parquet('./data/sdv_synthetic_data.parquet')

# Run comprehensive quality assessment
# This compares synthetic data against training and holdout sets
report_path, metrics = qa.report(
    syn_tgt_data=sdv_synthetic_data,    # SDV synthetic data
    trn_tgt_data=train,                 # Original training data
    hol_tgt_data=holdout,               # Holdout data for validation
    max_sample_size_embeddings=10_000,  # Limit sample size for efficiency
    report_path='sdv_qa_report.html'    # HTML report output
)

print(f"📋 SDV Quality Report saved to: {report_path}")
print("\n📈 SDV Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
sdv_accuracy = metrics.accuracy.overall
sdv_dcr_share = metrics.distances.dcr_share
sdv_dcr_training = metrics.distances.dcr_training
print(f"\n🎯 SDV Summary:")
print(f"   Overall Accuracy: {sdv_accuracy:.3f}")
print(f"   DCR Share: {sdv_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {sdv_dcr_training:.3f} (higher is better for privacy)")

📊 Evaluating SDV synthetic data quality...
[2025-07-04 15:30:36,083] INFO   : prepared training data for accuracy: (8000000, 15)
[2025-07-04 15:30:40,346] INFO   : prepared holdout data for accuracy: (2000000, 15)
[2025-07-04 15:31:09,223] INFO   : prepared synthetic data for accuracy: (10000000, 15)
[2025-07-04 15:31:09,868] INFO   : encode datasets for embeddings
[2025-07-04 15:31:10,676] INFO   : calculated embeddings: syn=(10000, 25), trn=(10000, 25), hol=(10000, 25)
[2025-07-04 15:31:10,677] INFO   : report accuracy and correlations
[2025-07-04 15:31:10,677] INFO   : calculate original data bins
[2025-07-04 15:31:20,522] INFO   : store original data bins
[2025-07-04 15:31:20,539] INFO   : calculate synthetic data bins
[2025-07-04 15:31:28,608] INFO   : calculate correlations
[2025-07-04 15:31:36,043] INFO   : calculate correlations
[2025-07-04 15:31:42,696] INFO   : calculated univariates for 15 columns in 1.58 seconds
[2025-07-04 15:31:52,799] INFO   : calculated bivariate accura


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



[2025-07-04 15:34:17,751] INFO   : calculate and plot distances
[2025-07-04 15:34:17,751] INFO   : calculate distances
[2025-07-04 15:34:17,991] INFO   : calculated DCRs for data.shape=(10000, 25) and query.shape=(10000, 25) in 0.17s
[2025-07-04 15:34:18,161] INFO   : calculated DCRs for data.shape=(10000, 25) and query.shape=(10000, 25) in 0.17s
[2025-07-04 15:34:18,331] INFO   : calculated DCRs for data.shape=(10000, 25) and query.shape=(10000, 25) in 0.17s
[2025-07-04 15:34:18,334] INFO   : DCR Share: 48.9%, NNDR Ratio: 0.949 - ALL columns
[2025-07-04 15:34:18,386] INFO   : calculated DCRs for data.shape=(10000, 8) and query.shape=(10000, 8) in 0.05s
[2025-07-04 15:34:18,429] INFO   : calculated DCRs for data.shape=(10000, 8) and query.shape=(10000, 8) in 0.04s
[2025-07-04 15:34:18,446] INFO   : calculated DCRs for data.shape=(10000, 8) and query.shape=(10000, 8) in 0.02s
[2025-07-04 15:34:18,447] INFO   : DCR Share: 51.0%, NNDR Ratio: 1.347 - 8 columns [[2, 6, 8, 9, 12, 13, 20, 23]

In [24]:
print("📊 Evaluating Mostly AI synthetic data quality...")

# Load the Mostly AI synthetic dataset
mostlyai_synthetic_data = pd.read_parquet('./data/mostlyai_synthetic_data.parquet')

# Run comprehensive quality assessment for Mostly AI
report_path, metrics = qa.report(
    syn_tgt_data=mostlyai_synthetic_data,  # Mostly AI synthetic data
    trn_tgt_data=train,                    # Original training data
    hol_tgt_data=holdout,                  # Holdout data for validation
    max_sample_size_embeddings=10_000,     # Limit sample size for efficiency
    report_path='mostlyai_qa_report.html'  # HTML report output
)

print(f"📋 Mostly AI Quality Report saved to: {report_path}")
print("\n📈 Mostly AI Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
mai_accuracy = metrics.accuracy.overall
mai_dcr_share = metrics.distances.dcr_share
mai_dcr_training = metrics.distances.dcr_training
print(f"\n🎯 Mostly AI Summary:")
print(f"   Overall Accuracy: {mai_accuracy:.3f}")
print(f"   DCR Share: {mai_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {mai_dcr_training:.3f} (higher is better for privacy)")

📊 Evaluating Mostly AI synthetic data quality...
[2025-07-04 15:39:12,164] INFO   : prepared training data for accuracy: (8000000, 15)
[2025-07-04 15:39:16,490] INFO   : prepared holdout data for accuracy: (2000000, 15)
[2025-07-04 15:41:18,932] INFO   : prepared synthetic data for accuracy: (10000000, 15)
[2025-07-04 15:41:19,605] INFO   : encode datasets for embeddings
[2025-07-04 15:41:20,425] INFO   : calculated embeddings: syn=(10000, 25), trn=(10000, 25), hol=(10000, 25)
[2025-07-04 15:41:20,425] INFO   : report accuracy and correlations
[2025-07-04 15:41:20,426] INFO   : calculate original data bins
[2025-07-04 15:41:30,057] INFO   : store original data bins
[2025-07-04 15:41:30,075] INFO   : calculate synthetic data bins
[2025-07-04 15:41:43,024] INFO   : calculate correlations
[2025-07-04 15:41:50,190] INFO   : calculate correlations
[2025-07-04 15:41:56,291] INFO   : calculated univariates for 15 columns in 1.48 seconds
[2025-07-04 15:42:05,724] INFO   : calculated bivariate 


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



[2025-07-04 15:44:44,130] INFO   : calculate and plot distances
[2025-07-04 15:44:44,130] INFO   : calculate distances
[2025-07-04 15:44:44,305] INFO   : calculated DCRs for data.shape=(10000, 25) and query.shape=(10000, 25) in 0.16s
[2025-07-04 15:44:44,473] INFO   : calculated DCRs for data.shape=(10000, 25) and query.shape=(10000, 25) in 0.17s
[2025-07-04 15:44:44,640] INFO   : calculated DCRs for data.shape=(10000, 25) and query.shape=(10000, 25) in 0.17s
[2025-07-04 15:44:44,642] INFO   : DCR Share: 50.5%, NNDR Ratio: 0.897 - ALL columns
[2025-07-04 15:44:44,682] INFO   : calculated DCRs for data.shape=(10000, 12) and query.shape=(10000, 12) in 0.04s
[2025-07-04 15:44:44,724] INFO   : calculated DCRs for data.shape=(10000, 12) and query.shape=(10000, 12) in 0.04s
[2025-07-04 15:44:44,766] INFO   : calculated DCRs for data.shape=(10000, 12) and query.shape=(10000, 12) in 0.04s
[2025-07-04 15:44:44,767] INFO   : DCR Share: 50.0%, NNDR Ratio: 1.035 - 12 columns [[0, 3, 4, 5, 7, 10, 1

In [25]:
# Add a final comparison section
print("\n" + "="*60)
print("🏆 FINAL COMPARISON")
print("="*60)
print(f"SDV      - Accuracy: {sdv_accuracy:.3f}, DCR Share: {sdv_dcr_share:.3f}, DCR Training: {sdv_dcr_training:.3f}")
print(f"MostlyAI - Accuracy: {mai_accuracy:.3f}, DCR Share: {mai_dcr_share:.3f}, DCR Training: {mai_dcr_training:.3f}")
print("\nInterpretation:")
print("• Higher accuracy = better statistical fidelity")
print("• Higher DCR Share = better privacy preservation (more diverse synthetic records)")
print("• Higher DCR Training = better privacy preservation (synthetic records farther from training data)")
print("• DCR Share ~0.5 indicates good balance between utility and privacy")
print("• Check HTML reports for detailed analysis")


🏆 FINAL COMPARISON
SDV      - Accuracy: 0.738, DCR Share: 0.510, DCR Training: 0.123
MostlyAI - Accuracy: 0.981, DCR Share: 0.513, DCR Training: 0.014

Interpretation:
• Higher accuracy = better statistical fidelity
• Higher DCR Share = better privacy preservation (more diverse synthetic records)
• Higher DCR Training = better privacy preservation (synthetic records farther from training data)
• DCR Share ~0.5 indicates good balance between utility and privacy
• Check HTML reports for detailed analysis
