# MOSTLY AI vs. SDV Comparison - Single Table Scenario

## Framework Comparison
This notebook compares two synthetic data generation libraries on a large-scale dataset:

- **SDV (Synthetic Data Vault)** - Business Source License
- **MOSTLY AI SDK** - Apache 2.0 License - Open Source

## Dataset & Objective
We'll use the **US Census Income dataset (10M records)** to:
- Compare training performance and generation speed
- Evaluate synthetic data quality using comprehensive metrics
- Assess privacy preservation capabilities
- Provide practical guidance for framework selection

## Key Takeaways
- Performance benchmarks on large-scale data
- Quality comparison metrics
- Privacy assessment results

In [1]:
# Install SDK in CLIENT mode
!uv pip install -U mostlyai
# Or install in LOCAL mode
!uv pip install -U 'mostlyai[local]'  
# Note: Restart kernel session after installation!

!uv pip install -q scikit-learn seaborn lightgbm sdv

[2mUsing Python 3.10.18 environment at: /Users/kennethhamilton/Desktop/sdv-mostly-experiment/venv[0m
[2K[2mResolved [1m65 packages[0m [2min 607ms[0m[0m                                        [0m
[2K[2mPrepared [1m2 packages[0m [2min 0.92ms[0m[0m                                            
[2mUninstalled [1m2 packages[0m [2min 110ms[0m[0m
[2K[2mInstalled [1m2 packages[0m [2min 54ms[0m[0m                                [0m
 [31m-[39m [1mpandas[0m[2m==2.2.3[0m
 [32m+[39m [1mpandas[0m[2m==2.3.1[0m
 [31m-[39m [1mpsutil[0m[2m==5.9.8[0m
 [32m+[39m [1mpsutil[0m[2m==7.0.0[0m
[2mUsing Python 3.10.18 environment at: /Users/kennethhamilton/Desktop/sdv-mostly-experiment/venv[0m
[2K[2mResolved [1m173 packages[0m [2min 850ms[0m[0m                                       [0m
[2K[2mPrepared [1m2 packages[0m [2min 0.47ms[0m[0m                                            
[2mUninstalled [1m2 packages[0m [2min 89ms[0m[0m
[2K[2mIn

# 1. Data Preparation

## Loading the Dataset
We'll use the US Census Income dataset with 10M records containing demographic, employment, and financial information - ideal for testing synthetic data generation at scale.

In [2]:
import pandas as pd

# Load the ACS Income dataset (1.4M records) from remote Parquet file
# Note: This is a large dataset - initial load may take a while
data = pd.read_parquet(
    "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/census/acs-income-2018.parquet"
).iloc[:, :15]
# drop unused categorical labels, so that SDV does not crash
for col in data.select_dtypes(["category"]).columns:
    data[col] = data[col].cat.remove_unused_categories()

# Display basic info about the dataset
print(f"Dataset shape: {data.shape}")
print(f"Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

data.head()

Dataset shape: (1476217, 15)
Memory usage: 43.7 MB


Unnamed: 0,State,Region,Division,Income,Age,Workclass,Education,Marital Status,Occupation,Place of birth,Gender,Race,Citizenship status,Self-care difficulty,Hearing difficulty
0,Louisiana/LA,South,West South Central (South Region),90000.0,55.0,Employee of a private for-profit company or bu...,"1 or more years of college credit, no degree",Married,EXT-Underground Mining Machine Operators,Louisiana/LA,Male,Black or African American alone,Born in the U.S.,No,No
1,Arizona/AZ,West,Mountain (West region),45000.0,24.0,"Local government employee (city, county, etc.)","Some college, but less than 1 year",Never married or under 15 years old,RPR-Automotive Service Technicians And Mechanics,California/CA,Male,White alone,Born in the U.S.,No,No
2,New York/NY,Northeast,Middle Atlantic (Northeast region),22600.0,30.0,Employee of a private for-profit company or bu...,Regular high school diploma,Never married or under 15 years old,"OFF-Hotel, Motel, And Resort Desk Clerks",New York/NY,Female,Some Other Race alone,Born in the U.S.,No,No
3,Delaware/DE,South,South Atlantic (South region),55000.0,23.0,Employee of a private for-profit company or bu...,Bachelor's degree,Never married or under 15 years old,OFF-First-Line Supervisors Of Office And Admin...,North Carolina/NC,Male,Black or African American alone,Born in the U.S.,No,No
4,New York/NY,Northeast,Middle Atlantic (Northeast region),100000.0,52.0,Employee of a private for-profit company or bu...,Associate's degree,Married,CON-Electricians,New York/NY,Male,White alone,Born in the U.S.,No,No


## Dataset Overview

The dataset contains 15 columns with mixed data types. This combination of numerical and categorical data makes it ideal for testing both frameworks' capabilities.

In [3]:
# Display column names and basic data types
print("\nColumns:")
print(data.dtypes)


Columns:
State                   category
Region                  category
Division                category
Income                   float64
Age                      float64
Workclass               category
Education               category
Marital Status          category
Occupation              category
Place of birth          category
Gender                  category
Race                    category
Citizenship status      category
Self-care difficulty    category
Hearing difficulty      category
dtype: object


## Train/Holdout Split

We split the data into:
- **Training Set (80% - 1.2M records)**: For model training
- **Holdout Set (20% - 0.3M records)**: For quality evaluation

This split ensures we can properly assess synthetic data quality against unseen real data.

In [4]:
from sklearn.model_selection import train_test_split

# Split data into training/holdout sets
# Using stratified split would be better for classification tasks, but not critical here
# random_state=1 ensures reproducible results
train, holdout = train_test_split(
    data,
    test_size=0.2,  # 20% for holdout evaluation
    random_state=1,  # Fixed seed for reproducibility
    shuffle=True,  # Ensure random sampling
)

print(f"Training set: {train.shape[0]:,} records ({train.shape[0] / len(data) * 100:.1f}%)")
print(f"Holdout set:  {holdout.shape[0]:,} records ({holdout.shape[0] / len(data) * 100:.1f}%)")

Training set: 1,180,973 records (80.0%)
Holdout set:  295,244 records (20.0%)


# 2. SDV Metadata Configuration

## Metadata Setup
SDV requires metadata to understand your data structure. We'll use auto-detection to identify column types (numerical vs categorical), then validate the configuration.

## Auto-Detecting Metadata
SDV can automatically detect column types from the data. The auto-detection correctly identifies our numerical and categorical columns.

In [5]:
from sdv.metadata import Metadata

# Auto-detect metadata from the training data
# Note: We wrap the DataFrame in a dict with table name 'table' as required by SDV
# Using only training data to avoid data leakage
metadata = Metadata.detect_from_dataframes({"table": train})

# Show a summary of detected column types
table_metadata = metadata.to_dict()["tables"]["table"]["columns"]
numerical_cols = [col for col, info in table_metadata.items() if info["sdtype"] == "numerical"]
categorical_cols = [col for col, info in table_metadata.items() if info["sdtype"] == "categorical"]

print("\n📊 Metadata Summary:")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")

# Validate the metadata structure
try:
    metadata.validate()
    print("✅ Metadata validation passed")
except Exception as e:
    print(f"❌ Metadata validation failed: {e}")
    # You would fix metadata issues here if any exist

# Validate that the metadata matches the actual data structure
try:
    metadata.validate_data(data=({"table": train}))  # Use train data for consistency
    print("✅ Data validation against metadata passed")
except Exception as e:
    print(f"❌ Data validation failed: {e}")
    # This would indicate mismatches between metadata and actual data


📊 Metadata Summary:
Numerical columns (2): ['Income', 'Age']
Categorical columns (12): ['Region', 'Division', 'Workclass', 'Education', 'Marital Status', 'Occupation', 'Place of birth', 'Gender', 'Race', 'Citizenship status', 'Self-care difficulty', 'Hearing difficulty']
✅ Metadata validation passed
✅ Data validation against metadata passed


# 3. SDV: Training and Generation

## Gaussian Copula Synthesizer
We'll use SDV's Gaussian Copula Synthesizer, which models the statistical relationships between variables and generates synthetic data that preserves these relationships.

In [6]:
import time

from sdv.single_table import GaussianCopulaSynthesizer

# Initialize the synthesizer with our metadata
# GaussianCopula is good for mixed data types and preserving correlations
synthesizer = GaussianCopulaSynthesizer(metadata)

print("🚀 Starting SDV training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Train the synthesizer on our training data
# This learns the statistical relationships between variables
synthesizer.fit(train)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV training completed in {elapsed_minutes:.2f} minutes")


We strongly recommend saving the metadata using 'save_to_json' for replicability in future SDV versions.



🚀 Starting SDV training...
Training on 1,180,973 records with 15 features
✅ SDV training completed in 1.54 minutes


In [7]:
print("🎲 Starting SDV synthetic data generation...")

start_time = time.time()

# Generate synthetic data with the same number of rows as original dataset
# You can adjust num_rows based on your needs
target_rows = len(data)  # Generate same size as original
sdv_synthetic_data = synthesizer.sample(num_rows=target_rows)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ SDV generation completed in {elapsed_minutes:.2f} minutes")
print(f"⏱️  Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"📊 Generated {len(sdv_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
print(sdv_synthetic_data.head())

🎲 Starting SDV synthetic data generation...
✅ SDV generation completed in 0.20 minutes
⏱️  Generation rate: 120,129 records/second
📊 Generated 1,476,217 synthetic records

First 5 synthetic records:
  State     Region                             Division   Income   Age  \
0   NaN  Northeast   Middle Atlantic (Northeast region)  53240.0  38.0   
1   NaN       West  East North Central (Midwest region)  15979.0  42.0   
2   NaN      South                Pacific (West region)  35126.0  32.0   
3   NaN      South  East North Central (Midwest region)  88097.0  69.0   
4   NaN       West                Pacific (West region)  24163.0  27.0   

                                           Workclass  \
0     Local government employee (city, county, etc.)   
1     Working without pay in family business or farm   
2     Local government employee (city, county, etc.)   
3  Self-employed in own incorporated business, pr...   
4     Local government employee (city, county, etc.)   

                   

In [8]:
import os

# Save SDV synthetic data
output_file = "./data/sdv_synthetic_data.parquet"
sdv_synthetic_data.to_parquet(output_file, index=False)

# Get file size in MB
file_size_mb = os.path.getsize(output_file) / 1024**2

print(f"💾 SDV synthetic data saved to: {output_file}")
print(f"📁 File size: {file_size_mb:.1f} MB")

💾 SDV synthetic data saved to: ./data/sdv_synthetic_data.parquet
📁 File size: 15.2 MB


# 4. Mostly AI: Training and Generation

## Deep Learning Approach
Mostly AI uses advanced deep learning models optimized for tabular data. The SDK provides local training capabilities with configurable parameters for training time and privacy settings.

In [9]:
from mostlyai.sdk import MostlyAI

# Initialize Mostly AI SDK for local training
# local=True means we'll train models locally rather than using cloud API
print("🔧 Initializing Mostly AI SDK...")
mostly = MostlyAI(local=True)
print("✅ Mostly AI SDK initialized successfully")

🔧 Initializing Mostly AI SDK...


✅ Mostly AI SDK initialized successfully


In [10]:
print("🚀 Starting Mostly AI training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Configure and start training
# Mostly AI automatically detects column types and optimizes model architecture
g = mostly.train(
    config={
        "name": "ACS Income",
        "tables": [
            {
                "name": "census",
                "data": train,
                "tabularModelConfiguration": {
                    "max_training_time": 10,  # Limit training time (minutes)
                    "enable_model_report": False,  # We do QA separate
                },
            }
        ],
    },
)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ Mostly AI training completed in {elapsed_minutes:.2f} minutes")

🚀 Starting Mostly AI training...
Training on 1,180,973 records with 15 features


Output()

✅ Mostly AI training completed in 10.83 minutes


In [11]:
print("🎲 Starting Mostly AI synthetic data generation...")

start_time = time.time()

# Generate synthetic data using the trained generator
# size parameter controls how many records to generate
target_rows = len(data)
sd = mostly.generate(g, size=target_rows)
mostlyai_synthetic_data = sd.data()

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"✅ Mostly AI generation completed in {elapsed_minutes:.2f} minutes")
print(f"⏱️ Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"📊 Generated {len(mostlyai_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
mostlyai_synthetic_data.head()

🎲 Starting Mostly AI synthetic data generation...


Output()

✅ Mostly AI generation completed in 0.74 minutes
⏱️ Generation rate: 33,425 records/second
📊 Generated 1,476,217 synthetic records

First 5 synthetic records:


Unnamed: 0,State,Region,Division,Income,Age,Workclass,Education,Marital Status,Occupation,Place of birth,Gender,Race,Citizenship status,Self-care difficulty,Hearing difficulty
0,Pennsylvania/PA,Northeast,Middle Atlantic (Northeast region),24652,58,Employee of a private for-profit company or bu...,Regular high school diploma,Widowed,"PRD-Miscellaneous Production Workers, Includin...",Illinois/IL,Male,White alone,Born in the U.S.,No,No
1,Pennsylvania/PA,Northeast,Middle Atlantic (Northeast region),42386,32,Employee of a private for-profit company or bu...,Regular high school diploma,Married,PRD-First-Line Supervisors Of Production And O...,Pennsylvania/PA,Female,White alone,Born in the U.S.,No,No
2,California/CA,West,Pacific (West region),174233,52,Employee of a private for-profit company or bu...,Bachelor's degree,Married,BUS-Management Analysts,Philippines,Male,Asian alone,U.S. citizen by naturalization,No,No
3,New York/NY,Northeast,Middle Atlantic (Northeast region),97392,43,"Self-employed in own incorporated business, pr...",Bachelor's degree,Married,"OFF-Bookkeeping, Accounting, And Auditing Clerks",New York/NY,Female,White alone,Born in the U.S.,No,No
4,California/CA,West,Pacific (West region),100764,53,Employee of a private for-profit company or bu...,Bachelor's degree,Married,ENG-Chemical Engineers,California/CA,Female,Two or More Races,Born in the U.S.,No,No


In [12]:
# Save Mostly AI synthetic data for comparison
output_file = "./data/mostlyai_synthetic_data.parquet"
mostlyai_synthetic_data.to_parquet(output_file, index=False)
file_size_bytes = os.path.getsize(output_file)
print(f"💾 MOSTLY AI synthetic data saved to: {output_file}")
print(f"📁 File size: {file_size_bytes / 1024**2:.1f} MB")

💾 MOSTLY AI synthetic data saved to: ./data/mostlyai_synthetic_data.parquet
📁 File size: 15.5 MB


# 5. Quality Assessment and Comparison

## Evaluation Framework
We'll use MOSTLY AI's comprehensive [Synthetic Data Quality Assurance](https://github.com/mostly-ai/mostlyai-qa) framework to evaluate both synthetic datasets. The assessment includes:

- **Accuracy Metrics**: How well synthetic data preserves statistical distributions (univariate, bivariate, trivariate)
- **Similarity Analysis**: Comparison between training, holdout, and synthetic data
- **DCR Privacy Metrics**: Distance to Closest Record analysis for privacy assessment
- **Overall Quality Score**: Combined metric for synthetic data fidelity

### Key Privacy Metrics:
- **DCR Share**: Proportion of synthetic records that are closer to holdout than training data (higher = better privacy)
- **DCR Training**: Average distance from synthetic to closest training record (higher = better privacy)
- **Optimal DCR Share**: ~0.5 indicates good balance between utility and privacy

Let's compare the results from both frameworks:

In [13]:
# Import and initialize the quality assessment framework
from mostlyai import qa

# Initialize logging to see detailed evaluation progress
qa.init_logging()
print("🔍 Quality assessment framework initialized")

🔍 Quality assessment framework initialized


In [14]:
print("📊 Evaluating SDV synthetic data quality...")

# Load the SDV synthetic dataset
sdv_synthetic_data = pd.read_parquet("./data/sdv_synthetic_data.parquet")

# Run comprehensive quality assessment
# This compares synthetic data against training and holdout sets
report_path, metrics = qa.report(
    syn_tgt_data=sdv_synthetic_data,  # SDV synthetic data
    trn_tgt_data=train,  # Original training data
    hol_tgt_data=holdout,  # Holdout data for validation
    max_sample_size_embeddings=10_000,  # Limit sample size for efficiency
    report_path="sdv_qa_report.html",  # HTML report output
)

print(f"📋 SDV Quality Report saved to: {report_path}")
print("\n📈 SDV Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
sdv_accuracy = metrics.accuracy.overall
sdv_dcr_share = metrics.distances.dcr_share
sdv_dcr_training = metrics.distances.dcr_training
print("\n🎯 SDV Summary:")
print(f"   Overall Accuracy: {sdv_accuracy:.3f}")
print(f"   DCR Share: {sdv_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {sdv_dcr_training:.3f} (higher is better for privacy)")

📊 Evaluating SDV synthetic data quality...
[2025-07-14 08:35:26,218] INFO   : prepared training data for accuracy: (1180973, 15)
[2025-07-14 08:35:26,622] INFO   : prepared holdout data for accuracy: (295244, 15)
[2025-07-14 08:35:28,499] INFO   : prepared synthetic data for accuracy: (1476217, 15)
[2025-07-14 08:35:28,539] INFO   : encode datasets for embeddings



divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



[2025-07-14 08:35:29,356] INFO   : calculated embeddings: syn=(10000, 34), trn=(10000, 34), hol=(10000, 34)
[2025-07-14 08:35:29,356] INFO   : report accuracy and correlations
[2025-07-14 08:35:29,356] INFO   : calculate original data bins
[2025-07-14 08:35:31,124] INFO   : store original data bins
[2025-07-14 08:35:31,138] INFO   : calculate synthetic data bins
[2025-07-14 08:35:32,624] INFO   : calculate correlations
[2025-07-14 08:35:34,613] INFO   : calculate correlations
[2025-07-14 08:35:36,625] INFO   : calculated univariates for 15 columns in 0.91 seconds
[2025-07-14 08:35:38,665] INFO   : calculated bivariate accuracies for 210 combinations in 2.04 seconds
[2025-07-14 08:35:49,867] INFO   : calculated trivariate accuracies for 455 combinations in 11.20 seconds
[2025-07-14 08:35:50,028] INFO   : calculate numeric univariate kdes
[2025-07-14 08:35:51,018] INFO   : calculate numeric univariate kdes
[2025-07-14 08:35:51,960] INFO   : calculate categorical univariate counts
[2025-0


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



[2025-07-14 08:36:01,765] INFO   : calculate and plot distances
[2025-07-14 08:36:01,765] INFO   : calculate distances
[2025-07-14 08:36:02,042] INFO   : calculated DCRs for data.shape=(10000, 34) and query.shape=(10000, 34) in 0.21s
[2025-07-14 08:36:02,252] INFO   : calculated DCRs for data.shape=(10000, 34) and query.shape=(10000, 34) in 0.21s
[2025-07-14 08:36:02,459] INFO   : calculated DCRs for data.shape=(10000, 34) and query.shape=(10000, 34) in 0.21s
[2025-07-14 08:36:02,464] INFO   : DCR Share: 53.0%, NNDR Ratio: 0.983 - ALL columns
[2025-07-14 08:36:02,482] INFO   : calculated DCRs for data.shape=(10000, 3) and query.shape=(10000, 3) in 0.02s
[2025-07-14 08:36:02,499] INFO   : calculated DCRs for data.shape=(10000, 3) and query.shape=(10000, 3) in 0.02s
[2025-07-14 08:36:02,513] INFO   : calculated DCRs for data.shape=(10000, 3) and query.shape=(10000, 3) in 0.01s
[2025-07-14 08:36:02,514] INFO   : DCR Share: 50.4%, NNDR Ratio: 0.892 - 3 columns [[1, 14, 15]]
[2025-07-14 08:

In [15]:
print("📊 Evaluating Mostly AI synthetic data quality...")

# Load the Mostly AI synthetic dataset
mostlyai_synthetic_data = pd.read_parquet("./data/mostlyai_synthetic_data.parquet")

# Run comprehensive quality assessment for Mostly AI
report_path, metrics = qa.report(
    syn_tgt_data=mostlyai_synthetic_data,  # Mostly AI synthetic data
    trn_tgt_data=train,  # Original training data
    hol_tgt_data=holdout,  # Holdout data for validation
    max_sample_size_embeddings=10_000,  # Limit sample size for efficiency
    report_path="mostlyai_qa_report.html",  # HTML report output
)

print(f"📋 Mostly AI Quality Report saved to: {report_path}")
print("\n📈 Mostly AI Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
mai_accuracy = metrics.accuracy.overall
mai_dcr_share = metrics.distances.dcr_share
mai_dcr_training = metrics.distances.dcr_training
print("\n🎯 Mostly AI Summary:")
print(f"   Overall Accuracy: {mai_accuracy:.3f}")
print(f"   DCR Share: {mai_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {mai_dcr_training:.3f} (higher is better for privacy)")

📊 Evaluating Mostly AI synthetic data quality...
[2025-07-14 08:36:24,308] INFO   : prepared training data for accuracy: (1180973, 15)
[2025-07-14 08:36:24,756] INFO   : prepared holdout data for accuracy: (295244, 15)
[2025-07-14 08:36:43,992] INFO   : prepared synthetic data for accuracy: (1476217, 15)
[2025-07-14 08:36:44,042] INFO   : encode datasets for embeddings



divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



[2025-07-14 08:36:44,898] INFO   : calculated embeddings: syn=(10000, 34), trn=(10000, 34), hol=(10000, 34)
[2025-07-14 08:36:44,899] INFO   : report accuracy and correlations
[2025-07-14 08:36:44,899] INFO   : calculate original data bins
[2025-07-14 08:36:46,714] INFO   : store original data bins
[2025-07-14 08:36:46,729] INFO   : calculate synthetic data bins
[2025-07-14 08:36:49,069] INFO   : calculate correlations
[2025-07-14 08:36:50,266] INFO   : calculate correlations
[2025-07-14 08:36:52,042] INFO   : calculated univariates for 15 columns in 0.70 seconds
[2025-07-14 08:36:53,823] INFO   : calculated bivariate accuracies for 210 combinations in 1.78 seconds
[2025-07-14 08:37:05,097] INFO   : calculated trivariate accuracies for 455 combinations in 11.27 seconds
[2025-07-14 08:37:05,406] INFO   : calculate numeric univariate kdes
[2025-07-14 08:37:06,377] INFO   : calculate numeric univariate kdes
[2025-07-14 08:37:07,872] INFO   : calculate categorical univariate counts
[2025-0


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



[2025-07-14 08:37:17,660] INFO   : calculate and plot distances
[2025-07-14 08:37:17,660] INFO   : calculate distances
[2025-07-14 08:37:17,883] INFO   : calculated DCRs for data.shape=(10000, 34) and query.shape=(10000, 34) in 0.21s
[2025-07-14 08:37:18,084] INFO   : calculated DCRs for data.shape=(10000, 34) and query.shape=(10000, 34) in 0.20s
[2025-07-14 08:37:18,294] INFO   : calculated DCRs for data.shape=(10000, 34) and query.shape=(10000, 34) in 0.21s
[2025-07-14 08:37:18,298] INFO   : DCR Share: 50.3%, NNDR Ratio: 0.923 - ALL columns
[2025-07-14 08:37:18,340] INFO   : calculated DCRs for data.shape=(10000, 13) and query.shape=(10000, 13) in 0.04s
[2025-07-14 08:37:18,381] INFO   : calculated DCRs for data.shape=(10000, 13) and query.shape=(10000, 13) in 0.04s
[2025-07-14 08:37:18,421] INFO   : calculated DCRs for data.shape=(10000, 13) and query.shape=(10000, 13) in 0.04s
[2025-07-14 08:37:18,422] INFO   : DCR Share: 50.3%, NNDR Ratio: 1.558 - 13 columns [[2, 3, 4, 5, 7, 20, 2

In [16]:
# Add a final comparison section
print("\n" + "=" * 60)
print("🏆 FINAL COMPARISON")
print("=" * 60)
print(f"SDV      - Accuracy: {sdv_accuracy:.3f}, DCR Share: {sdv_dcr_share:.3f}")
print(f"MostlyAI - Accuracy: {mai_accuracy:.3f}, DCR Share: {mai_dcr_share:.3f}")
print("\nInterpretation:")
print("• Higher accuracy = better statistical fidelity")
print("• Higher DCR Share = better privacy preservation (more diverse synthetic records)")
print("• DCR Share ~0.5 indicates good balance between utility and privacy")
print("• Check HTML reports for detailed analysis")


🏆 FINAL COMPARISON
SDV      - Accuracy: 0.527, DCR Share: 0.530
MostlyAI - Accuracy: 0.978, DCR Share: 0.503

Interpretation:
• Higher accuracy = better statistical fidelity
• Higher DCR Share = better privacy preservation (more diverse synthetic records)
• DCR Share ~0.5 indicates good balance between utility and privacy
• Check HTML reports for detailed analysis
