# Tabular Data Synthesis with Genesis

This notebook covers the three main tabular data generators:
- **CTGAN**: Deep learning based (GAN)
- **TVAE**: Variational Autoencoder
- **Gaussian Copula**: Statistical method

We'll compare their performance and help you choose the right one.

In [None]:
import numpy as np
import pandas as pd
import time

from genesis import SyntheticGenerator, QualityEvaluator
from genesis.generators.tabular import CTGANGenerator, TVAEGenerator, GaussianCopulaGenerator

## Create Sample Dataset

In [None]:
np.random.seed(42)
n = 5000

# Create data with correlations
age = np.random.normal(40, 12, n).clip(18, 80)
experience = (age - 18) * np.random.uniform(0.5, 1.0, n)  # Correlated with age
education_years = np.random.choice([12, 14, 16, 18, 20], n, p=[0.3, 0.2, 0.3, 0.15, 0.05])
salary = 20000 + experience * 2000 + education_years * 3000 + np.random.normal(0, 10000, n)

data = pd.DataFrame({
    'age': age.astype(int),
    'experience': experience.round(1),
    'education_years': education_years,
    'salary': salary.clip(20000, None),
    'department': np.random.choice(['Engineering', 'Sales', 'Marketing', 'HR'], n),
    'is_manager': np.random.choice([True, False], n, p=[0.2, 0.8])
})

discrete_cols = ['education_years', 'department', 'is_manager']

print(f"Dataset: {data.shape}")
data.describe()

## 1. Gaussian Copula Generator

Statistical method - fast, good for smaller datasets

In [None]:
start = time.time()

gc_generator = GaussianCopulaGenerator(verbose=True)
gc_generator.fit(data, discrete_columns=discrete_cols)
gc_synthetic = gc_generator.generate(n_samples=5000)

gc_time = time.time() - start
print(f"\nTime: {gc_time:.2f}s")
gc_synthetic.head()

In [None]:
# Evaluate
gc_evaluator = QualityEvaluator(data, gc_synthetic)
gc_report = gc_evaluator.evaluate()
print(f"Gaussian Copula - Fidelity: {gc_report.fidelity_score*100:.1f}%, Utility: {gc_report.utility_score*100:.1f}%")

## 2. CTGAN Generator

Deep learning (GAN) - handles complex distributions, good for larger datasets

In [None]:
start = time.time()

ctgan_generator = CTGANGenerator(
    epochs=100,  # Reduce for demo
    batch_size=500,
    verbose=True
)
ctgan_generator.fit(data, discrete_columns=discrete_cols)
ctgan_synthetic = ctgan_generator.generate(n_samples=5000)

ctgan_time = time.time() - start
print(f"\nTime: {ctgan_time:.2f}s")
ctgan_synthetic.head()

In [None]:
# Evaluate
ctgan_evaluator = QualityEvaluator(data, ctgan_synthetic)
ctgan_report = ctgan_evaluator.evaluate()
print(f"CTGAN - Fidelity: {ctgan_report.fidelity_score*100:.1f}%, Utility: {ctgan_report.utility_score*100:.1f}%")

## 3. TVAE Generator

Variational Autoencoder - balance between speed and quality

In [None]:
start = time.time()

tvae_generator = TVAEGenerator(
    epochs=100,
    batch_size=500,
    verbose=True
)
tvae_generator.fit(data, discrete_columns=discrete_cols)
tvae_synthetic = tvae_generator.generate(n_samples=5000)

tvae_time = time.time() - start
print(f"\nTime: {tvae_time:.2f}s")
tvae_synthetic.head()

In [None]:
# Evaluate
tvae_evaluator = QualityEvaluator(data, tvae_synthetic)
tvae_report = tvae_evaluator.evaluate()
print(f"TVAE - Fidelity: {tvae_report.fidelity_score*100:.1f}%, Utility: {tvae_report.utility_score*100:.1f}%")

## Comparison Summary

In [None]:
comparison = pd.DataFrame({
    'Method': ['Gaussian Copula', 'CTGAN', 'TVAE'],
    'Time (s)': [gc_time, ctgan_time, tvae_time],
    'Fidelity (%)': [gc_report.fidelity_score*100, ctgan_report.fidelity_score*100, tvae_report.fidelity_score*100],
    'ML Utility (%)': [gc_report.utility_score*100, ctgan_report.utility_score*100, tvae_report.utility_score*100],
    'Privacy (%)': [gc_report.privacy_score*100, ctgan_report.privacy_score*100, tvae_report.privacy_score*100],
})
comparison

## Choosing the Right Method

| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Gaussian Copula** | Small datasets, quick prototyping | Fast, no GPU needed | May miss complex patterns |
| **CTGAN** | Large datasets, complex patterns | Best quality for complex data | Slower, needs tuning |
| **TVAE** | Medium datasets | Good balance | Moderate complexity |

Use `method='auto'` to let Genesis choose based on your data characteristics.

In [None]:
# Auto-selection example
auto_generator = SyntheticGenerator(method='auto')
auto_generator.fit(data, discrete_columns=discrete_cols)

print(f"Selected method: {auto_generator.selected_method}")