# Genesis Quickstart Guide

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/genesis-synth/genesis/blob/main/examples/01_quickstart.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/genesis-synth/genesis/main?labpath=examples%2F01_quickstart.ipynb)

This notebook demonstrates the basic usage of Genesis for synthetic data generation.

## Installation

```bash
pip install genesis-synth
# For all features:
pip install genesis-synth[all]
```

In [None]:
import numpy as np
import pandas as pd

# Import Genesis
from genesis import SyntheticGenerator, QualityEvaluator, PrivacyConfig, Constraint

## Create Sample Data

Let's create a simple dataset to demonstrate synthetic data generation.

In [None]:
# Create sample data
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'age': np.random.randint(18, 80, n_samples),
    'income': np.random.normal(50000, 15000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'gender': np.random.choice(['M', 'F'], n_samples),
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n_samples),
    'has_mortgage': np.random.choice([True, False], n_samples, p=[0.4, 0.6])
})

print(f"Shape: {data.shape}")
data.head()

## Basic Synthetic Data Generation

The simplest way to generate synthetic data with Genesis.

In [None]:
# Create generator (auto-selects best method)
generator = SyntheticGenerator(method='auto')

# Fit on real data
generator.fit(
    data, 
    discrete_columns=['gender', 'city', 'has_mortgage']
)

# Generate synthetic data
synthetic_data = generator.generate(n_samples=1000)

print(f"Synthetic data shape: {synthetic_data.shape}")
synthetic_data.head()

## Adding Constraints

Enforce business rules on the generated data.

In [None]:
# Define constraints
constraints = [
    Constraint.positive('income'),         # Income must be positive
    Constraint.range('age', 18, 100),      # Age must be 18-100
    Constraint.range('credit_score', 300, 850),  # Credit score range
]

# Create generator with constraints
generator = SyntheticGenerator(method='gaussian_copula')
generator.fit(data, discrete_columns=['gender', 'city', 'has_mortgage'], constraints=constraints)

synthetic_data = generator.generate(n_samples=1000)

# Verify constraints are satisfied
print(f"Min income: {synthetic_data['income'].min():.2f}")
print(f"Age range: {synthetic_data['age'].min()} - {synthetic_data['age'].max()}")
print(f"Credit score range: {synthetic_data['credit_score'].min()} - {synthetic_data['credit_score'].max()}")

## Privacy-Enhanced Generation

Generate synthetic data with differential privacy guarantees.

In [None]:
# Configure privacy settings
privacy_config = PrivacyConfig(
    enable_differential_privacy=True,
    epsilon=1.0,  # Privacy budget (lower = more private)
    suppress_rare_categories=True,
    rare_category_threshold=0.01,
)

# Create privacy-enhanced generator
private_generator = SyntheticGenerator(
    method='gaussian_copula', 
    privacy=privacy_config
)

private_generator.fit(data, discrete_columns=['gender', 'city', 'has_mortgage'])
private_synthetic = private_generator.generate(n_samples=1000)

print("Privacy-enhanced synthetic data generated!")
private_synthetic.head()

## Quality Evaluation

Evaluate how well the synthetic data matches the real data.

In [None]:
# Evaluate quality
evaluator = QualityEvaluator(data, synthetic_data)
report = evaluator.evaluate()

# Print summary
print(report.summary())

In [None]:
# Get detailed scores
print(f"Overall Score: {report.overall_score:.1f}%")
print(f"Statistical Fidelity: {report.fidelity_score * 100:.1f}%")
print(f"ML Utility: {report.utility_score * 100:.1f}%")
print(f"Privacy Score: {report.privacy_score * 100:.1f}%")

## Compare Distributions

Visually compare real vs synthetic data distributions.

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Numeric columns
for i, col in enumerate(['age', 'income', 'credit_score']):
    ax = axes[0, i]
    ax.hist(data[col], bins=30, alpha=0.5, label='Real', density=True)
    ax.hist(synthetic_data[col], bins=30, alpha=0.5, label='Synthetic', density=True)
    ax.set_title(col)
    ax.legend()

# Categorical columns
for i, col in enumerate(['gender', 'city', 'has_mortgage']):
    ax = axes[1, i]
    real_counts = data[col].value_counts(normalize=True)
    syn_counts = synthetic_data[col].value_counts(normalize=True)
    
    x = range(len(real_counts))
    width = 0.35
    ax.bar([i - width/2 for i in x], real_counts.values, width, label='Real')
    ax.bar([i + width/2 for i in x], syn_counts.reindex(real_counts.index).values, width, label='Synthetic')
    ax.set_xticks(x)
    ax.set_xticklabels(real_counts.index)
    ax.set_title(col)
    ax.legend()

plt.tight_layout()
plt.show()

## Save and Export

Save synthetic data and quality reports.

In [None]:
# Save synthetic data
synthetic_data.to_csv('synthetic_data.csv', index=False)

# Save quality report
report.save_html('quality_report.html')
report.save_json('quality_report.json')

print("Files saved!")

## Next Steps

- **02_tabular_synthesis.ipynb**: Deep dive into CTGAN, TVAE, Gaussian Copula
- **03_time_series.ipynb**: Time series data generation
- **04_text_generation.ipynb**: LLM-based text generation
- **05_privacy_config.ipynb**: Advanced privacy configuration
- **06_healthcare_example.ipynb**: Healthcare use case
- **07_finance_example.ipynb**: Financial data use case
- **08_multitable_example.ipynb**: Multi-table with relationships