# Getting Started with the SDK  <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/quick-start/quick-start.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this notebook, we take our first steps with the SDK by training a basic single-table generator, to then probe it for new synthetic samples.

Note, that the chosen dataset is for demo purposes and intentionally very small to make this tutorial run fast. Expect significantly higher quality in case of more training samples. See the other tutorials for reference.

In [None]:
# Install SDK in CLIENT mode
!uv pip install -U mostlyai
# Or install in LOCAL mode
!uv pip install -U 'mostlyai[local]'
# Note: Restart kernel session after installation!

## Load Original Data

Fetch some original data that will be used for training the generator.

In [None]:
import pandas as pd

# fetch some original data
repo_url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev"
df_original = pd.read_csv(f"{repo_url}/census/census.csv.gz")
df_original

## Initialize the SDK



In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

## Train a Generator

Train a synthetic data generator.

In [None]:
# train a generator, with defaults
g = mostly.train(data=df_original)

## Generate Synthetic Data

Probe for a single synthetic sample.

In [None]:
mostly.probe(g)

Probe the trained generator for 100 representative synthetic samples.

In [None]:
mostly.probe(g, size=100)

Generate a larger scale representative synthetic dataset.

In [None]:
sd = mostly.generate(g, size=1_000_000)
df_synthetic = sd.data()
df_synthetic

Conditionally generate 1000 records of 70y old male citizens.

In [None]:
df_seed = pd.DataFrame(
    {
        "age": [70] * 1_000,
        "sex": ["Male"] * 1_000,
    }
)
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples

## Quality Assurance

Inspect the automated Quality Assurance report, to learn about the accuracy, similarity and novelty of the generated synthetic samples.

In [None]:
# display the quality assurance report
g.reports(display=True)

## Export Generator

Export the generator for further sharing with other SDK users. Plus, you can import zipped generators also to a MOSTLY AI platform.

In [None]:
# export the generator
g.export_to_file("census-generator.zip")

## Advanced Options

Several configuration parameters are available that allow fine-grained control over the training of the generator. See `?mostly.train` for further examples. See [GeneratorConfig](https://mostly-ai.github.io/mostlyai/api_domain/#mostlyai.sdk.domain.GeneratorConfig) as well as [SourceTableConfig](https://mostly-ai.github.io/mostlyai/api_domain/#mostlyai.sdk.domain.SourceTableConfig) for all available configuration settings.

The following example trains e.g. a differentially private generator, for a maximum of 2 minutes.

In [None]:
g = mostly.train(
    config={
        "name": "Census",  # name of the generator
        "tables": [
            {
                "name": "census",
                "data": df_original,
                "tabular_model_configuration": {  # tabular model configuration (optional)
                    "max_training_time": 2,  # - limit training time (in minutes)
                    # model, max_epochs,,..       # further model configurations (optional)
                    "differential_privacy": {  # differential privacy configuration (optional)
                        "max_epsilon": 5.0,  # - max epsilon value, used as stopping criterion
                        "delta": 1e-5,  # - delta value for differentially private training (DP-SGD)
                        "noise_multiplier": 1.5,  # - noise level for privacy for DP-SGD
                        "max_grad_norm": 1.0,  # - maximum norm of the per-sample gradients for DP-SGD
                        "value_protection_epsilon": 2.0,  # - DP epsilon for determining value ranges / data domains
                    },
                },
                # columns, keys, compute,..        # further table configurations (optional)
            }
        ],
    },
    start=False,  # start training immediately (default: True)
    wait=False,  # wait for completion (default: True)
)

In [None]:
# launch training asynchronously
g.training.start()

In [None]:
# observe training status
g.training.wait()

## Conclusion

This tutorial demonstrated the basic usage of the Synthetic Data SDK. You have successfully trained a generator from scratch, given the original data. And you have then used the generator to sample new records, according to your specifications.

See the other tutorials for further exercises.