Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
052aa48
feat: make SDK local-first and remove platform-only surfaces
mplatzer Apr 25, 2026
c15e83f
refactor: remove remaining user-account platform API surface
mplatzer Apr 25, 2026
acf8da6
refactor: drop unsupported artifacts and integrations clients
mplatzer Apr 25, 2026
d008167
refactor: hard-prune domain models to SDK-supported API
mplatzer Apr 25, 2026
81b7fde
refactor: remove computes API and local dummy endpoint
mplatzer Apr 25, 2026
e35a3ef
refactor: remove remaining platform-era auth and metadata surface
mplatzer Apr 25, 2026
b33e84f
refactor: remove usage statistics from SDK domain models
mplatzer Apr 25, 2026
b0c035a
refactor: remove sort_by from list endpoints
mplatzer Apr 25, 2026
f948a56
chore: remove legacy client CI section and stale domain filters
mplatzer Apr 25, 2026
19ea662
docs: remove MOSTLY_BASE_URL configuration mention
mplatzer Apr 25, 2026
c690ed2
refactor: remove about and models endpoints from SDK API
mplatzer Apr 25, 2026
e291ff1
refactor: remove MOSTLY_LOCAL environment mode switch
mplatzer Apr 25, 2026
d122ba7
refactor: remove slft query parameter from local routes
mplatzer Apr 25, 2026
fdcec9c
chore: clean up local-first leftovers in packaging and metadata
mplatzer Apr 25, 2026
8d85af8
docs: remove SDV comparison and external support/blog references
mplatzer Apr 25, 2026
e81f820
docs: make quick start local-only with uv install
mplatzer Apr 25, 2026
dd88603
docs: simplify quick start local wording
mplatzer Apr 25, 2026
398389e
refactor: drop hive, databricks, and redshift connectors
cursoragent Apr 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1 @@
blank_issues_enabled: false # Disables the option to create blank issues

contact_links:
- name: Contact Support
url: mailto:support@mostly.ai
about: For any support-related queries, please email us directly at support@mostly.ai.
34 changes: 0 additions & 34 deletions .github/workflows/run-tests-cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,37 +40,3 @@ jobs:
uv run --no-sync pytest -vv tests/_data/unit
uv run --no-sync pytest -vv tests/_local/unit
uv run --no-sync pytest -vv tests/test_domain.py

run-test-cpu-client:
if: false
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Setup | Checkout
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
with:
fetch-depth: 1
submodules: 'true'

- name: Setup | uv
uses: astral-sh/setup-uv@ed21f2f24f8dd64503750218de024bcf64c7250a # v7.1.5
with:
enable-cache: false
python-version: '3.11'

- name: Setup | Dependencies
run: |
uv sync --frozen --only-group dev
uv pip install --index-strategy unsafe-first-match torch==2.9.1+cpu torchvision==0.24.1+cpu ".[local]" --extra-index-url https://download.pytorch.org/whl/cpu

- name: Test | End-to-End (Client mode only)
env:
MOSTLY_API_KEY: ${{ secrets.E2E_CLIENT_MOSTLY_API_KEY }}
MOSTLY_BASE_URL: ${{ secrets.E2E_CLIENT_MOSTLY_BASE_URL }}
E2E_CLIENT_S3_ACCESS_KEY: ${{ secrets.E2E_CLIENT_S3_ACCESS_KEY }}
E2E_CLIENT_S3_SECRET_KEY: ${{ secrets.E2E_CLIENT_S3_SECRET_KEY }}
E2E_CLIENT_S3_BUCKET: ${{ secrets.E2E_CLIENT_S3_BUCKET }}
run: |
uv run --no-sync pytest -vv tests/_local/end_to_end -k 'client and mode'
3 changes: 1 addition & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,9 @@ RUN uv pip install torch==2.9.1 torchvision==0.24.1 --torch-backend=cpu
COPY mostlyai ./mostlyai
COPY README.md ./
RUN uv pip install -e .
COPY ./tools/docker_entrypoint.py /app/entrypoint.py

USER nonroot

EXPOSE 8080
ENTRYPOINT [ "uv", "run", "--no-sync", "--project", "/app", "--"]
CMD ["/app/entrypoint.py"]
CMD ["python", "-m", "mostlyai.sdk._local.docker_entrypoint"]
34 changes: 0 additions & 34 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,42 +1,8 @@
# Internal Variables
PUBLIC_API_FULL_URL = https://raw.githubusercontent.com/mostly-ai/mostly-openapi/refs/heads/main/public-api.yaml
PUBLIC_API_OUTPUT_PATH = mostlyai/sdk/domain.py

# Targets
.PHONY: help
help: ## show definition of each function
@awk 'BEGIN {FS = ":.*?## "} /^[a-zA-Z_-]+:.*?## / {printf "\033[36m%-25s\033[0m %s\n", $$1, $$2}' $(MAKEFILE_LIST)

.PHONY: gen-public-model
gen-public-model: ## build pydantic models for public api
@echo "Updating custom Jinja2 templates"
python tools/extend_model.py
@echo "Generating Pydantic models from $(PUBLIC_API_FULL_URL)"
datamodel-codegen --url $(PUBLIC_API_FULL_URL) $(COMMON_OPTIONS)
#datamodel-codegen --input ../mostly-app-v2/public-api/public-api.yaml $(COMMON_OPTIONS)
python tools/postproc_domain.py
# run pre-commit hooks to add license and lint the generated code; ignore the exit code to avoid confusion
uv run --no-sync pre-commit run --all-files > /dev/null 2>&1 || true

# Common options for both targets
COMMON_OPTIONS = \
--input-file-type openapi \
--output $(PUBLIC_API_OUTPUT_PATH) \
--snake-case-field \
--target-python-version 3.11 \
--use-schema-description \
--use-union-operator \
--use-standard-collections \
--field-constraints \
--collapse-root-models \
--use-one-literal-as-default \
--enum-field-as-literal one \
--use-subclass-enum \
--set-default-enum-member \
--output-model-type pydantic_v2.BaseModel \
--base-class mostlyai.sdk.client.base.CustomBaseModel \
--custom-template-dir tools/custom_template

.PHONY: clean
clean: ## Remove .gitignore files
git clean -fdX
Expand Down
39 changes: 18 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,12 @@
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai)](https://pypi.org/project/mostlyai/)
[![GitHub stars](https://img.shields.io/github/stars/mostly-ai/mostlyai?style=social)](https://github.com/mostly-ai/mostlyai/stargazers)

[Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/)
[Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/)

The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.

- **LOCAL** mode trains and generates synthetic data locally on your own compute resources.
- **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.
- **LOCAL** mode trains and generates synthetic data locally on your own compute resources (default).
- **CLIENT** mode connects to a remote SDK endpoint for training & generating synthetic data there.

## Overview

Expand All @@ -30,8 +29,6 @@ The SDK allows you to programmatically create, browse and manage 3 key resources
| Live probe the generator on demand | `df = mostly.probe(g, config)` | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) |
| Connect to any data source within your org | `c = mostly.connect(config)` | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) |

https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f

## Key Features

- **Broad Data Support**
Expand Down Expand Up @@ -62,10 +59,10 @@ https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f

## Quick Start <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/getting-started/getting-started.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

Install the SDK via `pip` (see [Installation](#installation) for further details):
Install the SDK for LOCAL mode (see [Installation](#installation) for further details):

```shell
pip install -U mostlyai # or 'mostlyai[local]' for LOCAL mode
uv pip install -U 'mostlyai[local]'
```

Generate synthetic samples using a pre-trained generator:
Expand Down Expand Up @@ -191,17 +188,9 @@ uv run --with jupyter jupyter lab

</details>

### CLIENT mode

This is a light-weight installation for using the SDK in CLIENT mode only. It communicates to a MOSTLY AI platform to perform requested tasks. See e.g. [app.mostly.ai](https://app.mostly.ai/) for a free-to-use hosted version.

```shell
uv pip install -U mostlyai
```

### CLIENT + LOCAL mode
### LOCAL mode (default)

This is a full installation for using the SDK in both CLIENT and LOCAL mode. It includes all dependencies, incl. PyTorch, for training and generating synthetic data locally.
This is the recommended installation for running the SDK in LOCAL mode (and optionally CLIENT mode as well). It includes all dependencies, incl. PyTorch, for training and generating synthetic data locally.

```shell
uv pip install -U 'mostlyai[local]'
Expand All @@ -225,15 +214,23 @@ uv pip install --index-strategy unsafe-first-match -U torch==2.9.1+cpu torchvisi
pip install -U torch==2.9.1+cpu torchvision==0.24.1+cpu 'mostlyai[local]' --extra-index-url https://download.pytorch.org/whl/cpu
```

### CLIENT mode

This is a light-weight installation for explicit CLIENT-only usage against a remote SDK endpoint.

```shell
uv pip install -U mostlyai
```


> **Note for Google Colab users**: Installing any of the local extras (`mostlyai[local]`, or `mostlyai[local-gpu]`) might need restarting the runtime after installation for the changes to take effect.

### Data Connectors

Add any of the following extras for further data connectors support in LOCAL mode: `databricks`, `googlebigquery`, `hive`, `mssql`, `mysql`, `oracle`, `postgres`, `redshift`, `snowflake`. E.g.
Add any of the following extras for further data connectors support in LOCAL mode: `googlebigquery`, `mssql`, `mysql`, `oracle`, `postgres`, `snowflake`. E.g.

```shell
uv pip install -U 'mostlyai[local, databricks, snowflake]'
uv pip install -U 'mostlyai[local, snowflake]'
```

### Using Docker
Expand Down Expand Up @@ -276,7 +273,7 @@ As an alternative, you can also build a Docker image, which provides you with an

<summary>Connect to the container</summary>

<p>You can now connect to the SDK running within the container by initializing the SDK in <code>CLIENT</code>> mode on the host machine.</p>
<p>You can now connect to the SDK running within the container from the host machine via an explicit <code>base_url</code>.</p>

```python
from mostlyai.sdk import MostlyAI
Expand Down
11 changes: 1 addition & 10 deletions docs/api_domain.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,11 @@ hide:

# Schema References for `mostlyai.sdk.domain`

This module is auto-generated to represent `pydantic`-based classes of the defined schema in the [Public API](https://github.com/mostly-ai/mostly-openapi/blob/main/public-api.yaml).
This module is maintained manually to represent the SDK-supported `pydantic` domain classes.

::: mostlyai.sdk.domain
options:
show_root_heading: true
show_root_full_path: true
show_object_full_path: false
show_root_toc_entry: false
filters:
- "!^Assistant.*"
- "!^Share.*"
- "!^ResourceShares"
- "!^LiteLlm.*"
- "!^DataLlm.*"
- "!.*PatchConfig.*"
- "!.*CloneConfig.*"
- "!^UsageReport.*"
73 changes: 4 additions & 69 deletions docs/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,12 @@ hide:
```python
from mostlyai.sdk import MostlyAI

# local mode (with TCP port)
mostly = MostlyAI(
local=True,
local_dir='~/mostlyai',
local_port=8080,
)
# local mode (default)
mostly = MostlyAI(local_dir='~/mostlyai')

# client mode
# client mode (explicit remote endpoint)
mostly = MostlyAI(
base_url='https://app.mostly.ai', # or set env var `MOSTLY_BASE_URL`
api_key='INSERT_YOUR_API_KEY', # or set env var `MOSTLY_API_KEY`
base_url='https://remote-sdk.example.com',
)
```

Expand Down Expand Up @@ -164,67 +159,7 @@ c.open()
c.delete()
```

## Datasets

Datasets can be used to train generators or create artifacts and may be created with or without a corresponding connector. Datasets are only available in `client` mode.

```python
# create a new dataset
ds = mostly.datasets.create(config: dict | DatasetConfig)

# retrieve a dataset by id
ds = mostly.datasets.get(dataset_id: str)

# list datasets (iterator with optional filters)
for ds in mostly.datasets.list(
offset: int = 0,
limit: int | None = None,
search_term: str | None = None,
owner_id: str | list[str] | None = None,
visibility: str | list[str] | None = None, # e.g., PUBLIC, PRIVATE, UNLISTED
created_from: str | None = None, # YYYY-MM-DD
created_to: str | None = None, # YYYY-MM-DD
sort_by: str | list[str] | None = None, # NO_OF_THREADS | NO_OF_LIKES | RECENCY
):
print(ds.id, ds.name)

# update an existing dataset (partial patch)
ds = mostly.datasets._update(dataset_id: str, config: dict | DatasetPatchConfig)

# delete a dataset
mostly.datasets._delete(dataset_id: str)

# download a file from a dataset
content_bytes, filename = mostly.datasets._download_file(
dataset_id: str,
file_path: str, # path inside the dataset
)

# upload a file to a dataset
mostly.datasets._upload_file(
dataset_id: str,
file_path: str | Path, # local path to file
)

# delete a file from a dataset
mostly.datasets._delete_file(
dataset_id: str,
file_path: str | Path, # path inside the dataset
)
```

## Miscellaneous

```python
# fetch info on your user account
mostly.me()

# fetch info about the platform
mostly.about()

# list all available models
mostly.models()

# list all available computes
mostly.computes()
```
2 changes: 0 additions & 2 deletions docs/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,3 @@ Explore synthetic data tutorials with the option to run them **either in Google
| Close gaps in your data with **Smart Imputation** | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/smart-imputation/smart-imputation.ipynb) | [View Notebook](./tutorials/smart-imputation/smart-imputation.ipynb) |
| Calculate accuracy and privacy metrics for **Quality Assurance** | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/quality-assurance/quality-assurance.ipynb) | [View Notebook](./tutorials/quality-assurance/quality-assurance.ipynb) |
| Enrich Sensitive Data with LLMs using Synthetic Replicas | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/synthetic-enrich/synthetic-enrich.ipynb) | [View Notebook](./tutorials/synthetic-enrich/synthetic-enrich.ipynb) |
| MOSTLY AI vs. SDV comparison: single-table scenario | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb) | [View Notebook](./tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb) |
| MOSTLY AI vs. SDV comparison: sequential scenario | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb) | [View Notebook](./tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb) |
11 changes: 2 additions & 9 deletions docs/tutorials/differential-privacy/differential-privacy.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,12 @@
"metadata": {},
"source": [
"# Differentially Private Synthetic Data <a href=\"https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/differential-privacy/differential-privacy.ipynb\" target=\"_blank\"><img src=\"https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab\" alt=\"Run on Colab\"></a>\n",
"\n",
"In this notebook, we demonstrate how a generator can be trained with differential privacy guarantees, and explore how the various settings can impact the data fidelity.\n",
"\n",
"How Differential Privacy is applied:\n",
"- Value ranges: DP is used to define value bounds for each column. The epsilon budget for this step is split evenly across columns.\n",
"- Model training: DP-SGD by [Opacus](https://github.com/pytorch/opacus) is used for training, with a separate epsilon (and delta) value.\n",
"- The total privacy budget is the sum of both parts.\n",
"\n",
"See also the schema reference for [DifferentialPrivacyConfig](https://mostly-ai.github.io/mostlyai/api_domain/#mostlyai.sdk.domain.DifferentialPrivacyConfig) for all available configuration parameters.\n",
"\n",
"For further background and analysis see also [this blog post](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) on \"_Differentially Private Synthetic Data with MOSTLY AI_\"."
"See also the schema reference for [DifferentialPrivacyConfig](https://mostly-ai.github.io/mostlyai/api_domain/#mostlyai.sdk.domain.DifferentialPrivacyConfig) for all available configuration parameters.\n"
]
},
{
Expand Down Expand Up @@ -244,7 +239,6 @@
"metadata": {},
"source": [
"## Further exercises\n",
"\n",
"In addition to walking through the above instructions, we suggest..\n",
"* to experiment with different DP settings\n",
"* to study the impact of the total size of the training data on final eps\n",
Expand All @@ -257,8 +251,7 @@
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"This tutorial demonstrated how to train with and without differential privacy guarantees. Note: DP just provides additional mathematical guarantees for use cases that require these. However, given the other privacy mechanism in-built into the SDK, synthetic data can also without stricter DP guarantees be considered to be anonymous. See again [here](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) for a further discussion."
"This tutorial demonstrated how to train with and without differential privacy guarantees. Note: DP just provides additional mathematical guarantees for use cases that require these. However, given the other privacy mechanism in-built into the SDK, synthetic data can also without stricter DP guarantees be considered to be anonymous."
]
}
],
Expand Down
Loading