# Enrich Sensitive Data with LLMs using Synthetic Replicas <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/synthetic-enrich/synthetic-enrich.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

This notebook shows how to safely enrich sensitive datasets with new LLM-generated columns - without sharing any private data. We first create a privacy-safe synthetic replica of the original dataset. The synthetic replica is enriched via an LLM. We then train a generator on this enriched replica. Finally, the generator applies the same enrichment to the original sensitive data - without the data ever leaving your environment.

📋 Steps

1.   Create a synthetic replica of your dataset
2.   Use an LLM to add new columns to the replica
3.   Train a generator on the enriched replica
4.   Generate enriched original data using the trained generator

🔐 Key Benefits

* No data exposure: Original data stays secure.
* Enrichment at scale: LLMs enrich synthetic data; the generator brings that intelligence back.
* Reusable logic: Once trained, the generator acts as a secure enrichment adapter - no repeated LLM calls needed.


## Install Dependencies

In [None]:
# Install SDK in CLIENT mode
!uv pip install -U mostlyai mostlyai-mock
# Or install in LOCAL mode
!uv pip install -U 'mostlyai[local]' mostlyai-mock
# Note: Restart kernel session after installation!

## Load Original Data

Fetch a sample of the census dataset that will be used as our sensitive proprietary data that we want to enrich while keeping it private.

In [None]:
# load sample of original data
import pandas as pd

df_orig = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz", nrows=2000)
df_orig.head()

## Initialize SDK

The SDK will handle model training and synthetic data generation, while DataLLM will provide the LLM enrichment capabilities.

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

## Train a Generator on the Original Data

Train a generator on the sensitive original data.

In [None]:
# train generator on original data
g = mostly.train(data=df_orig)

## Generate Synthetic Data

Create synthetic data that will serve us as a proxy of the sensitive original dataset. This synthetic data will be shared with the LLM for enrichment.

In [None]:
# generate synthetic data
df_syn = mostly.probe(g, size=len(df_orig))

## Enrich via MOCK

Use the LLM to enrich the synthetic data with 2 new columns, namely: **work category**, and **career stage**. This is where we expose data to the LLM, but only the synthetic proxy data, not our sensitive original data.

In [None]:
# Set your OpenAI API key as environment variable, required by mostlyai-mock
# import os
# os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"

from mostlyai import mock

mock_config_tables = {
    "census": {
        "prompt": "U.S. Census data with demographic and employment-related columns",
        "columns": {
            "specific_job_title": {
                "prompt": (
                    "Generate a realistic, specific job title for a person "
                    "based on their occupation, education, and income level. "
                    "The job title should be more specific than the general "
                    "occupation category."
                ),
                "dtype": "string",
            },
            "work_category": {
                "prompt": """categorize the occupation into work category, considering the actual job duties and level.
                                Examples of correct categorizations:
                                - Handlers-cleaners → Manual Labor
                                - Machine-op-inspct → Manual Labor
                                - Craft-repair → Manual Labor
                                - Transport-moving → Manual Labor
                                - Farming-fishing → Manual Labor
                                - Exec-managerial → Management
                                - Prof-specialty → Professional
                                - Tech-support → Technical
                                - Sales → Service Work
                                - Other-service → Service Work
            
                                Categories and their meanings:
                                - Manual Labor: physical work, manufacturing, construction, cleaning, transportation, farming, machine operation, craft work, manual repairs, physical labor
                                - Service Work: customer service, retail, hospitality, food service, personal care, non-physical service roles
                                - Professional: doctors, lawyers, engineers, scientists, specialized knowledge workers
                                - Management: supervisors, executives, administrators, team leaders
                                - Technical: IT, technical support, specialized technical skills, maintenance""",
                "dtype": "category",
                "values": ["Manual Labor", "Service Work", "Professional", "Management", "Technical"],
            },
        },
    }
}

In [None]:
# this will run for ~5min
df_syn_enriched = mock.sample(
    tables=mock_config_tables,
    existing_data={"census": df_syn},
    model="openai/gpt-4.1-nano",
)

In [None]:
df_syn_enriched.head()

## Train a Generator on the Enriched Synthetic Data

Train a generator on the enriched synthetic data to encode the LLM intelligence into a reusable, privacy-safe enrichment model. This enables it to later apply the same intelligence to sensitive original data.

In [None]:
# train generator on enriched synthetic data
config = {
    "name": "Enriched Census",
    "tables": [
        {
            "name": "Census",
            "data": df_syn_enriched,
            "tabular_model_configuration": {
                "value_protection": False,  # not needed as training data is not private
            },
        }
    ],
}
g = mostly.train(config=config)

## Use Generator to Enrich Original Data

Now we use the generator trained on enriched synthetic data to add the same new features to the original sensitive data. We do this by fixing the original data as the seed input to the generator, which then produces the enriched version with the same feature transformation. This approach ensures that the original data's structure and relationships are preserved while the new features are generated consistently with the same patterns learned from the synthetic data. Your sensitive data remains untouched - yet is now enhanced with the same intelligent enrichments, thanks to the generator’s learned transformations.

In [None]:
# generate enriched original data using original data as seed
df_orig_enriched = mostly.probe(g, seed=df_orig)

In [None]:
# display sample of enriched original data
df_orig_enriched.sample(n=10)

## Conclusion

This tutorial demonstrated how to securely enrich sensitive proprietary data by:

1. Creating a synthetic replica
2. Enriching the proxy with an LLM
3. Training a generator on the enriched replica
4. Applying the enrichment to the sensitive data

The sensitive data never leaves your secure environment, maintaining privacy while enabling LLM-based enrichment.