In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Tutorial: Generating Differentially Private Synthetic Data

**Copyright 2025 DeepMind Technologies Limited.**

Before diving into this notebook, we highly recommend familiarizing yourself with the ["Tutorial of DP-SGD LoRA fine-tuning Gemma3 in Keras on SAMSum dataset"]() (referred to as the "DP-SGD LoRA tutorial" from now on). This tutorial builds upon that previous work, and therefore, code concepts introduced there will not be explained in detail here.

This tutorial will guide you through generating **differentially private (DP) synthetic data**. This is a widespread use case, particularly for organizations handling sensitive data who wish to augment it while preserving individual privacy. Generating private synthetic data unlocks many possibilities, such as public data releases, training downstream tasks, or simply replacing sensitive original datasets to mitigate risks.

For a more in-depth understanding of synthetic data use cases and related concepts, you can refer to the paper ["Harnessing large-language models to generate private synthetic text"](https://arxiv.org/pdf/2306.01684). This tutorial draws inspiration from that paper and adopts a similar experimental setup.

In this example, we'll generate synthetic movie reviews that are similar to the [IMDb dataset](http://kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The overall approach mirrors the DP-SGD LoRA tutorial:

1.  We will **DP fine-tune** the [Gemma3 base model](https://www.kaggle.com/models/keras/gemma3) on the IMDb dataset. We'll use a concise prompt: `[imdb][{label}]:`. The model is expected to learn that this prompt signals the task of generating a review aligned with the provided label (positive or negative).
2.  Next, we'll **sample** several thousand examples in a "diverse" manner using the DP fine-tuned model, producing our DP synthetic data.
3.  Finally, we'll **evaluate performance** using the [MAUVE metric](https://github.com/krishnap25/mauve), comparing the generated DP synthetic data with the test portion of the IMDb dataset.

This tutorial leverages the **[Keras API]()** within Jax Privacy for DP model fine-tuning.

Performance will be evaluated using the **MAUVE metric**, calculated against real test data across the following setups:

1.  **Baseline 1:** Handcrafted prompt without model fine-tuning.
2.  **Baseline 2:** Handcrafted prompt with a few examples from the dataset, without model fine-tuning.
3.  **DP Fine-tuning:** Model fine-tuned with differential privacy, followed by sampling.
4.  **Non-DP Fine-tuning:** Model fine-tuned without differential privacy, followed by sampling.
5.  **Real Train Data:** The performance of using the real training data itself, serving as an estimation of a "good" result.
5.  **Real Test Data:** The performance of using the real test data itself, serving as an ultimate upper bound for achievable results.

The precise configurations for each experiment are detailed at the end of this notebook.

The results presented in this tutorial were obtained using 4 Cloud TPU v5p devices.

A test run of this tutorial (e.g., to verify your environment is correctly configured) can be completed in a few minutes, for example, by using a free v2-8 TPU in Google Colab.


<!-- TODO - b/398715962: add resource requirements realistic in OSS -->

<!-- TODO - b/398715962: check unfilled links -->

## Install and import dependencies

In [2]:
%%capture

# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U "keras>=3"
!pip uninstall -y -q keras-hub
!pip install -q -U keras-hub
!pip install tqdm
!pip install ipywidgets

!pip install dp_accounting jaxtyping drjax
!pip install jax_privacy==1.0.0

In [3]:
import os

os.environ["KERAS_HOME"] = os.getcwd() # Ensure that Keras uses home directory, which has enough space
os.environ["KERAS_BACKEND"] = "jax"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00" # Avoid memory fragmentation on JAX backend.

import keras
import keras_hub
import tensorflow as tf
import tensorflow_datasets as tfds
import tqdm
import numpy as np
import jax
import jax.numpy as jnp
import json

# Jax Privacy deps
from jax_privacy.keras import keras_api

In [4]:
import kagglehub

kagglehub.login()

# If you are using Colab, you can alternatively set KAGGLE_USERNAME and KAGGLE_KEY
# values in user data, and then uncomment and run the following code:
#
# from colabtools import userdata
#
# os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
# os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
#
# You use userdata to keep the Kaggle API key safe. Alternatively, you can
# hardcode the values but it is not recommended due to security risks of
# leaking the API key.


# If you're not using Colab, set the env vars as appropriate for your system.
# For example, to set the env vars on Linux you can run in terminal:
# ```
# export KAGGLE_USERNAME="your_username"
# export KAGGLE_KEY="your_key"
# ```

## Hyperparameter Setup

Here, we'll highlight a few important differences from the DP-SGD LoRA tutorial.

First, using the default sampler, which always chooses the next token with the highest probability, would result in identical synthetic examples every time we perform inference. To introduce **variability**, we must change the sampler. We achieve this by using a [TopPSampler](https://keras.io/keras_hub/api/samplers/top_p_sampler/) with `TOP_P=0.95`. This means we'll sample from the tokens whose cumulative probability, when ordered by decreasing probability, is at least 0.95. As you'll see, a `TOP_P` value of 0.95 allows for significant diversity in our sampling.

We also define the **validation size**, which specifies how many dataset entries are used for validation during training. Note that these entries will differ with each validation step. This number can be quite small, or even zero. While a validation dataset is typically used to monitor model overfitting and enable early stopping, early stopping is less critical when fine-tuning for synthetic data generation. An overfit model can still perform well as a synthetic text generator, and there's generally no issue with that. In fact, if you have ample, non-sensitive data, there's little reason to generate synthetic data; you could simply use the data directly. If your dataset size is insufficient for your goals but the data isn't sensitive, non-DP fine-tuning is an option. However, in many real-world scenarios, data originates from individuals and is thus sensitive, even if not directly identifiable. In such cases, **differentially private fine-tuning** is crucial to ensure the generated synthetic data is also DP.

Another set of important parameters relates to the **MAUVE metric**. MAUVE quantifies the similarity between any two text datasets, ranging from 0 to 1. A score of 0 indicates completely different types of text, while 1 suggests the texts are highly similar. When calculating MAUVE, it's crucial for the datasets to contain enough samples; the [original paper](https://arxiv.org/pdf/2102.01454) recommends 5,000 samples as a good balance between speed and quality. To ensure the results aren't specific to a particular sampled dataset, it's advisable to generate more than 5,000 samples (e.g., 10,000) and then calculate MAUVE on multiple (e.g., 5) 5,000-sample subsampled datasets. Subsampling is performed uniformly from the larger dataset. Finally, the mean and standard deviation are calculated across these trials. In our experiment configuration, the following constants control this process: `NUM_SAMPLES` (the larger 10,000-sample dataset), `NUM_MAUVE_SAMPLES` (the size of each subsampled dataset for MAUVE calculation), and `NUM_MAUVE_TRIALS` (the number of times to subsample and calculate MAUVE). These constants apply to both synthetic data and real data.

The configuration also specifies the **base prompt** and whether to **fine-tune the model**, as some experiments do not involve fine-tuning and instead use different handcrafted prompts. You can find the specific prompts and other parameters for each experiment at the end of this notebook.

Since certain operations, such as generating 10,000 samples, can be time-consuming, we've included helper functions to save intermediate results. These results will be stored in the `SAVE_ROOT_DIR` folder.


In [5]:
CONFIG = {
    "GEMMA3_MODEL_TYPE": "gemma3_instruct_12b",
    "SEQUENCE_LENGTH": 512,
    "EPOCHS": 3,
    "BATCH_SIZE": 32,  # Should be a multiple of the number of GPUs available.
    "GRADIENT_ACCUMULATION_STEPS": 32,  # Effective batch size is BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS, we recommend to make it 1024.
    "LORA_RANK": 32,
    "LEARNING_RATE": 0.003,
    # Validation size can be small when fine-tuning for synthetic data generation.
    "VALIDATION_SIZE": 64,
    "SEED": 42,
    # TOP_P is crucial for "diverse" sampling, ensuring different results across sampling calls.
    "TOP_P": 0.95,
    # Number of synthetic examples to generate.
    "NUM_SAMPLES": 10000,
    # Batch size for sampling inference.
    "SAMPLE_BATCH_SIZE": 24,
    # Number of samples to use when computing MAUVE.
    "NUM_MAUVE_SAMPLES": 5000,
    # Number of times to calculate MAUVE on subsamples for a more precise estimate (mean and standard deviation).
    "NUM_MAUVE_TRIALS": 5,
    # Whether to use bfloat16 (16-bit float) weights. Not all GPUs support bfloat16 (e.g., V100 does not, A100 does).
    "USE_MIXED_PRECISION": False,
    # The prompt used for synthetic data generation.
    "PROMPT": "[imdb][{label}]:",
    # Set to False for experiments that do not require model fine-tuning.
    "FINETUNE_MODEL": True,
    # Set to True for differentially private fine-tuning.
    "USE_DP": True,
    # DP-SGD parameters. Only applicable if USE_DP is True.
    "EPSILON": 10.0,
    "DELTA": 1.5e-5,  # Chosen as a value smaller than 1/n^1.1, where n = 25000 (number of training examples).
    "CLIPPING_NORM": 1.0,
    # If TEST_RUN is True, the code will execute on a small subset of data and a smaller model for quick verification.
    "TEST_RUN": True,
    # The root directory for saving all experiment data.
    "SAVE_ROOT_DIR": "./experiment_data"
}

if CONFIG["TEST_RUN"]:
    CONFIG["GEMMA3_MODEL_TYPE"] = "gemma3_instruct_1b"
    CONFIG["SEQUENCE_LENGTH"] = 128
    CONFIG["MAX_TRAIN_SIZE"] = 3000
    CONFIG["LORA_RANK"] = 4
    CONFIG["NUM_SAMPLES"] = 100
    CONFIG["SAMPLE_BATCH_SIZE"] = 8
    CONFIG["NUM_MAUVE_SAMPLES"] = 40
    CONFIG["NUM_MAUVE_TRIALS"] = 3
    CONFIG["USE_MIXED_PRECISION"] = False  # The 1b model is small and should fit into most GPUs.

print(f"Experiment config:\n{json.dumps(CONFIG, indent=2)}")

In [6]:
def save_config():
  os.makedirs(CONFIG["SAVE_ROOT_DIR"], exist_ok=True)
  file_path = os.path.join(CONFIG["SAVE_ROOT_DIR"], "config.json")
  with open(file_path, "w") as f:
    # Save as json so it is human-readable and easy to load programmatically.
    json.dump(CONFIG, f, indent=2)
  print(f"Saved config to {file_path}")

save_config()

## Prepare the data

In [7]:
SOURCE_TRAIN_DS, SOURCE_VALIDATION_DS = tfds.load('imdb_reviews', split=['train', 'test'])

if CONFIG["TEST_RUN"]:
  print("TEST RUN, sampling datasets")
  SOURCE_TRAIN_DS = SOURCE_TRAIN_DS.take(CONFIG["MAX_TRAIN_SIZE"])
  SOURCE_VALIDATION_DS = SOURCE_VALIDATION_DS.take(CONFIG["MAX_TRAIN_SIZE"])

Let's examine an IMDb dataset entry.

Each entry is a pair: a **label** (an integer, 1 for positive or 0 for negative) and the **movie review text**.

In [8]:
SOURCE_EXAMPLE_DS = SOURCE_VALIDATION_DS.take(1).batch(1, drop_remainder=True)
SOURCE_EXAMPLE = SOURCE_EXAMPLE_DS.as_numpy_iterator().next()
for key, val in SOURCE_EXAMPLE.items():
  if isinstance(val[0], np.int64):
    decoded_val = str(val[0])
  else:
    decoded_val = val[0].decode('utf-8')
  print(f'{key}:\n"{decoded_val}"\n')

Now, let's transform the dataset entries into model prompts and their corresponding expected responses. For labels, instead of using 1 or 0, our prompts will use the English words "positive" or "negative". The expected responses will remain the original review texts, without any modifications.

In [9]:
def get_prompt(label_id):
  label_str = "positive" if label_id == 1 else "negative"
  return CONFIG["PROMPT"].format(label=label_str)

def source_to_gemma3_format(review_dict):
  def map_source_example(label, text):
    return get_prompt(label), text

  prompt, response = tf.py_function(
        func=map_source_example,
        inp=[review_dict['label'], review_dict['text']],
        Tout=[tf.string, tf.string]
    )
  prompt.set_shape(())
  response.set_shape(())
  return {
      "prompts": prompt,
      "responses": response
  }

In [10]:
TRAIN_DS = SOURCE_TRAIN_DS.map(source_to_gemma3_format)
VALIDATION_DS = SOURCE_VALIDATION_DS.map(source_to_gemma3_format).take(CONFIG["VALIDATION_SIZE"])

Let's examine what the model's input will look like. If we are not fine-tuning the model, we won't need `responses`. During inference, we'll provide plain prompts as a list of strings, rather than a dictionary of prompts and responses.

In [11]:
EXAMPLE_DS = VALIDATION_DS.take(1).batch(1, drop_remainder=True)
EXAMPLE = EXAMPLE_DS.as_numpy_iterator().next()
for key, val in EXAMPLE.items():
  decoded_val = val[0].decode('utf-8')
  print(f'{key}:\n"{decoded_val}"\n')

In [12]:
# Train size is important for DP-SGD.
TRAIN_SIZE = int(TRAIN_DS.cardinality().numpy())
print(f'Train size: {TRAIN_SIZE}')
VALIDATION_SIZE = int(VALIDATION_DS.cardinality().numpy())
print(f'Validation size: {VALIDATION_SIZE}')

TRAIN_DS = TRAIN_DS.shuffle(buffer_size=2048).batch(CONFIG["BATCH_SIZE"], drop_remainder=True)
VALIDATION_DS = VALIDATION_DS.batch(CONFIG["BATCH_SIZE"], drop_remainder=True)

In [13]:
if len(jax.devices()) > 1:
  DATA_PARALLEL = keras.distribution.DataParallel()
  # You can see over how many GPUs the data will be distributed.
  print(DATA_PARALLEL)
  keras.distribution.set_distribution(DATA_PARALLEL)
else:
  print("Only one device, there will be no data parallelism")

## Model setup and fine-tuning

In [14]:
MODEL_WEIGHTS_DTYPE = None # use default dtype
if CONFIG["USE_MIXED_PRECISION"]:
  print("Using mixed precision")
  keras.mixed_precision.set_global_policy('mixed_bfloat16')
  MODEL_WEIGHTS_DTYPE = "bfloat16"

gemma_lm = keras_hub.models.Gemma3CausalLM.from_preset(CONFIG["GEMMA3_MODEL_TYPE"],
                                                       dtype=MODEL_WEIGHTS_DTYPE)

assert isinstance(gemma_lm.preprocessor, keras_hub.models.Gemma3CausalLMPreprocessor)
gemma_lm.preprocessor.sequence_length = CONFIG["SEQUENCE_LENGTH"]
gemma_lm.summary()

In [15]:
if CONFIG["FINETUNE_MODEL"]:
  gemma_lm.backbone.enable_lora(rank=CONFIG["LORA_RANK"])
  gemma_lm.summary()
else:
  print("Not finetuning model")

In [16]:
def load_lora_weights(filename: str):
  filepath = os.path.join(CONFIG["SAVE_ROOT_DIR"], filename)
  gemma_lm.backbone.load_lora_weights(filepath)
  print(f"LoRA weights loaded from: {filepath}")

# Uncomment it if you've already finetuned the model and want to load the weights.
# load_lora_weights("weights.lora.h5")

### Enabling Differentially Private Fine-tuning

The pre-trained model itself is not differentially private with respect to its initial training data, which we consider non-sensitive. However, the data used for fine-tuning in real world scenarioes usually *is* sensitive. To ensure the privacy of this sensitive data (IMDb reviews in our case), we prepare the model so that any subsequent training on this data will be differentially private. This process makes our fine-tuned model differentially private with respect to the sensitive fine-tuning data.

In [17]:
if CONFIG["USE_DP"]:
  params = keras_api.DPKerasConfig(
        epsilon=CONFIG["EPSILON"],
        delta=CONFIG["DELTA"],
        clipping_norm=CONFIG["CLIPPING_NORM"],
        batch_size=CONFIG["BATCH_SIZE"],
        train_steps=CONFIG["EPOCHS"] * (TRAIN_SIZE // CONFIG["BATCH_SIZE"]),
        train_size=TRAIN_SIZE,
        gradient_accumulation_steps=CONFIG["GRADIENT_ACCUMULATION_STEPS"],
        seed=CONFIG["SEED"],
  )
  gemma_lm = keras_api.make_private(gemma_lm, params)
  print(
      "DP training:"
      f"{CONFIG['CLIPPING_NORM']=} {CONFIG['EPOCHS']=} {CONFIG['BATCH_SIZE']=}"
  )
else:
  print("Non-DP training")

In [18]:
if CONFIG["FINETUNE_MODEL"]:
  optimizer = keras.optimizers.Adam(
      learning_rate=CONFIG["LEARNING_RATE"],
      gradient_accumulation_steps=CONFIG["GRADIENT_ACCUMULATION_STEPS"],
  )

  gemma_lm.compile(
      loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      optimizer=optimizer,
      weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
  )
else:
  print("Not finetuning model")

In [19]:
def save_lora_weights(filename: str):
  filepath = os.path.join(CONFIG["SAVE_ROOT_DIR"], filename)
  gemma_lm.backbone.save_lora_weights(filepath)
  print(f"LoRA weights saved to: {filepath}")

### Do LoRA fine-tuning

In [20]:
if CONFIG["FINETUNE_MODEL"]:
  gemma_lm.fit(x=TRAIN_DS,
              epochs=CONFIG["EPOCHS"],
              validation_data=VALIDATION_DS)
  save_lora_weights("weights.lora.h5")
else:
  print("Not finetuning model")

## Synthetic Data Generation

In this section, we will generate synthetic data using the model we've loaded and, optionally, fine-tuned.

Once you have the model, the synthetic data generation process is straightforward: we simply provide prompts (omitting expected responses since we're no longer training) and instruct the model to complete them (i.e., perform standard inference). These prompts can be either a short task-specific prompt, such as `[imdb][<positive/negative>]:` (if the model has been fine-tuned), or a longer, human-readable prompt explaining the task to the model, potentially including examples from the real dataset if fine-tuning has not occurred. It's important to note that if you include examples from the original dataset, your generated data is, strictly speaking, no longer differentially private. In such cases, a minimum safeguard is to manually review the included examples.

First, as discussed earlier, we must update the model's sampler to ensure it produces different results with each inference.

In [21]:
sampler = keras_hub.samplers.TopPSampler(p=CONFIG["TOP_P"])
gemma_lm.compile(sampler=sampler)

Let's manually examine some generated examples. If the model hasn't been fine-tuned, the examples might appear "okayish," but with fine-tuning, they will more closely resemble real data.

In [22]:
def generate_n_examples(n: int, is_positive: bool) -> list[str]:
  label = 1 if is_positive else 0
  return [gemma_lm.generate(get_prompt(label)) for _ in range(n)]

def print_generated_texts(texts: list[str], label: str):
  examples_to_print = '\n------\n'.join(texts)
  print(f"Generated {label} examples:\n{examples_to_print}")

print_generated_texts(generate_n_examples(n=3, is_positive=True), "positive")
print("\n=========\n")
print_generated_texts(generate_n_examples(n=3, is_positive=False), "negative")

Now, let's generate the synthetic data! We will produce it in the same format as the input dataset: a list of Python dictionaries, each containing `{"label": <value>, "review": <value>}`. Since generating thousands of samples can take several hours, it's a good practice to save the data immediately to prevent loss.

**Note**: Generation can take a significant amount of time if you have a large model, long sequence length, small physical batch size, or a small number of GPUs for data parallelization. Adjust the parameters accordingly to speed up generation.

In [23]:
#@title Synthetic Data Generation Functions

def extract_label_and_review(responses, prompts_by_label):
  cleaned_texts = []
  for text in responses:
      for label, prompt in prompts_by_label.items():
          if text.startswith(prompt):
              text = text[len(prompt):].replace("<end_of_turn>", "")
              cleaned_texts.append({"label": label, "text": text})
              break
  return cleaned_texts

def generate_synthetic_data(num_samples, sample_batch_size, seed):
  base_prompts = {
      0: get_prompt(0), # negative
      1: get_prompt(1), # positive
  }
  prompt_rng = jax.random.key(seed)
  prompt_arange = jnp.arange(len(base_prompts))
  prompt_indices = jax.random.choice(
      prompt_rng, prompt_arange, shape=(num_samples,)
  )
  prompt_splits = jnp.array_split(
      prompt_indices, num_samples // sample_batch_size
  )
  pbar = tqdm.tqdm(total=num_samples)
  result = []
  for local_indices in prompt_splits:
    prompts = [base_prompts[int(i.item())] for i in local_indices]
    responses = gemma_lm.generate(prompts)
    label_and_reviews = extract_label_and_review(responses, base_prompts)
    pbar.update(sample_batch_size)
    result += label_and_reviews
  return result

def save_data(data, filename):
  os.makedirs(CONFIG["SAVE_ROOT_DIR"], exist_ok=True)
  file_path = os.path.join(CONFIG["SAVE_ROOT_DIR"], filename)
  with open(file_path, "w") as f:
    json.dump(data, f, indent=2)
  print(f"Saved data to {file_path}")

def load_data(filename):
  file_path = os.path.join(CONFIG["SAVE_ROOT_DIR"], filename)
  with open(file_path, "r") as f:
    data = json.load(f)
  print(f"Loaded data from {file_path}")
  return data


In [24]:
SYNTHETIC_DATA = generate_synthetic_data(CONFIG["NUM_SAMPLES"], CONFIG["SAMPLE_BATCH_SIZE"], CONFIG["SEED"])
save_data(SYNTHETIC_DATA, "synthetic_data.json")
# Comment two previous lines and uncomment the next line to load the already generated data.
# SYNTHETIC_DATA = load_data("synthetic_data.json")
print(f"\nGenerated {len(SYNTHETIC_DATA)} synthetic examples.")
print(f"First synthetic example:\n{SYNTHETIC_DATA[0]}")
SYNTHETIC_DATA_TEXTS = [example["text"] for example in SYNTHETIC_DATA]

## Evaluation

As previously mentioned, we evaluate the quality of the synthetic data using the **MAUVE metric**.

Unfortunately, at the time of writing this tutorial, the scalable library we used for MAUVE calculation is not open-sourced. Therefore, we cannot provide code for directly calculating MAUVE here. We plan to implement an alternative solution using open-source libraries soon, and this notebook will be updated accordingly.

For now, we will explain how you can perform this calculation using available open-source libraries and present the results we obtained.

You can use the official [Mauve library](https://github.com/krishnap25/mauve) from GitHub. Its interface is quite straightforward. The primary limitation is that it's not highly parallelizable, as it runs on a single GPU, making text-to-embedding conversion a bottleneck. To overcome this, we recommend calculating the text embeddings in parallel independently and then providing these pre-computed embeddings to the MAUVE library. This approach significantly speeds up score calculation. Keep in mind that our methodology involves subsampling and multiple MAUVE calculations; this logic needs to be implemented manually as it's not part of the library. The subsampling strategy was described earlier: generate 10,000 samples, then uniformly subsample 5,000 samples five times for both synthetic and real data, and calculate the MAUVE score for each subsample.

The method for calculating embeddings is crucial. In our evaluation, we used the [Gecko English model](https://arxiv.org/pdf/2403.20327), which has 110 million parameters and produces 768-dimensional embedding vectors. While the Gecko model weights are not publicly released, you can perform inference using Vertex AI on Google Cloud (refer to the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api) and this [notebook](https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/text_embedding_new_api.ipynb)). If you aim to reproduce our results, use `text-embedding-005`, which is the most similar model available.

Alternatively, you can create text embeddings by using any other general-purpose LLM and extracting its last hidden layer. For instance, you could take the base, non-instruct Gemma3 model and use the last hidden layer of its last non-padding token. The advantage of this approach is that you can utilize the same resources running this notebook to their maximum capacity, such as computing embeddings for multiple texts in parallel.

### Results

For Gemma3 12b (`gemma3_instruct_12b`), fine-tuned for instructions, you can expect the following MAUVE scores:

| Experiment | MAUVE |
|---|---|
| `baseline_without_examples`: Baseline (no fine-tuning, no examples) | 0.006 ± 0.000 |
| `baseline_with_examples`: Baseline (no fine-tuning, 6 examples) | 0.008 ± 0.000 |
| `dp_ft`: DP Synthetic Data | 0.795 ± 0.010 |
| `non_dp_ft`: Non-DP Synthetic Data | 0.856 ± 0.008 |
| `use_train`: Train real vs. test real data | 0.806 ± 0.013 |
| `upper_bound`: Test real vs. test real data | 0.964 ± 0.003 |

As anticipated, the results align with our expectations. The baselines exhibit very poor scores because the model lacks knowledge of the desired format and style for generating reviews. DP fine-tuning performs slightly worse than non-DP fine-tuning, but the difference is not dramatic. Both fine-tuning approaches yield lower scores than simply comparing two samples from the test data (`upper_bound`), which represents the theoretical best performance achievable in synthetic text data generation.

One potentially surprising observation is that using the training data directly yields roughly similar results to DP synthetic data, yet worse results than non-DP synthetic data. This phenomenon is explained by how the dataset was split into training and testing sets, as detailed in the [original paper](https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf). The train and test splits do not overlap in terms of specific movies, leading to slightly different linguistic patterns. The fact that synthetic data generated by a fine-tuned model is more similar to the test dataset can be attributed to the LLM learning to produce "generic" movie reviews that capture the essential characteristics of a movie review, irrespective of the particular film. This generalized understanding allows its output to statistically align more closely with the test set.

Below, you'll find the detailed configurations for each experiment.


<details>
<summary><b>baseline_without_examples</b></summary>

```json
{
  "GEMMA3_MODEL_TYPE": "gemma3_instruct_12b",
  "SEQUENCE_LENGTH": 512,
  "EPOCHS": 3,
  "BATCH_SIZE": 32,
  "GRADIENT_ACCUMULATION_STEPS": 32,
  "LORA_RANK": 32,
  "LEARNING_RATE": 0.003,
  "VALIDATION_SIZE": 64,
  "SEED": 42,
  "TOP_P": 0.95,
  "NUM_SAMPLES": 10000,
  "SAMPLE_BATCH_SIZE": 24,
  "NUM_MAUVE_SAMPLES": 5000,
  "NUM_MAUVE_TRIALS": 5,
  "USE_MIXED_PRECISION": false,
  "FINETUNE_MODEL": false,
  "PROMPT": "As an IMDb movie reviewer, generate a realistic, one paragraph {label} movie review. Make up concrete names for characters or other details in your review. Just provide the review text itself, no ratings, sections, or extra info.",
  "USE_DP": false,
  "EPSILON": 10.0,
  "DELTA": 1.5e-05,
  "CLIPPING_NORM": 1.0,
  "TEST_RUN": false
}
```

</details>

<details>
<summary><b>baseline_with_examples</b></summary>

Since we are inserting examples, the prompt becomes significantly larger (~500 tokens). For proper evaluation, we must increase the maximum sequence length to provide enough space for the model to generate its prediction.

```json
{
  "GEMMA3_MODEL_TYPE": "gemma3_instruct_12b",
  "SEQUENCE_LENGTH": 1024,
  "EPOCHS": 3,
  "BATCH_SIZE": 16,
  "GRADIENT_ACCUMULATION_STEPS": 64,
  "LORA_RANK": 32,
  "LEARNING_RATE": 0.003,
  "VALIDATION_SIZE": 64,
  "SEED": 42,
  "TOP_P": 0.95,
  "NUM_SAMPLES": 10000,
  "SAMPLE_BATCH_SIZE": 16,
  "NUM_MAUVE_SAMPLES": 5000,
  "NUM_MAUVE_TRIALS": 5,
  "USE_MIXED_PRECISION": false,
  "FINETUNE_MODEL": false,
  "PROMPT": "As an IMDb movie reviewer, generate a realistic, one paragraph movie review in the same format and similar style as the examples below. Make up concrete names for characters or other details in your review. Only provide the review text.\n\nHere are three examples of positive movie reviews:\n\n* `This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.`\n* `I just saw the movie on tv. I really enjoyed it. I like a good mystery. and this one had me guessing up to the end. Sean Connery did a good job. I would recomend it to a friend.`\n* `I really enjoyed this movie.I was fifteen when this movie came out and I could relate. This will be a movie I would show my kids to let them know, the feelings they are having are normal. It is funny to see how we could be so devestated by things at such a young age..who knew that we would bounce back....again and again....Great movie!!!!`\n\nHere are three examples of negative movie reviews:\n\n* `It was disgusting and painful. What a waste of a cast! I swear, the audience (1/2 full) laughed TWICE in 90 minutes. This is not a lie. Do not even rent it.<br /><br />Zeta Jones was just too mean to be believable.<br /><br />Cusack was OK. Just OK. I felt sorry for him (the actor) in case people remember this mess.<br /><br />Roberts was the same as she always is. Charming and sweet, but with no purpose. The \"romance\" with John was completely unbelievable.`\n* `I'm sorry, I had high hopes for this movie. Unfortunately, it was too long, too thin and too weak to hold my attention. When I realized the whole movie was indeed only about an older guy reliving his dream, I felt cheated. Surely it could have been a device to bring us into something deeper, something more meaningful.<br /><br />So, don't buy a large drink or you'll be running to the rest room. My kids didn't enjoy it either. Ah well.`\n* `Corky Romano has to be one of the most jaw dropping and horrific \"comedy's\" ever made.<br /><br />While the sometimes amusing Chris Kattan who pulled off a very funny performance in the hilarious 'Undercover Brother' his character in Corky is so stupid and so unfunny-which is a shame since the premise is a wonderful idea. To bad they ran out of them when they got to page 3 on the script.`\n\nNow it's your turn, generate one {label} movie review which is similar to the {label} examples above:",
  "USE_DP": false,
  "EPSILON": 10.0,
  "DELTA": 1.5e-05,
  "CLIPPING_NORM": 1.0,
  "TEST_RUN": false
}
```

</details>

<details>
<summary><b>dp_ft</b></summary>

```json
{
  "GEMMA3_MODEL_TYPE": "gemma3_instruct_12b",
  "SEQUENCE_LENGTH": 512,
  "EPOCHS": 3,
  "BATCH_SIZE": 32,
  "GRADIENT_ACCUMULATION_STEPS": 32,
  "LORA_RANK": 32,
  "LEARNING_RATE": 0.003,
  "VALIDATION_SIZE": 64,
  "SEED": 42,
  "TOP_P": 0.95,
  "NUM_SAMPLES": 10000,
  "SAMPLE_BATCH_SIZE": 24,
  "NUM_MAUVE_SAMPLES": 5000,
  "NUM_MAUVE_TRIALS": 5,
  "USE_MIXED_PRECISION": false,
  "FINETUNE_MODEL": true,
  "PROMPT": "[imdb][{label}]:",
  "USE_DP": true,
  "EPSILON": 10.0,
  "DELTA": 1.5e-05,
  "CLIPPING_NORM": 1.0,
  "TEST_RUN": false
}
```

</details>

<details>
<summary><b>non_dp_ft</b></summary>

```json
{
  "GEMMA3_MODEL_TYPE": "gemma3_instruct_12b",
  "SEQUENCE_LENGTH": 512,
  "EPOCHS": 3,
  "BATCH_SIZE": 32,
  "GRADIENT_ACCUMULATION_STEPS": 32,
  "LORA_RANK": 32,
  "LEARNING_RATE": 0.003,
  "VALIDATION_SIZE": 64,
  "SEED": 42,
  "TOP_P": 0.95,
  "NUM_SAMPLES": 10000,
  "SAMPLE_BATCH_SIZE": 24,
  "NUM_MAUVE_SAMPLES": 5000,
  "NUM_MAUVE_TRIALS": 5,
  "USE_MIXED_PRECISION": false,
  "FINETUNE_MODEL": true,
  "PROMPT": "[imdb][{label}]:",
  "USE_DP": false,
  "EPSILON": 10.0,
  "DELTA": 1.5e-05,
  "CLIPPING_NORM": 1.0,
  "TEST_RUN": false
}
```

</details>

<details>
<summary><b>use_train</b></summary>

No specific configuration is provided here, as this simply involves calculating the MAUVE score between samples from the train and test datasets.

</details>

<details>
<summary><b>upper_bound</b></summary>

No specific configuration is provided here, as this simply involves calculating the MAUVE score between two **different** samples from the test dataset.

</details>