<a target="_blank" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/navft_dp_sample.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🤫 Training Navigator Fine Tuning with Differential Privacy

We here provide a quick example of how to train Navigator Fine Tuning with differential privacy (DP).

In [None]:
%%capture
!pip install datasets gretel-client


## 💾 Loading the Dataset

Let's first load a dataset. We use an e-commerce dataset that contains both a free-text column and some numerical/categorical columns. We preprocess the dataset for simplicity.

In [None]:
from datasets import load_dataset

ds = load_dataset("saattrupdan/womens-clothing-ecommerce-reviews")
df_train = ds["train"].to_pandas()

# For simplicity, we remove non-standard chars and truncate the review text
df_train["review_text"] = df_train["review_text"].str.replace(r'[^A-Za-z0-9 \.!?\']+', '', regex=True)
df_train["review_text"] = df_train["review_text"].str.slice(0, 128)

df_train.head()

## 🏃🏽‍♀️ Running Fine-Tuning

Navigator Fine Tuning uses a large-language model to generate synthetic output from training datasets with numeric, categorical, and/or free text columns.

Let us first run a job without DP, so we have a baseline. This will take around 12 minutes, so feel free to grab a coffee ☕

In [None]:
from gretel_client import Gretel

gretel = Gretel(api_key="prompt", project_name='navft-dp-sample', validate=True)

In [None]:
yaml_config_nodp = f"""
    schema_version: 1.0
    name: "navft-nodp"
    models:
    - navigator_ft:
        group_training_examples_by: null
        order_training_examples_by: null

        params:
            num_input_records_to_sample: auto

        generate:
            num_records: 1000
"""

nodp_model = gretel.submit_train(base_config=yaml_config_nodp, data_source=df_train)

To enable DP, we'll specify `privacy_params`.

- `dp: true` activates fine tuning with DP
- `epsilon` is the privacy loss parameter. Smaller epsilon values provide stronger guarantees that there will not be leakage of training data.
- `delta` is the probability of accidentally leaking information. By default, delta is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.2, where n is the number of training records.

Additionally, we adjust the standard parameters slightly to account for the incorporation of differential privacy.

- `batch_size: 8`
    - While the default Navigator Fine Tuning batch size is `1`, we increase this to larger values of `8` or `16` when training with DP. Increasing batch size when using DP is a common practice because there is a corresponding almost linear decrease in the standard deviation of the noise added to the average batch gradient. Note that if this value is too high, out-of-memory errors may occur.
- `use_structured_generation: true`
    - Structured generation allows us to utilize the schema of the dataset to enforce structure in the outputs by manipulating output logits.
        - Note that this assumes that the schema of the table, including numerical ranges and categories, are not private. If this is considered private, please set it to false.

This will take a bit longer, around 25 minutes, so you can grab a couple more coffees ☕☕ (but please be careful with your caffeine intake - maybe a decaf?)

In [None]:
yaml_config_dp = f"""
    schema_version: 1.0
    name: "navft-dp"
    models:
    - navigator_ft:
        group_training_examples_by: null
        order_training_examples_by: null

        params:
            num_input_records_to_sample: auto
            batch_size: 8

        privacy_params:
            dp: true
            epsilon: 8

        generate:
            num_records: 1000
            use_structured_generation: true
"""

dp_model = gretel.submit_train(base_config=yaml_config_dp, data_source=df_train)

## 📋 Comparing Results

Let's examine the results. In particular, we want to see how they compare in terms of:

- the **data privacy score (DPS)**, a measure of how hard it is to extract information about the original data from the synthetic data
- the **synthetic quality score (SQS)**, a measure of how close the synthetic data generated is to the original data

Typically, we should observe a slightly higher DPS for the differentially private model as compared to the non-differentially private one. Conversely, we'd expect to see a slightly higher SQS for the non-DP model. However, due to the stochastic nature of the algorithm, this might vary each time.

In [None]:
print("The DPS for the no-DP model is:", nodp_model.report.quality_scores["data_privacy_score"])
print("The DPS for the DP model is:", dp_model.report.quality_scores["data_privacy_score"])

In [None]:
print("The SQS for the no-DP model is:", nodp_model.report.quality_scores["synthetic_data_quality_score"])
print("The SQS for the DP model is:", dp_model.report.quality_scores["synthetic_data_quality_score"])