# 🤫 Training Navigator FT with Differential Privacy

We here provide a quick example of how to train Navigator FT with differential privacy (DP).

In [1]:
%%capture
!pip install datasets gretel-client


## 💾 Loading the Dataset

Let's first load a dataset. We use an e-commerce dataset that contains both a free-text column and some numerical/categorical columns. We preprocess the dataset for simplicity.

In [2]:
from datasets import load_dataset

ds = load_dataset("saattrupdan/womens-clothing-ecommerce-reviews")
df_train = ds["train"].to_pandas()

# For simplicity, we remove non-standard chars and truncate the review text
df_train["review_text"] = df_train["review_text"].str.replace(r'[^A-Za-z0-9 \.!?\']+', '', regex=True)
df_train["review_text"] = df_train["review_text"].str.slice(0, 128)

df_train.head()

Unnamed: 0,review_text,age,rating,positive_feedback_count,division_name,department_name,class_name,recommended_ind
0,I loved this shirt until the first time i wash...,39,1,0,General,Tops,Knits,0
1,This sweater was unflattering me very boxy and...,44,3,0,General,Tops,Sweaters,0
2,I fell in love with these bottoms at first sit...,41,5,2,General,Bottoms,Pants,1
3,I love the dress! i purchased this dress to w...,34,5,0,General,Dresses,Dresses,1
4,I fell in love with this dress when i saw it o...,46,5,0,General Petite,Dresses,Dresses,1


## 🏃🏽‍♀️ Running Fine-Tuning

Navigator Fine Tuning uses a large-language model to generate synthetic output from training datasets with numeric, categorical, and/or free text columns.

Let us first run a job without DP, so we have a baseline. This will take around 12 minutes, so feel free to grab a coffee ☕

In [None]:
from gretel_client import Gretel

gretel = Gretel(api_key="prompt", project_name='navft-dp-sample', validate=True)

Found cached Gretel credentials
Using endpoint https://api-dev.gretel.cloud
Logged in as andre.manoel@gretel.ai ✅
Project URL: https://console-dev.gretel.ai/proj_2oqOccJYr2HzGYFXlcCTQxoxkhR


In [None]:
yaml_config_nodp = f"""
    schema_version: 1.0
    name: "navft-nodp"
    models:
    - navigator_ft:
        group_training_examples_by: null
        order_training_examples_by: null

        params:
            num_input_records_to_sample: auto

        generate:
            num_records: 1000
"""

nodp_model = gretel.submit_train(base_config=yaml_config_nodp, data_source=df_train)

Submitting NAVIGATOR FINE TUNING training job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-navigator-fine-tuning
Console URL: https://console-dev.gretel.ai/proj_2oqOccJYr2HzGYFXlcCTQxoxkhR/models/67375dc12176460c965fe1d6/activity
Model ID: 67375dc12176460c965fe1d6
Analyzing input data and checking for auto-params... 
<< 🧭 Navigator FT >> Preparing for training 
<< 🧭 Navigator FT >> Tokenizing records 
<< 🧭 Navigator FT >> Number of unique train records: 19608 
<< 🧭 Navigator FT >> Assembling examples from 127.5% of the input records 
<< 🧭 Navigator FT >> Training Example Statistics: 

╒════════╤═════════════════════╤══════════════════════╤═══════════════════════╕
│        │   Tokens per record │   Tokens per example │   Records per example │
╞════════╪═════════════════════╪══════════════════════╪═══════════════════════╡
│ min    │                  60 │                 1943 │                    21 │
├────────┼─────────────────────┼────────────────

Let us now do the same, but with DP. Note that the config looks slightly different: 
- `max_sequences_per_example: 1`: NavFT typically packs as many records as possible in a single example, which helps with capturing correlations in the dataset; however, DP needs to bound the impact of each record, and for that we'll need only 1 record per example.
- `batch_size: 8`: NavFT tries to fill up the context of the language model, so the default batch size is 1; since we are not doing that anymore, we can try increasing this parameter. If the value is too high, however, we might get an out-of-memory error.
- `privacy_params`: Here is where we add all the DP-related parameters: we enable DP by setting `dp: true`, we set the value of epsilon by doing `epsilon: 8`, and the max. per-sample gradient norm using `per_sample_max_grad_norm: 0.1`. The value of delta is set automatically to $n^{-1.2}$.
- `use_structured_generation: true`: With DP, we might have issues learning the tabular format, so in order to get more valid records, it helps to do structured generation.

This will take a bit longer, around 25 minutes, so you can grab a couple more coffees ☕☕ (but please be careful with your caffeine intake - maybe a decaf?)

In [None]:
yaml_config_dp = f"""
    schema_version: 1.0
    name: "navft-dp"
    models:
    - navigator_ft:
        group_training_examples_by: null
        order_training_examples_by: null

        params:
            num_input_records_to_sample: auto
            batch_size: 8

        privacy_params:
            dp: true
            epsilon: 8

        generate:
            num_records: 1000
            use_structured_generation: true
"""

dp_model = gretel.submit_train(base_config=yaml_config_dp, data_source=df_train)

Submitting NAVIGATOR FINE TUNING training job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-navigator-fine-tuning
Console URL: https://console-dev.gretel.ai/proj_2oqOccJYr2HzGYFXlcCTQxoxkhR/models/673760b89adab2fa03095d99/activity
Model ID: 673760b89adab2fa03095d99
Analyzing input data and checking for auto-params... 
<< 🧭 Navigator FT >> Preparing for training 
<< 🧭 Navigator FT >> Tokenizing records 
<< 🧭 Navigator FT >> Number of unique train records: 19608 
<< 🧭 Navigator FT >> Assembling examples from 127.5% of the input records 
<< 🧭 Navigator FT >> Training Example Statistics: 

╒════════╤═════════════════════╤══════════════════════╤═══════════════════════╕
│        │   Tokens per record │   Tokens per example │   Records per example │
╞════════╪═════════════════════╪══════════════════════╪═══════════════════════╡
│ min    │                  60 │                  143 │                     1 │
├────────┼─────────────────────┼────────────────

## 📋 Comparing Results

Let's examine the results. In particular, we want to see how they compare in terms of:

- the **data privacy score (DPS)**, a measure of how hard it is to extract information about the original data from the synthetic data
- the **synthetic quality score (SQS)**, a measure of how close the synthetic data generated is to the original data

Typically, we should observe a slightly higher DPS for the differentially private model as compared to the non-differentially private one. Conversely, we'd expect to see a slightly higher SQS for the non-DP model. However, due to the stochastic nature of the algoritm, this might vary each time.

In [None]:
print("The DPS for the no-DP model is:", nodp_model.report.quality_scores["data_privacy_score"])
print("The DPS for the DP model is:", dp_model.report.quality_scores["data_privacy_score"])

The DPS for the no-DP model is: 88
The DPS for the DP model is: 87


In [6]:
print("The SQS for the no-DP model is:", nodp_model.report.quality_scores["synthetic_data_quality_score"])
print("The SQS for the DP model is:", dp_model.report.quality_scores["synthetic_data_quality_score"])

The SQS for the no-DP model is: 87
The SQS for the DP model is: 76
