<a href="https://colab.research.google.com/gist/zredlined/798670a15869533851df13725d589e4e/bigframes-demo-2-creating-differentially-private-synthetic-text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤖 Unlock Sensitive Text in BigQuery with Differentially Private Synthetic Text

Harness the power of sensitive text data for AI and analytics using [Gretel](https://gretel.ai)'s differentially private synthetic data, [Google BigQuery](https://cloud.google.com/bigquery), and the [BigFrames SDK](https://cloud.google.com/python/docs/reference/bigframes/latest).

## 🔍 In this Notebook:

1. Retrieve 30k sensitive clinical notes from BigQuery
2. Generate differentially private synthetic notes (ε = 5) using Gretel GPT
3. Evaluate synthetic data quality and utility
4. Store AI-ready synthetic data in BigQuery for downstream applications

## 💪 Why It Matters:

- **Robust Privacy**: ε = 5 offers strong protection against attacks while maintaining data utility
- **Efficiency at Scale**: High-quality results with just 30k records, versus millions typically required
- **Versatile Applications**: Safely use in healthcare, finance, customer support, and more
- **Unrestricted Usage**: Train ML models or perform analytics without privacy concerns
- **Potential to Outperform**: At scale, synthetic data can often exceed real data in ML tasks

Gretel's approach combines state-of-the-art LLMs with differential privacy, processing ~30k records in about 2 hours on a single GPU.

[Explore Gretel GPT](https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-gpt) | [Learn about DP Synthetic Text](https://gretel.ai/blog/generate-differentially-private-synthetic-text-with-gretel-gpt)

In [None]:
%%capture
!pip install -Uqq "gretel-client>=0.22.0"

In [None]:
# Install bigframes if it's not already installed in the environment.

# %%capture
# !pip install bigframes

In [None]:
from gretel_client import Gretel
from gretel_client.bigquery import BigFrames

gretel = Gretel(api_key="prompt", validate=True, project_name="bigframes-dp")

# This is the core interface we will use moving forward!
gretel_bigframes = BigFrames(gretel)

In [None]:
import bigframes.pandas as bpd

BIGQUERY_PROJECT = "gretel-vertex-demo"

# Set BigFrames options
bpd.options.display.progress_bar = None
bpd.options.bigquery.project = BIGQUERY_PROJECT

In [None]:
# Define the source project and dataset
project_id = "gretel-public"
dataset_id = "public"
table_id = "clinical-notes"

# Construct the table path
table_path = f"{project_id}.{dataset_id}.{table_id}"

# Read the table into a DataFrame
df = bpd.read_gbq_table(table_path)

In [None]:
import textwrap

def print_dataset_statistics(data_source):
    """Print high level dataset statistics"""
    num_rows = data_source.shape[0]
    num_chars = data_source['text'].str.len().sum()

    print(f"\nNumber of rows: {num_rows}")
    print(f"Number of characters: {num_chars}")

def print_wrapped_text(text, width=128):
    """Print text wrapped to a specified width"""
    wrapped_text = textwrap.fill(text, width=width)
    print(wrapped_text)

print("Sample Dialogue:\n")
print_wrapped_text(df.iloc[0]['text'])
print_dataset_statistics(df)


## 🧬 Generating Differentially Private Synthetic Text with Gretel GPT

Gretel GPT offers cutting-edge capabilities for generating high-quality, domain-specific synthetic text with differential privacy guarantees. Key features include:

- Achieves strong privacy protection with a differential privacy (DP) epsilon value of 5
- Maintains high semantic quality of generated text
- Requires significantly less input data compared to traditional approaches (10k+ records vs 1M+)
- Leverages pre-trained language models to enhance output quality
- Balances data utility with rigorous privacy protection

Gretel GPT enables the creation of synthetic text that captures the nuances of your specific domain while providing formal privacy guarantees. This approach is particularly valuable for regulated industries such as healthcare and finance, where data sensitivity is paramount.

By utilizing DP-SGD training optimizations and flash attention 2, Gretel GPT achieves 5x faster training and generation, completing the process in about 2 hours on a single GPU in Gretel Hybrid on GCP. This efficiency, combined with the ability to work with smaller datasets, makes it an ideal solution for organizations looking to leverage sensitive text data safely and effectively.

[Learn more about Gretel GPT and Differential Privacy](https://gretel.ai/blog/generate-differentially-private-synthetic-text-with-gretel-gpt)

In [None]:
# Submit the fine-tuning job to Gretel

# Configuration for fine-tuning job
fine_tune_config = {
    "base_config": "natural-language",
    "job_label": "clinicalnotes_epsilon_5",
    "pretrained_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "params": {
        "batch_size": 16,
        "steps": 2500,
        "weight_decay": 0.01,
        "warmup_steps": 100,
        "lr_scheduler": "linear",
        "learning_rate": 0.001,
        "max_tokens": 512,
    },
    "peft_params": {
        "lora_r": 8,
        "lora_alpha_over_r": 1,
    },
    "privacy_params": {
        "dp": True,
        "epsilon": 5,
        "delta": "auto"
    },
    "generate": {
        "num_records": 80,
        "temperature": 0.8,
        "maximum_text_length": 512
    }
}

# Submit the job and get the model ID
train_results = gretel_bigframes.submit_train(dataframe=df, **fine_tune_config)
model_id = train_results.model_id

### 🔄 Loading the Fine-tuned Model

If you want to reload the trained model object later, do it like this:

```python
train_results = gretel_bigframes.fetch_train_job_results(model_id)
```

In [None]:
# Attach to the training job
train_results = gretel_bigframes.fetch_train_job_results("66e9a333bff4baa0b71844ce")

# train_results.wait_for_completion()

In [None]:
# Check the status of the training job

train_results.refresh()
train_results.job_status

In [None]:
# Display the full report within this notebook

train_results.report.display_in_notebook()

In [None]:
# Fetch the synthetically generated data

df_synth = train_results.fetch_report_synthetic_data()

print("Sample Synthetically Generated Clinical Notes:\n")
print_wrapped_text(df_synth.iloc[1]['text'])

In [None]:
# Write the synthetically generated data to your table in BQ
# NOTE: The BQ Dataset must already exist!

project_id = BIGQUERY_PROJECT
dataset_id = "syntheticdata"
table_id = "clinical-notes"

# Construct the table path
table_path = f"{project_id}.{dataset_id}.{table_id}"

# Write to the destination table in BQ, un-comment to actually write to BQ.
# df_synth.to_gbq(table_path)

## 🌱 Preparing Seed Data for Conditional Generation

Seed data allows us to guide the synthetic data generation process. By providing partial information, we can:

- Generate context-specific synthetic records
- Explore various scenarios or patient profiles
- Ensure the generated data aligns with specific use cases or research questions

In this example, we're creating seed data with initial clinical contexts to demonstrate conditional generation.

In [None]:
import bigframes.pandas as bpd

# A dataframe with example clinical contexts to complete.
data = {
    "text": [
        "A 73-year-old man presented with a fall down of 13 stairs at her home while intoxicated. His past medical history ",
        "A 28 year old female was presented to our clinic with a left knee injury that had occurred a few days before while skiing.",
    ]
}

seed_data = bpd.DataFrame(data)

## 🤖 Generate Additional Differentially Private Synthetic Data

Now that we have our fine-tuned model and seed data, we can generate more synthetic records. This process:

- Maintains the differential privacy guarantees of our original training
- Allows for flexible data generation based on different seeds or prompts
- Can be used to augment datasets or create specialized subsets for specific analyses

Remember, you can adjust parameters like `temperature` to control the creativity of the generated text.

In [None]:
generate_results = gretel_bigframes.submit_generate(
    "66e9a333bff4baa0b71844ce",
    seed_data=seed_data,
    temperature=0.8,
    maximum_text_length=512
)

In [None]:
generate_job_id = generate_results.record_id # save off the Job ID for generation
generate_results.wait_for_completion()

# Restore the generation job if needed
# generate_results = gretel_bigframes.fetch_generate_job_results(train_results.model_id, generate_job_id)

In [None]:
# Inspect conditionally generated data

print("\n\nSample Clinical Notes:\n")
print_wrapped_text(data['text'][1] + generate_results.synthetic_data.iloc[1]['text'])