<a href="https://colab.research.google.com/gist/zredlined/b613e96c3b66b0f3d04648c15df16cb7/bigframes-demo-1-synthesizing-data-with-navigator-ft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤖 Synthesize Private Data with Gretel, BigFrames, and BigQuery

This notebook demonstrates a powerful workflow for generating high-quality, privacy-safe synthetic data using [Gretel](https://gretel.ai)'s suite of tools in conjunction with [Google BigQuery](https://cloud.google.com/bigquery) and the [BigFrames SDK](https://cloud.google.com/python/docs/reference/bigframes/latest).

## 🔍 What We'll Do:

- Retrieve real-world data from BigQuery using BigFrames SDK
- De-identify sensitive information with Gretel Transform v2 (TV2)
- Generate AI-ready, privacy-safe synthetic data using Gretel Navigator Fine-Tuning
- Seamlessly work with large-scale datasets in BigQuery


## 💪 Why It Matters:
This integrated approach enables organizations to:

- Safely leverage sensitive data for AI and ML use cases
- Break down data silos, promoting broader data accessibility
- Unlock the potential of restricted datasets
- Accelerate innovation while maintaining privacy and compliance
- Scale data operations seamlessly across large datasets

This notebook goes beyond simple PII removal, addressing the limitations of traditional anonymization techniques. By generating synthetic data, we create new records not based on any single individual, providing robust protection against various privacy attacks and re-identification risks.


Let's explore the power of privacy-preserving synthetic data generation! 🚀

[Learn more about Gretel Transform v2](https://docs.gretel.ai/create-synthetic-data/models/transform/v2) and [Gretel's Synthetic Data Generation](https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-navigator-fine-tuning)

In [None]:
%%capture
!pip install -Uqq "gretel-client>=0.22.0"

In [None]:
# Install bigframes if it's not already installed in the environment.

# %%capture
# !pip install bigframes

In [None]:
from gretel_client import Gretel
from gretel_client.bigquery import BigFrames

gretel = Gretel(api_key="prompt", validate=True, project_name="bigframes-demo")

# This is the core interface we will use moving forward!
gretel_bigframes = BigFrames(gretel)

In [None]:
import bigframes.pandas as bpd
import bigframes

BIGQUERY_PROJECT = "gretel-vertex-demo"

# Set BigFrames options
bpd.options.display.progress_bar = None
bpd.options.bigquery.project = BIGQUERY_PROJECT

In [None]:
# Define the source project and dataset
project_id = "gretel-public"
dataset_id = "public"
table_id = "sample-patient-events"

# Construct the table path
table_path = f"{project_id}.{dataset_id}.{table_id}"

# Read the table into a DataFrame
df = bpd.read_gbq_table(table_path)

# Display the DataFrame
df.peek()

## 🛡️ De-identifying and Processing Data with Gretel Transform v2

Before generating synthetic data, de-identifying personally identifiable information (PII) is a crucial first step towards data anonymization. Gretel's Transform v2 (TV2) provides a powerful and scalable framework for this and various other data processing tasks. TV2 combines advanced transformation techniques with named entity recognition (NER) capabilities, enabling efficient handling of large datasets. Beyond PII de-identification, TV2 can be used for data cleansing, formatting, and other preprocessing steps, making it a versatile tool in the data preparation pipeline. [Learn more about Gretel Transform v2](https://docs.gretel.ai/create-synthetic-data/models/transform/v2).

In [None]:
# De-identification configuration

transform_config = """
schema_version: "1.0"
models:
  - transform_v2:
      steps:
        - rows:
            update:
              - name: patient_id
                value: this | hash | truncate(10, end='')
              - name: first_name
                value: >
                  fake.first_name_female() if row.sex == 'Female' else
                  fake.first_name_male() if row.sex == 'Male' else
                  fake.first_name()
              - name: last_name
                value: fake.last_name()
"""

In [None]:
# Submit a transform job against the BigFrames table

transform_results = gretel_bigframes.submit_transforms(transform_config, df)

In [None]:
# Check out our Model ID, we can re-use this later to restore results.

model_id = transform_results.model_id

print(f"Gretel Model ID: {model_id}\n")

print(f"Gretel Console URL: {transform_results.model_url}")

In [None]:
# Restore an existing Transform model if needed

# model_id = "66db3d13e85d10df07c188c7"
# transform_results = gretel_bigframes.fetch_transforms_results(model_id)

In [None]:
transform_results.wait_for_completion()

In [None]:
transform_results.refresh()

In [None]:
# Take a look at the newly transformed BigFrames DataFrame

transform_results.transformed_df.head()

*italicized text*## 🧬 Generating Synthetic Data with Navigator Fine-Tuning

Gretel Navigator Fine-Tuning (Navigator-FT) generates high-quality, domain-specific synthetic data by fine-tuning pre-trained models on your datasets. Key features include:

- Handles multiple data modalities: numeric, categorical, free text, time series, and JSON
- Maintains complex relationships across data types and rows
- Can introduce meaningful new patterns, potentially improving ML/AI task performance
- Balances data utility with privacy protection

Navigator-FT builds on Gretel Navigator's capabilities, enabling the creation of synthetic data that captures the nuances of your specific domain while leveraging the strengths of pre-trained models. [Learn more](https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-navigator-fine-tuning).

In [None]:
# Prepare the training configuration
base_config = "navigator-ft"     # Base configuration for training

# Define the generation parameters
generate_params = {
    "num_records": len(df),  # Number of records to generate
    "temperature": 0.7       # Temperature parameter for data generation
}

# Submit the training job to Gretel
train_results = gretel_bigframes.submit_train(
    base_config=base_config,
    dataframe=transform_results.transformed_df,
    job_label="synthetic_patient_data",
    generate=generate_params,
    group_training_examples_by="patient_id",  # Group training examples by patient_id
    order_training_examples_by="event_date"   # Order training examples by event_date
)

In [None]:
# Inspect model metadata, Model ID can be used to re-hydrate training results

print(f"Gretel Model ID: {train_results.model_id}\n")

print(f"Gretel Console URL: {train_results.model_url}\n")

In [None]:
train_results.wait_for_completion()
train_results.refresh()

In [None]:
# Restore training results from a Model ID

model_id = "66e87fb4e95431a2ba067bbf"
train_results = gretel_bigframes.fetch_train_job_results(model_id)

In [None]:
# Display the full report within this notebook
train_results.report.display_in_notebook()

In [None]:
# Fetch the synthetically generated data
df_synth = train_results.fetch_report_synthetic_data()
df_synth.head()

In [None]:
# Write the synthetically generated data to your table in BQ
# NOTE: The BQ Dataset must already exist!

project_id = BIGQUERY_PROJECT
dataset_id = "syntheticdata"
table_id = "patient-events"

# Construct the table path
table_path = f"{project_id}.{dataset_id}.{table_id}"

# Write to the destination table in BQ, un-comment to actually write to BQ.
df_synth.to_gbq(table_path)

## ⚙️ Generate Additional Data

Given a trained synthetic model, you can now generate additional records.

In [None]:
generate_results = gretel_bigframes.submit_generate(model_id, num_records=100)

In [None]:
generate_job_id = generate_results.record_id

print(f"Generation Job ID: {generate_job_id}")

In [None]:
# Optionally restore a generation result object

# generate_job_id = "66db4e67ae94eef3abbcacf5"
# generate_results = gretel_bigframes.fetch_generate_job_results(train_results.model_id, generate_job_id)

In [None]:
generate_results.wait_for_completion()

In [None]:
generate_results.synthetic_data