<a target="_blank" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/evaluate/evaluate_with_pii_replay.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# PII Replay Notebook

## 💾 Install Gretel SDK

In [None]:
%%capture
!pip install -U gretel-client

## 🌐 Configure your Gretel Session

In [None]:
from gretel_client import Gretel

gretel = Gretel(api_key="prompt", validate=True, project_name="pii-replay-project")

## 🔬 Preview input data
Dataset is taken from https://www.kaggle.com/datasets/ravindrasinghrana/employeedataset

In [None]:
import pandas as pd

datasource = "https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/kaggle/employee_data.csv"
df = pd.read_csv(datasource)
test_df = None

# Drop columns to simplify example
df = df.drop(columns=["Supervisor", "BusinessUnit", "EmployeeType", "PayZone", "EmployeeClassificationType", "TerminationType", "TerminationDescription", "DepartmentType", "JobFunctionDescription", "DOB", "LocationCode", "RaceDesc", "MaritalDesc"])

df.head()

## ✂ Split train and test
In order to run [Membership Inference Protection](https://docs.gretel.ai/optimize-synthetic-data/evaluate/synthetic-data-quality-report#membership-inference-protection) in Evaluate, we separate out test_df separately from df:

In [None]:
# Shuffle the dataset randomly to ensure a random test set
# Set random_state to ensure reproducibility
shuffled_df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Split into test (5% holdout) and train
split_index = int(len(shuffled_df) * 0.05)
test_df = shuffled_df.iloc[:split_index]
train_df = shuffled_df.iloc[split_index:]

## 🏋️‍♂️ Train a generative model

- The [navigator-ft](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/navigator-ft.yml) base config tells Gretel we want to train with **Navigator Fine Tuning** using its default parameters.

- **Navigator Fine Tuning** is an LLM under the hood. Before training begins, information about how the input data was tokenized and assembled into examples will be logged in the cell output (as well as in Gretel's Console).

- Generation of a dataset for evaluation will begin immediately after the model completes training. The rate at which the model produces valid records will be logged to help assess how well the model is performing.

In [None]:
nav_ft_trained = gretel.submit_train("navigator-ft", data_source=train_df, evaluate={"skip": True}, generate={"num_records": 1000})
nav_ft_result = nav_ft_trained.fetch_report_synthetic_data()

## 󠁘🟰 Evaluate PII Replay for Model result without Transform

In [None]:
EVALUATE_CONFIG = """
schema_version: "1.0"

name: "evaluate-config"
models:
  - evaluate:
      data_source: "__tmp__"
      pii_replay:
        skip: false
        entities: ["first_name","last_name","email","state"]
"""
evaluate_report = gretel.submit_evaluate(EVALUATE_CONFIG, data_source=nav_ft_result, ref_data=train_df, test_data=test_df).evaluate_report
evaluate_report.display_in_notebook()

## 󠁘🔀 Define Transform Configuration and Train Transform Model

- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.

- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console.

In [None]:
TRANSFORM_CONFIG = """
schema_version: "1.0"
name: transform-config
models:
  - transform_v2:
      globals:
        locales:
          - en_US
        classify:
          enable: true
          entities:
            - first_name
            - last_name
            - email
            - state
          auto_add_entities: true
          num_samples: 3
      steps:
        - rows:
            update:
              - name: FirstName
                value: fake.first_name_male() if row['GenderCode'] == 'Male' else
                  fake.first_name_female()
              - name: LastName
                value: fake.last_name()
              - name: ADEmail
                value: row["FirstName"] + "." + row["LastName"] + "@bilearner.com"
              - name: State
                value: fake.state_abbr()
"""
transform_result = gretel.submit_transform(TRANSFORM_CONFIG, data_source=train_df).transformed_df

## 🏋️‍♂️ Train a generative model

In [None]:
tr_nav_ft_trained = gretel.submit_train("navigator-ft", data_source=transform_result, evaluate={"skip": True}, generate={"num_records": 1000})
tr_nav_ft_result = tr_nav_ft_trained.fetch_report_synthetic_data()

## 󠁘🟰 Evaluate PII Replay for Transform + Model result
In general, we expect that running Transform prior to Synthetics should decrease PII replay. We can see this by comparing the results below to the results running Synthetics without Transform earlier in the notebook. Note that given the stochastic nature of the algorithm, this could vary each time.

Note that there are many cases where we should not necessarily expect (or often even want) PII Replay of 0 across the board, even when running Transform first.

You should consider each column in context, both of the data and the real world. In general, you should expect entities that are rarer, like full address or full name, to have lower amounts of PII replay than entities that are more common, like first name or US state.

In [None]:
evaluate_report = gretel.submit_evaluate(EVALUATE_CONFIG, data_source=tr_nav_ft_result, ref_data=train_df, test_data=test_df).evaluate_report
evaluate_report.display_in_notebook()