# Getting Started: Transforming Data with Gretel Transform v2 🚀

Welcome to this hands-on guide for leveraging [Gretel Transform v2](https://docs.gretel.ai/create-synthetic-data/models/transform/v2), a powerful tool for detecting and transforming entities in both structured and unstructured datasets. This notebook will walk you through the process step-by-step, helping you:

* Configure and run a model to detect and process sensitive entities.
* Replace detected entities with synthetic data, including faking, hashing, or custom transformations.

Let’s get started! 🎉

## Step 1: Install Dependencies
First, let's install the `gretel_client` package to interact with Gretel's API.

In [None]:
!pip install -Uqq gretel_client

## Step 2: Set Up Gretel Client
Login to Gretel and create or load a project. Get a free API key at https://console.gretel.ai/users/me/key

In [None]:
from gretel_client import Gretel

gretel = Gretel(
    project_name="redact-pii",
    api_key="prompt",
    validate=True,
)

## Step 3: Load the Dataset
We'll load a sample dataset containing personal identifiable information (PII). Update the link to load your dataset of choice.

Let's review the first few rows of the dataset below.

In [None]:
import pandas as pd

df = pd.read_csv('https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/gretel_generated_table_simpsons_pii.csv')
df.head(5)

## Step 4: Configure and Run the Model

Let’s set up a **Transform v2** model to detect and anonymize entities in the dataset by either faking or hashing them, depending on the entity type and available Faker functions. The configuration is done in YAML format, and we’ll choose or create a Gretel project to store the model and its outputs.

Learn more in the docs at: https://docs.gretel.ai/create-synthetic-data/models/transform/v2/reference

In [None]:
# De-identification configuration
config = """
schema_version: "1.0"
name: "Replace PII"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
          entities:
            - first_name
            - last_name
            - email
            - phone_number
            - street_address
          num_samples: 100
      steps:
        - rows:
            update:
              # Detect and replace values in PII columns, hash if no Faker available
              - condition: column.entity is in globals.classify.entities
                value: column.entity | fake
                fallback_value: this | hash | truncate(9,true,"")

              # Detect and replace entities within free text columns
              - type: text
                value: this | fake_entities(on_error="hash")

              # Replace email addresses with first + last name to retain correlations
              - name: email_address
                value: 'row.first_name + "." + row.last_name + "@" + fake.free_email_domain()'
"""

transform_result = gretel.submit_transform(
    config=config,
    data_source=df,
    job_label="Transform PII data"
)

transformed_df = transform_result.transformed_df
transformed_df.head()

In [None]:
import pandas as pd

def highlight_detected_entities(report_dict):
    """
    Process the report dictionary, extract columns with detected entities,
    and highlight cells with non-empty entity labels.

    Args:
        report_dict (dict): The report dictionary from transform_result.report.as_dict.

    Returns:
        pd.io.formats.style.Styler: Highlighted DataFrame.
    """
    # Parse the columns and extract 'Detected Entities'
    columns_data = report_dict['columns']
    df = pd.DataFrame([
        {
            'Column Name': col['name'],
            'Detected Entities': ', '.join(
                entity['label'] for entity in col['entities'] if entity['label']
            )
        }
        for col in columns_data
    ])

    # Highlighting logic
    def highlight_entities(s):
        return ['background-color: lightgreen' if len(val) > 0 else '' for val in s]

    # Apply highlighting
    return df.style.apply(highlight_entities, subset=['Detected Entities'], axis=1)


highlight_detected_entities(pd.DataFrame(transform_result.report.as_dict))

Nice! We successfully de-identified both column-level PII entities and PII entities within unstructured free text using this default configuration.

## Summary
Finally, we'll do a side-by-side comparison of the first row of data before and after transformation. We'll also print out relevant job statistics.

In [None]:
# Preview the differences of the first row of real vs transformed data
pd.set_option('display.max_colwidth', None)

first_row_df1 = df.iloc[0].to_frame('Original')
first_row_df2 = transformed_df.iloc[0].to_frame('Transformed')

# Join the transposed rows
comparison_df = first_row_df1.join(first_row_df2)

def highlight_differences(row):
    is_different = row['Original'] != row['Transformed']
    color = 'background-color: lightgreen' if is_different else ''
    return ['', f'{color}; min-width: 500px']

styled_df = comparison_df.style.apply(highlight_differences, axis=1).format(escape="html")
styled_df