# Data Transformation with Gretel [MAKE COPY]
`May 2024`

This notebook demonstrates how to use [Gretel Transform v2](https://docs.gretel.ai/create-synthetic-data/models/transform/v2) to detect and transform entities in both structured and unstructured data. You will learn how to:
- Load and inspect a dataset
- Configure and run a model to detect entities
- Transform detected entities with synthetic data

Please make your own copy of the notebook before proceeding.

## Step 1: Install Dependencies
First, let's install the `gretel_client` package to interact with Gretel's API.

In [31]:
!pip install -Uqq gretel_client

## Step 2: Set Up Gretel Client
Enter your Gretel API key and endpoint for authenticating requests.

In [32]:
import ipywidgets as widgets
from IPython.display import display

from gretel_client import Gretel

# Create input forms for API key and endpoint
api_key = widgets.Password(description="API Key:")
endpoint = widgets.Text(description="Endpoint:", value="https://api.gretel.cloud")
display(api_key, endpoint)

# Function to set up Gretel client
def setup_gretel(api_key, endpoint):
    return Gretel(api_key=api_key.value, endpoint=endpoint.value, validate=True)

# Button to submit API key and endpoint
button = widgets.Button(description="Set Up Gretel Client")
output = widgets.Output()
display(button, output)

def on_button_clicked(b):
    with output:
        output.clear_output()
        gretel = setup_gretel(api_key, endpoint)
        print("Gretel client set up successfully!")

button.on_click(on_button_clicked)

Password(description='API Key:')

Text(value='https://api.gretel.cloud', description='Endpoint:')

Button(description='Set Up Gretel Client', style=ButtonStyle())

Output()

## Step 3: Load the Dataset
We'll load a sample dataset containing personal identifiable information (PII). Update the link to load your dataset of choice.

Let's review the first few rows of the dataset below.

In [33]:
import pandas as pd

df = pd.read_csv('https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/gretel_generated_table_simpsons_pii.csv')
df.head(5)

Unnamed: 0,first_name,last_name,email_address,street_address,city,country,age,favorite_hobby
0,Homer,Simpson,homer@example.com,742 Evergreen Terrace,Springfield,USA,40,Eating donuts with Lenny and Carl
1,Marge,Simpson,marge@example.com,742 Evergreen Terrace,Springfield,USA,38,Painting and spending time with Maggie
2,Bart,Simpson,bart@example.com,742 Evergreen Terrace,Springfield,USA,10,Skateboarding and causing trouble with Milhouse
3,Lisa,Simpson,lisa@example.com,742 Evergreen Terrace,Springfield,USA,8,Playing the saxophone and solving mysteries with Milhouse
4,Ned,Flanders,ned@example.com,744 Evergreen Terrace,Springfield,USA,45,Attending church and being a good neighbor


## Step 4: Configure and Run the Model
Now the fun begins! First, let's configure a Transform v2 model to detect and fake or hash (depending on whether the entity type has a matching Faker function) entities in the dataset. The configuration is specified in YAML format. We also need to select or create a Gretel project to store the model and outputs.

We'll then print out the detected entities.

In [39]:
import yaml
import pandas as pd
from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll

# YAML configuration
config = """
schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
          entities:

            - first_name
            - last_name
            - email
            - phone_number
            - street_address

          num_samples: 10
      steps:
        - rows:
            update:
              # Detect and replace values in PII columns
              - condition: column.entity is in globals.classify.entities
                value: column.entity | fake
                fallback_value: this | hash | truncate(9,true,"")
              # Detect and replace entities within free text columns
              - type: text
                value: this | fake_entities(on_error="hash")
"""

# Create project and model object
project = create_or_get_unique_project(name="transform-v2")
model = project.create_model_obj(model_config=yaml.safe_load(config), data_source=df)

# Run the model and poll for completion
model.submit_cloud()
poll(model, verbose=False)

# Retrieve and display the transformed data
transformed_df = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")
transformed_df.head(5)

Creating Transform V2 Model 
Transform V2 getting column classifier from environment 
Generating Transform V2 data artifact... 
Saving model archive 
Running model... 
Uploading artifacts to Gretel Cloud... 
Upload to Gretel Cloud is completed. 


Unnamed: 0,first_name,last_name,email_address,street_address,city,country,age,favorite_hobby
0,Sarah,Wade,stephen42@example.net,57981 Scott Courts,Springfield,USA,40,Eating donuts with Ashlee and Morris
1,Barry,Wade,marymcpherson@example.org,57981 Scott Courts,Springfield,USA,38,Painting and spending time with Cynthia
2,Leslie,Wade,andrewblake@example.net,57981 Scott Courts,Springfield,USA,10,Skateboarding and causing trouble with Leah
3,Adam,Wade,msims@example.com,57981 Scott Courts,Springfield,USA,8,Playing the saxophone and solving mysteries with Shannon
4,Tracy,Collins,gyoung@example.com,12857 Katie Vista Suite 096,Springfield,USA,45,Attending church and being a good neighbor


In [10]:
# Highlight columns with detected column-level entities
def style_entities(val):
    """
    Applies styling to non-None values.
    """
    color = 'lightgreen' if pd.notna(val) else ''
    return f'background-color: {color}'

report = pd.read_json(model.get_artifact_link("report_json"), compression="gzip")
report.style.applymap(style_entities, subset=['entities'])

Unnamed: 0,entities
age,
city,
country,
email_address,email
favorite_hobby,
first_name,first_name
last_name,last_name
street_address,street_address


In [29]:
# Show number of rows with NER detections in free text columns
for column in report.index[report["entities"].isna()]:
  modified_row_count = (df[column] != transformed_df[column]).sum()
  if modified_row_count:
    print(f"{column}: {modified_row_count} out of {len(df)} rows had one or more detected entities")

favorite_hobby: 27 out of 50 rows had one or more detected entities


Nice! We successfully de-identified both column-level PII entities and PII entities within unstructured free text using this default configuration.

Feel free to customize the configuration to suit your needs. For example, if
you would like the email address format to match the first and last names, you
could add the rule below to the `rows.update` section:
```yaml
- name: email_address
  value: 'row.first_name + "." + row.last_name + "@" + fake.free_email_domain()'
```

This will set the email_address column in each record to match the format {first_name}.{last_name}@{domain} where {first_name} and {last_name} are respectively the newly generated fake first and last names, and {domain} is one of gmail.com, yahoo.com or hotmail.com.

## Summary
Finally, we'll do a side-by-side comparison of the first row of data before and after transformation. We'll also print out relevant job statistics.

In [30]:
# Preview the differences of the first row of real vs transformed data
pd.set_option('display.max_colwidth', None)

first_row_df1 = df.iloc[0].to_frame('Original')
first_row_df2 = transformed_df.iloc[0].to_frame('Transformed')

# Join the transposed rows
comparison_df = first_row_df1.join(first_row_df2)

def highlight_differences(row):
    is_different = row['Original'] != row['Transformed']
    color = 'background-color: lightgreen' if is_different else ''
    return ['', f'{color}; min-width: 500px']

styled_df = comparison_df.style.apply(highlight_differences, axis=1).format(escape="html")
styled_df

Unnamed: 0,Original,Transformed
first_name,Homer,Eric
last_name,Simpson,Smith
email_address,homer@example.com,randy36@example.org
street_address,742 Evergreen Terrace,85045 Rogers Cliffs Apt. 262
city,Springfield,Springfield
country,USA,USA
age,40,40
favorite_hobby,Eating donuts with Lenny and Carl,Eating donuts with Ryan and Stephanie


If you found this notebook useful, try some advanced examples:

* [PDF content extraction and anonymization](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/transform/extract_and_anonymize_pdf_contents.ipynb)