In [None]:
!pip install -Uqq spacy gretel-client datasets # we install spacy for their visualization helper, displacy

# Work Safely with Free Text Using Gretel

Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label a set of email dumps looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from the email messages.

At the end of the notebook we'll have a dataset that is safe to share and analyze without compromising a user's personal information.

## Setup

In [None]:
import pandas as pd
import datasets
from gretel_client import get_cloud_client

pd.set_option('max_colwidth', None)

client = get_cloud_client(prefix="api", api_key="prompt")

In [None]:
client.install_packages()

## Load the dataset

Using Hugging Face's [datasets](https://github.com/huggingface/datasets) library, we load a dataset containing a dump of [Enron emails](https://huggingface.co/datasets/aeslc). This data contains unstructured emails that we will pass through a NER pipeline for labeling and PII discovery.

In [None]:
source_dataset = datasets.load_dataset("aeslc")
source_df = pd.DataFrame(source_dataset["train"]).sample(n=300, random_state=99)

In [None]:
source_df.head()

## Label the source text

With the data loaded into the notebook, we now create a Gretel Project, and upload the records to the project for labeling.

In [None]:
project = client.get_project(create=True)

`detection_mode` configures what models the NER pipeline uses for labeling. Using `detection_mode=all` we configure records to be labeled using all available models.

In [None]:
project.send_dataframe(source_df, detection_mode="all")

For extra credit, you can navigate to the project's console view to better inspect and visualize the source dataset.

In [None]:
project.get_console_url()

## Inspect labeled data

In this next cell, we download the labeled records and inspect each chat message to see what entities were detected. Gretel uses a combination of NLP models, regex, and custom heuristics to detect named entities in structured and unstructured data.

For a list of entities that Gretel can detect, [click here](https://gretel.ai/gretel-cloud-faqs/what-types-of-entities-can-gretel-identify).

In [None]:
from gretel_helpers.spacy import display_entities

TEXT_FIELD = "email_body"

for record in project.iter_records(direction="backward", record_limit=5):
    display_entities(record, TEXT_FIELD)

## Build a transformation pipeline

After labeling the dataset, we've identified chats that contain PII, such as names and emails. The final step in this blueprint is to build a transformation pipeline that will replace names and other identifying information with fake representations of the data.

We make a point to replace rather than redact sensitive information. This preservation ensures the dataset remains valuable for downstream use cases such as machine learning, where the structure and contents of the data are essential.

To learn more about data transformation pipelines with Gretel, check our [website](https://gretel.ai/platform/transform) or [SDK documentation](https://gretel-client.readthedocs.io/en/stable/transformers/api_ref.html).

In [None]:
import uuid

from gretel_client.transformers import DataPath, DataTransformPipeline
from gretel_client.transformers import FakeConstantConfig

SEED = uuid.uuid1().int

Configure the pipeline. `FakeConstantConfig` will replace any entities configured under `labels` with a fake version of the entity.

In [None]:
fake_xf = FakeConstantConfig(seed=SEED, labels=["person_name", "email_address", "phone_number"])

paths = [
    DataPath(input=TEXT_FIELD, xforms=[fake_xf]),
    DataPath(input="*"),
]

pipeline = DataTransformPipeline(paths)

Run the pipeline to redact any sensitive strings

In [None]:
xf_records = [
    pipeline.transform_record(record)["record"]
    for record in 
    project.iter_records(direction="backward")
]

xf_df = pd.DataFrame(xf_records)

Inspect a transformed email from the dataset.

In [None]:
from gretel_client.demo_helpers import show_record_diff


# Lookup the comparison email by subject line.
c_key = "subject_line"
c_value = "Confidentiality Agreement-Human Code"

# The comparison email contains multiple lines. For this
# demonstration we only want to examine the first line 
# so we strip any extraneous newlines.
orig = source_df[source_df[c_key] == c_value][TEXT_FIELD].iloc[0].split("\n")[0]
xf = xf_df[xf_df[c_key] == c_value][TEXT_FIELD].iloc[0].split("\n")[0]

show_record_diff({"": orig}, {"": xf})

Now that you've completed this notebook, you've seen how it's possible to take a corpus of free text, label it using Gretel's NER pipeline, and safely anonymize the dataset while retaining its utility.