In [92]:
!pip install -Uqq spacy gretel-client

# Working Safely with Sensitive Free Text Using Gretel Cloud and NLP

Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.

At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information.

## Setup

In [5]:
import pandas as pd
from gretel_client import get_cloud_client

client = get_cloud_client(prefix="api", api_key="prompt")

In [None]:
client.install_packages()

## Load the dataset

For this blueprint, we use a modified dataset from the Ubuntu Chat Corpus. It represents an archived set of IRC logs from Ubuntu's technical support channel. This data primarily contains free form text that we will pass through a NER pipeline for labeling and PII discovery.

In [76]:
source_df = pd.read_csv("https://gretel-public-website.s3.us-west-2.amazonaws.com/blueprints/nlp_text_analysis/chat_logs.csv")

In [77]:
source_df.head()

Unnamed: 0,folder,dialogueID,date,from,to,text
0,3,71195.tsv,2010-08-27T11:31:00.000Z,KB1JWQ,mjwalker,phone number 123.453.8920
1,3,126125.tsv,2008-04-23T14:55:00.000Z,bad_image,,location San Diego
2,3,126125.tsv,2008-04-23T14:56:00.000Z,bad_image,,my name is Linus
3,3,126125.tsv,2008-04-23T14:57:00.000Z,lordleemo,bad_image,location United States
4,3,64545.tsv,2009-08-01T06:22:00.000Z,mechtech,,city is San Diego email address is test@exampl...


## Label the source text

With the data loaded into the notebook, we now create a Gretel Project, and upload the records to the project for labeling.

In [78]:
project = client.get_project(create=True)

`detection_mode` configures the NER pipeline that is responsible for labeling the data. Using `detection_mode=all` we configure the records to be labeled using all of 

In [95]:
project.send_dataframe(source_df, detection_mode="all")

6 records [00:00, 15.01 records/s]         


For extra credit, you can navigate to the project's console view to better inspect and visualize the dataset.

In [91]:
project.get_console_url()

'https://console.gretel.cloud/drew-0a1c3'

## Inspect labeled data

In this next cell, we download the labeled records and inspect each chat message to see what entities were detected. Gretel uses a combination of NLP models, regex, and custom heuristics to detect named entities in structured and unstructured data.

For a list of entities that Gretel can detect, [click here](https://gretel.ai/gretel-cloud-faqs/what-types-of-entities-can-gretel-identify).

In [None]:
from gretel_helpers.spacy import display_entities

TEXT_FIELD = "text"

for record in project.iter_records(direction="backward"):
    display_entities(record, text_field)

## Build a transformation pipeline

After labeling the dataset, we've identified messages that contain PII, such as names and emails. The final step in this blueprint is to build a transformation pipeline that will replace names and other identifying information with fake representations of the data.

We make a point to replace rather than redact sensitive information. This preservation ensures the dataset remains valuable for downstream use cases such as machine learning, where the structure and contents of the data are essential.

To learn more about data transformation pipelines with Gretel, check our [website](https://gretel.ai/platform/transform) or [SDK documentation](https://gretel-client.readthedocs.io/en/stable/transformers/api_ref.html).

In [81]:
import uuid

from gretel_client.transformers import DataPath, DataTransformPipeline
from gretel_client.transformers import FakeConstantConfig

SEED = uuid.uuid1().int

Configure the pipeline. `FakeConstantConfig` will replace any entities configured under `labels` with a fake version of the entity.

In [86]:
fake_xf = FakeConstantConfig(seed=SEED, labels=["person_name", "email_address", "phone_number"])

paths = [
    DataPath(input=TEXT_FIELD, xforms=[fake_xf]),
    DataPath(input="*"),
]

pipeline = DataTransformPipeline(paths)

Run the pipeline to redact any sensitive strings

In [87]:
xf_records = [
    pipeline.transform_record(record)["record"]
    for record in 
    project.iter_records(direction="backward")
]

Inspect the transformed version of the dataset. Notice that entities such as names and emails have been replace with fake instances of the entity.

In [101]:
xf_df = pd.DataFrame(xf_records)

xf_df[[TEXT_FIELD]]

Unnamed: 0,text
0,location United States
1,my name is Darrell Long
2,city is San Diego email address is garciajim@y...
3,location San Diego
4,phone number 532-677-7284


Now that you've completed this notebook, you've seen how it's possible to take a corpus of free text, label it using Gretel's NER pipeline, and safely anonymize the dataset while retaining its utility.