In [None]:
!pip install -Uqq gretel-client

In [None]:
# remove me, local boilerplate

%load_ext autoreload
%autoreload 2

import os
os.chdir(os.path.normpath("../../../transformers/src"))

os.environ["GRETEL_URI"] = "gretel://api-dev.gretel.cloud/bike-orders"

# Auto-anonymization Pipeline

The objective in this notebook is to sufficiently anonymize a dataset containing PII so that it can be shared amongst users without revealing any identifying or sensitive details.

Using Gretel's [Data Catalog](https://gretel.ai/platform/data-catalog) and [Transformation](https://gretel.ai/platform/transform) features, this blueprint will walk through creating a pipeline for automatically anonymizing a dataset.

### Setup

First we'll import the Gretel client depedencies and build up a client pointing to the project.

In [None]:
from gretel_client import project_from_uri

project = project_from_uri("prompt")

In [None]:
project.client.install_packages(version="dev")

## Inspect source dataset

For this demonstration we've chosen a dataset containing bike order details. As you will see, this dataset contains personally identifying information such as names, email and personal financial details. 

In [None]:
project.head()

## Build the pipeline

In [None]:
from gretel_auto_xf.pipeline import build_pipeline
from gretel_auto_xf.helpers import rule_inspector, df_diff

`build_pipeline` will analyze the source dataset and generate a transformation pipeline that can be used to create an anonymized version of the source dataset.

In [None]:
pipeline = build_pipeline(project, show_progress=True)

In [None]:
rule_inspector(pipeline)

## Run the anonymization pipeline

Now that we've selected what transformations to apply, we can run the pipeline against the Gretel project. `xf_project` will retrieve the original records from the Gretel project and apply the anonymization pipeline.

The result of `xf_records` is an anonymized version of the original dataset.

In [None]:
anonymized_df, scores = pipeline.xf_project(as_df=True, show_progress=True)

## Compare datasets

Let's compare the two datasets... `df_diff` will perform a row-wise comparison by field.

In [None]:
df_diff(project.head(), anonymized_df, key="CustomerID", value=16625)

`scores` will return a set of scores for each record transformed. Scores are expected to be grouped close together. Any outliers in the dataset may indicate some records weren't properly anonymized.

In [None]:
scores.plot.hist()

## Save the anonymized dataset

Now that we've generated an anonymized version of the dataset, let's save it so it can be shared.

In [None]:
anonymized_df.to_csv("bike_orders_anonymized.csv")

Alternatively, we can upload the anonymized dataset to Gretel where it can be safely accessed by users.

In [None]:
anonymized_project = project.client.get_project(display_name="Sample Blueprint: Anonymized Bike Orders", create=True)
anonymized_project.send_dataframe(anonymized_df, use_progress_widget=True)

print(f"Your new Gretel project has been created! Access it here, {anonymized_project.get_console_url()}.")