In [None]:
!pip install -Uqq gretel-client

# Auto-anonymize production datasets for development

Data seeded in development, test, and other pre-production environments often don't have parity with production data. This difference in quality makes it difficult to track down bugs during development, and often leads to bugs that only occur in production.

In this blueprint, we take a production dataset containing sensitive, personally identifying details and generate a fake, anonymized copy of that dataset. The resulting dataset has the same shape, and can be loaded into pre-production databases, but isn't re-identifiable back to any customer.

Using Gretel's [Data Catalog](https://gretel.ai/platform/data-catalog) and [Transformation](https://gretel.ai/platform/transform) tools we walk-through a notebook that will analyze a source dataset and automatically generate a data pipeline that will transform a production dataset. While this demonstration runs as a notebook, this same pipeline can be deployed into a variety of different data stacks.

### Setup

First we'll import Gretel package depedencies and instantiate a client pointing to the newly created project.

In [None]:
from gretel_client import project_from_uri

project = project_from_uri("prompt")

In [None]:
project.client.install_packages(version="dev")

## Inspect source dataset

For this demonstration we've chosen a dataset containing bicycle order details. The dataset contains identifying information such as names, email and individual financial details. Gretel's Data Catalog will extract entities such as names, emails and locations using custom pattern matching and machine learning based NLP models. We'll use these entities to help determine what fields need to be anonymized.

In [None]:
project.head()

## Build the pipeline

`gretel_auto_xf` is a package built by Gretel that helps build transformation pipelines. The package uses a set of rules and heuristics to automically generate transformations based on the contents and metadata of the dataset.

In [None]:
from gretel_auto_xf.pipeline import build_pipeline
from gretel_auto_xf.helpers import rule_inspector, df_diff

`build_pipeline` will analyze the source dataset and generate a transformation pipeline that can be used to create an anonymized version of the source dataset.

In [None]:
pipeline = build_pipeline(project, show_progress=True)

After analyzing the dataset, a set of rules are matched and automatically collected into a pipeline. Using `rule_inspector` you may select or deselect rules based on your specific requirements or privacy constraints.

In [None]:
rule_inspector(pipeline)

## Run the anonymization pipeline

Now that we've selected what transformations to apply, we can run the pipeline against the Gretel project. `xf_project` will retrieve the original records from the Gretel project and apply the anonymization pipeline.

The result of `xf_records` is an anonymized version of the original dataset.

In [None]:
anonymized_df = pipeline.xf_project(as_df=True, show_progress=True, batch_pipeline=True)

## Compare datasets

Let's compare the two datasets... `df_diff` will perform a row-wise comparison by field.

In [None]:
df_diff(project.head(), anonymized_df, index=1)

## Save the anonymized dataset

Now that we've generated an anonymized version of the dataset, let's save it so it can be shared.

In [13]:
anonymized_df.to_csv("bike_orders_anonymized.csv", index=False)

Alternatively, we can upload the anonymized dataset to Gretel where it can be safely accessed by other users.

In [None]:
project_xf = project.client.get_project(display_name="Anonymized Bike Orders", create=True)
project_xf.send_dataframe(anonymized_df, use_progress_widget=True)

print(f"Your new Gretel project has been created! Access it here, {project_xf.get_console_url()}.")