# Gretel Transformers Walkthrough

Welcome to the Gretel Transformers walkthrough! In this tutorial we will take you through the process of creating a data pipeline to apply a variety of transformations to your data.

This tutorial assumes you have already uploaded data to Gretel.

The transformers in this example work on entity labels only. We have chosen a subset of labels we see often in data.

If you would like to build field-level transforms, please look through our blueprints directory (in the top level of the repository) for examples.

In [None]:
import getpass
import os

gretel_uri = os.getenv("GRETEL_URI") or getpass.getpass("Your Gretel URI")

## Create a Gretel Project Instance

In the code below, we will utilize the gretel-client to create an instance of a project that will be used to syntesize data from. 

In [2]:
%%capture
!pip install "gretel-client==0.7.0.rc3" --upgrade

[31mERROR: Could not find a version that satisfies the requirement gretel-client==0.7.0.rc3 (from versions: 0.5.0, 0.5.1, 0.6.0, 0.7.0rc1, 0.7.0rc2)[0m
[31mERROR: No matching distribution found for gretel-client==0.7.0.rc3[0m


In [None]:
# Load your Gretel project into the Python Client. Be sure to have your Gretel Project URI!

from gretel_client import project_from_uri

project = project_from_uri(gretel_uri)

In [None]:
# We can see how many records we've ingested and how many fields we've discovered, just to show the
# project is active.
print(f'Total Records Received: {project.record_count}')
print(f'Total Fields Discovered: {project.field_count}')

print("")
print('Previewing project dataframe')
project.head(5)

## Sample Entity Transformations

Below we build a series of entity specific transformers.


In [None]:
from gretel_client.transformers import (
    DataPath,
    DataTransformPipeline,
    BucketCreationParams,
    BucketConfig,
    RedactWithCharConfig,
    RedactWithLabelConfig,
    FakeConstantConfig,
    SecureHashConfig,
    StringMask,
    bucket_creation_params_to_list
)

# let's mask email addresses, by only keeping the first few chars
# this will automatically find emails in any field based on entity labeling
email_mask = StringMask(start_pos=3)
email_transformer = [RedactWithCharConfig(labels=["email_address"], mask=[email_mask])]

# let's mask the last 2 digits of zip codes
zip_mask = StringMask(start_pos=-2)
zip_transformer = [RedactWithCharConfig(labels=["us_zip_code"], mask=[zip_mask])]

# token redactor
# find any sensitive programming tokens that might exist and hash them
token_labels = ["generic_key", "slack_secrets", "jwt", "twilio_data", "square_api_key", "stripe_api_key"]
token_transformer = [SecureHashConfig(labels=token_labels, secret="hash_enc_key")]

# let's replace phone numbers with totally fake, but consistent ones
phone_transformer = [FakeConstantConfig(labels=["phone_number"], seed=1234, fake_method="phone_number")]

# aggressively mask all locations
location_transformer = [RedactWithLabelConfig(labels=["location"])]

# let's bucket latitudes and longitudes into less precise places
lat_lon_boundaries = BucketCreationParams(-180.0, 180.0, 0.5)
buckets = bucket_creation_params_to_list(lat_lon_boundaries)
lat_lon_transformer = [BucketConfig(buckets=buckets, labels=["latitude", "longitude"])]

# since we are only working on automatic transforms based on labels
# they can all go into one datapath

all_transformers = email_transformer + zip_transformer + token_transformer + phone_transformer + location_transformer + lat_lon_transformer
data_path = [
    DataPath(input="*", xforms=all_transformers)
]

pipeline = DataTransformPipeline(data_paths=data_path)

## Transform some sample records from your Gretel Project

Now we can create our data pipeline.  We will run some sample records through it.


In [None]:
# Sample records from your project

records = project.sample()

In [None]:
# Those same records transformed

transformed_records = []
for rec in records:
    transformed_records.append(pipeline.transform_record(rec))

In [None]:
from gretel_client.demo_helpers import show_record_diff

# Print out Git-style diffs between source and transformed records
for original, transformed in zip(records, transformed_records):
    show_record_diff(original["data"], transformed["data"])
    input()