# Gretel Transformers Walkthrough

Welcome to the Gretel Transformers walkthrough! In this tutorial we will take you through the process of creating a data pipeline to apply a variety of transformations to your data.

This tutorial assumes you have already uploaded data to a [Gretel Project](https://console.gretel.cloud).

The transformers in this example work on entity labels only. We have chosen a subset of labels we see often in data.

If you would like to build field-level transforms or see more advanced use cases please look through our [blueprints directory](https://github.com/gretelai/gretel-python-client/tree/master/blueprints) for more examples.

For a more exhaustive list of possible transformations, please reference our [documentation](https://gretel-client.readthedocs.io/en/latest/transformers/api_ref.html#module-reference-transformers).

In [None]:
# NOTE: Run this cell and copy your Gretel URI into the text box below

import getpass
import os

gretel_uri = os.getenv("GRETEL_URI") or getpass.getpass("Your Gretel URI")

## Create a Gretel Project Instance

In the code below, we will utilize the gretel-client to create an instance of a Project that we can use to iterate
labeled records from.

In [None]:
%%capture
!pip install gretel-client --upgrade

In [None]:
# Load your Gretel project into the Python Client. Be sure to have your Gretel Project URI!

from gretel_client import project_from_uri

project = project_from_uri(gretel_uri)

In [None]:
# Example JSON record and Gretel Metadata from the Project stream

# Components of a record:
# - id: A unique ID that represents a position in the stream the record resides
# - data: A flattened version of the raw record that was received
# - metadata: A dictionary of metadata, keyed by field name

project.sample()[0]

## Sample Entity Transformations

Below we build a series of entity specific transformers.


In [None]:
from gretel_client.transformers import (
    DataPath,
    DataTransformPipeline,
    Score,
    BucketCreationParams,
    BucketConfig,
    RedactWithCharConfig,
    RedactWithLabelConfig,
    FakeConstantConfig,
    SecureHashConfig,
    StringMask,
    bucket_creation_params_to_list
)

# let's mask email addresses, by only keeping the first few chars
# this will automatically find emails in any field based on entity labeling
email_mask = StringMask(start_pos=3)
email_transformer = [RedactWithCharConfig(labels=["email_address"], minimum_score=Score.MED, mask=[email_mask])]

ip_mask = StringMask(start_pos=-6)
ip_transformer = [RedactWithCharConfig(labels=["ip_address"], minimum_score=Score.MED, mask=[ip_mask])]

# let's mask the last 2 digits of zip codes
zip_mask = StringMask(start_pos=-2)
zip_transformer = [RedactWithCharConfig(labels=["us_zip_code"], minimum_score=Score.MED, mask=[zip_mask])]

# token redactor
# find any sensitive programming tokens that might exist and hash them
token_labels = ["generic_key", "slack_secrets", "jwt", "twilio_data", "square_api_key", "stripe_api_key"]
token_transformer = [SecureHashConfig(labels=token_labels, minimum_score=Score.MED, secret="hash_enc_key")]

# let's replace phone numbers with totally fake, but consistent ones
phone_transformer = [FakeConstantConfig(labels=["phone_number"], minimum_score=Score.MED, seed=1234, fake_method="phone_number")]

# let's replace person names with totally fake, but consistent ones
person_transformer = [FakeConstantConfig(labels=["person_name"], minimum_score=Score.MED, seed=1234, fake_method="person_name")]

# aggressively mask all locations
location_transformer = [RedactWithLabelConfig(labels=["location"], minimum_score=Score.MED)]

# let's bucket latitudes and longitudes into less precise places
lat_lon_boundaries = BucketCreationParams(-180.0, 180.0, 0.5)
buckets = bucket_creation_params_to_list(lat_lon_boundaries)
lat_lon_transformer = [BucketConfig(buckets=buckets, labels=["latitude", "longitude"], minimum_score=Score.MED)]

# since we are only working on automatic transforms based on labels
# they can all go into one datapath

all_transformers = email_transformer + ip_transformer + zip_transformer + token_transformer + phone_transformer + person_transformer + location_transformer + lat_lon_transformer
data_path = [
    DataPath(input="*", xforms=all_transformers)
]

pipeline = DataTransformPipeline(data_paths=data_path)

## Transform some sample records from your Gretel Project

Now we can create our data pipeline.  We will run some sample records through it.


In [None]:
# Sample records from your project

records = project.sample()

In [None]:
# Those same records transformed

transformed_records = []
for rec in records:
    transformed_records.append(pipeline.transform_record(rec))

In [None]:
from gretel_client.demo_helpers import show_record_diff

# Print out Git-style diffs between source and transformed records
for original, transformed in zip(records, transformed_records):
    show_record_diff(original["data"], transformed["data"])
    input("Press enter / return to go to the next record")

In [None]:
# If you have data constantly ingesting to the Gretel API, you can consume the labeled
# data and automatically apply your transforms like so:
#
# NOTE: If you do not have data ingesting currently, this operation will block until records are received
#
for record in project.iter_records():
    # from here you may route your transformed records to anywhere!
    transformed = pipeline.transform_record(record)
    print(transformed["record"])