# Gretel Transformers Walkthrough

Welcome to the Gretel Transformers walkthrough! In this tutorial we will take you through the process of creating a data pipeline to apply a variety of transformations to your data.

This tutorial assumes you have already uploaded data to Gretel.

Let's get started!

## Configuration

- If using Google Colab, we recommend you change to a GPU runtime.

- Input your Gretel URI String

- Create your Gretel Synthetic Configuration Template
  - See [our documentation](https://gretel-synthetics.readthedocs.io/en/stable/api/config.html) for additional config options

In [None]:
from pathlib import Path
import getpass
import os

gretel_uri = os.getenv("GRETEL_URI") or getpass.getpass("Your Gretel URI")

## Create a Gretel Project Instance

In the code below, we will utilize the gretel-client to create an instance of a project that will be used to syntesize data from. 

In [None]:
%%capture
# mark: cell_id=gretel_client_deps

!pip install gretel-client --upgrade
!pip install "gretel-client[fpe]==0.7.0.rc2"

In [None]:
# mark: cell_id=gretel_client_boilerplate
from gretel_client import project_from_uri

project = project_from_uri(gretel_uri)

In [None]:
# We can see how many records we've ingested and how many fields we've discovered, just to show the
# project is active.
print(f'Total Records Received: {project.record_count}\n')
print(f'Total Fields Discovered: {project.field_count}')

## Choose Entity types to transform

Gretel supports a range of transformations for numeric and string data.  Below we leverage methods of the Project class to find some representative fields and apply sample transformers to them.  First, let's go looking for some common entity types.


In [None]:
# See all the entity types detected in this project.
entity_types = [d['entity'] for d in project.entities]
print(f"Detected entity types: {entity_types}")

# Let's look for some identifiers... We will filter these examples against what we actually found.
identifying_entities = ["person_name", "email_address", "ip_address", "uuid"]
identifying_entities = [e for e in identifying_entities if e in entity_types]

person_name_fields = [d['field'] for d in project.get_field_details(entity="person_name")]

# And some places, both strings and numbers if we can...
location_entities = ["city", "us_zip_code", "latitude", "longitude"]
location_entities = [e for e in location_entities if e in entity_types]

# Everyone loves working with dates.
time_entities = ["date", "datetime"]
time_entities = [e for e in time_entities if e in entity_types]



## Changing identifying entities with string transformations

Now we start building up our pipeline.  We will define transformers and then specify the fields they act on with a data path.  We will build up a list of these to make our data pipeline.  Let's start with some tranformations we might want to do on identifiers -- redact them, encrypt them, fake them or just drop them.  We will choose at random.


In [None]:
import random

from gretel_client.transformers import (
    DropConfig,
    FakeConstantConfig,
    FpeStringConfig,
    RedactWithCharConfig,
    RedactWithLabelConfig,
    RedactWithStringConfig,
    StringMask,
    DataPath,
    DataTransformPipeline,
    DataRestorePipeline
)
from gretel_client.transformers.fakers import FAKER_MAP

# Define a seed value to use for faker transformations to ensure consistent output
SEED = 6251

# Define a secret for format preserving encryption.  
# "Do not store it plain text in github" boilerplate goes here :)
SECRET = "2B7E151628AED2A6ABF7158809CF4F3CEF4359D8D580AA4F7F036D6F04FC6A94"

data_paths = []

for entity in identifying_entities:
    # Get all the project fields tagged as this entity type
    entity_fields = [d['field'] for d in project.get_field_details(entity=entity)]
    for field in entity_fields:
        dice_roll = random.randint(1,6)
        xf = []
        if dice_roll == 1:
            print(f"Dropping field {field}")
            xf = [DropConfig()]
        if dice_roll == 2:
            print(f"Faking field {field}")
            xf = [FakeConstantConfig(seed=SEED, fake_method=FAKER_MAP.get(entity))]
        if dice_roll == 3:
            print(f"Encrypting field {field}")
            # radix 62 will encrypt alphanumeric but no special characters
            xf = [FpeStringConfig(secret=SECRET, radix=62)]
        if dice_roll == 4:
            print(f"Character redacting field {field}")
            # Use a fancier mask for emails
            if entity == "email_address":
                xf = [RedactWithCharConfig(
                        char="X",
                        mask=[StringMask(start_pos=3, mask_until="@"), 
                            StringMask(mask_after="@", mask_until=".", greedy=True)])]
            else:
                xf = [RedactWithCharConfig("#", mask=[StringMask(start_pos=3)])]
        if dice_roll == 5:
            print(f"Label redacting field {field}")
            xf = [RedactWithLabelConfig(labels=[entity])]
        if dice_roll == 6:
            print(f"String redacting field {field}")
            xf = [RedactWithStringConfig(string="CLASSIFIED")]
        data_paths.append(DataPath(input=field, xforms=xf))
        

## Rounding numeric latitudes and longitudes

Let's keep going.  We will use some of the same transformers for string locations.  For numeric, let's round the values.


In [None]:
from gretel_client.transformers import (
    bucket_creation_params_to_list,
    BucketCreationParams,
    BucketConfig
)

for entity in location_entities:
    # Get all the project fields tagged as this entity type
    entity_fields = [d['field'] for d in project.get_field_details(entity=entity)]
    for field in entity_fields:
        xf = []
        if entity in ["city", "us_zip_code"]:
            print(f"Faking field {field}")
            xf = [FakeConstantConfig(seed=SEED, fake_method=FAKER_MAP[entity])]
        else:
            print(f"Rounding field {field}")
            min_max_width_tuple = BucketCreationParams(-180.0, 180.0, 0.1)
            buckets = bucket_creation_params_to_list(min_max_width_tuple)
            xf = [BucketConfig(buckets=buckets)]
        data_paths.append(DataPath(input=field, xforms=xf))


## Shifting date values

Finally, in addition to the transformers above, you can also shift dates.  Here we keep it simple, but there are options to modify the shift based on another input field and you can specify other formats.


In [None]:
from gretel_client.transformers import (
    DateShiftConfig
)

for entity in location_entities:
    # Get all the project fields tagged as this entity type
    entity_fields = [d['field'] for d in project.get_field_details(entity=entity)]
    for field in entity_fields:
        print(f"Date shifting field {field}")
        xf = [DateShiftConfig(secret=SECRET, lower_range_days=-10, upper_range_days=25)]
        data_paths.append(DataPath(input=field, xforms=xf))


## The pipeline in action, er, book with a pipeline animal

Now we can create our data pipeline.  We will run some sample records through it.


In [None]:
# Add one last catch all
# data_paths.append(DataPath(input="*"))
# Make the pipe
pipe = DataTransformPipeline(data_paths)
# Bonus trick for the end
restore_pipe = DataRestorePipeline(data_paths)


In [None]:
# Sample records from your project
records = project.sample()
print(records)

In [None]:
# Those same records transformed
transformed_records = [pipe.transform_record(rec) for rec in records]
print(transformed_records)

In [None]:
# Recover anything that used the SECRET
restored_records = [restore_pipe.transform_record(rec) for rec in transformed_records]
print(restored_records)