# Gretel Transformers Walkthrough

Welcome to the Gretel Transformers walkthrough! In this tutorial we will take you through the process of creating a data pipeline to apply a variety of transformations to your data.

This tutorial assumes you have already uploaded data to Gretel.

The transformers in this example work on entity labels only. We have chosen a subset of labels we see often in data.

If you would like to build field-level transforms, please look through our blueprints directory (in the top level of the repository) for examples.

In [1]:
import getpass
import os

gretel_uri = os.getenv("GRETEL_URI") or getpass.getpass("Your Gretel URI")

Your Gretel URI········


## Create a Gretel Project Instance

In the code below, we will utilize the gretel-client to create an instance of a project that will be used to syntesize data from. 

In [2]:
%%capture
!pip install gretel-client --upgrade

In [3]:
# Load your Gretel project into the Python Client. Be sure to have your Gretel Project URI!

from gretel_client import project_from_uri

project = project_from_uri(gretel_uri)

In [4]:
# We can see how many records we've ingested and how many fields we've discovered, just to show the
# project is active.
print(f'Total Records Received: {project.record_count}')
print(f'Total Fields Discovered: {project.field_count}')

print("")
print('Previewing project dataframe')
project.head(5)

Total Records Received: 731234
Total Fields Discovered: 110

Previewing project dataframe


Unnamed: 0,payload.device_urn,payload.device_class,payload.device_sn,payload.device,payload.when_captured,payload.loc_lat,payload.loc_lon,payload.loc_alt,payload.loc_olc,payload.env_temp,...,payload.ip_address,payload.ip_country_code,payload.ip_city,payload.ip_country_name,payload.ip_subdivision,payload.location,origin,payload.lnd_7128ec,payload.lnd_7318u,payload.dev_ntp_count
0,pointcast:10041,pointcast,Pointcast #10041,10041,2020-07-27T21:51:15Z,37.796306,140.514413,65,8R92QGW7+GQF,26.6,...,103.67.223.51,JP,,Japan,,"37.796306,140.514413",arn:aws:sns:us-west-2:985752656544:ingest-meas...,,,
1,pointcast:10041,pointcast,Pointcast #10041,10041,2020-07-27T21:51:10Z,37.796306,140.514413,65,8R92QGW7+GQF,,...,103.67.223.51,JP,,Japan,,"37.796306,140.514413",arn:aws:sns:us-west-2:985752656544:ingest-meas...,16.0,,
2,pointcast:10041,pointcast,Pointcast #10041,10041,2020-07-27T21:51:03Z,37.796306,140.514413,65,8R92QGW7+GQF,,...,103.67.223.51,JP,,Japan,,"37.796306,140.514413",arn:aws:sns:us-west-2:985752656544:ingest-meas...,,62.0,
3,pointcast:10001,pointcast,Pointcast #10001,10001,2020-07-27T21:51:01Z,37.659,140.459,209,8R92MF55+JJ2,30.8,...,153.232.218.63,JP,Fukushima,Japan,Fukushima-ken,"37.659,140.459",arn:aws:sns:us-west-2:985752656544:ingest-meas...,,,1.0
4,pointcast:10001,pointcast,Pointcast #10001,10001,2020-07-27T21:51:00Z,37.659,140.459,209,8R92MF55+JJ2,,...,153.232.218.63,JP,Fukushima,Japan,Fukushima-ken,"37.659,140.459",arn:aws:sns:us-west-2:985752656544:ingest-meas...,13.0,,


## Sample Entity Transformations

Below we build a series of entity specific transformers.


In [5]:
from gretel_client.transformers import (
    DataPath,
    DataTransformPipeline,
    BucketCreationParams,
    BucketConfig,
    RedactWithCharConfig,
    RedactWithLabelConfig,
    FakeConstantConfig,
    SecureHashConfig,
    StringMask,
    bucket_creation_params_to_list
)

# let's mask email addresses, by only keeping the first few chars
# this will automatically find emails in any field based on entity labeling
email_mask = StringMask(start_pos=3)
email_transformer = [RedactWithCharConfig(labels=["email_address"], mask=[email_mask])]

# let's mask the last 2 digits of zip codes
zip_mask = StringMask(start_pos=-2)
zip_transformer = [RedactWithCharConfig(labels=["us_zip_code"], mask=[zip_mask])]

# token redactor
# find any sensitive programming tokens that might exist and hash them
token_labels = ["generic_key", "slack_secrets", "jwt", "twilio_data", "square_api_key", "stripe_api_key"]
token_transformer = [SecureHashConfig(labels=token_labels, secret="hash_enc_key")]

# let's replace phone numbers with totally fake, but consistent ones
phone_transformer = [FakeConstantConfig(labels=["phone_number"], seed=1234, fake_method="phone_number")]

# aggressively mask all locations
location_transformer = [RedactWithLabelConfig(labels=["location"])]

# let's bucket latitudes and longitudes into less precise places
lat_lon_boundaries = BucketCreationParams(-180.0, 180.0, 0.5)
buckets = bucket_creation_params_to_list(lat_lon_boundaries)
lat_lon_transformer = [BucketConfig(buckets=buckets, labels=["latitude", "longitude"])]

# since we are only working on automatic transforms based on labels
# they can all go into one datapath

all_transformers = email_transformer + zip_transformer + token_transformer + phone_transformer + location_transformer + lat_lon_transformer
data_path = [
    DataPath(input="*", xforms=all_transformers)
]

pipeline = DataTransformPipeline(data_paths=data_path)

## Transform some sample records from your Gretel Project

Now we can create our data pipeline.  We will run some sample records through it.


In [6]:
# Sample records from your project

records = project.sample()

In [7]:
# Those same records transformed

transformed_records = []
for rec in records:
    transformed_records.append(pipeline.transform_record(rec))

In [8]:
from gretel_client.demo_helpers import show_record_diff

# Print out Git-style diffs between source and transformed records
for original, transformed in zip(records, transformed_records):
    show_record_diff(original["data"], transformed["data"])
    input()

--- original

+++ transformed

@@ -1,4 +1,4 @@

-origin:arn:aws:sns:us-west-2:985752656544:ingest-measurements-prd
+origin:arn:aws:sns:LOCATION-west-2:129917740844:ingest-measurements-prd
 payload.bat_voltage:8.34
 payload.dev_comms_failures:1976
 payload.dev_free_memory:50588
@@ -6,17 +6,17 @@

 payload.dev_restarts:590
 payload.device:10041
 payload.device_class:pointcast
-payload.device_sn:Pointcast #10041
-payload.device_urn:pointcast:10041
+payload.device_sn:Pointcast #100XX
+payload.device_urn:pointcast:100XX
 payload.env_temp:26.6
 payload.ip_address:103.67.223.51
 payload.ip_city:None
-payload.ip_country_code:JP
-payload.ip_country_name:Japan
+payload.ip_country_code:LOCATION
+payload.ip_country_name:LOCATION
 payload.ip_subdivision:None
 payload.loc_alt:65
-payload.loc_lat:37.796306
-payload.loc_lon:140.514413
+payload.loc_lat:37.5
+payload.loc_lon:140.5
 payload.loc_olc:8R92QGW7+GQF
 payload.location:37.796306,140.514413
 payload.service_handler:i-051a2a353509414f0

--- origi


