# PII Detection and Masking with Hugging Face

Do you have unstructured text where you need to detect and mask PII information? 

This example uses a Named Entity Model (NEM) from Hugging Face to detect PII and Llama processor to mask the PII information using the NEM output. 
The entire process runs as a seververless event driven pipeline on GlassFlow. 

## Pre-requisites

- Create your free GlassFlow account via the [GlassFlow WebApp](https://app.glassflow.dev).
- Get your [Personal Access Token](https://app.glassflow.dev/profile) to authorize the Python SDK to interact with GlassFlow Cloud.
- Get your Hugging Face API token https://huggingface.co/

In [None]:
%pip install "glassflow>=2.0.5" pandas

In [None]:
import glassflow

In [None]:
# Please edit this variable with your own personal access token from https://app.glassflow.dev/profile
personal_access_token = ""

HUGGING_FACE_TOKEN = ""


## Create Pipeline

In [None]:
client = glassflow.GlassFlowClient(
    personal_access_token=personal_access_token
)

In [None]:
# Get the space named "examples" (or create one if no space is found)
list_spaces = client.list_spaces()

space_name = "examples"
for s in list_spaces.spaces:
    if s["name"] == space_name:
        space = glassflow.Space(
            personal_access_token=client.personal_access_token,
            id=s["id"], 
            name=s["name"]
        )
        break
else:
    space = client.create_space(name=space_name)

print(f"Space \"{space.name}\" with ID: {space.id}")

### Transformation Function

In [None]:
%pycat transform.py

### Requirements txt

In [None]:
with open("requirements.txt") as f:
    requirements_txt = f.read()
print(requirements_txt)

### Environment variables

In [None]:
env_vars = [{
  "name": "HUGGING_FACE_TOKEN",
  "value": HUGGING_FACE_TOKEN
}]

### Create Pipeline

In [None]:
pipeline_name = "pii-detection-masking-example"

pipeline = client.create_pipeline(
    name=pipeline_name, 
    transformation_file='transform.py',
    space_id=space.id, 
    env_vars=env_vars,
    requirements=requirements_txt
)
print("Pipeline ID:", pipeline.id)

In [None]:
print("Pipeline is deployed!") 
print("Pipeline Id = %s" % (pipeline.id))
print("Pipeline URL %s "% f"https://app.glassflow.dev/pipelines/{pipeline.id}")

## Produce data and send it to your pipeline

### Create a dummy data generator using python faker library

In [None]:
from faker import Faker

def data_generator():
    fake = Faker()
    return {
        'text': f"An order was created by {fake.name()} to be shipped to {fake.address()}. {fake.text()}".replace("\n", " ")
    }

### Example data generated 

In [None]:
display(data_generator())

### Get pipeline data source object to publish events to the pipeline

In [None]:
data_source = pipeline.get_source()

In [None]:
# Generate some data and send it to the pipeline. Store it locally to compare
n_events = 10
input_events = []
for i in range(n_events):
    event = data_generator()
    input_events.append(event)
    data_source.publish(event)

### Display data sent to the pipeline

In [None]:
import pandas as pd

display(pd.DataFrame(input_events))

## Consume events from the pipeline 

Get pipeline data sink to consume the transformed events from the pipeline.

In [None]:
data_sink = pipeline.get_sink()

In [None]:
output_events = []
while True:
    resp = data_sink.consume()
    if resp.status_code == 200:
        output_events.append(resp.json())
    if len(output_events) == n_events:
        # all events have been consumed
        break

In [None]:
import pandas as pd

display(pd.DataFrame(output_events))

## Monitor the pipeline

Go to the pipeline logs you created and monitor real-time events.

In [None]:
## Explore the pipeline logs on the web-UI 
pipeline_url = f"https://app.glassflow.dev/pipelines/{pipeline.id}/logs"
print(pipeline_url)