# PII Detection and Masking with Hugging Face

Do you have unstructured text where you need to detect and mask PII information? 

This example uses a Named Entity Model (NEM) from Hugging Face to detect PII and Llama processor to mask the PII information using the NEM output. 
The entire process runs as a seververless event driven pipeline on GlassFlow. 

## Pre-requisites

- Create your free GlassFlow account via the [GlassFlow WebApp](https://app.glassflow.dev).
- Get your [Personal Access Token](https://app.glassflow.dev/profile) to authorize the Python SDK to interact with GlassFlow Cloud.
- Get your Hugging Face API token https://huggingface.co/

In [None]:
%pip install "glassflow>=2.0.5" pandas

In [1]:
import glassflow

In [2]:
# Please edit this variable with your own personal access token from https://app.glassflow.dev/profile
personal_access_token = ""

HUGGING_FACE_TOKEN = ""


## Create Pipeline

In [3]:
client = glassflow.GlassFlowClient(
    personal_access_token=personal_access_token
)

In [None]:
# Get the space named "examples" (or create one if no space is found)
list_spaces = client.list_spaces()

space_name = "examples"
for s in list_spaces.spaces:
    if s["name"] == space_name:
        space = glassflow.Space(
            personal_access_token=client.personal_access_token,
            id=s["id"], 
            name=s["name"]
        )
        break
else:
    space = client.create_space(name=space_name)

print(f"Space \"{space.name}\" with ID: {space.id}")

### Transformation Function

In [5]:
%pycat transform.py

[0;32mfrom[0m [0mllama_index[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mpostprocessor[0m [0;32mimport[0m [0mNERPIINodePostprocessor[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mos[0m [0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mdef[0m [0mner[0m[0;34m([0m[0mmytext[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0mHUGGING_FACE_TOKEN[0m [0;34m=[0m [0mos[0m[0;34m.[0m[0mgetenv[0m[0;34m([0m[0;34m"HUGGING_FACE_TOKEN"[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0;32mimport[0m [0mrequests[0m[0;34m[0m
[0;34m[0m    [0mAPI_URL[0m [0;34m=[0m [0;34m"https://api-inference.huggingface.co/models/dbmdz/bert-large-cased-finetuned-conll03-english"[0m[0;34m[0m
[0;34m[0m    [0mheaders[0m [0;34m=[0m [0;34m{[0m[0;34m"Authorization"[0m[0;34m:[0m [0;34mf"Bearer {HUGGING_FACE_TOKEN}"[0m[0;34m}[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mquery[0m[0;34m([0m[0mpayload[0m[0;34m)[0m[0;

### Requirements txt

In [6]:
with open("requirements.txt") as f:
    requirements_txt = f.read()
print(requirements_txt)

llama-index==0.11.18
requests


### Environment variables

In [8]:
env_vars = [{
  "name": "HUGGING_FACE_TOKEN",
  "value": HUGGING_FACE_TOKEN
}]

### Create Pipeline

In [None]:
pipeline_name = "pii-detection-masking-example"

pipeline = client.create_pipeline(
    name=pipeline_name, 
    transformation_file='transform.py',
    space_id=space.id, 
    env_vars=env_vars,
    requirements=requirements_txt
)
print("Pipeline ID:", pipeline.id)

In [None]:
print("Pipeline is deployed!") 
print("Pipeline Id = %s" % (pipeline.id))
print("Pipeline URL %s "% f"https://app.glassflow.dev/pipelines/{pipeline.id}")

## Produce data and send it to your pipeline

### Create a dummy data generator using python faker library

In [12]:
from faker import Faker

def data_generator():
    fake = Faker()
    return {
        'text': f"An order was created by {fake.name()} to be shipped to {fake.address()}. {fake.text()}".replace("\n", " ")
    }

### Example data generated 

In [13]:
display(data_generator())

{'text': 'An order was created by Paul Sherman to be shipped to 0296 Glenn Valley Meyertown, NC 37549. Culture type tell these. Enough actually guy himself produce value wide. Fill accept push information evening party. Sound necessary charge realize let into. Story and fear.'}

### Get pipeline data source object to publish events to the pipeline

In [14]:
data_source = pipeline.get_source()

In [20]:
# Generate some data and send it to the pipeline. Store it locally to compare
n_events = 10
input_events = []
for i in range(n_events):
    event = data_generator()
    input_events.append(event)
    data_source.publish(event)

### Display data sent to the pipeline

In [21]:
import pandas as pd

display(pd.DataFrame(input_events))

Unnamed: 0,text
0,An order was created by Kyle Garcia to be ship...
1,An order was created by Alan Gonzalez to be sh...
2,An order was created by Rhonda Miles to be shi...
3,An order was created by Jessica Leonard to be ...
4,An order was created by Fernando Johnson to be...
5,An order was created by Brooke Newton to be sh...
6,An order was created by David Garcia to be shi...
7,An order was created by Luke Stephenson to be ...
8,An order was created by Chad Harris to be ship...
9,An order was created by Shawn Evans to be ship...


## Consume events from the pipeline 

Get pipeline data sink to consume the transformed events from the pipeline.

In [22]:
data_sink = pipeline.get_sink()

In [23]:
output_events = []
while True:
    resp = data_sink.consume()
    if resp.status_code == 200:
        output_events.append(resp.json())
    if len(output_events) == n_events:
        # all events have been consumed
        break

In [24]:
import pandas as pd

display(pd.DataFrame(output_events))

Unnamed: 0,text,text_masked,entities
0,An order was created by Kyle Garcia to be ship...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'Kyle Garcia', '[ORG_59]': 'Gould..."
1,An order was created by Alan Gonzalez to be sh...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'Alan Gonzalez', '[LOC_60]': 'Cry..."
2,An order was created by Rhonda Miles to be shi...,An order was created by [PER_24] to be shipped...,{'[PER_24]': 'Rhonda Miles'}
3,An order was created by Jessica Leonard to be ...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'Jessica Leonard', '[ORG_61]': 'C..."
4,An order was created by Fernando Johnson to be...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'Fernando Johnson', '[LOC_64]': '..."
5,An order was created by Brooke Newton to be sh...,An order was created by [ORG_24] to be shipped...,"{'[ORG_24]': 'Brooke Newton', '[ORG_61]': 'Wea..."
6,An order was created by David Garcia to be shi...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'David Garcia', '[ORG_54]': 'PSC'}"
7,An order was created by Luke Stephenson to be ...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'Luke Stephenson', '[ORG_61]': 'R..."
8,An order was created by Chad Harris to be ship...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'Chad Harris', '[ORG_59]': 'Roy',..."
9,An order was created by Shawn Evans to be ship...,An order was created by [PER_24] to be shipped...,"{'[PER_24]': 'Shawn Evans', '[LOC_58]': 'Cook ..."


## Monitor the pipeline

Go to the pipeline logs you created and monitor real-time events.

In [57]:
## Explore the pipeline logs on the web-UI 
pipeline_url = f"https://app.glassflow.dev/pipelines/{pipeline.id}/logs"
print(pipeline_url)

https://app.glassflow.dev/pipelines/477e28a5-aa8a-4ee8-9f58-3bed238112a0/logs
