# PII Recognizer

A function to detect pii data and anonymize the pii entity in the text
In this notebook we will go over the function's docs and outputs and see an end-to-end example of running it.

1. [Documentation](#chapter1)
2. [Results](#chapter2)
3. [End-to-end Demo](#chapter3)

<a id="chapter1"></a>
## 1. Documentation

It receive a string as input, return a string with all the pii entities anonymized and a html string with pii entity highlighted and a json string of the explaination of the model proecess. 


### 1.1. Parameters:
* **text**: `str`

    A piece of string that contains the PII data
    
* **model**: `str`
    
    - "whole", "spacy", "pattern", "flair". The default is "whole". 
    
    
    For each model, it can detect some entities. The "whole" model is combined all three models together. It can detect all the entities list below. 
    
    
    - "spacy" : ["LOCATION", "PERSON","NRP","ORGANIZATION","DATE_TIME"]
    
    - "pattern": ["CREDIT_CARD", "SSN", "PHONE", "EMAIL"]
    
    - "flair": [ "LOCATION",
        "PERSON",
        "NRP",
        "GPE",
        "ORGANIZATION",
        "MAC_ADDRESS",
        "US_BANK_NUMBER",
        "IMEI",
        "TITLE",
        "LICENSE_PLATE",
        "US_PASSPORT",
        "CURRENCY",
        "ROUTING_NUMBER",
        "US_ITIN",
        "US_BANK_NUMBER",
        "US_DRIVER_LICENSE",
        "AGE",
        "PASSWORD",
        "SWIFT_CODE"
        ]

### 1.2. Outputs:

There are three outputs of this function. 

* **anonymized_text**: `str`

    A piece of string that all the Pii entities are replaced with the label
    
* **html_str**: `str`
    
    A html string that all the Pii entities is labeled and highlighted.
    
    
* **stats**: `str`
    
    A json-like string that has the explanation of which model has detect which Pii entity with confident score  
    

<a id="chapter2"></a>
## 2. Results

The result of the function looks like the following: 

For example if the input string is 

`John Doe 's ssn is 182838483, connect john doe with john_doe@gmail.com or 6288389029, he can pay you with 41482929939393`

The anonymized_text is 

`<PERSON>'s <ORGANIZATION> is <SSN>, connect <PERSON> with <PERSON> <EMAIL> or <PHONE>, he can pay you with <CREDIT_CARD>`

The html_str is

<html><body><p><span><span style="display:inline-flex;flex-direction:row;align-items:center;background:#21c35466;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">John Doe&#x27;s<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">PERSON</span></span> <span style="display:inline-flex;flex-direction:row;align-items:center;background:#80849566;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">ssn<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">ORGANIZATION</span></span> is <span style="display:inline-flex;flex-direction:row;align-items:center;background:#ffa42133;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">182838483<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">SSN</span></span>, connect me with <span style="display:inline-flex;flex-direction:row;align-items:center;background:#21c35466;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">john_doe@gmail.com<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">PERSON</span></span><span style="display:inline-flex;flex-direction:row;align-items:center;background:#ff4b4b33;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">john_doe@gmail.com<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">EMAIL</span></span> or <span style="display:inline-flex;flex-direction:row;align-items:center;background:#ff4b4b33;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">6288389029<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">PHONE</span></span>, he can pay you with <span style="display:inline-flex;flex-direction:row;align-items:center;background:#ffa42133;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">41482929939393<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">CREDIT_CARD</span></span>
</span></p></body></html>

The json report that explain the output is

```yaml

[
  {
    "entity_type": "PERSON", # result of the labeling
    "start": 0, # start positon of the entity
    "end": 9,  # end postion of the entity
    "score": 0.99, # the confident score of the model + context_improvement
    "analysis_explanation": {
      "recognizer": "FlairRecognizer", # which recognizer is used to recognize this entity
      "pattern_name": null,
      "pattern": null,
      "original_score": 0.99, # The original confident score from the pre-trained model
      "score": 0.99, # the final score = original_score + score_context_improvement
      "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
      "score_context_improvement": 0, # The improvement from the context
      "supportive_context_word": "",
      "validation_result": null
    },
    "recognition_metadata": {
      "recognizer_identifier": "Flair Analytics_5577088640",
      "recognizer_name": "Flair Analytics"
    }
  },
  ....
]
```



<a id="chapter3"></a>
## 3. End-to-end Demo


In [15]:
# prepare the text file that need to analyze
with open("./pii.txt", "w") as text_file:
    text_file.write("John smith's ssn is 182838483, connect him with John_smith@gmail.com or 6288389029, he can pay you with 41482929939393")

In [48]:
#prepare the mlrun.function to generate three files and store them as artifact

import mlrun
artifact_path = "./"
fn = mlrun.code_to_function(
    name="pii_recognizer",
    filename="pii_recognizer.py",
    kind="job",
    image="mlrun/mlrun",
    handler="pii_recognize",
    description="This function is used to recognize PII in a given text",
)


run_obj = fn.run(
    artifact_path = artifact_path,
    params= {
        'model': "whole", 
        'artifact_input_path': "./pii.txt",
        "output_key": "output",
        "html_key": "hightlighted",
        "rpt_key": "explanation"
        
    },
    local=True,
)

> 2023-07-10 22:12:21,438 [info] Storing function: {'name': 'pii-recognizer-pii-recognize', 'uid': 'bb696b4076424a8a88412f17168c6a53', 'db': None}
2023-07-10 22:12:21,673 loading file /Users/Peng_Wei/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
2023-07-10 22:12:23,904 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
2023-07-10 22:12:25,940 loading file /Users/Peng_Wei/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
2023-07-10 22:12:28,072 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...8c6a53,0,Jul 11 05:12:21,completed,pii-recognizer-pii-recognize,v3io_user=PengWeikind=owner=PengWeihost=M-C02G416TML87,,model=wholeartifact_input_path=./pii.txtoutput_key=outputhtml_key=hightlightedrpt_key=explanation,,outputhightlightedexplanation





> 2023-07-10 22:12:36,160 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-pii-recognize'}


In [49]:
#check the result of the anonymized text 
context = mlrun.get_or_create_ctx('pii')
arti_output = context.get_cached_artifact("output")
output = mlrun.get_dataitem(arti_output.get_target_path()).get().decode("utf-8")
print(output)


<PERSON> <ORGANIZATION> is <SSN>, connect him with <EMAIL> or <PHONE>, he can pay you with <CREDIT_CARD>


In [50]:
#check the highlighted html 
html_output = context.get_cached_artifact("hightlighted")
html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
from IPython.core.display import display, HTML
display(HTML(html_str))


In [56]:
#check the json report about the explanation.
rpt_output = context.get_cached_artifact("explanation")
rpt_str = mlrun.get_dataitem(rpt_output.get_target_path()).get().decode("utf-8")
import pprint
pprint.pprint(rpt_str)


('[{"entity_type": "PERSON", "start": 53, "end": 58, "score": 1.0, '
 '"analysis_explanation": {"recognizer": "FlairRecognizer", "pattern_name": '
 'null, "pattern": null, "original_score": 1.0, "score": 1.0, '
 '"textual_explanation": "Identified as PER by Flair\'s Named Entity '
 'Recognition", "score_context_improvement": 0, "supportive_context_word": "", '
 '"validation_result": null}, "recognition_metadata": '
 '{"recognizer_identifier": "Flair Analytics_6005305888", "recognizer_name": '
 '"Flair Analytics"}}, {"entity_type": "PERSON", "start": 0, "end": 12, '
 '"score": 0.97, "analysis_explanation": {"recognizer": "FlairRecognizer", '
 '"pattern_name": null, "pattern": null, "original_score": 0.97, "score": '
 '0.97, "textual_explanation": "Identified as PER by Flair\'s Named Entity '
 'Recognition", "score_context_improvement": 0, "supportive_context_word": "", '
 '"validation_result": null}, "recognition_metadata": '
 '{"recognizer_identifier": "Flair Analytics_6005305888", "re