# PII Recognizer

A function to detect pii data and anonymize the pii entity in the text. 

In this notebook we will go over the function's docs and outputs and see an end-to-end example of running it.

1. [Documentation](#chapter1)
2. [Results](#chapter2)
3. [End-to-end Demo](#chapter3)

<a id="chapter1"></a>
## 1. Documentation

The function receive a directory path with all the text files in it. It walk through the directory, get all the text file. Then it detect the pii entity inside of the text file, replace the entity with the label. Generate the html file with all pii entity highlighted. Generate the json report has the explaination of the process.


### 1.1. Parameters:
* **context**: `mlrun.MLClientCtx`
    
    The MLRun context
    
* **model**: `str`
    
    - "whole", "spacy", "pattern", "flair". The default is "whole".
    
    For each model, it can detect some entities. The "whole" model is combined all three models together. It can detect all the entities list below. 
    
    
    - "spacy" : ["LOCATION", "PERSON","NRP","ORGANIZATION","DATE_TIME"]
    
    - "pattern": ["CREDIT_CARD", "SSN", "PHONE", "EMAIL"]
    
    - "flair": [ "LOCATION",
        "PERSON",
        "NRP",
        "GPE",
        "ORGANIZATION",
        "MAC_ADDRESS",
        "US_BANK_NUMBER",
        "IMEI",
        "TITLE",
        "LICENSE_PLATE",
        "US_PASSPORT",
        "CURRENCY",
        "ROUTING_NUMBER",
        "US_ITIN",
        "US_BANK_NUMBER",
        "US_DRIVER_LICENSE",
        "AGE",
        "PASSWORD",
        "SWIFT_CODE"
        ]
       
       
 * **input_path**: `str`
 
     The input directory with all the text files
     
  * **output_path**: `str`
 
     The directory that is used to store the anonymized text files. it is also used for mlrun to log the artifact as zip file
     
  * **output_suffix**: `str`
 
     The suffix will added to the input file. for example if the input text file is pii.txt, if output_suffix is "anonymized", the output file would be pii_anonymized.txt
     
   * **html_key**: `str`
 
     The artifact name of the html file
 
 
 
 

### 1.2. Outputs:

There are two outputs of this function. 

* **output_path**: `str`
    
    The directory stored all the anonymized text files

* **rpt_json**: `dict`

    A dict of reporting to explain how does the model detect the pii entity
    
* **errors** : `dict`
    A dict of errors when processing the text files if any
    

<a id="chapter2"></a>
## 2. Results

The result of the function looks like the following: 

For example if the input string is 

`John Doe 's ssn is 182838483, connect john doe with john_doe@gmail.com or 6288389029, he can pay you with 41482929939393`

The anonymized_text is 

`<PERSON>'s <ORGANIZATION> is <SSN>, connect <PERSON> with <PERSON> <EMAIL> or <PHONE>, he can pay you with <CREDIT_CARD>`

The html_str is

<html><body><p><span><span style="display:inline-flex;flex-direction:row;align-items:center;background:#21c35466;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">John Doe&#x27;s<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">PERSON</span></span> <span style="display:inline-flex;flex-direction:row;align-items:center;background:#80849566;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">ssn<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">ORGANIZATION</span></span> is <span style="display:inline-flex;flex-direction:row;align-items:center;background:#ffa42133;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">182838483<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">SSN</span></span>, connect me with <span style="display:inline-flex;flex-direction:row;align-items:center;background:#21c35466;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">john_doe@gmail.com<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">PERSON</span></span><span style="display:inline-flex;flex-direction:row;align-items:center;background:#ff4b4b33;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">john_doe@gmail.com<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">EMAIL</span></span> or <span style="display:inline-flex;flex-direction:row;align-items:center;background:#ff4b4b33;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">6288389029<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">PHONE</span></span>, he can pay you with <span style="display:inline-flex;flex-direction:row;align-items:center;background:#ffa42133;border-radius:0.5rem;padding:0.25rem 0.5rem;overflow:hidden;line-height:1">41482929939393<span style="border-left:1px solid;opacity:0.1;margin-left:0.5rem;align-self:stretch"></span><span style="margin-left:0.5rem;font-size:0.75rem;opacity:0.5">CREDIT_CARD</span></span>
</span></p></body></html>

The json report that explain the output is

```yaml

[
  {
    "entity_type": "PERSON", # result of the labeling
    "start": 0, # start positon of the entity
    "end": 9,  # end postion of the entity
    "score": 0.99, # the confident score of the model + context_improvement
    "analysis_explanation": {
      "recognizer": "FlairRecognizer", # which recognizer is used to recognize this entity
      "pattern_name": null,
      "pattern": null,
      "original_score": 0.99, # The original confident score from the pre-trained model
      "score": 0.99, # the final score = original_score + score_context_improvement
      "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
      "score_context_improvement": 0, # The improvement from the context
      "supportive_context_word": "",
      "validation_result": null
    },
    "recognition_metadata": {
      "recognizer_identifier": "Flair Analytics_5577088640",
      "recognizer_name": "Flair Analytics"
    }
  },
  ....
]
```



<a id="chapter3"></a>
## 3. End-to-end Demo


In [51]:
#prepare the mlrun.function to generate files and store them as artifact

import mlrun
artifact_path = "./"
fn = mlrun.code_to_function(
    name="pii_recognizer",
    filename="pii_recognizer.py",
    kind="job",
    image="mlrun/mlrun",
    handler="recognize_pii",
    description="This function is used to recognize PII in a given text",
)

run_obj = fn.run(
    artifact_path = artifact_path,
    params= {
        'model': "whole", 
        'input_path': "./data/",
        'output_path': "./data/output/",
        "output_suffix": "output",
        "html_key": "highlighted",
        
    },
    returns = ["output_path: path", "rpt_json: file", "errors: file"],
    local=True,
)

> 2023-07-21 14:43:02,946 [info] Storing function: {'name': 'pii-recognizer-recognize-pii', 'uid': 'd56a535779f4475cbc2582d87f30845b', 'db': None}
2023-07-21 14:43:03,065 loading file /Users/Peng_Wei/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
2023-07-21 14:43:09,028 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
Model loaded


Processing files:   0%|          | 0/2 [00:00<?, ?file/s]



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...30845b,0,Jul 21 21:43:02,completed,pii-recognizer-recognize-pii,v3io_user=PengWeikind=owner=PengWeihost=M-C02G416TML87,,model=wholeinput_path=./data/output_path=./data/output/output_suffix=outputhtml_key=highlighted,,highlightedoutput_pathrpt_jsonerrors





> 2023-07-21 14:43:16,148 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-recognize-pii'}


In [52]:
#get the mlrun context
context = mlrun.get_or_create_ctx('pii')

In [53]:
#check the highlighted html 
html_output = context.get_cached_artifact("highlighted")
html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
from IPython.core.display import display, HTML
display(HTML(html_str))


In [54]:
#check the json report about the explanation.
rpt_output = context.get_cached_artifact("rpt_json")
rpt_str = mlrun.get_dataitem(rpt_output.get_target_path()).get().decode("utf-8")
import json
obj = json.loads(rpt_str)
 
# Pretty Print JSON
json_formatted_str = json.dumps(obj, indent=4)
print(json_formatted_str)


{
    "data/letter.txt": [
        {
            "entity_type": "PERSON",
            "start": 9,
            "end": 17,
            "score": 1.0,
            "analysis_explanation": {
                "recognizer": "FlairRecognizer",
                "pattern_name": null,
                "pattern": null,
                "original_score": 1.0,
                "score": 1.0,
                "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
                "score_context_improvement": 0,
                "supportive_context_word": "",
                "validation_result": null
            },
            "recognition_metadata": {
                "recognizer_identifier": "Flair Analytics_5394533440",
                "recognizer_name": "Flair Analytics"
            }
        },
        {
            "entity_type": "LOCATION",
            "start": 248,
            "end": 255,
            "score": 1.0,
            "analysis_explanation": {
                "recognizer"