# Anonymization task

Let's compare several methods:

- Microsoft Presidio
- Flair
- Llama 2
- ChatGPT-4 32k

Read about the performance of each below.l

## Presidio: example from docs

https://microsoft.github.io/presidio/getting_started/

In [84]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

text="My phone number is 212-555-5555"

# Set up the engine, loads the NLP module (spaCy model by default) 
# and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text=text,
                           entities=["PHONE_NUMBER"],
                           language='en')
print(results)

# Analyzer results are passed to the AnonymizerEngine for anonymization

anonymizer = AnonymizerEngine()

anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results)

print(anonymized_text)

[type: PHONE_NUMBER, start: 19, end: 31, score: 0.75]
text: My phone number is <PHONE_NUMBER>
items:
[
    {'start': 19, 'end': 33, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'}
]



## NPD data

In [85]:
from datasets import load_dataset, ClassLabel, Value, Features

features = Features({'doc_id': Value('string'), 'meta': Value('string'), 'raw_content': Value('string')})

force_llm_dataset_scrubbed = load_dataset("json", data_files="../data/force_llm_corpus_scrubbed.jsonl", features=features)

  table = cls._concat_blocks(blocks, axis=0)


In [86]:
force_llm_dataset_scrubbed

DatasetDict({
    train: Dataset({
        features: ['doc_id', 'meta', 'raw_content'],
        num_rows: 2941971
    })
})

In [87]:
text = force_llm_dataset_scrubbed['train'][140]['raw_content']
text

"2.2 HSEQ On the M07-07 well number of incidents/near misses occurred, as indicated below: The first incident happened during the rig move on the 1st of June 2010. The rig was pinned on stand-off location at Cirrus M7-A platform. The anchor was issued to tug MV Elbe. It is not clear what exactly happened, the statement of the IP was not clear. Deck crew was connecting the towing wire with shackle to the buoy. The IP was inserting safety pin when the wire suddenly tensioned. The thumb of the IP's left hand got trapped between the shackle and the Karmfork. The statement of the IP is not clear as he himself did not comprehend what happened. The captain of the Elbe required medical assistance from the rig medic. The IP was capable of walking to and stepping on the Billy Pugh for transport with assistance of NLB crewmember. On board the IP was examined. The IP had trapped his left thumb sustaining approximately 2cm open wound over the top of first left thumb joint On the 4th of July 2010, t

### Add some PII

In [89]:
text = """2.2 HSEQ On the M07-07 well number of incidents/near misses occurred, as indicated below: The first incident happened during the rig move on the 1st of June 2010. The rig was pinned on stand-off location at Cirrus M7-A platform. The anchor was issued to tug MV Elbe. It is not clear what exactly happened, the statement of the IP was not clear. Deck crew was connecting the towing wire with shackle to the buoy. The IP was inserting safety pin when the wire suddenly tensioned. The thumb of the IP's left hand got trapped between the shackle and the Karmfork. The statement of the IP is not clear as he himself did not comprehend what happened. The captain of the Elbe required medical assistance from the rig medic. The IP was capable of walking to and stepping on the Billy Pugh for transport with assistance of NLB crewmember. On board the IP was examined. The IP had trapped his left thumb sustaining approximately 2cm open wound over the top of first left thumb joint On the 4th of July 2010, the second incident occurred, 30 cc base oil spilled into sea. The problem occurred with the base oil wash pump for the shakers. Lars Hansen (the SCS engineer) changed out the pump for spare. He disconnected the used pump and lined up spare pump. It was observed that base oil drips were coming from the beam below shakers. The shaker hand closed in base oil supply. It was determined that there was failure of seal on the crowfoot hose connection. The beams were cleaned instantly with rags and no further oil dripped into the sea. On the 6th of July, the third incident occurred while drilling the 12 1/4 hole with three mud pumps on line and 1/2 liners installed. The pump pressure was steady at 4100psi while pumping 915 gpm. significant pressure drop was observed, it was thought the reason was due to one off the pop-offs going off. Mechanical pop-offs Retsco type 'C' are set at 4800 psi and electrical pop-offs at 4600 psi. Upon further investigation of the pump room, it was discovered that the upper body had parted from the lower body due to shearing of the cap screws which hold the bodies together."""
text

"2.2 HSEQ On the M07-07 well number of incidents/near misses occurred, as indicated below: The first incident happened during the rig move on the 1st of June 2010. The rig was pinned on stand-off location at Cirrus M7-A platform. The anchor was issued to tug MV Elbe. It is not clear what exactly happened, the statement of the IP was not clear. Deck crew was connecting the towing wire with shackle to the buoy. The IP was inserting safety pin when the wire suddenly tensioned. The thumb of the IP's left hand got trapped between the shackle and the Karmfork. The statement of the IP is not clear as he himself did not comprehend what happened. The captain of the Elbe required medical assistance from the rig medic. The IP was capable of walking to and stepping on the Billy Pugh for transport with assistance of NLB crewmember. On board the IP was examined. The IP had trapped his left thumb sustaining approximately 2cm open wound over the top of first left thumb joint On the 4th of July 2010, t

In [90]:
results = analyzer.analyze(text=text,
                           entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"],
                           language='en')
print(results)

[type: PERSON, start: 16, end: 19, score: 0.85, type: PERSON, start: 550, end: 558, score: 0.85, type: PERSON, start: 766, end: 780, score: 0.85, type: PERSON, start: 1127, end: 1138, score: 0.85]


In [91]:
anonymizer = AnonymizerEngine()

anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results)

print(anonymized_text)

text: 2.2 HSEQ On the <PERSON>-07 well number of incidents/near misses occurred, as indicated below: The first incident happened during the rig move on the 1st of June 2010. The rig was pinned on stand-off location at Cirrus M7-A platform. The anchor was issued to tug MV Elbe. It is not clear what exactly happened, the statement of the IP was not clear. Deck crew was connecting the towing wire with shackle to the buoy. The IP was inserting safety pin when the wire suddenly tensioned. The thumb of the IP's left hand got trapped between the shackle and the <PERSON>. The statement of the IP is not clear as he himself did not comprehend what happened. The captain of the Elbe required medical assistance from the rig medic. The IP was capable of walking to and stepping on <PERSON> for transport with assistance of NLB crewmember. On board the IP was examined. The IP had trapped his left thumb sustaining approximately 2cm open wound over the top of first left thumb joint On the 4th of July 201

## Try `flair`

https://github.com/flairNLP/flair

In [92]:
from flair.data import Sentence
from flair.nn import Classifier

# load the NER tagger
tagger = Classifier.load('ner')

2023-12-04 14:27:42,340 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [93]:
# make a sentence
sentence = Sentence(text)

# run NER over sentence
tagger.predict(sentence)

# print the sentence with all annotations
print(sentence)

Sentence[421]: "2.2 HSEQ On the M07-07 well number of incidents/near misses occurred, as indicated below: The first incident happened during the rig move on the 1st of June 2010. The rig was pinned on stand-off location at Cirrus M7-A platform. The anchor was issued to tug MV Elbe. It is not clear what exactly happened, the statement of the IP was not clear. Deck crew was connecting the towing wire with shackle to the buoy. The IP was inserting safety pin when the wire suddenly tensioned. The thumb of the IP's left hand got trapped between the shackle and the Karmfork. The statement of the IP is not clear as he himself did not comprehend what happened. The captain of the Elbe required medical assistance from the rig medic. The IP was capable of walking to and stepping on the Billy Pugh for transport with assistance of NLB crewmember. On board the IP was examined. The IP had trapped his left thumb sustaining approximately 2cm open wound over the top of first left thumb joint On the 4th 

## LLAMA 2

Following more or less https://swharden.com/blog/2023-07-29-ai-chat-locally-with-python/

Using https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/blob/main/llama-2-7b-chat.ggmlv3.q8_0.bin

In [94]:
# load the large language model file
from llama_cpp import Llama
LLM = Llama(model_path="../data/llama/llama-2-7b-chat.ggmlv3.q8_0.gguf")

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../data/llama/llama-2-7b-chat.ggmlv3.q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q8_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:           blk.0.attn_norm.weight f32      [  4096,   

In [95]:
# create a text prompt
prompt = "Q: What are the names of the days of the week? A:"

# generate a response (takes several seconds)
output = LLM(prompt)

# display the response
print(output["choices"][0]["text"])

 The names of the days of the week in English are: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday.



llama_print_timings:        load time =   14159.15 ms
llama_print_timings:      sample time =       6.93 ms /    33 runs   (    0.21 ms per token,  4761.22 tokens per second)
llama_print_timings: prompt eval time =   14159.08 ms /    16 tokens (  884.94 ms per token,     1.13 tokens per second)
llama_print_timings:        eval time =    7441.48 ms /    32 runs   (  232.55 ms per token,     4.30 tokens per second)
llama_print_timings:       total time =   21679.07 ms


In [96]:
output

{'id': 'cmpl-315ea745-9859-42f9-8015-999a834b8f50',
 'object': 'text_completion',
 'created': 1701696499,
 'model': '../data/llama/llama-2-7b-chat.ggmlv3.q8_0.gguf',
 'choices': [{'text': ' The names of the days of the week in English are: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday.',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 16, 'completion_tokens': 32, 'total_tokens': 48}}

## Try on data

In [99]:
short_text = """2.2 HSEQ On the M07-07 well number of incidents/near misses occurred, as indicated below: The first incident happened during the rig move on the 1st of June 2010. The rig was pinned on stand-off location at Cirrus M7-A platform. The anchor was issued to tug MV Elbe. It is not clear what exactly happened, the statement of the IP was not clear. Deck crew was connecting the towing wire with shackle to the buoy. The IP was inserting safety pin when the wire suddenly tensioned. The thumb of the IP's left hand got trapped between the shackle and the Karmfork. The statement of the IP is not clear as he himself did not comprehend what happened. The captain of the Elbe required medical assistance from the rig medic. The IP was capable of walking to and stepping on the Billy Pugh for transport with assistance of NLB crewmember. On board the IP was examined. The IP had trapped his left thumb sustaining approximately 2cm open wound over the top of first left thumb joint On the 4th of July 2010, the second incident occurred, 30 cc base oil spilled into sea. The problem occurred with the base oil wash pump for the shakers. Lars Hansen (the SCS engineer) changed out the pump for spare. He disconnected the used pump and lined up spare pump. It was observed that base oil drips were coming from the beam below shakers. The shaker hand closed in base oil supply. It was determined that there was failure of seal on the crowfoot hose connection. The beams were cleaned instantly with rags and no further oil dripped into the sea."""
short_text  # Removed last paragraph.

"2.2 HSEQ On the M07-07 well number of incidents/near misses occurred, as indicated below: The first incident happened during the rig move on the 1st of June 2010. The rig was pinned on stand-off location at Cirrus M7-A platform. The anchor was issued to tug MV Elbe. It is not clear what exactly happened, the statement of the IP was not clear. Deck crew was connecting the towing wire with shackle to the buoy. The IP was inserting safety pin when the wire suddenly tensioned. The thumb of the IP's left hand got trapped between the shackle and the Karmfork. The statement of the IP is not clear as he himself did not comprehend what happened. The captain of the Elbe required medical assistance from the rig medic. The IP was capable of walking to and stepping on the Billy Pugh for transport with assistance of NLB crewmember. On board the IP was examined. The IP had trapped his left thumb sustaining approximately 2cm open wound over the top of first left thumb joint On the 4th of July 2010, t

In [102]:
prompt = (
    f"""Q: In the following paragraph, 'IP' refers to 'injured party'. Please identify any personally identifiable """
    f"""information (PII, such as personal names, email addresses, etc) in the following paragraph. Note that there """
    f"""might be none.\n\n{short_text}\n\nA:"""
)

In [107]:
output = LLM(prompt, temperature=0.9)

# display the response
print(output["choices"][0]["text"])

Llama.generate: prefix-match hit


 The following information was identified as potentially identifiable in the provided paragraph:
1. Name (Lars Hansen)
2. Email address (not provided)
3. Personal names (IP, captain of the Elbe)
4. Contact information (not provided)



llama_print_timings:        load time =   14159.15 ms
llama_print_timings:      sample time =      17.25 ms /    57 runs   (    0.30 ms per token,  3305.11 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   14562.78 ms /    57 runs   (  255.49 ms per token,     3.91 tokens per second)
llama_print_timings:       total time =   14755.03 ms


## ChatGPT

In [116]:
from openai import AzureOpenAI
from openai.types import Embedding
from dotenv import load_dotenv
import os

# Load the shared environment variables.
load_dotenv("../.env.shared")
load_dotenv("../.env.secret")

client = AzureOpenAI(
    api_version=os.environ["OPENAI_API_VERSION"],
    azure_endpoint=os.environ["OPENAI_API_BASE"],
    api_key=os.environ["OPENAI_API_KEY"]
)

model = os.environ["GPT432k_DEPLOYMENT"]

question = (
    """Q: In the following paragraph, 'IP' refers to 'injured party'. Please identify any personally identifiable """
    """information (PII, such as personal names, email addresses, etc) in the following paragraph. Note that there """
    """might be none.\n\n{}\n\nA:"""
)

prompt = question.format(text)

completion = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}])

print(completion.choices[0].message.content)

The personally identifiable information in the paragraph is the name 'Lars Hansen'.


In [117]:
completion.choices[0]

Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="The personally identifiable information in the paragraph is the name 'Lars Hansen'.", role='assistant', function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})

### Without PII

In [112]:
text0 = force_llm_dataset_scrubbed['train'][140]['raw_content']

In [118]:
prompt = question.format(text0)

completion = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}])

print(completion.choices[0].message.content)

This paragraph does not contain any personally identifiable information (PII).


## Strategy

- Clear documents with a cheap thing (eg Presidio)
- Analyse what's left with an LLM

## Ideas

### Redaction with references

- Smart redaction -- instead of replacing a personal name with `<PERSON>` (or whatever), it would be better to use something like "The captain of the vessel" or "The first author of the X report" or whatever. A lot of meaning could be lost by replacing 3 different people's names with PERSON.
- Even better if these impersonal references can also be preserved across a document, or across multiple documents. At the very least PERSON_1, PERSON_2, etc.
- Given a knowledge graph, or access to org chart, job titles could be subbed for names pretty easily.
- It seems like GPT might have a chance of this kind of task.

### Etc