# Code from [here](https://github.com/microsoft/presidio/blob/main/docs/analyzer/nlp_engines/transformers.md)

This is the transformer implementation.
First, trying out with the given config (English). Then, using that on a Finnish text. Finally, trying to change to a Finnish model.
Model configs in tr-[language]-config.yml.

***
## English model

In [1]:
!pip -q install "presidio-analyzer[transformers]"
!pip -q install presidio-anonymizer

In [1]:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

text = "My name is Don and my phone number is 555-223-4495"

# Create configuration containing engine name and models
conf_file = "./testi-config.yml"

# Create NLP engine based on configuration
provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, 
    supported_languages=["en","fi"]
)

results = analyzer.analyze(text=text, language="en")
print(results)

Some weights of the model checkpoint at dslim/bert-base-ner were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[type: PHONE_NUMBER, start: 38, end: 50, score: 0.75]


In [2]:
#above does analysis, now anonymize
from presidio_anonymizer import AnonymizerEngine

# Analyzer results are passed to the AnonymizerEngine for anonymization

anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized_text)


text: My name is Don and my phone number is <PHONE_NUMBER>
items:
[
    {'start': 38, 'end': 52, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'}
]



## Finnish text on English model

In [None]:
text = "Minun nimeni on Aku ja puhelinnumeroni on 044 2235595"

results = analyzer.analyze(text=text, language="en")
print(results)
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text)

## Finnish on Finnish model

In [None]:
text = "Minun nimeni on Aku ja puhelinnumeroni on 051-223-5595"

results = analyzer.analyze(text=text, language="fi")
#print(results)

anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized_text)

## Test set?

In [None]:
# test set?
import json

file_path = 'testset.jsonl'

data = []
with open(file_path, 'r') as file:
    for line in file:
        data.append(json.loads(line))

In [None]:
for d in data:
    results = analyzer.analyze(text=d["original"], language=d["language"])
    anonymized_text = anonymizer.anonymize(text=d["original"], analyzer_results=results)
    print(anonymized_text.text)
    print(d["redacted"])
    print("---------")