# spaCy implementation, from [here](https://github.com/microsoft/presidio/blob/main/docs/analyzer/nlp_engines/spacy_stanza.md)

Using only the spaCy models (which have a [ner component](https://spacy.io/models)). Model configs in sc-[language]-config.YAML.

In [1]:
!pip -q install presidio-analyzer
!pip -q install presidio-anonymizer

In [2]:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine

text = "My name is Don Quixote and my phone number is 555-223-4495"

# Create configuration containing engine name and models
conf_file = "./spacy-config.yml"

# Create NLP engine based on configuration
provider = NlpEngineProvider(conf_file=conf_file)
nlp_engine = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, 
    supported_languages=["en","fi"]
)

results = analyzer.analyze(text=text, language="en")
print(results)
anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text)

[type: PERSON, start: 11, end: 22, score: 0.85, type: PHONE_NUMBER, start: 46, end: 58, score: 0.75]
text: My name is <PERSON> and my phone number is <PHONE_NUMBER>
items:
[
    {'start': 43, 'end': 57, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'},
    {'start': 11, 'end': 19, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
]



In [3]:
text = "Minun nimeni on Aku Hirviniemi ja puhelinnumeroni on 044 2235595"
results = analyzer.analyze(text=text, language="fi")
print(results)
anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text)

[type: PERSON, start: 16, end: 30, score: 0.85, type: PHONE_NUMBER, start: 53, end: 64, score: 0.4]
text: Minun nimeni on <PERSON> ja puhelinnumeroni on <PHONE_NUMBER>
items:
[
    {'start': 47, 'end': 61, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'},
    {'start': 16, 'end': 24, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
]



In [4]:
# test set?
import json

file_path = 'testset.jsonl'

data = []
with open(file_path, 'r') as file:
    for line in file:
        data.append(json.loads(line))

In [5]:
for d in data:
    results = analyzer.analyze(text=d["original"], language=d["language"])
    anonymized_text = anonymizer.anonymize(text=d["original"], analyzer_results=results)
    print(anonymized_text.text)
    print(d["redacted"])
    print("---------")

<PERSON> called <PHONE_NUMBER> from her office at the company in <LOCATION>.
<PERSON> called <PHONE_NUMBER> from her office at the company in <LOCATION>.
---------
<PERSON> prefers using Apple for his work, contacting clients with his email <EMAIL_ADDRESS>.
<PERSON> prefers using <PRODUCT> for his work, contacting clients with his email <EMAIL>.
---------
<PERSON> visited <LOCATION> and enjoyed a delightful meal at Le Jules Verne restaurant.
<PERSON> visited <LOCATION> and enjoyed a delightful meal at <LOCATION>.
---------
At the conference, Dr. <PERSON> discussed advancements in medicine at the Mayo Clinic.
At the conference, Dr. <PERSON> discussed advancements in medicine at <ORG>.
---------
Samantha's phone number <UK_NHS> was written on a note left at the Starbucks cafe.
<PERSON>'s phone number <PHONE_NUMBER> was written on a note left at the <ORG> cafe.
---------
Mr. <PERSON>, CEO of ABC Corporation, announced record profits in their latest <DATE_TIME> report.
Mr. <PERSON>, CEO of