Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bagatur/multilingual anon #10346

Merged
merged 3 commits into from
Sep 7, 2023
Merged

Bagatur/multilingual anon #10346

merged 3 commits into from
Sep 7, 2023

Conversation

baskaryan
Copy link
Collaborator

No description provided.

maks-operlejn-ds and others added 2 commits September 7, 2023 14:42
### Description

Add multiple language support to Anonymizer

PII detection in Microsoft Presidio relies on several components - in
addition to the usual pattern matching (e.g. using regex), the analyser
uses a model for Named Entity Recognition (NER) to extract entities such
as:
- `PERSON`
- `LOCATION`
- `DATE_TIME`
- `NRP`
- `ORGANIZATION`


[[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)

To handle NER in specific languages, we utilize unique models from the
`spaCy` library, recognized for its extensive selection covering
multiple languages and sizes. However, it's not restrictive, allowing
for integration of alternative frameworks such as
[Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/)
or
[transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/)
when necessary.

### Future works

- **automatic language detection** - instead of passing the language as
a parameter in `anonymizer.anonymize`, we could detect the language/s
beforehand and then use the corresponding NER model. We have discussed
this internally and @mateusz-wosinski-ds will look into a standalone
language detection tool/chain for LangChain 😄

### Twitter handle
@deepsense_ai / @MaksOpp

### Tag maintainer
@baskaryan @hwchase17 @hinthornw
@vercel
Copy link

vercel bot commented Sep 7, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 7, 2023 10:06pm

@dosubot dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Sep 7, 2023
@baskaryan baskaryan merged commit 8826293 into master Sep 7, 2023
17 checks passed
@baskaryan baskaryan deleted the bagatur/multilingual_anon branch September 7, 2023 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants