Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bagatur/multilingual anon #10346

Merged
merged 3 commits into from
Sep 7, 2023
Merged

Bagatur/multilingual anon #10346

merged 3 commits into from
Sep 7, 2023

Commits on Sep 7, 2023

  1. Multilingual anonymization (#10327)

    ### Description
    
    Add multiple language support to Anonymizer
    
    PII detection in Microsoft Presidio relies on several components - in
    addition to the usual pattern matching (e.g. using regex), the analyser
    uses a model for Named Entity Recognition (NER) to extract entities such
    as:
    - `PERSON`
    - `LOCATION`
    - `DATE_TIME`
    - `NRP`
    - `ORGANIZATION`
    
    
    [[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)
    
    To handle NER in specific languages, we utilize unique models from the
    `spaCy` library, recognized for its extensive selection covering
    multiple languages and sizes. However, it's not restrictive, allowing
    for integration of alternative frameworks such as
    [Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/)
    or
    [transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/)
    when necessary.
    
    ### Future works
    
    - **automatic language detection** - instead of passing the language as
    a parameter in `anonymizer.anonymize`, we could detect the language/s
    beforehand and then use the corresponding NER model. We have discussed
    this internally and @mateusz-wosinski-ds will look into a standalone
    language detection tool/chain for LangChain 馃槃
    
    ### Twitter handle
    @deepsense_ai / @MaksOpp
    
    ### Tag maintainer
    @baskaryan @hwchase17 @hinthornw
    maks-operlejn-ds committed Sep 7, 2023
    Configuration menu
    Copy the full SHA
    274c3dc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1d2b6c3 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    41a2548 View commit details
    Browse the repository at this point in the history