Analyzer - multiple languages and nlp engines #312

dhpollack · 2020-04-21T14:43:24Z

Initially this was my attempt to use stanza, which is an nlp engine by
Stanford. But generally, it's an update to allow for one to add NLP
engines and custom recognizers more easily. Specifically, I
standardized the format of the recognizers, removed use of global
variables when possible, and removed a lot of hard-coding of defaults.

I am thinking of using presidio for several non-english projects at work
and these are several of the changes that I made.

Below is a list of the changes in list form:

make spacy and/or stanza optional
- remove requirement of en_core_web_lg from install
allow predefined recognizers to take parameters
- this allows for easily using these as non-english recognizers
create config files for different NLP engines
create tests for stanza
make all spacy and stanza tests optional
create a Dockerfile for an anaconda-based image
- pytorch is built with MKL and is much faster on cpu from conda
completely rewrote the IBAN recognizer
- the current version only recognizes IBANs if they are the entirety
  of the string. This version will find IBANs in sentences.
fixed some tests
created a run.sh file, so just run dockers without rebuilding them

"Breaking" Changes:

I would like to use black, but it's
not super friendly with pylint. My suggestion is to drop pylint and
use black instead.
Default spacy model is en rather than en_core_web_lg and no spacy
models are downloaded by default. The idea is to let the user choose
which models they want. For non-english users, it saves a lot of time
at installation because you don't need to install the large spacy
model that you aren't using.

Relevant Issues:
#302

Signed-off-by: David Pollack d.pollack@solvemate.com

omri374 · 2020-04-22T07:01:39Z

Hi @dhpollack, thanks for your awesome contribution! We'll look at it and add comments.
Some initial thoughts:

Have you looked at the presidio-research repo for evaluating NLP models? We didn't look at stanza but we did compare spaCy to Flair and CRF models. We'd like to make sure that Stanza is a. fast enough and b. accurate enough. Flair, for example, was extremely accurate but very slow, and CRF was very fast but less accurate.
Have you considered using spacy-stanza?

dhpollack · 2020-04-22T07:40:00Z

Hi @dhpollack, thanks for your awesome contribution! We'll look at it and add comments.
Some initial thoughts:

1. Have you looked at the [presidio-research](https://github.com/microsoft/presidio-research/) repo for evaluating NLP models? We didn't look at stanza but we did compare spaCy to Flair and CRF models. We'd like to make sure that Stanza is a. fast enough and b. accurate enough. Flair, for example, was extremely accurate but very slow, and CRF was very fast but less accurate.

2. Have you considered using [spacy-stanza](https://spacy.io/universe/project/spacy-stanza)?

@omri374

I looked at it, but it was after I started hacking away at the original library. I also wanted the anonymization of text more than the analyzer so it was easier to use this library. Additionally, I was also looking into deploying a service either via k8s or docker eventually.
I hadn't seen this. This looks like it would be easier to use. I'll definitely look further into it.

For my use case, speed was less of a priority than accuracy. Currently I'm only using a single core instance to test things out and it was fast enough for my use case, but I am sure that if one were to use a GPU machine it would be much faster. My main goal was see how difficult it would be add different models from different libraries to the analyzer (and ultimately the anonymization) and to make things a bit more user-friendly to add non-English pipelines. A ancillary goal was to play with stanza and see how easy it would be to use as an alternative to spacy.

Feel free to cherrypick certain parts of this PR. It looks like there is another PR which adds another language (hebrew), but I wanted to throw this out there as an alternative way of doing it. Instead of creating new python classes for each language one can get the same results with something like the following:

il_domain_recognizer = DomainRecognizer(pattern_groups=[("IL Domain()", u"(www.)?[\u05D0-\u05EA]{2,63}.co.il", 0.5)], supported_entity="IL_DOMAIN_NAME", supported_language="he")

Then you can even use config files for different languages / recognizers that are based on the predefined recognizers.

Another option for me would be to could use the presidio-research repo and then reimplement the anonymization part in python. I've just gotten started with this project so I'll take a look into that as well.

omri374 · 2020-04-22T11:20:57Z

One thing we had issues with regarding languages, is the context phrases. Even in patterns like credit cards (which are international), the context words are still language specific. One option would be to create a recognizer per language and the other is to provide a set of context phrases for each language.

What do you think would be the most practical and user friendly?

dhpollack · 2020-04-22T11:28:12Z

I made the context an input as well. I just didn't use it in the example because the PR with the example used the same context as the default.

dhpollack · 2020-04-22T11:33:55Z

I standardized the format of the predefined recognizers. If you look at the code you should be able to see what's configurable and what's not. But do you mean at run time and loading all the predefined recognizers for each language? I hadn't tackled this problem yet but Off the top of my head I would use a configuration file with a list of recognizers and load those, otherwise lod the defaults.

elyase · 2020-04-22T18:50:45Z

@dhpollack I would be very interested in testing this PR. I installed the python wheel from your PR and downloaded the spacy models but when I try:

from presidio_analyzer import AnalyzerEngine
engine = AnalyzerEngine(default_language="de")
text = "My name is David and I live in Miami"
response = engine.analyze(correlation_id=0, text = text, entities=[], language='de', all_fields=True, score_threshold=0.5)

I am getting a:

2020-04-22 20:45:27 ERROR: Failed to get recognizers hash
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/elyase/Downloads/presidio/presidio-analyzer/presidio_analyzer/analyzer_engine.py", line 198, in analyze
    all_fields=all_fields)
  File "/Users/elyase/Downloads/presidio/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py", line 113, in get_recognizers
    "No matching recognizers were found to serve the request.")
ValueError: No matching recognizers were found to serve the request.

Are you able to run that example?

dhpollack · 2020-04-23T07:18:47Z

@elyase

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine.spacy_nlp_engine import SpacyNlpEngine
registry = RecognizerRegistry()
nlp = SpacyNlpEngine({"de": "de"})
registry.load_predefined_recognizers(["de"], "spacy")
engine = AnalyzerEngine(registry=registry, nlp_engine=nlp, default_language="de")
text = "My name is David and I live in Miami"
response = engine.analyze(correlation_id=0, text = text, entities=[], language='de', all_fields=True, score_threshold=0.5)

I did the above and get a different error about not having a hash for the API store. I think this is because I don't have redis running on my machine. However, this is a simplified version of what app.py does. I've never tried this outside of using app.py so I'm not exactly sure how it should work.

elyase · 2020-04-23T08:24:58Z

@dhpollack that works. The hash error is fortunately just logging output. Thanks a lot!

omri374 · 2020-04-23T08:38:31Z

We have a task on ignoring this error when running the analyzer in a standalone way

omri374

First, thanks for the great contribution. It makes the code much more readable and robust.
I would suggest to look at spacy-stanza. As it's mainatained by explosion AI, I would assume it's production ready. This would make the NLP engine (and other parts) much simpler to understand and maintain.

presidio-analyzer/Pipfile

presidio-analyzer/presidio_analyzer/nlp_engine/spacy_nlp_engine.py

presidio-analyzer/presidio_analyzer/predefined_recognizers/crypto_recognizer.py

presidio-analyzer/tests/test_stanza_recognizer.py

presidio-analyzer/tests/test_spacy_recognizer.py

presidio-analyzer/requirements.txt

presidio-analyzer/presidio_analyzer/recognizer_result.py

presidio-analyzer/presidio_analyzer/predefined_recognizers/us_itin_recognizer.py

presidio-analyzer/Dockerfile.local.conda

dhpollack · 2020-04-24T10:41:22Z

@omri374 What are your thoughts on dropping pylint in favor of black? It's also possible to continue using pylint, but there are a few things that black does that pylint complains about. I guess we could just ignore those in the pylintrc file. I started doing a bit of programming in Rust and C++ and found standardized code formatters remove a lot of cognitive energy that went into silly things like "should I put these parameters each on their own line, all on one line, two per line,...". I love it especially on other OSS projects that I've worked on.

presidio-analyzer/tests/conftest.py

presidio-analyzer/tests/test_analyzer_engine.py

dhpollack · 2020-04-24T18:03:44Z

Ok, now 'm using pytest marks to mark the tests that test differently with the big model vs the small one. Also the test times on my machine are 1.8s vs 6.3s vs 22s (spacy, spacy + stanza, spacy + stanza + en_core_web_lg). There is currently only 1 test that gets different results when using en_core_web_lg vs en. I am using pytest --runslow to add the large model tests rather than skipping them.

omri374 · 2020-04-26T07:39:02Z

@dhpollack regarding Black, I'm not familiar enough with it. We'll add it to the backlog to investigate. Thanks for the suggestion!

omri374 · 2020-04-30T21:27:00Z

/azp run

azure-pipelines · 2020-04-30T21:27:12Z

Azure Pipelines successfully started running 1 pipeline(s).

dhpollack · 2020-05-12T13:13:42Z

@omri374 do you want to break out these changes separately yourself? If you cherry-pick certain changes but not all then it gets harder to merge / rebase into master. I maintain a separate branch internally, but I can keep this branch roughly up-to-date with that one.

Also I think you can run your CI pipeline and it should work.

omri374 · 2020-05-12T14:06:45Z

@dhpollack I think we should consider spacy-stanza first. This is the main blocker for accepting this PR. Breaking it into sub PRs or cherry-picking might require some manual work but it would easier to review. We need another reviewer to merge this.
Did you get a chance to try out spacy-stanza? if this is something you can introduce into this PR, I would be happy to work on breaking it into smaller PRs. WDYT?

dhpollack · 2020-05-12T14:29:06Z

Spacy-stanza is in there. I made a few significant changes since the original PR. Spacy-stanza being one and the other being pytest style tests. But it might be good to get a fresh set of eyes on it and see what they think.

presidio-analyzer/Pipfile

presidio-analyzer/presidio_analyzer/app.py

presidio-analyzer/presidio_analyzer/predefined_recognizers/sg_fin_recognizer.py

presidio-analyzer/presidio_analyzer/nlp_engine/spacy_nlp_engine.py

presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py

presidio-analyzer/setup.cfg

omri374 · 2020-05-19T13:12:23Z

@dhpollack thanks for the updates. If you prefer that we go over this together over a meeting, feel free to reach out: presidio@microsoft.com

omri374 · 2020-05-20T11:09:46Z

/azp run

azure-pipelines · 2020-05-20T11:09:59Z

Azure Pipelines successfully started running 1 pipeline(s).

omri374 · 2020-05-21T16:38:11Z

@dhpollack CI fails on linting. Could you please verify that pylint and flake 8 pass successfully?

dhpollack · 2020-05-22T09:23:42Z

@dhpollack CI fails on linting. Could you please verify that pylint and flake 8 pass successfully?

@omri374 done

* made spacy required * using spacy-stanza for stanza models * refactor tests to use pytest * make one test reliant on big model optional

All tests have been refactored to use pytest. Previously, there was a mix of unittest, pytest and miscellaneous global initializations. This commit moves everything to pytest. There is now extensive use of fixtures instead of global variables and parametrized tests instead of duplicated code for each test. The major difference is that parametrized tests are not individually named.

this installs the big spacy model by default in the Docker and the Azure pipeline.

* add documentation and doc strings * change yaml field names to be more logical

dhpollack · 2020-07-15T16:24:22Z

@omri374 I made the changes based on the review. I am getting an branch conflict error, but it might just be an error with the github frontend and my browser cache. It should have been fixed with this commit.

omri374 · 2020-07-19T08:09:55Z

/azp run

azure-pipelines · 2020-07-19T08:10:06Z

Azure Pipelines successfully started running 1 pipeline(s).

…e_langs * fix merge conflicts with documentation

omri374 · 2020-07-20T09:04:02Z

@balteravishay @eladiw could you please review the non-python and non-md files and add your comments?

pipelines/templates/build-python-template.yaml

omri374 · 2020-07-22T14:59:10Z

Merged! Thanks a lot @dhpollack !

omri374 requested changes Apr 23, 2020

View reviewed changes

omri374 reviewed Apr 23, 2020

View reviewed changes

presidio-analyzer/Dockerfile.local.conda Outdated Show resolved Hide resolved

dhpollack changed the title ~~analyzer - multiple languages and nlp engines~~ [WIP] analyzer - multiple languages and nlp engines Apr 24, 2020

dhpollack commented Apr 24, 2020

View reviewed changes

presidio-analyzer/tests/conftest.py Outdated Show resolved Hide resolved

dhpollack commented Apr 24, 2020

View reviewed changes

presidio-analyzer/tests/test_analyzer_engine.py Outdated Show resolved Hide resolved

microsoft deleted a comment from azure-pipelines bot May 2, 2020

omri374 reviewed May 12, 2020

View reviewed changes

dhpollack force-pushed the dhp/allow_multiple_langs branch from 2c69c07 to 8de3d09 Compare May 19, 2020 11:40

David Pollack added 13 commits July 15, 2020 14:19

spacy required, spacy-stanza, update tests

68cc8d9

* made spacy required * using spacy-stanza for stanza models * refactor tests to use pytest * make one test reliant on big model optional

changes based on PR comments

b5fd25a

fixes to Dockerfiles

bb84544

remove sys.path.append

af0caef

fix pipeline errors (i.e. install spacy model)

d07f464

this installs the big spacy model by default in the Docker and the Azure pipeline.

fix rebase errors

d0ab7ed

use Pattern class

f281950

update docs

4200436

use PresidioLogger

3bd2d30

linting fixes

33cfc20

move imports to top level

9a4b9f3

edits based on PR-review

b23e551

* add documentation and doc strings * change yaml field names to be more logical

dhpollack force-pushed the dhp/allow_multiple_langs branch from bac05a1 to b23e551 Compare July 15, 2020 16:52

omri374 self-requested a review July 19, 2020 06:59

Merge remote-tracking branch 'upstream/master' into dhp/allow_multipl…

59e3b2a

…e_langs * fix merge conflicts with documentation

omri374 approved these changes Jul 20, 2020

View reviewed changes

balteravishay reviewed Jul 22, 2020

View reviewed changes

pipelines/templates/build-python-template.yaml Outdated Show resolved Hide resolved

fix pipelines based on PR comments

d7458e6

balteravishay approved these changes Jul 22, 2020

View reviewed changes

omri374 merged commit e5fe414 into microsoft:master Jul 22, 2020

omri374 changed the title ~~[WIP] analyzer - multiple languages and nlp engines~~ Analyzer - multiple languages and nlp engines Sep 15, 2020

This was referenced Jan 6, 2021

Fix surrounding context for unsupported languages #303

Closed

Enhacement: Add support to all languages supported by Spacy #227

Closed

SharonHart mentioned this pull request Nov 29, 2022

IPV6 recognizer not working properly #907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyzer - multiple languages and nlp engines #312

Analyzer - multiple languages and nlp engines #312

dhpollack commented Apr 21, 2020

omri374 commented Apr 22, 2020

dhpollack commented Apr 22, 2020

omri374 commented Apr 22, 2020

dhpollack commented Apr 22, 2020

dhpollack commented Apr 22, 2020

elyase commented Apr 22, 2020 •

edited

Loading

dhpollack commented Apr 23, 2020 •

edited

Loading

elyase commented Apr 23, 2020

omri374 commented Apr 23, 2020

omri374 left a comment •

edited

Loading

dhpollack commented Apr 24, 2020

dhpollack commented Apr 24, 2020

omri374 commented Apr 26, 2020

omri374 commented Apr 30, 2020

azure-pipelines bot commented Apr 30, 2020

dhpollack commented May 12, 2020

omri374 commented May 12, 2020

dhpollack commented May 12, 2020

omri374 commented May 19, 2020

omri374 commented May 20, 2020

azure-pipelines bot commented May 20, 2020

omri374 commented May 21, 2020

dhpollack commented May 22, 2020

dhpollack commented Jul 15, 2020

omri374 commented Jul 19, 2020

azure-pipelines bot commented Jul 19, 2020

omri374 commented Jul 20, 2020

omri374 commented Jul 22, 2020

Analyzer - multiple languages and nlp engines #312

Analyzer - multiple languages and nlp engines #312

Conversation

dhpollack commented Apr 21, 2020

omri374 commented Apr 22, 2020

dhpollack commented Apr 22, 2020

omri374 commented Apr 22, 2020

dhpollack commented Apr 22, 2020

dhpollack commented Apr 22, 2020

elyase commented Apr 22, 2020 • edited Loading

dhpollack commented Apr 23, 2020 • edited Loading

elyase commented Apr 23, 2020

omri374 commented Apr 23, 2020

omri374 left a comment • edited Loading

Choose a reason for hiding this comment

dhpollack commented Apr 24, 2020

dhpollack commented Apr 24, 2020

omri374 commented Apr 26, 2020

omri374 commented Apr 30, 2020

azure-pipelines bot commented Apr 30, 2020

dhpollack commented May 12, 2020

omri374 commented May 12, 2020

dhpollack commented May 12, 2020

omri374 commented May 19, 2020

omri374 commented May 20, 2020

azure-pipelines bot commented May 20, 2020

omri374 commented May 21, 2020

dhpollack commented May 22, 2020

dhpollack commented Jul 15, 2020

omri374 commented Jul 19, 2020

azure-pipelines bot commented Jul 19, 2020

omri374 commented Jul 20, 2020

omri374 commented Jul 22, 2020

elyase commented Apr 22, 2020 •

edited

Loading

dhpollack commented Apr 23, 2020 •

edited

Loading

omri374 left a comment •

edited

Loading