-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyzer - multiple languages and nlp engines #312
Analyzer - multiple languages and nlp engines #312
Conversation
Hi @dhpollack, thanks for your awesome contribution! We'll look at it and add comments.
|
For my use case, speed was less of a priority than accuracy. Currently I'm only using a single core instance to test things out and it was fast enough for my use case, but I am sure that if one were to use a GPU machine it would be much faster. My main goal was see how difficult it would be add different models from different libraries to the analyzer (and ultimately the anonymization) and to make things a bit more user-friendly to add non-English pipelines. A ancillary goal was to play with stanza and see how easy it would be to use as an alternative to spacy. Feel free to cherrypick certain parts of this PR. It looks like there is another PR which adds another language (hebrew), but I wanted to throw this out there as an alternative way of doing it. Instead of creating new python classes for each language one can get the same results with something like the following: il_domain_recognizer = DomainRecognizer(pattern_groups=[("IL Domain()", u"(www.)?[\u05D0-\u05EA]{2,63}.co.il", 0.5)], supported_entity="IL_DOMAIN_NAME", supported_language="he") Then you can even use config files for different languages / recognizers that are based on the predefined recognizers. Another option for me would be to could use the presidio-research repo and then reimplement the anonymization part in python. I've just gotten started with this project so I'll take a look into that as well. |
One thing we had issues with regarding languages, is the context phrases. Even in patterns like credit cards (which are international), the context words are still language specific. One option would be to create a recognizer per language and the other is to provide a set of context phrases for each language. What do you think would be the most practical and user friendly? |
I made the context an input as well. I just didn't use it in the example because the PR with the example used the same context as the default. |
I standardized the format of the predefined recognizers. If you look at the code you should be able to see what's configurable and what's not. But do you mean at run time and loading all the predefined recognizers for each language? I hadn't tackled this problem yet but Off the top of my head I would use a configuration file with a list of recognizers and load those, otherwise lod the defaults. |
@dhpollack I would be very interested in testing this PR. I installed the python wheel from your PR and downloaded the spacy models but when I try: from presidio_analyzer import AnalyzerEngine
engine = AnalyzerEngine(default_language="de")
text = "My name is David and I live in Miami"
response = engine.analyze(correlation_id=0, text = text, entities=[], language='de', all_fields=True, score_threshold=0.5) I am getting a:
Are you able to run that example? |
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine.spacy_nlp_engine import SpacyNlpEngine
registry = RecognizerRegistry()
nlp = SpacyNlpEngine({"de": "de"})
registry.load_predefined_recognizers(["de"], "spacy")
engine = AnalyzerEngine(registry=registry, nlp_engine=nlp, default_language="de")
text = "My name is David and I live in Miami"
response = engine.analyze(correlation_id=0, text = text, entities=[], language='de', all_fields=True, score_threshold=0.5) I did the above and get a different error about not having a hash for the API store. I think this is because I don't have redis running on my machine. However, this is a simplified version of what app.py does. I've never tried this outside of using app.py so I'm not exactly sure how it should work. |
@dhpollack that works. The hash error is fortunately just logging output. Thanks a lot! |
We have a task on ignoring this error when running the analyzer in a standalone way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thanks for the great contribution. It makes the code much more readable and robust.
I would suggest to look at spacy-stanza
. As it's mainatained by explosion AI, I would assume it's production ready. This would make the NLP engine (and other parts) much simpler to understand and maintain.
presidio-analyzer/presidio_analyzer/nlp_engine/spacy_nlp_engine.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/crypto_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/us_itin_recognizer.py
Show resolved
Hide resolved
@omri374 What are your thoughts on dropping pylint in favor of black? It's also possible to continue using pylint, but there are a few things that black does that pylint complains about. I guess we could just ignore those in the pylintrc file. I started doing a bit of programming in Rust and C++ and found standardized code formatters remove a lot of cognitive energy that went into silly things like "should I put these parameters each on their own line, all on one line, two per line,...". I love it especially on other OSS projects that I've worked on. |
Ok, now 'm using pytest marks to mark the tests that test differently with the big model vs the small one. Also the test times on my machine are 1.8s vs 6.3s vs 22s (spacy, spacy + stanza, spacy + stanza + en_core_web_lg). There is currently only 1 test that gets different results when using |
@dhpollack regarding Black, I'm not familiar enough with it. We'll add it to the backlog to investigate. Thanks for the suggestion! |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@omri374 do you want to break out these changes separately yourself? If you cherry-pick certain changes but not all then it gets harder to merge / rebase into master. I maintain a separate branch internally, but I can keep this branch roughly up-to-date with that one. Also I think you can run your CI pipeline and it should work. |
@dhpollack I think we should consider spacy-stanza first. This is the main blocker for accepting this PR. Breaking it into sub PRs or cherry-picking might require some manual work but it would easier to review. We need another reviewer to merge this. |
Spacy-stanza is in there. I made a few significant changes since the original PR. Spacy-stanza being one and the other being pytest style tests. But it might be good to get a fresh set of eyes on it and see what they think. |
presidio-analyzer/presidio_analyzer/predefined_recognizers/sg_fin_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/nlp_engine/spacy_nlp_engine.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py
Outdated
Show resolved
Hide resolved
2c69c07
to
8de3d09
Compare
@dhpollack thanks for the updates. If you prefer that we go over this together over a meeting, feel free to reach out: presidio@microsoft.com |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@dhpollack CI fails on linting. Could you please verify that pylint and flake 8 pass successfully? |
@omri374 done |
* made spacy required * using spacy-stanza for stanza models * refactor tests to use pytest * make one test reliant on big model optional
All tests have been refactored to use pytest. Previously, there was a mix of unittest, pytest and miscellaneous global initializations. This commit moves everything to pytest. There is now extensive use of fixtures instead of global variables and parametrized tests instead of duplicated code for each test. The major difference is that parametrized tests are not individually named.
this installs the big spacy model by default in the Docker and the Azure pipeline.
* add documentation and doc strings * change yaml field names to be more logical
@omri374 I made the changes based on the review. I am getting an branch conflict error, but it might just be an error with the github frontend and my browser cache. It should have been fixed with this commit. |
bac05a1
to
b23e551
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
…e_langs * fix merge conflicts with documentation
@balteravishay @eladiw could you please review the non-python and non-md files and add your comments? |
Merged! Thanks a lot @dhpollack ! |
Initially this was my attempt to use stanza, which is an nlp engine by
Stanford. But generally, it's an update to allow for one to add NLP
engines and custom recognizers more easily. Specifically, I
standardized the format of the recognizers, removed use of global
variables when possible, and removed a lot of hard-coding of defaults.
I am thinking of using presidio for several non-english projects at work
and these are several of the changes that I made.
Below is a list of the changes in list form:
of the string. This version will find IBANs in sentences.
run.sh
file, so just run dockers without rebuilding them"Breaking" Changes:
not super friendly with pylint. My suggestion is to drop pylint and
use black instead.
en
rather thanen_core_web_lg
and no spacymodels are downloaded by default. The idea is to let the user choose
which models they want. For non-english users, it saves a lot of time
at installation because you don't need to install the large spacy
model that you aren't using.
Relevant Issues:
#302
Signed-off-by: David Pollack d.pollack@solvemate.com