Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass in loaded spacy model #822

Closed
lsmith77 opened this issue Feb 1, 2022 · 11 comments · Fixed by #854
Closed

pass in loaded spacy model #822

lsmith77 opened this issue Feb 1, 2022 · 11 comments · Fixed by #854
Labels

Comments

@lsmith77
Copy link
Contributor

lsmith77 commented Feb 1, 2022

We have build a custom NLP API using spacy. We plan to use persidio to remove PII from data we sent to sentry. since our API is already using spacy, we would like to re-use the same models and load them only once.

In this spirit we are wondering if you would be open to consider supporting passing in a spacy modal instance via configuration, rather than just allowing to pass in a model name that is then loaded via spacy.load() in the presidio code:

lang_code: spacy.load(model_name, disable=["parser"])

I have to admit that I have not done any benchmarking on this yet, but I assume this should shave off time loading the models and also reduce the memory footprint.

@omri374
Copy link
Contributor

omri374 commented Feb 2, 2022

Hi @lsmith77,
Good point! We don't have an official solution for this, but here's an idea which might work:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy

# Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):

    def __init__(self, loaded_spacy_model):
        self.nlp = {"en": loaded_spacy_model}

# Load a model a-priori
nlp = spacy.load("en_core_web_sm")

# Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)

# Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine)

# Analyze text
analyzer.analyze(text="My name is Bob", language="en")

This hack might work, but your suggestion is great for future improvement. If you'd like to create a PR I'd be happy to help reviewing it.

@lsmith77
Copy link
Contributor Author

lsmith77 commented Feb 2, 2022

thank you .. will try it out and might create a PR.

@lsmith77
Copy link
Contributor Author

lsmith77 commented Feb 2, 2022

I can confirm that this works.

I guess it might make more sense to just document this "hack", rather than add this class, wdyt?

@omri374
Copy link
Contributor

omri374 commented Apr 22, 2022

Hi @lsmith77, sorry for the delayed response. Yes this could either be enhanced by documentation or extended functionality. Any contribution would be greatly appreciated!

@lsmith77
Copy link
Contributor Author

I went the documentation route #854

@orsinium
Copy link

@lsmith77 sorry for necroposting. Do you have your presidio-powered scrubber for Sentry open-sourced? We're solving the same problem right now, and I wonder if we could reuse some of your work.

@omri374
Copy link
Contributor

omri374 commented Jun 13, 2024

Hi @orsinium, what kind of integration are you looking for? For Presidio to run as part of Sentry and detect PII?

@lsmith77
Copy link
Contributor Author

@lsmith77 sorry for necroposting. Do you have your presidio-powered scrubber for Sentry open-sourced? We're solving the same problem right now, and I wonder if we could reuse some of your work.

we ended up not using presidio as it seemed like it wasn't able to cover non-western names. so we build a very simple scrubber in python ourselves that just handles URLs, emails and numbers.

@omri374
Copy link
Contributor

omri374 commented Jun 13, 2024

Thanks for the feedback @lsmith77. Have you looked into other NER models? or worked with the default spaCy one?

@lsmith77
Copy link
Contributor Author

Thanks for the feedback @lsmith77. Have you looked into other NER models? or worked with the default spaCy one?

Yes, we are using spaCy and we also use their NER on their LG models in English/German. it works ok-ish for detecting names based on sentence structure.

We also tried some NER models on huggingface and found them to be more accurate but in the end we decided to stick with spaCy because we were already using spaCy for other purposes and so it was "cheaper" to accept the spaCy limitations

@orsinium
Copy link

what kind of integration are you looking for? For Presidio to run as part of Sentry and detect PII?

Sentry has a "scrubber", a middleware that removes sensitive data from all events before sending them to the server. The default one is very basic: it only removes top-level values (and doesn't check nested values) based on a pre-defined deny list of words like "password". We looked into our Sentry and found a lot of PII. Luckily, you can provide your own scrubber:

https://docs.sentry.io/platforms/python/data-management/sensitive-data/

I started to look for a solution and I'm considering making a custom scrubber based on presidio. And search for "Sentry" in issues led me here.

so we build a very simple scrubber in python ourselves that just handles URLs, emails and numbers.

Got it, thank you. I might also end up not overthinking it and just using a big hardcoded deny list of names and a bunch of regexes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants