Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export the train data and imported to presidio #78

Open
Bamorem opened this issue Sep 4, 2023 · 9 comments
Open

Export the train data and imported to presidio #78

Bamorem opened this issue Sep 4, 2023 · 9 comments

Comments

@Bamorem
Copy link

Bamorem commented Sep 4, 2023

I have been using presidio_analyzer in my local host.

I used presidio-research to generate a dataset with Fake and then run the train/run/dev.

the output is 3 JSON files.
I used the same generated dataset to generate the .spacy file

Now my question is how to integrate the trained data to my local presidio_analyzer to run with the trained data.

I feel like it's missing the integration steps ^^

@omri374
Copy link
Contributor

omri374 commented Sep 5, 2023

Hi, have you trained a spaCy model using those samples, or would you like to evaluate Presidio using those samples?

@Bamorem
Copy link
Author

Bamorem commented Sep 5, 2023

Maybe I don't get how it should work, but I have Presidio that uses en_core_web_trf (spaCy model, right ?)

But Presidio (presidio_analyzer) doesn't perform well in some cases so I wanted to train my presidio_analyzer to perform better.

So that's when I found out about presidio-research that says it would be able to train it. I followed all the steps and I thought about the output.spacy would need to be imported to my presidio_analyzer. I'm I wrong? How can I train my local presidio_analyzer for it to perform better based on my dataset?

I would love some help ^^

@omri374
Copy link
Contributor

omri374 commented Sep 5, 2023

What you can consider, is to train a new model using spaCy (or other libraries) and then integrate it into Presidio.
More on training using spaCy: https://spacy.io/usage/training
Once you have the model installed locally, you can configure Presidio to use the custom model instead of the en_core_web_trf one: https://microsoft.github.io/presidio/tutorial/05_languages/

@Bamorem
Copy link
Author

Bamorem commented Sep 6, 2023

Okay so Using the link that you send me (https://spacy.io/usage/training) I did those two steps :

python -m spacy init fill-config ./base_config.cfg ./config.cfg
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

Here again, I guess the file ./train.spacy is the one generated like presidio-research suggests:

from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")

Am I right or here again dataset.spacy and train.spacy are two different things? if yes how do I get this train.spacy file?

I think I can use presidio-research :

Tell me if I'm wrong.

@omri374
Copy link
Contributor

omri374 commented Sep 6, 2023

Hi, we split the dataset into three, where train.spacy and dev.spacy are used for training/validation, and test.spacy will be used for evaluation. dataset.spacy is just an example on how to create a spacy dataset.

@Bamorem
Copy link
Author

Bamorem commented Sep 6, 2023

After I run the training I get those outputs.

now how do I integrate it into my local presidio_analyzer ?

This is the part missing from the documentation I guess.

Screenshot 2023-09-06 at 21 47 13

@omri374
Copy link
Contributor

omri374 commented Sep 7, 2023

To serialize/desrialize, see this spaCy doc: https://spacy.io/usage/saving-loading

To integrate it into Presidio, see this issue: microsoft/presidio#822

@Bamorem
Copy link
Author

Bamorem commented Sep 7, 2023

I'm sorry but I got even more lost...

So the next step after the spacy train is to serialize/desrialize the output model ?

Also, the second link to integrate is still using en_core_web_sm how do I use the modal that I trained?

Sorry for the question but is it possible to get a step-by-step on what to do to train an existing model and integrate it into to Presidio analyzer?

This helps to generate a dataset that will train an existing model to perform better based on the dataset expectation.

This will generate two files that will be used for the spacy train

This will generate two output , one call model-best and one call model-last (most of the time the model-best will be the one that we one to use)

  • 4 - serialize/desrialize
    what this is for, how to run it?

  • 5 - integrate into presidio
    in my case how does it get the model name? is it with a path or need to include it? where do I need to move ? ....

sorry for this but I still don't see how I can do it

@omri374
Copy link
Contributor

omri374 commented Sep 10, 2023

The question is more related to spaCy than to Presidio. Let me try to help although I might be wrong as I haven't done this in a while.

  • Serialize: save the model to disk.
  • Deserialize: Load the model from disk.

does this work?

import spacy

nlp  = spacy.load("PATH_TO_MODEL_BEST")

If yes, then you should use the same approach when creating your NlpEngine, as described before:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy

# Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):

    def __init__(self, loaded_spacy_model):
        self.nlp = {"en": loaded_spacy_model}

# Load a model a-priori
nlp = spacy.load("PATH_TO_MODEL_BEST")

# Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)

# Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine)

From here, you can continue with the evaluating Presidio notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants