Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass in custom trained spacy model #851

Closed
vajjasaikiran opened this issue Apr 11, 2022 · 8 comments
Closed

Pass in custom trained spacy model #851

vajjasaikiran opened this issue Apr 11, 2022 · 8 comments

Comments

@vajjasaikiran
Copy link

We have trained a custom spacy model having entities which currently spacy does not have. We plan to use that spacy model as the default spacyNLP engine .

I tried with the code mentioned in #822 , but I am not getting the required entities.

I tried the below code.

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy

#Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):

    def __init__(self, loaded_spacy_model):
        self.nlp = {"en": loaded_spacy_model}

#Load a model a-priori
nlp = spacy.load("/path/to/custom_model")

#Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)

#Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine)

#Analyze text
analyzer.analyze(text="My name is Bob. I work for Google as an ML engineer.", language="en") 

Expected entities: [PERSON, ORG, CUSTOM]
Predicted entities: [PERSON, ORG]

Can somebody explain if there is any hack or something to achieve this.

@omri374
Copy link
Contributor

omri374 commented Apr 12, 2022

Hi @vajjasaikiran,
Presidio uses spaCy first as an NLP engine, and second to extract NER. For the latter, there's a recognizer called SpacyRecognizer.

It has the en_core_web_lg entity types by default, but it's possible to pass others. For example:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import SpacyRecognizer

# Define the new entities supported by the custom model
spacy_entities = ["PERS", "LOC", "ORG", "TIME", "DATE", "MONEY", "PERCENT", "MISC__AFF", "MISC__ENT"]

# Translate the model's entity types to Presidio's (if needed, in this example we map tham 1:1)
spacy_label_groups = [({ent}, {ent}) for ent in spacy_entities]

spacy_recognizer = SpacyRecognizer(supported_language="en", 
                                   supported_entities=spacy_entities, 
                                   check_label_groups=spacy_label_groups)

# Create Presidio Analyzer Engine
analyzer = AnalyzerEngine()

# List existing (predefined) recognizers
print([rec.name for rec in analyzer.registry.recognizers])

# Remove the previous SpacyRecognizer
analyzer.registry.recognizers = [rec for rec in analyzer.registry.recognizers if rec.name != "SpacyRecognizer"]

# Add the new custom SpacyRecognizer
analyzer.registry.add_recognizer(spacy_recognizer)

# Run Analyzer Engine
res = analyzer.analyze(text="text with custom entities", language="en")

Hope this helps!

@vajjasaikiran
Copy link
Author

Hi @omri374 . Thank you for the response.

I can see that we are adding new custom entities, label groups to the Analyzer Engine. But where are we passing the custom model weight file into the Analyzer engine?
Could you please modify the above code or explain me how the custom model weight file is getting utilised in the Analyzer engine pipeline.

@omri374
Copy link
Contributor

omri374 commented Apr 13, 2022

Hi @vajjasikiran,

This is a good point. This is the flow in high level:

  1. The NlpEngine creates an object called NlpArtifacts, which contains the output of the spaCy pipeline (tokens, entities etc.)
  2. The SpacyRecognizer object is leveraging the NlpArtifacts simply to extract the requested entities out of the NlpArtifacts.

So the actual model weights are being used when the NlpEngine runs the input text through the model. Then, the outputs are propagated to all other recognizers, including the SpacyRecognizer.

This example shows this in more detail. It takes tokens out of the NlpArtifacts to extract token attributes, but a similar logic is used to extract entities in the SpacyRecognizer class:

@vajjasaikiran
Copy link
Author

Hi @omri374 ,

I understood the flow. But one thing I am still not clear is that , how can I pass my new model weights to the NLPEngine. Presidio by default has en_core_web_lg model loaded during initialisation. I wanted to pass my new custom trained model to the Engine.

You can consider this as a doube NER kind of pipeline. Predict(spacy entities) + Predict(custom entities) + combine them and give the result.

My custom model can predict 5 entities, which spacy default model does not. I want the NLPEngine to predict (Spacy entities + custom entities). Could you please share a sample code where I can pass my custom model object and it just adds to the default pipeline of presidio and DONE. If there is no such way to do it right now, can you help me in doing some hacks around the available classes and get the work DONE.

@omri374
Copy link
Contributor

omri374 commented Apr 14, 2022

Hi @vajjasaikiran, so if I understand correctly the ambition is to have both en_core_web_lg and an additional custom model.

In this case I would suggest creating a new recognizer which loads the custom model. Here's an example implementation. It uses the same logic in the SpacyRecognizer, just with a loaded model instead of what's received in the NlpArtifacts:

from typing import Tuple, Set

from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.predefined_recognizers import SpacyRecognizer

import spacy

class CustomSpacyRecognizer(SpacyRecognizer):
    
    def __init__(self, path_to_model:str):
        """
        SpacyRecognizer with a new/custom model, 
        to run in parallel with the model in NlpEngine.
        :param path_to_model: Path to the custom model's location
        """
        
        self.path_to_model = path_to_model
        self.model = None # Model will be loaded on .load()
        
        entities = ["ORG"] # TODO change to the custom model's entities
        spacy_label_groups = [({ent}, {ent}) for ent in entities]
        
        super().__init__(
                supported_language='en',
                supported_entities=entities,
                ner_strength=0.85,
                check_label_groups=spacy_label_groups
        )
    
    def load(self):
        self.model = spacy.load(self.path_to_model)
    
    def analyze(self, text, entities, nlp_artifacts=None):
        """
        Analyze using a spaCy model. Similar to SpacyRecognizer.analyze, 
        except it has an actual call to a spaCy model loaded as part of this recognizer.
        """
        results = []

        doc = self.model(text)
        
        ner_entities = doc.ents

        for entity in entities:
            if entity not in self.supported_entities:
                continue
            for ent in ner_entities:
                if not self.__check_label(entity, ent.label_, self.check_label_groups):
                    continue
                textual_explanation = f"Identified as {ent.label_} by the spaCy model: {self.path_to_model}"
                explanation = self.build_spacy_explanation(
                    self.ner_strength, textual_explanation
                )
                spacy_result = RecognizerResult(
                    entity_type=entity,
                    start=ent.start_char,
                    end=ent.end_char,
                    score=self.ner_strength,
                    analysis_explanation=explanation,
                    recognition_metadata={
                        RecognizerResult.RECOGNIZER_NAME_KEY: self.name
                    },
                )
                results.append(spacy_result)

        return results
    
    @staticmethod
    def __check_label(
        entity: str, label: str, check_label_groups: Tuple[Set, Set]
    ) -> bool:
        return any(
            [entity in egrp and label in lgrp for egrp, lgrp in check_label_groups]
        )

Adding the new recognizer (in this example only detects the ORG entity):

custom_spacy = CustomSpacyRecognizer(path_to_model="en_core_web_sm")

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(custom_spacy)

results = analyzer.analyze(text="David Smith works at IBM", language="en", return_decision_process=True)

Results (with the decision process to see that the same entity was detected twice, once by the default spaCy model and second by the custom model, in this case en_core_web_sm but could be anything else:

[res.__dict__ for res in results]
[{'entity_type': 'PERSON',
  'start': 0,
  'end': 11,
  'score': 0.85,
  'analysis_explanation': {'recognizer': 'SpacyRecognizer', 'pattern_name': None, 'pattern': None, 'original_score': 0.85, 'score': 0.85, 'textual_explanation': "Identified as PERSON by Spacy's Named Entity Recognition", 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None},
  'recognition_metadata': {'recognizer_name': 'SpacyRecognizer'}},
 {'entity_type': 'ORG',
  'start': 21,
  'end': 24,
  'score': 0.85,
  'analysis_explanation': {'recognizer': 'CustomSpacyRecognizer', 'pattern_name': None, 'pattern': None, 'original_score': 0.85, 'score': 0.85, 'textual_explanation': 'Identified as ORG by the spaCy model: en_core_web_sm', 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None},
  'recognition_metadata': {'recognizer_name': 'CustomSpacyRecognizer'}}]

@vajjasaikiran
Copy link
Author

Hi @omri374

Thank you so much for your quick responses. I tried this and it is working as expected.

@efka84
Copy link

efka84 commented Dec 23, 2023

@omri374 @vajjasaikiran

I have installed my custom trained NER model as a Python package. How can i use it with the final provided pieces of code (the accepted solution).

@omri374
Copy link
Contributor

omri374 commented Dec 24, 2023

@efka84 is it a spaCy model? if yes, you can pass a model loaded by spaCy into Presidio
spaCy model loading: https://spacy.io/usage/saving-loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants