Add custom detection function for language detection #444

saucam · 2023-02-25T16:53:07Z

Serves #423

@davidmezzetti I was wondering what language ids should be returned by the custom function. Do the language ids change based on the model used? Do we need to return the same language ids as returned by the fasttext model?

Please take a look at the initial attempt.

davidmezzetti · 2023-02-25T19:11:50Z

Hi @saucam Thank you for the initial PR!

I was thinking of reusing the langdetect parameter in the constructor:

    def __init__(self, path="facebook/m2m100_418M", quantize=False, gpu=True, batch=64, langdetect=DEFAULT_LANG_DETECT, findmodels=True):

Then instead of adding a parameter to detect the function would analyze self.langdetect to determine if it's a function. Otherwise, it would fallback to the existing logic. Something like this.

    def detect(self, texts):
        """
        Detects the language for each element in texts.

        Args:
            texts: list of text

        Returns:
            list of languages
        """

        if not FASTTEXT:
            raise ImportError('Language detection is not available - install "pipeline" extra to enable')

        if not self.detector:
            # Check if langdetect is a function
            if isinstance(self.langdetect, types.FunctionType) or hasattr(self.langdetect, "__call__"):
                self.detector = self.langdetect
            else:
                # Suppress unnecessary warning
                fasttext.FastText.eprint = lambda x: None
    
                # Load language detection model
                path = cached_download(self.langdetect, legacy_cache_layout=True)
                self.detector = fasttext.load_model(path).predict

        # Transform texts to format expected by language detection model
        texts = [x.lower().replace("\n", " ").replace("\r\n", " ") for x in texts]

        return [x[0].split("__")[-1] for x in self.detector(texts)[0]]

In terms of the language ids being returned, I'd leave that as an implementation detail for the custom function. Custom functions should use ISO-639-1 2 letter codes.

saucam · 2023-02-25T23:40:33Z

While I see the advantage of re-using existing variable, I think it is overly convoluted to use a variable that can be either a function or a string! Besides we have to add a check to find if it is a function.
Isn't the approach of having an extra variable simpler?
Also, since are we doing

[x[0].split("__")[-1] for x in self.detector(texts)[0]]

it will not work for custom functions because return type from fasttext model is different (something like ([['__label__en']], [array([0.9033877], dtype=float32)]))
whereas returned codes are expected to be ISO-639-1 2 letter codes from custom function

davidmezzetti · 2023-02-26T02:31:27Z

I would prefer not to add another parameter. While it's easier for this change, I think the right approach is figuring out a good way to reuse langdetect. It's actually not really a useful parameter right now.

I expect there is a way to create a default detect function for fasttext and make it return two letter country codes. I'll think about it some more and write back.

saucam · 2023-02-28T02:31:53Z

ok thank you !

saucam · 2023-03-05T06:14:41Z

@davidmezzetti did you get a chance to give this a thought?

davidmezzetti · 2023-03-05T23:49:00Z

Not yet, I've been tied up on other issues but I will take a look after the 5.4 release goes out.

davidmezzetti · 2023-03-10T21:07:33Z

Sorry for the delay here.

This is the rough pseudocode to repurpose langdetect in a backwards compatible way.

Change default for langdetect to None

def __init__(self, path="facebook/m2m100_418M", quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True):

Load default detector if langdetect not provided, otherwise use external function.

def detect(self, texts):
    """
    Detects the language for each element in texts.

    Args:
        texts: list of text

    Returns:
        list of languages
    """

    # Backwards compatible to load fasttext model
    if not self.langdetect or isinstance(self.langdetect, str):
        return self.defaultdetect(texts)

    # Call external language detector
    return self.langdetect(texts)

Default fasttext language detector

def defaultdetect(self, texts):
    if not self.detector:
        if not FASTTEXT:
            raise ImportError('Language detection is not available - install "pipeline" extra to enable')

        # Default path
        path = self.langdetect
        if not path: 
            path = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz"

        # Suppress unnecessary warning
        fasttext.FastText.eprint = lambda x: None

        # Load language detection model
        path = cached_download(self.langdetect, legacy_cache_layout=True)
        self.detector = fasttext.load_model(path)

    # Transform texts to format expected by language detection model
    texts = [x.lower().replace("\n", " ").replace("\r\n", " ") for x in texts]

    return [x[0].split("__")[-1] for x in self.detector.predict(texts)[0]]

saucam · 2023-03-11T02:33:00Z

@davidmezzetti thanks! I updated the PR with these changes

.gitignore

src/python/txtai/pipeline/text/translation.py

davidmezzetti · 2023-03-11T03:09:19Z

Thanks, I'll get this merged in once the builds done. And congratulations you're the 1,000th commit to this repo!

saucam · 2023-03-11T03:47:38Z

Thank you! just fixed the failing tests.

davidmezzetti · 2023-03-11T11:49:06Z

Merged, thanks for the contribution!

Add custom detection function for language detection

eb7f5ed

Fix according to comments

797da5a

davidmezzetti reviewed Mar 11, 2023

View reviewed changes

.gitignore Show resolved Hide resolved

davidmezzetti reviewed Mar 11, 2023

View reviewed changes

src/python/txtai/pipeline/text/translation.py Show resolved Hide resolved

davidmezzetti added this to the v5.5.0 milestone Mar 11, 2023

Use constant

dbda61c

Fix failing tests

e96bc93

davidmezzetti merged commit 8ecfd01 into neuml:master Mar 11, 2023

davidmezzetti linked an issue Mar 11, 2023 that may be closed by this pull request

Modify translation pipeline langdetect parameter to accept language detection function #423

Closed

davidmezzetti mentioned this pull request Mar 11, 2023

Modify translation pipeline langdetect parameter to accept language detection function #423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom detection function for language detection #444

Add custom detection function for language detection #444

saucam commented Feb 25, 2023

davidmezzetti commented Feb 25, 2023

saucam commented Feb 25, 2023 •

edited

Loading

davidmezzetti commented Feb 26, 2023

saucam commented Feb 28, 2023

saucam commented Mar 5, 2023

davidmezzetti commented Mar 5, 2023

davidmezzetti commented Mar 10, 2023 •

edited

Loading

saucam commented Mar 11, 2023

davidmezzetti commented Mar 11, 2023

saucam commented Mar 11, 2023

davidmezzetti commented Mar 11, 2023

Add custom detection function for language detection #444

Add custom detection function for language detection #444

Conversation

saucam commented Feb 25, 2023

davidmezzetti commented Feb 25, 2023

saucam commented Feb 25, 2023 • edited Loading

davidmezzetti commented Feb 26, 2023

saucam commented Feb 28, 2023

saucam commented Mar 5, 2023

davidmezzetti commented Mar 5, 2023

davidmezzetti commented Mar 10, 2023 • edited Loading

saucam commented Mar 11, 2023

davidmezzetti commented Mar 11, 2023

saucam commented Mar 11, 2023

davidmezzetti commented Mar 11, 2023

saucam commented Feb 25, 2023 •

edited

Loading

davidmezzetti commented Mar 10, 2023 •

edited

Loading