Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom detection function for language detection #444

Merged
merged 4 commits into from
Mar 11, 2023

Conversation

saucam
Copy link
Contributor

@saucam saucam commented Feb 25, 2023

Serves #423

@davidmezzetti I was wondering what language ids should be returned by the custom function. Do the language ids change based on the model used? Do we need to return the same language ids as returned by the fasttext model?

Please take a look at the initial attempt.

@davidmezzetti
Copy link
Member

Hi @saucam Thank you for the initial PR!

I was thinking of reusing the langdetect parameter in the constructor:

    def __init__(self, path="facebook/m2m100_418M", quantize=False, gpu=True, batch=64, langdetect=DEFAULT_LANG_DETECT, findmodels=True):

Then instead of adding a parameter to detect the function would analyze self.langdetect to determine if it's a function. Otherwise, it would fallback to the existing logic. Something like this.

    def detect(self, texts):
        """
        Detects the language for each element in texts.

        Args:
            texts: list of text

        Returns:
            list of languages
        """

        if not FASTTEXT:
            raise ImportError('Language detection is not available - install "pipeline" extra to enable')

        if not self.detector:
            # Check if langdetect is a function
            if isinstance(self.langdetect, types.FunctionType) or hasattr(self.langdetect, "__call__"):
                self.detector = self.langdetect
            else:
                # Suppress unnecessary warning
                fasttext.FastText.eprint = lambda x: None
    
                # Load language detection model
                path = cached_download(self.langdetect, legacy_cache_layout=True)
                self.detector = fasttext.load_model(path).predict

        # Transform texts to format expected by language detection model
        texts = [x.lower().replace("\n", " ").replace("\r\n", " ") for x in texts]

        return [x[0].split("__")[-1] for x in self.detector(texts)[0]]

In terms of the language ids being returned, I'd leave that as an implementation detail for the custom function. Custom functions should use ISO-639-1 2 letter codes.

@saucam
Copy link
Contributor Author

saucam commented Feb 25, 2023

While I see the advantage of re-using existing variable, I think it is overly convoluted to use a variable that can be either a function or a string! Besides we have to add a check to find if it is a function.
Isn't the approach of having an extra variable simpler?
Also, since are we doing

[x[0].split("__")[-1] for x in self.detector(texts)[0]]

it will not work for custom functions because return type from fasttext model is different (something like ([['__label__en']], [array([0.9033877], dtype=float32)]))
whereas returned codes are expected to be ISO-639-1 2 letter codes from custom function

@davidmezzetti
Copy link
Member

I would prefer not to add another parameter. While it's easier for this change, I think the right approach is figuring out a good way to reuse langdetect. It's actually not really a useful parameter right now.

I expect there is a way to create a default detect function for fasttext and make it return two letter country codes. I'll think about it some more and write back.

@saucam
Copy link
Contributor Author

saucam commented Feb 28, 2023

ok thank you !

@saucam
Copy link
Contributor Author

saucam commented Mar 5, 2023

@davidmezzetti did you get a chance to give this a thought?

@davidmezzetti
Copy link
Member

Not yet, I've been tied up on other issues but I will take a look after the 5.4 release goes out.

@davidmezzetti
Copy link
Member

davidmezzetti commented Mar 10, 2023

Sorry for the delay here.

This is the rough pseudocode to repurpose langdetect in a backwards compatible way.

Change default for langdetect to None

def __init__(self, path="facebook/m2m100_418M", quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True):

Load default detector if langdetect not provided, otherwise use external function.

def detect(self, texts):
    """
    Detects the language for each element in texts.

    Args:
        texts: list of text

    Returns:
        list of languages
    """

    # Backwards compatible to load fasttext model
    if not self.langdetect or isinstance(self.langdetect, str):
        return self.defaultdetect(texts)

    # Call external language detector
    return self.langdetect(texts)

Default fasttext language detector

def defaultdetect(self, texts):
    if not self.detector:
        if not FASTTEXT:
            raise ImportError('Language detection is not available - install "pipeline" extra to enable')

        # Default path
        path = self.langdetect
        if not path: 
            path = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz"

        # Suppress unnecessary warning
        fasttext.FastText.eprint = lambda x: None

        # Load language detection model
        path = cached_download(self.langdetect, legacy_cache_layout=True)
        self.detector = fasttext.load_model(path)

    # Transform texts to format expected by language detection model
    texts = [x.lower().replace("\n", " ").replace("\r\n", " ") for x in texts]

    return [x[0].split("__")[-1] for x in self.detector.predict(texts)[0]]

@saucam
Copy link
Contributor Author

saucam commented Mar 11, 2023

@davidmezzetti thanks! I updated the PR with these changes

@davidmezzetti davidmezzetti added this to the v5.5.0 milestone Mar 11, 2023
@davidmezzetti
Copy link
Member

Thanks, I'll get this merged in once the builds done. And congratulations you're the 1,000th commit to this repo!

@saucam
Copy link
Contributor Author

saucam commented Mar 11, 2023

Thank you! just fixed the failing tests.

@davidmezzetti
Copy link
Member

Merged, thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modify translation pipeline langdetect parameter to accept language detection function
2 participants