-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom detection function for language detection #444
Conversation
Hi @saucam Thank you for the initial PR! I was thinking of reusing the langdetect parameter in the constructor: def __init__(self, path="facebook/m2m100_418M", quantize=False, gpu=True, batch=64, langdetect=DEFAULT_LANG_DETECT, findmodels=True): Then instead of adding a parameter to def detect(self, texts):
"""
Detects the language for each element in texts.
Args:
texts: list of text
Returns:
list of languages
"""
if not FASTTEXT:
raise ImportError('Language detection is not available - install "pipeline" extra to enable')
if not self.detector:
# Check if langdetect is a function
if isinstance(self.langdetect, types.FunctionType) or hasattr(self.langdetect, "__call__"):
self.detector = self.langdetect
else:
# Suppress unnecessary warning
fasttext.FastText.eprint = lambda x: None
# Load language detection model
path = cached_download(self.langdetect, legacy_cache_layout=True)
self.detector = fasttext.load_model(path).predict
# Transform texts to format expected by language detection model
texts = [x.lower().replace("\n", " ").replace("\r\n", " ") for x in texts]
return [x[0].split("__")[-1] for x in self.detector(texts)[0]] In terms of the language ids being returned, I'd leave that as an implementation detail for the custom function. Custom functions should use ISO-639-1 2 letter codes. |
While I see the advantage of re-using existing variable, I think it is overly convoluted to use a variable that can be either a function or a string! Besides we have to add a check to find if it is a function.
it will not work for custom functions because return type from fasttext model is different (something like ([['__label__en']], [array([0.9033877], dtype=float32)])) |
I would prefer not to add another parameter. While it's easier for this change, I think the right approach is figuring out a good way to reuse langdetect. It's actually not really a useful parameter right now. I expect there is a way to create a default detect function for fasttext and make it return two letter country codes. I'll think about it some more and write back. |
ok thank you ! |
@davidmezzetti did you get a chance to give this a thought? |
Not yet, I've been tied up on other issues but I will take a look after the 5.4 release goes out. |
Sorry for the delay here. This is the rough pseudocode to repurpose langdetect in a backwards compatible way. Change default for langdetect to None def __init__(self, path="facebook/m2m100_418M", quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True): Load default detector if langdetect not provided, otherwise use external function. def detect(self, texts):
"""
Detects the language for each element in texts.
Args:
texts: list of text
Returns:
list of languages
"""
# Backwards compatible to load fasttext model
if not self.langdetect or isinstance(self.langdetect, str):
return self.defaultdetect(texts)
# Call external language detector
return self.langdetect(texts) Default fasttext language detector def defaultdetect(self, texts):
if not self.detector:
if not FASTTEXT:
raise ImportError('Language detection is not available - install "pipeline" extra to enable')
# Default path
path = self.langdetect
if not path:
path = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz"
# Suppress unnecessary warning
fasttext.FastText.eprint = lambda x: None
# Load language detection model
path = cached_download(self.langdetect, legacy_cache_layout=True)
self.detector = fasttext.load_model(path)
# Transform texts to format expected by language detection model
texts = [x.lower().replace("\n", " ").replace("\r\n", " ") for x in texts]
return [x[0].split("__")[-1] for x in self.detector.predict(texts)[0]] |
@davidmezzetti thanks! I updated the PR with these changes |
Thanks, I'll get this merged in once the builds done. And congratulations you're the 1,000th commit to this repo! |
Thank you! just fixed the failing tests. |
Merged, thanks for the contribution! |
Serves #423
@davidmezzetti I was wondering what language ids should be returned by the custom function. Do the language ids change based on the model used? Do we need to return the same language ids as returned by the fasttext model?
Please take a look at the initial attempt.