You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm doing a comparison between Lingua and the new FastText model for NLLB with your benchmark (which BTW if you are interested, I can submit a PR with the necessary changes to run the benchmark with the new FastText model). This model uses ISO 639-3 but I found some differences between the set of codes in FastText NLLB model and Lingua. These are because FT is using always (or almost) individual language codes instead of inclusive codes, which Lingua is using in most cases.
The ideal, I think, would be able to identify all possible languages and therefore using always individual codes, but I know that this is hard especially for pluricentric languages (like Malay or Serbo-Croatian) and even more if variants are mutually intelligible. Or maybe there's no data to train a model for each variant.
So, I just wanted to point out these differences in case they are useful for you. These are the conversions I'm doing:
fnmap_fasttext_to_lingua(label:&String) -> Option<Language>{let language = label.split('_').collect::<Vec<&str>>()[4];// remove __label__ prefix// Convert FT codes to Lingua codes that do not match directlymatch language.as_ref(){// FT uses invidual Azerbaijain (North, South) codes, Lingua uses inclusive"azb" | "azj" => returnSome(Language::Azerbaijani),// FT uses individual Albanian Tosk code, Lingua uses SQ inclusive code// Seems that all the text in the test set is Tosk"als" => returnSome(Language::Albanian),// FT using individual Standard Latvian code"lvs" | "lvg" => returnSome(Language::Latvian).// Despite indonesian individual code is being used in Lingua, Malay inclusive is also// being used"zsm" => returnSome(Language::Malay),// same with mongolian"khk" => returnSome(Language::Mongolian),"pes" | "prs" => returnSome(Language::Persian),
_ => {},}for lingua_language inLanguage::iter(){if language == lingua_language.iso_code_639_3().to_string(){returnSome(lingua_language);}}println!("Language code '{}' not found", language);None}
I do not speak any of the languages that differ and do not know the source of the test data, so cannot tell if this is 100% true. But there are test sets that the FastText model supports both variants and it is saying it is only one variant
There is also the case of Malay, where Lingua uses the inclusive code msa code but this code includes Indonesian ind. Maybe the lingua code should be Standard Malay zsm? But this is a difficult case and may need much more work, since Wikipedia says they are close to mutually intelligible and we already know from the benchmark that tools are struggling to differentiate between them:
Hi,
I'm doing a comparison between Lingua and the new FastText model for NLLB with your benchmark (which BTW if you are interested, I can submit a PR with the necessary changes to run the benchmark with the new FastText model). This model uses ISO 639-3 but I found some differences between the set of codes in FastText NLLB model and Lingua. These are because FT is using always (or almost) individual language codes instead of inclusive codes, which Lingua is using in most cases.
The ideal, I think, would be able to identify all possible languages and therefore using always individual codes, but I know that this is hard especially for pluricentric languages (like Malay or Serbo-Croatian) and even more if variants are mutually intelligible. Or maybe there's no data to train a model for each variant.
So, I just wanted to point out these differences in case they are useful for you. These are the conversions I'm doing:
I do not speak any of the languages that differ and do not know the source of the test data, so cannot tell if this is 100% true. But there are test sets that the FastText model supports both variants and it is saying it is only one variant
so maybe Lingua is using inclusive codes but in practice it is only covering one of the variants of that inclusive code?
For context, these are the list of inclusive and individual codes and names from Wikipedia:
Latvian lav – inclusive code
Farsi fas – inclusive code
Azerbaijaini aze – inclusive code
There is also the case of Malay, where Lingua uses the inclusive code
msa
code but this code includes Indonesianind
. Maybe the lingua code should be Standard Malayzsm
? But this is a difficult case and may need much more work, since Wikipedia says they are close to mutually intelligible and we already know from the benchmark that tools are struggling to differentiate between them:Sorry about this "brick" of text and thank you for your tool, it is really helpful!
The text was updated successfully, but these errors were encountered: