Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Individual vs inclusive 639-3 codes #282

Closed
ZJaume opened this issue Dec 15, 2023 · 0 comments
Closed

Individual vs inclusive 639-3 codes #282

ZJaume opened this issue Dec 15, 2023 · 0 comments

Comments

@ZJaume
Copy link

ZJaume commented Dec 15, 2023

Hi,

I'm doing a comparison between Lingua and the new FastText model for NLLB with your benchmark (which BTW if you are interested, I can submit a PR with the necessary changes to run the benchmark with the new FastText model). This model uses ISO 639-3 but I found some differences between the set of codes in FastText NLLB model and Lingua. These are because FT is using always (or almost) individual language codes instead of inclusive codes, which Lingua is using in most cases.

The ideal, I think, would be able to identify all possible languages and therefore using always individual codes, but I know that this is hard especially for pluricentric languages (like Malay or Serbo-Croatian) and even more if variants are mutually intelligible. Or maybe there's no data to train a model for each variant.

So, I just wanted to point out these differences in case they are useful for you. These are the conversions I'm doing:

fn map_fasttext_to_lingua(label: &String) -> Option<Language> {
    let language = label.split('_').collect::<Vec<&str>>()[4]; // remove __label__ prefix

    // Convert FT codes to Lingua codes that do not match directly
    match language.as_ref() {
        // FT uses invidual Azerbaijain (North, South) codes, Lingua uses inclusive
        "azb" | "azj" => return Some(Language::Azerbaijani),
        // FT uses individual Albanian Tosk code, Lingua uses SQ inclusive code
        // Seems that all the text in the test set is Tosk
        "als" => return Some(Language::Albanian),
        // FT using individual Standard Latvian code
        "lvs" | "lvg" => return Some(Language::Latvian).
        // Despite indonesian individual code is being used in Lingua, Malay inclusive is also
        // being used
        "zsm" => return Some(Language::Malay),
        // same with mongolian
        "khk" => return Some(Language::Mongolian),
        "pes" | "prs" => return Some(Language::Persian),
        _ => {},
    }

    for lingua_language in Language::iter() {
        if language == lingua_language.iso_code_639_3().to_string() {
            return Some(lingua_language);
        }
    }
    println!("Language code '{}' not found", language);
    None
}

I do not speak any of the languages that differ and do not know the source of the test data, so cannot tell if this is 100% true. But there are test sets that the FastText model supports both variants and it is saying it is only one variant

$ cat lingua-rs/language-models/lv/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    989 __label__lvs_Latn
      4 __label__est_Latn
      3 __label__lit_Latn
      1 __label__pol_Latn
      1 __label__oci_Latn
      1 __label__kor_Hang
      1 __label__hun_Latn
$ cat lingua-rs/language-models/fa/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    985 __label__pes_Arab
     13 __label__prs_Arab
      1 __label__yue_Hant
      1 __label__arb_Arab
$ cat lingua-rs/language-models/az/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    999 __label__azj_Latn
      1 __label__tur_Latn

so maybe Lingua is using inclusive codes but in practice it is only covering one of the variants of that inclusive code?

For context, these are the list of inclusive and individual codes and names from Wikipedia:
Latvian lav – inclusive code

Farsi fas – inclusive code

Azerbaijaini aze – inclusive code

  • azj – North Azerbaijani
  • azb – South Azerbaijani

There is also the case of Malay, where Lingua uses the inclusive code msa code but this code includes Indonesian ind. Maybe the lingua code should be Standard Malay zsm? But this is a difficult case and may need much more work, since Wikipedia says they are close to mutually intelligible and we already know from the benchmark that tools are struggling to differentiate between them:

$ cat lingua-rs/language-models/ms/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    802 __label__ind_Latn
    186 __label__zsm_Latn
      5 __label__eng_Latn
      3 __label__jav_Latn
      1 __label__yue_Hant
      1 __label__pol_Latn
      1 __label__hrv_Latn
      1 __label__cat_Latn
$ cat lingua-rs/language-models/id/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr 
    957 __label__ind_Latn
     38 __label__zsm_Latn
      4 __label__jav_Latn
      1 __label__sun_Latn

Sorry about this "brick" of text and thank you for your tool, it is really helpful!

Repository owner locked and limited conversation to collaborators Dec 19, 2023
@pemistahl pemistahl converted this issue into discussion #287 Dec 19, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant