Individual vs inclusive 639-3 codes #282

ZJaume · 2023-12-15T16:39:35Z

Hi,

I'm doing a comparison between Lingua and the new FastText model for NLLB with your benchmark (which BTW if you are interested, I can submit a PR with the necessary changes to run the benchmark with the new FastText model). This model uses ISO 639-3 but I found some differences between the set of codes in FastText NLLB model and Lingua. These are because FT is using always (or almost) individual language codes instead of inclusive codes, which Lingua is using in most cases.

The ideal, I think, would be able to identify all possible languages and therefore using always individual codes, but I know that this is hard especially for pluricentric languages (like Malay or Serbo-Croatian) and even more if variants are mutually intelligible. Or maybe there's no data to train a model for each variant.

So, I just wanted to point out these differences in case they are useful for you. These are the conversions I'm doing:

fn map_fasttext_to_lingua(label: &String) -> Option<Language> {
    let language = label.split('_').collect::<Vec<&str>>()[4]; // remove __label__ prefix

    // Convert FT codes to Lingua codes that do not match directly
    match language.as_ref() {
        // FT uses invidual Azerbaijain (North, South) codes, Lingua uses inclusive
        "azb" | "azj" => return Some(Language::Azerbaijani),
        // FT uses individual Albanian Tosk code, Lingua uses SQ inclusive code
        // Seems that all the text in the test set is Tosk
        "als" => return Some(Language::Albanian),
        // FT using individual Standard Latvian code
        "lvs" | "lvg" => return Some(Language::Latvian).
        // Despite indonesian individual code is being used in Lingua, Malay inclusive is also
        // being used
        "zsm" => return Some(Language::Malay),
        // same with mongolian
        "khk" => return Some(Language::Mongolian),
        "pes" | "prs" => return Some(Language::Persian),
        _ => {},
    }

    for lingua_language in Language::iter() {
        if language == lingua_language.iso_code_639_3().to_string() {
            return Some(lingua_language);
        }
    }
    println!("Language code '{}' not found", language);
    None
}

I do not speak any of the languages that differ and do not know the source of the test data, so cannot tell if this is 100% true. But there are test sets that the FastText model supports both variants and it is saying it is only one variant

$ cat lingua-rs/language-models/lv/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    989 __label__lvs_Latn
      4 __label__est_Latn
      3 __label__lit_Latn
      1 __label__pol_Latn
      1 __label__oci_Latn
      1 __label__kor_Hang
      1 __label__hun_Latn
$ cat lingua-rs/language-models/fa/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    985 __label__pes_Arab
     13 __label__prs_Arab
      1 __label__yue_Hant
      1 __label__arb_Arab
$ cat lingua-rs/language-models/az/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    999 __label__azj_Latn
      1 __label__tur_Latn

so maybe Lingua is using inclusive codes but in practice it is only covering one of the variants of that inclusive code?

For context, these are the list of inclusive and individual codes and names from Wikipedia:
Latvian lav – inclusive code

lvs – Standard Latvian language
ltg – Latgalian language

Farsi fas – inclusive code

Azerbaijaini aze – inclusive code

azj – North Azerbaijani
azb – South Azerbaijani

There is also the case of Malay, where Lingua uses the inclusive code msa code but this code includes Indonesian ind. Maybe the lingua code should be Standard Malay zsm? But this is a difficult case and may need much more work, since Wikipedia says they are close to mutually intelligible and we already know from the benchmark that tools are struggling to differentiate between them:

$ cat lingua-rs/language-models/ms/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr
    802 __label__ind_Latn
    186 __label__zsm_Latn
      5 __label__eng_Latn
      3 __label__jav_Latn
      1 __label__yue_Hant
      1 __label__pol_Latn
      1 __label__hrv_Latn
      1 __label__cat_Latn
$ cat lingua-rs/language-models/id/testdata/sentences.txt | ./fastertext/build/fasttext predict lid218e.bin - | sort | uniq -c | sort -nr 
    957 __label__ind_Latn
     38 __label__zsm_Latn
      4 __label__jav_Latn
      1 __label__sun_Latn

Sorry about this "brick" of text and thank you for your tool, it is really helpful!

The text was updated successfully, but these errors were encountered:

Repository owner locked and limited conversation to collaborators Dec 19, 2023

pemistahl converted this issue into discussion #287 Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Individual vs inclusive 639-3 codes #282

Individual vs inclusive 639-3 codes #282

ZJaume commented Dec 15, 2023 •

edited

This issue was moved to a discussion.

This issue was moved to a discussion.

Individual vs inclusive 639-3 codes #282

Individual vs inclusive 639-3 codes #282

Comments

ZJaume commented Dec 15, 2023 • edited

This issue was moved to a discussion.

ZJaume commented Dec 15, 2023 •

edited