Most frequent alphabet detection causes inaccuracies when multiple alphabets have same occurrence count #105

Marcono1234 · 2021-06-09T09:24:16Z

The mostFrequentAlphabet detection of filterLanguagesByRules uses maxByOrNull:

lingua/src/main/kotlin/com/github/pemistahl/lingua/api/LanguageDetector.kt

Line 318 in 7e415ae

    
           val mostFrequentAlphabet = detectedAlphabets.entries.maxByOrNull { it.value }!!.key

When text now contains words of multiple alphabets which have the same occurence count, maxByOrNull would pick one of them as most frequent one.
For example the following will return only Greek with 1.0 confidence:

LanguageDetectorBuilder.fromLanguages(Language.GREEK, Language.BENGALI).build()
  .computeLanguageConfidenceValues("Παπασταθόπουλου ভয়াবহ")

When instead getting a set of alphabets with maximum count the result is (roughly): {GREEK=1.0, BENGALI=0.6349401470311586}

The text was updated successfully, but these errors were encountered:

Marcono1234 · 2021-06-12T23:28:02Z

Or maybe in general it would be good to adjust the rule based detection to not make rash decisions. For example when a text is half Japanese and half English (with the English part being the translation), Lingua will most likely return only Japanese with a confidence of 1.0.

Edit: Though maybe this is actually a feature request asking for detection of multiple languages in a text similar to #38, except without having to know where the sections in different languages are, approximate precentages might suffice.

pemistahl · 2021-11-23T13:58:28Z

This problem has been fixed now.

Marcono1234 mentioned this issue Jun 9, 2021

Chinese and Japanese word special casing is obsolete #104

Closed

pemistahl added a commit that referenced this issue Nov 14, 2021

Fix non-deterministic language detection (#105)

cbcd5a6

pemistahl added this to the Lingua 1.1.1 milestone Nov 22, 2021

pemistahl added the bug Something isn't working label Nov 22, 2021

pemistahl added a commit that referenced this issue Nov 23, 2021

Add unit test to check deterministic language detection (#105)

c37e15d

pemistahl closed this as completed Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Most frequent alphabet detection causes inaccuracies when multiple alphabets have same occurrence count #105

Most frequent alphabet detection causes inaccuracies when multiple alphabets have same occurrence count #105

Marcono1234 commented Jun 9, 2021

Marcono1234 commented Jun 12, 2021 •

edited

Loading

pemistahl commented Nov 23, 2021

Most frequent alphabet detection causes inaccuracies when multiple alphabets have same occurrence count #105

Most frequent alphabet detection causes inaccuracies when multiple alphabets have same occurrence count #105

Comments

Marcono1234 commented Jun 9, 2021

Marcono1234 commented Jun 12, 2021 • edited Loading

pemistahl commented Nov 23, 2021

Marcono1234 commented Jun 12, 2021 •

edited

Loading