Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most frequent alphabet detection causes inaccuracies when multiple alphabets have same occurrence count #105

Closed
Marcono1234 opened this issue Jun 9, 2021 · 2 comments
Labels
bug Something isn't working
Milestone

Comments

@Marcono1234
Copy link
Contributor

The mostFrequentAlphabet detection of filterLanguagesByRules uses maxByOrNull:

val mostFrequentAlphabet = detectedAlphabets.entries.maxByOrNull { it.value }!!.key

When text now contains words of multiple alphabets which have the same occurence count, maxByOrNull would pick one of them as most frequent one.
For example the following will return only Greek with 1.0 confidence:

LanguageDetectorBuilder.fromLanguages(Language.GREEK, Language.BENGALI).build()
  .computeLanguageConfidenceValues("Παπασταθόπουλου ভয়াবহ")

When instead getting a set of alphabets with maximum count the result is (roughly): {GREEK=1.0, BENGALI=0.6349401470311586}

@Marcono1234
Copy link
Contributor Author

Marcono1234 commented Jun 12, 2021

Or maybe in general it would be good to adjust the rule based detection to not make rash decisions. For example when a text is half Japanese and half English (with the English part being the translation), Lingua will most likely return only Japanese with a confidence of 1.0.

Edit: Though maybe this is actually a feature request asking for detection of multiple languages in a text similar to #38, except without having to know where the sections in different languages are, approximate precentages might suffice.

@pemistahl pemistahl added this to the Lingua 1.1.1 milestone Nov 22, 2021
@pemistahl pemistahl added the bug Something isn't working label Nov 22, 2021
@pemistahl
Copy link
Owner

This problem has been fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants