Skip to content
This repository has been archived by the owner on Jun 15, 2024. It is now read-only.

Some of the English words detect as different language #76

Open
kooshansari opened this issue Nov 24, 2022 · 4 comments
Open

Some of the English words detect as different language #76

kooshansari opened this issue Nov 24, 2022 · 4 comments

Comments

@kooshansari
Copy link

Please check the below sheet. For most of the simple English words it detects as different language
image

@kooshansari kooshansari changed the title Some of the English words Some of the English words detect as different language Nov 24, 2022
@janheinrichmerker
Copy link

Most language detectors don't work well on very short texts (in this case a single word).
You could use the model's output scores to define a threshold under which no language is detected. Otherwise the language labels on short texts will probably be noisy.

@bfischer1121
Copy link

Why are language detectors so bad on short text? I get that the sample size is small but one would think they would switch approaches to a basic sanity check. e.g., the characters "age" have absolutely no correlation with the characters found in Korean. This seems to be an issue with every language detection library we've used -- pure randomness!

@AmitMY
Copy link

AmitMY commented Jul 7, 2023

I feel like this one might be a little better - https://mediapipe-studio.webapps.google.com/demo/language_detector

@bfischer1121
Copy link

Nice suggestion (detects 6/7 correctly)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants