Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposition: Using prior language probability to increase likelihood #101

Open
slavaGanzin opened this issue Jan 3, 2023 · 9 comments
Open

Comments

@slavaGanzin
Copy link

slavaGanzin commented Jan 3, 2023

@pemistahl Peter, I think it would be beneficial for this library to have a separate method that will add probability prior (in a Bayesian way) to the mix.

Let's look into statistics: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

So if 57% of texts, that you see on the internet, are in English so, if you predicted "English" for any input you would be wrong only in 43%. It's like a stopped clock, but it is right every second probe.

For example: #100

Based on that premise, if we are using just plain character statistics "как дела" is more Macedonian than Russian. But overall, if we add language statistics to the mix, lingua-puy would be "wrong" less often.

There are more Russian-speaking users of this library, than Macedonians, just because there are more Russian-speaking people overall. And so when a random user writes "как дела" it's "more accurate" to predict "russian" than "macedonian", just because in general that is what is expected by these users.

So my proposition to add detector.detect_language_with_prior function and factorize it with prior: likelihood = probability X prior_probability

For example: #97

detector.detect_language_of("Hello")

"ITALIAN": 0.9900000000000001,
"SPANISH": 0.8457074930316446,
"ENGLISH": 0.6405700388041755,
"FRENCH": 0.260556921899765,
"GERMAN": 0.01,
"CHINESE": 0,
"RUSSIAN": 0
detector.detect_language_with_prior("Hello")

# Of course constants are for illustrative purposes only.
# Results should be normalized afterwords
"ENGLISH": 0.6405700388041755 * 0.577,
"SPANISH": 0.8457074930316446 * 0.045,
"ITALIAN": 0.9900000000000001 * 0.017,
"FRENCH": 0.260556921899765 * 0.039,

Linked issues:

@pemistahl
Copy link
Owner

Hi @slavaGanzin, thank you for this very interesting idea. :) I will evaluate whether the overall accuracy improves when applying prior probabilities.

@duboff
Copy link

duboff commented Jan 9, 2023

I agree this should dramatically increase quality. After using lingua-py in production at scale, we've noticed quite a few instances of small languages (eg. Bulgarian, Macedonian) predicted over much more likely ones

@nickchomey
Copy link

nickchomey commented Feb 7, 2023

Another related suggestion - allow us to pass in a dictionary with language:probability pairs to suggest what the language is expected to be, and either use this to break ties or even build it into the model's probability calculation somehow. Beyond just the possibility that such a mechanism might improve results generally, it could give us significantly more control over our specific domains and use cases.

Let's say we're using social media data and we know (or have concluded) the primary language for each user. It would be useful to be able to tell lingua (perhaps even with some sort of probability, calculated from the language breakdown of the user's prior posts) what the expected language might be.

E.g. I post in English 99% of the time, but sometimes I write in Spanish. So, in an ambiguous situation, it would be better to conclude that it is English. But, if I had other contextual metadata available (e.g. Knowing that the post is from a Spanish-centric group/page/hashtag etc...), the pre-provided probability could be different.

If no argument is passed in, it could use some sort of global default, perhaps the one suggested by OP, which we could override for our own domains with a .env file. This .env file could also make it easier to filter the permissible languages that are normally passed in as an argument - if nothing passed, use the languages set in env. If nothing in env, use all languages.

@slavaGanzin
Copy link
Author

slavaGanzin commented Feb 8, 2023

@nickchomey .env approach sounds scary. This can be a second parameter to a function with default values equal to general language distributions, which you can override by providing your own.

@bhaveshkr
Copy link

I agree this should dramatically increase quality. After using lingua-py in production at scale, we've noticed quite a few instances of small languages (eg. Bulgarian, Macedonian) predicted over much more likely ones

Hi duboff, I find lingua to be extremely slow like 10-20 strings/secs on MacBook Pro. Can you suggest some approach to make it usable in the prod environment?

@pemistahl
Copy link
Owner

@bhaveshkr I've just written down some performance tips in the README. You probably want to read them.

@duboff
Copy link

duboff commented Sep 27, 2023

@pemistahl It's Great to see a new version! I was a bit afraid. Without putting undue pressure on you, do you think you are likely to consider the idea in this Issue or something similar any time soon?

Hi duboff, I find lingua to be extremely slow like 10-20 strings/secs on MacBook Pro. Can you suggest some approach to make it usable in the prod environment?

I just did exactly what the readme told me, but our use case is typically short-ish strings. We run it on AWS Lambda where it works fine with increased timeout.

@pemistahl
Copy link
Owner

Without putting undue pressure on you, do you think you are likely to consider the idea in this Issue or something similar any time soon?

@duboff Half a year ago or so, I did a quick evaluation of applying hard-coded prior probabilities. But the overall detection accuracy decreased significantly. So the proposed approach in this issue is not as promising as you may expect. I've kept this issue open so far as I think that it's worth doing more experiments in this direction. Not having enough free time is the limiting factor. This is an open source project, however, so feel free to fork and implement improvements yourself. I'm always happy about pull requests.

@nickchomey
Copy link

I'm just going to reiterate that I think the approach I suggested is clearly the right one - allow us to pass in our own probabilities rather than have them hardcoded.

#101 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants