Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IllegalStateException thrown for unusual case #24

Closed
RichardInnocent opened this issue Feb 4, 2020 · 4 comments
Closed

IllegalStateException thrown for unusual case #24

RichardInnocent opened this issue Feb 4, 2020 · 4 comments
Labels
bug Something isn't working
Milestone

Comments

@RichardInnocent
Copy link

I'm able to configure the LanguageDetector as follows:

LanguageDetector languageDetector =
    LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.UNKNOWN)
                           .build();

When trying to compute the probabilities of the languages for the content 그 가격으로는 최상, the following exception is thrown:

Exception in thread "main" java.lang.IllegalStateException: inputStream must not be null
	at com.github.pemistahl.lingua.api.LanguageDetector.loadLanguageModel$lingua(LanguageDetector.kt:346)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:353)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:72)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
	at com.github.pemistahl.lingua.api.LanguageDetector.lookUpNgramProbability$lingua(LanguageDetector.kt:336)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeSumOfNgramProbabilities$lingua(LanguageDetector.kt:312)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageProbabilities$lingua(LanguageDetector.kt:299)
	at com.github.pemistahl.lingua.api.LanguageDetector.addNgramProbabilities$lingua(LanguageDetector.kt:164)
	at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:116)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$3(LanguageDetectionTimeAnalysis.java:83)
	at java.util.ArrayList.forEach(ArrayList.java:1257)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$2(LanguageDetectionTimeAnalysis.java:81)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$TimedEvent.time(LanguageDetectionTimeAnalysis.java:107)
	at com.feefo.entity.feedback.statistics.LanguageDetectionTimeAnalysis$LinguaTimeCheck.lambda$1(LanguageDetectionTimeAnalysis.java:89)

This exception is not thrown for other clearly non-English content (e.g. 여보세요), although changing from Language.UNKNOWN to Language.GERMAN solves this issue.

If Language.UNKNOWN is not meant to be included in the fromLanguages collection, a suitable exception should be thrown to indicate this.


As a side note, my use case for including Language.ENGLISH and Language.UNKNOWN is that, for my use case, I only care to know whether or not the language is English so would prefer to maintain the ability to include Language.UNKNOWN.

@krzysztofcybulski
Copy link

You can hack the API by passing same language twice:
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.ENGLISH)

@RichardInnocent
Copy link
Author

Thanks @krzysztofpcy, that's a great workaround for now. I still think this issue should be resolved in an upcoming version.

@pemistahl
Copy link
Owner

It never ceases to amaze me how creative people become to widen a tool's use cases which were never intended to be supported. :-)

Language.UNKNOWN is not meant to be used as input for the method. It serves only as a return value. Your exception is thrown because the library tries to load a language model for Language.UNKNOWN from disk into memory which does not exist, of course. For the cases where you didn't get the exception this is because the rule-based engine could successfully determine the language, so loading the language models was not necessary.

If you just want to determine whether some text is English or not and you cannot reliably exclude any other language in your data set, then please use LanguageDetectorBuilder.fromAllBuiltInLanguages() and throw away everything that does not return Language.ENGLISH. If I find the time to implement some kind of confidence scoring, this use case can be handled easier perhaps.

In any case, I will change the api so that an exception is thrown whenever Language.UNKNOWN is tried to be used as the input language. Thanks for letting me know about this, @RichardInnocent.

@pemistahl pemistahl added the bug Something isn't working label Feb 4, 2020
@pemistahl pemistahl added this to the Lingua 0.6.1 milestone Feb 4, 2020
@RichardInnocent
Copy link
Author

Thanks for your response and advice with my use case - much appreciated.

I know you've already tagged it for the next release, but the confidence scoring issue would be really useful to me too as it would allow me to avoid the overhead of including all other languages so I look forward to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants