Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lingua.Unknown is not handled appropriately if included in the set of input languages #7

Closed
marians opened this issue Nov 24, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@marians
Copy link

marians commented Nov 24, 2021

I have started testing the library a few days ago and just saw a first nil pointer panic like this:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4314fd8]

goroutine 22414 [running]:
github.com/pemistahl/lingua-go.loadJson(0xa834340, 0x45537a0)
	/Users/marian/go/pkg/mod/github.com/pemistahl/lingua-go@v1.0.3/json.go:37 +0x178
github.com/pemistahl/lingua-go.languageDetector.loadLanguageModels({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/lingua-go@v1.0.3/detector.go:612 +0x8d
github.com/pemistahl/lingua-go.languageDetector.lookUpNgramProbability({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/lingua-go@v1.0.3/detector.go:552 +0x146
github.com/pemistahl/lingua-go.languageDetector.computeSumOfNgramProbabilities({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/lingua-go@v1.0.3/detector.go:526 +0x145
github.com/pemistahl/lingua-go.languageDetector.computeLanguageProbabilities({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/lingua-go@v1.0.3/detector.go:484 +0xcb
github.com/pemistahl/lingua-go.languageDetector.lookUpLanguageModels({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/lingua-go@v1.0.3/detector.go:452 +0xb7
created by github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues
	/Users/marian/go/pkg/mod/github.com/pemistahl/lingua-go@v1.0.3/detector.go:176 +0x455

Seems as if even reading from an embed File can fail at some point.

I'm using go version go1.17.2 darwin/amd64.

@pemistahl
Copy link
Owner

Thank you for reporting this @marians. Can you please show me your code which calls the library? When exactly did this error occur? I need more context to find out what's going on.

@marians
Copy link
Author

marians commented Nov 25, 2021

The code is not public currently. I'll copy a reduced version of my invocation here.

languages := []lingua.Language{
	lingua.Albanian,
	lingua.Basque,
	lingua.Bosnian,
	lingua.Bulgarian,
	lingua.Chinese,
	lingua.Catalan,
	lingua.Croatian,
	lingua.Czech,
	lingua.Danish,
	lingua.Dutch,
	lingua.English,
	lingua.Estonian,
	lingua.Finnish,
	lingua.French,
	lingua.German,
	lingua.Greek,
	lingua.Hungarian,
	lingua.Italian,
	lingua.Japanese,
	lingua.Latvian,
	lingua.Lithuanian,
	lingua.Macedonian,
	lingua.Polish,
	lingua.Portuguese,
	lingua.Romanian,
	lingua.Serbian,
	lingua.Slovak,
	lingua.Slovene,
	lingua.Spanish,
	lingua.Swedish,
	lingua.Unknown,
}

detector := lingua.NewLanguageDetectorBuilder().
	FromLanguages(languages...).
	Build()

// This part is called within a loop for many different text values.
confidenceValues := detector.ComputeLanguageConfidenceValues(text)
if len(confidenceValues) > 0 {
	result := confidenceValues[0].Language().IsoCode639_1().String()
}

@pemistahl
Copy link
Owner

Ah, I think I know what's wrong. You are trying to build the language detector by including lingua.Unknown. This does not make sense because this is not a real language. The function loadJson tries to find language models for lingua.Unknown but it cannot find any, of course. I've forgotten to check for lingua.Unknown and handle this case appropriately, so your issue is still valid. I'm going to fix it soon.

For the time being, please remove lingua.Unknown from your set of languages and you should be fine.

@pemistahl pemistahl added the bug Something isn't working label Nov 25, 2021
@marians
Copy link
Author

marians commented Nov 25, 2021

Thanks for the info!

My expectation (without reading up in this) was that having Unknown in the list, the classifier would be able to give "unknown" as the most likely result if the language was in fact not classified as one of the other ones. If that makes sense.

@pemistahl
Copy link
Owner

Well, one could guess so. But no, if the language cannot be reliably detected, lingua.Unknown is returned no matter if it has been included in the set of languages or not. I will make this more clear in the documentation. Thanks again for making me aware of this.

@pemistahl pemistahl changed the title Function loadJson() does not handle errors, can cause nil pointer panic lingua.Unknown is not handled appropriately if included in the set of input languages Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants