New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix chardet test and add ordering option #11621
Conversation
Signed-off-by: Andrew Thornton <art27@cantab.net>
Signed-off-by: Andrew Thornton <art27@cantab.net>
Signed-off-by: Andrew Thornton <art27@cantab.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is missing Latvian windows-1257
@lafriks it appears that github.com/gogs/chardet doesn't detect or assign to windows-1257 |
Signed-off-by: Andrew Thornton <art27@cantab.net>
This is the right direction, of course! However, I'm concerned that this will need to be done every time. IMHO the best way to address this is to assign the priority inside gogs/chardet (we would need to take over that library). I elaborated about this here: #8474 (comment) |
Signed-off-by: Andrew Thornton <art27@cantab.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I believe this is a good compromise.
Nice work!!
modules/charset/charset.go
Outdated
priority, has := setting.Repository.DetectedCharsetScore[strings.ToLower(strings.TrimSpace(topResult.Charset))] | ||
for _, result := range results { | ||
if result.Confidence == topConfidence { | ||
resultPriority, resultHas := setting.Repository.DetectedCharsetScore[strings.ToLower(strings.TrimSpace(result.Charset))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we could attempt to normalize our list casing to the lib's casing in order to avoid calling ToLower()
in a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehehe have you looked at the libs casing? I copied the names exactly in to the setting, app.ini.sample and config cheat sheet. There's literally no fixed pattern for them.
One thing (sorry about the afterthought): if |
But not before utf-8 |
Heh ANSI charset simply overrides any detected charset that isn't utf8... It's not a very sensible option. I'm happy to make a breaking change by removing it and adding a default undetected charset or something like that? |
@zeripath on second thought you're right, it would be breaking. Charset detection is a sensitive issue. Maybe we should leave the (*) We could deprecate it, but ANSI characters sets should already be a thing of the past. I believe most people dealing with them have really no choice, as they need to deal with ages of old code. Deprecating the option in a way they can't force anymore doesn't sound like a nice thing to do. |
Sorry, bad operation! 😓 |
Signed-off-by: Andrew Thornton <art27@cantab.net>
Ping LG-TM |
* Fix chardet test and add ordering option Signed-off-by: Andrew Thornton <art27@cantab.net> * minor fixes Signed-off-by: Andrew Thornton <art27@cantab.net> * remove log Signed-off-by: Andrew Thornton <art27@cantab.net> * remove log2 Signed-off-by: Andrew Thornton <art27@cantab.net> * only iterate through top results Signed-off-by: Andrew Thornton <art27@cantab.net> * Update docs/content/doc/advanced/config-cheat-sheet.en-us.md * slight restructure of for loop Signed-off-by: Andrew Thornton <art27@cantab.net> Co-authored-by: techknowlogick <techknowlogick@gitea.io>
Add DETECTED_CHARSET_ORDER to repository config to allow setting of tie-breaking for detected charset ordering.
Fixes intermittent failure of chardet test
Signed-off-by: Andrew Thornton art27@cantab.net