-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use charlock_holmes instead of nkf at FetchLinkCardService #4080
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec succeeds even with nkf
. Spec the parsed card and you'll find it returns windows-1252
for sjis
and ISO-8859-1
for koi8-r
. It is a regression and does not solve the problem you expected the change to solve.
charlock_holmes uses ICU as its backend. Show ICU beats NKF to adopt this change.
Also, give |
from a more general perspective, i'm very wary about adding new external, non-Gemfile dependencies after the problems many admins had deploying language detection. |
I'm not really worrying about the external dependency. People have managed to get cld3-ruby work anyway. (I was expecting more fatal problems at the time, by the way. It was surprising for me that the library, tested only on my environment, worked on lots of computers. A more popular library would certainly work.) |
As @akihikodaki advised I modified to pass Now koi8-r is detected correctly, but sjis is still detected as windows-1252. Maybe it is trade-off for adopting international charset detector instead of Japanese-encoding-specific implementation. |
👍
The string may be too short to infer the encoding. Please test with different texts and evaluate whether its accuracy is acceptable. Also, do not forget to spec the inferred results. |
OK, I improved the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the series of fixes.
Probably, I think Docker will need to install icu. |
I added icu-dev and checked |
At the Interop Tokyo conference, one of the comments I got was that upgrading Mastodon was too bothersome. I would like to not add system-level dependencies unless absolutely necessary. This is used only for FetchLinkCardService? It could be better to just fail for unknown encodings. Aren't all modern pages served in UTF8? |
Newly created pages should be served in UTF-8 for sure, but there are many websites which are created before UTF-8 became dominant. I think users will not suffer from adding a dependency on ICU so much, because It seems Ubuntu Server 16.04 and 14.04 contains ICU by default. |
Paging @Jehops who is maintaining the FreeBSD package for Mastodon - would this be another PITA? |
No. Unless I'm missing something, it would be straightforward to add the dependency to the FreeBSD package. |
I suspect NKF is only for Japanese encodings, so we might have to use more generic way for language detection.
After applying this PR, we use charlock_holmes instead of nkf to detect charset.
In addition to it, the code tries to use the value of
charset
specified inContent-Type
HTTP response header, and use the charset detection only when charset is not be specified in HTTP response header or failed to parse the response body due to mis-specifying of the charset in HTTP response header.Pros
Cons