Use charlock_holmes instead of nkf at FetchLinkCardService #4080

nullkal · 2017-07-05T15:55:59Z

I suspect NKF is only for Japanese encodings, so we might have to use more generic way for language detection.

After applying this PR, we use charlock_holmes instead of nkf to detect charset.
In addition to it, the code tries to use the value of charset specified in Content-Type HTTP response header, and use the charset detection only when charset is not be specified in HTTP response header or failed to parse the response body due to mis-specifying of the charset in HTTP response header.

Pros

charlock_holmes uses ICU to detecting charset, so it is reliable for non-Japanese encodings.

Cons

charlock_holmes depends on ICU, and we need to install icu4c to the system via each environment's package manager.

akihikodaki

The spec succeeds even with nkf. Spec the parsed card and you'll find it returns windows-1252 for sjis and ISO-8859-1 for koi8-r. It is a regression and does not solve the problem you expected the change to solve.
charlock_holmes uses ICU as its backend. Show ICU beats NKF to adopt this change.

akihikodaki · 2017-07-05T20:16:16Z

Also, give response.charset to CharlockHolmes::EncodingDetector.detect as hint_enc instead of a fallback.

nightpool · 2017-07-05T20:17:14Z

from a more general perspective, i'm very wary about adding new external, non-Gemfile dependencies after the problems many admins had deploying language detection.

akihikodaki · 2017-07-05T20:32:32Z

I'm not really worrying about the external dependency. People have managed to get cld3-ruby work anyway. (I was expecting more fatal problems at the time, by the way. It was surprising for me that the library, tested only on my environment, worked on lots of computers. A more popular library would certainly work.)

nullkal · 2017-07-06T06:21:17Z

As @akihikodaki advised I modified to pass response.charset to CharlockHolmes::EncodingDetector as hint_enc. And I also set CharlockHolmes::EncodingDetector#strip_tags to true.

Now koi8-r is detected correctly, but sjis is still detected as windows-1252. Maybe it is trade-off for adopting international charset detector instead of Japanese-encoding-specific implementation.

akihikodaki · 2017-07-06T07:31:58Z

As @akihikodaki advised I modified to pass response.charset to CharlockHolmes::EncodingDetector as hint_enc. And I also set CharlockHolmes::EncodingDetector#strip_tags to true.

👍

Now koi8-r is detected correctly, but sjis is still detected as windows-1252. Maybe it is trade-off for adopting international charset detector instead of Japanese-encoding-specific implementation.

The string may be too short to infer the encoding. Please test with different texts and evaluate whether its accuracy is acceptable. Also, do not forget to spec the inferred results.

nullkal · 2017-07-06T08:15:34Z

OK, I improved the test.

akihikodaki

Thanks for the series of fixes.

abcang · 2017-07-06T08:33:49Z

Probably, I think Docker will need to install icu.

nullkal · 2017-07-06T10:30:36Z

I added icu-dev and checked docker-compose build ran successfully.

Gargron · 2017-07-06T22:41:34Z

At the Interop Tokyo conference, one of the comments I got was that upgrading Mastodon was too bothersome. I would like to not add system-level dependencies unless absolutely necessary. This is used only for FetchLinkCardService? It could be better to just fail for unknown encodings. Aren't all modern pages served in UTF8?

nullkal · 2017-07-07T00:17:05Z

Newly created pages should be served in UTF-8 for sure, but there are many websites which are created before UTF-8 became dominant. attempt_opengraph gets information also from obsolete pages, not only from modern pages which contain <meta property="og:*">.

I think users will not suffer from adding a dependency on ICU so much, because It seems Ubuntu Server 16.04 and 14.04 contains ICU by default.

Gargron · 2017-07-07T22:23:37Z

Paging @Jehops who is maintaining the FreeBSD package for Mastodon - would this be another PITA?

Jehops · 2017-07-08T08:28:37Z

No. Unless I'm missing something, it would be straightforward to add the dependency to the FreeBSD package.

nullkal added 4 commits July 6, 2017 00:32

Specs for language detection

cd908ce

Use CharlockHolmes instead of NKF

d85fd7a

Correct mistakes

530b65e

Correct style

c259e89

akihikodaki suggested changes Jul 5, 2017

View reviewed changes

Set hint_enc instead of falling back and strip_tags

dda682c

Improve specs

10c52a9

akihikodaki approved these changes Jul 6, 2017

View reviewed changes

Add dependencies

b71ffbf

abcang approved these changes Jul 6, 2017

View reviewed changes

Gargron approved these changes Jul 8, 2017

View reviewed changes

Merge branch 'master' into charlock_holmes_for_charcode_detection

226fe4f

Gargron merged commit 007ab33 into mastodon:master Jul 8, 2017

nullkal deleted the charlock_holmes_for_charcode_detection branch July 10, 2017 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use charlock_holmes instead of nkf at FetchLinkCardService #4080

Use charlock_holmes instead of nkf at FetchLinkCardService #4080

nullkal commented Jul 5, 2017

akihikodaki left a comment

akihikodaki commented Jul 5, 2017

nightpool commented Jul 5, 2017

akihikodaki commented Jul 5, 2017

nullkal commented Jul 6, 2017

akihikodaki commented Jul 6, 2017

nullkal commented Jul 6, 2017

akihikodaki left a comment

abcang commented Jul 6, 2017

nullkal commented Jul 6, 2017 •

edited

Gargron commented Jul 6, 2017

nullkal commented Jul 7, 2017 •

edited

Gargron commented Jul 7, 2017

Jehops commented Jul 8, 2017 •

edited

Use charlock_holmes instead of nkf at FetchLinkCardService #4080

Use charlock_holmes instead of nkf at FetchLinkCardService #4080

Conversation

nullkal commented Jul 5, 2017

Pros

Cons

akihikodaki left a comment

Choose a reason for hiding this comment

akihikodaki commented Jul 5, 2017

nightpool commented Jul 5, 2017

akihikodaki commented Jul 5, 2017

nullkal commented Jul 6, 2017

akihikodaki commented Jul 6, 2017

nullkal commented Jul 6, 2017

akihikodaki left a comment

Choose a reason for hiding this comment

abcang commented Jul 6, 2017

nullkal commented Jul 6, 2017 • edited

Gargron commented Jul 6, 2017

nullkal commented Jul 7, 2017 • edited

Gargron commented Jul 7, 2017

Jehops commented Jul 8, 2017 • edited

nullkal commented Jul 6, 2017 •

edited

nullkal commented Jul 7, 2017 •

edited

Jehops commented Jul 8, 2017 •

edited