Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use charlock_holmes instead of nkf at FetchLinkCardService #4080

Merged
merged 8 commits into from Jul 8, 2017
Merged

Use charlock_holmes instead of nkf at FetchLinkCardService #4080

merged 8 commits into from Jul 8, 2017

Conversation

nullkal
Copy link
Contributor

@nullkal nullkal commented Jul 5, 2017

I suspect NKF is only for Japanese encodings, so we might have to use more generic way for language detection.

After applying this PR, we use charlock_holmes instead of nkf to detect charset.
In addition to it, the code tries to use the value of charset specified in Content-Type HTTP response header, and use the charset detection only when charset is not be specified in HTTP response header or failed to parse the response body due to mis-specifying of the charset in HTTP response header.

Pros

  • charlock_holmes uses ICU to detecting charset, so it is reliable for non-Japanese encodings.

Cons

  • charlock_holmes depends on ICU, and we need to install icu4c to the system via each environment's package manager.

Copy link
Contributor

@akihikodaki akihikodaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec succeeds even with nkf. Spec the parsed card and you'll find it returns windows-1252 for sjis and ISO-8859-1 for koi8-r. It is a regression and does not solve the problem you expected the change to solve.
charlock_holmes uses ICU as its backend. Show ICU beats NKF to adopt this change.

@akihikodaki
Copy link
Contributor

Also, give response.charset to CharlockHolmes::EncodingDetector.detect as hint_enc instead of a fallback.

@nightpool
Copy link
Member

from a more general perspective, i'm very wary about adding new external, non-Gemfile dependencies after the problems many admins had deploying language detection.

@akihikodaki
Copy link
Contributor

I'm not really worrying about the external dependency. People have managed to get cld3-ruby work anyway. (I was expecting more fatal problems at the time, by the way. It was surprising for me that the library, tested only on my environment, worked on lots of computers. A more popular library would certainly work.)

@nullkal
Copy link
Contributor Author

nullkal commented Jul 6, 2017

As @akihikodaki advised I modified to pass response.charset to CharlockHolmes::EncodingDetector as hint_enc. And I also set CharlockHolmes::EncodingDetector#strip_tags to true.

Now koi8-r is detected correctly, but sjis is still detected as windows-1252. Maybe it is trade-off for adopting international charset detector instead of Japanese-encoding-specific implementation.

@akihikodaki
Copy link
Contributor

As @akihikodaki advised I modified to pass response.charset to CharlockHolmes::EncodingDetector as hint_enc. And I also set CharlockHolmes::EncodingDetector#strip_tags to true.

👍

Now koi8-r is detected correctly, but sjis is still detected as windows-1252. Maybe it is trade-off for adopting international charset detector instead of Japanese-encoding-specific implementation.

The string may be too short to infer the encoding. Please test with different texts and evaluate whether its accuracy is acceptable. Also, do not forget to spec the inferred results.

@nullkal
Copy link
Contributor Author

nullkal commented Jul 6, 2017

OK, I improved the test.

Copy link
Contributor

@akihikodaki akihikodaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the series of fixes.

@abcang
Copy link
Contributor

abcang commented Jul 6, 2017

Probably, I think Docker will need to install icu.

@nullkal
Copy link
Contributor Author

nullkal commented Jul 6, 2017

I added icu-dev and checked docker-compose build ran successfully.

@Gargron
Copy link
Member

Gargron commented Jul 6, 2017

At the Interop Tokyo conference, one of the comments I got was that upgrading Mastodon was too bothersome. I would like to not add system-level dependencies unless absolutely necessary. This is used only for FetchLinkCardService? It could be better to just fail for unknown encodings. Aren't all modern pages served in UTF8?

@nullkal
Copy link
Contributor Author

nullkal commented Jul 7, 2017

Newly created pages should be served in UTF-8 for sure, but there are many websites which are created before UTF-8 became dominant. attempt_opengraph gets information also from obsolete pages, not only from modern pages which contain <meta property="og:*">.

I think users will not suffer from adding a dependency on ICU so much, because It seems Ubuntu Server 16.04 and 14.04 contains ICU by default.

@Gargron
Copy link
Member

Gargron commented Jul 7, 2017

Paging @Jehops who is maintaining the FreeBSD package for Mastodon - would this be another PITA?

@Jehops
Copy link
Contributor

Jehops commented Jul 8, 2017

No. Unless I'm missing something, it would be straightforward to add the dependency to the FreeBSD package.

@Gargron Gargron merged commit 007ab33 into mastodon:master Jul 8, 2017
@nullkal nullkal deleted the charlock_holmes_for_charcode_detection branch July 10, 2017 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants