-
Notifications
You must be signed in to change notification settings - Fork 11
Cope with Unicode properly where we try to #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ft hyphens and non-breaking spaces don't blat other unrelated characters.
…hat breaks text into words, not at the word level. It's too late then!
…n we've explicitly said "I have no idea what would happen in $foo mode with this input file"
…say that there is.
…than UTF-8, I think, and while normally that would be a good thing, this distribution works in UTF-8 internally rather than Unicode strings.
… tests to be skipped didn't work. Use a simpler approach.
|
Thanks - have merged that and am trying to beat travis CI into shape :-/ Sorry for previously losing your pull request - its been a hectic period. |
|
Looks like this change breaks handling of non-breaking spaces. See the test just produced - t/spaces.t Problem appears to be that gets mapped to a \xA0 in the character stream from the HTML parser. This then becomings a U+FFFD (Unicode replacement character) when it goes through the decode in @skington - are you able to think of a way to handle this? |
|
I'll have a look; my gut instinct is that the HTML parser isn't Unicode-clean. |
|
The problem is that HTML::Parser isn't Unicode-clean, but whatever it uses for HTML entities is. (Note that _convert_spacelike_characters_to_space says "supplied with a string in bytes" - i.e. bytes rather than codepoints or graphemes.) HTML::Parser->parse_file sets binmode on the filehandle and reads the contents in in 512-byte chunks, for instance, so anything we get from that route is going to be in UTF8 if we're lucky. It therefore makes sense to expect bytes, decode them from UTF8 in the if statement, and only if we see an actual \xA0 or \xAD once we've done that, assume that we actually did have UTF8-encoded text, make the appropriate subtitutions, and then turn the Unicode string back into bytes again. A stop-gap solution would be to use utf8_mode to ensure that the entities matched the guessed-at file encoding. But IMO a proper solution would be to directly address the problem of encodings. |
This is PR 5 rewritten for a cleaner commit history. It fails the author tests, but I believe those fail in master as well.