Skip to content

Conversation

@skington
Copy link
Contributor

@skington skington commented Nov 24, 2016

This is PR 5 rewritten for a cleaner commit history. It fails the author tests, but I believe those fail in master as well.

@nigelm nigelm merged commit 0d89aae into nigelm:master Nov 25, 2016
@nigelm
Copy link
Owner

nigelm commented Nov 25, 2016

Thanks - have merged that and am trying to beat travis CI into shape :-/

Sorry for previously losing your pull request - its been a hectic period.

@nigelm
Copy link
Owner

nigelm commented Jan 4, 2017

Looks like this change breaks handling of non-breaking spaces.

See the test just produced - t/spaces.t

Problem appears to be that   gets mapped to a \xA0 in the character stream from the HTML parser. This then becomings a U+FFFD (Unicode replacement character) when it goes through the decode in _convert_spacelike_characters_to_space() - which is not matched by the character replacement regexp.

@skington - are you able to think of a way to handle this?

@skington
Copy link
Contributor Author

skington commented Jan 4, 2017

I'll have a look; my gut instinct is that the HTML parser isn't Unicode-clean.

@skington
Copy link
Contributor Author

skington commented Jan 4, 2017

The problem is that HTML::Parser isn't Unicode-clean, but whatever it uses for HTML entities is. (Note that _convert_spacelike_characters_to_space says "supplied with a string in bytes" - i.e. bytes rather than codepoints or graphemes.)

HTML::Parser->parse_file sets binmode on the filehandle and reads the contents in in 512-byte chunks, for instance, so anything we get from that route is going to be in UTF8 if we're lucky. It therefore makes sense to expect bytes, decode them from UTF8 in the if statement, and only if we see an actual \xA0 or \xAD once we've done that, assume that we actually did have UTF8-encoded text, make the appropriate subtitutions, and then turn the Unicode string back into bytes again.

A stop-gap solution would be to use utf8_mode to ensure that the entities matched the guessed-at file encoding. But IMO a proper solution would be to directly address the problem of encodings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants