Cope with Unicode properly where we try to #9

skington · 2016-11-24T02:13:16Z

This is PR 5 rewritten for a cleaner commit history. It fails the author tests, but I believe those fail in master as well.

…ft hyphens and non-breaking spaces don't blat other unrelated characters.

…hat breaks text into words, not at the word level. It's too late then!

…n we've explicitly said "I have no idea what would happen in $foo mode with this input file"

…say that there is.

…than UTF-8, I think, and while normally that would be a good thing, this distribution works in UTF-8 internally rather than Unicode strings.

… tests to be skipped didn't work. Use a simpler approach.

nigelm · 2016-11-25T17:04:26Z

Thanks - have merged that and am trying to beat travis CI into shape :-/

Sorry for previously losing your pull request - its been a hectic period.

nigelm · 2017-01-04T17:16:05Z

Looks like this change breaks handling of non-breaking spaces.

See the test just produced - t/spaces.t

Problem appears to be that gets mapped to a \xA0 in the character stream from the HTML parser. This then becomings a U+FFFD (Unicode replacement character) when it goes through the decode in _convert_spacelike_characters_to_space() - which is not matched by the character replacement regexp.

@skington - are you able to think of a way to handle this?

skington · 2017-01-04T17:34:12Z

I'll have a look; my gut instinct is that the HTML parser isn't Unicode-clean.

skington · 2017-01-04T18:16:54Z

The problem is that HTML::Parser isn't Unicode-clean, but whatever it uses for HTML entities is. (Note that _convert_spacelike_characters_to_space says "supplied with a string in bytes" - i.e. bytes rather than codepoints or graphemes.)

HTML::Parser->parse_file sets binmode on the filehandle and reads the contents in in 512-byte chunks, for instance, so anything we get from that route is going to be in UTF8 if we're lucky. It therefore makes sense to expect bytes, decode them from UTF8 in the if statement, and only if we see an actual \xA0 or \xAD once we've done that, assume that we actually did have UTF8-encoded text, make the appropriate subtitutions, and then turn the Unicode string back into bytes again.

A stop-gap solution would be to use utf8_mode to ensure that the entities matched the guessed-at file encoding. But IMO a proper solution would be to directly address the problem of encodings.

skington added 13 commits November 24, 2016 02:08

Test that Unicode gets passed through correctly, and the tests for so…

3410743

…ft hyphens and non-breaking spaces don't blat other unrelated characters.

Non-breaking spaces and soft hyphens need to be handled by the code t…

5684ee3

…hat breaks text into words, not at the word level. It's too late then!

Factor out this code to a common library that knows to skip tests whe…

48747ce

…n we've explicitly said "I have no idea what would happen in $foo mode with this input file"

Markdown output for this test is exactly the same as text.

b68aa5a

There isn't actually any conversion from Unicode to Latin1, so don't …

280a1f4

…say that there is.

Fix the encoding in this file.

821bf6a

Use entities rather than broken raw UTF-8.

a9194a2

Don't use the same test title for every test.

e85b2e2

Don't use HTML entities, they get turned into Unicode strings rather …

a40466b

…than UTF-8, I think, and while normally that would be a good thing, this distribution works in UTF-8 internally rather than Unicode strings.

Prefer simplicity over robustness.

c8b666f

Remove FindBin and debugging.

b7f7d83

No need for use_ok for an internal test class.

8a7733c

dzil test turns symlinks into ordinary files so this way of detecting…

dbb8552

… tests to be skipped didn't work. Use a simpler approach.

This was referenced Nov 24, 2016

Cope with Unicode properly where we try to #5

Closed

HTML::Formatter isn't Unicode-aware #10

Open

nigelm merged commit 0d89aae into nigelm:master Nov 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cope with Unicode properly where we try to #9

Cope with Unicode properly where we try to #9

Uh oh!

skington commented Nov 24, 2016 •

edited

Loading

Uh oh!

nigelm commented Nov 25, 2016

Uh oh!

nigelm commented Jan 4, 2017 •

edited

Loading

Uh oh!

skington commented Jan 4, 2017

Uh oh!

skington commented Jan 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cope with Unicode properly where we try to #9

Cope with Unicode properly where we try to #9

Uh oh!

Conversation

skington commented Nov 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nigelm commented Nov 25, 2016

Uh oh!

nigelm commented Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skington commented Jan 4, 2017

Uh oh!

skington commented Jan 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

skington commented Nov 24, 2016 •

edited

Loading

nigelm commented Jan 4, 2017 •

edited

Loading