New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cope with Unicode properly where we try to #5

Closed
wants to merge 16 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@skington
Contributor

skington commented Mar 8, 2015

This is RT issue 9700.

The fundamental problem with the previous code was that it was assuming that it was dealing with Latin1-encoded text when in fact it was dealing with Unicode text decoded into bytes. As such, it made checks for \xA0 or \xAD thinking it was finding \x{00A0} NO-BREAK SPACE or \x{00AD} SOFT HYPHEN, when in fact it was merely finding any byte in a multi-byte character, such as a couple of Thai characters (this was what was mentioned in the RT ticket) which had an A0 or AD somewhere in their multiple-byte sequence.

Even if it had correctly avoided characters that had nothing to do with non-breaking spaces or soft hyphens, the code would still have had a problem because this sort of thing should have been done before splitting text into words. So the check has been moved out of out() and into textflow().

The corresponding code only happened in the Text and Markdown formatters, and I have no idea about PostScript and don't want to know about / test RTF, so I updated the tests so they used a common framework for finding test files, that I could extend to say "I have no idea what the output for this format should be, and that's fine".

I updated the main module's documentation as it's clear that people are expecting to get Unicode / UTF-8 back, and the fact that the code internally behaves as if it were dealing with ASCII is mostly not a problem as there's very little string-munging going on. There's an argument for making the module Unicode-aware and e.g. handling everything as Unicode strings, and decoding/encoding only at the outer boundary, but that wouldn't be consistent with requiring perl 5.6.1.

skington added some commits Mar 8, 2015

Test that Unicode gets passed through correctly, and the tests for so…
…ft hyphens and non-breaking spaces don't blat other unrelated characters.
Non-breaking spaces and soft hyphens need to be handled by the code t…
…hat breaks text into words, not at the word level. It's too late then!
Revert "The Markdown output for this is the same as plaintext as ther…
…e's no HTML fanciness involved."

This reverts commit cbb4697.
Factor out this code to a common library that knows to skip tests whe…
…n we've explicitly said "I have no idea what would happen in $foo mode with this input file"
@skington

This comment has been minimized.

Contributor

skington commented Mar 9, 2015

Annoyingly, dzil test says stuff like this:

Could not decode UTF-8 t/data/expected/unicode.md; filename set by GatherDir (Dist::Zilla::Plugin::GatherDir line 215); encoded_content added by @NIGELM/GatherDir (Dist::Zilla::Plugin::GatherDir line 216); error was: utf8 "\xD83D" does not map to Unicode at /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/x86_64-linux/Encode.pm line 200.
; maybe you need the [Encoding] plugin to specify an encoding at /opt/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/Dist/Zilla/Role/File.pm line 153.

I was hoping that TextMate 1 was Unicode-savvy-enough that if I cut and pasted stuff into it it would sort out Unicode stuff properly. Either I was wrong or the Dist::Zilla plugins are wrong. Or maybe we're both wrong!

t/01_ps.t Outdated
use File::Spec; # try to keep pathnames neutral
use Test::More 0.96;
BEGIN { use_ok("HTML::FormatPS"); }
use lib "$FindBin::Bin/lib";

This comment has been minimized.

@karenetheridge

karenetheridge Nov 23, 2016

Contributor

The two FindBin lines can be replaced with use lib 't/lib';, which also avoids adding another dependency.

This comment has been minimized.

@skington

skington Nov 23, 2016

Contributor

Is FindBin a dependency per se? It's been in core since 5.003_07.

This comment has been minimized.

@karenetheridge

karenetheridge Nov 23, 2016

Contributor

it's still an unnecessary complication to a test. FindBin is only really useful for relocatable scripts, outside of the cpan context.

This comment has been minimized.

@skington

skington Nov 23, 2016

Contributor

This PR is a year and a half old, so I can't remember exactly why I added that in, but I suspect it was because I was saying e.g. prove -I ../lib 01_is.t and that was breaking because the code assumed it was being run from a specific directory, which IMO is brittle. I'll revert if this is a blocker, though. (Still looking at other changes at the moment, trying to remember how all of this stuff works.)

This comment has been minimized.

@karenetheridge

karenetheridge Nov 23, 2016

Contributor

If you keep your working directory as the repository root (as the installer does), then you can not only make test, but prove -lr t, etc etc and everything will DTRT.

I'm not a maintainer of this module, so my opinions are just that -- but I found this PR outstanding and was interested in getting it moving again so it could be shipped along with the other issue I filed for this distribution.

This comment has been minimized.

@skington

skington Nov 23, 2016

Contributor

The flipside is that if you have a hefty test directory structure, it's nice to be able to run just part of the tests by saying cd t/complex/directory/hierarchy and then either prove *.t or even perl -d that-one-test-that-fails.t and not have to prefix everything with t/complex/directory/hierarchy all the time.

(I'm from a RISC OS background, where making assumptions about the current working directory is one of the greatest sins. I fear I'm fighting a losing battle in Linux-land, though, given that e.g. git blame /path/to/repository/file fails if you're not actually currently inside /path/to/repository yourself, for reasons I don't understand but are presumably due to kernel developers thinking "why would you ever want to do that?")

But I'll fix, in the interests of getting this resolved, and because it's not a major win for such a small test suite like this.

@karenetheridge

This comment has been minimized.

Contributor

karenetheridge commented Nov 23, 2016

@skington Dist::Zilla is correct here -- "\xD83D" isn't a valid unicode character.

The file is correctly encoded as utf8, and when I open it in my editor I see:

<d83d><dca9> → ☃

Were you perhaps intending to say 💩, which is \x1f4a9? (It encodes as \x{d83d}\x{dca9})

@karenetheridge

(referring to 149e203) Can you squash this commit with the one it reverts (which ought to make both of them vanish)?

@skington

This comment has been minimized.

Contributor

skington commented Nov 23, 2016

OK, I definitely blame my editor. And of course the tests passed because garbage in, garbage out is a perfectly good round trip. I'll try to beef up the tests' robustness.

@skington

This comment has been minimized.

Contributor

skington commented Nov 23, 2016

OK, I think this is all the issues raised by code review dealt with.

Note in particular that I had to revert the attempt at making the tests more robust by using HTML entities. I was trying to avoid the problems with the tests previously, where Unicode pile-of-poo and friends were accidentally mis-encoded, but this wasn't noticed because the error was present in both the input and the output file, so the tests passed. Unfortunately, HTML::Formatter operates on (typically) UTF8-encoded strings internally, but decoding the HTML entity produced a Unicode string instead, so I've switched back to what should be valid UTF8 on both ends.

@skington

This comment has been minimized.

Contributor

skington commented Nov 24, 2016

Right, the tests are failing, but that's because of author tests (spelling and perlcritic) that are failing upstream as well.

@karenetheridge

This comment has been minimized.

Contributor

karenetheridge commented Nov 24, 2016

When you've got things back into a working state, you can rebase the commits down into a clean series and the pull request will get updated automatically when you force-push. Right now, the tree looks like:

* d7b8913 2016-11-24 Sam Kington (upstream/pr/5) I *said*, this is not a symlink!
* fed82a1 2016-11-24 Sam Kington Not sure what happened to this file. Yay git and symlinks!
* f24d7e8 2016-11-24 Sam Kington Yes, this is no longer a symlink.
* 798ed79 2016-11-24 Sam Kington Put back these files now that Travis has checked out a version of the code without them, hopefully cleaning up the issue with symlinks not resolving or whatever its problem was.
* 42ec208 2016-11-24 Sam Kington Completely trash these files a second time.
* b0b4cb2 2016-11-24 Sam Kington ...and put them back as normal files.
* 196c350 2016-11-24 Sam Kington These symlinks turned into straight files got git confused. So start off by trashing them...
* 71ed2dc 2016-11-23 Sam Kington No need for use_ok for an internal test class.
* 2bb0846 2016-11-23 Sam Kington dzil test turns symlinks into ordinary files so this way of detecting tests to be skipped didn't work. Use a simpler approach.
*   65f2021 2016-11-23 Sam Kington Merge upstream/master into master
|\  
| * b54d80a 2015-10-13 Nigel Metheringham (tag: release/2.14, upstream/master, origin/master, origin/HEAD, master) v2.14
| *   80a5514 2015-10-12 Nigel Metheringham Merge branch 'master' of github.com:nigelm/html-formatter
| |\  
| | * 48bf910 2015-10-12 Nigel Metheringham Link in Travis-CI status into README
| * | 6d23e6f 2015-10-12 Nigel Metheringham Link in Travis-CI status into README
| |/  
| * 01a91ab 2015-10-12 Nigel Metheringham (tag: release/2.13-TRIAL) v2.13
| * f72de44 2015-10-12 Nigel Metheringham Change distribution name to HTML-Formatter
| * d0fd3e3 2015-10-10 Nigel Metheringham (tag: release/2.12) v2.12
| * 3ee3e95 2015-10-10 Nigel Metheringham Update changes
| *   30bc5dd 2015-10-10 Nigel Metheringham Merge pull request #6 from karenetheridge/topic/file_slurper
| |\  
| | * 827132d 2015-09-24 Karen Etheridge (upstream/pr/6, topic/sprintf_arg) switch from File::Slurp to File::Slurper
| * |   40f46cd 2015-10-10 Nigel Metheringham Merge branch 'master' of github.com:nigelm/html-format
| |\ \  
| | |/  
| * | aaadb44 2015-10-10 Nigel Metheringham Resolve minor test issues
| * | da8b2f6 2015-10-10 Nigel Metheringham Add travis-ci setup
* | | dd0386b 2016-11-23 Sam Kington Remove FindBin and debugging.
* | | c1d915c 2016-11-23 Sam Kington Prefer simplicity over robustness.
* | | f7ae182 2016-11-23 Sam Kington Don't use HTML entities, they get turned into Unicode strings rather than UTF-8, I think, and while normally that would be a good thing, this distribution works in UTF-8 internally rather than Unicode strings.
* | | 1d4f310 2016-11-23 Sam Kington Don't use the same test title for every test.
* | | 5967b62 2016-11-23 Sam Kington Use entities rather than broken raw UTF-8.
* | | 824b7ee 2016-11-23 Sam Kington Fix the encoding in this file.
* | | ef4009b 2015-03-08 Sam Kington There isn't actually any conversion from Unicode to Latin1, so don't say that there is.
* | | d27c950 2015-03-08 Sam Kington Markdown output for this test is exactly the same as text.
* | | d414f34 2015-03-08 Sam Kington Factor out this code to a common library that knows to skip tests when we've explicitly said "I have no idea what would happen in $foo mode with this input file"
* | | 149e203 2015-03-08 Sam Kington Revert "The Markdown output for this is the same as plaintext as there's no HTML fanciness involved."
* | | cbb4697 2015-03-08 Sam Kington The Markdown output for this is the same as plaintext as there's no HTML fanciness involved.
* | | ad6f376 2015-03-08 Sam Kington Non-breaking spaces and soft hyphens need to be handled by the code that breaks text into words, not at the word level. It's too late then!
* | | 0b0bc70 2015-03-08 Sam Kington Test that Unicode gets passed through correctly, and the tests for soft hyphens and non-breaking spaces don't blat other unrelated characters.
| |/  
|/|   
* |   ddc7080 2013-10-28 Nigel Metheringham Merge pull request #3 from neilbowers/master
|\ \  
| |/  
|/|   
| * cffa738 2013-10-27 Neil Bowers (upstream/pr/3) Reformatted as per CPAN::Changes::Spec
|/  
* 1b8b176 2013-10-27 Nigel Metheringham (tag: release/2.11) v2.11

skington added some commits Nov 23, 2016

dzil test turns symlinks into ordinary files so this way of detecting…
… tests to be skipped didn't work. Use a simpler approach.
@skington

This comment has been minimized.

Contributor

skington commented Nov 24, 2016

I've squashed the commits where I was fighting with Travis and git, but couldn't obviously rebase because of a merge commit. So instead I branched from current master and cherry-picked into the new branch, squashing a commit that I'd immediately reverted in the process. Hope this is cleaner!

@karenetheridge

This comment has been minimized.

Contributor

karenetheridge commented Nov 24, 2016

but couldn't obviously rebase because of a merge commit

Rebasing squashes your merges, but not any merges from the thing you're rebasing onto.

@skington

This comment has been minimized.

Contributor

skington commented Nov 24, 2016

but couldn't obviously rebase because of a merge commit

Rebasing squashes your merges, but not any merges from the thing you're rebasing onto.

Quite possibly. All I know is, I squashed all of the symlink-related commits together and that was fine; then I tried to rebase, and got a non-obvious error that I couldn't immediately find a solution to.

Bear in mind that when I merged from master I got merge conflicts because of the function I'd added just above the warning about the distribution name change from upstream. I can well imagine that that might make rebasing tricky.

Either way, though, the choices I had were (1) having squashed all of my commits as seemed reasonable, to cherry-pick them into a clean branch, or (2) to fight with my git client and/or the git command-line. I chose the easy route at first hoping that was enough - as a proof of concept if nothing else. I can struggle with git some more if you want, but given that at this point the only difference between the two branches is the github comments, and I've linked from one PR to the other, I'd rather not.

@nigelm

This comment has been minimized.

Owner

nigelm commented Nov 25, 2016

Closing as superceded by your later pull request

@nigelm nigelm closed this Nov 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment