Fix handling of NBSP in Win-1252 encoded strings #993

pointlessone · 2016-10-23T09:33:39Z

This is a subtle bug with a complex solution.

What is the bug?

While working on spec for manual (#949) I've noticed that word wrapping works differently in MRI and JRuby. This lead to different docs being generated depending on interpreter.

Here's an example.

MRI:

JRuby:

What causes it?

For code indentation (and actually for any space) in the manual we use Non-breaking Space (NBSP). We do this because Prawn strips any leading whitespaces when it does text layout (presumably to keep the text properly aligned).

When Prawn lays out text it uses a bunch of regular expressions to break the text into words. That regular expression contains \s for whitespaces among other things. It turns out that /\s/ matches NBSP as well. Sometimes.

In MRI /\s/ matches NBSP in Win-1252 strings but doesn't in UTF-8 strings. JRuby doesn't match it in either.

What's up with Windows-1252 strings? Why not use UTF-8 all the time?

PDF has 14 "build-in" fonts. They're defined in the spec and are supposed to be present on any system the PDF can be read. This gives authors some choice of fonts and also makes documents smaller because these fonts do not have to be embedded into the PDF file.

Now, authoring software (such as Prawn) needs font metric data to properly layout text on a page. This data is distributed in a form of AFM (Adobe Font Metrics) files.

A sidenote about text encoding in PDF. PDF, generally speaking, doesn't support any text encodings. It uses whatever is provided by the font currently used. Most fonts provide encoding tables that are corresponding to one (or more) standardised encodings such as ASCII or Unicode.

The 14 standard fonts happen to correspond to Windows-1252 encoding. So the easiest way to write text in those fonts is to just use proper encoding on the input strings.

The solution

Inconsistent behaviour of \s in regular expressions is easy to fix. We just need to break it down into individual characters we want. While MRI is the standard and JRuby is likely the one who's bug it is, we actually want to use JRuby behaviour. NBSP's purpose is to display space but not be used to break lines into words, after all.

But wait! There's more!

On the way I've found that Prawn doesn't properly preserve encoding of fragments. This actually can affect resulting documents because fragments that are supposed to be Win-1252-encoded are UTF-8-encoded and contain bytes that may produce unwanted results.

So I fixed it. But it turns out that string's and regular expression's encoding have to be the same to properly match. Otherwise it raises an exception and we don't want that. So that added a bit of complexity to the whole regex business since the encoding of the fragment needs to be carried over to regex construction.

Caveats

Prawn uses a few characters that are not universally present in all encodings. For example, Zero-widths Space (ZWSP). Prawn substituted it for an empty string. After construction of the word-breaking regex it looked like this: /\s||-/ (simplified for demonstration purposes). Notice that double pipe. It matches an empty string. This produced some extra blank tokens. There's a spec for this behaviour. I don't think it's correct. Prawn should not produce empty tokens. I fixed that.
Built-in fonts are not capable to rendering Unicode text (outside of ASCII). In a few places Prawn explicitly checks that and raises exceptions. Other parts (like AFM font handling code) just expect Win-1252-encoded strings and treat all input as if it's properly encoded without checks. There was a spec that explicitly against this. It expected a fully Unicode string to be properly laid out while the font simply had no metrics for the characters in the string. I fixed the spec to expect an appropriate exception with default font and properly layout text with an UTF-8 font. There's also an associated PR Commit failing UTF-8 test case for Line Wrap #693 describing what seems to be expected behaviour even if signaled by not quite an appropriate exception.

See #993 for more info.

yob · 2016-10-23T09:42:38Z

Couldn't help myself and had to comment on this PR - great research, and awesome PR description 😀

pointlessone · 2016-10-23T09:44:23Z

@yob Anything to make people interested in reviewing it. ;)

See #993 for more info.

pointlessone force-pushed the nbsp-fix branch from b885b64 to 439c4fc Compare October 23, 2016 09:36

pointlessone added a commit that referenced this pull request Oct 23, 2016

Fix handling of NBSP in Win-1252 encoded strings

439c4fc

See #993 for more info.

pointlessone mentioned this pull request Oct 23, 2016

Manual spec #949

Merged

6 tasks

Fix handling of NBSP in Win-1252 encoded strings

cd40dcb

See #993 for more info.

pointlessone force-pushed the nbsp-fix branch from 439c4fc to cd40dcb Compare December 21, 2016 16:00

pointlessone merged commit cd40dcb into master Dec 21, 2016

pointlessone deleted the nbsp-fix branch December 23, 2016 08:27

pointlessone mentioned this pull request Feb 12, 2017

Refactor for MRI Regexp bug? #980

Closed

jessedoyle mentioned this pull request Aug 25, 2020

Getting Prawn::Errors::IncompatibleStringEncoding jessedoyle/prawn-icon#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of NBSP in Win-1252 encoded strings #993

Fix handling of NBSP in Win-1252 encoded strings #993

pointlessone commented Oct 23, 2016 •

edited

Loading

yob commented Oct 23, 2016

pointlessone commented Oct 23, 2016

Fix handling of NBSP in Win-1252 encoded strings #993

Fix handling of NBSP in Win-1252 encoded strings #993

Conversation

pointlessone commented Oct 23, 2016 • edited Loading

What is the bug?

What causes it?

What's up with Windows-1252 strings? Why not use UTF-8 all the time?

The solution

But wait! There's more!

Caveats

yob commented Oct 23, 2016

pointlessone commented Oct 23, 2016

pointlessone commented Oct 23, 2016 •

edited

Loading