Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of NBSP in Win-1252 encoded strings #993

Merged
merged 1 commit into from
Dec 21, 2016
Merged

Conversation

pointlessone
Copy link
Member

@pointlessone pointlessone commented Oct 23, 2016

This is a subtle bug with a complex solution.

What is the bug?

While working on spec for manual (#949) I've noticed that word wrapping works differently in MRI and JRuby. This lead to different docs being generated depending on interpreter.

Here's an example.

MRI:
screen shot 2016-10-16 at 12 40 05

JRuby:
screen shot 2016-10-16 at 12 39 30

What causes it?

For code indentation (and actually for any space) in the manual we use Non-breaking Space (NBSP). We do this because Prawn strips any leading whitespaces when it does text layout (presumably to keep the text properly aligned).

When Prawn lays out text it uses a bunch of regular expressions to break the text into words. That regular expression contains \s for whitespaces among other things. It turns out that /\s/ matches NBSP as well. Sometimes.

In MRI /\s/ matches NBSP in Win-1252 strings but doesn't in UTF-8 strings. JRuby doesn't match it in either.

What's up with Windows-1252 strings? Why not use UTF-8 all the time?

PDF has 14 "build-in" fonts. They're defined in the spec and are supposed to be present on any system the PDF can be read. This gives authors some choice of fonts and also makes documents smaller because these fonts do not have to be embedded into the PDF file.

Now, authoring software (such as Prawn) needs font metric data to properly layout text on a page. This data is distributed in a form of AFM (Adobe Font Metrics) files.

A sidenote about text encoding in PDF. PDF, generally speaking, doesn't support any text encodings. It uses whatever is provided by the font currently used. Most fonts provide encoding tables that are corresponding to one (or more) standardised encodings such as ASCII or Unicode.

The 14 standard fonts happen to correspond to Windows-1252 encoding. So the easiest way to write text in those fonts is to just use proper encoding on the input strings.

The solution

Inconsistent behaviour of \s in regular expressions is easy to fix. We just need to break it down into individual characters we want. While MRI is the standard and JRuby is likely the one who's bug it is, we actually want to use JRuby behaviour. NBSP's purpose is to display space but not be used to break lines into words, after all.

But wait! There's more!

On the way I've found that Prawn doesn't properly preserve encoding of fragments. This actually can affect resulting documents because fragments that are supposed to be Win-1252-encoded are UTF-8-encoded and contain bytes that may produce unwanted results.

So I fixed it. But it turns out that string's and regular expression's encoding have to be the same to properly match. Otherwise it raises an exception and we don't want that. So that added a bit of complexity to the whole regex business since the encoding of the fragment needs to be carried over to regex construction.

Caveats

  1. Prawn uses a few characters that are not universally present in all encodings. For example, Zero-widths Space (ZWSP). Prawn substituted it for an empty string. After construction of the word-breaking regex it looked like this: /\s||-/ (simplified for demonstration purposes). Notice that double pipe. It matches an empty string. This produced some extra blank tokens. There's a spec for this behaviour. I don't think it's correct. Prawn should not produce empty tokens. I fixed that.
  2. Built-in fonts are not capable to rendering Unicode text (outside of ASCII). In a few places Prawn explicitly checks that and raises exceptions. Other parts (like AFM font handling code) just expect Win-1252-encoded strings and treat all input as if it's properly encoded without checks. There was a spec that explicitly against this. It expected a fully Unicode string to be properly laid out while the font simply had no metrics for the characters in the string. I fixed the spec to expect an appropriate exception with default font and properly layout text with an UTF-8 font. There's also an associated PR Commit failing UTF-8 test case for Line Wrap #693 describing what seems to be expected behaviour even if signaled by not quite an appropriate exception.

@yob
Copy link
Member

yob commented Oct 23, 2016

Couldn't help myself and had to comment on this PR - great research, and awesome PR description 😀

@pointlessone
Copy link
Member Author

@yob Anything to make people interested in reviewing it. ;)

@pointlessone pointlessone mentioned this pull request Oct 23, 2016
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants