Fix handling of NBSP in Win-1252 encoded strings #993
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a subtle bug with a complex solution.
What is the bug?
While working on spec for manual (#949) I've noticed that word wrapping works differently in MRI and JRuby. This lead to different docs being generated depending on interpreter.
Here's an example.
MRI:
JRuby:
What causes it?
For code indentation (and actually for any space) in the manual we use Non-breaking Space (NBSP). We do this because Prawn strips any leading whitespaces when it does text layout (presumably to keep the text properly aligned).
When Prawn lays out text it uses a bunch of regular expressions to break the text into words. That regular expression contains
\s
for whitespaces among other things. It turns out that/\s/
matches NBSP as well. Sometimes.In MRI
/\s/
matches NBSP in Win-1252 strings but doesn't in UTF-8 strings. JRuby doesn't match it in either.What's up with Windows-1252 strings? Why not use UTF-8 all the time?
PDF has 14 "build-in" fonts. They're defined in the spec and are supposed to be present on any system the PDF can be read. This gives authors some choice of fonts and also makes documents smaller because these fonts do not have to be embedded into the PDF file.
Now, authoring software (such as Prawn) needs font metric data to properly layout text on a page. This data is distributed in a form of AFM (Adobe Font Metrics) files.
A sidenote about text encoding in PDF. PDF, generally speaking, doesn't support any text encodings. It uses whatever is provided by the font currently used. Most fonts provide encoding tables that are corresponding to one (or more) standardised encodings such as ASCII or Unicode.
The 14 standard fonts happen to correspond to Windows-1252 encoding. So the easiest way to write text in those fonts is to just use proper encoding on the input strings.
The solution
Inconsistent behaviour of
\s
in regular expressions is easy to fix. We just need to break it down into individual characters we want. While MRI is the standard and JRuby is likely the one who's bug it is, we actually want to use JRuby behaviour. NBSP's purpose is to display space but not be used to break lines into words, after all.But wait! There's more!
On the way I've found that Prawn doesn't properly preserve encoding of fragments. This actually can affect resulting documents because fragments that are supposed to be Win-1252-encoded are UTF-8-encoded and contain bytes that may produce unwanted results.
So I fixed it. But it turns out that string's and regular expression's encoding have to be the same to properly match. Otherwise it raises an exception and we don't want that. So that added a bit of complexity to the whole regex business since the encoding of the fragment needs to be carried over to regex construction.
Caveats
/\s||-/
(simplified for demonstration purposes). Notice that double pipe. It matches an empty string. This produced some extra blank tokens. There's a spec for this behaviour. I don't think it's correct. Prawn should not produce empty tokens. I fixed that.