Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change text box to return remaining text as UTF-8, improve Win1252 handling internally, raise errors or warnings rather than silently replacing invalid glyphs #793

Closed
wants to merge 7 commits into from

Conversation

practicingruby
Copy link
Member

Currently if using an AFM-font,Prawn's text box will return overflow text in Win1252 encoding, even though the user must provide their text in UTF-8 format. To render this overflowed text, the user needs to pass :skip_encoding, which is a fairly awkward workflow.

In an ideal setting, we should treat any transcoding to Win1252 (which isn't even supported by Ruby, so we implement it ourselves) as an implementation detail, and never allow the transcoded text to cross the public API boundary. In other words, Prawn should be UTF-8 in, UTF-8 out to keep things simple for users.

See related discussion on #777, and on #779.

@practicingruby
Copy link
Member Author

This patch almost works... we see one test failure and one manual example failure currently. I'm a little nervous about merging this even if we can fix the remaining failure, because it constitutes an API change to the 1.0 stable API.

That said, we can preserve nearly perfect backwards compatibility in practice by making :skip_encoding a no-op. This would make it so that old code which passed remaining text back into text_box would still work correctly. The only possible breakage would be if someone was actually doing some of their own processing on the win1252 text to convert it back into UTF-8, but that seems unlikely.

@@ -101,6 +101,10 @@ def normalize_encoding(text)
"Arguments to text methods must be UTF-8 encoded"
end

def to_utf8(text)
text.bytes.pack("U*")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This may not actually be a correct way of converting back into UTF-8. Needs closer investigation.

@practicingruby
Copy link
Member Author

In theory, this should be working now! The only thing I haven't completely dealt with yet is making :skip_encoding a no-op, or removing it entirely.

@airblade: Can you please try this code in your original use case and see if it solves the problem for you? It implements UTF-8 in, UTF-8 out, so you shouldn't get any encoding surprises anymore.

@practicingruby
Copy link
Member Author

@packetmonkey: This is another "I'm worried it's going to break stuff" pull request. Are you using built-in fonts in any of your non-trivial Prawn code? If so, please take this for a spin.

@practicingruby
Copy link
Member Author

Note to all... I am not sure why we were using our own Win1252 implementation... perhaps it wasn't supported in Ruby 1.8 back when we still needed to support 1.8? It's possible to use Ruby's M17N system now for this purpose, so this patch switches to using Ruby's implementation rather than our own under the hood.

@packetmonkey
Copy link
Contributor

We use both the built in fonts and additional fonts. I have a lot of mock data with letter accents. All of my text gets added via a call to #formatted_text_box.

I'll give it a look over and see if I trip into anything.

@airblade
Copy link
Contributor

@sandal This fixes the problem for me. Thank you!

@practicingruby practicingruby changed the title Change text box to return remaining text as UTF-8 Change text box to return remaining text as UTF-8, improve Win1252 handling internally Oct 21, 2014
@practicingruby
Copy link
Member Author

a68eef6 contains a scary but probably good change: It raises an error if non-Windows-1252 glyphs are attempted to be rendered in AFM fonts. The behavior of existing versions of Prawn is to replace with an underscore, which is pretty much never what users will want, and masks bugs.

This latest also changes or gets rid of tests that actually only worked because they were exploiting this bad behavior.

@practicingruby practicingruby changed the title Change text box to return remaining text as UTF-8, improve Win1252 handling internally Change text box to return remaining text as UTF-8, improve Win1252 handling internally, raise errors or warnings whenever we might be rendering garbage text Oct 21, 2014
@practicingruby practicingruby changed the title Change text box to return remaining text as UTF-8, improve Win1252 handling internally, raise errors or warnings whenever we might be rendering garbage text Change text box to return remaining text as UTF-8, improve Win1252 handling internally, raise errors or warnings rather than silently replacing invalid glyphs Oct 21, 2014
@practicingruby
Copy link
Member Author

@packetmonkey: I'm going to merge this, but if anything breaks open up issues and we'll get them sorted out.

practicingruby added a commit that referenced this pull request Oct 30, 2014
A squashed merge of #793, containing the following improvements:

* Text for all Prawn methods is now UTF-8-in, UTF-8-out, so the user
does not need to handle Windows-1252 strings.

* Internally, we're now using Ruby's M17n system to handle the encoding
into Windows-1252, so text.encoding will come back as Windows-1252
when `AFM#normalize_encoding` is called, rather than `ASCII-8Bit`

* When using AFM fonts + ASCII only text, no warning will be seen.

* When using AFM fonts + non-ASCII characters that are supported in
 Windows-1252, users will see a warning about the limited
internationalization support, along with a recommendation to use a TTF
 font instead.

* The warning includes instructions on how to disable it (just set
`Prawn::Font::AFM.hide_m17_warning = true`)

* When using AFM fonts + non-ASCII characters that are NOT supported in
* WIndows-1252, an exception will be raised rather than replacing w.
 `_`.

* None of the above will apply to anyone using TTF fonts with sane UTF-8
support, everything should "just work" for those folks.
@practicingruby
Copy link
Member Author

Merged in adaf18c!

@airblade
Copy link
Contributor

@sandal Great news!

@isaiah
Copy link

isaiah commented Oct 22, 2015

a68eef6 contains a scary but probably good change: It raises an error if non-Windows-1252 glyphs are attempted to be rendered in AFM fonts. The behavior of existing versions of Prawn is to replace with an underscore, which is pretty much never what users will want, and masks bugs.

@practicingruby I think this change makes the fallback_fonts ignored.
I have a document that contains Cyrillic, CJK, Arabic, Greek or anything, so I cannot really tell the document which font to use, previously I depends on the fallback_fonts feature, the UTF-8 glyphs are rendered correctly, but it's broken since 2.0.0

@practicingruby
Copy link
Member Author

@isaiah Do not mix AFM fonts with fallback fonts... use TTF fonts throughout and you'll be fine!

@isaiah
Copy link

isaiah commented Oct 22, 2015

@practicingruby How can I do that? Currently I monkey patched the Prawn::Document#font_families method and #fallback_fonts method, to add my own TTF fonts. Should I disable the original fonts completely?

@pointlessone pointlessone deleted the utf8-remaining-text branch December 30, 2020 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants