Encoding compatibility cleanup (including removal of unnecessary String packing/unpacking on Ruby 1.9) #390

Merged
merged 4 commits into from Aug 21, 2012

Conversation

Projects
None yet
2 participants
Contributor

alexdowad commented Aug 19, 2012

This all started when I was studying Prawn's text wrapping code and found some seemingly-redundant code for iterating over a String one UTF-8 character at a time. It turned out to be something which was done purely for compatibility with Ruby 1.8. Investigation revealed several similar places in the codebase.

I've tried to concentrate this String encoding-related compatibility stuff in compatibility.rb. With this patch, packing and unpacking of Strings to get a sequence of UTF-8 characters is only done under Ruby 1.8; under 1.9, it will just use the built-in methods for dealing with encodings.

The specs pass under 1.8.7, 1.9.2, and 1.9.3. I couldn't run the specs under 1.8.6, because the Rakefile requires Bundler, and Bundler requires Ruby >= 1.8.7. Upon discovering this, I took the liberty of removing a little code which was there only for < 1.8.6 compatibility. Now that compatibility with Ruby 1.8.6 has been dropped, the header comment for prawn/compatibility.rb no longer made sense, so I rewrote it.

When running the specs under different implementations, I discovered one test was failing under Ruby 1.9.2 (but not 1.9.3). What it came down to was: Font::TTF#encode_text was returning text encoded as ASCII-8BIT under 1.9.3, but as US-ASCII under 1.9.2. I think that US-ASCII strings are supposed to be upgraded to ASCII-8BIT if non-ASCII chars (ie. chars with MSB set) are added to the string, but Ruby 1.9.2 isn't doing that for some reason. It may be a platform bug.

I modified the spec in question so it passes regardless of the String encoding which is returned by Font#encode_text.

The conditional is unnecessary, because for implementations which don't have it, File.binread is defined in prawn/compatibility.rb.

Already present in 1.8.7.

This is already present in 1.8.7 too. Originally, I mistakenly thought that String#each_char respected UTF-8 encoding even under 1.8.7, but that's not true. We could patch it to match the behavior of 1.9, but then you might break client code. So instead, I defined a new helper below.

When adding helper methods to core classes like String, you want to choose names which are very unlikely to collide with methods added by the client programmer (or by other gems). Originally I thought of "characters", but I think that's too likely to collide. Hopefully "unicode_characters" won't -- it's also a bit more descriptive.

Contributor

alexdowad commented Aug 19, 2012

WAIT! SORRY! Logically, this patch should improve performance, but it seems to make it worse! There may be a performance bug somewhere in here... let me diagnose it before merging, please!

Contributor

alexdowad commented Aug 19, 2012

OK, forget what I said about performance being worse with this patch. My benchmark was only exercising Prawn::Font::AFM, not Prawn::Font::TTF. Cases where Prawn::Font::TTF is used is where the unpacking/repacking code which I originally questioned is exercised. So there is no measurable performance benefit to this patch when using AFM fonts -- whether it measures as "faster" or "slower" on any given test run is wholly determined by random fluctuations in the results.

Interestingly, wrapping text with TTF fonts seems to be faster than using AFM fonts.

@bradediger bradediger merged commit a5a7fc1 into prawnpdf:master Aug 21, 2012

Member

bradediger commented Aug 21, 2012

Merged. Thank you!

Contributor

alexdowad commented Aug 21, 2012

And thanks for reviewing the code... I would really appreciate any specific comments which you (or anyone else) has.

alexdowad deleted the alexdowad:encoding_compatibility_cleanup branch Dec 22, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment