Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAML and 1.9.2 Encoding #519

Closed
nesquena opened this issue May 8, 2011 · 39 comments
Closed

HAML and 1.9.2 Encoding #519

nesquena opened this issue May 8, 2011 · 39 comments
Assignees
Milestone

Comments

@nesquena
Copy link
Member

nesquena commented May 8, 2011

Chatting with wayneeseguin: http://twitter.com/#!/wayneeseguin/statuses/67353013299322880

Looks like he is experiencing encoding errors when he uses Padrino with HAML but not with Sinatra+Haml. It works if he does:

 -# coding: utf-8

in each HAML file but this shouldn't be necessary. We need to investigate why this happens and correct the issue for the next release.

@ghost ghost assigned DAddYE May 8, 2011
@wayneeseguin
Copy link

This is the layout file, completely fresh generated application w/Sequel & Sqlite3, haml

https://gist.github.com/743e07cb1ecdc25bac52

And the 'index' action was literally an empty file ala 'touch app/base/index.haml'

@DAddYE
Copy link
Member

DAddYE commented May 8, 2011

I think is it a Tilt issue: rtomayko/tilt#83

@nesquena
Copy link
Member Author

nesquena commented May 8, 2011

If so, why would it work in Plain Sinatra+HAML which Wayne tested separately? Have you experienced this before Dave in any of your apps?

@wayneeseguin
Copy link

For clarification, I ran

$ cp ../../sinatra_base_app/views/layouts/application.haml app/views/layouts/

So the file was identical.

@wayneeseguin
Copy link

Additionally,

∴ ruby -v
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

For a bit of an exercise, I went through and found all the areas in Sinatra mentioning encodings (v1.2.6):

Most interesting:

def force_encoding
set :default_encoding

References:

def content_type (references default_encoding)
def render (references default_encoding)
def call! (references force_encoding method)

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

More research, the only lines actually causing this error were these two characters:

%a{ :href => "#docindex"}
        ∴ Index

and

Working With Rails
–
Thank You!

With those two symbols removed the error goes away. So to reproduce this simply in Padrino, simply do:

# app/layouts/base.haml
!!!
%html
  %head
    %title Encoding Fail
    %meta{ :content => "text/html; charset=UTF-8", "http-equiv" => "Content-Type" }
  %body
    %p ∴ Foo
    %p – Bar
    %p= yield

Render anything with that template and you will see:

Encoding::UndefinedConversionError at /
"\xE2" from ASCII-8BIT to UTF-8

Getting a bit closer...

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

If I look at the backtrace in Padrino when that encoding issue occurs, I see:

gems/sinatra-1.2.6/lib/sinatra/base.rb:548:in `render'

is fundamentally still doing the rendering as you'd expect. So the default encoding is properly set, if I were to add:

raise settings.default_encoding

I see: utf-8 as the output right before the render hits an exception. So the default_encoding is intact.

@wayneeseguin
Copy link

Brilliant deductive work thus far!

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

@wayneeseguin @DAddYE Ok I have figured out the exact source of the issue in Padrino. At least we tracked it down. I have a Padrino app and a Sinatra app both with the same template files. The Padrino app returns:

Encoding::UndefinedConversionError at /
"\xE2" from ASCII-8BIT to UTF-8

and the Sinatra app returns with no errors. I tracked the culprit in Padrino to "set_encoding" method:

def set_encoding
  if RUBY_VERSION < '1.9'
    $KCODE='u'
  else
    Encoding.default_external = Encoding::UTF_8
    Encoding.default_internal = Encoding::UTF_8 # <- THIS
  end
  nil
end

By setting Encoding.default_internal to UTF-8, this error begins appearing in plain Sinatra as well! By default this is set to nil and when that is true, rendering works in both Padrino and Sinatra. To verify check out:

# PLAIN SINATRA
get "/" do
  Encoding.default_internal = Encoding::UTF_8
  haml :simple, :layout => true
end

will have an exception and:

# PADRINO
get "/" do
  Encoding.default_internal = nil
  haml :simple, :layout => true
end

works like a charm.

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

Quick fix for right now @wayneeseguin is to simply set:

# config/boot.rb
Padrino.after_load do
  Encoding.default_internal = nil
end

Padrino.load!

and Padrino appears to render UTF-8 perfectly fine. Unexpectedly it is actually setting the default_internal to utf-8 that broke this in the first place.

@DAddYE How should we proceed? My feeling is to simply remove that declaration? Although why does that cause this exception to occur?

@wayneeseguin
Copy link

That seems very counter-intuitive (what is actually happening) however thank you for the fix!

I am very excited about using Padrino this speed bump aside :)

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

Glad I could fix this with only about an hour of investigation. I was actually surprised and tried almost everything else before removing that line as the culprit. I actually prefer that internal encoding being set to UTF-8.

It also feels like it must be a bug of some sort in HAML. As if trying to force the 'internal' encoding to be UTF-8 exposes an issue with HAML itself.

Anyhow glad the fix was pretty easy and glad to hear you are using Padrino :)

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

@nex3 @chriseppstein If you can read through this and give your opinion, I would be greatful. The tl;dr is:

# PLAIN SINATRA
get "/" do
  Encoding.default_internal = Encoding::UTF_8
  haml :simple, :layout => true
end

Fails with an exception in Sinatra:

Encoding::UndefinedConversionError at /
"\xE2" from ASCII-8BIT to UTF-8

but:

# PLAIN SINATRA
get "/" do
  Encoding.default_internal = nil
  haml :simple, :layout => true
end

Renders without issue given the layout:

# app/layouts/base.haml
!!!
%html
  %head
    %title Encoding Fail
    %meta{ :content => "text/html; charset=UTF-8", "http-equiv" => "Content-Type" }
  %body
    %p ∴ Foo
    %p – Bar
    %p= yield

Any ideas? I am on the latest haml (3.1.1) and ruby 1.9.2

@DAddYE
Copy link
Member

DAddYE commented May 9, 2011

Interesting also encoding reference: http://haml-lang.com/docs/yardoc/file.HAML_REFERENCE.html

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

Yeah I saw that, but why does that mean that default_internal cannot be set to UTF8?

@DAddYE
Copy link
Member

DAddYE commented May 9, 2011

Mmm no Idea, Im interested in what @rkh think about that.

@rkh
Copy link

rkh commented May 9, 2011

Will look at this later, maybe you should also get the attention of @raggi and @nex3? I usually suck when dealing with encodings.

@DAddYE
Copy link
Member

DAddYE commented May 9, 2011

An interesting thing is that with:

Encoding.default_internal = nil
Haml::Template.options[:encoding] = "utf-8"

but the strange thing is that Haml set Haml::Template.options[:encoding] => Encoding.default_internal if Haml::Template.options[:encoding] == nil

@DAddYE
Copy link
Member

DAddYE commented May 9, 2011

With Encoding.default_internal = Encoding::BINARY works

@DAddYE
Copy link
Member

DAddYE commented May 9, 2011

By default haml think that source is ASCI-8BIT so when u set Encoding.default_internal convert it...

@nex3
Copy link

nex3 commented May 9, 2011

Here's what's happening, as far as I can tell:

  • You have a file, simple.haml, that has UTF-8-encoded non-ASCII characters in it.
  • Your Ruby process has Encoding.default_internal set to "utf-8".
  • The framework loads simple.haml as ASCII, since it doesn't know any better.
  • The framework passes the simple.haml text to Haml.
  • Haml sees that the internal encoding is UTF-8, so it attempts to transform the simple.haml text to UTF-8.
  • Since the simple.haml text is marked as ASCII, this fails because it contains non-ASCII characters.

The fundamental problem here is that simple.haml is marked as ASCII. There are two ways around this: you could set Encoding.default_external to "utf-8", which should cause simple.haml to be loaded as UTF-8 rather than ASCII; or you could add -# encoding: utf-8 to the top of simple.haml, which will cause Haml to force the encoding flag to UTF-8 without doing a transformation.

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

@nex3 Even with Encoding.default_external set to 'utf-8' sinatra still throws an exception regardless it doesn't seem to fix the issue. With both internal and external set to 'utf-8' as far as I can tell the problem persists. Why would that be?

Assuming we don't want to set the encoding in every single file, what's the best fallback? Set default_internal to nil?

@nex3
Copy link

nex3 commented May 9, 2011

If Encoding.default_external isn't working, the question is why the file's encoding is being set to ASCII in the first place. default_external should affect everything that's read from the filesystem. I think the best bet here is to trace through the framework and check the encoding of the file at various points in the program. Either it's being loaded as ASCII initially, or it's getting transformed to ASCII somewhere along the line.

Setting default_internal to nil isn't a great long-term solution; by doing that you do away with all the encoding support in Ruby 1.9, which is bound to bite you sooner or later.

@rkh
Copy link

rkh commented May 9, 2011

It should not be set to US-ASCII, but to ASCII-8BIT, since we use File.binread in Tilt if available, to avoid potential exceptions while reading. We then have HAML generate the Ruby code, and (since we'll pass it to eval), insert our own encoding comment if missing, which defaults to UTF-8 when coming from Sinatra. We do it this way so we avoid dealing with implementation details of the template engine.

Note: I do not know if this fully applies for Padrino, this is what's going on in vanilla Sinatra.

@nex3
Copy link

nex3 commented May 9, 2011

Well, there's the problem. Haml's getting passed an ASCII-8BIT file by Tilt, which it then tries to re-encode to UTF-8, causing the failure.

@rkh: the policy of passing binary data to template engines seems mistaken to me. Haml, for instance, assumes it's being passed a well-formed and properly-encoded document, an assumption that a binary-flagged UTF-8-encoded document violates.

@rkh
Copy link

rkh commented May 9, 2011

@nex3: I disagree, parsing as ASCII-8BIT is the only option we have. We cannot know any encoding before Haml parses the source code. Otherwise we will have to re-implement simple parsers for all template engine supported (currently 22, not counting those added by extensions) to figure out what encoding the files are in. What if it is some Japanese encoding, or UTF-16 (likely on Windows), simply assuming the file is in UTF-8 would break that. Moreover, Rack is following a strict everything we are not 100% sure of is ASCII-8BIT policy, with the basic idea that this will have Strings behaving exactly as on 1.8. In contrast to Rails, Sinatra/Tilt is not opinionated and we cannot simply assume files are in UTF-8.

Might be an alternative option to not pass :default_encoding to Haml but instead do encoding handling completely on our side? (It is actually an option we use in Tilt and just happen to pass that on to Haml.) An alternative I see would be for Haml to use force_encoding if the string is binary.

@rkh
Copy link

rkh commented May 9, 2011

Note: It is not just that way because we thought it makes sense (I had everything default to UTF-8 at one point), but because we got actual bug reports for this. There are still ppl out there using different encodings.

@nex3
Copy link

nex3 commented May 9, 2011

@rkh: Ruby has a built-in mechanism for specifying what encoding external files are (Encoding.default_external). Tilt seems to ignore this completely. That seems wrong, and is causing substantial pain here. It's also worth noting that if you're forcing everything into BINARY, any non-ASCII-compatible encoding (such as UTF-16 or a Japanese encoding) will break horribly for any template engine, since all such engines I've seen use ASCII-based parsers.

You seem to assume that all files will specify their encodings internally. But as @nesquena points out, people don't want to add an encoding comment to each file. They want to be able to set a default, and you're ignoring that default.

The way Rails does this, for reference, is not to force everything into UTF-8. Rails assumes external files use Encoding.default_external, but provides a safety valve in case some template-engine-specific encoding comment is used.

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

@rkh @nex3 Yes, we are using the exact same Sinatra and Tilt rendering as you guys at the end of the day. It is indeed the Tilt based ASCII-8BIT encoding which causes the exception with HAML when default_internal is set. If we can fix this for Sinatra and Tilt then it will be fixed in Padrino as well. Seems this might actually be a Tilt issue where it needs to respect the Encoding defaults if set in 1.9?

@nesquena
Copy link
Member Author

nesquena commented May 9, 2011

This presents a bit of a problem for us in the meantime with that Tilt ASCII-8BIT encoding issue but I suspect the best thing for Padrino to do is leave the default encoding set to UTF8 as it is now since this is actually correct and then these changes will be made in the downstream. Man I hate encoding issues. Just to restate for Padrino users, the quick workaround hack if anyone experiences this problem is to:

# config/boot.rb
Padrino.after_load do
  Encoding.default_internal = nil
end

@rkh
Copy link

rkh commented May 10, 2011

@nex3: The default value of Encoding.default_external depends on ENV['LC_CTYPE'] and ENV['LANG'] and therefore is usually UTF-8 on OSX and US-ASCII on most Linux distributions. Simply using it will likely cause applications to break on deployment.
We already had this solution, too, and I want to fix this properly.

There is also a corresponding Tilt issue, btw: rtomayko/tilt#75

@nex3
Copy link

nex3 commented May 10, 2011

Sure, using Encoding.default_external relies on the user to specify their desired encoding. Fundamentally, any solution will work this way: have some default, and rely on the user to override the default when necessary. Ruby uses methods for overriding the default, Encoding.default_external and encoding comments (where appropriate); Ruby libraries should respect both.

Having the default be BINARY and ignoring the standard mechanism for overriding that default is not the proper solution.

@rkh
Copy link

rkh commented May 10, 2011

Ok, but what is? Loading UTF-8 files as US-ASCII is neither, nor is having US-ASCII templates that work properly and then insert UTF-8 values from the HTTP communication.

@nex3
Copy link

nex3 commented May 10, 2011

The UTF-8-over-HTTP problem is going to exist just as much with BINARY templates as with default_external-encoded ones. Fundamentally, you just can't hide from users the fact that they must specify the encodings of their external files.

From a template engine's point of view, Rails' solution is pretty much ideal: use default_external (which defaults to UTF-8) when loading a template, but provide a hook for template engines to recognize encoding comments. If no encoding comment is found and the file isn't valid in default_external, throw an error.

@rkh
Copy link

rkh commented May 10, 2011

It is not ideal, default_external only defaults to UTF-8 on OSX, it is US-ASCII on Debian and other Linux distros ($LC_CTYPE is usually not set on Linux, and $LANG is set to C, whereas $LANG is not set on OSX, and $LC_CTYPE is set to UTF-8). And we let users specify HTTP encoding (via the :default_encoding setting, which we default to UTF-8), which we also use for templates, since having two different encodings for say params and templates just leads to more issues. What we could do, is hand the template in our default encoding to Haml. Moreover, to play nicer with other setups, we could have our default encoding default to Encoding.default_external if that is different from US-ASCII (UTF-8 is a super set of US-ASCII), assuming it is that way intentionally, I am not sure though this will still lead to deployment issues for Japanese people.

@nesquena
Copy link
Member Author

Perhaps we should consider moving our discussion back to rtomayko/tilt#75 since that is where this should probably be fixed. I am closing this ticket since nothing needs to change in Padrino, and there's an easy hacky workaround if anyone encounters this problem. Thanks guys!

@ujifgc
Copy link
Member

ujifgc commented Mar 28, 2012

So, I'm tired of waiting for Tilt to fix it and bored of writing magic comments and BOM markers to my templates. I have a tiny monkeypatch to help Haml deal with utf-8 files. Hope this helps those who don't feel like setting Encoding.default_internal to nil.

module Tilt
  class HamlTemplate
    def prepare
      @data.force_encoding Encoding.default_external  # magic line
      options = @options.merge(:filename => eval_file, :line => line)
      @engine = ::Haml::Engine.new(data, options)
    end
  end
end

@apohllo
Copy link

apohllo commented May 25, 2012

Thanks for that patch. I agree with @nex3 that tilt should respect default_external - that was the first thing I thought of, since this is the built-in mechanism for resolving such issues and making the code bullet-proof. default_internal = nil is a mass :(

DAddYE added a commit that referenced this issue Aug 31, 2013
Use tilt 1.4.0+ (fixes haml & UTF-8 issue #519)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants