Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Frontmatter is using system default encoding instead of template encoding or even app.encoding #589

Closed
utensil opened this Issue · 11 comments

4 participants

@utensil

Minimal reproduce:

For a haml file

--- 
title: 中文
date: 2012/09/08
tags: tech
---

%div
  blah

with its layout file

-# coding: utf-8
!!! 5
%html
  %head
    %meta{:charset => "utf-8"}
      / Always force latest IE rendering engine (even in intranet) & Chrome Frame
    %meta{:content => "IE=edge,chrome=1", "http-equiv" => "X-UA-Compatible"}
      /[if lt IE 9]
        %script{:src => "http://html5shim.googlecode.com/svn/trunk/html5.js", :type => "text/javascript"}
    %meta{:content => "width=device-width, initial-scale=1.0", :name => "viewport"}
    %title= current_page.data.title 
    = yield_content :head
  %body
    #container
      #main{:role => "main"}
        %div.content
          = yield

will encounter

Encoding::CompatibilityError at /tech_blog/2012/09/08/encoding-test.html
incompatible character encodings: UTF-8 and GBK

I looked around, and it seems that when middleman render the template, it's using the encoding of the haml file(the -# coding: utf-8) or the encoding set in config.rb(the set :encoding, "utf-8"), and when middleman parse the frontmatter, it's using the system default encoding(in my case, GBK), so when meta data is concating into template(%title= current_page.data.title), encodings got incompatible.

Specifically, in C:\Ruby192\lib\ruby\gems\1.9.1\gems\middleman-core-3.0.2\lib\middleman-core\core_extensions\front_matter.rb:

      # Get the frontmatter and plain content from a file
      # @param [String] path
      # @return [Array<Thor::CoreExt::HashWithIndifferentAccess, String>]
      def frontmatter_and_content(path)
        full_path = File.expand_path(path, @app.source_dir)

        content = File.read(full_path)
        data = {}

        begin
          if content =~ /\A.*coding:/
            lines = content.split(/\n/)
            lines.shift
            content = lines.join("\n")
          end

          if result = parse_yaml_front_matter(content)
            data, content = result
          elsif result = parse_json_front_matter(content)
            data, content = result
          end
        rescue => e
          # Probably a binary file, move on
        end

The content in content = File.read(full_path) is GBK, so when it's parsed by parse_yaml_front_matter(content), strings in metadata are GBK. That's what's causing the problem.

Other than simply bundle exec middleman server, I also tried LANG=zh_CN.utf-8 bundle exec middleman server under MinGW/MSYS, but gem execjs can't work in that case(due to trying to split PATH(which is GBK) as utf-8), I have to stick to not setting Encoding.default_external or Encoding.default_internal.

@utensil

The easiest and most naive fix might be changing content = File.read(full_path) to content = File.read(full_path, :encoding => ::Middleman::Application.defaults[:encoding]), which solved the problem in my case.

But I guess my fix is too naive and might break something, so I raise this issue to see if it can be solved in a better way. I noticed that #416 seems to be a prior look into this problem, but it seems what I discovered is what is left undone in the patch?

@tdreyno
Owner

Thanks for digging in.

@utensil

Thanks for the comment, but would there be any followup of this issue?

@tdreyno
Owner

Haven't had a chance to look yet, but happy to accept a pull request with tests.

@bhollis
Owner

I tried to repro this in Cucumber but failed - my machine uses LANG=en_US.UTF-8 so the example just works, and changing the environment from within Cucumber doesn't work. Even setting LANG=GBK in my shell and running Cucumber doesn't work because Cucumber doesn't force features to load using the right encoding (at least until cucumber/gherkin@96b3902 is released).

@utensil's solution just forces all files to be read as utf-8, which I'm not against - it's always simpler to say "Middleman treats all files as utf-8 so deal with it" but I'd understand if that's not a position @tdreyno would want to take.

@tdreyno
Owner

Honestly, I'm not knowledgable enough on encoding issues to know the correct path. My understanding is that Rails defaults to UTF8, and that sounds good to me, but they also require Ruby 1.9 which I'm not sure we can do yet (damn OSX default Ruby)

@utensil

I just came across a long discussion in tilt for template encodings( rtomayko/tilt#75 ), which could be used for reference in dealing with such problems.

@bhollis , I wouldn't say my solution is forcing all files to be read as utf-8, but as ::Middleman::Application.defaults[:encoding] which can be set arbitrarily by a user.

My solution feels wrong for me because template files could be in different encodings(say, the magic comment -# coding: utf-8 in haml). After some consideration, it seems to be better if we

use the same encoding as the template file, which is detected by the template engine and defaults to ::Middleman::Application.defaults[:encoding]

instead of my

forcing all the template files to be encoded in ::Middleman::Application.defaults[:encoding] to be more consistent.

If middleman adopt the latter is also reasonable, because convention over configuration, and it's less common to use different encodings in different templates in the same site --- I can think of one scenario though, which is that for a Chinese site, it might use GBK and BIG5 for Simplified and Traditional Chinese locales, instead of using utf-8 which is less common(because Chinese characters in utf-8 is always 3 bytes long and not bandwidth-saving) .

@bhollis
Owner

I'm afraid I've lost track of this issue. Is this still a problem as of 3.0.11? @utensil, what would we need to do to move forward w/ a solution

@utensil

I didn't update my gem to 3.0.11, I can test it ASAP.

I wonder, if it's solved in 3.0.11, what strategy did middleman use?

I still prefer use the same encoding as the template file, which is detected by the template engine and defaults to ::Middleman::Application.defaults[:encoding] to read frontmatter.

What do you think?

@Abhra1992

I am using 3.0.11. I'm sad to say, the issue might still be there. Frontmatter gets parsed if templates are saved in ANSI mode, but not if done in UTF-8. My layout is in ANSI mode. It is probably occuring due to [this issue] rtomayko/tilt#75 in Tilt. I am using [this gem] https://github.com/msadouni/middleman-utf8-partial to take care of the problem.

@tdreyno tdreyno closed this
@utensil

Lately I updated the version of middleman for my blog to the latest and attempted to reproduce this bug and found it disappeared.

After some dig in, I notice though on my machine ruby -e 'puts Encoding.default_external' outputs 'GBK', but in front_matter.rb it's already 'utf-8', some magic must happened.

And the magic is middleman/middleman@283576a :

@@ -148,6 +149,11 @@ def initialize(&block)
       # Setup the default values from calls to set before initialization
       self.class.config.load_settings(self.class.superclass.config.all_settings)

+      if Object.const_defined?(:Encoding)
+        Encoding.default_internal = config[:encoding]
+        Encoding.default_external = config[:encoding]
+      end
+
       # Evaluate a passed block if given
       instance_exec(&block) if block_given?

So, happy ending for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.