Frontmatter is using system default encoding instead of template encoding or even app.encoding #589

utensil opened this Issue Sep 8, 2012 · 12 comments


None yet
5 participants

utensil commented Sep 8, 2012

Minimal reproduce:

For a haml file

title: 中文
date: 2012/09/08
tags: tech



with its layout file

-# coding: utf-8
!!! 5
    %meta{:charset => "utf-8"}
      / Always force latest IE rendering engine (even in intranet) & Chrome Frame
    %meta{:content => "IE=edge,chrome=1", "http-equiv" => "X-UA-Compatible"}
      /[if lt IE 9]
        %script{:src => "", :type => "text/javascript"}
    %meta{:content => "width=device-width, initial-scale=1.0", :name => "viewport"}
    = yield_content :head
      #main{:role => "main"}
          = yield

will encounter

Encoding::CompatibilityError at /tech_blog/2012/09/08/encoding-test.html
incompatible character encodings: UTF-8 and GBK

I looked around, and it seems that when middleman render the template, it's using the encoding of the haml file(the -# coding: utf-8) or the encoding set in config.rb(the set :encoding, "utf-8"), and when middleman parse the frontmatter, it's using the system default encoding(in my case, GBK), so when meta data is concating into template(%title=, encodings got incompatible.

Specifically, in C:\Ruby192\lib\ruby\gems\1.9.1\gems\middleman-core-3.0.2\lib\middleman-core\core_extensions\front_matter.rb:

      # Get the frontmatter and plain content from a file
      # @param [String] path
      # @return [Array<Thor::CoreExt::HashWithIndifferentAccess, String>]
      def frontmatter_and_content(path)
        full_path = File.expand_path(path, @app.source_dir)

        content =
        data = {}

          if content =~ /\A.*coding:/
            lines = content.split(/\n/)
            content = lines.join("\n")

          if result = parse_yaml_front_matter(content)
            data, content = result
          elsif result = parse_json_front_matter(content)
            data, content = result
        rescue => e
          # Probably a binary file, move on

The content in content = is GBK, so when it's parsed by parse_yaml_front_matter(content), strings in metadata are GBK. That's what's causing the problem.

Other than simply bundle exec middleman server, I also tried LANG=zh_CN.utf-8 bundle exec middleman server under MinGW/MSYS, but gem execjs can't work in that case(due to trying to split PATH(which is GBK) as utf-8), I have to stick to not setting Encoding.default_external or Encoding.default_internal.


utensil commented Sep 8, 2012

The easiest and most naive fix might be changing content = to content =, :encoding => ::Middleman::Application.defaults[:encoding]), which solved the problem in my case.

But I guess my fix is too naive and might break something, so I raise this issue to see if it can be solved in a better way. I noticed that #416 seems to be a prior look into this problem, but it seems what I discovered is what is left undone in the patch?


tdreyno commented Sep 8, 2012

Thanks for digging in.


utensil commented Sep 27, 2012

Thanks for the comment, but would there be any followup of this issue?


tdreyno commented Sep 27, 2012

Haven't had a chance to look yet, but happy to accept a pull request with tests.


bhollis commented Oct 10, 2012

I tried to repro this in Cucumber but failed - my machine uses LANG=en_US.UTF-8 so the example just works, and changing the environment from within Cucumber doesn't work. Even setting LANG=GBK in my shell and running Cucumber doesn't work because Cucumber doesn't force features to load using the right encoding (at least until cucumber-attic/gherkin@96b3902 is released).

@utensil's solution just forces all files to be read as utf-8, which I'm not against - it's always simpler to say "Middleman treats all files as utf-8 so deal with it" but I'd understand if that's not a position @tdreyno would want to take.


tdreyno commented Oct 10, 2012

Honestly, I'm not knowledgable enough on encoding issues to know the correct path. My understanding is that Rails defaults to UTF8, and that sounds good to me, but they also require Ruby 1.9 which I'm not sure we can do yet (damn OSX default Ruby)


utensil commented Oct 10, 2012

I just came across a long discussion in tilt for template encodings( rtomayko/tilt#75 ), which could be used for reference in dealing with such problems.

@bhollis , I wouldn't say my solution is forcing all files to be read as utf-8, but as ::Middleman::Application.defaults[:encoding] which can be set arbitrarily by a user.

My solution feels wrong for me because template files could be in different encodings(say, the magic comment -# coding: utf-8 in haml). After some consideration, it seems to be better if we

use the same encoding as the template file, which is detected by the template engine and defaults to ::Middleman::Application.defaults[:encoding]

instead of my

forcing all the template files to be encoded in ::Middleman::Application.defaults[:encoding] to be more consistent.

If middleman adopt the latter is also reasonable, because convention over configuration, and it's less common to use different encodings in different templates in the same site --- I can think of one scenario though, which is that for a Chinese site, it might use GBK and BIG5 for Simplified and Traditional Chinese locales, instead of using utf-8 which is less common(because Chinese characters in utf-8 is always 3 bytes long and not bandwidth-saving) .


bhollis commented Feb 11, 2013

I'm afraid I've lost track of this issue. Is this still a problem as of 3.0.11? @utensil, what would we need to do to move forward w/ a solution


utensil commented Feb 13, 2013

I didn't update my gem to 3.0.11, I can test it ASAP.

I wonder, if it's solved in 3.0.11, what strategy did middleman use?

I still prefer use the same encoding as the template file, which is detected by the template engine and defaults to ::Middleman::Application.defaults[:encoding] to read frontmatter.

What do you think?

I am using 3.0.11. I'm sad to say, the issue might still be there. Frontmatter gets parsed if templates are saved in ANSI mode, but not if done in UTF-8. My layout is in ANSI mode. It is probably occuring due to [this issue] rtomayko/tilt#75 in Tilt. I am using [this gem] to take care of the problem.

@tdreyno tdreyno closed this Jun 3, 2013


utensil commented Jun 15, 2013

Lately I updated the version of middleman for my blog to the latest and attempted to reproduce this bug and found it disappeared.

After some dig in, I notice though on my machine ruby -e 'puts Encoding.default_external' outputs 'GBK', but in front_matter.rb it's already 'utf-8', some magic must happened.

And the magic is 283576a :

@@ -148,6 +149,11 @@ def initialize(&block)
       # Setup the default values from calls to set before initialization

+      if Object.const_defined?(:Encoding)
+        Encoding.default_internal = config[:encoding]
+        Encoding.default_external = config[:encoding]
+      end
       # Evaluate a passed block if given
       instance_exec(&block) if block_given?

So, happy ending for this issue.

lwr commented Sep 2, 2016

It is too bad that the above ending is not happy enough.

Problem still exists while loading YAML files in data because the config encoding is apply too late

281      # Before config is parsed. Mostly used for extensions.
282      execute_callbacks(:before_configuration)
... ...
298      if Object.const_defined?(:Encoding)
299        Encoding.default_external = config[:encoding]
300      end

error stacktrace

/usr/local/rvm/gems/ruby-2.1.8/gems/middleman-core-4.1.10/lib/middleman-core/util/data.rb:60:in `match': invalid byte sequence in GB18030 (ArgumentError)
        from /usr/local/rvm/gems/ruby-2.1.8/gems/middleman-core-4.1.10/lib/middleman-core/util/data.rb:60:in `parse'
... ...
        from /usr/local/rvm/gems/ruby-2.1.8/gems/middleman-core-4.1.10/lib/middleman-core/application.rb:282:in `initialize'
        from /usr/local/rvm/gems/ruby-2.1.8/gems/middleman-cli-4.1.10/bin/middleman:51:in `new'
        from /usr/local/rvm/gems/ruby-2.1.8/gems/middleman-cli-4.1.10/bin/middleman:51:in `<top (required)>'
        from /usr/local/rvm/gems/ruby-2.1.8/bin/middleman:23:in `load'
        from /usr/local/rvm/gems/ruby-2.1.8/bin/middleman:23:in `<main>'
        from /usr/local/rvm/gems/ruby-2.1.8/bin/ruby_executable_hooks:15:in `eval'
        from /usr/local/rvm/gems/ruby-2.1.8/bin/ruby_executable_hooks:15:in `<main>'

Can I reopen this issue or make a new?

More precisely, the YAML specification requires file writing only in UTF-8 or UTF-16, so the way what middleman implemented now is incorrect, it should enforce the encoding to UTF-8 while loading YAML files, or even add a detection if the file is written in UTF-16, never depends on system charsets!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment