Adding "encoding" configuration (4th version) #1449

Merged
merged 8 commits into from Sep 24, 2013

Projects

None yet

6 participants

@shigeya
Contributor
shigeya commented Aug 24, 2013

Rebased towards current master, to support elimination of monkey patch to File.read.

Need tests. (require some discussion)

@shigeya
Contributor
shigeya commented Aug 24, 2013

I have considered how to implement the test.

For generation part of the program, I prefer to have both encoding=utf-8 case and without encoding (current default).
I have reviewed test on https://github.com/mojombo/jekyll/blob/master/test/test_site.rb .

First option is test whole site generation with or without utf-8, but it will add tons of tests (when run in Ruby >=1.9).
Other option is add a fraction of site generation in utf-8 only with Ruby >=1.9
Which maintainers prefer?

Also, do you need to have test for configuration? It looks like there are no tests among default configuration parameters. I can add, but I have no idea whether it is meaningful or not.

@shigeya
Contributor
shigeya commented Aug 24, 2013

btw, while I think monkey-patching String is bad idea, monkey-patching on File.read on options seems reasonable. It clearer than if RUBY_VERSION in various places. If maintainers agree on this, I can modify that way, too.

@mattr- mattr- commented on an outdated diff Aug 24, 2013
lib/jekyll/convertible.rb
+ self.content = if RUBY_VERSION < "1.9"
@mattr-
mattr- Aug 24, 2013 Member

Please wrap this conditional into a method call so that we don't have to parse what's going on. A method name like read_content would work better here instead.

@mattr- mattr- commented on an outdated diff Aug 24, 2013
lib/jekyll/tags/include.rb
@@ -62,7 +62,11 @@ def render(context)
Dir.chdir(includes_dir) do
choices = Dir['**/*'].reject { |x| File.symlink?(x) }
if choices.include?(@file)
- source = File.read(@file)
+ source = if RUBY_VERSION < "1.9"
@mattr-
mattr- Aug 24, 2013 Member

Similar to the comment above, I think this should go into its own method as well.

@shigeya
Contributor
shigeya commented Aug 24, 2013

@mattr- I added a commit. Is this the style what you like to see?

@mattr-
Member
mattr- commented Aug 25, 2013

Yes, that's a good start. There's a couple of other places that could use some method extraction as well. 😃

@shigeya
Contributor
shigeya commented Aug 25, 2013

I don't understand how far you want to make these descriptive compare to other codes. I see line in initializer of site.rb can be described in other way, but I think it's enough minimal and descriptive.

Of course It's possible to do that way one by one. Can you provide guidelines or conditions on judging such? Then, you don't necessary to comment every commit this way and easier.

@mattr-
Member
mattr- commented Aug 25, 2013

The conditionals around the RUBY_VERSION checks need their own methods,
IMO. Once that's done, then I think it's good to merge.

On Sat, Aug 24, 2013 at 10:54 PM, Shigeya Suzuki
notifications@github.comwrote:

I don't understand how far you want to make these descriptive compare to
other codes. I see line in initialized of site.rb can be described in
other way, but I think it's enough minimal and descriptive.

Of course It's possible to do that way one by one. Can you provide
guidelines or conditions on judging such? Then, you don't necessary to
comment every commit this way and easier.


Reply to this email directly or view it on GitHubhttps://github.com/mojombo/jekyll/pull/1449#issuecomment-23221459
.

@shigeya
Contributor
shigeya commented Aug 25, 2013

Introduced File.read_with_options instead of patching File.read. How about this?

@mattr-
Member
mattr- commented Aug 27, 2013

❤️ it!

@shigeya
Contributor
shigeya commented Aug 28, 2013

Good. then, how about this performance fix? It's marginally faster.

https://gist.github.com/shigeya/6362099

I haven't committed/push this yet.

@mattn
mattn commented Aug 28, 2013

Good Job

@mattr-
Member
mattr- commented Aug 28, 2013

@shigeya That also works. 😃

@shigeya
Contributor
shigeya commented Aug 28, 2013

@mattr- Ok, thanks. merged that change, and rebased on top of current master.

@shigeya
Contributor
shigeya commented Aug 28, 2013

Test passed on this branch. (I have tested on master and it's still failing on the encoding). Travis should say OK too.

@parkr parkr and 1 other commented on an outdated diff Aug 28, 2013
lib/jekyll/tags/include.rb
@@ -48,6 +48,11 @@ def validate_syntax
end
end
+ # Grab file read opts in the context
+ def file_read_opts_from_context(context)
@parkr
parkr Aug 28, 2013 Member

Does this need to specify that they come from the context? I think it'd be ok to just have file_read_opts here :)

@shigeya
shigeya Aug 28, 2013 Contributor

Okay, then, just mentioning file_read_opts is enough, then, placing context.registers[:site].file_read_opts at the option of read_with_options is sufficient, isn't it (without refactoring that as method)? On contrary, if we name the method file_read_opts, passing context to it seems wrong to me. The other option is put the read_file_opts into context or wherever place. which seems odd.

@parkr
parkr Aug 28, 2013 Member

When you pass the context, it's inferred that the context has something to do with it. I think def file_read_opts(context) says "based on this context, fetch me the file read options.

@shigeya
shigeya Aug 28, 2013 Contributor

If you say so, ok. I'll change that way.

@parkr parkr commented on an outdated diff Aug 28, 2013
site/docs/configuration.md
@@ -99,6 +99,18 @@ class="flag">flags</code> (specified on the command-line) that control them.
<p><code class="option">timezone: TIMEZONE</code></p>
</td>
</tr>
+ <tr class='setting'>
+ <td>
+ <p class='name'><strong>Encoding</strong></p>
+ <p class="description">
+ Set the encoding of files. The default is the system encoding,
+ which determined by LANG environment variable.
@parkr
parkr Aug 28, 2013 Member

It'd be awesome to have a link to a list of possible encodings. Maybe a Wikipedia article for reference?

@parkr
Member
parkr commented Aug 28, 2013

LGTM after my other couple of comments.

@shigeya
Contributor
shigeya commented Aug 28, 2013

Fixed.

@mattn
mattn commented Sep 9, 2013

There's no issue to merge? :)

@shigeya
Contributor
shigeya commented Sep 12, 2013

Any reason for delay for merge? if you want rebase, I can at any time...

@parkr
Member
parkr commented Sep 12, 2013

We need a rebase! Thanks for being willing to do so :) I will take a look at it after the rebase :)

@shigeya
Contributor
shigeya commented Sep 12, 2013

Will work on it tomorrow. I tried, but it's little complex to do clean merge for a drunk guy:D

@shigeya
Contributor
shigeya commented Sep 13, 2013

It was painful:< @parkr, please review and merge.

@mattn
mattn commented Sep 13, 2013

good job

@mattn
mattn commented Sep 18, 2013

Why not merge this? As I said this or another issue, jekyll doesn't work to read multi-byte encoding texts on windows ruby2.0 . I think you are taking too many time to fix this problem. :-(

@parkr parkr commented on the diff Sep 18, 2013
lib/jekyll/core_ext.rb
@@ -69,3 +69,18 @@ def glob_include?(e)
any? { |exp| File.fnmatch?(exp, e) }
end
end
+
+# Ruby 1.8's File.read don't support option.
+# read_with_options ignore optional parameter for 1.8,
+# and act as alias for 1.9 or later.
+class File
+ if RUBY_VERSION < '1.9'
+ def self.read_with_options(path, opts = {})
+ self.read(path)
+ end
+ else
+ def self.read_with_options(path, opts = {})
+ self.read(path, opts)
+ end
+ end
@parkr
parkr Sep 18, 2013 Member

is it possible to do this here:

class File
  def self.read_with_options(path, opts = {})
    if RUBY_VERSION < '1.9'
      self.read(path)
    else
      self.read(path, opts)
    end
  end
end

It's a bit DRYer. :)

@shigeya
shigeya Sep 19, 2013 Contributor

That was previous version's code. Current version is just a little faster, since it is does not require eval of RUBY_VERSION on every call. I asked whether you guys like this or not and imported. Please decide which is better.

@parkr
parkr Sep 19, 2013 Member

The current version is fine given that reason.

@parkr parkr and 1 other commented on an outdated diff Sep 18, 2013
site/docs/configuration.md
@@ -99,6 +99,22 @@ class="flag">flags</code> (specified on the command-line) that control them.
<p><code class="option">timezone: TIMEZONE</code></p>
</td>
</tr>
+ <tr class='setting'>
+ <td>
+ <p class='name'><strong>Encoding</strong></p>
+ <p class="description">
+ Set the encoding of files by name. Only avaialble for Ruby
@parkr
parkr Sep 18, 2013 Member

There's a slight spelling issue here: available :)

@shigeya
shigeya Sep 19, 2013 Contributor

Will fix this. thanks.

@parkr parkr and 2 others commented on an outdated diff Sep 18, 2013
site/docs/configuration.md
@@ -99,6 +99,22 @@ class="flag">flags</code> (specified on the command-line) that control them.
<p><code class="option">timezone: TIMEZONE</code></p>
</td>
</tr>
+ <tr class='setting'>
+ <td>
+ <p class='name'><strong>Encoding</strong></p>
+ <p class="description">
+ Set the encoding of files by name. Only avaialble for Ruby
+ 1.9 or later).
+ The default value is nil, which assumes default encoding,
+ <code>ASCII-8BIT</code>.
@parkr
parkr Sep 18, 2013 Member

I think the default value is $LANG on Unix then maybe defaults back to ASCII-8BIT

@mattn
mattn Sep 18, 2013

I prefer utf-8 as default. $LANG is locale name not encoding name.

On 9/18/13, Parker Moore notifications@github.com wrote:

@@ -99,6 +99,22 @@ class="flag">flags (specified on the
command-line) that control them.

timezone: TIMEZONE



  •  <td>
    
  •    <p class='name'><strong>Encoding</strong></p>
    
  •    <p class="description">
    
  •        Set the encoding of files by name. Only avaialble for Ruby
    
  •        1.9 or later).
    
  •        The default value is nil, which assumes default encoding,
    
  •        <code>ASCII-8BIT</code>.
    

I think the default value is $LANG on Unix then maybe defaults back to
ASCII-8BIT


Reply to this email directly or view it on GitHub:
https://github.com/mojombo/jekyll/pull/1449/files#r6435126

  • Yasuhiro Matsumoto
@parkr
parkr Sep 18, 2013 Member

Encoding is in there :)

~/code/jekyll$ echo $LANG
en_US.UTF-8
@parkr
parkr Sep 18, 2013 Member

Also the default of UTF-8 is smart but isn't compatible with some systems This note here about ASCII-8BIT is about the usual-default system encoding being ASCII – we don't impose a default.

@mattn
mattn Sep 18, 2013

You need to remove en_US. if it use $LANG.

we don't impose a default.

Then, it need to set Encoding.default_external?

@shigeya
shigeya Sep 19, 2013 Contributor

Since Ruby's default behavior without locale configuration such as LANG is default to ASCII-8BIT, default should be ASCII-8BIT, since it may break other user's page build.

We can discuss whether defaults to be utf-8 or not later. This discussion can be endless. Thus, I recommend to stick with current behavior, unless clearly configured with encoding config. This allows us to configure to use utf-8 while keeping other page builds happy.

@parkr parkr and 1 other commented on an outdated diff Sep 18, 2013
site/docs/configuration.md
@@ -99,6 +99,22 @@ class="flag">flags</code> (specified on the command-line) that control them.
<p><code class="option">timezone: TIMEZONE</code></p>
</td>
</tr>
+ <tr class='setting'>
+ <td>
+ <p class='name'><strong>Encoding</strong></p>
+ <p class="description">
+ Set the encoding of files by name. Only avaialble for Ruby
+ 1.9 or later).
+ The default value is nil, which assumes default encoding,
+ <code>ASCII-8BIT</code>.
+ Available encoding for the ruby in use, can be shown by
+ command <code>ruby -e 'puts Encoding::list.join("\n")</code>
@parkr
parkr Sep 18, 2013 Member

The Ruby code here doesn't end. Should have the ending quotation mark: ruby -e 'puts Encoding::list.join("\n")'

@shigeya
shigeya Sep 19, 2013 Contributor

Thanks. Will fix this too..

@parkr parkr commented on the diff Sep 18, 2013
test/test_convertible.rb
@@ -40,7 +40,7 @@ class TestConvertible < Test::Unit::TestCase
should "not parse if there is encoding error in file" do
name = 'broken_front_matter3.erb'
out = capture_stdout do
- ret = @convertible.read_yaml(@base, name)
+ ret = @convertible.read_yaml(@base, name, :encoding => 'utf-8')
@parkr
parkr Sep 18, 2013 Member

Should we not use the system encoding / config'd encoding here?

@shigeya
shigeya Sep 19, 2013 Contributor

We should stick with this way, since the text trying to read is actually utf-8 here.

@parkr
parkr Sep 19, 2013 Member

All yaml is always UTF-8? I'm confused.

@shigeya
shigeya Sep 19, 2013 Contributor

No, the target file 'broken_front_matter3.erb' is encoded in utf-8 see line 41.

@parkr
parkr Sep 19, 2013 Member

Ah, ok, thanks :)

@parkr
Member
parkr commented Sep 18, 2013

Thanks for all your work on this!

@shigeya
Contributor
shigeya commented Sep 19, 2013

I will update the branch today.

@parkr
Member
parkr commented Sep 19, 2013

Thank you!

@shigeya
Contributor
shigeya commented Sep 19, 2013

Rebased (again), fixed, tested. For convenience, I have reordered and squashed some of commits.

@parkr
Member
parkr commented Sep 19, 2013

Thank you thank you thank you!

@mattr- Let's get this merged and out in a v1.2.2 with any other fixes we deem necessary. This is a whopper!

@mattn
mattn commented Sep 19, 2013

it will be coming! so cool! :)

@mattr- mattr- commented on the diff Sep 20, 2013
lib/jekyll/convertible.rb
@@ -20,16 +20,23 @@ def to_s
self.content || ''
end
+ # Returns merged optin hash for File.read of self.site (if exists)
+ # and a given param
+ def merged_file_read_opts(opts)
+ (self.site ? self.site.file_read_opts : {}).merge(opts)
@mattr-
mattr- Sep 20, 2013 Member

I don't see a reason to check for site here. Convertible classes will always have a site object.

@shigeya
shigeya Sep 20, 2013 Contributor

Hmm. I have merged previous code without thought on that. Will change quickly.

@shigeya
shigeya Sep 20, 2013 Contributor

Nope. you're wrong.

After I changed to:

self.site.file_read_opts.merge(opts)

Then, on test:

Error reading file /Users/shigeya/git/jekyll/test/fixtures/broken_front_matter1.erb: undefined method `file_read_opts' for nil:NilClass

please merge as is.

@mattr-
Member
mattr- commented Sep 20, 2013

One small tiny change and then I think we're good.

@parkr
Member
parkr commented Sep 20, 2013

@mattr- Looks like @shigeya has run into a problem with your change, but the current code works. I'm 👍 to ship and release v1.2.2 this weekend or early next week.

@shigeya
Contributor
shigeya commented Sep 24, 2013

Please?

@parkr
Member
parkr commented Sep 24, 2013

Looks good to me. @mattr-, just need your final 👍 then we can merge it in.

@mattr-
Member
mattr- commented Sep 24, 2013

👍 :shipit:

On Mon, Sep 23, 2013 at 9:10 PM, Parker Moore notifications@github.comwrote:

Looks good to me. @mattr- https://github.com/mattr-, just need your
final [image: 👍] then we can merge it in.


Reply to this email directly or view it on GitHubhttps://github.com/mojombo/jekyll/pull/1449#issuecomment-24970283
.

@parkr parkr merged commit 2734759 into jekyll:master Sep 24, 2013

1 check passed

default The Travis CI build passed
Details
@parkr parkr added a commit that referenced this pull request Sep 24, 2013
@parkr parkr Update history to reflect merge of #1449 5badf7d
@shigeya
Contributor
shigeya commented Sep 24, 2013

Thank you!

@parkr
Member
parkr commented Sep 24, 2013

Our pleasure – thank YOU!

@mattn
mattn commented Sep 24, 2013

Coooooooooooooooool!

@parkr
Member
parkr commented Sep 24, 2013

This will be released with v1.3.0 :) Thanks for your work on this! 🎉

@shigeya
Contributor
shigeya commented Sep 24, 2013

You're welcome. I (and some of my friends) plan to use jekyll, so, hope to contribute some other area, too. I believe basic UTF-8 support accelerate use of jekyll among Japanese communities.

@penibelst
Member

@shigeya I’m curious which setup causes your encoding issues. I don’t know much about the Japanese alphabet, but I create Russian and German content in Jekyll without any problems. These alphabets have many characters outside of ASCII. My OS is Ubuntu, all content files are UTF-8 encoded.

What is the common setup among Japanese communities?

@shigeya
Contributor
shigeya commented Sep 24, 2013

@penibelst That's a very, very long story. Following are my understandings.

First, Ruby does not configure default_external to UTF-8 according to LC_CTYPE locale configuration environment variable, since Ruby startup code intentionally reset LC_CTYPE to null string on startup (see main.c). Don't ask me why:)

Second, whether conversion succeed or not depends on pair of encodings (source and destination) and the code point of interest. Ruby raise exception, when string translated from one encoding to another, if some of the code point which does not have translation rule (internally).

What's the output of ruby -e 'puts Encoding.default_external' in your environment? If the output is 'ASCII-8BIT' or 'US-ASCII' (I bet on this), translation rule between the encoding you're using (in files) and default externals (ASCII) on the code points you're using are translatable, cause no problem.

Now, for many of Japanese character code points, ASCII-8BIT and UTF-8 are not mappable.

Thus, we need Ruby to set default external encoding to UTF-8 manually, either by setting Encoding.default_external or -E option. The patch provided here allows us to configure the external encoding to be used on file read on various locations.

Hope this helps.

@penibelst
Member

@shigeya This sounds very bad. I thought Ruby was born in Japan.

The output of your command in my environment is -e:1: uninitialized constant Encoding (NameError).

@parkr
Member
parkr commented Sep 25, 2013

@penibelst You may need to require 'encoding' before you call Encoding.default_external

@shigeya
Contributor
shigeya commented Sep 25, 2013

@penibelst Ah, I have overlooked that reason. You're using 1.8.7. Ruby 1.8.7 does not have encoding library. Its story is very different since it doesn't have encoding handling like as described above. The feature we discussed here is for 1.9.3 and later. On 1.8.7, as far as you're using same encoding, it should not cause encoding error. On contrary, if you'd like to have safer encoding translation, you need to use Ruby 1.9.3 or later.

On the Encoding implementation, while it is little complex and cause some difficulties in certain situation, I believe this is one of most forward looking and complete implementation of encoding among currently available computer languages. As far as the encoding is specified correctly, it work perfectly, and will not cause issues on translation and also detect issues on translations.

Additional note on above discussion, I have reviewed source code and found that, Ruby 1.9.3+ control encoding default according to LANG environment variable. Default encoding (set as default_external, etc) is platform dependent. Unfortunately, documentation on this topic is not good enough. Check Ruby 1.9.3's Encoding doc and search for LANG.

@penibelst
Member

@parkr no such file to load -- encoding (LoadError)

@shigeya You are right, my Ruby is older than 1.9.3

puts RUBY_VERSION, ENV['LANG']
1.8.7
de_DE.UTF-8
@shigeya shigeya deleted the shigeya:config-encoding-and-yaml-opts branch Sep 26, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment