New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode URLs in utf-8 when escaping and unescaping #2420

Merged
merged 6 commits into from May 21, 2014

Conversation

Projects
None yet
6 participants
@albertogg
Member

albertogg commented May 18, 2014

I've been doing some tests to try to fix issue #2379 after reading @fabianrbz comment. In particular I analyzed what was the encoding behavior when using CGI.escape(path), what is the behavior with URI.escape(path, /[^a-zA-Z\d\-._~!$&\'()*+,;=:@\/]/). I also tested how the Addressable gem worked using Addressable::URI.escape(path), none the less I read a bit about RFC 3986 and all of this led me to this pull request.

Here are the examples:

What Jekyll is doing right now: It gets a UTF-8 string, when it gets escaped, the string is converted to ASCII, but when it is unescaped it remains ASCII and this is when the problem occurs. We can easily fixit by encoding the string to UTF-8 prior unescaping it.

# require 'uri'

pry(main)> uri_uri = uri_uri = '/rails笔记/2014/04/20/go-to-top-button-in-rails/'
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
uri_uri.encoding
#=> #<Encoding:UTF-8>
pry(main)> uri_escape = URI.escape(uri_uri, /[^a-zA-Z\d\-._~!$&\'()*+,;=:@\/]/)
#=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_escape.encoding
#=> #<Encoding:US-ASCII>
pry(main)> uri_unescape = URI.unescape(uri_escape)
#=> "/rails\xE7\xAC\x94\xE8\xAE\xB0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_unescape.encoding
#=> #<Encoding:US-ASCII>
pry(main)> uri_unescape = URI.unescape(uri_escape.encode('utf-8'))
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_unescape.encoding
#=> #<Encoding:UTF-8>
pry(main)> uri_escape = URI.escape(uri_uri, /[^a-zA-Z\d\-._~!$&\'()*+,;=:@\/]/).encode('utf-8')
#=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_escape.encoding
#=> #<Encoding:UTF-8>
pry(main)> uri_unescape = URI.unescape(uri_escape)
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_escape.encoding
#=> #<Encoding:UTF-8>

This is what Jekyll was doing with CGI: when it gets a UTF-8 string and escapes it or unescapes it, the string remained UTF-8. If it gets an ASCII string it will escape it as an ASCII string, but when unescapes it, it will change it to UTF-8. So that why the problem never existed in Jekyll < 2.0.0.

# require 'cgi'

pry(main)> cgi_uri = "/rails笔记/2014/04/20/go-to-top-button-in-rails/".force_encoding(Encoding::ASCII)
#=> ""/rails\xE7\xAC\x94\xE8\xAE\xB0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_uri.encoding
#=> #<Encoding:US-ASCII>
pry(main)> cgi_escape = CGI.escape(cgi_uri)
#=> "%2Frails%E7%AC%94%E8%AE%B0%2F2014%2F04%2F20%2Fgo-to-top-button-in-rails%2F"
pry(main)> cgi_escape.encoding
#=> #<Encoding:US-ASCII>
pry(main)> cgi_unescape = CGI.unescape(cgi_escape)
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_unescape.encoding
#=> #<Encoding:UTF-8>

pry(main)> cgi_uri = "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_uri.encoding
#=> #<Encoding:UTF-8>
pry(main)> cgi_escape = CGI.escape(cgi_uri)
#=> "%2Frails%E7%AC%94%E8%AE%B0%2F2014%2F04%2F20%2Fgo-to-top-button-in-rails%2F"
pry(main)> cgi_escape.encoding
#=> #<Encoding:UTF-8>
pry(main)> cgi_unescape = CGI.unescape(cgi_escape)
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_unescape.encoding
#=> #<Encoding:UTF-8>

Addressable gem Always converts to UTF-8 when escaping or unescaping, so the problem never existed.
_update:_ I missed that when the Addressable gem receives a ASCII string (not escaped) it explodes while escaping it with the same error ArgumentError: invalid byte sequence in US-ASCII we are having and .force_encoding('utf-8') is needed.

# require 'addressable'

pry(main)> ad_uri = "/rails笔记/2014/04/20/go-to-top-button-in-rails/".force_encoding(Encoding::ASCII)
#=> "/rails\xE7\xAC\x94\xE8\xAE\xB0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_uri.encoding
#=> #<Encoding:US-ASCII>
pry(main)> ad_escape = Addressable::URI.escape(ad_uri)
# ArgumentError: invalid byte sequence in US-ASCII from /Users/albertogg/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/addressable2.3.6/lib/addressable/uri.rb:105:in `scan'
pry(main)> ad_escape = Addressable::URI.escape(ad_uri.force_encoding('utf-8'))
=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_escape.encoding
#=> #<Encoding:UTF-8>

pry(main)> ad_escape_ascii = ad_escape.force_encoding(Encoding::ASCII)
=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_escape_ascii.encoding
#=> #<Encoding:US-ASCII>
pry(main)> ad_unescape = Addressable::URI.unescape(ad_escape_ascii)
=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_unescape.encoding
#=> #<Encoding:UTF-8>

Wow, this was really hard for me to write. I hope everyone understand what I was trying to express.

Encode URLs in utf-8 when escaping and unescaping
There is a problem while returning a path that has some special and possible Non-ASCII characters that may lead jekyll to break while doing the unescaping process. This is can be addressed by “forcing” ASCII to UTF-8.
@parkr

This comment has been minimized.

Member

parkr commented May 18, 2014

This is awesome! 👍 from me. @mattr-?

@parkr

This comment has been minimized.

Member

parkr commented May 18, 2014

Can you add a should block for these two methods in the tests, please? 😃 Then I'll merge.

@fabianrbz

This comment has been minimized.

Member

fabianrbz commented May 18, 2014

@albertogg awesome work! after taking a look at what addressable does, this is the simplest solution I can think of that fixes the issue, maybe we should add a test case to avoid regressions in the future.

@albertogg

This comment has been minimized.

Member

albertogg commented May 18, 2014

@parkr I really don't know how to test this, I tried but I have no clue, everything seems to be utf-8 while testing, I'm sorry 😞. If any of you guys can guide me or do it that will be great 😅

@parkr

This comment has been minimized.

Member

parkr commented May 19, 2014

Looking for something like:

should "return a UTF-8 encoded string" do
  assert_equal "utf-8", URL.escape_path("blah").encoding
end
should "return a UTF-8 encoded string" do
  assert_equal "utf-8", URL.unescape_path("blah").encoding
end

😃

albertogg added some commits May 19, 2014

Add tests to validate encoding of URLs
Added tests to validate the encoding of returned URL strings
after been escaped or unescaped.
Force encoding before escaping or unescaping
This will reassure not having any errors when escaping or
unescaping.
@albertogg

This comment has been minimized.

Member

albertogg commented May 19, 2014

Thanks again @parkr ❤️! I really need to level up my testing fu.

Also, I kept testing this and got to a conclusion that is better to force the encoding on the path, that way we will reassure (I think) that no matter what the encoding is set on the user machine we will be able to re-encode to utf-8 with no errors. If you think this is not ok, I can remove the forcing thing. Thanks!

@parkr

This comment has been minimized.

parkr commented on 2a1054b May 20, 2014

1.9.3 doesn't read these characters?

This comment has been minimized.

Owner

albertogg replied May 20, 2014

@parkr no, the hole file needs to have # encoding: utf-8 at the top. It seems that utf-8 was not the default before Ruby < 2.0.0

This comment has been minimized.

parkr replied May 20, 2014

Try adding # encoding: utf-8 at the top instead. We should have some UTF-8 characters in our test, I think 😄

@parkr

This comment has been minimized.

Member

parkr commented May 20, 2014

I always feel just horrid about using force_encoding. I think all it does is change the "header" of the string – it doesn't actually re-encode the string back to UTF-8 or whatever. I'd feel better using encode.

@albertogg

This comment has been minimized.

Member

albertogg commented May 20, 2014

Ok, let me revert that and that's it.

Revert changes and add encoding to the test file
* Add encoding to the test file as Ruby 1.9.3 doesn’t defaults to utf-8.
* Remove the forced encoding as encode seems too aggressive.
end
should "return a UTF-8 unescaped string" do
assert_equal Encoding::UTF_8, URL.unescape_path("/rails%E7%AC%94%E8%AE%B0/2014/04/20/escaped/").encoding

This comment has been minimized.

@albertogg

albertogg May 20, 2014

Member

Now I feel this like should be:

assert_equal Encoding::UTF_8, URL.unescape_path("/rails%E7%AC%94%E8%AE%B0/2014/04/20/escaped/".encode(Encoding::ASCII)).encoding

As in the test that string will always be utf-8, or at least that's what I think. But I'm not sure tho.

This comment has been minimized.

@parkr

parkr May 20, 2014

Member

Definitely!

Update unescaped string test
Nothing was being tested without explicitly making the string
encoding ASCII.
@albertogg

This comment has been minimized.

Member

albertogg commented May 20, 2014

It worked, hope it's all good!

parkr added a commit that referenced this pull request May 21, 2014

@parkr parkr merged commit 7e05b6e into jekyll:master May 21, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

parkr added a commit that referenced this pull request May 21, 2014

@parkr

This comment has been minimized.

Member

parkr commented May 21, 2014

Thanks so much, @albertogg!

@albertogg

This comment has been minimized.

Member

albertogg commented May 21, 2014

No problem! I'm glad I could help!

@albertogg albertogg deleted the albertogg:fix-encoding-issue branch May 21, 2014

@Gutek

This comment has been minimized.

Gutek commented Oct 28, 2014

don't know about you, but i have this issue still with jekyll 2.4.0 and on mac. on windows it works.

site i'm trying to build:
https://github.com/Gutek/gutek.github.io

/Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll.rb:125:in `gsub!': invalid byte sequence in UTF-8 (ArgumentError)
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll.rb:125:in `sanitized_path'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/post.rb:271:in `destination'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:43:in `block in new_files'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:448:in `block (2 levels) in each_site_file'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:447:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:447:in `block in each_site_file'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:446:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:446:in `each_site_file'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:43:in `new_files'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:24:in `obsolete_files'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:15:in `cleanup!'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:279:in `cleanup'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:47:in `process'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/command.rb:28:in `process_site'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/commands/build.rb:55:in `build'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/commands/build.rb:33:in `process'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/commands/build.rb:17:in `block (2 levels) in init_with_program'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `call'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `block in execute'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `execute'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/program.rb:35:in `go'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary.rb:22:in `program'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/bin/jekyll:18:in `<top (required)>'
    from /usr/bin/jekyll:23:in `load'
    from /usr/bin/jekyll:23:in `<main>'

@tobalr

This comment has been minimized.

tobalr commented Dec 4, 2014

I had the some issues with special characters too. In the end I got it to work by converting the .md files to UTF8 without BOM.

@jekyll jekyll locked and limited conversation to collaborators Feb 27, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.