Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode URLs in utf-8 when escaping and unescaping #2420

Merged
merged 6 commits into from
May 21, 2014

Conversation

albertogg
Copy link
Member

I've been doing some tests to try to fix issue #2379 after reading @fabianrbz comment. In particular I analyzed what was the encoding behavior when using CGI.escape(path), what is the behavior with URI.escape(path, /[^a-zA-Z\d\-._~!$&\'()*+,;=:@\/]/). I also tested how the Addressable gem worked using Addressable::URI.escape(path), none the less I read a bit about RFC 3986 and all of this led me to this pull request.

Here are the examples:

What Jekyll is doing right now: It gets a UTF-8 string, when it gets escaped, the string is converted to ASCII, but when it is unescaped it remains ASCII and this is when the problem occurs. We can easily fixit by encoding the string to UTF-8 prior unescaping it.

# require 'uri'

pry(main)> uri_uri = uri_uri = '/rails笔记/2014/04/20/go-to-top-button-in-rails/'
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
uri_uri.encoding
#=> #<Encoding:UTF-8>
pry(main)> uri_escape = URI.escape(uri_uri, /[^a-zA-Z\d\-._~!$&\'()*+,;=:@\/]/)
#=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_escape.encoding
#=> #<Encoding:US-ASCII>
pry(main)> uri_unescape = URI.unescape(uri_escape)
#=> "/rails\xE7\xAC\x94\xE8\xAE\xB0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_unescape.encoding
#=> #<Encoding:US-ASCII>
pry(main)> uri_unescape = URI.unescape(uri_escape.encode('utf-8'))
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_unescape.encoding
#=> #<Encoding:UTF-8>
pry(main)> uri_escape = URI.escape(uri_uri, /[^a-zA-Z\d\-._~!$&\'()*+,;=:@\/]/).encode('utf-8')
#=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_escape.encoding
#=> #<Encoding:UTF-8>
pry(main)> uri_unescape = URI.unescape(uri_escape)
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> uri_escape.encoding
#=> #<Encoding:UTF-8>

This is what Jekyll was doing with CGI: when it gets a UTF-8 string and escapes it or unescapes it, the string remained UTF-8. If it gets an ASCII string it will escape it as an ASCII string, but when unescapes it, it will change it to UTF-8. So that why the problem never existed in Jekyll < 2.0.0.

# require 'cgi'

pry(main)> cgi_uri = "/rails笔记/2014/04/20/go-to-top-button-in-rails/".force_encoding(Encoding::ASCII)
#=> ""/rails\xE7\xAC\x94\xE8\xAE\xB0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_uri.encoding
#=> #<Encoding:US-ASCII>
pry(main)> cgi_escape = CGI.escape(cgi_uri)
#=> "%2Frails%E7%AC%94%E8%AE%B0%2F2014%2F04%2F20%2Fgo-to-top-button-in-rails%2F"
pry(main)> cgi_escape.encoding
#=> #<Encoding:US-ASCII>
pry(main)> cgi_unescape = CGI.unescape(cgi_escape)
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_unescape.encoding
#=> #<Encoding:UTF-8>

pry(main)> cgi_uri = "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_uri.encoding
#=> #<Encoding:UTF-8>
pry(main)> cgi_escape = CGI.escape(cgi_uri)
#=> "%2Frails%E7%AC%94%E8%AE%B0%2F2014%2F04%2F20%2Fgo-to-top-button-in-rails%2F"
pry(main)> cgi_escape.encoding
#=> #<Encoding:UTF-8>
pry(main)> cgi_unescape = CGI.unescape(cgi_escape)
#=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> cgi_unescape.encoding
#=> #<Encoding:UTF-8>

Addressable gem Always converts to UTF-8 when escaping or unescaping, so the problem never existed.
_update:_ I missed that when the Addressable gem receives a ASCII string (not escaped) it explodes while escaping it with the same error ArgumentError: invalid byte sequence in US-ASCII we are having and .force_encoding('utf-8') is needed.

# require 'addressable'

pry(main)> ad_uri = "/rails笔记/2014/04/20/go-to-top-button-in-rails/".force_encoding(Encoding::ASCII)
#=> "/rails\xE7\xAC\x94\xE8\xAE\xB0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_uri.encoding
#=> #<Encoding:US-ASCII>
pry(main)> ad_escape = Addressable::URI.escape(ad_uri)
# ArgumentError: invalid byte sequence in US-ASCII from /Users/albertogg/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/addressable2.3.6/lib/addressable/uri.rb:105:in `scan'
pry(main)> ad_escape = Addressable::URI.escape(ad_uri.force_encoding('utf-8'))
=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_escape.encoding
#=> #<Encoding:UTF-8>

pry(main)> ad_escape_ascii = ad_escape.force_encoding(Encoding::ASCII)
=> "/rails%E7%AC%94%E8%AE%B0/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_escape_ascii.encoding
#=> #<Encoding:US-ASCII>
pry(main)> ad_unescape = Addressable::URI.unescape(ad_escape_ascii)
=> "/rails笔记/2014/04/20/go-to-top-button-in-rails/"
pry(main)> ad_unescape.encoding
#=> #<Encoding:UTF-8>

Wow, this was really hard for me to write. I hope everyone understand what I was trying to express.

There is a problem while returning a path that has some special and possible Non-ASCII characters that may lead jekyll to break while doing the unescaping process. This is can be addressed by “forcing” ASCII to UTF-8.
@parkr
Copy link
Member

parkr commented May 18, 2014

This is awesome! 👍 from me. @mattr-?

@parkr
Copy link
Member

parkr commented May 18, 2014

Can you add a should block for these two methods in the tests, please? 😃 Then I'll merge.

@fabianrbz
Copy link
Member

@albertogg awesome work! after taking a look at what addressable does, this is the simplest solution I can think of that fixes the issue, maybe we should add a test case to avoid regressions in the future.

@albertogg
Copy link
Member Author

@parkr I really don't know how to test this, I tried but I have no clue, everything seems to be utf-8 while testing, I'm sorry 😞. If any of you guys can guide me or do it that will be great 😅

@parkr
Copy link
Member

parkr commented May 19, 2014

Looking for something like:

should "return a UTF-8 encoded string" do
  assert_equal "utf-8", URL.escape_path("blah").encoding
end
should "return a UTF-8 encoded string" do
  assert_equal "utf-8", URL.unescape_path("blah").encoding
end

😃

Added tests to validate the encoding of returned URL strings
after been escaped or unescaped.
This will reassure not having any errors when escaping or
unescaping.
@albertogg
Copy link
Member Author

Thanks again @parkr ❤️! I really need to level up my testing fu.

Also, I kept testing this and got to a conclusion that is better to force the encoding on the path, that way we will reassure (I think) that no matter what the encoding is set on the user machine we will be able to re-encode to utf-8 with no errors. If you think this is not ok, I can remove the forcing thing. Thanks!

@parkr
Copy link
Member

parkr commented May 20, 2014

I always feel just horrid about using force_encoding. I think all it does is change the "header" of the string – it doesn't actually re-encode the string back to UTF-8 or whatever. I'd feel better using encode.

@albertogg
Copy link
Member Author

Ok, let me revert that and that's it.

* Add encoding to the test file as Ruby 1.9.3 doesn’t defaults to utf-8.
* Remove the forced encoding as encode seems too aggressive.
end

should "return a UTF-8 unescaped string" do
assert_equal Encoding::UTF_8, URL.unescape_path("/rails%E7%AC%94%E8%AE%B0/2014/04/20/escaped/").encoding
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I feel this like should be:

assert_equal Encoding::UTF_8, URL.unescape_path("/rails%E7%AC%94%E8%AE%B0/2014/04/20/escaped/".encode(Encoding::ASCII)).encoding

As in the test that string will always be utf-8, or at least that's what I think. But I'm not sure tho.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

Nothing was being tested without explicitly making the string
encoding ASCII.
@albertogg
Copy link
Member Author

It worked, hope it's all good!

@parkr
Copy link
Member

parkr commented May 21, 2014

Thanks so much, @albertogg!

@albertogg
Copy link
Member Author

No problem! I'm glad I could help!

@albertogg albertogg deleted the fix-encoding-issue branch May 21, 2014 02:42
@Gutek
Copy link

Gutek commented Oct 28, 2014

don't know about you, but i have this issue still with jekyll 2.4.0 and on mac. on windows it works.

site i'm trying to build:
https://github.com/Gutek/gutek.github.io

/Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll.rb:125:in `gsub!': invalid byte sequence in UTF-8 (ArgumentError)
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll.rb:125:in `sanitized_path'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/post.rb:271:in `destination'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:43:in `block in new_files'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:448:in `block (2 levels) in each_site_file'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:447:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:447:in `block in each_site_file'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:446:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:446:in `each_site_file'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:43:in `new_files'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:24:in `obsolete_files'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/cleaner.rb:15:in `cleanup!'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:279:in `cleanup'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/site.rb:47:in `process'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/command.rb:28:in `process_site'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/commands/build.rb:55:in `build'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/commands/build.rb:33:in `process'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/lib/jekyll/commands/build.rb:17:in `block (2 levels) in init_with_program'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `call'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `block in execute'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/command.rb:220:in `execute'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary/program.rb:35:in `go'
    from /Library/Ruby/Gems/2.0.0/gems/mercenary-0.3.4/lib/mercenary.rb:22:in `program'
    from /Library/Ruby/Gems/2.0.0/gems/jekyll-2.4.0/bin/jekyll:18:in `<top (required)>'
    from /usr/bin/jekyll:23:in `load'
    from /usr/bin/jekyll:23:in `<main>'

@tobalr
Copy link

tobalr commented Dec 4, 2014

I had the some issues with special characters too. In the end I got it to work by converting the .md files to UTF8 without BOM.

@jekyll jekyll locked and limited conversation to collaborators Feb 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants