-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode URLs in utf-8 when escaping and unescaping #2420
Conversation
There is a problem while returning a path that has some special and possible Non-ASCII characters that may lead jekyll to break while doing the unescaping process. This is can be addressed by “forcing” ASCII to UTF-8.
This is awesome! 👍 from me. @mattr-? |
Can you add a |
@albertogg awesome work! after taking a look at what |
@parkr I really don't know how to test this, I tried but I have no clue, everything seems to be utf-8 while testing, I'm sorry 😞. If any of you guys can guide me or do it that will be great 😅 |
Looking for something like: should "return a UTF-8 encoded string" do
assert_equal "utf-8", URL.escape_path("blah").encoding
end
should "return a UTF-8 encoded string" do
assert_equal "utf-8", URL.unescape_path("blah").encoding
end 😃 |
Added tests to validate the encoding of returned URL strings after been escaped or unescaped.
This will reassure not having any errors when escaping or unescaping.
Thanks again @parkr ❤️! I really need to level up my testing fu. Also, I kept testing this and got to a conclusion that is better to force the encoding on the path, that way we will reassure (I think) that no matter what the encoding is set on the user machine we will be able to re-encode to utf-8 with no errors. If you think this is not ok, I can remove the forcing thing. Thanks! |
I always feel just horrid about using |
Ok, let me revert that and that's it. |
* Add encoding to the test file as Ruby 1.9.3 doesn’t defaults to utf-8. * Remove the forced encoding as encode seems too aggressive.
end | ||
|
||
should "return a UTF-8 unescaped string" do | ||
assert_equal Encoding::UTF_8, URL.unescape_path("/rails%E7%AC%94%E8%AE%B0/2014/04/20/escaped/").encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I feel this like should be:
assert_equal Encoding::UTF_8, URL.unescape_path("/rails%E7%AC%94%E8%AE%B0/2014/04/20/escaped/".encode(Encoding::ASCII)).encoding
As in the test that string will always be utf-8
, or at least that's what I think. But I'm not sure tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely!
Nothing was being tested without explicitly making the string encoding ASCII.
It worked, hope it's all good! |
Thanks so much, @albertogg! |
No problem! I'm glad I could help! |
don't know about you, but i have this issue still with jekyll 2.4.0 and on mac. on windows it works. site i'm trying to build:
|
I had the some issues with special characters too. In the end I got it to work by converting the .md files to UTF8 without BOM. |
I've been doing some tests to try to fix issue #2379 after reading @fabianrbz comment. In particular I analyzed what was the encoding behavior when using
CGI.escape(path)
, what is the behavior withURI.escape(path, /[^a-zA-Z\d\-._~!$&\'()*+,;=:@\/]/)
. I also tested how the Addressable gem worked usingAddressable::URI.escape(path)
, none the less I read a bit about RFC 3986 and all of this led me to this pull request.Here are the examples:
What Jekyll is doing right now: It gets a UTF-8 string, when it gets escaped, the string is converted to ASCII, but when it is unescaped it remains ASCII and this is when the problem occurs. We can easily fixit by encoding the string to UTF-8 prior unescaping it.
This is what Jekyll was doing with CGI: when it gets a UTF-8 string and escapes it or unescapes it, the string remained UTF-8. If it gets an ASCII string it will escape it as an ASCII string, but when unescapes it, it will change it to UTF-8. So that why the problem never existed in Jekyll < 2.0.0.
Addressable gem Always converts to UTF-8 when escaping or unescaping, so the problem never existed.
_update:_ I missed that when the Addressable gem receives a ASCII string (not escaped) it explodes while escaping it with the same error
ArgumentError: invalid byte sequence in US-ASCII
we are having and.force_encoding('utf-8')
is needed.Wow, this was really hard for me to write. I hope everyone understand what I was trying to express.