Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Character Expansion #42

Closed
blakeembrey opened this issue Feb 26, 2015 · 14 comments
Closed

Unicode Character Expansion #42

blakeembrey opened this issue Feb 26, 2015 · 14 comments
Labels

Comments

@blakeembrey
Copy link
Member

I'm opening this issue to create a conversation around supporting unicode and other illegal characters in the path string. For example, matching 我 currently requires using %E6%88%91 as part of the path in the current version of the module. Any thoughts?

@dougwilson
Copy link

I'm fine with this, as long as it expands to the correct Unicode normalization form, whatever the specs say IRIs are expanded to.

@blakeembrey
Copy link
Member Author

@dougwilson Awesome, I'll start an investigation on that 👍

@dougwilson
Copy link

Looks like it's RFC 3987 section 3.1 :)

Also, if it helps, Mojolicious already does this stuff, so it'd be interesting to know how they are doing their transformation, since it has worked well for a long time.

@blakeembrey
Copy link
Member Author

Very nice, thank you 😄

@blakeembrey
Copy link
Member Author

@dougwilson Just looking through the implementation, it looks like they decode it at the framework level before passing it to route matches? Did I understand that correctly? http://mojolicio.us/perldoc/Mojo/Path#to_route

Is that something that could be considered for Express 5.0?

@dougwilson
Copy link

I'll have to look more at their source code (which can be browsed at https://metacpan.org/source/SRI/Mojolicious-6.0/lib), but I'm open to whatever we want to do for Express 5.0. Currently Express 5.0 just passes the raw req.url down to this module, so theoretically that makes it the most flexible of all, but if it's not flexible enough, we can always make it more flexible :)

But really, what I wanted to know is how do they do the UTF-8 -> URI transformation. The characters in JavaScript source code are UCS-2 and so we need to have some kind of way to transform source code strings to URIs reliably. For example, if I type ú in source code, there are actually several different byte sequences that result in that character and even multiple different Unicode code points (U+00FA or U+0075 U+0301). Which ones do we support, or do we treat them the same?

Example:

decodeURIComponent('%C3%BA') // -> ú
decodeURIComponent('u%CC%81') // -> ú

Should a user have to understand that the ú they type in their editor may not match the ú that comes in the URI?

@bajtos
Copy link

bajtos commented Mar 2, 2015

decode it at the framework level before passing it to route matches

This is a great idea, especially if there is a way how to get it working reliably for all edge cases. It should solve the problem where there are multiple ways how to encode a single character, be it ú as mentioned by @dougwilson, or characters like ~ that some clients encode and some don't (IIRC).

It may be possible to implement it in a backward-compatible way if we url-decode the path inputs passed by API users too.

@blakeembrey
Copy link
Member Author

@dougwilson So ES6 has a built in method for this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

decodeURIComponent('u%CC%81').normalize() === decodeURIComponent('%C3%BA')

Edit: Only 0.12 for String#normalize, but we can also use a third party module like https://github.com/walling/unorm for older node versions. I think putting this at the framework level actually makes more sense here, because we'll be decoding it a bunch of times and it'll make the flow more confusing otherwise. If the path accepts unicode, I think we should be able to manually compare too req.url === 'ú'.

@nicooprat
Copy link

Any news on this?

@blakeembrey
Copy link
Member Author

@nicooprat This issue seems like the opposite to what you're looking for (this was about unicode inputs to path matching). From kadirahq/flow-router#599, it seems like what you'd like is the ability to specify the way to encode - currently it's only pretty but that won't be enough to handle unicode characters outside of ASCII.

@blakeembrey
Copy link
Member Author

blakeembrey commented Aug 22, 2017

By the way, I took another stab at doing this for the path-to-regexp 2.0. I don't think it's reasonably possible since every single character could be encoded differently. This has reinforced my view that it needs to be encoded at the router level. Would love additional input though.

For a simple example, type /+ into the browser. The + is not encoded. Now do encodeURIComponent (which you should be using programmatically to safely encode user input). Now you've got an escaped plus. The route will only match one or the other. This gets even more complex when you have multiple representations of the same character input. I'd recommend decoding at the framework level (it's more performant than using path-to-regexp and having this library decode the same string hundreds of times).

@blakeembrey
Copy link
Member Author

blakeembrey commented Aug 22, 2017

Here's a framework level function that should work (but could use some improvements):

function encoding (ignore) {
  return {
    encode: function (value) {
      return encodeURIComponent(value)
    },
    decode: function (value) {
      return value.replace(/(?:%[ef][0-9a-f](?:%[0-9a-f]{2}){2}|%[cd][0-9a-f]%[0-9a-f]{2}|%[0-9a-f]{2})/ig, function (m) {
        const char = decodeURIComponent(m)

        if (ignore.indexOf(char) > -1) return m

        return char
      })
    }
  }
}

Note: You can not technically just use decodeURI() because a : won't be decoded but it can be encoded.

@blakeembrey
Copy link
Member Author

I have added a basic normalizePathname function to this library in 2c3baf1. It uses a simpler implementation than above so you need to make sure of two things if you use this function:

  1. You don't double decode a string (e.g. normalizePathname already calls decodeURIComponent, you shouldn't call it again in application code)
  2. You don't mind that %2F will become /, which will change the match (e.g. /test/route and /test%2Froute were previously different)

@blakeembrey
Copy link
Member Author

Unfortunately that was a quick 9 hours. I've released a 5.x which removes normalizePathname - I looked around to some routing libraries online and played with URL, and it wasn't consistent. Also noted that String.prototype.normalize would not be supported on IE. Instead, I've opted to document the solution in the README.

Note for implementors: if you're using URL, I added an encode option to pathToRegexp so you can encode paths using encodeURI. This will ensure path names from URL match against the RegEx as you cannot use the solution in the README (URL will always encode the pathname).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants