Unicode Character Expansion #42

blakeembrey · 2015-02-26T15:07:57Z

I'm opening this issue to create a conversation around supporting unicode and other illegal characters in the path string. For example, matching 我 currently requires using %E6%88%91 as part of the path in the current version of the module. Any thoughts?

The text was updated successfully, but these errors were encountered:

dougwilson · 2015-02-26T15:10:49Z

I'm fine with this, as long as it expands to the correct Unicode normalization form, whatever the specs say IRIs are expanded to.

blakeembrey · 2015-02-26T15:12:07Z

@dougwilson Awesome, I'll start an investigation on that 👍

dougwilson · 2015-02-26T15:13:45Z

Looks like it's RFC 3987 section 3.1 :)

Also, if it helps, Mojolicious already does this stuff, so it'd be interesting to know how they are doing their transformation, since it has worked well for a long time.

blakeembrey · 2015-02-26T15:15:08Z

Very nice, thank you 😄

blakeembrey · 2015-02-28T10:03:37Z

@dougwilson Just looking through the implementation, it looks like they decode it at the framework level before passing it to route matches? Did I understand that correctly? http://mojolicio.us/perldoc/Mojo/Path#to_route

Is that something that could be considered for Express 5.0?

dougwilson · 2015-02-28T18:05:35Z

I'll have to look more at their source code (which can be browsed at https://metacpan.org/source/SRI/Mojolicious-6.0/lib), but I'm open to whatever we want to do for Express 5.0. Currently Express 5.0 just passes the raw req.url down to this module, so theoretically that makes it the most flexible of all, but if it's not flexible enough, we can always make it more flexible :)

But really, what I wanted to know is how do they do the UTF-8 -> URI transformation. The characters in JavaScript source code are UCS-2 and so we need to have some kind of way to transform source code strings to URIs reliably. For example, if I type ú in source code, there are actually several different byte sequences that result in that character and even multiple different Unicode code points (U+00FA or U+0075 U+0301). Which ones do we support, or do we treat them the same?

Example:

decodeURIComponent('%C3%BA') // -> ú
decodeURIComponent('u%CC%81') // -> ú

Should a user have to understand that the ú they type in their editor may not match the ú that comes in the URI?

bajtos · 2015-03-02T09:31:06Z

decode it at the framework level before passing it to route matches

This is a great idea, especially if there is a way how to get it working reliably for all edge cases. It should solve the problem where there are multiple ways how to encode a single character, be it ú as mentioned by @dougwilson, or characters like ~ that some clients encode and some don't (IIRC).

It may be possible to implement it in a backward-compatible way if we url-decode the path inputs passed by API users too.

blakeembrey · 2015-03-03T10:37:52Z

@dougwilson So ES6 has a built in method for this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

decodeURIComponent('u%CC%81').normalize() === decodeURIComponent('%C3%BA')

Edit: Only 0.12 for String#normalize, but we can also use a third party module like https://github.com/walling/unorm for older node versions. I think putting this at the framework level actually makes more sense here, because we'll be decoding it a bunch of times and it'll make the flow more confusing otherwise. If the path accepts unicode, I think we should be able to manually compare too req.url === 'ú'.

nicooprat · 2016-07-26T14:13:56Z

Any news on this?

blakeembrey · 2017-08-22T02:27:57Z

@nicooprat This issue seems like the opposite to what you're looking for (this was about unicode inputs to path matching). From kadirahq/flow-router#599, it seems like what you'd like is the ability to specify the way to encode - currently it's only pretty but that won't be enough to handle unicode characters outside of ASCII.

blakeembrey · 2017-08-22T03:01:30Z

By the way, I took another stab at doing this for the path-to-regexp 2.0. I don't think it's reasonably possible since every single character could be encoded differently. This has reinforced my view that it needs to be encoded at the router level. Would love additional input though.

For a simple example, type /+ into the browser. The + is not encoded. Now do encodeURIComponent (which you should be using programmatically to safely encode user input). Now you've got an escaped plus. The route will only match one or the other. This gets even more complex when you have multiple representations of the same character input. I'd recommend decoding at the framework level (it's more performant than using path-to-regexp and having this library decode the same string hundreds of times).

blakeembrey · 2017-08-22T04:06:26Z

Here's a framework level function that should work (but could use some improvements):

function encoding (ignore) {
  return {
    encode: function (value) {
      return encodeURIComponent(value)
    },
    decode: function (value) {
      return value.replace(/(?:%[ef][0-9a-f](?:%[0-9a-f]{2}){2}|%[cd][0-9a-f]%[0-9a-f]{2}|%[0-9a-f]{2})/ig, function (m) {
        const char = decodeURIComponent(m)

        if (ignore.indexOf(char) > -1) return m

        return char
      })
    }
  }
}

Note: You can not technically just use decodeURI() because a : won't be decoded but it can be encoded.

blakeembrey · 2019-11-12T03:13:28Z

I have added a basic normalizePathname function to this library in 2c3baf1. It uses a simpler implementation than above so you need to make sure of two things if you use this function:

You don't double decode a string (e.g. normalizePathname already calls decodeURIComponent, you shouldn't call it again in application code)
You don't mind that %2F will become /, which will change the match (e.g. /test/route and /test%2Froute were previously different)

blakeembrey · 2019-11-12T12:08:22Z

Unfortunately that was a quick 9 hours. I've released a 5.x which removes normalizePathname - I looked around to some routing libraries online and played with URL, and it wasn't consistent. Also noted that String.prototype.normalize would not be supported on IE. Instead, I've opted to document the solution in the README.

Note for implementors: if you're using URL, I added an encode option to pathToRegexp so you can encode paths using encodeURI. This will ensure path names from URL match against the RegEx as you cannot use the solution in the README (URL will always encode the pathname).

blakeembrey mentioned this issue Feb 26, 2015

Reject routes containing an unescaped space character expressjs/express#2511

Open

blakeembrey mentioned this issue Mar 2, 2015

Deprecate usage of white-space character in path #41

Closed

blakeembrey mentioned this issue Mar 3, 2015

Support unicode URLs pillarjs/router#15

Closed

gertsonderby mentioned this issue Sep 7, 2015

Fix url decoding visionmedia/page.js#309

Closed

taion mentioned this issue Apr 14, 2016

Implement Cyrillic route paths remix-run/react-router#3317

Closed

nicooprat mentioned this issue Jul 26, 2016

Support for special characters in FlowRouter kadirahq/flow-router#599

Open

pshrmn mentioned this issue Jan 3, 2017

[Bug] Unicode routes remix-run/react-router#4319

Closed

blakeembrey added the discuss label Oct 14, 2017

blakeembrey closed this as completed Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Character Expansion #42

Unicode Character Expansion #42

blakeembrey commented Feb 26, 2015

dougwilson commented Feb 26, 2015

blakeembrey commented Feb 26, 2015

dougwilson commented Feb 26, 2015

blakeembrey commented Feb 26, 2015

blakeembrey commented Feb 28, 2015

dougwilson commented Feb 28, 2015

bajtos commented Mar 2, 2015

blakeembrey commented Mar 3, 2015

nicooprat commented Jul 26, 2016

blakeembrey commented Aug 22, 2017

blakeembrey commented Aug 22, 2017 •

edited

Loading

blakeembrey commented Aug 22, 2017 •

edited

Loading

blakeembrey commented Nov 12, 2019

blakeembrey commented Nov 12, 2019

Unicode Character Expansion #42

Unicode Character Expansion #42

Comments

blakeembrey commented Feb 26, 2015

dougwilson commented Feb 26, 2015

blakeembrey commented Feb 26, 2015

dougwilson commented Feb 26, 2015

blakeembrey commented Feb 26, 2015

blakeembrey commented Feb 28, 2015

dougwilson commented Feb 28, 2015

bajtos commented Mar 2, 2015

blakeembrey commented Mar 3, 2015

nicooprat commented Jul 26, 2016

blakeembrey commented Aug 22, 2017

blakeembrey commented Aug 22, 2017 • edited Loading

blakeembrey commented Aug 22, 2017 • edited Loading

blakeembrey commented Nov 12, 2019

blakeembrey commented Nov 12, 2019

blakeembrey commented Aug 22, 2017 •

edited

Loading

blakeembrey commented Aug 22, 2017 •

edited

Loading