Added more supporting methods for URN paths. #201

hardlyknowem · 2015-03-25T17:39:32Z

First, I apologize if this should really be an issue instead of a PR, so that the solution could be properly planned in advance.

I work on a product is required to accept URIs which may or may not be properly encoded and reformat them as properly-encoded URIs. This is a hard problem to solve on our own, and so URI.js has been a godsend. Unfortunately, the most common type of URI we handle is a URN; and URI.js does not currently have anything like recodePath for URNs. This patch implements a recodeURNPath method for URNs and calls it while normalizing a URN path.

This PR might carry a risk with it, unfortunately. There is some equivocation in the documentation about the difference between URNs and URIs, which is reflected in the overall structure of URI.js:

URLs are used to address the individual resources of your website. URNs are usually used for hooking into other applications, as mailto:, magnet: or spotify: suggest. While RFC 3986 defines the structure of an URL in depth, URNs are not. The structure (and meaning) of URNs are up to their distinct specifications.

Technically, URN refers to a very specific thing: Uniform Resource Names as defined by RFC 2141 (the syntactic requirements of which are implemented in this patch). Though there is some degree of freedom to the structure of URNs in a specific namespace, they cannot break the rules laid out in 2141 (such as which characters are valid in a URN). In addition, this means that mailto: and spotify: URIs are not actually URNs at all (and in fact, the RFC defining mailto: scheme, RFC 2368, calls it a URL).

I think URI.js is drawing its inspiration from this paragraph in RFC 3986 (which is mentioned in the quoted documentation above):

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location"). The term "Uniform Resource Name" (URN) has been used historically to refer to both URIs under the "urn" scheme [RFC2141], which are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable, and to any other URI with the properties of a name.

The problem here is that even this is still a purely semantic distinction, while the distinction between URNs and URIs as implemented in URI.js is syntactic: URLs are anything that looks like protocol://user:password@hostname/path/with/slashes?query=string and URNs are anything that looks like scheme:path:with:colons?query=string.

All this is to say: introducing the syntactic requirements to normalize real URNs as defined in RFC 2141 may or may not actually apply to the things URI.js treats as URNs.

That said, in practice I think the risk isn't very significant. For one, users of URN-ish URIs already could not rely on URI.js for proper normalization of paths, and so I see no harm in URI.js applying its best effort. If a consumer needs a particular kind of syntax for their custom URI scheme, they always have had to implement that themselves, and so nothing changes here.

Use case: I work on some software that needs to accept URNs that may be improperly formatted and encode them properly. This patch introduces a new method, `recodeURNPath`, that will break the URN's path into segments and make sure that each part of it is encoded properly. Risks: URI.js treats "URN" as a catch-all category for everything that doesn't look like an HTTP URL. Many of these schemes may or may not follow the URN syntactical rules.

rodneyrehm · 2015-03-26T00:08:28Z

I think this is the single best PR I've been sent on any of my projects so far. Thank you.

I agree with your assessments, thank you for the explanation. I have indeed considered the distinction between URLs and URNs to be syntactic. If mailto: et al are not URNs (like I assumed) we may want to discuss what other parts of URI.js need changeing. It's very likely I did not dig deep enough into what makes URNs tick - at least that's what I think when reading that : as path delim is semantics I imposed.

rodneyrehm · 2015-03-26T00:08:31Z

src/URI.js

+      // for usage in a URN. RFC2141 also calls out "-", ".", and "_" as acceptable characters, but
+      // these aren't encoded by encodeURIComponent, so we don't have to call them out here. Also
+      // note that the colon character is not featured in the encoding map; this is because URI.js
+      // gives the colons in URNs semantic meaning as the delimiters of path segements, and so it


If : is neither the URN path delimiter, nor the "industry default path delimiter", we need to talk about what is.

hardlyknowem · 2015-03-26T18:57:19Z

So before I address any specific comments, I think it might be better to have the discussion about how much more would need to change—just so we don't do any work that gets obsoleted soon.

Personally, I think the distinction URI.js makes right now between URIs and URLs is fine, so long as it is explained in the docs and perhaps given a different name. There are some URI schemes out there that aren't technically URNs in the full RFC 2141 sense, but they are used to name things rather than locate them (and so they would fall under the "historical usage" talked about in RFC 3986). More importantly, many of these look a whole lot like URNs: see the DOI scheme (e.g., doi:10.1000/182).

So, in practice, I think it's perfectly valid to have URI.js distinguish between two kinds of URIs: those that look like HTTP URLs (protocol:// with '/' as a path separator) and those that look like URNs (scheme:, with ':' as a path separator, more on that below), which URI.js already has. It's not technically right, but it can't be the responsibility of URI.js to accommodate every possible URI scheme. Most URIs are going to fall into those two camps.

The change I might suggest instead is to change the documentation so that the distinction is made between URN-ish and URL-ish URIs.

As for the URN syntax and the use of the colon: According to the RFC, URNs have the following syntax: urn:<nid>:<nss>, where <nid> is the "Namespace Identifier" and <nss> is the "Namespace-Specific String." And so, the colon does have special syntactic meaning for URNs at least once. The place you could get into trouble is in the namespace-specific string.

I just went and skimmed the RFCs of the first twenty formal URN namespaces registered here and it appears that every single one of them uses the colon as something like a path separator. So I think for all practical purposes URI.js is totally valid in giving the colon character special significance as a path separator. Again, there's no way that URI.js can accommodate all possible URI syntaxes (and all possible URN namespace-specific syntaxes), so I think practicality can win the day here.

So, in conclusion, I think that after cleaning this current PR up, the only thing that would really need to change is the documentation, to make it clearer what URI.js really means by "URN". If you agree with that assessment, I'll go ahead and start cleaning up this PR.

rodneyrehm · 2015-03-26T21:12:17Z

… it can't be the responsibility of URI.js to accommodate every possible URI scheme

well. You should be able to work with every scheme. But that doesn't mean that stuff like .segment() should work for every scheme (unless we make the separator a scheme-configurable thing, with : being the default).

If you agree with that assessment, I'll go ahead and start cleaning up this PR.

I do, for better or worse. Do you also want take a shot at the docs (may be a second PR if you wish)?

hardlyknowem · 2015-03-27T17:58:32Z

I'll take a shot at the docs, but it may or may not be this PR. Let me figure out how hard it would be to do that.

My local changes created a bug that caused an exception to be thrown during the `normalize` call of these methods, which then caused later tests to fail because the wrong decoding/encoding function was being used.

…and URNs.

hardlyknowem · 2015-03-27T22:05:48Z

Alright, I've responded to the comments and made the changes to the docs. Let me know if you'd like other things changed.

rodneyrehm · 2015-03-27T22:36:02Z

great stuff!

What resources besides RFC 2141, the list of official URNs, RFC 2368 (other RFC for popular schemes?) do you think should go into the Readme.md?

hardlyknowem · 2015-03-27T23:14:38Z

Thanks! I'll think on that last question. I'll address the new comments (as well as compile a list of resources) over the weekend.

1. Used more meaningful tags in the documentation. 2. method names with "URN" in them have been camel-cased to use "Urn" instead. 3. A list of resources and a changelog has been added to the README.

hardlyknowem · 2015-03-30T16:01:22Z

Responded to all above comments. Do you want the above work to be rebased/squashed into a single commit?

rodneyrehm · 2015-03-31T18:14:19Z

README.md

@@ -243,6 +246,7 @@ URI.js is published under the [MIT license](http://www.opensource.org/licenses/m

 ### master (will become 1.15.0)

+* URNs are now normalized based on the syntax given by [RFC 2141](https://www.ietf.org/rfc/rfc2141.txt)


proper format and content would be:

fixed .pathname() to properly en/decode URN paths - (Issue #201, mlefoster)

fixing URI normalization to properly handle URN paths based on RFC 2141 syntax - (Issue #201, mlefoster)

fixed .normalize() and .normalizePath() to properly normalize URN paths

added URI.encodeUrnPathSegment()

added URI.decodeUrnPathSegment()

added URI.decodeUrnPath()

added URI.recodeUrnPath()

rodneyrehm · 2015-03-31T18:14:46Z

nice!

Responded to all above comments. Do you want the above work to be rebased/squashed into a single commit?

I can't say I have an opinion on that. But sure, you can squash the commits if you want to.

hardlyknowem · 2015-03-31T20:30:19Z

README file updated.

I won't bother with squashing, since there are some comments on commits. (Rebasing changes commit hashes, which means the comments become hard to locate.)

Added more supporting methods for URN paths.

rodneyrehm · 2015-03-31T22:01:05Z

thank you!

rodneyrehm · 2015-03-31T22:31:31Z

released v1.15.0, thank you for your support!

hardlyknowem · 2015-04-01T23:03:34Z

Thanks for including it!

rodneyrehm · 2015-04-10T12:07:10Z

Hey @mlefoster , follow-up question: @Munter has created a repository identifying various official iana and unoffical schemes. By itself I don't see how it could be helpful to URI.js in providing better URN support. The question is what data would we want to add to scheme in order to accomplish anything significant? Or is there nothing left to be improved/simplified?

Munter · 2015-04-10T12:12:44Z

I can see that you have an ad-hoc scheme to default port mapping in URI.js. That part might be nice to add to the schemes data object, if there are official resources I can scrape for them

hardlyknowem · 2015-04-11T19:12:40Z

For my own part, I mainly use URI.js as a syntactical helper (in particular, normalizing and sanitizing bad inputs, particularly with respect to percent-encoding). From that perspective, the two things I can think of would be 1) a better mapping of schemes to default ports for those URIs that have ports and 2) maybe a mapping of schemes to their general syntax: whether the scheme supports a query string, what the path separator is, etc.

The former is something URI.js already supports; it's just a matter of filling it out more if appropriate. As for 2), the question is whether that's practical. Are there URI schemes that fall outside the general HTTP-like and URN-like dichotomy? And if there are, are those schemes common enough that making those changes would be worth it? Are there enough handwritten/poorly-formatted URIs of those schemes floating around that people have a need to normalize them?

rodneyrehm · 2015-04-11T20:31:04Z

Are there enough handwritten/poorly-formatted URIs of those schemes floating around that people have a need to normalize them?

I don't think normalizing is the main feature to go for. Could a proper scheme mapping (whatever that would look like) support semantics of a particular scheme? Think about mailto:<address>?subject=&body&… provided a map { address: …, subject: …, body: … } and thus simpler to handle. That said, I have no idea if people actually need it.

rodneyrehm reviewed Mar 26, 2015
View reviewed changes

rodneyrehm added Feature Path labels Mar 26, 2015

Matthew Lefavor added 3 commits March 27, 2015 14:48

Added finally clause to iso8859 and unicode methods.

8d68ead

My local changes created a bug that caused an exception to be thrown during the `normalize` call of these methods, which then caused later tests to fail because the wrong decoding/encoding function was being used.

Responded to PR comments.

1840662

Updated documentation to better reflect the distinction between URLs …

46b270e

…and URNs.

Responded to PR review comments (round 2).

8b057a7

1. Used more meaningful tags in the documentation. 2. method names with "URN" in them have been camel-cased to use "Urn" instead. 3. A list of resources and a changelog has been added to the README.

rodneyrehm reviewed Mar 31, 2015
View reviewed changes

Cleaned up README file.

158011d

hardlyknowem force-pushed the master branch from 4a25e5f to 158011d Compare March 31, 2015 20:27

rodneyrehm added a commit that referenced this pull request Mar 31, 2015

Merge pull request #201 from RusticiSoftware/master

55d5a98

Added more supporting methods for URN paths.

rodneyrehm merged commit 55d5a98 into medialize:master Mar 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added more supporting methods for URN paths. #201

Added more supporting methods for URN paths. #201

hardlyknowem commented Mar 25, 2015

rodneyrehm commented Mar 26, 2015

rodneyrehm Mar 26, 2015

hardlyknowem commented Mar 26, 2015

rodneyrehm commented Mar 26, 2015

hardlyknowem commented Mar 27, 2015

hardlyknowem commented Mar 27, 2015

rodneyrehm commented Mar 27, 2015

hardlyknowem commented Mar 27, 2015

hardlyknowem commented Mar 30, 2015

rodneyrehm Mar 31, 2015

rodneyrehm commented Mar 31, 2015

hardlyknowem commented Mar 31, 2015

rodneyrehm commented Mar 31, 2015

rodneyrehm commented Mar 31, 2015

hardlyknowem commented Apr 1, 2015

rodneyrehm commented Apr 10, 2015

Munter commented Apr 10, 2015

hardlyknowem commented Apr 11, 2015

rodneyrehm commented Apr 11, 2015

		@@ -243,6 +246,7 @@ URI.js is published under the [MIT license](http://www.opensource.org/licenses/m

		### master (will become 1.15.0)

		* URNs are now normalized based on the syntax given by [RFC 2141](https://www.ietf.org/rfc/rfc2141.txt)

Added more supporting methods for URN paths. #201

Added more supporting methods for URN paths. #201

Conversation

hardlyknowem commented Mar 25, 2015

rodneyrehm commented Mar 26, 2015

rodneyrehm Mar 26, 2015

Choose a reason for hiding this comment

hardlyknowem commented Mar 26, 2015

rodneyrehm commented Mar 26, 2015

hardlyknowem commented Mar 27, 2015

hardlyknowem commented Mar 27, 2015

rodneyrehm commented Mar 27, 2015

hardlyknowem commented Mar 27, 2015

hardlyknowem commented Mar 30, 2015

rodneyrehm Mar 31, 2015

Choose a reason for hiding this comment

rodneyrehm commented Mar 31, 2015

hardlyknowem commented Mar 31, 2015

rodneyrehm commented Mar 31, 2015

rodneyrehm commented Mar 31, 2015

hardlyknowem commented Apr 1, 2015

rodneyrehm commented Apr 10, 2015

Munter commented Apr 10, 2015

hardlyknowem commented Apr 11, 2015

rodneyrehm commented Apr 11, 2015