Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added more supporting methods for URN paths. #201

Merged
merged 6 commits into from
Mar 31, 2015

Conversation

hardlyknowem
Copy link
Contributor

First, I apologize if this should really be an issue instead of a PR, so that the solution could be properly planned in advance.

I work on a product is required to accept URIs which may or may not be properly encoded and reformat them as properly-encoded URIs. This is a hard problem to solve on our own, and so URI.js has been a godsend. Unfortunately, the most common type of URI we handle is a URN; and URI.js does not currently have anything like recodePath for URNs. This patch implements a recodeURNPath method for URNs and calls it while normalizing a URN path.

This PR might carry a risk with it, unfortunately. There is some equivocation in the documentation about the difference between URNs and URIs, which is reflected in the overall structure of URI.js:

URLs are used to address the individual resources of your website. URNs are usually used for hooking into other applications, as mailto:, magnet: or spotify: suggest. While RFC 3986 defines the structure of an URL in depth, URNs are not. The structure (and meaning) of URNs are up to their distinct specifications.

Technically, URN refers to a very specific thing: Uniform Resource Names as defined by RFC 2141 (the syntactic requirements of which are implemented in this patch). Though there is some degree of freedom to the structure of URNs in a specific namespace, they cannot break the rules laid out in 2141 (such as which characters are valid in a URN). In addition, this means that mailto: and spotify: URIs are not actually URNs at all (and in fact, the RFC defining mailto: scheme, RFC 2368, calls it a URL).

I think URI.js is drawing its inspiration from this paragraph in RFC 3986 (which is mentioned in the quoted documentation above):

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location"). The term "Uniform Resource Name" (URN) has been used historically to refer to both URIs under the "urn" scheme [RFC2141], which are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable, and to any other URI with the properties of a name.

The problem here is that even this is still a purely semantic distinction, while the distinction between URNs and URIs as implemented in URI.js is syntactic: URLs are anything that looks like protocol://user:password@hostname/path/with/slashes?query=string and URNs are anything that looks like scheme:path:with:colons?query=string.

All this is to say: introducing the syntactic requirements to normalize real URNs as defined in RFC 2141 may or may not actually apply to the things URI.js treats as URNs.

That said, in practice I think the risk isn't very significant. For one, users of URN-ish URIs already could not rely on URI.js for proper normalization of paths, and so I see no harm in URI.js applying its best effort. If a consumer needs a particular kind of syntax for their custom URI scheme, they always have had to implement that themselves, and so nothing changes here.

Use case: I work on some software that needs to accept URNs that may be
improperly formatted and encode them properly. This patch introduces a new
method, `recodeURNPath`, that will break the URN's path into segments and
make sure that each part of it is encoded properly.

Risks: URI.js treats "URN" as a catch-all category for everything that doesn't
look like an HTTP URL. Many of these schemes may or may not follow the URN
syntactical rules.
@rodneyrehm
Copy link
Member

I think this is the single best PR I've been sent on any of my projects so far. Thank you.

I agree with your assessments, thank you for the explanation. I have indeed considered the distinction between URLs and URNs to be syntactic. If mailto: et al are not URNs (like I assumed) we may want to discuss what other parts of URI.js need changeing. It's very likely I did not dig deep enough into what makes URNs tick - at least that's what I think when reading that : as path delim is semantics I imposed.

// for usage in a URN. RFC2141 also calls out "-", ".", and "_" as acceptable characters, but
// these aren't encoded by encodeURIComponent, so we don't have to call them out here. Also
// note that the colon character is not featured in the encoding map; this is because URI.js
// gives the colons in URNs semantic meaning as the delimiters of path segements, and so it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If : is neither the URN path delimiter, nor the "industry default path delimiter", we need to talk about what is.

@hardlyknowem
Copy link
Contributor Author

So before I address any specific comments, I think it might be better to have the discussion about how much more would need to change—just so we don't do any work that gets obsoleted soon.

Personally, I think the distinction URI.js makes right now between URIs and URLs is fine, so long as it is explained in the docs and perhaps given a different name. There are some URI schemes out there that aren't technically URNs in the full RFC 2141 sense, but they are used to name things rather than locate them (and so they would fall under the "historical usage" talked about in RFC 3986). More importantly, many of these look a whole lot like URNs: see the DOI scheme (e.g., doi:10.1000/182).

So, in practice, I think it's perfectly valid to have URI.js distinguish between two kinds of URIs: those that look like HTTP URLs (protocol:// with '/' as a path separator) and those that look like URNs (scheme:, with ':' as a path separator, more on that below), which URI.js already has. It's not technically right, but it can't be the responsibility of URI.js to accommodate every possible URI scheme. Most URIs are going to fall into those two camps.

The change I might suggest instead is to change the documentation so that the distinction is made between URN-ish and URL-ish URIs.

As for the URN syntax and the use of the colon: According to the RFC, URNs have the following syntax: urn:<nid>:<nss>, where <nid> is the "Namespace Identifier" and <nss> is the "Namespace-Specific String." And so, the colon does have special syntactic meaning for URNs at least once. The place you could get into trouble is in the namespace-specific string.

I just went and skimmed the RFCs of the first twenty formal URN namespaces registered here and it appears that every single one of them uses the colon as something like a path separator. So I think for all practical purposes URI.js is totally valid in giving the colon character special significance as a path separator. Again, there's no way that URI.js can accommodate all possible URI syntaxes (and all possible URN namespace-specific syntaxes), so I think practicality can win the day here.

So, in conclusion, I think that after cleaning this current PR up, the only thing that would really need to change is the documentation, to make it clearer what URI.js really means by "URN". If you agree with that assessment, I'll go ahead and start cleaning up this PR.

@rodneyrehm
Copy link
Member

… it can't be the responsibility of URI.js to accommodate every possible URI scheme

well. You should be able to work with every scheme. But that doesn't mean that stuff like .segment() should work for every scheme (unless we make the separator a scheme-configurable thing, with : being the default).

If you agree with that assessment, I'll go ahead and start cleaning up this PR.

I do, for better or worse. Do you also want take a shot at the docs (may be a second PR if you wish)?

@hardlyknowem
Copy link
Contributor Author

I'll take a shot at the docs, but it may or may not be this PR. Let me figure out how hard it would be to do that.

Matthew Lefavor added 3 commits March 27, 2015 14:48
My local changes created a bug that caused an exception to be thrown during the `normalize` call of
these methods, which then caused later tests to fail because the wrong decoding/encoding function
was being used.
@hardlyknowem
Copy link
Contributor Author

Alright, I've responded to the comments and made the changes to the docs. Let me know if you'd like other things changed.

@rodneyrehm
Copy link
Member

great stuff!

What resources besides RFC 2141, the list of official URNs, RFC 2368 (other RFC for popular schemes?) do you think should go into the Readme.md?

@hardlyknowem
Copy link
Contributor Author

Thanks! I'll think on that last question. I'll address the new comments (as well as compile a list of resources) over the weekend.

1. Used more meaningful tags in the documentation.
2. method names with "URN" in them have been camel-cased to use "Urn" instead.
3. A list of resources and a changelog has been added to the README.
@hardlyknowem
Copy link
Contributor Author

Responded to all above comments. Do you want the above work to be rebased/squashed into a single commit?

@@ -243,6 +246,7 @@ URI.js is published under the [MIT license](http://www.opensource.org/licenses/m

### master (will become 1.15.0)

* URNs are now normalized based on the syntax given by [RFC 2141](https://www.ietf.org/rfc/rfc2141.txt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proper format and content would be:

@rodneyrehm
Copy link
Member

nice!

Responded to all above comments. Do you want the above work to be rebased/squashed into a single commit?

I can't say I have an opinion on that. But sure, you can squash the commits if you want to.

@hardlyknowem
Copy link
Contributor Author

README file updated.

I won't bother with squashing, since there are some comments on commits. (Rebasing changes commit hashes, which means the comments become hard to locate.)

rodneyrehm added a commit that referenced this pull request Mar 31, 2015
Added more supporting methods for URN paths.
@rodneyrehm rodneyrehm merged commit 55d5a98 into medialize:master Mar 31, 2015
@rodneyrehm
Copy link
Member

thank you!

@rodneyrehm
Copy link
Member

released v1.15.0, thank you for your support!

@hardlyknowem
Copy link
Contributor Author

Thanks for including it!

@rodneyrehm
Copy link
Member

Hey @mlefoster , follow-up question: @Munter has created a repository identifying various official iana and unoffical schemes. By itself I don't see how it could be helpful to URI.js in providing better URN support. The question is what data would we want to add to scheme in order to accomplish anything significant? Or is there nothing left to be improved/simplified?

@Munter
Copy link

Munter commented Apr 10, 2015

I can see that you have an ad-hoc scheme to default port mapping in URI.js. That part might be nice to add to the schemes data object, if there are official resources I can scrape for them

@hardlyknowem
Copy link
Contributor Author

For my own part, I mainly use URI.js as a syntactical helper (in particular, normalizing and sanitizing bad inputs, particularly with respect to percent-encoding). From that perspective, the two things I can think of would be 1) a better mapping of schemes to default ports for those URIs that have ports and 2) maybe a mapping of schemes to their general syntax: whether the scheme supports a query string, what the path separator is, etc.

The former is something URI.js already supports; it's just a matter of filling it out more if appropriate. As for 2), the question is whether that's practical. Are there URI schemes that fall outside the general HTTP-like and URN-like dichotomy? And if there are, are those schemes common enough that making those changes would be worth it? Are there enough handwritten/poorly-formatted URIs of those schemes floating around that people have a need to normalize them?

@rodneyrehm
Copy link
Member

Are there enough handwritten/poorly-formatted URIs of those schemes floating around that people have a need to normalize them?

I don't think normalizing is the main feature to go for. Could a proper scheme mapping (whatever that would look like) support semantics of a particular scheme? Think about mailto:<address>?subject=&body&… provided a map { address: …, subject: …, body: … } and thus simpler to handle. That said, I have no idea if people actually need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants