This repository has been archived by the owner. It is now read-only.

Fix for url.parse() leaving trailing ":" on the protocol/scheme #1580

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet

Before:
% node -e 'require("url").parse("http://www.google.com/").protocol'
http:

After:
% node -e 'require("url").parse("http://www.google.com/").protocol'
http

Fix url.parse() leaving trailing ':' on protocol
Before:
    % node -e 'require("url").parse("http://www.google.com/").protocol'
    http:

After:
    % node -e 'require("url").parse("http://www.google.com/").protocol'
    http

Scratch that, I'll update all the tests, too.

isaacs commented Aug 23, 2011

Why?

@isaacs isaacs closed this Aug 23, 2011

Bad day at the office? Or is today "close-without-reason Tuesday"? I'll resubmit tomorrow - hopefully with better luck. Perhaps there's good science to be had in such experiments!

In the mean time, you can mend your possibly bad day with some UCB skits, like this one: http://www.ucbcomedy.com/videos/play/6904/who-poisoned-whose-tea

isaacs commented Aug 24, 2011

Sorry, my terseness was not coming from any sort of disrespect or grumpiness. What I mean is, why is this a good change?

There's been zero discussion of this idea. It is purely cosmetic but will break lots of node programs. What's the benefit?

I closed the issue because, in lieu of some very compelling reason, there's no way this is happening. There's a "reopen" button if such reason can be found, but otherwise, it may as well not take up space on the list.

So...

Why?

Well, I filed this because I believe I found a bug. I sent it as a pull because it was an easy patch. There's been zero discussion because nobody had filed this bug before. The comment section of github issues and pulls are a great arena for such discussion (and code review).

As for a compelling reason, here's a few:

  • python's urlparse module says "http"
  • ruby's uri lib says "http"
  • java's java.net.URL class says "http"
  • perl's URI module says "http"
  • url.js is nowhere anywhere near RFC1738, and that sucks (X)

Code samples for each of 4 languages mentioned above:

% python -c 'import urlparse; print urlparse.urlparse("http://www.google.com").scheme'
http
% ruby -ruri -e 'puts URI.parse("http://www.google.com/").scheme'
http
% jruby -rjava -e 'puts java.net.URL.new("http://www.google.com/").getProtocol'
http
% perl -mURI -le 'print URI->new("http://www.google.com")->scheme'
http

(X) url.js is pretty broken with respect to RFC1738 and real-world url stuff, but much of that is out of scope for this issue.

Specific snippets of grammar from RFC1738:

 ; The generic form of a URL is:
genericurl     = scheme ":" schemepart

; the scheme is in lower case; interpreters should use case-ignore
scheme         = 1*[ lowalpha | digit | "+" | "-" | "." ]

"scheme" above is what url.js calls "protocol" - note how the colon isn't part of the scheme. The scheme grammar includes no allowance for colon characters.

chjj commented Aug 24, 2011

To be fair, I can think of one URL parser that does keep the trailing colon: the browser.

F12 + window.location.protocol

isaacs commented Aug 24, 2011

Node's url.js borrows its naming conventions from the location object in the browser, extended (but not changed) in the following ways:

  • add query, since that's such a common use case, and we have a querystring parser as well, so it's really easy
  • add auth, and special handling for mailto:, file:, javascript: and a few others, since node sees these, but client-side JS never does
  • lastly, parser is designed to handle a "path-only" url, as is most commonly found on HTTP requests.

The browser is the reference implementation as far as url parsing and resolving is concerned. People who aren't familiar with every relevant RFC (ie, almost everyone) expect it to work the same.

slaskis commented Aug 24, 2011

@jordansissel I found this a bit strange as well and it made me write https://github.com/publicclass/addressable which is basically Rubys extended URI gem "addressable" for js. It parses urls closer to the RFC with some extra features found in the ruby gem...

Even if you're ignoring everything else, and only using 'the browser' as your specification/reference, url.js still falls down pretty hard.

Further, I'm not sure what you're calling "the browser" (which browser? which version?). Example, google chrome 13 fails to recognize "svn+ssh://foo.com/" as a valid URL, but Firefox 4.0.1 does fine. Additionally, url.js doesn't properly parse data urls, while most modern browsers handle seem to handle this fine, so is url.js really based on a browser, or was it based on some fantasy of what some browser somewhere at some time might maybe have done?

Whatever happens, it would be nice if folks like @slaskis didn't have to look at the core / standard library of a given tool and go "that's mad broken, dog" and have to fix a standard, broken thing by making a new third-party thing.

(edit: google chrome might recognize svn+ssh://, but the behavior of unknown url schemes seems to be to search them instead of saying 'i don't know what this is')

chjj commented Aug 24, 2011

If we're going to bring IETF spec's into this. Here is the regex that the URI RFC recommends for parsing URI's: http://tools.ietf.org/html/rfc3986#page-51 (I use a modified version of this myself, it's pretty fast and it works well.)

/^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/

Which for the following URL:

http://www.ics.uci.edu/pub/ietf/uri/#Related

Results in the following captures:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

They capture both the trailing colon and the protocol without it. What to do now? ;)

isaacs commented Aug 24, 2011

It'd be nice if url.parse handled + in protocols. I don't think there's an open issue on this, feel free to post one. It's not strictly allowed for hyperlinks, but as you point out, urls are more than href attributes.

I'd also like to figure out some deterministic way to handle urls that separate the hostname from a cwd-relative path with :, like ssh://foo@bar.com:some-dir. It would make it easier to parse git remote urls in npm.

It would also be nice if the return value of url.parse was an instance of a class that had a toString method which returned the formatted href, since that's the only omission from what the browser implementations provide. But that's not terribly important. (Incidentally, since new is often faster in v8 than creating an object literal, it might be a slight speed improvement, but it'd be slight, and I doubt that any node program is spending enough time parsing urls to notice.)

Whatever happens, it would be nice if folks like @slaskis didn't have to look at the core / standard library of a given tool and go "that's mad broken, dog" and have to fix a standard, broken thing by making a new third-party thing.

It is not the intent of node to be a batteries-included platform. https://github.com/joyent/node/wiki/node-core-vs-userland

Url parsing is an extremely common task that most web servers and clients need to do. There is an existing quorum in the leading JavaScript implementations on how urls are to be parsed, so we're following it at least as closely as any browser follows any other.

google chrome might recognize svn+ssh://

It doesn't fetch it, and hyperlinks with svn+ssh hrefs aren't followed. If you configure chrome to open a different application for this protocol, then it'll open that application.

Further, I'm not sure what you're calling "the browser" (which browser? which version?).

window.location parsing is fairly consistent across browsers I've worked with. I originally tested url.js against Firefox 3 and 4, Chrome stable and dev (10-ish, I think?), Safari 4/Webkit nightly, and MSIE 6 and 7. Typically, Chrome and Firefox are considered authoritative in this case.

It is not the intent of node to be a batteries-included platform. https://github.com/joyent/node/wiki/node-core-vs-userland

Foot-stomp and 'you shall not pass' heard and noted. I withdraw my patch and bug report.

isaacs commented Aug 25, 2011

Foot-stomp and 'you shall not pass' heard and noted.

Hahah. I think @ry is the gandalf character here. You know he once removed url parsing entirely?

slaskis commented Aug 25, 2011

@chjj yep, that's the regex I use in addressable. Works a treat :)

@chjj and @isaacs: $1 only collects the scheme's trailing ":" because the grammar for "collecting-parens-but-who-cares" hadn't yet been implemented in POSIX regex :D

Read the next section of the RFC; the sentence starting "Therefore, we can determine the value of the five components as ..." is kind of important. "$1" is irrelevant. It is an ex-regex. It has perished. It does not matter.

How about the ascii-art section "Syntax Components", or section 3? Given that none of the art there notes the trailing ":" as being remotely important, how about dumping it? How about noting that the only purpose that that section imparts the ":" is "the character between the first part we care about and the second part we care about"? It's a string literal - a character that could, should and must be thrown away in order for code at a higher level to use the library code sensibly!

[Edited to remove alcoholic content]

Member

mikeal commented Aug 29, 2011

we need to match expectations, the expectation is that this behaves like the browser.

we live with the browser's mistakes, that's the web.

wjessop commented Aug 29, 2011

A URL parsing lib including a trailing : in the scheme part looks like a bug to me, regardless of the mistakes "the browsers" have made in the past.

yeah, that shit's fucking insane. I don't give a damn about your ideology, any ideology which leads you to write something like that has its head up its ass.

Sorry, nothing personal. Immensely useful library. I use it and love it. But you have got to be fucking kidding me.

starwed commented Aug 29, 2011

Heh, I read a blog post a few days ago about the differences between various URL parsing/slicing.

http://tantek.com/2011/238/b1/many-ways-slice-url-name-pieces

Quick reference image:
http://farm7.static.flickr.com/6203/6082913622_c953b1fc96_o.png

donpark commented Aug 29, 2011

well, I'll take practical sense over logical sense, head up its ass or not.

Um, guys. Why not just include protocol with the colon (eg "http:") and scheme without it ("http"). That way everyone wins.

jwatte commented Aug 29, 2011

I'm a reasonably experienced developer, and my expectation was not that it would include the colon.
How about adding a "scheme" property, without the colon, to the parser result?
Also, staying compatible with all software in all versions is how you end up with technologies like Windows ME, or PHP. Successful, but painful. Is that the goal of Node?

isaacs commented Aug 29, 2011

@xenomonkey Adding scheme without the colon is not out of the question. But why? (Seriously. Make a compelling case beyond "ruby and python users will stop telling node users that node is unusable.")

@gilesbowkett I'm not sure who you're insulting.

@jwatte How long did it take you to adapt your program to handle the ":" in the protocol property? A whole minute? Less?

I'm really not seeing a bug here.

wjessop commented Aug 29, 2011

But why?

@isaacs: For me @jordansissel already covered it, but to summarise, convention, standards and some sense of correctness.

@isaacs Upon reflection I think my suggestion is crap (results in bloated code for, as you say, no gain). You're quite right the only reason for doing it would be to make node more similar to other languages and slightly more compliant with the RFC. Either scheme (without the colon) or protocol (with the colon) should be supported but not both. Personally I don't care either way (I'm quite capable of adding/stripping a character from a string if I need to).

I would suspect that since node is based on javascript that you should stick with the most expected solution. For javascript programmers than would be to use protocol and leave the colon on the end.

Yeah, what the hell was @gilesbowkett on about?

Anyway, I don’t know why everybody’s flipping out about this. @isaacs’ point is perfectly appropriate: Node is JavaScript; Node is intended to parallel the browser; Node is intended to keep everything in userspace code. There is no real argument for the change, except that “some other languages do it this way.” What browsers do ≥ what other languages do, because Node is intended to ape the browser, not those other languages.

But why? (Seriously. Make a compelling case beyond "ruby and python users will stop telling node users that node is unusable.")

Because asking for a common nomenclature is silly, right?

Your reply sounds quite like a troll baiting. You asked for a compelling case and was provided data. You disregarded that data and revised your response, "Make a compelling case beyond [all data provided]". If more data is provided, you'll simply repeat the same pattern above, right?

I would recommend folks stop updating this issue since you'll likely just get stuck in a loop of providing data and having the data be discarded I'd delete it if I could find such a button on github.

I'm wasn't looking for flames or fanaticism. I just wanted a bug fixed.

Node is intended to parallel the browser

Other than Node and browsers both using Javascript, I've never realized that there is supposed to be a parallel. Is there?

chjj commented Aug 29, 2011

Is it me, or is this argument over a single colon?

haikusw commented Aug 29, 2011

I'm confused why the browser behavior would be the source of expectations for server side development...

My vote is that url parsing should match the RFC (that's what the RFCs are for - to define "correct" behavior, are they not?).

is the ":" included in the definition of "protocol/scheme" in the specification for URLs? I think the answer is "No" and if that's the case, then it shouldn't be included.

Node is intended to parallel the browser

"Node's goal is to provide an easy way to build scalable network programs. " - http://nodejs.org/#about

I lol’d

I lol’d. @chjj++

Is it me, or is this argument over a single colon?

@chjj: most of url.js is pretty wonky as I stated earlier. This patch was sort of a water-testing attempt to see if it was worth fixing it up and sending a bigger patch set. Nodejs has a strong history of backwards-incompatibility and experimentation between stable releases (see 0.2 vs 0.4) so I figured it was worth some effort to fix this library.

blarg! garrrfulnubb. blarp

foca commented Aug 29, 2011

Don't fight, you guys

we have hit the meme gif event horizon

Member

mikeal commented Aug 30, 2011

the browser does not provide a module system or a streaming HTTP client, if it did we might have used the same API.

tj commented Aug 30, 2011

on() vs addEventListener(). go!

aseemk commented Aug 30, 2011

Another option that comes to mind is to explicitly differentiate between Location objects (matching the browser spec) and URI objects (matching the various RFCs and all other server-side platforms). E.g. require('location').parse(...) vs. require('uri').parse(...). Perhaps the 'url' module could be deprecated, mapping to 'location' for the transition period. This might be overkill, but might be nice.

Not to incite things any further here, but @mikeal and @isaacs what would be the harm in adding convenience properties for .scheme and .fragment? I'm not sure that in all of the madness of this thread that there was a direct answer to that question. In fact I believe that @isaacs was the first to even mention that as a possibility (just needed a "reason"). If you can see through all the flames, I believe this thread itself is a reason; there is obvious passion in the Node.js community on even this small issue.

I believe everyone here cares for Node and appreciates the hard work that has gone into it. Part of the heat in this thread may simply be that many of us resent the terrible APIs we've had to deal with for years and are just afraid to see the same thing server-side. Like it or not, Node (along with several other recent projects) represents more than just a great platform or a way to take the language we love and use it outside of the browser; it represents the hope of the community to change Javascript for the better, to not be stuck by antiquated mistakes of 10yr old browsers and hasty decisions from its birth; it represents the best parts of the Javascript language, not the annoyances; it represents the future of JS, not the past.

Node.js is still in developmental stages; the api is still going through growing pains. I can't imagine there would be any significant performance overhead or development work. The Node.js community has spoken here and exclaimed "Terseness!" and "Compatibility!", so why not give them both?

(If you really care about extraneous API properties long-term, you could always take a look a few months from now and see what the majority of github projects wind up using... and then pick one or the other for v1).

donpark commented Aug 30, 2011

Guys. Please stop. We should stop at providing feedback and I think we've gone past that. If coercion is what your intention is, please fork. Otherwise, stop here and let them decide.

Member

bmeck commented Aug 30, 2011

tl;dr

If it is Javascript it should look and act like Javascript.
http://msdn.microsoft.com/en-us/library/ms534353(v=vs.85).aspx
https://developer.mozilla.org/En/Window.location

I certainly hate that I have to have a colon when i test something, oh god, the humanity of adding a ':' to act like all other popular implementations of the language. Oh god, the horror, oh god how will I ever concatenate this into a new url when I can't see the ':' being added back manually. Save me from the insanity of history and stuff that existed prior to my other favorite language, why can't protocol be called scheme, why cant scheme be added onto an object that already works, why don't we split up the url and argue about when pathname is actually a hostname, its just not acceptable that my input is not correct, I put a hostname in there, why is it being called a pathname! Having this much argumentation from this many people means nothing, my way is the best! X doesn't agree with me therefore I shall ignore his questions about both sides.

isaacs commented Aug 30, 2011

@bmeck You can make a starting // imply a hostname by passing "true" as the third argument.

The only harm in adding .scheme and .fragment would be the implied message that the best way to effect a change to node's API is to get twitter and hacker news involved. This isn't a democracy; more votes don't matter. And this thread is completely spoiled at this point.

Start a discussion on the dev mailing list. Present a use-case. If it's reasonable, and useful enough to justify increasing the API surface for it, and doesn't negatively affect other uses or performance, we'll do it.

detro commented Sep 2, 2011

Ehm... add an extra property to the result of ".parse" call to hold the "protocol without column" result?

It will not break anyones code and will add the "handy" feature (not that a string truncation is THAT difficult anyway).

liz-mars commented Sep 2, 2011

I agree that we should try to stick to the RFCs, and that the current implementation deviates. However, I also agree that we should make sure we stick close to how JS is implemented in most browsers. What to do?

I thought to myself, 'How would a browser developer handle this?' At it hit me. We should have two modes of operation in node. The first -- we can call it 'quirks Node' -- will be the default, and it will maintain backward compatibility with existing Node APIs. The second, 'standards Node', will activate based on a complicated heuristic that takes into account how well the code is commented, whether it's composed in ASCII or UTF-8, and the current output of /dev/random.

To make it simple for developers, Standards Node will always track the latest trends in development, and will change frequently.

Tongue firmly in cheek,
~Beth, who is very happy that node exists at all, and that so many people care so much about it so as to argue for hours over a detail as small as a colon.

isaacs commented Sep 2, 2011

@paulbjensen wins 9000 internets!

Member

mikeal commented Sep 2, 2011

I HAVE AN OPINION!

tj commented Sep 2, 2011

MOAR COLONS

chjj commented Sep 2, 2011

What are colons? Are they webscale? 9000 internets is not enough for webscale.

I: think: this: issue: has: not: received: enough: attention::

Too: many: developers: have: been: failing: to: end: every: string: with: a: colon::

BroDotJS commented Sep 2, 2011

Losers always whine about RFCs and back compat. Bros go home with the prom queen:

require.colonsblow = function (moduleName) {
    var mod = require(moduleName);
    var k, fn;

    function colonic(x) {
        return (typeof x === 'string' && x.slice(-1) === ':') ? x.slice(0, -1) : x;
    }

    function wrap(fn) {
        return function () {
            var result = fn.apply(this, arguments);
            var k;

            if (result instanceof Array) {
                result = result.map(function (v, i, a) {
                    return colonic(v);
                });
            } else if (result && typeof result === 'object') {
                for (k in result) {
                    if (result.hasOwnProperty(k)) {
                        result[k] = colonic(result[k]);
                    }
                }
            }

            return colonic(result);
        };
    }

    for (k in mod) {
        if (mod.hasOwnProperty(k) && typeof mod[k] === 'function') {
            mod[k] = wrap(mod[k]);
        }
    }

    return mod;
};

var ballinURLz = require.colonsblow('url');

console.log(ballinURLz.parse('http://your.mom.com/so/fat'));

Doneski. This bro is headed to Twin Peaks for shots and steaks. Who's in? No nerds.

Marak commented Sep 2, 2011

Isaacs was the prom queen.

<ref>The Rock</ref>

chjj commented Sep 2, 2011

The Rock

Ah, beat me to it.

tj commented Sep 2, 2011

Marak commented Sep 2, 2011

sbussard commented Sep 2, 2011

+1 for getting rid of useless chars

Compatibility with non-standard APIs is stupid. 99% of the users will perform a replace() on the thing anyways, just get rid of it now.

Member

mikeal commented Sep 2, 2011

@BonsaiDen @sbussard this thread is now closed to serious comments, only jokes are allowed now

Leave : Alone

At this rate, we might as well just get rid of semicolons too

sbussard commented Sep 2, 2011

what other things end with a colon? ... :(){ :|:& };:

donpark commented Sep 3, 2011

How do I unparticipate from this 4chanish thread?

Marak commented Sep 3, 2011

Member

Qard commented Sep 3, 2011

Below the comment box. Click "Disable notifications for this Pull Request"

donpark commented Sep 3, 2011

Doh. Thx @Qard & @Marak

@ghost

ghost commented Sep 2, 2012

What the hell did I just read?

johan commented Dec 8, 2012

+1 keeping protocol as-is
+1 adding colon-less scheme property

The first is useful for sharing code and APIs between front and back end code.
The second is useful because it's a reasonable expectation for those of us coming from the RFCs.
We can have both, just as we can have both .hostname and .host (with port), which are also useful.

/ web front-and-back-end developer since 15 years

vicary commented Dec 8, 2012

What's the point of digging arguments last year? Esp. when it comes to punchuations, it's strictly personal and your opinion is not likely to work.

isaacs commented Dec 9, 2012

@github Can you please please give us the ability to close comments on issues after some period of time? This is a textbook example of a thing that is well beyond the point where any good can possibly come from additional conversation.

rubys pushed a commit to webspecs/url that referenced this pull request Nov 27, 2014

Remove a note per @zcorpan and add an exciting logo!
Background behind the colon is nodejs/node-v0.x-archive#1580
(be sure to read it all the way)

@KenanY KenanY referenced this pull request in nodejs/node Mar 22, 2015

Closed

lib: remove `:` from protocol in Url.parse(). #1237

@github -1 on the ability to close comments after some period of time. As a firm believer in free speech, we should not let people stifle constructive conversation.

I've been a fan of using github as a source of high quality entertainment for over 5 years, you would be doing your key demographic a disservice in implementing said feature.

Also I feel like @gilesbowkett comments were on point.

thanks guys for going so deep into this colon issue. the library really needed some cleansing. felt like it was getting full of waste.
those colons can be a real pain in the you-know-what! too bad that this thread had to go down that dark hole.

@bascht bascht referenced this pull request in fhemberger/good-logstash Apr 27, 2016

Merged

Supply full URL instead of param bits. #1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.