Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of page-relative vs. root-relative paths #16

Closed
Markbnj opened this issue Aug 26, 2013 · 5 comments
Closed

Handling of page-relative vs. root-relative paths #16

Markbnj opened this issue Aug 26, 2013 · 5 comments

Comments

@Markbnj
Copy link

Markbnj commented Aug 26, 2013

First, thanks for your work on furl. I've found the API very useful for slicing and dicing URLs. However I think I have found an issue with the way the class handles relative paths. Using 0.3.4 installed via pip. Consider the following:

from furl import furl
f1 = furl('http://www.domain.com/somewhere/over")
f2 = furl('the/rainbow')
print f2.path
/the/rainbow
print f1.join(f2.url)
http://www.domain.com/the/rainbow

I think the addition of the forward slash to the path in f2 is a bug, since it turns a page-relative path into a root-relative path.

@gruns
Copy link
Owner

gruns commented Aug 28, 2013

First, thanks for your work on furl. I've found the API very useful for
slicing and dicing URLs.

No - thank you for using furl.

This behavior is a result of the ambiguity of incomplete URLs. For example

>>> f = furl('the/rainbow')

is clearly a path. But what about

>>> f = furl('google.com')

Is the intended URL the path '/google.com' or the domain 'google.com/'? It's
ambiguous.

By default, furl treats ambiguous inputs as paths. Then, when a path-only furl
is serialized to a URL, it's prepended with a '/' if it doesn't start with one
already.

>>> f = furl('google.com')
>>> f.url
'/google.com'

This is natural because in a full URL a path cannot start without a '/'. For
example

>>> f = furl('a/path')
>>> f.host = 'google.com

f.url should now be

>>> f.url
'google.com/a/path'

not

>>> f.url
'google.coma/path'

Note the automatically prepended '/' to 'a/path' in the final URL.

It's this automatic prepending of a '/' to path-only furls that results in the
unexpected behavior observed with furl.join().

I'll think about how this ambiguity and resultant unexpected behavior can be
mitigated.

@gruns
Copy link
Owner

gruns commented Sep 20, 2013

It makes sense for path-only URLs to be prepended with a '/' when serialized to
a URL. Paths in a URL must be preceded by a '/'.

>>> f = furl('a/path')
>>> f.url
'/a/path'

I think the best course of action is to remove the invariant that URL Paths are
always absolute. URL Paths should be optionally absolute, like Fragment Paths.

>>> f = furl('a/path')
>>> f.url
'/a/path'
>>> str(f.path)
'a/path'
>>> f.path.isabsolute
False
>>> f.path.isabsolute = True
>>> str(f.path)
'/a/path'

So, if your intention is to join() a non-absolute path to a URL, like
originally proposed, you would join() with the Path object, not the
serialized URL.

>>> f1 = furl('http://www.domain.com/somewhere/over")
>>> f2 = furl('the/rainbow')
>>> f2.url
'/the/rainbow'
>>> str(f2.path)
'the/rainbow'
>>> f2.path.isabsolute
False
>>> f1.join(str(f2.path)).url
'http://www.domain.com/somewhere/over/the/rainbow"

What do you think?

@Markbnj
Copy link
Author

Markbnj commented Sep 20, 2013

You get to the correct results, but I'm not a fan of f2.url producing the path with the slash prepended. First, let me challenge the statement: "Paths in a URL must be preceded by a '/'." The URL RFC explicitly allows partial URLs. Here are the w3 rules on expanding them: http://www.w3.org/Addressing/URL/4_3_Partial.html. The key point is that these partial URLs commonly appear in web pages, and users of your package will definitely be trying to parse them with it. A URL of the form given in your paraphrasing of my example, i.e. "the/rainbow" has a specific meaning within the context of a parent object, and you can't arbitrarily change that meaning by prepending a '/' to it.

In your earlier example, "google.com," this is a case of trying to help the implementer more than he or she deserves. According to all the rules of URLs that is a partial path. You and I might recognize it as a domain name and treat it specially, but there is no reason for your library to do so. In short, given a base URL and a list of URLs to be joined with it, this is what I would expect to happen:

Base URL: http://www.domain.com/somewhere/over

the/rainbow - http://www.domain.com/somewhere/over/the/rainbow
/the/rainbow - http://www.domain.com/the/rainbow
//the/rainbow - //the/rainbow
google.com - http://www.domain.com/somewhere/over/google.com

The last case may seem strange, but it is the fault of the implementer, not your library. In this case adhering to the rule and allowing things to come apart at the seams is probably the kinder way to proceed.

@gruns
Copy link
Owner

gruns commented Sep 25, 2013

You're right: prepending '/' to non-absolute paths when they're serialized to a
URL is confusing. furl is, in-effect, modifying the input data without being
instructed to do so. It's confusing if one feeds 'a/path' into furl, makes no
changes to the furl object, but doesn't get 'a/path' back out.

A strong, natural solution is the one I mentioned before: remove the invariant
that URL Paths are always absolute. Thus, the new behavior will be:

>>> f = furl('a/path')
>>> f.url
'a/path'
>>> f.path.isabsolute
False

Instead of the current behavior:

>>> f = furl('a/path')
>>> f.url
'/a/path'
>>> f.path.isabsolute
True

For the second issue, treating 'google.com' in furl('google.com') as a path, not
a domain, is already in-place and will remain so. furl will not give paths that
resemble domains special treatment.

I'm leaving this ticket open until I fix the path issue. Pull requests welcome.

gruns pushed a commit that referenced this issue Oct 4, 2013
… a netloc (a username, password, host, and/or port). Fix issue #16. Thanks to Markbnj.
@gruns
Copy link
Owner

gruns commented Oct 4, 2013

This issue has been fixed in furl v0.3.5. URL paths are no longer always absolute if non-empty; they're now only always absolute in the presence of a netloc (a username, password, host, and/or port).

>>> from furl import furl
>>> f = furl('/a/path')
>>> f.path.isabsolute
True
>>> f.path
Path('/a/path')
>>> f.path.isabsolute = False
>>> f.path
Path('a/path')
>>> f.host = 'arc.io'
>>> f
furl('arc.io/a/path')
>>> f.path.isabsolute
True
>>> f.path.isabsolute = True
Traceback (most recent call last):
  ...
AttributeError: Path.isabsolute is True and read-only for URLs with a netloc (a username, password, host, and/or port). A URL path must start with a '/' to separate itself from a netloc.

Your original example now works (though somewhere/over should be somewhere/over/ for
the joined path to become /somewhere/over/the/rainbow/, as probably intended).

>>> f1 = furl('http://www.domain.com/somewhere/over/')
>>> f2 = furl('the/rainbow')
>>> print f2.path
the/rainbow
>>> print f1.join(f2.url)
http://www.domain.com/somewhere/over/the/rainbow

Upgrade to furl v0.3.5 with

pip install furl --upgrade

Thank you for bringing this issue to my attention and for your input and suggestions, Markbnj.

@gruns gruns closed this as completed Oct 4, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants