Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requests.Session should send "Referer" headers #2079

Closed
za-creature opened this issue Jun 1, 2014 · 5 comments
Closed

requests.Session should send "Referer" headers #2079

za-creature opened this issue Jun 1, 2014 · 5 comments

Comments

@za-creature
Copy link

From my understanding, the point of requests.Session is that it emulates what a dumb* browser does when asked to render a webpage. This means keeping cookies, following redirects and in my opinion, setting the HTTP "Referer" header.

The Session class doesn't seem to touch neither the referer header nor it's properly spelled variant. I consider this to be non-pythonic because both the documentation as well as the "Session" class name suggest that all requests in a session are performed sequentially, and each individual request may update the state of the session by setting cookies, or by returning 3xx responses which in turn forward the session to a different URL.

As such, I propose setting a default value for the "referer" header (perhaps setting the properly spelled "referrer" header as well for future-proofing) to the final form of the most recently executed request. This seems to be consistent with the way cookies are handled in that performing a request will automatically parse and update any cookies in the jar. Since this will break backwards compatibility (for a few servers, an undefined referrer is better than a wrong one), this should probably be implemented pending a check to Session.follow_referrer which should default to False.

I can create a patch for this if the package maintainer agrees that this is a bug.

  • By dumb I mean a browser incapable of rendering images, executing scripts or handling stylesheets and limited to just parsing and rendering HTML, which is arguably what most robots do.
@Lukasa
Copy link
Member

Lukasa commented Jun 1, 2014

Thanks for raising this issue!

This is an interesting idea. The key problem is that it's difficult for requests to correctly fill out the Referer header. From RFC 2616:

The Referer[sic] request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained (the "referrer", although the header field is misspelled.)

The problem is that, even with a Session, we don't know what resource provided the URI. We can guess that it might be the resource we just fetched, but we simply can't prove it. In a lot of cases it won't be, and in those cases we're not being helpful, we're just leaking user data to unrelated websites (and that's a terrible thing to do).

Browsers can do this correctly because they know when you've clicked on a link versus when you typed in the URI bar, but we don't have that luxury.

For that reason, I don't think this is a good idea. Does that seem like a convincing argument to you?

@za-creature
Copy link
Author

I definitely agree with you on the "it's difficult" part. Figuring out who provided the target URI is basically guesswork, and I won't deny that. I've stumbled across this because I am implementing a crawler for a website that doesn't seem to like robots (well, it likes GoogleBot, but since they're being racist about it, I figured I'll just emulate a regular browser that they can't afford to ban)

You definitely can't prove that the referrer is the most recently fetched resource, and yes, in a lot of cases it won't be. This is why I was suggesting that the functionality should be implemented in an opt-in fashion (not unlike the redirect following mechanism).

As for leaking data, while it's a valid concern, it's not something that's not readily available in virtually every browser that uses the default config. Whenever a webpage requests a resource (image, stylesheet, script, object; doesn't matter if local or remote), pretty much every browser sets the referrer to the requesting webpage (and this is without any user interaction whatsoever). I'm not 100% about ajax requests, but those probably default to the URL they were loaded from as well.

So with all due respect, I have to say that no, your argument is not yet convincing enough for me to drop this.

@Lukasa
Copy link
Member

Lukasa commented Jun 1, 2014

Here are my follow-ups:

This is why I was suggesting that the functionality should be implemented in an opt-in fashion (not unlike the redirect following mechanism).

Redirect-following isn't opt-in, it's opt-out. =) More generally, the requests philosophy is that either a behaviour ought to be the default (as in, generally useful), or it should be easy to achieve externally. Maintaining a referrer is very easy to do:

s = requests.Session()
headers = {}

while urls:
    url = urls.pop()
    r = s.get(url, headers=headers)
    headers['Referer'] = r.request.url

So right now the requirement is to convince me why something that's easy to do and generally not needed should be an opt-out behaviour (because we don't add opt-in behaviours).

As for leaking data, while it's a valid concern, it's not something that's not readily available in virtually every browser that uses the default config. Whenever a webpage requests a resource (image, stylesheet, script, object; doesn't matter if local or remote), pretty much every browser sets the referrer to the requesting webpage (and this is without any user interaction whatsoever).

Right, but that wasn't my concern: this is the correct use of Referer. My concern is when you are hitting unrelated URLs, at which point you are both a) sending invalid referrers and b) effectively sending a third-party your HTTP history when you should not have.

@za-creature
Copy link
Author

Your provided solution is more or less what I did:

http = requests.Session()
while queue:
    url, data = queue.popleft()
    response = http.get(url, headers={"referer": referrer})
    referrer = response.url

I was not aware of requests' "opt-out-only" philosophy, but I can't really say that I disagree with it. Either way, it seems that my primary argument is null and void, and as such, I don't have any more objections if you want to close the issue.

As for your other point, for the sake of argument I'll say that if you're hitting unrelated URLs, you shouldn't be using sessions to begin with (at the very least, cookies will not persist cross domain). For brevity, similar functionality can be achieved by combining regular requests with functools.partial.

Anyway, I just want to say thanks for the library. While not perfect (nothing ever is, nor can any objective definition of perfection ever be stated), it's definitely the best HTTP library I've come across (pretty cool that it's greenlet-ready) and well...

Keep up the good work!

@Lukasa
Copy link
Member

Lukasa commented Jun 1, 2014

No problem! Glad we could take this out.

By the way, we totally expect you to use Session objects across domains. =) They provide centralised configuration (only set headers once etc.) and connection pooling, so any time you're doing anything more than trivial work, we expect you to use a Session.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants