-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
requests.Session should send "Referer" headers #2079
Comments
Thanks for raising this issue! This is an interesting idea. The key problem is that it's difficult for requests to correctly fill out the Referer header. From RFC 2616:
The problem is that, even with a Browsers can do this correctly because they know when you've clicked on a link versus when you typed in the URI bar, but we don't have that luxury. For that reason, I don't think this is a good idea. Does that seem like a convincing argument to you? |
I definitely agree with you on the "it's difficult" part. Figuring out who provided the target URI is basically guesswork, and I won't deny that. I've stumbled across this because I am implementing a crawler for a website that doesn't seem to like robots (well, it likes GoogleBot, but since they're being racist about it, I figured I'll just emulate a regular browser that they can't afford to ban) You definitely can't prove that the referrer is the most recently fetched resource, and yes, in a lot of cases it won't be. This is why I was suggesting that the functionality should be implemented in an opt-in fashion (not unlike the redirect following mechanism). As for leaking data, while it's a valid concern, it's not something that's not readily available in virtually every browser that uses the default config. Whenever a webpage requests a resource (image, stylesheet, script, object; doesn't matter if local or remote), pretty much every browser sets the referrer to the requesting webpage (and this is without any user interaction whatsoever). I'm not 100% about ajax requests, but those probably default to the URL they were loaded from as well. So with all due respect, I have to say that no, your argument is not yet convincing enough for me to drop this. |
Here are my follow-ups:
Redirect-following isn't opt-in, it's opt-out. =) More generally, the requests philosophy is that either a behaviour ought to be the default (as in, generally useful), or it should be easy to achieve externally. Maintaining a referrer is very easy to do: s = requests.Session()
headers = {}
while urls:
url = urls.pop()
r = s.get(url, headers=headers)
headers['Referer'] = r.request.url So right now the requirement is to convince me why something that's easy to do and generally not needed should be an opt-out behaviour (because we don't add opt-in behaviours).
Right, but that wasn't my concern: this is the correct use of |
Your provided solution is more or less what I did: http = requests.Session()
while queue:
url, data = queue.popleft()
response = http.get(url, headers={"referer": referrer})
referrer = response.url I was not aware of requests' "opt-out-only" philosophy, but I can't really say that I disagree with it. Either way, it seems that my primary argument is null and void, and as such, I don't have any more objections if you want to close the issue. As for your other point, for the sake of argument I'll say that if you're hitting unrelated URLs, you shouldn't be using sessions to begin with (at the very least, cookies will not persist cross domain). For brevity, similar functionality can be achieved by combining regular requests with functools.partial. Anyway, I just want to say thanks for the library. While not perfect (nothing ever is, nor can any objective definition of perfection ever be stated), it's definitely the best HTTP library I've come across (pretty cool that it's greenlet-ready) and well... Keep up the good work! |
No problem! Glad we could take this out. By the way, we totally expect you to use |
From my understanding, the point of requests.Session is that it emulates what a dumb* browser does when asked to render a webpage. This means keeping cookies, following redirects and in my opinion, setting the HTTP "Referer" header.
The Session class doesn't seem to touch neither the referer header nor it's properly spelled variant. I consider this to be non-pythonic because both the documentation as well as the "Session" class name suggest that all requests in a session are performed sequentially, and each individual request may update the state of the session by setting cookies, or by returning 3xx responses which in turn forward the session to a different URL.
As such, I propose setting a default value for the "referer" header (perhaps setting the properly spelled "referrer" header as well for future-proofing) to the final form of the most recently executed request. This seems to be consistent with the way cookies are handled in that performing a request will automatically parse and update any cookies in the jar. Since this will break backwards compatibility (for a few servers, an undefined referrer is better than a wrong one), this should probably be implemented pending a check to Session.follow_referrer which should default to False.
I can create a patch for this if the package maintainer agrees that this is a bug.
The text was updated successfully, but these errors were encountered: