Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not getting the caching I expected #95

Closed
toolforger opened this issue Aug 6, 2015 · 14 comments
Closed

Not getting the caching I expected #95

toolforger opened this issue Aug 6, 2015 · 14 comments

Comments

@toolforger
Copy link
Contributor

Here's my CCSSE:

import github3
import cachecontrol
g = github3.GitHub()
cachecontrol.CacheControl(g._session)
print(g.rate_limit()['resources']['core']) # initial rate limit
print(g.rate_limit()['resources']['core']) # rate_limit - should not count
repository = g.repository('sigmavirus24', 'github3.py')
print(g.rate_limit()['resources']['core']) # get repo data - should count
repository = g.repository('sigmavirus24', 'github3.py')
print(g.rate_limit()['resources']['core']) # get repo again - should be served from cache

In the output, I'm seeing that the rate limit is being ticked off for each g.repository call.

{u'reset': 1438863563, u'limit': 60, u'remaining': 53}
{u'reset': 1438863563, u'limit': 60, u'remaining': 53}
{u'reset': 1438863563, u'limit': 60, u'remaining': 52}
{u'reset': 1438863563, u'limit': 60, u'remaining': 51}

With logging cranked up to the max and tons of logging added (see #93 for the exact code, 50a76aa to be specific), I'm seeing the long trace below.
Conspicuously, there's no Updating cache with response from "http://...", which is what I added at https://github.com/toolforger/cachecontrol/blob/50a76aa0f022a47d34c65c13c4c813ecb1f2c086/cachecontrol/controller.py#L228, so I guess indicating cachecontrol.controller.CacheController.cache_response is never called.
Since that call is stashed away in a functools.partial, I have no idea where and when that call should have happened, so I have come to a dead end.

INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
INFO:github3:Missed the cache building the url
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): api.github.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
{u'reset': 1438867921, u'limit': 60, u'remaining': 60}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
{u'reset': 1438867922, u'limit': 60, u'remaining': 60}
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'repos', 'sigmavirus24', 'github3.py')
INFO:github3:Missed the cache building the url
DEBUG:github3:GET https://api.github.com/repos/sigmavirus24/github3.py with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/repos/sigmavirus24/github3.py" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /repos/sigmavirus24/github3.py HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'repos', 'sigmavirus24', 'github3.py')
DEBUG:github3:GET https://api.github.com/repos/sigmavirus24/github3.py with {}
{u'reset': 1438867922, u'limit': 60, u'remaining': 59}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/repos/sigmavirus24/github3.py" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /repos/sigmavirus24/github3.py HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
{u'reset': 1438867922, u'limit': 60, u'remaining': 58}
@toolforger
Copy link
Contributor Author

I fixed an embarrassing typo in the logging modifications, but the log output still contains no mention of Updating cache with response, so the typo seems to be inconsequential.
(I just updated the log output just to be on the safe side.)

@toolforger
Copy link
Contributor Author

Ping @ionrock - any chance that this is going to be attacked anytime soon?

@sigmavirus24
Copy link
Contributor

@toolforger the fastest way to a solution is for you to pick this up and run with it. I'm sure @ionrock has more important real life things to do at the moment.

@toolforger
Copy link
Contributor Author

I doubt that it's any of my (or anybody else's) business what his real-life things are. What's my business is whether he's going to do anything about it in the foreseeable future or not. If no, fine; if yes, fine again; if unknown or not response, I'll take that for a no and get on with it.

I don't know what you mean with "pick this up and run with it".
If it's "just use it for what it's worth" - well, it's not caching anything for me, so it should remove it I guess.
If it's "fix whatever I need fixing" - impossible, there are too many ifs in the code that are clearly motivated by some RFC but I can't check whether they're correct or not because the RFC is not mentioned.

@sigmavirus24
Copy link
Contributor

If it's "fix whatever I need fixing" - impossible

I doubt it's impossible. Perhaps it's impossible to do in 30 seconds or 5 minutes, but it's not impossible. Inf act it's impossible that it's impossible. ;)

@ionrock
Copy link
Contributor

ionrock commented Aug 30, 2015

@toolforger Indeed @sigmavirus24 is correct. Life has very much been getting in the way of providing you a response. My apologies.

Can you send along information on what the response looks like? Specifically, I need to see the headers in the response. Usually, if something isn't getting cached, it is because the server is providing a header that is telling the client not to cache things. Once we figure out what the issue is, we can look at writing a heuristic to to add that might work around the problem.

@toolforger
Copy link
Contributor Author

Am 31.08.2015 um 00:39 schrieb Ian Cordasco:

If it's "fix whatever I need fixing" - impossible

I doubt it's impossible. [...]

As I said: Unknown specs. At least three RFCs and an unknown host of
practical experience in that code, no way to know what it's supposed to do.

Perhaps it's impossible to do in 30 seconds or 5 minutes, but it's not impossible.

I can tell you that it's impossible even after five days of reading
CacheControl, single-stepping it, adding logging statements, and reading
up on the pertinent RFCs.
I can admit that I didn't dive into the details of the tests.

Am 31.08.2015 um 00:47 schrieb Eric Larson:

Life has very much been getting in the way of providing you a response. My apologies.

No worries, I can understand that. I just wanted to know, I didn't want
to slap my priorities on you.
You might want to find somebody else to be in charge of things if/when
CacheControl becomes a dependency of other projects.

Can you send along information on what the response looks like? Specifically, I need to see the headers in the response.

URL sent:
https://api.github.com/repos/sigmavirus24/github3.py

Headers sent:

{
   'Accept-Encoding': 'gzip, deflate',
   'Accept': 'application/vnd.github.v3.full+json',
   'User-Agent': 'github3.py/0.9.4',
   'Accept-Charset': 'utf-8',
   'Connection': 'keep-alive',
   'Content-Type': 'application/json'
}

Headers received:

{
   'vary': 'Accept, Accept-Encoding',
   'x-served-by': 'cee4c0729c8e9147e7abcb45b9d69689',
   'x-xss-protection': '1; mode=block',
   'x-content-type-options': 'nosniff',
   'etag': 'W/"38aaa1047b2cb6026f733367ceb37b91"',
   'access-control-allow-credentials': 'true',
   'status': '200 OK',
   'x-ratelimit-remaining': '59',
   'x-github-media-type': 'github.v3; param=full; format=json',
   'access-control-expose-headers':
     'ETag, Link, X-GitHub-OTP, X-RateLimit-Limit,
      X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes,
      X-Accepted-OAuth-Scopes, X-Poll-Interval',
   'transfer-encoding': 'chunked',
   'x-github-request-id': '5795EE32:15DFE:634F841:55E3F59E',
   'cache-control': 'public, max-age=60, s-maxage=60',
   'last-modified': 'Fri, 28 Aug 2015 21:48:05 GMT',
   'date': 'Mon, 31 Aug 2015 06:35:10 GMT',
   'access-control-allow-origin': '*',
   'content-security-policy': "default-src 'none'",
   'content-encoding': 'gzip',
   'strict-transport-security':
     'max-age=31536000; includeSubdomains; preload',
   'server': 'GitHub.com',
   'x-ratelimit-limit': '60',
   'x-frame-options': 'deny',
   'content-type': 'application/json; charset=utf-8',
   'x-ratelimit-reset': '1441006510'
}

I'm not sure how etag and date/max-age interact. In fact it would be
nice if date/max-age could be ignored entirely, I'd let GitHub decide
whether they want to resend the data or not.
I'm just realizing that my use case isn't about keeping the traffic
down, it's avoiding hitting GitHub's rate limit. I.e. avoiding round
trips is merely nice to have, what's important is getting as many Not
Modified responses as possible (because these don't count against the
rate limit).

Usually, if something isn't getting cached, it is because the server is providing a header that is telling the client not to cache things.
Once we figure out what the issue is, we can look at writing a heuristic to to add that might work around the problem.

Hopefully the heuristics can be applied on a per-site basis. I'd hate to
add heuristics that might make CacheControl less useful for other sites!

@toolforger
Copy link
Contributor Author

After some more debugging, I'm seeing that CacheController.cache_response is simply never called.

What's happening is that CacheControlAdapter.build_response wraps the cache_response call in a functools.partial and places that in response._fp, obviously expecting the Requests library to call that at some later time.
Which it never does. response._fp is managed by Request's copy of urllib3, and Requests never calls anything that uses _fp. I have no idea why, I'd have to dig deeper into the software stack to find that out.
I suspect it's easier to rewrite CacheControl to avoid modifying internal variables though. It's trying to make sure the stream gets closed eventually. Though I'm wondering why CacheControl is even bothering to deal with that use case; if a response is too large to keep in RAM, you wouldn't want to have it cached either. So my "fix" would be to do away with CallbackFileWrapper - what do people think?

@ionrock
Copy link
Contributor

ionrock commented Sep 3, 2015

@toolforger The general use case is that if you don't use the response, you don't want to cache it. If your app was using a simple in memory cache and made a request that resulted in huge response, it could fill up the available memory or have other unintended consequences depending on the cache store. While I can agree that it does make things slightly more complex, the result is safer for users.

As for caching things larger than ram, this is really dependent on the cache store. There are many tools that can handle larger data such MongoDB's GridFS, S3 or even the FileStore. Sometimes the largest responses are the ones you want cached the most.

Thanks for providing info on the response. I'll take a look. In the meantime, it sounds like the file handle of the response is not getting consumed. Take a look at the body content docs to see how you can ensure the file handle is exhausted.

@ionrock
Copy link
Contributor

ionrock commented Sep 3, 2015

@toolforger I wrote a simple tool to hopefully help sort out why your response isn't getting cached. It doesn't handle the use case where the response is not used, so keep that in mind. If you do try it, please feel free to give any feedback in PR #100.

@toolforger
Copy link
Contributor Author

Yeah, after some more thinking I suspected myself that eliminating CallbackFileWrapper wasn't that great of an idea. Good to know more of the background though.

Streaming is not the culprit I'm seeing, I have stream = False. requests.sessions.Session.send touches r.content in line 605, causing r to read the entire response.
In fact I never did anything about streaming, my use case has no need for that and I didn't become aware of the streaming machinery before I started digging into the code.

The immediate cause seems to be chunking. I'm seeing r.content hit requests.packages.urllib3.response.HTTPResponse.stream, which in line 302 actively avoids using self._fp if self.chunked is set. (I'm not sure why it's doing that.)

@toolforger
Copy link
Contributor Author

Some more analysis...

CallbackFileWrapper hooks into httplib's HTTPResponse.read, but what gets called for chunked responses is HTTPResponse._safe_read (which, ironically, carries a comment amounting to "people could always have used read instead of _safe_read").

Sounds easy to solve, just hook _safe_read alongside read.
Well, problem 1: We're getting end-of-chunk markers ("\r\n") into CallbackFileWrapper's __buf. I didn't check whether that's good or bad, because of
Problem 2: The last call to _safe_read happens before close. In fact, during and after close, we get multiple accesses to CallbackFileWrapper.fp, but nothing more specific that we could use to detect that the response is being closed, so we don't get the trigger that makes us cache the response.

In the light of these findings, wouldn't hooking the Response object itself instead of its internal _fp attribute be the better approach?

toolforger added a commit to toolforger/cachecontrol that referenced this issue Sep 6, 2015
Old code used to hook the HTTPResponse._fp object, on the code path that deals
just with non-chunked responses. Since _fp isn't responsible for detecting the
end of chunked responses, it cannot know when to create the cache entry.

New code hooks the HTTPResponse itself, on the stream() function, that's the
common code path for chunked and non-chunked responses.

Fixes psf#95.
@ionrock
Copy link
Contributor

ionrock commented Mar 24, 2016

@toolforger I've merged a patch to work with chunked responses. If you are still interested, please give master a try and see if that solves the issue.

Otherwise, I'll plan on closing this issue. Thanks for all your work debugging and working on a fix to this!

@sigmavirus24
Copy link
Contributor

So I installed cachecontrol from PyPI, and github3.py and did a simple test:

import cachecontrol
import github3

g = github3.GitHub()
cachecontrol.CacheControl(g.session)
print(g.rate_limit()['rate']['remaining'])
print(g.user('sigmavirus24'))
print(g.rate_limit()['rate']['remaining'])
print(g.user('sigmavirus24'))
print(g.rate_limit()['rate']['remaining'])

Assuming everything is working correctly you would see:

60
<User [sigmavirus24:Ian Cordasco]>
59
<User [sigmavirus24:Ian Cordasco]>
59

What I saw initially was a 58 at the end instead. I then installed git+https://github.com/ionrock/cachecontrol and saw

58
<User [sigmavirus24:Ian Cordasco]>
57
<User [sigmavirus24:Ian Cordasco]>
57

So I believe it's working now. 👍

@ionrock ionrock closed this as completed Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants