Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileCache does not work #111

Closed
burtgulash opened this issue Jan 19, 2016 · 13 comments
Closed

FileCache does not work #111

burtgulash opened this issue Jan 19, 2016 · 13 comments

Comments

@burtgulash
Copy link

I tried to test for existence of cache directory, similar to included test test_storage_filecache.py but it does not get created. forever=True flag does not help, changing directory .web_cache to something else neither.

import os.path
import logging
logging.basicConfig(level=logging.DEBUG)

import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache

webcache_dir = ".web_cache"
cache = FileCache(webcache_dir)
sess = CacheControl(requests.Session(), cache=cache)
response = sess.get("http://google.com")

print()
print(cache)
print("%s exists?" % webcache_dir, os.path.exists(webcache_dir))

Attached log:

INFO:urllib3.connectionpool:Starting new HTTP connection (1): google.com
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET / HTTP/1.1" 302 258
INFO:urllib3.connectionpool:Starting new HTTP connection (1): www.google.cz
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET /?gfe_rd=cr&ei=DnKeVs2tOOWI8QfDyYbwDw HTTP/1.1" 200 7699

<cachecontrol.caches.file_cache.FileCache object at 0x7f72120f4b00>
.web_cache exists? False
@wcraigtrader
Copy link

I'm experiencing this problem as well with requests 2.9.1. Tracing through the implementation, it appears that the FileCache.set() method is never called, because the CallbackFileWrapper.read() method is never called. In my case, the code looks like:

webcache_dir = ".web_cache"
cache = FileCache(webcache_dir)
sess = CacheControl(requests.Session(), cache=cache)
response = sess.get("http://google.com")
print (response.text)

I believe the problem is that by the time that CacheControlAdapter.build_response() is called, the entire content of the request has already been read and stored in the response text, so the CallbackFileWrapper never gets called.

@ionrock
Copy link
Contributor

ionrock commented Mar 24, 2016

@burtgulash @wcraigtrader is correct. Unless the response is actually read, it won't be cached. The reason for this is that unless you use the result, there isn't a good reason to exhaust the file handle and save the result, potentially, in memory.

I'm going to close this out. Feel free to open a new issue if you find some other issue!

@ionrock ionrock closed this as completed Mar 24, 2016
@wcraigtrader
Copy link

My diagnosis may be correct, but that still doesn't solve the problem: it isn't caching data. When I use 'response.text', I am using the data, therefore the request and its result should be cached, and that isn't happening.

@sigmavirus24
Copy link
Contributor

Unless the response is actually read, it won't be cached.

@ionrock looking at @burtgulash's example, they are not using stream=True so the urllib3 response would be read and cached by requests. Even so, @wcraigtrader seems to have a way to reproduce this even when calling response.text.

@ionrock
Copy link
Contributor

ionrock commented Mar 28, 2016

@sigmavirus24 Very true! The other thing to recognize is that if the response isn't cacheable, then it won't preemptively create the cache dir either.

[ ~/Work/cachecontrol ] $ xe doesitcache http://google.com
xe doesitcache http://google.com
Updating cache with response from "http://www.google.com/"
Looking up "http://www.google.com/" in the cache
No cache entry available
Not cached :(

I'm not sure what headers might be getting in the way offhand, but it looks like google is asking that the response isn't cached.

@sigmavirus24
Copy link
Contributor

Good catch @ionrock

@wcraigtrader
Copy link

FWIW, I substituted google.com in that example as a generic, rather than tying this to the actual websites I'm working with, but I assure you that if you choose a real domain that doesn't have any caching headers, the example works fine. Try using trader.name instead and you'll get the result I described: cachecontrol won't cache the page if you use response.text.

@sigmavirus24
Copy link
Contributor

@wcraigtrader I just used doesitcache (which comes with cachecontrol) to check http://trader.name and it reported that it would not cache. I then checked it with httpie:

~ http http://trader.name
HTTP/1.1 200 OK
Accept-Ranges: bytes
Connection: Keep-Alive
Content-Encoding: gzip
Content-Length: 2264
Content-Type: text/html
Date: Tue, 29 Mar 2016 20:22:51 GMT
Keep-Alive: timeout=5, max=100
Server: Apache/2.4.7 (Ubuntu)
Vary: Accept-Encoding

<html>
  <head>
    <title>W. Craig Trader: Software Engineer / Technology Generalist</title>
        <!--
    <link rel="openid.server" href="http://getopenid.com/action/authenticate/" />
    <link rel="openid.delegate" href="https://getopenid.com/wcraigtrader/" />
    -->
    <!-- MyOpenID -->

I didn't see any cache headers in there. To make sure this wasn't something with requests, I tried curl and received:

~ curl -iv http://trader.name
* Rebuilt URL to: http://trader.name/
*   Trying 71.178.19.94...
* Connected to trader.name (71.178.19.94) port 80 (#0)
> GET / HTTP/1.1
> Host: trader.name
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Tue, 29 Mar 2016 20:25:41 GMT
Date: Tue, 29 Mar 2016 20:25:41 GMT
< Server: Apache/2.4.7 (Ubuntu)
Server: Apache/2.4.7 (Ubuntu)
< Accept-Ranges: bytes
Accept-Ranges: bytes
< Vary: Accept-Encoding
Vary: Accept-Encoding
< Transfer-Encoding: chunked
Transfer-Encoding: chunked
< Content-Type: text/html
Content-Type: text/html

<
<html>
  <head>
    <title>W. Craig Trader: Software Engineer / Technology Generalist</title>
        <!--
    <link rel="openid.server" href="http://getopenid.com/action/authenticate/" />
    <link rel="openid.delegate" href="https://getopenid.com/wcraigtrader/" />
    -->
    <!-- MyOpenID -->

Is there a site that this can be reproduced against?

@wcraigtrader
Copy link

@sigmavirus24 You prove my point -- without any cache headers, cachecontrol should be caching that page, and it doesn't. That demonstrates the bug.

@sigmavirus24
Copy link
Contributor

So looking at the Controller which handles the caching logic, it looks like cachecontrol isn't caching because all that exists on the site is a Date header without anything to tell the library when the entry should cease being cached. See this elif block. I'm not familiar enough with the relevant RFC(s) to say whether this is correct or not, but it seems rather sensible to me.

@ionrock
Copy link
Contributor

ionrock commented Mar 30, 2016

@wcraigtrader Thanks for taking the time to dig into this. The rfc (as I've read them) leave this aspect a little ambiguous, I think, somewhat intentionally in order to prescribe caching mechanisms that can work across many different use cases. CacheControl, by default, tries to use a very explicit caching pattern where a response won't be cached unless the server explicitly provides headers that make the response cacheable. This is essentially what @sigmavirus24 has confirmed. Thanks @sigmavirus24!

With that said, browsers typically take on the heuristics to use the data and make a guess that something can be cached. This was why I added heuristics in the first place! For example, CacheControl does come with a heuristic that is meant to act more like a web browser (https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/heuristics.py#L91). In CacheControl, heuristics take an original response and allow adjusting the headers before the caching logic is applied. This allows you to essentially make something that isn't normally cacheable, able to be cached using CacheControl.

If you give the heuristics a try or have suggestions to improve them, please let me know in a ticket or pull request. The feature, while powerful, I've left rather bare bones until there is a clear consensus on how to improve them. Thanks again for digging into this issue, reading the code and commenting on the ticket. This conversation makes me think it would be a good idea to move the doesitcache tool to the README as well as make note to consider heuristics if something isn't getting cached.

@sigmavirus24
Copy link
Contributor

@ionrock could doesitcache be improved to allow the user to specify a heuristic so users could test the heuristic(s) they might want to apply?

@ionrock
Copy link
Contributor

ionrock commented Mar 30, 2016

@sigmavirus24 Great idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants