FileCache does not work #111

burtgulash · 2016-01-19T17:35:42Z

I tried to test for existence of cache directory, similar to included test test_storage_filecache.py but it does not get created. forever=True flag does not help, changing directory .web_cache to something else neither.

import os.path
import logging
logging.basicConfig(level=logging.DEBUG)

import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache

webcache_dir = ".web_cache"
cache = FileCache(webcache_dir)
sess = CacheControl(requests.Session(), cache=cache)
response = sess.get("http://google.com")

print()
print(cache)
print("%s exists?" % webcache_dir, os.path.exists(webcache_dir))

Attached log:

INFO:urllib3.connectionpool:Starting new HTTP connection (1): google.com
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET / HTTP/1.1" 302 258
INFO:urllib3.connectionpool:Starting new HTTP connection (1): www.google.cz
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET /?gfe_rd=cr&ei=DnKeVs2tOOWI8QfDyYbwDw HTTP/1.1" 200 7699

<cachecontrol.caches.file_cache.FileCache object at 0x7f72120f4b00>
.web_cache exists? False

The text was updated successfully, but these errors were encountered:

wcraigtrader · 2016-02-19T09:48:48Z

I'm experiencing this problem as well with requests 2.9.1. Tracing through the implementation, it appears that the FileCache.set() method is never called, because the CallbackFileWrapper.read() method is never called. In my case, the code looks like:

webcache_dir = ".web_cache"
cache = FileCache(webcache_dir)
sess = CacheControl(requests.Session(), cache=cache)
response = sess.get("http://google.com")
print (response.text)

I believe the problem is that by the time that CacheControlAdapter.build_response() is called, the entire content of the request has already been read and stored in the response text, so the CallbackFileWrapper never gets called.

ionrock · 2016-03-24T14:42:24Z

@burtgulash @wcraigtrader is correct. Unless the response is actually read, it won't be cached. The reason for this is that unless you use the result, there isn't a good reason to exhaust the file handle and save the result, potentially, in memory.

I'm going to close this out. Feel free to open a new issue if you find some other issue!

wcraigtrader · 2016-03-27T23:19:49Z

My diagnosis may be correct, but that still doesn't solve the problem: it isn't caching data. When I use 'response.text', I am using the data, therefore the request and its result should be cached, and that isn't happening.

sigmavirus24 · 2016-03-28T16:29:15Z

Unless the response is actually read, it won't be cached.

@ionrock looking at @burtgulash's example, they are not using stream=True so the urllib3 response would be read and cached by requests. Even so, @wcraigtrader seems to have a way to reproduce this even when calling response.text.

ionrock · 2016-03-28T21:26:04Z

@sigmavirus24 Very true! The other thing to recognize is that if the response isn't cacheable, then it won't preemptively create the cache dir either.

[ ~/Work/cachecontrol ] $ xe doesitcache http://google.com
xe doesitcache http://google.com
Updating cache with response from "http://www.google.com/"
Looking up "http://www.google.com/" in the cache
No cache entry available
Not cached :(

I'm not sure what headers might be getting in the way offhand, but it looks like google is asking that the response isn't cached.

sigmavirus24 · 2016-03-28T21:56:40Z

Good catch @ionrock

wcraigtrader · 2016-03-29T19:11:24Z

FWIW, I substituted google.com in that example as a generic, rather than tying this to the actual websites I'm working with, but I assure you that if you choose a real domain that doesn't have any caching headers, the example works fine. Try using trader.name instead and you'll get the result I described: cachecontrol won't cache the page if you use response.text.

sigmavirus24 · 2016-03-29T20:36:22Z

@wcraigtrader I just used doesitcache (which comes with cachecontrol) to check http://trader.name and it reported that it would not cache. I then checked it with httpie:

~ http http://trader.name
HTTP/1.1 200 OK
Accept-Ranges: bytes
Connection: Keep-Alive
Content-Encoding: gzip
Content-Length: 2264
Content-Type: text/html
Date: Tue, 29 Mar 2016 20:22:51 GMT
Keep-Alive: timeout=5, max=100
Server: Apache/2.4.7 (Ubuntu)
Vary: Accept-Encoding

<html>
  <head>
    <title>W. Craig Trader: Software Engineer / Technology Generalist</title>
        <!--
    <link rel="openid.server" href="http://getopenid.com/action/authenticate/" />
    <link rel="openid.delegate" href="https://getopenid.com/wcraigtrader/" />
    -->
    <!-- MyOpenID -->

I didn't see any cache headers in there. To make sure this wasn't something with requests, I tried curl and received:

~ curl -iv http://trader.name
* Rebuilt URL to: http://trader.name/
*   Trying 71.178.19.94...
* Connected to trader.name (71.178.19.94) port 80 (#0)
> GET / HTTP/1.1
> Host: trader.name
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Tue, 29 Mar 2016 20:25:41 GMT
Date: Tue, 29 Mar 2016 20:25:41 GMT
< Server: Apache/2.4.7 (Ubuntu)
Server: Apache/2.4.7 (Ubuntu)
< Accept-Ranges: bytes
Accept-Ranges: bytes
< Vary: Accept-Encoding
Vary: Accept-Encoding
< Transfer-Encoding: chunked
Transfer-Encoding: chunked
< Content-Type: text/html
Content-Type: text/html

<
<html>
  <head>
    <title>W. Craig Trader: Software Engineer / Technology Generalist</title>
        <!--
    <link rel="openid.server" href="http://getopenid.com/action/authenticate/" />
    <link rel="openid.delegate" href="https://getopenid.com/wcraigtrader/" />
    -->
    <!-- MyOpenID -->

Is there a site that this can be reproduced against?

wcraigtrader · 2016-03-30T02:12:31Z

@sigmavirus24 You prove my point -- without any cache headers, cachecontrol should be caching that page, and it doesn't. That demonstrates the bug.

sigmavirus24 · 2016-03-30T14:27:32Z

So looking at the Controller which handles the caching logic, it looks like cachecontrol isn't caching because all that exists on the site is a Date header without anything to tell the library when the entry should cease being cached. See this elif block. I'm not familiar enough with the relevant RFC(s) to say whether this is correct or not, but it seems rather sensible to me.

ionrock · 2016-03-30T14:50:02Z

@wcraigtrader Thanks for taking the time to dig into this. The rfc (as I've read them) leave this aspect a little ambiguous, I think, somewhat intentionally in order to prescribe caching mechanisms that can work across many different use cases. CacheControl, by default, tries to use a very explicit caching pattern where a response won't be cached unless the server explicitly provides headers that make the response cacheable. This is essentially what @sigmavirus24 has confirmed. Thanks @sigmavirus24!

With that said, browsers typically take on the heuristics to use the data and make a guess that something can be cached. This was why I added heuristics in the first place! For example, CacheControl does come with a heuristic that is meant to act more like a web browser (https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/heuristics.py#L91). In CacheControl, heuristics take an original response and allow adjusting the headers before the caching logic is applied. This allows you to essentially make something that isn't normally cacheable, able to be cached using CacheControl.

If you give the heuristics a try or have suggestions to improve them, please let me know in a ticket or pull request. The feature, while powerful, I've left rather bare bones until there is a clear consensus on how to improve them. Thanks again for digging into this issue, reading the code and commenting on the ticket. This conversation makes me think it would be a good idea to move the doesitcache tool to the README as well as make note to consider heuristics if something isn't getting cached.

sigmavirus24 · 2016-03-30T15:48:12Z

@ionrock could doesitcache be improved to allow the user to specify a heuristic so users could test the heuristic(s) they might want to apply?

ionrock · 2016-03-30T16:00:32Z

@sigmavirus24 Great idea!

int-ua mentioned this issue Mar 11, 2016

Can anyone help me in explaining why cachecontrol of not caching url? #116

Closed

ionrock closed this as completed Mar 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileCache does not work #111

FileCache does not work #111

burtgulash commented Jan 19, 2016

wcraigtrader commented Feb 19, 2016

ionrock commented Mar 24, 2016

wcraigtrader commented Mar 27, 2016

sigmavirus24 commented Mar 28, 2016

ionrock commented Mar 28, 2016

sigmavirus24 commented Mar 28, 2016

wcraigtrader commented Mar 29, 2016

sigmavirus24 commented Mar 29, 2016

wcraigtrader commented Mar 30, 2016

sigmavirus24 commented Mar 30, 2016

ionrock commented Mar 30, 2016

sigmavirus24 commented Mar 30, 2016

ionrock commented Mar 30, 2016

FileCache does not work #111

FileCache does not work #111

Comments

burtgulash commented Jan 19, 2016

wcraigtrader commented Feb 19, 2016

ionrock commented Mar 24, 2016

wcraigtrader commented Mar 27, 2016

sigmavirus24 commented Mar 28, 2016

ionrock commented Mar 28, 2016

sigmavirus24 commented Mar 28, 2016

wcraigtrader commented Mar 29, 2016

sigmavirus24 commented Mar 29, 2016

wcraigtrader commented Mar 30, 2016

sigmavirus24 commented Mar 30, 2016

ionrock commented Mar 30, 2016

sigmavirus24 commented Mar 30, 2016

ionrock commented Mar 30, 2016