Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redisign the cahce API #240

Closed
wants to merge 30 commits into from
Closed

Conversation

hexagonrecursion
Copy link
Contributor

The current cache API is:

class Cache:
    def get(self, key):
        pass

    def set(self, key, value, expires=None):
        pass

    def delete(self, key):
        pass

    def close(self):
        pass

This makes fixing #145, #180 and #238 difficult because you have to store the entire response body somewhere before you can save it with Cache.set().

Currently CacheControl buffers the response body in memory which is just wrong - thou shalt not assume that there are no files large enough to not fit in RAM. This is the cause of #145 and #180. Additionally this makes CacheControl completely unusable with response bodies larder than (2^32)-1 bytes no matter how much RAM you have due to a design limitation of the msgpack format #238.

I have considered a less invasive approach of buffering the response body in a temporary file, but there are at least two problems with this:

  1. On Linux /tmp is often a tmpfs - a filesystem backed by virtual memory. It can swap out, but swap is not always enabled. tmpfs has a configurable maximum size limit. swap also often has a size limit. Basically: this would defeat the point - we would not be able to reliably cache a 100GB response even if there is enough storage for the cache.
  2. Unoptimal IO: to cache a 10GB response we would write 10GB, read 10GB and write 10GB again instead of writing 10GB once.

I'm going to make big changes to the interface between CacheControl and
a cache implementation. This makes all existing cache implementations
suspect. I'll have to reimplement them later.
@hexagonrecursion
Copy link
Contributor Author

hexagonrecursion commented Feb 16, 2021

Fixing #238 would also require a new serialization format because cc=4 stores both the body and the header in a single msgpack object.
Edit: fixing #145 and #180 will also require a new serialization format.

A cache has to be able to distinguish between two different scenarios:
1. CacheControl has successfully written all the data it intended to
write and wants to save it
2. The download got canceled part way through (eg due to SIGINT)
and CacheControl wants to release all resources associated with it

In the second scenario the cache must not return incomplete data in
a subsequent open_read
@ionrock
Copy link
Contributor

ionrock commented Feb 19, 2021

Hey! Thanks for looking at this. I know this limitation has been considered before, but the discrepancies between different operating systems and potential use cases made it challenging to find a valid fix. The result was we did the simplest thing! One idea I had when looking through your commentary was that you might be able to extend the heuristic functionality to allow someone to implement a large response handler. For example, if the Content-Length is larger than some value (1GB maybe?) then you might trigger a custom buffer that avoids the extra reads / writes.

I haven't looked at the code for a while, but I think heuristics only are for adjusting the response object before it is sent to the actual caching logic. You'd likely have to pass some extra info to trigger the buffering or potentially swap the response's body with your custom buffer. There are probably other colors to paint the bikeshed too!

I mention it as I suspect changing the core API might be challenging at this point whereas extending the heuristics might provide a more reasonable upgrade path over time.

@hexagonrecursion
Copy link
Contributor Author

discrepancies between different operating systems

What kind of discrepancies?

potential use cases

What kind of use cases?

One idea I had when looking through your commentary was that you might be able to extend the heuristic functionality to allow someone to implement a large response handler. For example, if the Content-Length is larger than some value (1GB maybe?) then you might trigger a custom buffer that avoids the extra reads / writes.

Exactly how would this work? Where would the body be stored while it's being downloaded?

hexagonrecursion added a commit to hexagonrecursion/cachecontrol that referenced this pull request Feb 21, 2021
This should make it easier to change the cache API
see psf#240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants