Redisign the cahce API #240

hexagonrecursion · 2021-02-16T08:54:59Z

The current cache API is:

class Cache:
    def get(self, key):
        pass

    def set(self, key, value, expires=None):
        pass

    def delete(self, key):
        pass

    def close(self):
        pass

This makes fixing #145, #180 and #238 difficult because you have to store the entire response body somewhere before you can save it with Cache.set().

Currently CacheControl buffers the response body in memory which is just wrong - thou shalt not assume that there are no files large enough to not fit in RAM. This is the cause of #145 and #180. Additionally this makes CacheControl completely unusable with response bodies larder than (2^32)-1 bytes no matter how much RAM you have due to a design limitation of the msgpack format #238.

I have considered a less invasive approach of buffering the response body in a temporary file, but there are at least two problems with this:

On Linux /tmp is often a tmpfs - a filesystem backed by virtual memory. It can swap out, but swap is not always enabled. tmpfs has a configurable maximum size limit. swap also often has a size limit. Basically: this would defeat the point - we would not be able to reliably cache a 100GB response even if there is enough storage for the cache.
Unoptimal IO: to cache a 10GB response we would write 10GB, read 10GB and write 10GB again instead of writing 10GB once.

I'm going to make big changes to the interface between CacheControl and a cache implementation. This makes all existing cache implementations suspect. I'll have to reimplement them later.

hexagonrecursion · 2021-02-16T09:20:40Z

Fixing #238 would also require a new serialization format because cc=4 stores both the body and the header in a single msgpack object.
Edit: fixing #145 and #180 will also require a new serialization format.

A cache has to be able to distinguish between two different scenarios: 1. CacheControl has successfully written all the data it intended to write and wants to save it 2. The download got canceled part way through (eg due to SIGINT) and CacheControl wants to release all resources associated with it In the second scenario the cache must not return incomplete data in a subsequent open_read

ionrock · 2021-02-19T03:22:08Z

Hey! Thanks for looking at this. I know this limitation has been considered before, but the discrepancies between different operating systems and potential use cases made it challenging to find a valid fix. The result was we did the simplest thing! One idea I had when looking through your commentary was that you might be able to extend the heuristic functionality to allow someone to implement a large response handler. For example, if the Content-Length is larger than some value (1GB maybe?) then you might trigger a custom buffer that avoids the extra reads / writes.

I haven't looked at the code for a while, but I think heuristics only are for adjusting the response object before it is sent to the actual caching logic. You'd likely have to pass some extra info to trigger the buffering or potentially swap the response's body with your custom buffer. There are probably other colors to paint the bikeshed too!

I mention it as I suspect changing the core API might be challenging at this point whereas extending the heuristics might provide a more reasonable upgrade path over time.

hexagonrecursion · 2021-02-19T04:43:57Z

discrepancies between different operating systems

What kind of discrepancies?

potential use cases

What kind of use cases?

One idea I had when looking through your commentary was that you might be able to extend the heuristic functionality to allow someone to implement a large response handler. For example, if the Content-Length is larger than some value (1GB maybe?) then you might trigger a custom buffer that avoids the extra reads / writes.

Exactly how would this work? Where would the body be stored while it's being downloaded?

This should make it easier to change the cache API see psf#240

hexagonrecursion added 2 commits February 16, 2021 11:13

Remove the old cache implementations

38b624c

I'm going to make big changes to the interface between CacheControl and a cache implementation. This makes all existing cache implementations suspect. I'll have to reimplement them later.

Create a draft of the new cache API

95d24f8

hexagonrecursion mentioned this pull request Feb 16, 2021

Poor naming choice: FileCache, RedisCache, DictCache, BaseCache #241

Closed

hexagonrecursion added 3 commits February 16, 2021 14:13

Add a comment

bfb96fd

Fix import failing

507bfb0

TDD

3d30237

hexagonrecursion force-pushed the file branch from 4e55904 to 3d30237 Compare February 18, 2021 16:19

hexagonrecursion added 6 commits February 18, 2021 19:29

More TDD

d166fa6

Refactor tests

f75ea1b

Improve a test

ca901a9

Delete docstring

1b4aea2

More TDD

f4b0930

More TDD

ac77d1d

hexagonrecursion added 6 commits February 19, 2021 09:01

More TDD

86b4a81

Break code on purpose

49e2696

And fix it

31f7e4b

Break code again

b3606c2

And fix it

e759b29

Line too long

e8c2311

hexagonrecursion force-pushed the file branch from 664bead to e8c2311 Compare February 20, 2021 09:01

hexagonrecursion added 6 commits February 20, 2021 12:05

TDD

9ecfdb3

TDD

a128591

TDD

31363c9

TDD

bc47bc6

Add comment

fe25d60

Drop commented out code

013089b

hexagonrecursion added 6 commits February 20, 2021 12:30

Break code

8398e32

And fix it

13b565c

Break code

1d475f6

And fix it

852dd3c

Add a docstring

3a2e98b

TDD

23163e4

hexagonrecursion added a commit to hexagonrecursion/cachecontrol that referenced this pull request Feb 21, 2021

Refactor: extract code to get responses from cache

12bbb56

This should make it easier to change the cache API see psf#240

hexagonrecursion mentioned this pull request Feb 23, 2021

Refactor: extract code to get responses from the cache #247

Closed

hexagonrecursion mentioned this pull request Mar 3, 2021

Caching partial content #246

Open

johtso mentioned this pull request Mar 11, 2021

Active development, making library async compatible, client agnostic etc. #248

Open

hexagonrecursion closed this Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redisign the cahce API #240

Redisign the cahce API #240

hexagonrecursion commented Feb 16, 2021

hexagonrecursion commented Feb 16, 2021 •

edited

Loading

ionrock commented Feb 19, 2021

hexagonrecursion commented Feb 19, 2021

Redisign the cahce API #240

Redisign the cahce API #240

Conversation

hexagonrecursion commented Feb 16, 2021

hexagonrecursion commented Feb 16, 2021 • edited Loading

ionrock commented Feb 19, 2021

hexagonrecursion commented Feb 19, 2021

hexagonrecursion commented Feb 16, 2021 •

edited

Loading