add support for SplitBodyCache #1971

tgolsson · 2023-06-02T19:39:06Z

Pull Request Checklist

A news fragment is added in news/ describing what is new.
Test cases added for changed code.

Describe what you have changed in this PR.

Supersedes #1498, rather than rebasing. This adds support for using a split-body. In the current single-body cache, each file in the HTTP cache contains two different fragments: the serialized HTTP response, followed by the HTTP body data. This means that when response headers change (including timestamps), the whole cache object is rewritten to disk.

With this change the body is not written again, reducing disk pressure.

This change is compatible with existing caches, to the best of my ability to test it. However; it completely prevents running an older PDM against a new cache. In that case, the body in the file is "None" and the deserialization breaks. Not sure if there's a great workaround.

frostming

However; it completely prevents running an older PDM against a new cache.

I don't think we need to consider this issue.

sanmai-NL · 2023-06-12T08:19:35Z

@tgolsson Why does caches.get_body() return a BinaryIO rather than binary data? This easily leads to file handle leaks.

frostming · 2023-06-12T08:22:40Z

@sanmai-NL It's based on https://github.com/psf/cachecontrol, maybe wrong place for asking.

tgolsson · 2023-06-12T08:42:06Z

@sanmai-NL Why would it lead to file handle leaks?

sanmai-NL · 2023-06-12T08:51:28Z

@tgolsson Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak. That's why I asked why this function needs to return BinaryIO rather than plain binary data.

sanmai-NL · 2023-06-12T08:56:07Z

@sanmai-NL It's based on https://github.com/psf/cachecontrol, maybe wrong place for asking.

I failed to find references to BytesIO in there.

frostming · 2023-06-12T09:05:13Z

I failed to find references to BytesIO in there.

https://github.com/psf/cachecontrol/blob/ffdfe8dd65715573911925a5d231bc530f2db4e9/cachecontrol/cache.py#L70

frostming · 2023-06-12T09:08:02Z

Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak

The io object will be passed to create an HTTPResponse object which will close it when consumed. I don't see why it will lead to leaks, every time you get a response it contains an open fp with it. It is not different from we open one ourselves and pass it to response.

sanmai-NL · 2023-06-12T09:54:39Z

Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak

The io object will be passed to create an HTTPResponse object which will close it when consumed. I don't see why it will lead to leaks, every time you get a response it contains an open fp with it. It is not different from we open one ourselves and pass it to response.

I noticed that CacheControl is (exclusively?) meant to be used with urllib3. urllib3 moves the IO object upon construction and closes HTTP responses after reading all bytes, rather than upon destruction. I see you designed get_body() to work with file handles rather than data, to conserve memory. Stepping aside the current urllib3 dependency, can you comment why a binary data generator wouldn't work? That could be context managed, unlike the bare IO object being passed around.

tgolsson · 2023-06-12T10:08:36Z

@tgolsson Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak. That's why I asked why this function needs to return BinaryIO rather than plain binary data.

Which Python version? Python file handles from open are closed when they get garbage collected, in every case I know of.

Stepping aside the current urllib3 dependency, can you comment why a binary data generator wouldn't work? That could be context managed, unlike the bare IO object being passed around.

There was actually some issues with returning a non-file in some situations, though the original PR is many months old at this point as cachecontrol was quite dead, so I can't say if it'd still repro or was related to issues in that specific version. I think in general it's nicer to avoid loading multiple gigabytes into memory if it can be avoided, especially if it's not certain it'll be needed. As I recall in some cases it'll actually just get streamed out again into the build directory, so we never need to keep it all in memory.

sanmai-NL · 2023-06-12T10:28:20Z

@tgolsson I'll readily believe it wasn't a major defect with PDM. But please read this: https://realpython.com/why-close-file-python/

#2005

sanmai-NL · 2023-06-12T11:42:00Z

Re #2005 (comment)

Not sure what you mean with manually close? Do you mean, somehow enforcing that close() is called? I proposed a redesign to that effect, so that urllib3 will lazily load a cached file based on its path, without a file descriptor having to be passed between functions. I'm not sure how to solve it since I'm not so deeply involved in this.

frostming · 2023-06-12T11:45:09Z

using a global context manager that will closed after CLI handling, like what pip is doing

tgolsson · 2023-06-12T11:45:36Z

urllib3 will always close the file (when the response is GC'd); so I don't see why we'd need to manually close them. I guess if we had some malformed response data we could potentially leak them on non-CPython interpreters as we'd not be creating a HTTPResponse object.

sanmai-NL · 2023-06-12T13:00:02Z

When will garbage collection happen? If PDM keeps running and accumulates responses to cache that aren't being read, the number of open files will increase to no limit. Right? Perhaps pointing out the actual source code paths will keep us focused.

frostming · 2023-06-13T03:15:14Z

When will garbage collection happen? If PDM keeps running and accumulates responses to cache that aren't being read, the number of open files will increase to no limit. Right? Perhaps pointing out the actual source code paths will keep us focused.

Keep arguing with two philosophies seems meaningless. Can you quantify how many files are opened and how many are not closed when running a PDM command?

sanmai-NL · 2023-06-13T06:03:05Z

If you explain the main code path from actual requests done by PDM into this caching layer, then I'll look into measurement.

tgolsson · 2023-06-13T10:25:50Z

To quote the docs (which you can find here):

.. c:function:: void Py_DECREF(PyObject *o)

   Decrement the reference count for object *o*.

   If the reference count reaches zero, the object's type's deallocation
   function (which must not be ``NULL``) is invoked.

   This function is usually used to delete a :term:`strong reference` before
   exiting its scope.

   The object must not be ``NULL``; if you aren't sure that it isn't ``NULL``,
   use :c:func:`Py_XDECREF`.

   .. warning::

      The deallocation function can cause arbitrary Python code to be invoked (e.g.
      when a class instance with a :meth:`__del__` method is deallocated).  While
      exceptions in such code are not propagated, the executed code has free access to
      all Python global variables.  This means that any object that is reachable from
      a global variable should be in a consistent state before :c:func:`Py_DECREF` is
      invoked.  For example, code to delete an object from a list should copy a
      reference to the deleted object in a temporary variable, update the list data
      structure, and then call :c:func:`Py_DECREF` for the temporary variable.

So the object is to be garbage-collected immediately when the refcount reaches 0, i.e, when it becomes unreachable from regular Python code. That is to say; unless there are islands of self-referential objects, it will be cleaned up immediately. If it does exist in an island it'll be handled during the sweep-and-prune GC phase.

If it isn't at 0, and isn't in an island, someone is keeping it around and I'd say forcing it to be closed would be a potential future bug.

sanmai-NL · 2023-06-13T11:37:56Z

@tgolsson To be clear, I knew that. The reference counting behavior is CPython specific, by the way. And even there, a deallocation does not guarantee that operating system resources related to the freed object, such as file handles, are released as well. See https://stackoverflow.com/questions/49512990/does-python-gc-close-files-too.

I'm just putting a question mark at offering an API that increases the likelihood of leaking memory or other resources. So I asked, are there no equally performant but more robust alternatives? The scenario I envisage is that a programmer calls get_body() many times in a frame that remains active, and so the refcount does not reach 0. Like PDM, doing a lot of get_body() calls, processing the results in a function that goes out of scope only after everything is done (seems typical for a CLI invocation). That would result in a leak if some IO object isn't closed due to a defect, or in any case a waste of resources at runtime. Indeed, in the container space I do have actual experience with resource-constrained operating system instances that run out of file descriptors at a peak in activity.

But I'm keen to measure this as asked by @frostming. Just would like more explaination about how the caching layer is actually (to be) used.

tgolsson · 2023-06-13T11:55:23Z

@tgolsson To be clear, I knew that. The reference counting behavior is CPython specific, by the way. And even there, a deallocation does not guarantee that operating system resources related to the freed object, such as file handles, are released as well. See https://stackoverflow.com/questions/49512990/does-python-gc-close-files-too.

I'm aware that it is CPython specific. It is after all in the previous link you told me to read. Perhaps you forgot? And it's also quite obvious from the fact that I link the CPython documentation.

And as I've written before; the HTTPResponse does close the file explicitly when dropped, via __del__. So the issue occurs if/when the body for some reason doesn't get constructed, which would happen if we switch request headers, and a few potential other cases:

https://github.com/psf/cachecontrol/blob/ffdfe8dd65715573911925a5d231bc530f2db4e9/cachecontrol/serialize.py#L100-L150

Doesn't mean there's anything wrong with the code in PDM or this PR. For what it's worth, the caching layer is used to construct the session used by unearth when calling PyPi. This code is here:

https://github.com/pdm-project/pdm/blob/4b06c6a40f0b3026c2283a1362d44cf168dc3dae/src/pdm/models/session.py#L38C1-L60

sanmai-NL · 2023-06-13T15:02:09Z

Please calm down sir. I'll respond to your comments soon, but I would prefer if you keep it businesslike.

tgolsson · 2023-06-13T18:20:58Z

My apologies; I wasn't trying to be rude. But reciting back what I'm saying and explaining it to me is quite condescending, as is telling me to 'go read up' by just pointing out other things to read (which is why I quote + reference instead what's important in the docs!). Gets quite frustrating, which is why I called it out.

With that out of the way I'll try to be better. Back on topic: I hooked the FileCache struct to validate my hypothesis that files get closed.

This isn't an error or fake graph; that's actually two lines almost perfectly super-imposed.

This is the wrapper:

class Wrap:
    _open = []
    _close = []
    _files_open = 0
    _files_open_now = []

    def __init__(self, inner):
        self.buffer = inner
        Wrap._open.append(time.perf_counter_ns())
        Wrap._files_open += 1
        Wrap._files_open_now.append(Wrap._files_open)

    def close(self):
        self.buffer.close()
        Wrap._files_open -= 1
        Wrap._close.append(time.perf_counter_ns())

    def __getattr__(self, name):
        if name in ("close",):
            return self.__dict__[name]

        return getattr(self.buffer, name)

and then simply wrapping the file in get_body like so:

    def get_body(self, key: str) -> BinaryIO | None:
        path = self._get_cache_path(key)
        with contextlib.suppress(OSError):
            return cast(BinaryIO, Wrap(open(f"{path}.body", "rb")))

        return None

With pdm lock on PDM itself; I see 168 files opened and closed within the ~10s that takes, and never at any point is there more than 1 file open. Same with pdm sync, pdm install, etc, with varying timestamps and setups.

sanmai-NL · 2023-06-16T11:02:25Z

@tgolsson I think you've misunderstood my inquiry.

@tgolsson Why does caches.get_body() return a BinaryIO rather than binary data? This easily leads to file handle leaks.

@tgolsson Because every caller of get_body() then needs to close its returned file object at some point, or else the file handle will leak. That's why I asked why this function needs to return BinaryIO rather than plain binary data.

You finally perform some experiment. Thanks for that, it adds some information.
But, this experiment only gives evidence that suggests this current implementation in PDM hardly leaks file handles or any resource through the caching layer. Was that then what I claimed? I only asked about the motivation for this design, compared to an alternative design that guarantees proper closing of file handles and other resources and does not rely on careful usage of the API. I expected you or others who have something to say, to respond with a discussion on design alternatives, not in some sort of defense.

Most of the following statements of fact you seem to believe in, I think, based on your defensive response. But they are disconnected from my own input in this thread:

The current implementation in PDM (this PR) leaks file handles.
The current implementation is designed by @tgolsson and is not the responsibility of the cachecontrol architecture.
The current implementation leaks file handles and this is unacceptable to me.
I think that @tgolsson should provide evidence that the current implementation as used with PDM does not leak file handles.
An implementation that's only correct under CPython is acceptable for PDM.

Lastly, you write you understand CPython vs non-CPython interacts with the design. I gave you pointers, you responded with information that is already known, apparently to both of us. I hope you now see why that doesn't help the discussion.

Instead, I expected you to motivate your own concession that under non-CPython, leaks may occur in more types of cases. (Please mind, I'm not claiming your concession in this referenced message covers all cases or is correct itself.)
Moreover, I expected you to analyze the cases I described in add support for SplitBodyCache #1971 (comment).

sanmai-NL · 2023-06-16T11:13:57Z

And as I've written before; the HTTPResponse does close the file explicitly when dropped, via __del__. So the issue occurs if/when the body for some reason doesn't get constructed, which would happen if we switch request headers, and a few potential other cases:

https://github.com/psf/cachecontrol/blob/ffdfe8dd65715573911925a5d231bc530f2db4e9/cachecontrol/serialize.py#L100-L150

Tracing back from HTTPResponsein cachecontrol, urllib3 and its superclasses starting with your reference, I didn't find a __del__ implementation that closes the file handle. In C source code of CPython's standard library garbage collection is tied (to some extent) to releasing resources like file handles. Please kindly point out what code I missed. Surely I will have missed the destructor method you referred to, in navigating quickly across GitHub.

tgolsson · 2023-06-16T14:36:07Z

Sorry for wall of text.

[...] I expected you or others who have something to say, to respond with a discussion on design alternatives, not in some sort of defense.

You did ask, but you also claimed that the design leads to leaking file handles. So if you expect discussion, I could as easily say I expect proof - otherwise it's just theorycrafting about file handles.

But, I also don't agree with that sentiment, because in the intended usage and under regular (i.e, happy path) operation the leaks cannot occur. The API is designed in an upstream crate, and a proper usage in-place is safe. It isn't a general-purpose cache utility, and I don't think it has to be. So I'm more interested in proving whether or not it has leaks as is, as otherwise any change would be churn.

Re your statements, or claims to my thoughts; I hope my above reasoning makes sense. Code only needs to be fit for the purpose it's intended for; nothing more. The design could be improved - like any design can be - but if it works I think it doesn't matter. So proving it doesn't work is the most tractable path so we know what to change. So that leaves the following point:

An implementation that's only correct under CPython is acceptable for PDM.

And while I do not have any say as I'm merely a passing contributor; I do think that the opposite can be argued: An implementation that does not support X is acceptable for PDM. What X is can be argued. X might be Cython, Jython, or a solo project by John Doe. For example; I know Jython has IOBase, and IOBase as defined by the Python documentation shall hold a certain behavior. I'm arguing this behaviour guarantees correctness on the happy path.

For sanity, we thusly have to find an implementation of Python that is correct by the standard, but which does not guarantee good behaviour in some situation. This - and this is a leap of faith - could be any non-CPython interpreter, iff we also do not construct the HTTPResponse in cachecontrol (because the happy path is correct).

Sadly, I do not have every interpreter available at hand. As a remedy I've gone through some online and in containers using this snippet:

import os

def foo():
    print(len(os.listdir('/proc/%s/fd' % os.getpid())))
    x = open('x.txt', 'w')
    print(len(os.listdir('/proc/%s/fd' % os.getpid())))


print(len(os.listdir('/proc/%s/fd' % os.getpid())))
foo()
print(len(os.listdir('/proc/%s/fd' % os.getpid())))

Which ouputs something like 3 3 4 3 on a "good" interpreter and 3 3 4 4 on a bad. Doing this I can say that Cython and Stackless Python behave exactly like CPython. PyPy behaves different, but in a way that I think is still passable: since it doesn't GC in the same way CPython does; it takes a bit longer. If I explicitly gc.collect() it also closes the fd. That - as far as I can tell - also means that io.IOBase cannot be reliably used without calling gc.collect() - if that's fine, then the behavior of file handles is also fine. As a proof; I used my above Wrap, and its __del__ method never gets called either.

Ergo; if the behavior here leads to leaking file handles then using io.IOBase leads to leaking file handles under PyPy. The extension of that reasoning is that PyPy isn't a correctly implemented interpreter under those specific rules. I.e,

x = WrapWithDel(open(...))

and

x = IOBaseDerivative(open(...))

would both leak the file handle, even if both are guaranteed to close it if __del__ is called.

Tracing back from HTTPResponsein cachecontrol, urllib3 and its superclasses starting with your reference, I didn't find a del implementation that closes the file handle. In C source code of CPython's standard library garbage collection is tied (to some extent) to releasing resources like file handles. Please kindly point out what code I missed. Surely I will have missed the destructor method you referred to, in navigating quickly across GitHub.

It's in the code you link to - Py_tp_finalize corresponds to __del__. The binding occurs on line 882, and it's defined on line 296. The call to close occurs on line 321. Since that invokes it through self, it'd find the most derived close to call; i.e., the one on HTTPResponse, which in turns calls the close on the file just like called it on Wrap.close in my experiment.

I'm just putting a question mark at offering an API that increases the likelihood of leaking memory or other resources. So I asked, are there no equally performant but more robust alternatives? The scenario I envisage is that a programmer calls get_body() many times in a frame that remains active, and so the refcount does not reach 0. Like PDM, doing a lot of get_body() calls, processing the results in a function that goes out of scope only after everything is done (seems typical for a CLI invocation).

A programmer does not call get_body many times in a frame, because this isn't a general purpose API and is only used for the purpose of caching urllib3. Chainsaws are dangerous in the hands of a fool; that isn't a flaw in their design if you use them for their intended purpose. Given the API here - and those in the upstream - changing those would be breaking changes and have to be dealt with as such, whether planned or not. The other case of course is adding new code in PDM, in which case whoever did it would have to design their code in such a way that file handles are closed, which also applies to the correct-by-definition happy path.

That would result in a leak if some IO object isn't closed due to a defect, or in any case a waste of resources at runtime. Indeed, in the container space I do have actual experience with resource-constrained operating system instances that run out of file descriptors at a peak in activity.

I agree that if an IO object isn't closed due to a defect, that's a bug to be fixed. ;-) And if your resource-constrained operating system is doing pdm install in containers you probably have an architectural problem. It's just not an environment where most build tools - whether it's PDM, Pants, Bazel, Cargo, or CMake would thrive - or be designed for.

sanmai-NL · 2023-06-19T14:47:23Z

Thanks for the analysis. No problem about it being more or less text. Who doesn't want to read it can skip the topic.

My general response is: building robust interfaces matters. We can argue that there's no problem now, but what I'm looking for is that there's no problem tomorrow as well. The criterion for me is how much of an investment it had been to implement an alternative design that's more robust.

I agree that if an IO object isn't closed due to a defect, that's a bug to be fixed. ;-) And if your resource-constrained operating system is doing pdm install in containers you probably have an architectural problem. It's just not an environment where most build tools - whether it's PDM, Pants, Bazel, Cargo, or CMake would thrive - or be designed for.

The resource constrained OS resides within the container. The max number of file descriptors can sometimes be low by default (e.g., 1024). I hope you agree that building software within containers isn't unusual.

add support for SplitBodyCache

1f9a1a9

tgolsson mentioned this pull request Jun 2, 2023

Use split-body cache to make cached wheels disk-stable #1498

Closed

tgolsson added 2 commits June 2, 2023 21:42

add news item

68e5b56

fix return type

3e2ba47

tgolsson force-pushed the ts/split-body-cache-redo branch from b85c418 to 3e2ba47 Compare June 2, 2023 19:49

frostming approved these changes Jun 5, 2023

View reviewed changes

frostming merged commit 06fd648 into pdm-project:main Jun 5, 2023
19 checks passed

frostming mentioned this pull request Jun 6, 2023

JSONDecodeError? (no content in response) #1981

Closed

1 task

frostming mentioned this pull request Jun 12, 2023

Close IO objects #2005

Merged

2 tasks

j178 mentioned this pull request Apr 3, 2024

recursive script j178/pdm#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for SplitBodyCache #1971

add support for SplitBodyCache #1971

tgolsson commented Jun 2, 2023

frostming left a comment

sanmai-NL commented Jun 12, 2023

frostming commented Jun 12, 2023

tgolsson commented Jun 12, 2023

sanmai-NL commented Jun 12, 2023 •

edited

sanmai-NL commented Jun 12, 2023

frostming commented Jun 12, 2023

frostming commented Jun 12, 2023 •

edited

sanmai-NL commented Jun 12, 2023 •

edited

tgolsson commented Jun 12, 2023 •

edited

sanmai-NL commented Jun 12, 2023

sanmai-NL commented Jun 12, 2023 •

edited

frostming commented Jun 12, 2023

tgolsson commented Jun 12, 2023 •

edited

sanmai-NL commented Jun 12, 2023

frostming commented Jun 13, 2023 •

edited

sanmai-NL commented Jun 13, 2023

tgolsson commented Jun 13, 2023 •

edited

sanmai-NL commented Jun 13, 2023 •

edited

tgolsson commented Jun 13, 2023 •

edited

sanmai-NL commented Jun 13, 2023 •

edited

tgolsson commented Jun 13, 2023

sanmai-NL commented Jun 16, 2023 •

edited

sanmai-NL commented Jun 16, 2023 •

edited

tgolsson commented Jun 16, 2023

sanmai-NL commented Jun 19, 2023

add support for SplitBodyCache #1971

add support for SplitBodyCache #1971

Conversation

tgolsson commented Jun 2, 2023

Pull Request Checklist

Describe what you have changed in this PR.

frostming left a comment

Choose a reason for hiding this comment

sanmai-NL commented Jun 12, 2023

frostming commented Jun 12, 2023

tgolsson commented Jun 12, 2023

sanmai-NL commented Jun 12, 2023 • edited

sanmai-NL commented Jun 12, 2023

frostming commented Jun 12, 2023

frostming commented Jun 12, 2023 • edited

sanmai-NL commented Jun 12, 2023 • edited

tgolsson commented Jun 12, 2023 • edited

sanmai-NL commented Jun 12, 2023

sanmai-NL commented Jun 12, 2023 • edited

frostming commented Jun 12, 2023

tgolsson commented Jun 12, 2023 • edited

sanmai-NL commented Jun 12, 2023

frostming commented Jun 13, 2023 • edited

sanmai-NL commented Jun 13, 2023

tgolsson commented Jun 13, 2023 • edited

sanmai-NL commented Jun 13, 2023 • edited

tgolsson commented Jun 13, 2023 • edited

sanmai-NL commented Jun 13, 2023 • edited

tgolsson commented Jun 13, 2023

sanmai-NL commented Jun 16, 2023 • edited

sanmai-NL commented Jun 16, 2023 • edited

tgolsson commented Jun 16, 2023

sanmai-NL commented Jun 19, 2023

sanmai-NL commented Jun 12, 2023 •

edited

frostming commented Jun 12, 2023 •

edited

sanmai-NL commented Jun 12, 2023 •

edited

tgolsson commented Jun 12, 2023 •

edited

sanmai-NL commented Jun 12, 2023 •

edited

tgolsson commented Jun 12, 2023 •

edited

frostming commented Jun 13, 2023 •

edited

tgolsson commented Jun 13, 2023 •

edited

sanmai-NL commented Jun 13, 2023 •

edited

tgolsson commented Jun 13, 2023 •

edited

sanmai-NL commented Jun 13, 2023 •

edited

sanmai-NL commented Jun 16, 2023 •

edited

sanmai-NL commented Jun 16, 2023 •

edited