-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Changing FFT cache to a bounded LRU cache #7686
Conversation
c.setdefault(3, [np.array([1, 2])]) | ||
assert_array_almost_equal(c[3][0], np.array([1, 2])) | ||
|
||
assert c._get_size() == np.ones(2).nbytes * 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assert_(...)
, plain old assert goes away in optimized Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. For pytest and nose I usually just use plain asserts but your point is very valid. I changed it to the assertEqual()
method of the test class. I hope that's ok.
Fails on Linux 386 and Windows. Makes me suspect a This enhancement looks reasonable to me, but should be discussed on the list. Please make a post. |
I think it failed because I carelessly assumed the dtype of some array creation functions. I pushed a fix. Let's see if CI passes.
Will do. |
I wonder why this cache is even there. At least for a few simple examples, creating time of the work array is much shorter than running the fft:
|
Here is a quick and dirty but more extensive benchmark - it tests the total time for 100 FFTs of equal length with and without cache which IMHO is one of the most interesting quantities. Lots of real world application require a ton of FFTs of the same length. import time
import numpy as np
print("NPTS t for 100 runs w cache t for 100 runs w/o cache")
print("================================================================")
for length in [1E3, 1E4, 1E5, 1E6, 1E7]:
length = int(length)
d = np.random.random(length)
d = np.require(d, dtype=np.float64)
# Warmup
np.fft.fft(d)
np.fft.fftpack._fft_cache.clear()
total_time = 0
for _ in range(100):
a = time.time()
np.fft.fft(d)
b = time.time()
total_time += b - a
total_time_no_cache = 0
for _ in range(100):
np.fft.fftpack._fft_cache.clear()
a = time.time()
np.fft.fft(d)
b = time.time()
total_time_no_cache += b - a
print("%7i %11.5f %11.5f" % (
length, total_time, total_time_no_cache)) Results on my machine:
IMHO the cache is more than worth it as it will greatly speed up many real world workflows and many people will not have enough experience to use the FFTW wrappers or other FFT implementations. |
@krischer - yes, that is quite convincing! I agree doing many FFTs in sequence is a common use case and I'll definitely take that speed improvement! Then the remaining question is what is a reasonably optimal cache size, and how do we define size. On the mailing list, I suggested also adding a limit to the number of entries. What do you think? In any case, just to be clear: I think this PR is an improvement no matter what. |
I guess there's no point in spending much time trying to fine tune the cache eviction heuristics, because we have absolutely no data :-). Any limit is obviously better than no limit; beyond that we just don't know. So I think there's two productive things we might do: either find some source of data - maybe a real program that benefits from the cache? or else merge and move on :-). We can always fine tune later... |
I did read that and I think it is a good idea. I'll add it - it will then be limited by either memory size or total item count - whatever reaches its limit first. The limits should be discussed. They could be made configurable but that is IMHO too much complexity and really knowledgeable people could just monkey patch the caches or use another FFT implementation. I think 8 for the max item size as you proposed on the mailing list is a bit on the low side. I would propose at least 16-64 if not even higher. The trickier limit is the total allowed cache size - this PR so far has an arbitrary number of 100 MB but that might even be a bit too low - the following plot shows the size of a single cache item for various lengths of FFTs. |
Yes, work array size is length * 2 * 16 bytes (complex128), so 320 MB for 1e7 elements. (Actually, given this your In my workloads at least, I often have one large FT and many smaller ones. One could envision something like |
@njsmith - you're right that we should not fine-tune this too much. I'd be happy with just putting a limit of |
I like that. The only problem with that is that a single large FFT will occupy a lot of memory in the cache that is never freed again. I'd be happy with that as people doing these calculations can be assumed to have lots of memory. If nobody objects I'll implement that. |
True - but then I'd have to specialize for full and real valued FFTs. The current way is IMHO simpler and fast enough. |
But that is true now too, and there doesn't seem to be an easy way (barring a time-based clearing). (And at least now I know why it is often good to just exit python and restart... It seems there are many other places where "handy caches" are kept.) |
The magic size of the cache is: dim = |
The C code is very dense and I don't have the time right now to break it down, but cache size already is
pyfftw has a timed cache implemented with a separate thread but that is probably way too much complexity for the core numpy package: https://github.com/pyFFTW/pyFFTW/blob/master/pyfftw/interfaces/cache.py In any case - baring CI failures this PR is done from my point of view - please review it or let me know of any other desired chances. There are still two caches - one for real, and one for complex valued FFTs. Old values will be evicted upon getting/setting if:
The cache will also never be fully cleared but at least one item will always remain as otherwise single large items would never get stored in the cache and one also wants to benefit from the cache in these cases. |
Looks all good to me! I also like that if anybody looks at the code, it is quite obvious what to do if one wants a different cache. |
Needs mention in |
It is currently branched off master but it might make sense to rebase it on top of one of the maintenance branches. If so and which of them is up to you. Or would you rather see a backport? |
@krischer Everything starts in master and backports are restricted to bug fixes. |
Note that the cached data is linear in the transform size, so that for very large transforms the relative time spent in computation of the twiddle factors will decrease. |
@krischer Still needs mention in the release notes. |
Done. I guess this PR could also be considered a bugfix for unexpected and unwanted behavior but it appears to not have caused a lot of practical problems in the past so I'm fine with having it only in future numpy versions. |
|
||
class _FFTCache(object): | ||
""" | ||
Cache for the FFT init functions as an LRU (least recently used) cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should give this a standard docstring, see doc/HOWTO_DOCUMENT.rst.txt
. All that is really needed here is a Parameters
section with the two arguments to __init__
.
LGTM modulo class docstring. |
☔ The latest upstream changes (presumably #7704) made this pull request unmergeable. Please resolve the merge conflicts. |
Needs rebase also, probably the release notes. |
Rebased, force pushed, and improved the docstring. |
Great. One more thing ;) Could you squash the commits -- |
Replaces the simple dictionary caches for the twiddle factors of numpy.fft to bounded LRU (least recently used) caches. The caches can thus no longer grow without bounds. See numpy#7686.
Sure thing :-) |
Thanks @krischer . |
Hmm, I got an error report on merge
Looks like it occurred in
Not sure if that is a threading problem or a timeout problem. |
How did you trigger it? CI and local tests pass for me. EDIT: I should read more carefully. Can you reproduce this reliably or does it only happen occasionally? Also what OS did you run it on? |
Travis CI runs the tests after merges and emails me the result, I didn't run the tests myself. You can see the report at https://travis-ci.org/numpy/numpy/builds/135682171. It is probably sporadic, which is the worst kind of problem... |
Mhm. I can occasionally reproduce it locally by greatly increasing the number of threads the tests runs with. I'll look into it. |
I'm pretty sure this happens because # As soon as we put wsave back into the cache, another thread could pick it
# up and start using it, so we must not do this until after we're
# completely done using it ourselves.
fft_cache[n].append(wsave) now has a I would implement the latter variant but I'm not sure how to reliably test it. Any ideas? |
Well, lots of threads ;) Probably the best thing up front is for some experienced folks to look at the code and think about it ;) |
Fair enough. To reliably trigger the bug as I interpret it, change the def __getitem__(self, key):
# pop + add to move it to the end.
value = self._dict.pop(key)
import time
time.sleep(0.01)
... Both proposed solution would work. |
Don't like that much. Maybe derive the FFTCache from |
Good idea. Its a one line change and seems to work fine. But I also have little experience in that regard so maybe there are some hidden pitfalls? The downside is that the total possible cache size would then depend on the number of threads. A simple lock around the |
I believe each thread would get its own cache, not sure if the initial data is copied from the parent thread. The total amount of possible data will grow with the number of threads, but I expect anyone running hundreds of threads will have the resources needed to handle that. I'm not sure the simple lock will work, although I expect that a relevant process will hold a reference to the data even if it is ejected from the cache, so maybe all is well. There is a context manaager for handling such locks and I suppose the lock contention is likely to be small. |
Of course, with a lot of threads doing different size transforms, much cache data is likely to be evicted before it can be used. Python threading is really time slicing, it doesn't run on multiple processors and ideally there would be a single thread dedicated to, say, running ffts while other threads might handle io, data acquisition, user interfaces and other such. I don't think either multiple caches or a lock would be a problem in practice. |
How about just wrapping a lock around the cache access methods? Contention should be basically nil in any reasonable code. |
@njsmith If they all use the same lock that should work. |
Right, specifically I'm suggesting that |
why can't we add a simple |
The currently existing simple dictionary cache can grow without limit. With this PR neither cache can grow much larger than 100MB (arbitrary limit). It is implemented as an LRU (least recently used) cache so it will always remove items that have been get/set least recently. This should still reap most benefits of the original cache implementation without the potential to use a large amount of memory.
The following snippet will fill a lot of memory without this patch:
This is not just a theoretical issue - we've encountered this in practice here: obspy/obspy#1424
I'm not entirely sure if the additional complexity is really worth it but I don't see a simpler solution to achieve the same effect. The proposed LRU cache will always retain the most recently used items and throws away old ones as soon as the specified memory limit is reached. Thus most users will still benefit from the cache. I think it still works correctly when used in threaded code but I did not extensively test it.
It currently has a limit of 100 MB for each cache which is an arbitrary limit that works for my use cases.
The PR is currently based on the current master - let me know if you want it rebased on some other branch.