ENH: Changing FFT cache to a bounded LRU cache #7686

krischer · 2016-05-27T15:20:02Z

The currently existing simple dictionary cache can grow without limit. With this PR neither cache can grow much larger than 100MB (arbitrary limit). It is implemented as an LRU (least recently used) cache so it will always remove items that have been get/set least recently. This should still reap most benefits of the original cache implementation without the potential to use a large amount of memory.

The following snippet will fill a lot of memory without this patch:

import numpy as np

for i in range(10000):
    print(i)
    np.fft.ifft(np.empty(10000 + i, dtype=np.complex128))

This is not just a theoretical issue - we've encountered this in practice here: obspy/obspy#1424

I'm not entirely sure if the additional complexity is really worth it but I don't see a simpler solution to achieve the same effect. The proposed LRU cache will always retain the most recently used items and throws away old ones as soon as the specified memory limit is reached. Thus most users will still benefit from the cache. I think it still works correctly when used in threaded code but I did not extensively test it.

It currently has a limit of 100 MB for each cache which is an arbitrary limit that works for my use cases.

The PR is currently based on the current master - let me know if you want it rebased on some other branch.

charris · 2016-05-27T17:15:54Z

numpy/fft/tests/test_helper.py

+        c.setdefault(3, [np.array([1, 2])])
+        assert_array_almost_equal(c[3][0], np.array([1, 2]))
+
+        assert c._get_size() == np.ones(2).nbytes * 3


Use assert_(...), plain old assert goes away in optimized Python.

Interesting point. For pytest and nose I usually just use plain asserts but your point is very valid. I changed it to the assertEqual() method of the test class. I hope that's ok.

charris · 2016-05-27T17:22:29Z

Fails on Linux 386 and Windows. Makes me suspect a long problem somewhere, windows (C) long is always 32 bits.

This enhancement looks reasonable to me, but should be discussed on the list. Please make a post.

krischer · 2016-05-27T20:19:13Z

I think it failed because I carelessly assumed the dtype of some array creation functions. I pushed a fix. Let's see if CI passes.

This enhancement looks reasonable to me, but should be discussed on the list. Please make a post.

Will do.

mhvk · 2016-05-27T21:40:01Z

I wonder why this cache is even there. At least for a few simple examples, creating time of the work array is much shorter than running the fft:

In [15]: %timeit np.fft.fftpack.fftpack.cffti(10000)
1000 loops, best of 3: 463 µs per loop

In [16]: %timeit np.fft.ifft(np.empty(10000, dtype=np.complex128)) 
100 loops, best of 3: 3.76 ms per loop

In [17]: %timeit np.fft.ifft(np.empty(100, dtype=np.complex128))
10000 loops, best of 3: 30.7 µs per loop

In [18]: %timeit np.fft.fftpack.fftpack.cffti(100)
100000 loops, best of 3: 5.58 µs per loop

krischer · 2016-05-30T08:26:37Z

Here is a quick and dirty but more extensive benchmark - it tests the total time for 100 FFTs of equal length with and without cache which IMHO is one of the most interesting quantities. Lots of real world application require a ton of FFTs of the same length.

import time
import numpy as np

print("NPTS         t for 100 runs w cache     t for 100 runs w/o cache")
print("================================================================")

for length in [1E3, 1E4, 1E5, 1E6, 1E7]:
    length = int(length)
    d = np.random.random(length)
    d = np.require(d, dtype=np.float64)
    # Warmup
    np.fft.fft(d)
    np.fft.fftpack._fft_cache.clear()

    total_time = 0
    for _ in range(100):
        a = time.time()
        np.fft.fft(d)
        b = time.time()
        total_time += b - a

    total_time_no_cache = 0
    for _ in range(100):
        np.fft.fftpack._fft_cache.clear()
        a = time.time()
        np.fft.fft(d)
        b = time.time()
        total_time_no_cache += b - a

    print("%7i        %11.5f                    %11.5f" % (
        length, total_time, total_time_no_cache))

Results on my machine:

NPTS         t for 100 runs w cache     t for 100 runs w/o cache
================================================================
   1000            0.00235                        0.00450
  10000            0.02120                        0.04525
 100000            0.29368                        0.50182
1000000            4.40410                        7.07679
10000000           71.06963                      104.85043

IMHO the cache is more than worth it as it will greatly speed up many real world workflows and many people will not have enough experience to use the FFTW wrappers or other FFT implementations.

mhvk · 2016-05-30T15:35:40Z

@krischer - yes, that is quite convincing! I agree doing many FFTs in sequence is a common use case and I'll definitely take that speed improvement!

Then the remaining question is what is a reasonably optimal cache size, and how do we define size. On the mailing list, I suggested also adding a limit to the number of entries. What do you think?

In any case, just to be clear: I think this PR is an improvement no matter what.

njsmith · 2016-05-30T15:51:30Z

I guess there's no point in spending much time trying to fine tune the cache eviction heuristics, because we have absolutely no data :-). Any limit is obviously better than no limit; beyond that we just don't know.

So I think there's two productive things we might do: either find some source of data - maybe a real program that benefits from the cache? or else merge and move on :-). We can always fine tune later...

krischer · 2016-05-30T15:51:51Z

Then the remaining question is what is a reasonably optimal cache size, and how do we define size. On the mailing list, I suggested also adding a limit to the number of entries. What do you think?

I did read that and I think it is a good idea. I'll add it - it will then be limited by either memory size or total item count - whatever reaches its limit first.

The limits should be discussed. They could be made configurable but that is IMHO too much complexity and really knowledgeable people could just monkey patch the caches or use another FFT implementation.

I think 8 for the max item size as you proposed on the mailing list is a bit on the low side. I would propose at least 16-64 if not even higher. The trickier limit is the total allowed cache size - this PR so far has an arbitrary number of 100 MB but that might even be a bit too low - the following plot shows the size of a single cache item for various lengths of FFTs.

mhvk · 2016-05-30T16:09:52Z

Yes, work array size is length * 2 * 16 bytes (complex128), so 320 MB for 1e7 elements. (Actually, given this your get_size could just sum the keys and assume those are 32 byte items).

In my workloads at least, I often have one large FT and many smaller ones. One could envision something like max_cache_size = max(some_min_size, largest_item_size * 1.5)

mhvk · 2016-05-30T16:11:48Z

@njsmith - you're right that we should not fine-tune this too much. I'd be happy with just putting a limit of 32 on the total number of items, and keeping it at 100MB or a single item otherwise.

krischer · 2016-05-30T16:15:55Z

In my workloads at least, I often have one large FT and many smaller ones. One could envision something like max_cache_size = max(some_min_size, largest_item_size * 1.5)

I like that. The only problem with that is that a single large FFT will occupy a lot of memory in the cache that is never freed again. I'd be happy with that as people doing these calculations can be assumed to have lots of memory. If nobody objects I'll implement that.

krischer · 2016-05-30T16:19:11Z

Yes, work array size is length * 2 * 16 bytes (complex128), so 320 MB for 1e7 elements. (Actually, given this your get_size could just sum the keys and assume those are 32 byte items).

True - but then I'd have to specialize for full and real valued FFTs. The current way is IMHO simpler and fast enough.

mhvk · 2016-05-30T16:32:31Z

The only problem with that is that a single large FFT will occupy a lot of memory in the cache that is never freed again.

But that is true now too, and there doesn't seem to be an easy way (barring a time-based clearing).

(And at least now I know why it is often good to just exit python and restart... It seems there are many other places where "handy caches" are kept.)

seberg · 2016-05-30T17:20:44Z

The magic size of the cache is: dim = 4*n + 15, the actually needed cache is only 2*n + 15 (i.e. you could half the needed cache if you change the code a bit). Since the factorization seems not the very slow part (the +15), I guess we have no choice except adding some heuristic.

krischer · 2016-06-02T12:25:01Z

The magic size of the cache is: dim = 4_n + 15, the actually needed cache is only 2_n + 15 (i.e. you could half the needed cache if you change the code a bit). Since the factorization seems not the very slow part (the +15), I guess we have no choice except adding some heuristic.

The C code is very dense and I don't have the time right now to break it down, but cache size already is 2*n + 15 for real valued input. It is only 4*n + 15 for complex valued input. As far as I currently understand the C code it also uses all 4 * n + 15 cache entries but I might be mistaken.

But that is true now too, and there doesn't seem to be an easy way (barring a time-based clearing).

pyfftw has a timed cache implemented with a separate thread but that is probably way too much complexity for the core numpy package: https://github.com/pyFFTW/pyFFTW/blob/master/pyfftw/interfaces/cache.py

In any case - baring CI failures this PR is done from my point of view - please review it or let me know of any other desired chances. There are still two caches - one for real, and one for complex valued FFTs. Old values will be evicted upon getting/setting if:

cache size > max(100 MB, largest_item_size * 1.5) OR > 32 items in cache

The cache will also never be fully cleared but at least one item will always remain as otherwise single large items would never get stored in the cache and one also wants to benefit from the cache in these cases.

mhvk · 2016-06-02T13:44:24Z

Looks all good to me! I also like that if anybody looks at the code, it is quite obvious what to do if one wants a different cache.

charris · 2016-06-02T16:47:37Z

Needs mention in doc/release/1.12.0-notes.rst.

krischer · 2016-06-03T12:43:28Z

It is currently branched off master but it might make sense to rebase it on top of one of the maintenance branches. If so and which of them is up to you. Or would you rather see a backport?

charris · 2016-06-03T16:19:12Z

@krischer Everything starts in master and backports are restricted to bug fixes.

charris · 2016-06-03T20:32:52Z

Note that the cached data is linear in the transform size, so that for very large transforms the relative time spent in computation of the twiddle factors will decrease.

charris · 2016-06-03T20:34:18Z

@krischer Still needs mention in the release notes.

krischer · 2016-06-03T23:05:41Z

@krischer Still needs mention in the release notes.

Done.

I guess this PR could also be considered a bugfix for unexpected and unwanted behavior but it appears to not have caused a lot of practical problems in the past so I'm fine with having it only in future numpy versions.

charris · 2016-06-04T16:54:46Z

numpy/fft/helper.py

+
+class _FFTCache(object):
+    """
+    Cache for the FFT init functions as an LRU (least recently used) cache.


Should give this a standard docstring, see doc/HOWTO_DOCUMENT.rst.txt. All that is really needed here is a Parameters section with the two arguments to __init__.

charris · 2016-06-04T16:55:33Z

LGTM modulo class docstring.

homu · 2016-06-04T23:18:03Z

☔ The latest upstream changes (presumably #7704) made this pull request unmergeable. Please resolve the merge conflicts.

charris · 2016-06-05T16:39:14Z

Needs rebase also, probably the release notes.

krischer · 2016-06-06T17:37:38Z

Rebased, force pushed, and improved the docstring.

charris · 2016-06-06T17:46:53Z

Great. One more thing ;) Could you squash the commits -- git rebase -i HEAD~8 -- and take a look at formatting the commit message as documented in doc/source/dev/gitwash/development_workflow.rst?

Replaces the simple dictionary caches for the twiddle factors of numpy.fft to bounded LRU (least recently used) caches. The caches can thus no longer grow without bounds. See numpy#7686.

krischer · 2016-06-06T17:55:18Z

Sure thing :-)

charris · 2016-06-06T19:27:47Z

Thanks @krischer .

charris · 2016-06-06T19:59:36Z

Hmm, I got an error report on merge

Exception in thread Thread-20:

Traceback (most recent call last):

  File "/opt/python/2.7.9/lib/python2.7/threading.py", line 810, in __bootstrap_inner

    self.run()

  File "/opt/python/2.7.9/lib/python2.7/threading.py", line 763, in run

    self.__target(*self.__args, **self.__kwargs)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/tests/test_fftpack.py", line 132, in worker

    q.put(func(*args))

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/fftpack.py", line 286, in ifft

    output = _raw_fft(a, n, axis, fftpack.cffti, fftpack.cfftb, _fft_cache)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/fftpack.py", line 89, in _raw_fft

    fft_cache[n].append(wsave)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/helper.py", line 262, in __getitem__

    value = self._dict.pop(key)

  File "/opt/python/2.7.9/lib/python2.7/collections.py", line 143, in pop

    raise KeyError(key)

KeyError: 200

Looks like it occurred in

ERROR: test_ifft (test_fftpack.TestFFTThreadSafe
----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/tests/test_fftpack.py", line 154, in test_ifft

    self._test_mtsame(np.fft.ifft, a)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/tests/test_fftpack.py", line 145, in _test_mtsame

    assert_array_equal(q.get(timeout=5), expected,

  File "/opt/python/2.7.9/lib/python2.7/Queue.py", line 176, in get

    raise Empty

Empty

Not sure if that is a threading problem or a timeout problem.

krischer · 2016-06-06T20:03:08Z

How did you trigger it? CI and local tests pass for me.

EDIT: I should read more carefully. Can you reproduce this reliably or does it only happen occasionally? Also what OS did you run it on?

charris · 2016-06-06T20:10:00Z

Travis CI runs the tests after merges and emails me the result, I didn't run the tests myself. You can see the report at https://travis-ci.org/numpy/numpy/builds/135682171. It is probably sporadic, which is the worst kind of problem...

krischer · 2016-06-06T20:14:10Z

Mhm. I can occasionally reproduce it locally by greatly increasing the number of threads the tests runs with. I'll look into it.

charris · 2016-06-06T20:30:51Z

My guess is that an entry gets deleted before access. You probably need some way to track whether an entry is in use before deletion, which means the cache could grow with the number of threads. This sounds like a difficult problem unless the cache can be made thread local.

@pv @njsmith Thoughts?

krischer · 2016-06-06T20:44:51Z

I'm pretty sure this happens because

    # As soon as we put wsave back into the cache, another thread could pick it
    # up and start using it, so we must not do this until after we're
    # completely done using it ourselves.
    fft_cache[n].append(wsave)

now has a __getitem__() that has multiple parts implemented in Python and is thus no longer protected by the GIL. So this could either be solved with a mutex or a simple try/except with the result that the occasional cache write will silently but harmlessly fail in the threaded case.

I would implement the latter variant but I'm not sure how to reliably test it. Any ideas?

charris · 2016-06-06T20:52:10Z

Well, lots of threads ;) Probably the best thing up front is for some experienced folks to look at the code and think about it ;)

krischer · 2016-06-06T21:02:04Z

Fair enough.

To reliably trigger the bug as I interpret it, change the __getitem__() method of the FFTCache object in numpy/fft/helper.py to:

    def __getitem__(self, key):
        # pop + add to move it to the end.
        value = self._dict.pop(key)
        import time
        time.sleep(0.01)
        ...

Both proposed solution would work.

charris · 2016-06-06T21:33:52Z

Don't like that much. Maybe derive the FFTCache from threading.local? I don't have the experience to say it will work, but it is easy and seems worth trying.

krischer · 2016-06-06T21:42:34Z

Good idea. Its a one line change and seems to work fine. But I also have little experience in that regard so maybe there are some hidden pitfalls?

The downside is that the total possible cache size would then depend on the number of threads. A simple lock around the __getattr__() method also works and the total cache size is bounded no matter the number of threads.

charris · 2016-06-06T21:53:05Z

I believe each thread would get its own cache, not sure if the initial data is copied from the parent thread. The total amount of possible data will grow with the number of threads, but I expect anyone running hundreds of threads will have the resources needed to handle that. I'm not sure the simple lock will work, although I expect that a relevant process will hold a reference to the data even if it is ejected from the cache, so maybe all is well. There is a context manaager for handling such locks and I suppose the lock contention is likely to be small.

charris · 2016-06-06T22:05:26Z

Of course, with a lot of threads doing different size transforms, much cache data is likely to be evicted before it can be used. Python threading is really time slicing, it doesn't run on multiple processors and ideally there would be a single thread dedicated to, say, running ffts while other threads might handle io, data acquisition, user interfaces and other such. I don't think either multiple caches or a lock would be a problem in practice.

njsmith · 2016-06-06T22:14:17Z

How about just wrapping a lock around the cache access methods? Contention should be basically nil in any reasonable code.

charris · 2016-06-06T22:39:25Z

@njsmith If they all use the same lock that should work.

njsmith · 2016-06-06T22:43:46Z

Right, specifically I'm suggesting that FFTCache.__init__ do something like self._lock = ..., and then all the method bodies do with self._lock: ...

carlodri · 2018-01-02T13:10:20Z

The only problem with that is that a single large FFT will occupy a lot of memory in the cache that is never freed again.

why can't we add a simple ._clear_cache() method to the _FFTCache object?

charris changed the title ~~Changing FFT cache to a limited LRU cache~~ ENH: Changing FFT cache to a limited LRU cache May 27, 2016

charris added 01 - Enhancement component: numpy.fft labels May 27, 2016

charris reviewed May 27, 2016
View reviewed changes

krischer changed the title ~~ENH: Changing FFT cache to a limited LRU cache~~ ENH: Changing FFT cache to a bounded LRU cache May 27, 2016

charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jun 3, 2016

charris reviewed Jun 4, 2016
View reviewed changes

krischer force-pushed the limit-fft-cache branch from 333ae2f to 8b65620 Compare June 6, 2016 17:36

ENH: Changing FFT cache to a bounded LRU cache

2de9651

Replaces the simple dictionary caches for the twiddle factors of numpy.fft to bounded LRU (least recently used) caches. The caches can thus no longer grow without bounds. See numpy#7686.

krischer force-pushed the limit-fft-cache branch from 8b65620 to 2de9651 Compare June 6, 2016 17:54

charris merged commit 175476f into numpy:master Jun 6, 2016

homu mentioned this pull request Jun 6, 2016

ENH: add helper for ifft padding to numpy.fft #7593

Closed

krischer mentioned this pull request Jun 6, 2016

Possible memory leak in remove_response() (numpy FFT cache growing) obspy/obspy#1424

Merged

krischer mentioned this pull request Jun 7, 2016

BUG: Fix race condition with new FFT cache #7712

Merged

ThomasLecocq mentioned this pull request Oct 25, 2016

memory usage after running scipy.signal.fftconvolve (possible memory leak?) scipy/scipy#5986

Open

ENH: Changing FFT cache to a bounded LRU cache #7686

ENH: Changing FFT cache to a bounded LRU cache #7686

Conversation

krischer commented May 27, 2016

charris May 27, 2016

Choose a reason for hiding this comment

krischer May 27, 2016

Choose a reason for hiding this comment

charris commented May 27, 2016

krischer commented May 27, 2016

mhvk commented May 27, 2016

krischer commented May 30, 2016 • edited Loading

mhvk commented May 30, 2016

njsmith commented May 30, 2016

krischer commented May 30, 2016

mhvk commented May 30, 2016

mhvk commented May 30, 2016

krischer commented May 30, 2016

krischer commented May 30, 2016

mhvk commented May 30, 2016

seberg commented May 30, 2016

krischer commented Jun 2, 2016

mhvk commented Jun 2, 2016

charris commented Jun 2, 2016

krischer commented Jun 3, 2016

charris commented Jun 3, 2016

charris commented Jun 3, 2016

charris commented Jun 3, 2016

krischer commented Jun 3, 2016

charris Jun 4, 2016

Choose a reason for hiding this comment

charris commented Jun 4, 2016

homu commented Jun 4, 2016

charris commented Jun 5, 2016

krischer commented Jun 6, 2016

charris commented Jun 6, 2016

krischer commented Jun 6, 2016

charris commented Jun 6, 2016

charris commented Jun 6, 2016

krischer commented Jun 6, 2016 • edited Loading

charris commented Jun 6, 2016

krischer commented Jun 6, 2016

charris commented Jun 6, 2016

krischer commented Jun 6, 2016

charris commented Jun 6, 2016

krischer commented Jun 6, 2016

charris commented Jun 6, 2016

krischer commented Jun 6, 2016

charris commented Jun 6, 2016

charris commented Jun 6, 2016

njsmith commented Jun 6, 2016

charris commented Jun 6, 2016

njsmith commented Jun 6, 2016

carlodri commented Jan 2, 2018

krischer commented May 30, 2016 •

edited

Loading

krischer commented Jun 6, 2016 •

edited

Loading