Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Changing FFT cache to a bounded LRU cache #7686

Merged
merged 1 commit into from
Jun 6, 2016

Conversation

krischer
Copy link
Contributor

The currently existing simple dictionary cache can grow without limit. With this PR neither cache can grow much larger than 100MB (arbitrary limit). It is implemented as an LRU (least recently used) cache so it will always remove items that have been get/set least recently. This should still reap most benefits of the original cache implementation without the potential to use a large amount of memory.

The following snippet will fill a lot of memory without this patch:

import numpy as np

for i in range(10000):
    print(i)
    np.fft.ifft(np.empty(10000 + i, dtype=np.complex128))

This is not just a theoretical issue - we've encountered this in practice here: obspy/obspy#1424

I'm not entirely sure if the additional complexity is really worth it but I don't see a simpler solution to achieve the same effect. The proposed LRU cache will always retain the most recently used items and throws away old ones as soon as the specified memory limit is reached. Thus most users will still benefit from the cache. I think it still works correctly when used in threaded code but I did not extensively test it.

It currently has a limit of 100 MB for each cache which is an arbitrary limit that works for my use cases.

The PR is currently based on the current master - let me know if you want it rebased on some other branch.

@charris charris changed the title Changing FFT cache to a limited LRU cache ENH: Changing FFT cache to a limited LRU cache May 27, 2016
c.setdefault(3, [np.array([1, 2])])
assert_array_almost_equal(c[3][0], np.array([1, 2]))

assert c._get_size() == np.ones(2).nbytes * 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert_(...), plain old assert goes away in optimized Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. For pytest and nose I usually just use plain asserts but your point is very valid. I changed it to the assertEqual() method of the test class. I hope that's ok.

@charris
Copy link
Member

charris commented May 27, 2016

Fails on Linux 386 and Windows. Makes me suspect a long problem somewhere, windows (C) long is always 32 bits.

This enhancement looks reasonable to me, but should be discussed on the list. Please make a post.

@krischer
Copy link
Contributor Author

I think it failed because I carelessly assumed the dtype of some array creation functions. I pushed a fix. Let's see if CI passes.

This enhancement looks reasonable to me, but should be discussed on the list. Please make a post.

Will do.

@krischer krischer changed the title ENH: Changing FFT cache to a limited LRU cache ENH: Changing FFT cache to a bounded LRU cache May 27, 2016
@mhvk
Copy link
Contributor

mhvk commented May 27, 2016

I wonder why this cache is even there. At least for a few simple examples, creating time of the work array is much shorter than running the fft:

In [15]: %timeit np.fft.fftpack.fftpack.cffti(10000)
1000 loops, best of 3: 463 µs per loop

In [16]: %timeit np.fft.ifft(np.empty(10000, dtype=np.complex128)) 
100 loops, best of 3: 3.76 ms per loop

In [17]: %timeit np.fft.ifft(np.empty(100, dtype=np.complex128))
10000 loops, best of 3: 30.7 µs per loop

In [18]: %timeit np.fft.fftpack.fftpack.cffti(100)
100000 loops, best of 3: 5.58 µs per loop

@krischer
Copy link
Contributor Author

krischer commented May 30, 2016

Here is a quick and dirty but more extensive benchmark - it tests the total time for 100 FFTs of equal length with and without cache which IMHO is one of the most interesting quantities. Lots of real world application require a ton of FFTs of the same length.

import time
import numpy as np

print("NPTS         t for 100 runs w cache     t for 100 runs w/o cache")
print("================================================================")

for length in [1E3, 1E4, 1E5, 1E6, 1E7]:
    length = int(length)
    d = np.random.random(length)
    d = np.require(d, dtype=np.float64)
    # Warmup
    np.fft.fft(d)
    np.fft.fftpack._fft_cache.clear()

    total_time = 0
    for _ in range(100):
        a = time.time()
        np.fft.fft(d)
        b = time.time()
        total_time += b - a

    total_time_no_cache = 0
    for _ in range(100):
        np.fft.fftpack._fft_cache.clear()
        a = time.time()
        np.fft.fft(d)
        b = time.time()
        total_time_no_cache += b - a

    print("%7i        %11.5f                    %11.5f" % (
        length, total_time, total_time_no_cache))

Results on my machine:

NPTS         t for 100 runs w cache     t for 100 runs w/o cache
================================================================
   1000            0.00235                        0.00450
  10000            0.02120                        0.04525
 100000            0.29368                        0.50182
1000000            4.40410                        7.07679
10000000           71.06963                      104.85043

IMHO the cache is more than worth it as it will greatly speed up many real world workflows and many people will not have enough experience to use the FFTW wrappers or other FFT implementations.

@mhvk
Copy link
Contributor

mhvk commented May 30, 2016

@krischer - yes, that is quite convincing! I agree doing many FFTs in sequence is a common use case and I'll definitely take that speed improvement!

Then the remaining question is what is a reasonably optimal cache size, and how do we define size. On the mailing list, I suggested also adding a limit to the number of entries. What do you think?

In any case, just to be clear: I think this PR is an improvement no matter what.

@njsmith
Copy link
Member

njsmith commented May 30, 2016

I guess there's no point in spending much time trying to fine tune the cache eviction heuristics, because we have absolutely no data :-). Any limit is obviously better than no limit; beyond that we just don't know.

So I think there's two productive things we might do: either find some source of data - maybe a real program that benefits from the cache? or else merge and move on :-). We can always fine tune later...

@krischer
Copy link
Contributor Author

Then the remaining question is what is a reasonably optimal cache size, and how do we define size. On the mailing list, I suggested also adding a limit to the number of entries. What do you think?

I did read that and I think it is a good idea. I'll add it - it will then be limited by either memory size or total item count - whatever reaches its limit first.

The limits should be discussed. They could be made configurable but that is IMHO too much complexity and really knowledgeable people could just monkey patch the caches or use another FFT implementation.

I think 8 for the max item size as you proposed on the mailing list is a bit on the low side. I would propose at least 16-64 if not even higher. The trickier limit is the total allowed cache size - this PR so far has an arbitrary number of 100 MB but that might even be a bit too low - the following plot shows the size of a single cache item for various lengths of FFTs.

figure_1-1

@mhvk
Copy link
Contributor

mhvk commented May 30, 2016

Yes, work array size is length * 2 * 16 bytes (complex128), so 320 MB for 1e7 elements. (Actually, given this your get_size could just sum the keys and assume those are 32 byte items).

In my workloads at least, I often have one large FT and many smaller ones. One could envision something like max_cache_size = max(some_min_size, largest_item_size * 1.5)

@mhvk
Copy link
Contributor

mhvk commented May 30, 2016

@njsmith - you're right that we should not fine-tune this too much. I'd be happy with just putting a limit of 32 on the total number of items, and keeping it at 100MB or a single item otherwise.

@krischer
Copy link
Contributor Author

In my workloads at least, I often have one large FT and many smaller ones. One could envision something like max_cache_size = max(some_min_size, largest_item_size * 1.5)

I like that. The only problem with that is that a single large FFT will occupy a lot of memory in the cache that is never freed again. I'd be happy with that as people doing these calculations can be assumed to have lots of memory. If nobody objects I'll implement that.

@krischer
Copy link
Contributor Author

Yes, work array size is length * 2 * 16 bytes (complex128), so 320 MB for 1e7 elements. (Actually, given this your get_size could just sum the keys and assume those are 32 byte items).

True - but then I'd have to specialize for full and real valued FFTs. The current way is IMHO simpler and fast enough.

@mhvk
Copy link
Contributor

mhvk commented May 30, 2016

The only problem with that is that a single large FFT will occupy a lot of memory in the cache that is never freed again.

But that is true now too, and there doesn't seem to be an easy way (barring a time-based clearing).

(And at least now I know why it is often good to just exit python and restart... It seems there are many other places where "handy caches" are kept.)

@seberg
Copy link
Member

seberg commented May 30, 2016

The magic size of the cache is: dim = 4*n + 15, the actually needed cache is only 2*n + 15 (i.e. you could half the needed cache if you change the code a bit). Since the factorization seems not the very slow part (the +15), I guess we have no choice except adding some heuristic.

@krischer
Copy link
Contributor Author

krischer commented Jun 2, 2016

The magic size of the cache is: dim = 4_n + 15, the actually needed cache is only 2_n + 15 (i.e. you could half the needed cache if you change the code a bit). Since the factorization seems not the very slow part (the +15), I guess we have no choice except adding some heuristic.

The C code is very dense and I don't have the time right now to break it down, but cache size already is 2*n + 15 for real valued input. It is only 4*n + 15 for complex valued input. As far as I currently understand the C code it also uses all 4 * n + 15 cache entries but I might be mistaken.

But that is true now too, and there doesn't seem to be an easy way (barring a time-based clearing).

pyfftw has a timed cache implemented with a separate thread but that is probably way too much complexity for the core numpy package: https://github.com/pyFFTW/pyFFTW/blob/master/pyfftw/interfaces/cache.py

In any case - baring CI failures this PR is done from my point of view - please review it or let me know of any other desired chances. There are still two caches - one for real, and one for complex valued FFTs. Old values will be evicted upon getting/setting if:

cache size > max(100 MB, largest_item_size * 1.5) OR > 32 items in cache

The cache will also never be fully cleared but at least one item will always remain as otherwise single large items would never get stored in the cache and one also wants to benefit from the cache in these cases.

@mhvk
Copy link
Contributor

mhvk commented Jun 2, 2016

Looks all good to me! I also like that if anybody looks at the code, it is quite obvious what to do if one wants a different cache.

@charris
Copy link
Member

charris commented Jun 2, 2016

Needs mention in doc/release/1.12.0-notes.rst.

@krischer
Copy link
Contributor Author

krischer commented Jun 3, 2016

It is currently branched off master but it might make sense to rebase it on top of one of the maintenance branches. If so and which of them is up to you. Or would you rather see a backport?

@charris
Copy link
Member

charris commented Jun 3, 2016

@krischer Everything starts in master and backports are restricted to bug fixes.

@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jun 3, 2016
@charris
Copy link
Member

charris commented Jun 3, 2016

Note that the cached data is linear in the transform size, so that for very large transforms the relative time spent in computation of the twiddle factors will decrease.

@charris
Copy link
Member

charris commented Jun 3, 2016

@krischer Still needs mention in the release notes.

@krischer
Copy link
Contributor Author

krischer commented Jun 3, 2016

@krischer Still needs mention in the release notes.

Done.

I guess this PR could also be considered a bugfix for unexpected and unwanted behavior but it appears to not have caused a lot of practical problems in the past so I'm fine with having it only in future numpy versions.


class _FFTCache(object):
"""
Cache for the FFT init functions as an LRU (least recently used) cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should give this a standard docstring, see doc/HOWTO_DOCUMENT.rst.txt. All that is really needed here is a Parameters section with the two arguments to __init__.

@charris
Copy link
Member

charris commented Jun 4, 2016

LGTM modulo class docstring.

@homu
Copy link
Contributor

homu commented Jun 4, 2016

☔ The latest upstream changes (presumably #7704) made this pull request unmergeable. Please resolve the merge conflicts.

@charris
Copy link
Member

charris commented Jun 5, 2016

Needs rebase also, probably the release notes.

@krischer
Copy link
Contributor Author

krischer commented Jun 6, 2016

Rebased, force pushed, and improved the docstring.

@charris
Copy link
Member

charris commented Jun 6, 2016

Great. One more thing ;) Could you squash the commits -- git rebase -i HEAD~8 -- and take a look at formatting the commit message as documented in doc/source/dev/gitwash/development_workflow.rst?

Replaces the simple dictionary caches for the twiddle factors of
numpy.fft to bounded LRU (least recently used) caches. The caches can
thus no longer grow without bounds.

See numpy#7686.
@krischer
Copy link
Contributor Author

krischer commented Jun 6, 2016

Sure thing :-)

@charris charris merged commit 175476f into numpy:master Jun 6, 2016
@charris
Copy link
Member

charris commented Jun 6, 2016

Thanks @krischer .

@charris
Copy link
Member

charris commented Jun 6, 2016

Hmm, I got an error report on merge

Exception in thread Thread-20:

Traceback (most recent call last):

  File "/opt/python/2.7.9/lib/python2.7/threading.py", line 810, in __bootstrap_inner

    self.run()

  File "/opt/python/2.7.9/lib/python2.7/threading.py", line 763, in run

    self.__target(*self.__args, **self.__kwargs)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/tests/test_fftpack.py", line 132, in worker

    q.put(func(*args))

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/fftpack.py", line 286, in ifft

    output = _raw_fft(a, n, axis, fftpack.cffti, fftpack.cfftb, _fft_cache)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/fftpack.py", line 89, in _raw_fft

    fft_cache[n].append(wsave)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/helper.py", line 262, in __getitem__

    value = self._dict.pop(key)

  File "/opt/python/2.7.9/lib/python2.7/collections.py", line 143, in pop

    raise KeyError(key)

KeyError: 200

Looks like it occurred in

ERROR: test_ifft (test_fftpack.TestFFTThreadSafe
----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/tests/test_fftpack.py", line 154, in test_ifft

    self._test_mtsame(np.fft.ifft, a)

  File "/home/travis/build/numpy/numpy/builds/venv/lib/python2.7/site-packages/numpy/fft/tests/test_fftpack.py", line 145, in _test_mtsame

    assert_array_equal(q.get(timeout=5), expected,

  File "/opt/python/2.7.9/lib/python2.7/Queue.py", line 176, in get

    raise Empty

Empty

Not sure if that is a threading problem or a timeout problem.

@krischer
Copy link
Contributor Author

krischer commented Jun 6, 2016

How did you trigger it? CI and local tests pass for me.

EDIT: I should read more carefully. Can you reproduce this reliably or does it only happen occasionally? Also what OS did you run it on?

@charris
Copy link
Member

charris commented Jun 6, 2016

Travis CI runs the tests after merges and emails me the result, I didn't run the tests myself. You can see the report at https://travis-ci.org/numpy/numpy/builds/135682171. It is probably sporadic, which is the worst kind of problem...

@krischer
Copy link
Contributor Author

krischer commented Jun 6, 2016

Mhm. I can occasionally reproduce it locally by greatly increasing the number of threads the tests runs with. I'll look into it.

@charris
Copy link
Member

charris commented Jun 6, 2016

My guess is that an entry gets deleted before access. You probably need some way to track whether an entry is in use before deletion, which means the cache could grow with the number of threads. This sounds like a difficult problem unless the cache can be made thread local.

@pv @njsmith Thoughts?

@krischer
Copy link
Contributor Author

krischer commented Jun 6, 2016

I'm pretty sure this happens because

    # As soon as we put wsave back into the cache, another thread could pick it
    # up and start using it, so we must not do this until after we're
    # completely done using it ourselves.
    fft_cache[n].append(wsave)

now has a __getitem__() that has multiple parts implemented in Python and is thus no longer protected by the GIL. So this could either be solved with a mutex or a simple try/except with the result that the occasional cache write will silently but harmlessly fail in the threaded case.

I would implement the latter variant but I'm not sure how to reliably test it. Any ideas?

@charris
Copy link
Member

charris commented Jun 6, 2016

Well, lots of threads ;) Probably the best thing up front is for some experienced folks to look at the code and think about it ;)

@krischer
Copy link
Contributor Author

krischer commented Jun 6, 2016

Fair enough.

To reliably trigger the bug as I interpret it, change the __getitem__() method of the FFTCache object in numpy/fft/helper.py to:

    def __getitem__(self, key):
        # pop + add to move it to the end.
        value = self._dict.pop(key)
        import time
        time.sleep(0.01)
        ...

Both proposed solution would work.

@charris
Copy link
Member

charris commented Jun 6, 2016

Don't like that much. Maybe derive the FFTCache from threading.local? I don't have the experience to say it will work, but it is easy and seems worth trying.

@krischer
Copy link
Contributor Author

krischer commented Jun 6, 2016

Good idea. Its a one line change and seems to work fine. But I also have little experience in that regard so maybe there are some hidden pitfalls?

The downside is that the total possible cache size would then depend on the number of threads. A simple lock around the __getattr__() method also works and the total cache size is bounded no matter the number of threads.

@charris
Copy link
Member

charris commented Jun 6, 2016

I believe each thread would get its own cache, not sure if the initial data is copied from the parent thread. The total amount of possible data will grow with the number of threads, but I expect anyone running hundreds of threads will have the resources needed to handle that. I'm not sure the simple lock will work, although I expect that a relevant process will hold a reference to the data even if it is ejected from the cache, so maybe all is well. There is a context manaager for handling such locks and I suppose the lock contention is likely to be small.

@charris
Copy link
Member

charris commented Jun 6, 2016

Of course, with a lot of threads doing different size transforms, much cache data is likely to be evicted before it can be used. Python threading is really time slicing, it doesn't run on multiple processors and ideally there would be a single thread dedicated to, say, running ffts while other threads might handle io, data acquisition, user interfaces and other such. I don't think either multiple caches or a lock would be a problem in practice.

@njsmith
Copy link
Member

njsmith commented Jun 6, 2016

How about just wrapping a lock around the cache access methods? Contention should be basically nil in any reasonable code.

@charris
Copy link
Member

charris commented Jun 6, 2016

@njsmith If they all use the same lock that should work.

@njsmith
Copy link
Member

njsmith commented Jun 6, 2016

Right, specifically I'm suggesting that FFTCache.__init__ do something like self._lock = ..., and then all the method bodies do with self._lock: ...

@carlodri
Copy link

carlodri commented Jan 2, 2018

The only problem with that is that a single large FFT will occupy a lot of memory in the cache that is never freed again.

why can't we add a simple ._clear_cache() method to the _FFTCache object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy.fft
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants