New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STFT pre-allocation #1514
STFT pre-allocation #1514
Conversation
And we've hit a pretty serious snag that I hadn't considered. The optimization I've implemented to only pad at the edges breaks padding semantics for Now, I'm not entirely sure why we're using wrap padding in these tests - probably this is to get around edge transients for the stationary test signals, but it's a pretty crude hack. However, given that this does change semantics in a pretty subtle way, I don't think we can include this optimization in a minor point release. We do have the option of rolling back to just have pre-allocated output arrays without the copy-pad optimization, but my preference would be to keep this intact and bump it to a later release (eg 0.10). |
Per offline discussion, @lostanlen and I have agreed to punt this one for now. |
Codecov Report
@@ Coverage Diff @@
## main #1514 +/- ##
==========================================
+ Coverage 98.73% 98.75% +0.01%
==========================================
Files 32 32
Lines 3942 4001 +59
==========================================
+ Hits 3892 3951 +59
Misses 50 50
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I've added exceptions for now-unsupported padding modes in stft. I wound up disabling the The remaining to-do's in the issue are still TBD though. |
All good - had to rebase this anyway, so i just rolled back that commit at the same time. |
Prototype of griffin-lim using pre-allocation for both forward and inverse STFT's gets us a minor boost. Example is 90 seconds of random noise at float32. Here, >>> sr = 22050
>>> y = np.random.randn(22050 * 90).astype(np.float32)
>>> %timeit librosa.griffinlim(S, random_state=0, n_iter=64)
16.4 s ± 1.64 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %memit librosa.griffinlim(S, random_state=0, n_iter=64)
peak memory: 518.35 MiB, increment: 151.34 MiB
>>> %timeit griffinlim(S, random_state=0, n_iter=64)
15.1 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %memit griffinlim(S, random_state=0, n_iter=64)
peak memory: 518.42 MiB, increment: 151.40 MiB Peak memory hasn't changed much, which I think just means the gc was doing a good job previously of cleaning up after itself. At this point, I think we're still paying primarily in temp storage from the intermediate reconstruction here: librosa/librosa/core/spectrum.py Line 2434 in 9f46e98
we could probably absorb this directly into the angles array with an in-place multiply and cut down on storage a bit more.
|
I had to add a couple of additional tweaks to GL here, mainly to preserve numerical precision in angles dtype. Previously we forced complex64 even if the input was higher precision. When moving from a temporary multiply ( I'm still not seeing a ton of reliable speed improvements here, though we should automatically inherit the previously reported speedups in stft. |
Ok, I think this one is feature-complete. It does have some fairly substantial changes though. While everything here "should be" numerically equivalent to the 0.9.2 branch (notwithstanding the previous comment about numerical precision in griffin-lim phase estimates), I think this PR does warrant some careful scrutiny by someone apart from myself. |
Some additional benchmarking data, relevant to #599. I've been playing around with asv to get a more long-term sense of efficiency gains and regressions. Here are some benchmark results for time and peak memory (on my laptop) for the current main branch vs this PR when running STFT on data of different durations (in seconds), different hop lengths, and centering on/off: main branch (before preallocation optimizations):
PR 1514:
At some point, I'll have this up to run continuously on a server and publish results on a website. But for now, it at least gives a rough sense of time and memory improvements for this PR. |
Leaving a note for later: The update to the pcen-streaming example to use pre-allocated output arrays works here, but ther's a chance that it could cause some trouble at the end of the stream. This got me thinking a little about how exactly we want to handle this generally. For reference, here's the pattern used in the example now: D = None
for y_block in stream:
D = librosa.stft(y_block, n_fft=n_fft, hop_length=hop_length,
center=False, out=D) The idea here is that on the first block, if out is None:
stft_matrix = np.zeros(shape, dtype=dtype, order="F")
elif not np.allclose(out.shape, shape):
raise ParameterError(
f"Shape mismatch for provided output array out.shape={out.shape} != {shape}"
) I think we can work around this by providing a simple "view trim" option. We could accept an over-sized output array ( |
Thinking this over, I'm wondering if we should make this "view trimming" mode the default behavior, or just always implement it. I see very little downside to allowing over-sized inputs, and the benefits to ease-of-use seem substantial. |
Reference Issue
Fixes #871
What does this implement/fix? Explain your changes.
This PR adds parameter
out=
tostft
andistft
methods, allowing re-use of previously allocated output buffers.Additionally, it implements some more careful handling of edge padding (when using centered mode) to reduce unnecessary copying. The result should be a substantial speedup when processing long signals.
Any other comments?
This is currently WIP. To-dos:
That said, I'd like to get some early feedback on the implementation, mainly from the perspective of readability.