STFT pre-allocation #1514

bmcfee · 2022-06-24T13:15:54Z

Reference Issue

Fixes #871

What does this implement/fix? Explain your changes.

This PR adds parameter out= to stft and istft methods, allowing re-use of previously allocated output buffers.

Additionally, it implements some more careful handling of edge padding (when using centered mode) to reduce unnecessary copying. The result should be a substantial speedup when processing long signals.

Any other comments?

This is currently WIP. To-dos:

complete tests for stft pre-allocation
add istft and tests
update other parts of the library to make use of pre-allocation where possible, eg griffinlim
add an example to the gallery illustrating the utility, or maybe just extend the pcen example

That said, I'd like to get some early feedback on the implementation, mainly from the perspective of readability.

bmcfee · 2022-06-24T13:34:15Z

And we've hit a pretty serious snag that I hadn't considered. The optimization I've implemented to only pad at the edges breaks padding semantics for wrap mode. This isn't a huge deal in the big picture, but in the short term it does break some of our unit tests (eg on reassigned frequencies).

Now, I'm not entirely sure why we're using wrap padding in these tests - probably this is to get around edge transients for the stationary test signals, but it's a pretty crude hack.

However, given that this does change semantics in a pretty subtle way, I don't think we can include this optimization in a minor point release. We do have the option of rolling back to just have pre-allocated output arrays without the copy-pad optimization, but my preference would be to keep this intact and bump it to a later release (eg 0.10).

bmcfee · 2022-06-24T14:27:50Z

Per offline discussion, @lostanlen and I have agreed to punt this one for now.

codecov · 2022-07-01T14:17:29Z

Codecov Report

Merging #1514 (410dec6) into main (9f46e98) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1514      +/-   ##
==========================================
+ Coverage   98.73%   98.75%   +0.01%     
==========================================
  Files          32       32              
  Lines        3942     4001      +59     
==========================================
+ Hits         3892     3951      +59     
  Misses         50       50

Flag	Coverage Δ
unittests	`98.75% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
librosa/core/spectrum.py	`97.84% <100.00%> (+0.34%)`	⬆️
librosa/display.py	`96.93% <0.00%> (-0.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f46e98...410dec6. Read the comment docs.

bmcfee · 2022-07-01T14:17:59Z

I've added exceptions for now-unsupported padding modes in stft.

I wound up disabling the center=True test for reassigned spectrograms for now. We used pad_mode='wrap' not out of actual necessity, but just to minimize the influence of edge transients on our test case. Ultimately this was a bit of a hack, and I don't think it diminishes much to simplify this test and only check center=False (no padding).

The remaining to-do's in the issue are still TBD though.

lostanlen · 2022-07-01T23:01:43Z

@bmcfee apologies for that vagrant commit 4303fa8 🤦🏻 i meant to push it to #1485 instead

bmcfee · 2022-07-02T00:03:04Z

All good - had to rebase this anyway, so i just rolled back that commit at the same time.

bmcfee · 2022-07-02T12:08:26Z

Prototype of griffin-lim using pre-allocation for both forward and inverse STFT's gets us a minor boost. Example is 90 seconds of random noise at float32.

Here, librosa.griffinlim is not using pre-allocation, but is using the optimized padding from this PR. griffinlim is using pre-allocation. Results are numerically identical:

>>> sr = 22050
>>> y = np.random.randn(22050 * 90).astype(np.float32)
>>> %timeit librosa.griffinlim(S, random_state=0, n_iter=64)
16.4 s ± 1.64 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %memit librosa.griffinlim(S, random_state=0, n_iter=64)
peak memory: 518.35 MiB, increment: 151.34 MiB
>>> %timeit griffinlim(S, random_state=0, n_iter=64)
15.1 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %memit griffinlim(S, random_state=0, n_iter=64)
peak memory: 518.42 MiB, increment: 151.40 MiB

Peak memory hasn't changed much, which I think just means the gc was doing a good job previously of cleaning up after itself.

At this point, I think we're still paying primarily in temp storage from the intermediate reconstruction here:

librosa/librosa/core/spectrum.py

Line 2434 in 9f46e98

S * angles,

we could probably absorb this directly into the angles array with an in-place multiply and cut down on storage a bit more.

bmcfee · 2022-07-02T15:37:13Z

I had to add a couple of additional tweaks to GL here, mainly to preserve numerical precision in angles dtype.

Previously we forced complex64 even if the input was higher precision. When moving from a temporary multiply (istft(S * angles, ...)) to an in-place multiply (angles *= S), this was causing disagreements on double-precision inputs (single-precision retained numerical equivalence). This PR now has GL adapting the complex dtype to the precision of the input (S).

I'm still not seeing a ton of reliable speed improvements here, though we should automatically inherit the previously reported speedups in stft.

bmcfee · 2022-07-02T15:57:53Z

Ok, I think this one is feature-complete. It does have some fairly substantial changes though. While everything here "should be" numerically equivalent to the 0.9.2 branch (notwithstanding the previous comment about numerical precision in griffin-lim phase estimates), I think this PR does warrant some careful scrutiny by someone apart from myself.

bmcfee · 2022-07-08T12:20:06Z

Some additional benchmarking data, relevant to #599. I've been playing around with asv to get a more long-term sense of efficiency gains and regressions. Here are some benchmark results for time and peak memory (on my laptop) for the current main branch vs this PR when running STFT on data of different durations (in seconds), different hop lengths, and centering on/off:

main branch (before preallocation optimizations):

[  0.00%] · For librosa commit 25538adb <abs2>:
[  0.00%] ·· Benchmarking conda-py3.9
[ 50.00%] ··· Running (benchmarks.STFTSuite.time_stft--).
[ 75.00%] ··· benchmarks.STFTSuite.peakmem_stft                                                                                                                                             ok
[ 75.00%] ··· ============ =========== =========== ============ ============ ============ =============
              --                                        center / duration                              
              ------------ ----------------------------------------------------------------------------
               hop_length   True / 10   True / 60   True / 240   False / 10   False / 60   False / 240 
              ============ =========== =========== ============ ============ ============ =============
                  256          154M        193M        336M         153M         188M          315M    
                  512          150M        172M        251M         149M         167M          230M    
                  1024         148M        162M        209M         148M         156M          188M    
              ============ =========== =========== ============ ============ ============ =============

[100.00%] ··· benchmarks.STFTSuite.time_stft                                                                                                                                                ok
[100.00%] ··· ============ ============= ============ ============ ============= ============ =============
              --                                          center / duration                                
              ------------ --------------------------------------------------------------------------------
               hop_length    True / 10    True / 60    True / 240    False / 10   False / 60   False / 240 
              ============ ============= ============ ============ ============= ============ =============
                  256       10.6±0.07ms   71.7±20ms     491±10ms     10.8±0.3ms   67.2±20ms      453±10ms  
                  512        5.81±0.2ms   31.3±0.4ms    253±20ms     5.47±0.1ms   31.2±0.8ms     205±7ms   
                  1024       3.65±0.4ms   18.5±0.7ms    132±6ms     3.09±0.05ms   17.1±0.5ms    77.5±20ms  
              ============ ============= ============ ============ ============= ============ =============

PR 1514:

[  0.00%] · For librosa commit c95314eb <stft-preallocate>:
[  0.00%] ·· Benchmarking conda-py3.9
[ 50.00%] ··· Running (benchmarks.STFTSuite.time_stft--).
[ 75.00%] ··· benchmarks.STFTSuite.peakmem_stft                                                                                                                                             ok
[ 75.00%] ··· ============ =========== =========== ============ ============ ============ =============
              --                                        center / duration                              
              ------------ ----------------------------------------------------------------------------
               hop_length   True / 10   True / 60   True / 240   False / 10   False / 60   False / 240 
              ============ =========== =========== ============ ============ ============ =============
                  256          153M        188M        315M         153M         188M          315M    
                  512          149M        167M        230M         149M         167M          230M    
                  1024         148M        156M        188M         148M         156M          188M    
              ============ =========== =========== ============ ============ ============ =============

[100.00%] ··· benchmarks.STFTSuite.time_stft                                                                                                                                                ok
[100.00%] ··· ============ ============= ============ ============ ============= ============ =============
              --                                          center / duration                                
              ------------ --------------------------------------------------------------------------------
               hop_length    True / 10    True / 60    True / 240    False / 10   False / 60   False / 240 
              ============ ============= ============ ============ ============= ============ =============
                  256        11.6±0.6ms   74.1±0.5ms    291±1ms      11.3±0.3ms   74.0±0.3ms     292±2ms   
                  512       5.94±0.02ms   33.1±0.8ms   149±0.9ms    5.72±0.09ms   31.8±0.8ms     149±1ms   
                  1024      3.32±0.04ms   17.6±0.6ms   77.8±0.8ms    3.24±0.1ms   17.7±0.7ms    77.4±0.9ms 
              ============ ============= ============ ============ ============= ============ =============

At some point, I'll have this up to run continuously on a server and publish results on a website. But for now, it at least gives a rough sense of time and memory improvements for this PR.

bmcfee · 2022-07-14T14:25:39Z

Leaving a note for later:

The update to the pcen-streaming example to use pre-allocated output arrays works here, but ther's a chance that it could cause some trouble at the end of the stream. This got me thinking a little about how exactly we want to handle this generally.

For reference, here's the pattern used in the example now:

D = None

for y_block in stream:
    D = librosa.stft(y_block, n_fft=n_fft, hop_length=hop_length,
                     center=False, out=D)

The idea here is that on the first block, D is None and there is no pre-allocation. On subsequent iterations, we reuse the old output array. This works fine when all blocks are of identical size. However, if the last y_block is shorter than previous blocks, this will fail because D.shape will not match the size requirement, per this check in the PR:

if out is None:
    stft_matrix = np.zeros(shape, dtype=dtype, order="F")
elif not np.allclose(out.shape, shape):
    raise ParameterError(
        f"Shape mismatch for provided output array out.shape={out.shape} != {shape}"
    )

I think we can work around this by providing a simple "view trim" option. We could accept an over-sized output array (out) but only write into a prefix slice of the array, eg stft_matrix = out[..., :n_frames].

bmcfee · 2022-07-14T14:44:11Z

Thinking this over, I'm wondering if we should make this "view trimming" mode the default behavior, or just always implement it. I see very little downside to allowing over-sized inputs, and the benefits to ease-of-use seem substantial.

* fix a framing bug introduced in #1514, fixes #1567 * fixed an edge case when hop_length=1 in stft * decorative comments

bmcfee added enhancement Does this improve existing functionality? functionality Does this add new functionality? labels Jun 24, 2022

bmcfee added this to the 0.9.2 milestone Jun 24, 2022

bmcfee self-assigned this Jun 24, 2022

bmcfee modified the milestones: 0.9.2, 0.10.0 Jun 24, 2022

bmcfee force-pushed the stft-preallocate branch from 41da72d to d96b49a Compare June 30, 2022 21:22

bmcfee force-pushed the stft-preallocate branch from b557e7f to 61b996f Compare July 1, 2022 17:50

bmcfee added 7 commits July 1, 2022 19:58

added pre-allocating stft optimization, toward fixing #871

215e032

explicitly forbid bad padding modes in stft

861ffcc

cleaning up tests

51b2519

expanding tests for stft preallocation

ef95c40

added pre-allocating istft

3525b74

added pre-allocating istft tests

6ef1b43

updatd docstrings for stft/istft

d4f730a

bmcfee force-pushed the stft-preallocate branch from 4303fa8 to d4f730a Compare July 1, 2022 23:58

initial cut of preallocated griffinlim

05be55a

in-place multiplies for griffinlim angles

2a9e7fb

updated pcen streaming example for pre-allocated stft outputs

c95314e

bmcfee changed the title ~~[WIP] STFT pre-allocation~~ [CR needed] STFT pre-allocation Jul 2, 2022

bmcfee mentioned this pull request Jul 6, 2022

Vectorized utility functions: abs2, unit phasor, others? #745

Closed

bmcfee mentioned this pull request Jul 7, 2022

Reduce dependency on padding #1199

Closed

updated stft comments for user-provided pad functions [ci skip]

d86afaf

support over-allocation in stft

410dec6

bmcfee merged commit ad90c4b into main Jul 14, 2022

bmcfee changed the title ~~[CR needed] STFT pre-allocation~~ STFT pre-allocation Jul 14, 2022

bmcfee deleted the stft-preallocate branch July 14, 2022 16:16

bmcfee mentioned this pull request Sep 7, 2022

librosa.stft 0.10.0-dev from github has numerical issue #1567

Closed

bmcfee added a commit that referenced this pull request Sep 7, 2022

fix a framing bug introduced in #1514, fixes #1567

beb4a6b

bmcfee mentioned this pull request Sep 7, 2022

off-by-one error in STFT frame padding #1568

Merged

bmcfee added a commit that referenced this pull request Sep 8, 2022

off-by-one error in STFT frame padding (#1568)

2dbd200

* fix a framing bug introduced in #1514, fixes #1567 * fixed an edge case when hop_length=1 in stft * decorative comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STFT pre-allocation #1514

STFT pre-allocation #1514

bmcfee commented Jun 24, 2022 •

edited

bmcfee commented Jun 24, 2022

bmcfee commented Jun 24, 2022

codecov bot commented Jul 1, 2022 •

edited

bmcfee commented Jul 1, 2022

lostanlen commented Jul 1, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 8, 2022

bmcfee commented Jul 14, 2022

bmcfee commented Jul 14, 2022

STFT pre-allocation #1514

STFT pre-allocation #1514

Conversation

bmcfee commented Jun 24, 2022 • edited

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

bmcfee commented Jun 24, 2022

bmcfee commented Jun 24, 2022

codecov bot commented Jul 1, 2022 • edited

Codecov Report

bmcfee commented Jul 1, 2022

lostanlen commented Jul 1, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 2, 2022

bmcfee commented Jul 8, 2022

bmcfee commented Jul 14, 2022

bmcfee commented Jul 14, 2022

bmcfee commented Jun 24, 2022 •

edited

codecov bot commented Jul 1, 2022 •

edited