Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STFT pre-allocation #1514

Merged
merged 12 commits into from Jul 14, 2022
Merged

STFT pre-allocation #1514

merged 12 commits into from Jul 14, 2022

Conversation

bmcfee
Copy link
Member

@bmcfee bmcfee commented Jun 24, 2022

Reference Issue

Fixes #871

What does this implement/fix? Explain your changes.

This PR adds parameter out= to stft and istft methods, allowing re-use of previously allocated output buffers.

Additionally, it implements some more careful handling of edge padding (when using centered mode) to reduce unnecessary copying. The result should be a substantial speedup when processing long signals.

Any other comments?

This is currently WIP. To-dos:

  • complete tests for stft pre-allocation
  • add istft and tests
  • update other parts of the library to make use of pre-allocation where possible, eg griffinlim
  • add an example to the gallery illustrating the utility, or maybe just extend the pcen example

That said, I'd like to get some early feedback on the implementation, mainly from the perspective of readability.

@bmcfee bmcfee added enhancement Does this improve existing functionality? functionality Does this add new functionality? labels Jun 24, 2022
@bmcfee bmcfee added this to the 0.9.2 milestone Jun 24, 2022
@bmcfee bmcfee self-assigned this Jun 24, 2022
@bmcfee
Copy link
Member Author

bmcfee commented Jun 24, 2022

And we've hit a pretty serious snag that I hadn't considered. The optimization I've implemented to only pad at the edges breaks padding semantics for wrap mode. This isn't a huge deal in the big picture, but in the short term it does break some of our unit tests (eg on reassigned frequencies).

Now, I'm not entirely sure why we're using wrap padding in these tests - probably this is to get around edge transients for the stationary test signals, but it's a pretty crude hack.

However, given that this does change semantics in a pretty subtle way, I don't think we can include this optimization in a minor point release. We do have the option of rolling back to just have pre-allocated output arrays without the copy-pad optimization, but my preference would be to keep this intact and bump it to a later release (eg 0.10).

@bmcfee bmcfee modified the milestones: 0.9.2, 0.10.0 Jun 24, 2022
@bmcfee
Copy link
Member Author

bmcfee commented Jun 24, 2022

Per offline discussion, @lostanlen and I have agreed to punt this one for now.

@codecov
Copy link

codecov bot commented Jul 1, 2022

Codecov Report

Merging #1514 (410dec6) into main (9f46e98) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1514      +/-   ##
==========================================
+ Coverage   98.73%   98.75%   +0.01%     
==========================================
  Files          32       32              
  Lines        3942     4001      +59     
==========================================
+ Hits         3892     3951      +59     
  Misses         50       50              
Flag Coverage Δ
unittests 98.75% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
librosa/core/spectrum.py 97.84% <100.00%> (+0.34%) ⬆️
librosa/display.py 96.93% <0.00%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f46e98...410dec6. Read the comment docs.

@bmcfee
Copy link
Member Author

bmcfee commented Jul 1, 2022

I've added exceptions for now-unsupported padding modes in stft.

I wound up disabling the center=True test for reassigned spectrograms for now. We used pad_mode='wrap' not out of actual necessity, but just to minimize the influence of edge transients on our test case. Ultimately this was a bit of a hack, and I don't think it diminishes much to simplify this test and only check center=False (no padding).

The remaining to-do's in the issue are still TBD though.

@lostanlen
Copy link
Contributor

@bmcfee apologies for that vagrant commit 4303fa8 🤦🏻 i meant to push it to #1485 instead

@bmcfee
Copy link
Member Author

bmcfee commented Jul 2, 2022

All good - had to rebase this anyway, so i just rolled back that commit at the same time.

@bmcfee
Copy link
Member Author

bmcfee commented Jul 2, 2022

Prototype of griffin-lim using pre-allocation for both forward and inverse STFT's gets us a minor boost. Example is 90 seconds of random noise at float32.

Here, librosa.griffinlim is not using pre-allocation, but is using the optimized padding from this PR. griffinlim is using pre-allocation. Results are numerically identical:

>>> sr = 22050
>>> y = np.random.randn(22050 * 90).astype(np.float32)
>>> %timeit librosa.griffinlim(S, random_state=0, n_iter=64)
16.4 s ± 1.64 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %memit librosa.griffinlim(S, random_state=0, n_iter=64)
peak memory: 518.35 MiB, increment: 151.34 MiB
>>> %timeit griffinlim(S, random_state=0, n_iter=64)
15.1 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %memit griffinlim(S, random_state=0, n_iter=64)
peak memory: 518.42 MiB, increment: 151.40 MiB

Peak memory hasn't changed much, which I think just means the gc was doing a good job previously of cleaning up after itself.

At this point, I think we're still paying primarily in temp storage from the intermediate reconstruction here:

S * angles,

we could probably absorb this directly into the angles array with an in-place multiply and cut down on storage a bit more.

@bmcfee
Copy link
Member Author

bmcfee commented Jul 2, 2022

I had to add a couple of additional tweaks to GL here, mainly to preserve numerical precision in angles dtype.

Previously we forced complex64 even if the input was higher precision. When moving from a temporary multiply (istft(S * angles, ...)) to an in-place multiply (angles *= S), this was causing disagreements on double-precision inputs (single-precision retained numerical equivalence). This PR now has GL adapting the complex dtype to the precision of the input (S).

I'm still not seeing a ton of reliable speed improvements here, though we should automatically inherit the previously reported speedups in stft.

@bmcfee bmcfee changed the title [WIP] STFT pre-allocation [CR needed] STFT pre-allocation Jul 2, 2022
@bmcfee
Copy link
Member Author

bmcfee commented Jul 2, 2022

Ok, I think this one is feature-complete. It does have some fairly substantial changes though. While everything here "should be" numerically equivalent to the 0.9.2 branch (notwithstanding the previous comment about numerical precision in griffin-lim phase estimates), I think this PR does warrant some careful scrutiny by someone apart from myself.

@bmcfee
Copy link
Member Author

bmcfee commented Jul 8, 2022

Some additional benchmarking data, relevant to #599. I've been playing around with asv to get a more long-term sense of efficiency gains and regressions. Here are some benchmark results for time and peak memory (on my laptop) for the current main branch vs this PR when running STFT on data of different durations (in seconds), different hop lengths, and centering on/off:

main branch (before preallocation optimizations):

[  0.00%] · For librosa commit 25538adb <abs2>:
[  0.00%] ·· Benchmarking conda-py3.9
[ 50.00%] ··· Running (benchmarks.STFTSuite.time_stft--).
[ 75.00%] ··· benchmarks.STFTSuite.peakmem_stft                                                                                                                                             ok
[ 75.00%] ··· ============ =========== =========== ============ ============ ============ =============
              --                                        center / duration                              
              ------------ ----------------------------------------------------------------------------
               hop_length   True / 10   True / 60   True / 240   False / 10   False / 60   False / 240 
              ============ =========== =========== ============ ============ ============ =============
                  256          154M        193M        336M         153M         188M          315M    
                  512          150M        172M        251M         149M         167M          230M    
                  1024         148M        162M        209M         148M         156M          188M    
              ============ =========== =========== ============ ============ ============ =============

[100.00%] ··· benchmarks.STFTSuite.time_stft                                                                                                                                                ok
[100.00%] ··· ============ ============= ============ ============ ============= ============ =============
              --                                          center / duration                                
              ------------ --------------------------------------------------------------------------------
               hop_length    True / 10    True / 60    True / 240    False / 10   False / 60   False / 240 
              ============ ============= ============ ============ ============= ============ =============
                  256       10.6±0.07ms   71.7±20ms     491±10ms     10.8±0.3ms   67.2±20ms      453±10ms  
                  512        5.81±0.2ms   31.3±0.4ms    253±20ms     5.47±0.1ms   31.2±0.8ms     205±7ms   
                  1024       3.65±0.4ms   18.5±0.7ms    132±6ms     3.09±0.05ms   17.1±0.5ms    77.5±20ms  
              ============ ============= ============ ============ ============= ============ =============

PR 1514:

[  0.00%] · For librosa commit c95314eb <stft-preallocate>:
[  0.00%] ·· Benchmarking conda-py3.9
[ 50.00%] ··· Running (benchmarks.STFTSuite.time_stft--).
[ 75.00%] ··· benchmarks.STFTSuite.peakmem_stft                                                                                                                                             ok
[ 75.00%] ··· ============ =========== =========== ============ ============ ============ =============
              --                                        center / duration                              
              ------------ ----------------------------------------------------------------------------
               hop_length   True / 10   True / 60   True / 240   False / 10   False / 60   False / 240 
              ============ =========== =========== ============ ============ ============ =============
                  256          153M        188M        315M         153M         188M          315M    
                  512          149M        167M        230M         149M         167M          230M    
                  1024         148M        156M        188M         148M         156M          188M    
              ============ =========== =========== ============ ============ ============ =============

[100.00%] ··· benchmarks.STFTSuite.time_stft                                                                                                                                                ok
[100.00%] ··· ============ ============= ============ ============ ============= ============ =============
              --                                          center / duration                                
              ------------ --------------------------------------------------------------------------------
               hop_length    True / 10    True / 60    True / 240    False / 10   False / 60   False / 240 
              ============ ============= ============ ============ ============= ============ =============
                  256        11.6±0.6ms   74.1±0.5ms    291±1ms      11.3±0.3ms   74.0±0.3ms     292±2ms   
                  512       5.94±0.02ms   33.1±0.8ms   149±0.9ms    5.72±0.09ms   31.8±0.8ms     149±1ms   
                  1024      3.32±0.04ms   17.6±0.6ms   77.8±0.8ms    3.24±0.1ms   17.7±0.7ms    77.4±0.9ms 
              ============ ============= ============ ============ ============= ============ =============

At some point, I'll have this up to run continuously on a server and publish results on a website. But for now, it at least gives a rough sense of time and memory improvements for this PR.

@bmcfee
Copy link
Member Author

bmcfee commented Jul 14, 2022

Leaving a note for later:

The update to the pcen-streaming example to use pre-allocated output arrays works here, but ther's a chance that it could cause some trouble at the end of the stream. This got me thinking a little about how exactly we want to handle this generally.

For reference, here's the pattern used in the example now:

D = None

for y_block in stream:
    D = librosa.stft(y_block, n_fft=n_fft, hop_length=hop_length,
                     center=False, out=D)

The idea here is that on the first block, D is None and there is no pre-allocation. On subsequent iterations, we reuse the old output array. This works fine when all blocks are of identical size. However, if the last y_block is shorter than previous blocks, this will fail because D.shape will not match the size requirement, per this check in the PR:

if out is None:
    stft_matrix = np.zeros(shape, dtype=dtype, order="F")
elif not np.allclose(out.shape, shape):
    raise ParameterError(
        f"Shape mismatch for provided output array out.shape={out.shape} != {shape}"
    )

I think we can work around this by providing a simple "view trim" option. We could accept an over-sized output array (out) but only write into a prefix slice of the array, eg stft_matrix = out[..., :n_frames].

@bmcfee
Copy link
Member Author

bmcfee commented Jul 14, 2022

Thinking this over, I'm wondering if we should make this "view trimming" mode the default behavior, or just always implement it. I see very little downside to allowing over-sized inputs, and the benefits to ease-of-use seem substantial.

@bmcfee bmcfee merged commit ad90c4b into main Jul 14, 2022
@bmcfee bmcfee changed the title [CR needed] STFT pre-allocation STFT pre-allocation Jul 14, 2022
@bmcfee bmcfee deleted the stft-preallocate branch July 14, 2022 16:16
bmcfee added a commit that referenced this pull request Sep 7, 2022
bmcfee added a commit that referenced this pull request Sep 8, 2022
* fix a framing bug introduced in #1514, fixes #1567

* fixed an edge case when hop_length=1 in stft

* decorative comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Does this improve existing functionality? functionality Does this add new functionality?
Development

Successfully merging this pull request may close these issues.

STFT/ISTFT pre-allocate output buffers
2 participants