Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CR] Inverse CQT #435

Merged
merged 5 commits into from
Feb 10, 2018
Merged

[CR] Inverse CQT #435

merged 5 commits into from
Feb 10, 2018

Conversation

bmcfee
Copy link
Member

@bmcfee bmcfee commented Nov 6, 2016

Implements #165.

This version seems to work, but has some scaling issues that need to be resolved.


This change is Reviewable

@bmcfee bmcfee added the functionality Does this add new functionality? label Nov 6, 2016
@bmcfee bmcfee added this to the 0.5 milestone Nov 6, 2016
@bmcfee bmcfee self-assigned this Nov 6, 2016
@bmcfee bmcfee force-pushed the icqt branch 5 times, most recently from 5443e69 to 7ee43c6 Compare January 10, 2017 18:14
@bmcfee
Copy link
Member Author

bmcfee commented Jan 10, 2017

Added a basic unit test with a sine sweep. This test fails gloriously, and reveals that the output scale depends critically on the hop length, probably due to a lack of explicit correction for window modulation.

@bmcfee
Copy link
Member Author

bmcfee commented Jan 10, 2017

Factored out the window modulation logic from stft into its own function, and tried to patch it into icqt. It doesn't seem to do much -- I'm most likely doing it wrong, but it's also probably not the primary factor in the hop-dependent scale discrepancies I'm seeing:

Here's the test output for a sine sweep (unit amplitude) going C2 to C6 at 44.1KHz, and varying the hop length. The assert error output is the max-norm of the output (reconstructed signal), compared to the input (~1). Note that there's a dependence between hop_length (third parameter in the test: 64, 128, 384, 512) and the norm. The fact that 384 and 512 give the same scale indicates that this has to do with the downsample count (rounding to an integer power of two), and not a simple linear scaling.

======================================================================
FAIL: test_constantq.test_icqt(44100, False, 64, 1, array([ 0.0093187 ,  0.01863697,  0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
    np.max(np.abs(y)))
AssertionError: (0.58766218480812893, 0.99999999947403728)

======================================================================
FAIL: test_constantq.test_icqt(44100, False, 128, 1, array([ 0.0093187 ,  0.01863697,  0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
    np.max(np.abs(y)))
AssertionError: (0.29437417911466718, 0.99999999947403728)

======================================================================
FAIL: test_constantq.test_icqt(44100, False, 384, 1, array([ 0.0093187 ,  0.01863697,  0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
    np.max(np.abs(y)))
AssertionError: (0.12826735736959327, 0.99999999947403728)

======================================================================
FAIL: test_constantq.test_icqt(44100, False, 512, 1, array([ 0.0093187 ,  0.01863697,  0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
    np.max(np.abs(y)))
AssertionError: (0.128829889962883, 0.99999999947403728)

@bmcfee bmcfee force-pushed the icqt branch 2 times, most recently from af1d4d6 to 5bc0bdf Compare January 15, 2017 18:11
@bmcfee
Copy link
Member Author

bmcfee commented Jan 25, 2017

Rewrote icqt to work in the time-domain by convolution of the basis conjugate with each row of the CQT. This makes it a bit more obvious how to handle dynamic-length window normalization.

It works pretty well for short window lengths. Here's an exponential sine sweep padded with silence at the end:

sr = 44100
y = make_signal(sr, 3.0, fmin='C1', fmax='C9')
y = librosa.util.fix_length(y, 5*sr)

With the following CQT parameters, you get

hop_length = 128
over_sample = 3

fmin = librosa.note_to_hz('C1')

C = librosa.cqt(y,
                sr=sr,
                hop_length=hop_length,
                bins_per_octave=int(12*over_sample),
                n_bins=int(8 * 12 * over_sample),
                fmin=fmin,
                tuning=0.0,
                scale=True)

image

Top is the input CQT; middle is the CQT of the reconstructed signal without window normalization; bottom is with window normalization. This looks pretty good in CQT space; in time domain space, it's not horrible, but not close to perfect either:

image

(horiz lines at the bounds of the input signal)

The inflation at the end is due to the sum-squared scaling of the window function, and depends on the hop length at the top octave. If you repeat this with hop_length = 512, it gets more severe:

image

image

In stft, we hack around window underflow by the following type of calculation:

good_idx = wss > tiny(wss)
y[good_idx] /= wss[good_idx]

i.e., only divide out the window sum-square when it won't cause underflow problems. Doing the same thing here works up to a point, but since it's a different index set for each frequency, you end up with the ramp at high frequencies. I've tried hacking around this by putting a user-tunable parameter that can drop in place of tiny(wss), but this is a brutal hack and doesn't work well.

@dpwe got any suggestions for dealing with this more sanely?

@dpwe
Copy link
Contributor

dpwe commented Jan 25, 2017 via email

@bmcfee
Copy link
Member Author

bmcfee commented Jan 26, 2017

Why do we need to divide out the windows? Can't we make them sum to 1?
They are all different lengths depending on the center frequency?

They do some to 1, but because each channel has different overlap with respect to a fixed hop length, you get channel-dependent modulation that needs to be corrected for. (Unlike in stft, where the same modulation affects all channels because the window length is constant.)

Upon closer inspection, I think the problem I ran into above is due to aligning the window normalization curve with the reconstructed time-domain signal. It's gonna take me some time to track this down exactly.

@bmcfee
Copy link
Member Author

bmcfee commented Feb 3, 2017

Okay, I redid the window normalization and things should now be properly aligned.

I still see some smeared out transient effects at the boundaries of the test signal, but I have no idea how to fix it. At this point, I'd greatly appreciate a set of fresh eyes on the code.

EDIT: I don't think it's an alignment issue at this point, since the effect also appears at the beginning if you time-reverse the test signal. (That is, we're not just off by a window or anything like that.)

image

You can control this a bit by raising the threshold on the squared window that determines which samples we up-scale. For instance, going from 1e-8 (above) to 1e-2 gives:

image

Going all the way up to 1 (only down-scale, never up-scale):

image

Question is: is there a smart way to set this threshold in general?

@bmcfee
Copy link
Member Author

bmcfee commented Feb 6, 2017

Question is: is there a smart way to set this threshold in general?

Still poking at this. The up-sweep on the high frequencies at the end goes away nicely if you set amin=sqrt(2); at least for a few windows (hann, triangle, hamming).

image

On others (blackmanharris) it's way the hell off:

image

The scaling does seem to behave correctly with respect to hop length. The above were 44K @ hop=256. Here's one with hop=4096, and the magnitude seems to be well preserved:

image

similarly for hop=128:

image

@bmcfee
Copy link
Member Author

bmcfee commented Feb 6, 2017

In this continuing saga, latest batch of changes includes jit acceleration for time-domain reconstruction, and a correction to a phase offset error.

I'm now getting SDR of around +12.8 on the test sweep (44KHz, hop=512, 3x frequency over-sampling).

22KHz test sweep gets SDR of +6.1. A lot of the loss here can probably be attributed to the high-frequency components at the edges of padding in the test-signal construction.

On the included example audio, at 44K, 5-25 seconds, we get SDR=12.4. Dropping the hop to 256 gets us up to 12.5. Keeping the hop at 512 and going to 5x oversampling in frequency gets us to +14.

So..... I think it's all working, more or less, as it should. The window division threshold still seems strange to me, so any input on that would be great. Otherwise, I'm happy to cut this one loose, document it as unstable, and call it a day.

@stefan-balke
Copy link
Member

Beside the SDR, how does it sound? Okayish?

@bmcfee
Copy link
Member Author

bmcfee commented Feb 7, 2017

Beside the SDR, how does it sound? Okayish?

Pretty good, in general. There's obviously some high-frequency attenuation, and some of the transients get rattled around a bit. Overall, I'd say it sounds better than, say, an istft with random phase.

@bmcfee
Copy link
Member Author

bmcfee commented Feb 7, 2017

The window division threshold still seems strange to me, so any input on that would be great.

Quick follow-up on this: testing with a pure sine at C5, I see the reconstructed amplitude drop as I raise the amount of frequency over-sampling. I suppose there needs to be some kind of absolute scaling in terms of Q for this to all work out properly. Will investigate further later on.

@bmcfee
Copy link
Member Author

bmcfee commented Feb 8, 2017


Nevermind, figured it out, but it seems a bit broken.  I'll make a separate issue.

@bmcfee bmcfee changed the title [WIP] Inverse CQT Inverse CQT Feb 9, 2017
@bmcfee
Copy link
Member Author

bmcfee commented Feb 9, 2017

Update: I couldn't make heads nor tails of the Q-dependent amplitude scaling.

I think at this point, it's best to just flag it as unstable (already done) and let it loose. Maybe someone smart will come along and fix it. 😁

Anyone care to CR?

@bmcfee bmcfee modified the milestones: 0.5.1, 0.5 Feb 10, 2017
@dpwe
Copy link
Contributor

dpwe commented Sep 23, 2017 via email

@bmcfee
Copy link
Member Author

bmcfee commented Oct 3, 2017

Well I think this one's about as ready as can be, so CR would be appreciated at any point.

Updates on the tests described earlier in the thread:

Sine sweep

EDIT: updated to more closely match the previous examples in-thread.

Here's a log sweep from C1->C9 at 44.1KHz, padded with silence on either end. Plotted are cqt(y) and cqt(icqt(cqt(y))). 8 octaves, 36 bpo, starting at C1 with a hop of 512.

image

The amplitude scaling seems to behave properly:
image

There are obvious artifacts at the discontinuities, but they're totally inaudible. SDR is 6.87dB. Re-running with a test signal of C1-C8 brings this up to 9.33.

Real example

20 seconds of musical audio (billboard dataset) at 44.1KHz. Here we're running into band limiting issues, but the reconstruction sounds pretty good. As a function of frequency oversampling (bins_per_octave / 12) keeping everything else fixed, we get:

n_over SDR
1 5.7
3 10.2
5 12.6
7 14.6
9 15.8

Reconstruction image (bpo=36):
image

@bmcfee bmcfee added enhancement Does this improve existing functionality? Hacktoberfest labels Oct 3, 2017
@dpwe
Copy link
Contributor

dpwe commented Oct 4, 2017 via email

@bmcfee
Copy link
Member Author

bmcfee commented Oct 4, 2017

If there's any way I can make this one easier, please let me know.

Quick summary of changes:

  • cqt now includes an extra explicit scaling of sqrt(2) per octave here. This is there to implicitly scale the filters as well as the signal when downsampling. EDIT: disregard, that was already there just done differently.
  • sum-squared window normalization has been factored out of istft into its own helper, and numba-accelerated.
  • cqt filter bases are properly centered now.
  • icqt is implemented by overlap-adding (basis[i] (*) C[i]).real for each basis within an octave, and then upsampling to go to the next octave.
  • EDIT: updated for compatiblity with stft conjugates the correct half of the spectrum #634

The main part that I'm shaky on is this line:

  • window sum-square normalization ; needs to be thresholded for numerical stability, and sqrt(2) works for hann windows, but other windows seem to go off the rails.

@bmcfee
Copy link
Member Author

bmcfee commented Oct 10, 2017

Quick update: after merging #634 and updating this PR accordingly, we get a few more bits of precision on reconstruction:

n_over SDR (old) SDR (current)
1 5.7 5.8
3 10.2 10.3
5 12.6 13.0
7 14.6 15.4
9 15.8 17.4

@bmcfee
Copy link
Member Author

bmcfee commented Nov 3, 2017

Update: sorted out some of the window function issues here, so we at least get consistently (if incorrectly) scaled outputs for different window functions on the same signal. There seems to be some non-trivial gain kicking in that I haven't quite sorted out how to characterize.

Quick sample of estimated gain (y_inv.max() / y.max()) with different windows, hops, and over-sampling ratios:

window  over_sample hop_length gain
(kaiser, 4.0) 1 128 0.325122
(kaiser, 4.0) 1 256 0.344234
(kaiser, 4.0) 1 512 0.339830
(kaiser, 4.0) 3 128 0.292700
(kaiser, 4.0) 3 256 0.295633
(kaiser, 4.0) 3 512 0.299188
(kaiser, 4.0) 5 128 0.289119
(kaiser, 4.0) 5 256 0.291737
(kaiser, 4.0) 5 512 0.297369
blackmanharris 1 128 0.449154
blackmanharris 1 256 0.535853
blackmanharris 1 512 0.547814
blackmanharris 3 128 0.431386
blackmanharris 3 256 0.434321
blackmanharris 3 512 0.434058
blackmanharris 5 128 0.428521
blackmanharris 5 256 0.428522
blackmanharris 5 512 0.428522
hamming 1 128 0.364840
hamming 1 256 0.402344
hamming 1 512 0.394746
hamming 3 128 0.335422
hamming 3 256 0.337668
hamming 3 512 0.345552
hamming 5 128 0.331832
hamming 5 256 0.334356
hamming 5 512 0.341088
hann 1 128 0.372560
hann 1 256 0.372503
hann 1 512 0.372511
hann 3 128 0.354173
hann 3 256 0.354169
hann 3 512 0.354162
hann 5 128 0.351063
hann 5 256 0.351063
hann 5 512 0.351061
triangle 1 128 0.365550
triangle 1 256 0.445327
triangle 1 512 0.446967
triangle 3 128 0.323054
triangle 3 256 0.323209
triangle 3 512 0.323839
triangle 5 128 0.319909
triangle 5 256 0.319899
triangle 5 512 0.319945

@bmcfee
Copy link
Member Author

bmcfee commented Nov 3, 2017

Pushed up the latest version. It has some gnarly global scaling effects as noted above, so unit tests will fail for the time being. Note: I put in an extra correction of sqrt(n_fft) so the gain values above are no longer correct.

It does appear that the gain is a dynamic property that depends on the oversampling ratio, which is not currently corrected. For example, a pure tone at C3 processed with triangle windows results in the following:

  window over_sample hop_length gain
triangle 1 128 7.521
triangle 1 256 7.506
triangle 1 512 7.629
triangle 1 1024 7.568
triangle 3 128 14.077
triangle 3 256 14.077
triangle 3 512 14.077
triangle 3 1024 14.058
triangle 5 128 13.957
triangle 5 256 13.957
triangle 5 512 13.957
triangle 5 1024 13.953

Similarly, for hann windows:

window over_sample hop_length gain
hann 1 128 8.244
hann 1 256 8.241
hann 1 512 8.215
hann 1 1024 8.231
hann 3 128 15.473
hann 3 256 15.473
hann 3 512 15.473
hann 3 1024 15.477
hann 5 128 15.325
hann 5 256 15.325
hann 5 512 15.325
hann 5 1024 15.326

@bmcfee
Copy link
Member Author

bmcfee commented Nov 10, 2017

I added another bit of gain correction here that accounts for filter redundancy. There may be a bit of loss at the octave boundaries, since the normalization is computed from a single octave's worth of filters, so we lose a bit at the top and bottom.

It's still not exactly right, though. The SDR for estimated gain vs exact gain calculation (given the input signal and computing the RMSE ratio between yin and yout) is about half what it should be.

@bmcfee
Copy link
Member Author

bmcfee commented Nov 18, 2017

One more update on this one.

As mentioned above, I'm pretty confident that the last variable to tease out is the dependence of gain on the combination of (window, bins_per_octave) (or, maybe more precisely, window and Q).

Here's the most recent set of gain estimates. The test signal is a zero-padded Gb4 tone scaled to have unit (l2) norm, at 44100 Hz. The test code is as follows:

R = pd.DataFrame(columns=['window', 'over_sample', 'hop_length', 'gain', 'Q'])

for window in tqdm(['hann', 'triangle', 'boxcar', 'hamming', 'blackmanharris']):
    for over_sample in tqdm([3, 5, 7, 9]):
        Q = 1./(2.0**(1./(12 * over_sample)) - 1)
        for hop_length in tqdm([256, 512, 1024]):
            C = librosa.cqt(y,
                sr=sr,
                hop_length=hop_length,
                bins_per_octave=int(12*over_sample),
                n_bins=int(6 * 12 * over_sample),
                tuning=0.0,
                window=window, sparsity=0,
                scale=True)
            
            y2 = librosa.icqt(C, sr=sr,
                  hop_length=hop_length,
                  bins_per_octave=int(12*over_sample),
                  scale=True,
                  window=window)
            
            gain = np.sum(np.abs(y2**2))**0.5 / np.sum(np.abs(y**2))**0.5
            
            R = R.append(dict(window=window, over_sample=over_sample,
                              hop_length=hop_length, gain=gain, Q=Q), ignore_index=True)
            
R.sort_values(['window', 'over_sample', 'hop_length'], inplace=True)

And the results are:

window over_sample hop_length gain Q
blackmanharris 3 256 3.273691 51.438626
blackmanharris 3 512 3.273692 51.438626
blackmanharris 3 1024 3.273747 51.438626
blackmanharris 5 256 3.231796 86.062665
blackmanharris 5 512 3.231796 86.062665
blackmanharris 5 1024 3.231791 86.062665
blackmanharris 7 256 3.213130 120.687071
blackmanharris 7 512 3.213130 120.687071
blackmanharris 7 1024 3.213128 120.687071
blackmanharris 9 256 3.204726 155.311599
blackmanharris 9 512 3.204726 155.311599
blackmanharris 9 1024 3.204726 155.311599
boxcar 3 256 0.028354 51.438626
boxcar 3 512 0.048817 51.438626
boxcar 3 1024 0.105860 51.438626
boxcar 5 256 0.026040 86.062665
boxcar 5 512 0.032643 86.062665
boxcar 5 1024 0.042883 86.062665
boxcar 7 256 0.014753 120.687071
boxcar 7 512 0.021084 120.687071
boxcar 7 1024 0.045244 120.687071
boxcar 9 256 0.017771 155.311599
boxcar 9 512 0.023747 155.311599
boxcar 9 1024 0.039175 155.311599
hamming 3 256 1.120848 51.438626
hamming 3 512 1.120877 51.438626
hamming 3 1024 1.120938 51.438626
hamming 5 256 1.105181 86.062665
hamming 5 512 1.105189 86.062665
hamming 5 1024 1.105207 86.062665
hamming 7 256 1.098501 120.687071
hamming 7 512 1.098503 120.687071
hamming 7 1024 1.098517 120.687071
hamming 9 256 1.095247 155.311599
hamming 9 512 1.095248 155.311599
hamming 9 1024 1.095254 155.311599
hann 3 256 1.382946 51.438626
hann 3 512 1.382946 51.438626
hann 3 1024 1.382943 51.438626
hann 5 256 1.362803 86.062665
hann 5 512 1.362803 86.062665
hann 5 1024 1.362802 86.062665
hann 7 256 1.353827 120.687071
hann 7 512 1.353827 120.687071
hann 7 1024 1.353827 120.687071
hann 9 256 1.349483 155.311599
hann 9 512 1.349483 155.311599
hann 9 1024 1.349483 155.311599
triangle 3 256 1.220527 51.438626
triangle 3 512 1.220505 51.438626
triangle 3 1024 1.220164 51.438626
triangle 5 256 1.218933 86.062665
triangle 5 512 1.218931 86.062665
triangle 5 1024 1.218873 86.062665
triangle 7 256 1.217677 120.687071
triangle 7 512 1.217677 120.687071
triangle 7 1024 1.217658 120.687071
triangle 9 256 1.217132 155.311599
triangle 9 512 1.217132 155.311599
triangle 9 1024 1.217123 155.311599

Aside from boxcar, which is a mess, everything else seems to be pretty stable for a fixed choice of window, and for larger over_sample values, the gain seems to converge downward. The consistency across choice of hop_length indicates that the window sum-square normalization is behaving correctly.

When doing the overlap-add, each filter is scaled up by this estimate of its overlap with neighboring filters. This is probably incorrect, so any insight on how to do this properly would be much appreciated.

@bmcfee
Copy link
Member Author

bmcfee commented Feb 2, 2018

Offline consensus: I'll get the tests passing again, mark this feature as unstable and then cut it loose. We can fix it later.

working on scale issues

more or less fixed scaling issues in icqt

added basic unit test for icqt

expanded and tightened icqt tests

Added sparsity to icqt basis

testing icqt with odd-multiple hop

removed spurious print

factored out window sum-squared calculation

implemented but commented out icqt window inversion

fixed plots for window_sumsquare example

remove window modulation for now

added channel-dependent squared window normalization to icqt

fixed some import issues

simplified some normalization code in icqt

refactored window sumsquare to avoid numba compilation warning

fixed an edge case in window sumsquare

fixed a regression in window ss

jit-enabled icqt reconstruction. sdr looks good

linting icqt

weakened icqt test a hell of a lot

fixed optional_jit usage error
fixed some integer window length issues

tightened test requirements for icqt

relaxed icqt test for py3.4 build

raised test accuracy on icqt for py3.4 build

forgot to remove cqt real test

removed optional jits
cleaning up window normalization in icqt

simplifying icqt normalizations

revised gain correction

normalized scaling within icqt

windowing stft within cqt

re-introduced scale estimation

reverted scaling mode
reverted internal windowing to preserve pseudo/hybrid behavior.
@bmcfee
Copy link
Member Author

bmcfee commented Feb 10, 2018

Reviewed 1 of 4 files at r1, 1 of 3 files at r3, 2 of 2 files at r4.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@bmcfee bmcfee merged commit d3e7bfe into master Feb 10, 2018
@bmcfee bmcfee deleted the icqt branch February 10, 2018 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Does this improve existing functionality? functionality Does this add new functionality?
Development

Successfully merging this pull request may close these issues.

None yet

3 participants