-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CR] Inverse CQT #435
[CR] Inverse CQT #435
Conversation
5443e69
to
7ee43c6
Compare
Added a basic unit test with a sine sweep. This test fails gloriously, and reveals that the output scale depends critically on the hop length, probably due to a lack of explicit correction for window modulation. |
Factored out the window modulation logic from Here's the test output for a sine sweep (unit amplitude) going C2 to C6 at 44.1KHz, and varying the hop length. The assert error output is the max-norm of the output (reconstructed signal), compared to the input (~1). Note that there's a dependence between ======================================================================
FAIL: test_constantq.test_icqt(44100, False, 64, 1, array([ 0.0093187 , 0.01863697, 0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
np.max(np.abs(y)))
AssertionError: (0.58766218480812893, 0.99999999947403728)
======================================================================
FAIL: test_constantq.test_icqt(44100, False, 128, 1, array([ 0.0093187 , 0.01863697, 0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
np.max(np.abs(y)))
AssertionError: (0.29437417911466718, 0.99999999947403728)
======================================================================
FAIL: test_constantq.test_icqt(44100, False, 384, 1, array([ 0.0093187 , 0.01863697, 0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
np.max(np.abs(y)))
AssertionError: (0.12826735736959327, 0.99999999947403728)
======================================================================
FAIL: test_constantq.test_icqt(44100, False, 512, 1, array([ 0.0093187 , 0.01863697, 0.02795402, ..., -0.99819204,
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/bmcfee/miniconda/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/bmcfee/git/librosa/tests/test_constantq.py", line 359, in __test
np.max(np.abs(y)))
AssertionError: (0.128829889962883, 0.99999999947403728) |
af1d4d6
to
5bc0bdf
Compare
Rewrote It works pretty well for short window lengths. Here's an exponential sine sweep padded with silence at the end: sr = 44100
y = make_signal(sr, 3.0, fmin='C1', fmax='C9')
y = librosa.util.fix_length(y, 5*sr) With the following CQT parameters, you get hop_length = 128
over_sample = 3
fmin = librosa.note_to_hz('C1')
C = librosa.cqt(y,
sr=sr,
hop_length=hop_length,
bins_per_octave=int(12*over_sample),
n_bins=int(8 * 12 * over_sample),
fmin=fmin,
tuning=0.0,
scale=True) Top is the input CQT; middle is the CQT of the reconstructed signal without window normalization; bottom is with window normalization. This looks pretty good in CQT space; in time domain space, it's not horrible, but not close to perfect either: (horiz lines at the bounds of the input signal) The inflation at the end is due to the sum-squared scaling of the window function, and depends on the hop length at the top octave. If you repeat this with In good_idx = wss > tiny(wss)
y[good_idx] /= wss[good_idx] i.e., only divide out the window sum-square when it won't cause underflow problems. Doing the same thing here works up to a point, but since it's a different index set for each frequency, you end up with the ramp at high frequencies. I've tried hacking around this by putting a user-tunable parameter that can drop in place of @dpwe got any suggestions for dealing with this more sanely? |
Why do we need to divide out the windows? Can't we make them sum to 1?
They are all different lengths depending on the center frequency?
DAn.
…On Wed, Jan 25, 2017 at 10:33 Brian McFee ***@***.***> wrote:
Rewrote icqt to work in the time-domain by convolution of the basis
conjugate with each row of the CQT. This makes it a bit more obvious how to
handle dynamic-length window normalization.
It works pretty well for short window lengths. Here's an exponential sine
sweep padded with silence at the end:
sr = 44100
y = make_signal(sr, 3.0, fmin='C1', fmax='C9')
y = librosa.util.fix_length(y, 5*sr)
With the following CQT parameters, you get
hop_length = 128
over_sample = 3
fmin = librosa.note_to_hz('C1')
C = librosa.cqt(y,
sr=sr,
hop_length=hop_length,
bins_per_octave=int(12*over_sample),
n_bins=int(8 * 12 * over_sample),
fmin=fmin,
tuning=0.0,
scale=True)
[image: image]
<https://cloud.githubusercontent.com/assets/1190540/22296258/958ffc08-e2e7-11e6-80d3-b52ab9577fff.png>
Top is the input CQT; middle is the CQT of the reconstructed signal
without window normalization; bottom is with window normalization. This
looks pretty good in CQT space; in time domain space, it's not horrible,
but not close to perfect either:
[image: image]
<https://cloud.githubusercontent.com/assets/1190540/22296637/d1872258-e2e8-11e6-8a6e-9cfc3bb7621f.png>
(horiz lines at the bounds of the input signal)
The inflation at the end is due to the sum-squared scaling of the window
function, and depends on the hop length at the top octave. If you repeat
this with hop_length = 512, it gets more severe:
[image: image]
<https://cloud.githubusercontent.com/assets/1190540/22296769/351eaf8e-e2e9-11e6-8a49-a6ca829f1d95.png>
[image: image]
<https://cloud.githubusercontent.com/assets/1190540/22296742/234344b4-e2e9-11e6-9458-e06b7f1eac32.png>
In stft, we hack around window underflow by the following type of
calculation:
good_idx = wss > tiny(wss)
y[good_idx] /= wss[good_idx]
i.e., only divide out the window sum-square when it won't cause underflow
problems. Doing the same thing here works up to a point, but since it's a
different index set for each frequency, you end up with the ramp at high
frequencies. I've tried hacking around this by putting a user-tunable
parameter that can drop in place of tiny(wss), but this is a brutal hack
and doesn't work well.
@dpwe <https://github.com/dpwe> got any suggestions for dealing with this
more sanely?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#435 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAhs0aK2VLJhHRJThPKvYee2VgLHs9Lxks5rV2s7gaJpZM4Kqnz9>
.
|
They do some to 1, but because each channel has different overlap with respect to a fixed hop length, you get channel-dependent modulation that needs to be corrected for. (Unlike in stft, where the same modulation affects all channels because the window length is constant.) Upon closer inspection, I think the problem I ran into above is due to aligning the window normalization curve with the reconstructed time-domain signal. It's gonna take me some time to track this down exactly. |
Okay, I redid the window normalization and things should now be properly aligned. I still see some smeared out transient effects at the boundaries of the test signal, but I have no idea how to fix it. At this point, I'd greatly appreciate a set of fresh eyes on the code. EDIT: I don't think it's an alignment issue at this point, since the effect also appears at the beginning if you time-reverse the test signal. (That is, we're not just off by a window or anything like that.) You can control this a bit by raising the threshold on the squared window that determines which samples we up-scale. For instance, going from 1e-8 (above) to 1e-2 gives: Going all the way up to 1 (only down-scale, never up-scale): Question is: is there a smart way to set this threshold in general? |
Still poking at this. The up-sweep on the high frequencies at the end goes away nicely if you set On others (blackmanharris) it's way the hell off: The scaling does seem to behave correctly with respect to hop length. The above were 44K @ hop=256. Here's one with hop=4096, and the magnitude seems to be well preserved: similarly for hop=128: |
In this continuing saga, latest batch of changes includes jit acceleration for time-domain reconstruction, and a correction to a phase offset error. I'm now getting SDR of around +12.8 on the test sweep (44KHz, hop=512, 3x frequency over-sampling). 22KHz test sweep gets SDR of +6.1. A lot of the loss here can probably be attributed to the high-frequency components at the edges of padding in the test-signal construction. On the included example audio, at 44K, 5-25 seconds, we get SDR=12.4. Dropping the hop to 256 gets us up to 12.5. Keeping the hop at 512 and going to 5x oversampling in frequency gets us to +14. So..... I think it's all working, more or less, as it should. The window division threshold still seems strange to me, so any input on that would be great. Otherwise, I'm happy to cut this one loose, document it as unstable, and call it a day. |
Beside the SDR, how does it sound? Okayish? |
Pretty good, in general. There's obviously some high-frequency attenuation, and some of the transients get rattled around a bit. Overall, I'd say it sounds better than, say, an istft with random phase. |
Quick follow-up on this: testing with a pure sine at C5, I see the reconstructed amplitude drop as I raise the amount of frequency over-sampling. I suppose there needs to be some kind of absolute scaling in terms of |
|
Update: I couldn't make heads nor tails of the Q-dependent amplitude scaling. I think at this point, it's best to just flag it as unstable (already done) and let it loose. Maybe someone smart will come along and fix it. 😁 Anyone care to CR? |
I'm still here.
…On Fri, Sep 22, 2017 at 2:01 PM, Brian McFee ***@***.***> wrote:
I guess nobody's up to CR this. Should we just merge, document the feature
as unstable, and deal with the fallout later?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#435 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAhs0S3xi0UJVS18aVTVDDCj6aGwLxZqks5sk_XkgaJpZM4Kqnz9>
.
|
I want to do this. More accurately, I want to have done this.
DAn.
…On Tue, Oct 3, 2017 at 4:25 PM, Brian McFee ***@***.***> wrote:
Well I think this one's about as ready as can be, so CR would be
appreciated at any point.
Updates on the tests described earlier in the thread:
Sine sweep
Here's a log sweep from C2->C6 at 44.1KHz, padded with silence on either
end. Plotted are cqt(y) and cqt(icqt(cqt(y))). 8 octaves, 36 bpo,
starting at C1 with a hop of 512.
[image: image]
<https://user-images.githubusercontent.com/1190540/31146905-17b35a7a-a856-11e7-94f4-149ff97053fd.png>
There are obvious artifacts at the discontinuities, but they're totally
inaudible. SDR is 29.67dB.
Real example
20 seconds of musical audio (billboard dataset) at 44.1KHz. Here we're
running into band limiting issues, but the reconstruction sounds pretty
good. As a function of frequency oversampling (bins_per_octave / 12)
keeping everything else fixed, we get:
n_over SDR
1 5.7
3 10.2
5 12.6
7 14.6
9 15.8
Reconstruction image (bpo=36):
[image: image]
<https://user-images.githubusercontent.com/1190540/31147242-3f66f440-a857-11e7-9b7a-06f7c33d2258.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#435 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAhs0eyMfql3sRUg4wjFZVxE2YqNGsL8ks5sopgggaJpZM4Kqnz9>
.
|
If there's any way I can make this one easier, please let me know. Quick summary of changes:
The main part that I'm shaky on is this line:
|
Quick update: after merging #634 and updating this PR accordingly, we get a few more bits of precision on reconstruction:
|
Update: sorted out some of the window function issues here, so we at least get consistently (if incorrectly) scaled outputs for different window functions on the same signal. There seems to be some non-trivial gain kicking in that I haven't quite sorted out how to characterize. Quick sample of estimated gain (
|
Pushed up the latest version. It has some gnarly global scaling effects as noted above, so unit tests will fail for the time being. Note: I put in an extra correction of It does appear that the gain is a dynamic property that depends on the oversampling ratio, which is not currently corrected. For example, a pure tone at C3 processed with triangle windows results in the following:
Similarly, for hann windows:
|
I added another bit of gain correction here that accounts for filter redundancy. There may be a bit of loss at the octave boundaries, since the normalization is computed from a single octave's worth of filters, so we lose a bit at the top and bottom. It's still not exactly right, though. The SDR for estimated gain vs exact gain calculation (given the input signal and computing the RMSE ratio between |
One more update on this one. As mentioned above, I'm pretty confident that the last variable to tease out is the dependence of gain on the combination of Here's the most recent set of gain estimates. The test signal is a zero-padded R = pd.DataFrame(columns=['window', 'over_sample', 'hop_length', 'gain', 'Q'])
for window in tqdm(['hann', 'triangle', 'boxcar', 'hamming', 'blackmanharris']):
for over_sample in tqdm([3, 5, 7, 9]):
Q = 1./(2.0**(1./(12 * over_sample)) - 1)
for hop_length in tqdm([256, 512, 1024]):
C = librosa.cqt(y,
sr=sr,
hop_length=hop_length,
bins_per_octave=int(12*over_sample),
n_bins=int(6 * 12 * over_sample),
tuning=0.0,
window=window, sparsity=0,
scale=True)
y2 = librosa.icqt(C, sr=sr,
hop_length=hop_length,
bins_per_octave=int(12*over_sample),
scale=True,
window=window)
gain = np.sum(np.abs(y2**2))**0.5 / np.sum(np.abs(y**2))**0.5
R = R.append(dict(window=window, over_sample=over_sample,
hop_length=hop_length, gain=gain, Q=Q), ignore_index=True)
R.sort_values(['window', 'over_sample', 'hop_length'], inplace=True) And the results are:
Aside from When doing the overlap-add, each filter is scaled up by this estimate of its overlap with neighboring filters. This is probably incorrect, so any insight on how to do this properly would be much appreciated. |
Offline consensus: I'll get the tests passing again, mark this feature as unstable and then cut it loose. We can fix it later. |
working on scale issues more or less fixed scaling issues in icqt added basic unit test for icqt expanded and tightened icqt tests Added sparsity to icqt basis testing icqt with odd-multiple hop removed spurious print factored out window sum-squared calculation implemented but commented out icqt window inversion fixed plots for window_sumsquare example remove window modulation for now added channel-dependent squared window normalization to icqt fixed some import issues simplified some normalization code in icqt refactored window sumsquare to avoid numba compilation warning fixed an edge case in window sumsquare fixed a regression in window ss jit-enabled icqt reconstruction. sdr looks good linting icqt weakened icqt test a hell of a lot fixed optional_jit usage error
cleaning up window normalization in icqt simplifying icqt normalizations revised gain correction normalized scaling within icqt windowing stft within cqt re-introduced scale estimation reverted scaling mode
reverted internal windowing to preserve pseudo/hybrid behavior.
Reviewed 1 of 4 files at r1, 1 of 3 files at r3, 2 of 2 files at r4. Comments from Reviewable |
Implements #165.
This version seems to work, but has some scaling issues that need to be resolved.
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)