Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Add correlate_template() function with 'full' normalization; deprecate 'domain' keyword in correlate() in favor of 'method' keyword #2042
What does this PR do?
This PR adds full normalisation to cross-correlations as discussed in #2035.
I have based this on Master given that it was flagged for 1.2.0 in #2035, and it isn't strictly a bug-fix.
I still need to work out a good way to test for rounding errors - the usual case I use is to download some data from the 2016 M7.8 Kaikoura earthquake as this has enough range to result in floating point errors for many correlation implementations when using a day of data. How do we feel about adding an extra network test that is quite memory intensive...?
Looks pretty good to me. I think someone else should still review the cross correlation normalization as I'm not at all an expert in this.
Can you add a small test case that tests all possibilities of the
normalize argument? There are a lot now and it will be easy to have a regression in future changes if this is not explicitly tested. A mock test might work well that checks with functions are being called?
This looks also good to me. Here are my comments after a first superficial review:
Thanks for those reviews @trichter and @krischer, I'm making some changes down here and have fixed the things I have replied to. Agree that results are not quite as expected and I need to dig into this a little more.
I meant something else with the one comment. In line 155
you have to substitute
Also, I think I am not correct with the point, that the correlations with different normalizations are the same for len(a)==len(b). But you can check that the full_xcorr == naive_xcorr for the mid sample and that full_xcorr > naive_xcorr for all other samples.
And last but not least here is another input:
I don't have strong opinions about this but I slightly prefer the first version. But if you think this use case warrants a second function I'd personally also be fine with it.
I think I slightly prefer the second option.
OK, I've been doing some coding on the _normxcorr function and I finally came up with:
Did not push to this brach, because I didn't want to overwrite @calum-chamberlain 's code.
Edited code block.
So that would work, but should be less memory efficient for the
""" :note: For ``normalization='full'`` the shorter of ``(a, b)`` will be slid through the long data, such that the zeroth element of the returned array corresponds to the correlation of the short input array with the first n elements of the long input array where n is the length of the short input array. """
Opinions welcome on that, sounds a little opaque to me.
Yup, have changed that test to
I mildly prefer option 2 as well, but with some doc-string in
@trichter how do you want to proceed? Happy to reformat my stuff here with the
Still, we are talking about different issues , here. It just has to be replaced in this one line to get the algorithm correct.
OK, I am going to refactor everything a bit and push my changes to your branch.
This is looking pretty good by now. I'll trust the two of you with the mathematical details and only have some very minor comments left.
Could also add a small regression test for
normalize=True which I think is not covered by the current test suite.
One other small request I have: Could you add something about this PR to this page here (https://github.com/obspy/obspy/wiki/What's-New-in-ObsPy-1.2) - then we don't have to scramble to create this page at release time.
we might want to add this to the github PR template checklist, for major changes to include themselves in the "whats new" page..
Jan 23, 2018
2 of 5 checks passed
@calum-chamberlain Using your notebook I compared the performance of the new implementation with bottleneck. It is only marginally slower. The original numpy method failed with a memory error using data corresponding to 1 day sampled at 100Hz.
indeces = np.arange(0, 24*3600, 0.01) # A day of data sampled at 100 Hz data = signal.gausspulse(indeces - 2000, fc=0.5) data += 0.3 * np.random.randn(len(data)) template_start, template_stop = (198000, 202000) # Lets use a little longer template too, which means a longer rolling window template = data[template_start: template_stop]
# no normalization %timeit correlate_template(data, template, normalize=None) %memit correlate_template(data, template, normalize=None)
# normalization with bottleneck %timeit normxcorr(data, template, (len(data) // 2) - (len(template) // 2), demean=True, domain='freq', norm_method='bn') %memit normxcorr(data, template, (len(data) // 2) - (len(template) // 2), demean=True, domain='freq', norm_method='bn')
# normalization with np.cumsum %timeit correlate_template(data, template) %memit correlate_template(data, template)