Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixup underperforming GENERIC kernel for volk_8u_x4_conv_k7_r2_8u #475

Merged
merged 1 commit into from Feb 6, 2022

Conversation

Aang23
Copy link
Contributor

@Aang23 Aang23 commented May 20, 2021

This addresses #473

I ended up comparing this implementation with others open-source implementations such as libcorrect and viterbi_lib from GNU Radio, and accordingly tuned

int METRICSHIFT = 1;
which seemed to be the only thing really different to me.

To my surprise, this did drastically improve performances to match the SSE / Spiral kernel ignoring a few details. Output from this "fixed?" generic kernel does slightly differ from the spiral implementation, but to a level that is entirely tolerable (down to a few bits on an entire 100Mb file) in my opinion.

Honestly, I do not know exactly how this fixes things in detail and did not have much clue what I was doing, so I would entirely understand if this is not accepted as a valid fix, but it does make this kernel usable compared to previous results and definitely as consistent as I would expect it to be.

I did not notice any degradation or weird behavior in my tests.

@Aang23
Copy link
Contributor Author

Aang23 commented May 20, 2021

For reference, here is once again the exact same data (METEOR-M 2 LRPT, which is a weather satellite downlinking a QPSK CCSDS r=1/2 k=7 signal).

SSE Kernel : https://ibb.co/cYFTFHs
Tuned generic Kernel : https://ibb.co/ZfGMkWY
Only minor differences are observed.

@michaelld
Copy link
Contributor

I ended up comparing this implementation with others open-source implementations such as libcorrect and viterbi_lib from GNU Radio, and accordingly tuned

Can you point us to the approximate code for comparison's sake? I did a quick look through & didn't see anything obvious (one way or the other). Maybe those other programs are more obvious as to what's going on Volk's?!

@Aang23
Copy link
Contributor Author

Aang23 commented May 20, 2021

I ended up comparing this implementation with others open-source implementations such as libcorrect and viterbi_lib from GNU Radio, and accordingly tuned

Can you point us to the approximate code for comparison's sake? I did a quick look through & didn't see anything obvious (one way or the other). Maybe those other programs are more obvious as to what's going on Volk's?!

Sure! - I do not know if I am really right on any of this, as I said my understanding of this algorithm isn't really good enough for that.

In libcorrect (https://github.com/quiet/libcorrect/blob/f5a28c74fba7a99736fe49d3a5243eca29517ae9/src/convolutional/decode.c#L60) what appears to play the same role is set to conv->rate (so 2 in this instance).

And it seems to be identical in this other algorithm I am using (slightly patched for CCSDS codes from an old GNU Radio version, but what's in the main repo currently is identical otherwise), https://github.com/altillimity/SatDump/blob/00e2dfadd7ced52d07d208a30b517eeb0d20a772/src-core/modules/common/viterbi_lib/viterbi.cpp#L65. (Shift applied twice, (state[i + 32].path << 1) | 1; ).

@rear1019
Copy link
Contributor

Mh, well, the generic and the SSE Spiral code kind of look like code from Phil Karn’s FEC library. Karn’s SSE code uses 5 bits for the branch metric, whereas the Spiral code uses 6 (see usage of _mm_srli_epi16() followed by masking with _mm_and_si128()). This makes me wonder why the Spiral code uses a higher threshold for normalization of the path metrics (210 instead of 105). My gut feeling is that this is wrong and that it’s probably masked by the usage of saturated arithmetic. And this is probably where things go wrong in the generic code. Making the branch metrics smaller (as done by this PR) fixes that issue.

@Aang23 Please test whether decreasing RENORMALIZE_THRESHOLD in the generic code fixes the issue as well (w/o the patch from this PR). It definitely must be <= (255-63) (if my theory is right).

@Aang23
Copy link
Contributor Author

Aang23 commented May 24, 2021

Mh, well, the generic and the SSE Spiral code kind of look like code from Phil Karn’s FEC library. Karn’s SSE code uses 5 bits for the branch metric, whereas the Spiral code uses 6 (see usage of _mm_srli_epi16() followed by masking with _mm_and_si128()). This makes me wonder why the Spiral code uses a higher threshold for normalization of the path metrics (210 instead of 105). My gut feeling is that this is wrong and that it’s probably masked by the usage of saturated arithmetic. And this is probably where things go wrong in the generic code. Making the branch metrics smaller (as done by this PR) fixes that issue.

I established the SSE Spiral kernel was initially generated by http://spiral.ece.cmu.edu/vitgen, which also explains why it is named "spiral" in the first place. (by the way, thanks for the in-depth explanation)

@Aang23 Please test whether decreasing RENORMALIZE_THRESHOLD in the generic code fixes the issue as well (w/o the patch from this PR). It definitely must be <= (255-63) (if my theory is right).

I will try this in a few minutes.

@Aang23
Copy link
Contributor Author

Aang23 commented May 24, 2021

@rear1019 I just tested your theory, with int METRICSHIFT = 1;. I tried with values ranging from 10 to 210 in increments of 5. None provided any improvement, rather a few differences with near-identical performances to sticking with int RENORMALIZE_THRESHOLD = 210;.

@Aang23
Copy link
Contributor Author

Aang23 commented Jul 15, 2021

Any new on this?

@jdemel
Copy link
Contributor

jdemel commented Sep 6, 2021

I can't tell if things improved now. Did you guys come to a conclusion?

It seems like this is not an issue we can simply test in our CI. This kind of FEC performance regression is terribly hard to detect.

Anyways, @Aang23 do you have some test code? Ideally, it is as simple as possible but allows some kind of BER/FER over SNR simulation to test performance. I could imagine adding this code to VOLK. We wouldn't run it as part of our normal CI but use it whenever we try to fix an issue such as this one.

Besides, we started the process to migrate VOLK to LGPL. It'd be great to receive your support. Ideally with a simple PR against the AUTHORS_RESUBMITTING_UNDER_LGPL_LICENSE.md file that adds your name to the list. Technically that counts as re-submitting your code under LGPL.

@rear1019
Copy link
Contributor

rear1019 commented Sep 6, 2021

@rear1019 I just tested your theory, with int METRICSHIFT = 1;. I tried with values ranging from 10 to 210 in increments of 5. None provided any improvement, rather a few differences with near-identical performances to sticking with int RENORMALIZE_THRESHOLD = 210;.

I just realized that RENORMALIZE_THRESHOLD is not used at all. The renormalize() function always performs a renormalize – the if statement which checks the threshold is suspiciously commented. Given that I failed to see another overflow issue in Volk’s code I looked into GNU Radio’s. We could consider changing the initial biasing of the known start state in cc_decoder_impl::init_viterbi():

https://github.com/gnuradio/gnuradio/blob/c56e76c74c60d7dfa2d8e3b910a68b7b1f9f0ad9/gr-fec/lib/cc_decoder_impl.cc#L275-L290

Smaller values indicate a better path metric due to the way the branch metrics are calculated. The bad paths of the trellis start off with a higher metric. Moreover, the metric of the correct path will grow slower than the metric of all other paths (assuming a low or moderate error rate). At some point a bad path may become a good one due to overflow.

@Aang23 Are you able to compile GNU Radio? If so, please test whether decreasing the value 63 in the code shown above helps. However, we should be careful with this change. It might affect the SSE kernel as well.

@jdemel
Copy link
Contributor

jdemel commented Sep 21, 2021

@Aang23 we're in the process to re-license VOLK under LGPL cf. #492 .
It'd be great if you'd support us and add yourself to the list. Just fill out the prepared statement and add your info to the list. Thanks!

@AlexandreRouma
Copy link
Contributor

I can generate some performance test code for this. I'm also having the same issue with this kernel and it's forced me to use libcorrect instead when volk alone could do the job (and do it much faster)

@Aang23
Copy link
Contributor Author

Aang23 commented Nov 29, 2021

I can't tell if things improved now. Did you guys come to a conclusion?

It seems like this is not an issue we can simply test in our CI. This kind of FEC performance regression is terribly hard to detect.

Anyways, @Aang23 do you have some test code? Ideally, it is as simple as possible but allows some kind of BER/FER over SNR simulation to test performance. I could imagine adding this code to VOLK. We wouldn't run it as part of our normal CI but use it whenever we try to fix an issue such as this one.

Besides, we started the process to migrate VOLK to LGPL. It'd be great to receive your support. Ideally with a simple PR against the AUTHORS_RESUBMITTING_UNDER_LGPL_LICENSE.md file that adds your name to the list. Technically that counts as re-submitting your code under LGPL.

I do sort of have some, (based on GNU Radio's CC Decoder), but due to the nature of the issue it is not that simple, the issue only really shows when feeding in a large (noisy) dataset. My method for testing this has been feeding real-world data from a CCSDS-compliant satellite and comparing the output quality and BER. The best I could provide to reliably show the issue would be a pre-processed (ready to be fed to the kernel) dataset to feed some basic decoding code.

Sorry for the delay in replying, for some reason all mail notifications from this repository ended up in my spam folder...

@rear1019 I just tested your theory, with int METRICSHIFT = 1;. I tried with values ranging from 10 to 210 in increments of 5. None provided any improvement, rather a few differences with near-identical performances to sticking with int RENORMALIZE_THRESHOLD = 210;.

I just realized that RENORMALIZE_THRESHOLD is not used at all. The renormalize() function always performs a renormalize – the if statement which checks the threshold is suspiciously commented. Given that I failed to see another overflow issue in Volk’s code I looked into GNU Radio’s. We could consider changing the initial biasing of the known start state in cc_decoder_impl::init_viterbi():

https://github.com/gnuradio/gnuradio/blob/c56e76c74c60d7dfa2d8e3b910a68b7b1f9f0ad9/gr-fec/lib/cc_decoder_impl.cc#L275-L290

Smaller values indicate a better path metric due to the way the branch metrics are calculated. The bad paths of the trellis start off with a higher metric. Moreover, the metric of the correct path will grow slower than the metric of all other paths (assuming a low or moderate error rate). At some point a bad path may become a good one due to overflow.

@Aang23 Are you able to compile GNU Radio? If so, please test whether decreasing the value 63 in the code shown above helps. However, we should be careful with this change. It might affect the SSE kernel as well.

Yes I am. I have tested this both in my software (SatDump, which is using an implementation based on GNU Radio's) and in GNU Radio. This did not help at all unfortunately. (With METRICSHIFT = 1)
Using METRICSHIFT = 2, the output was identical.

I can generate some performance test code for this. I'm also having the same issue with this kernel and it's forced me to use libcorrect instead when volk alone could do the job (and do it much faster)

I tried several times using shorter datasets but it really doesn't demonstrate the issue... Though, maybe you can do like I'm doing currently and provide your own fix if Spiral's not available? See https://github.com/altillimity/SatDump/blob/a739b3d9c52b30978452e52d213c7c2c023f3578/src-core/common/codings/viterbi/cc_decoder.cpp#L108

@jdemel
Copy link
Contributor

jdemel commented Jan 31, 2022

I'm curious, does this change improve decoding performance now or not? I'd be happy to merge this PR.

Could you add some more info here which configurations seem to work well and which don't? Which configuration does CCSDS use?
I know this PR is sitting here for quite a while now. And unfortunately, it is about exactly one of those bugs that are terribly hard to reproduce and debug. That's the reason I'm hesitant to merge.
@michaelld what's your opinion on this?

@Aang23
Copy link
Contributor Author

Aang23 commented Jan 31, 2022

I'm curious, does this change improve decoding performance now or not? I'd be happy to merge this PR.

On my end, I have been using this PR's contents in SatDump for months as a fallback if the Spiral kernel is not available. I've had plenty of confirmations things worked as expected since then.

As far as tests I have done on actual CCSDS signals go, it definitely does in every case.

Could you add some more info here which configurations seem to work well and which don't? Which configuration does CCSDS use?

Well, I have only tested with Voyager/CCSDS codes which are by far the most common for r=2,k=7 codes, so that was with G1 & G2 being 79, 109.
The voyager code compared to CCSDS is just G1 & G2 inverted.
Something else I have done as a test was porting over the pure C kernel generated by Spiral (http://spiral.ece.cmu.edu/vitgen/), which resulted in performances identical to this PR.

I know this PR is sitting here for quite a while now. And unfortunately, it is about exactly one of those bugs that are terribly hard to reproduce and debug. That's the reason I'm hesitant to merge.

I can only agree on this! The issue will only really show over a decent amount of data at low SNR. It does show decently in GNU Radio by chaining an encoder, noise source (Add + Gaussian noise source) and the decoder, but that's pretty much as good as it gets for reproducing the issue without straight up testing on actual signals.

@jdemel jdemel requested a review from michaelld February 6, 2022 16:10
Copy link
Contributor

@jdemel jdemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. These changes work in https://github.com/altillimity/SatDump I expect to work them as least as well for everyone. I hope we can introduce some more extensive tests for our FEC kernels at some point.

Copy link
Contributor

@michaelld michaelld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michaelld
Copy link
Contributor

Yes wishing we had a test for this, but for now let's just get it fixed

@michaelld michaelld merged commit 2a36ad9 into gnuradio:main Feb 6, 2022
@Aang23
Copy link
Contributor Author

Aang23 commented Feb 7, 2022

Thanks!

@jdemel
Copy link
Contributor

jdemel commented Feb 7, 2022

@Aang23 if you like, you can add your name etc. to https://github.com/gnuradio/volk/blob/main/.zenodo.json
Since this list is currently sorted alphabetically, your GH nick is the first in the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

volk_8u_x4_conv_k7_r2_8u generic kernel inconsistent with SSE implementation
5 participants