Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace speex resampler with polyphase FIR interpolator. #51

Closed
wants to merge 8 commits into from

Conversation

bmatherly
Copy link
Contributor

This pull request builds on top of: #49
The only commit this request adds is: 48924de
Also, it passes all tests from: #50

@andrewrk
Copy link
Collaborator

This looks really nice. How's the performance compared to master branch?

@bmatherly
Copy link
Contributor Author

I don't have a good way to compare with master because I have never used speex.

In terms of accuracy, here is the relevant output from the test application (#50):
PASSED - seq-3341-15-24bit.wav.wav: -6.0002649391173914e+00
PASSED - seq-3341-16-24bit.wav.wav: -6.0333638212754002e+00
PASSED - seq-3341-17-24bit.wav.wav: -5.9851041846311235e+00
PASSED - seq-3341-18-24bit.wav.wav: -5.9983726572865628e+00
PASSED - seq-3341-19-24bit.wav.wav: 2.9766193982539342e+00
PASSED - seq-3341-20-24bit.wav.wav: -1.3018312868045112e-01
PASSED - seq-3341-21-24bit.wav.wav: -8.5927052174689350e-02
PASSED - seq-3341-22-24bit.wav.wav: -1.8143836032717714e-01
PASSED - seq-3341-23-24bit.wav.wav: -8.5926006441882030e-02

If someone could provide similar test results using master, we could compare accuracy.

In terms of CPU performance, I don't know what method speex uses. But I do know that it is really hard to beat the performance of the sinc/windowed method with odd filter taps at the nyquist frequency. The reason is that the sinc coefficients end up being zero for every other coefficient (except the middle) and the hanning window zeros out the first and last sample.

So the total number of multiply/add operations per sample for 49 taps comes to ((48 x 3 / 4) + 1 - 2) / 4 = 8.75 for my implementation.

Compare that to the example filter provided by BS.1770-4 for 48 taps. The total number of multiply/add operations per sample comes to 48 / 4 = 12

So the method I used is 27% more efficient than the BS.1770-4 example.

@andrewrk
Copy link
Collaborator

Beautiful.

@jiixyj I'd like to merge this pull request along with the others (I would clean up some commits while retaining credit to the authors). Looks like I have permission to do this, but I'll give you a chance to step in first. If I don't hear anything in a week, I'll do it myself.

@audionuma
Copy link
Contributor

audionuma commented Apr 20, 2016

Hello,
that's good news.
@andrewrk @jiixyj : maybe you should set up a branch (v 1.2 maybe) to merge those three pull request so that that we can do extensive testing on the whole package without modifying the current release which is used by several packagers (at least debian and macports to my knowledge). You could then merge that branch into master and publish a new release once everything seems ok.
@bmatherly : I have setup a quick and dirty code (see attached file) to compare efficiency of current release against your proposal as far as true-peak measure is concerned.
I have make two builds, one linked against current libebur128 (with speexdsp), one linked against your last commit in the true-peak branch.
Doing quick comparison between both on a 6 minutes stereo/48 k/16 bits files has shown your code being almost twice slower than current release (approx. 8s vs approx. 16s). Maybe my test isn't meaningful, any suggestions welcome.
Thanks.
Manuel
mytest.c.zip

@bmatherly
Copy link
Contributor Author

@audionuma : Thanks for doing the test.

Can you also compare the accuracy of the test results between the two implementations using #50?

It is possible that we could reduce the number of filter taps:
https://github.com/bmatherly/libebur128/blob/true_peak/ebur128/ebur128.c#L275
I choose 49 taps because it works for my needs, but I did not attempt to optimize it and I think it could be reduced. Perhaps try setting it to 27 and see how the performance and accuracy compare to speex?

@bmatherly
Copy link
Contributor Author

bmatherly commented Apr 20, 2016

I just took a look at the speexdsp resampler implementation for the first time. It appears to be highly optimized with SSE instructions. I don't think we could hope to compete with it on performance using a straight C implementation.

Here are a couple of ideas of how to improve the performance of this patch:

  1. Find the lowest number of taps that produces acceptable results
  2. Switch from hanning window to kaizer (might allow the taps to be reduced a little further - or it might not).
  3. Try adding compiler options to optimize multiply/accumulate operations: http://stackoverflow.com/a/34461738/4355458

@audionuma
Copy link
Contributor

Using a slightly modified version of #50 (see attached file).
(build and run on a macbook pro x86_64).
tests.c.zip

Output for current release

Note: the tests do not have to pass with EXACT_PASSED.
Passing these tests does not mean that the library is 100% EBU R 128 compliant!


start : 4463
PASSED, EXACT_FAILED - seq-3341-1-16bit.wav: -2.2953554851420424e+01
PASSED, EXACT_FAILED - seq-3341-2-16bit.wav: -3.2959858907723813e+01
PASSED, EXACT_FAILED - seq-3341-3-16bit-v02.wav: -2.3014141652728306e+01
PASSED, EXACT_FAILED - seq-3341-4-16bit-v02.wav: -2.3014141652728327e+01
PASSED, EXACT_FAILED - seq-3341-5-16bit-v02.wav: -2.2979029141006247e+01
PASSED, EXACT_FAILED - seq-3341-6-5channels-16bit.wav: -2.3017156275006020e+01
PASSED, EXACT_FAILED - seq-3341-6-6channels-WAVEEX-16bit.wav: -2.3017156275006020e+01
PASSED, EXACT_FAILED - seq-3341-7_seq-3342-5-24bit.wav: -2.2986158804946516e+01
PASSED, EXACT_FAILED - seq-3341-2011-8_seq-3342-6-24bit-v02.wav: -2.2997820241139713e+01
PASSED, EXACT_PASSED - seq-3342-1-16bit.wav: 1.0001105488329134e+01
PASSED, EXACT_PASSED - seq-3342-2-16bit.wav: 4.9993734051522178e+00
PASSED, EXACT_PASSED - seq-3342-3-16bit.wav: 1.9995064067783115e+01
PASSED, EXACT_PASSED - seq-3342-4-16bit.wav: 1.4999273937723455e+01
PASSED, EXACT_PASSED - seq-3341-7_seq-3342-5-24bit.wav: 4.9747585878473721e+00
PASSED, EXACT_FAILED - seq-3341-2011-8_seq-3342-6-24bit-v02.wav: 1.4993218571417380e+01
PASSED - seq-3341-15-24bit.wav.wav: -6.0002649391173923e+00
PASSED - seq-3341-16-24bit.wav.wav: -6.0001575059669516e+00
PASSED - seq-3341-17-24bit.wav.wav: -6.0001926281968601e+00
PASSED - seq-3341-18-24bit.wav.wav: -6.0001471759263616e+00
PASSED - seq-3341-19-24bit.wav.wav: 3.0098260838209900e+00
PASSED - seq-3341-20-24bit.wav.wav: -1.3018312868045112e-01
PASSED - seq-3341-21-24bit.wav.wav: -1.3147499549516292e-01
PASSED - seq-3341-22-24bit.wav.wav: -1.3148971274153129e-01
PASSED - seq-3341-23-24bit.wav.wav: -1.3147341864881676e-01

start true-peak : 7083310

end : 7700355
ticks for true-peak : 617045

Output for true-peak branch

Note: the tests do not have to pass with EXACT_PASSED.
Passing these tests does not mean that the library is 100% EBU R 128 compliant!


start : 4109
PASSED, EXACT_FAILED - seq-3341-1-16bit.wav: -2.2953554851420424e+01
PASSED, EXACT_FAILED - seq-3341-2-16bit.wav: -3.2959858907723813e+01
PASSED, EXACT_FAILED - seq-3341-3-16bit-v02.wav: -2.3014141652728295e+01
PASSED, EXACT_FAILED - seq-3341-4-16bit-v02.wav: -2.3014141652728345e+01
PASSED, EXACT_FAILED - seq-3341-5-16bit-v02.wav: -2.2979029141006244e+01
PASSED, EXACT_FAILED - seq-3341-6-5channels-16bit.wav: -2.3017156275006020e+01
PASSED, EXACT_FAILED - seq-3341-6-6channels-WAVEEX-16bit.wav: -2.3017156275006020e+01
PASSED, EXACT_FAILED - seq-3341-7_seq-3342-5-24bit.wav: -2.2986158804946516e+01
PASSED, EXACT_FAILED - seq-3341-2011-8_seq-3342-6-24bit-v02.wav: -2.2997820241139717e+01
PASSED, EXACT_PASSED - seq-3342-1-16bit.wav: 1.0001105488329134e+01
PASSED, EXACT_PASSED - seq-3342-2-16bit.wav: 4.9993734051522178e+00
PASSED, EXACT_PASSED - seq-3342-3-16bit.wav: 1.9995064067783115e+01
PASSED, EXACT_PASSED - seq-3342-4-16bit.wav: 1.4999273937723455e+01
PASSED, EXACT_PASSED - seq-3341-7_seq-3342-5-24bit.wav: 4.9747585878473721e+00
PASSED, EXACT_FAILED - seq-3341-2011-8_seq-3342-6-24bit-v02.wav: 1.4993218571417380e+01
PASSED - seq-3341-15-24bit.wav.wav: -6.0002649391173923e+00
PASSED - seq-3341-16-24bit.wav.wav: -6.0333638212754002e+00
PASSED - seq-3341-17-24bit.wav.wav: -5.9851041846311244e+00
PASSED - seq-3341-18-24bit.wav.wav: -5.9983726572865610e+00
PASSED - seq-3341-19-24bit.wav.wav: 2.9766193982539346e+00
PASSED - seq-3341-20-24bit.wav.wav: -1.3018312868045112e-01
PASSED - seq-3341-21-24bit.wav.wav: -8.5927052174689350e-02
PASSED - seq-3341-22-24bit.wav.wav: -1.8143836032717717e-01
PASSED - seq-3341-23-24bit.wav.wav: -8.5926006441882030e-02

start true-peak : 6935958

end : 8216746
ticks for true-peak : 1280788

Results are good for standard compliance.
Time for true-peak measure is close to two times longer with the polyphase implementation.
I might work on your suggestions (reducing taps) tomorrow and will let you know what the results are.

@bmatherly
Copy link
Contributor Author

@audionuma : Thanks for looking into all of that. I'm a little slammed for the rest of this week. But I might have some time to look for optimizations next week.

@audionuma
Copy link
Contributor

Hello,
so I've been trying to add optimization flags to the libebur128/CMakeList.txt.
Lines added to true-peak branch CMakeList.txt (based on your previous link http://stackoverflow.com/questions/15933100/how-to-use-fused-multiply-add-fma-instructions-with-sse-avx/34461738#34461738 ). Notice I have absolutely no idea of what these flags do ;-)

if ("${CMAKE_C_COMPILER_ID}" STREQUAL "Clang")
  message(STATUS "*****Using Clang******")
  #SET(MY_COMPILE_FLAGS "-O1 -mavx2 -mfma -ffp-contract=fast")
  SET(MY_COMPILE_FLAGS "-O2 -mavx2 -mfma")
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "AppleClang")
  message(STATUS "*****Using AppleClang******")
  #SET(MY_COMPILE_FLAGS "-O1 -mavx2 -mfma -ffp-contract=fast")
  SET(MY_COMPILE_FLAGS "-O2 -mavx2 -mfma")
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "GNU")
  message(STATUS "*****Using GCC******")
  SET(MY_COMPILE_FLAGS "-O2 -mavx2 -mfma")
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "Intel")
  message(STATUS "*****Using Intel******")
  SET(MY_COMPILE_FLAGS "-O1 -march=core-avx2")
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "MSVC")
  message(STATUS "*****Using MSVC******")
  SET(MY_COMPILE_FLAGS "/O1 /arch:AVX2 /fp:fast")
endif()

SET( CMAKE_C_FLAGS  "${CMAKE_C_FLAGS} ${MY_COMPILE_FLAGS}" )

Using the suggested flags for Clang (which I am using) leads to a runtime error, that's why they are commented out and replaced by the GCC ones.
Now, the good news.
Without flags : ticks for true-peak : 1277645
With flags : ticks for true-peak : 357127
And without real testing, it seems to also enhance the whole library speed (ie for loudness measurement).
Issues
I am not able to test the script on a GNU/Linux or Windows platform.
I am not sure wether the flag setting should be in the top-level CMakeList.txt or in the ebur128/CMakeList.txt

@bmatherly
Copy link
Contributor Author

bmatherly commented Apr 21, 2016

Interesting. The "-mavx2 -mfma" flags tell the compiler to try to use FMA instructions when possible. FMA was only recently added to some CPUs in the last couple of years. It seems very likely that the compiler would use FMA for the multiply/accumulate code if the flags permit it. However, if it does, and you run the binary on a CPU that does not support FMA, it would probably cause a runtime error. If we really want to use FMA, we would need to write two versions of the function: a native version and an FMA optimized version. Then, at runtime, we would need to decide which function to use by checking the CPU ID/flags. Highly optimized libraries like ffmpeg do this all the time. We probably don't need that - and all this FMA stuff is a distraction.

What is probably making all the difference is the "-O2" flag. If you have the time, it would be good to re-run the comparisons of master and true-peak branches after compiling both with the "-O2" flag (and excluding the other flags). If you don't have time, I will try to make time this weekend. I hope that would narrow the performance gap between the master and true-peak branches because I really didn't expect the performance to be so different.

I'm no expert on packaging. So I'm not sure if it is best practice to specify optimization flags in the build scripts (cmake/make) or if it is best to leave those decisions to the packager or person compiling.

It is interesting to compare the debian rules files for libebur128 and speex. The libebur128 rules do not specify any compiler options. While the speex rules explicitly force "-O2"
Maybe @andrewrk can provide some insight on best practices.
I think that one thing we have learned is that it is beneficial to use "-O2" and we should either add that to the libebur128 source, or get package maintainers to add it to the respective scripts.

@audionuma
Copy link
Contributor

With only "-O2" flag set, I have ticks for true-peak : 341962

@andrewrk
Copy link
Collaborator

When you make a debug build you don't get the -O2 flag but you do when you make a release build. Any speed comparisons should be done against release mode builds.

Make a release build by using this configure line:

cmake .. -DCMAKE_BUILD_TYPE=Release

This is how packages in distros such as Debian build.

@bmatherly
Copy link
Contributor Author

Ok. So based on those results, it looks like we should be good to go with the patch as-is. Are there any other tests that should be done before we move forward?

@jiixyj
Copy link
Owner

jiixyj commented Apr 22, 2016

Thank you for your work on this! I haven't had the chance to thoroughly review your patches, but if everyone else is happy, please go ahead. :)
Just one thing: Wouldn't the addition of the EBUR128_ERROR_INSUFFICIENT_DATA error break the existing API/ABI, so where previously user code got some loudness value, there is now an error? What was the return value previously if there is insufficient data? -HUGE_VAL? What does the spec say should happen in this case?
I would prefer a backwards compatible solution if possible, maybe by adding a new function that does this check?

@bmatherly
Copy link
Contributor Author

bmatherly commented Apr 22, 2016

@jiixyj : Thanks for your consideration.

For those following along, jiixyj is referring to 9eda37f

Previously, the return value for loudness_shortterm() and loudness_momentary() was EBUR128_SUCCESS - even when there was insufficient data. The functions would always perform the calculations as if st->d->audio_data was full of valid data. That means that sometimes the loudness measurement would be -HUGE_VAL, and sometimes it would just be a number that doesn't represent anything. Because st->d->audio_data is allocated using malloc(), you can't even assume that the loudness measurement represents silence leading up to the data that is available - because malloc() might return a pointer with random data in it.

My patch only applies to loudness_shortterm() and loudness_momentary(). After the patch, if either function is called before enough valid data has been received, the return value is EBUR128_ERROR_INSUFFICIENT_DATA and the loudness parameter is returned unchanged.

I have checked the specifications and they don't define a meaning for shortterm and momentary loudness before 3000ms and 400ms have elapsed (respectively). So I don't know that there is a right answer of what to return in that scenario. Logically, I suppose, one could just calculate the loudness for the data that has been received so far. I'd be willing to make that patch if people prefer.

If the advantage of the patch does not outweigh the disadvantage (API change), I am OK if it is excluded. In my own application, I would have to keep track of how many samples have been provided and ignore the value until I have provided sufficient data. We could also just update the documentation to explain that the functions should not be used until sufficient data has been provided.

@audionuma
Copy link
Contributor

@bmatherly has a very good point with raising this EBUR128_ERROR_INSUFFICIENT_DATA. It is a good way to deal with this issue. The hardware realtime meter I use doesn't display a value until there's sufficient data (ie it will not display a value for short-term for the first three seconds after reset).
But I understand @jiixyj point of view on breaking tha API/ABI.
I think that merging the polyphase implementation (and associated contribs) is in itself a big step, so we may wait a bit before modifying the API. We may either stress out in the documentation that it is the library user responsibility to ensure that enough data has been provided, or create an enhancement ticket that describe the issue and see wether there's strong support for the API change.

@bmatherly
Copy link
Contributor Author

Hi. I added two more commits to this pull request.

d82b54b fixes a typo/bug from a1bcda2. Feel free to squash those two commits if you want.

8cc762d implements prev_sample_peak() and prev_true_peak() which provide the respective measurements for only the data from the previous call to add frames. I consider this the last feature that is needed to implement a real-time EBU compliant loudness meter. With this feature, it is possible to provide a real time display of peak and true_peak measurements in addition to the maximum peak for the entire program.

I hope you will consider including these.

I think I am done adding features. I can't think of any other features that would be required for my real-time loudness meter. Of course, if I find any bugs, I'll pass the fixes along.

@jiixyj
Copy link
Owner

jiixyj commented Aug 13, 2016

I've merged the interpolation patch into the review-pr-49 branch. Thank you again for the resampler, finally no more Speex dependency!

@jiixyj jiixyj added this to the 1.2.0 milestone Aug 13, 2016
@kylophone
Copy link
Contributor

Haven't had a chance to test yet, but this is a welcome addition!

@jiixyj
Copy link
Owner

jiixyj commented Oct 26, 2016

The new resampler is in master now. This PR can be closed if I didn't miss anything.

@jiixyj jiixyj closed this Oct 26, 2016
@bmatherly
Copy link
Contributor Author

Thanks again for merging the interpolator. I think it is a good step forward. I see you excluded this commit:
8cc762d

Is it still under consideration? Shall I submit another pull request?

Without that commit, it is not possible to use libebur128 for a real-time peak/true-peak meter because the value will only ever go up when a new maximum is encountered.

@jiixyj
Copy link
Owner

jiixyj commented Oct 28, 2016

Ah, I must have missed this commit. Could you submit another PR?

@bmatherly
Copy link
Contributor Author

PR for previous peak/true peak functions:
#55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants