Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High bitrate, small devices #34

Closed
mehrvarz opened this issue Aug 20, 2018 · 8 comments
Closed

High bitrate, small devices #34

mehrvarz opened this issue Aug 20, 2018 · 8 comments

Comments

@mehrvarz
Copy link

Hi. Playing hires audio on a RPi can easily drive 50-70% cpu load. You live by the "Output underflowed" message. It only takes another high priority task (sshd) to move a little finger and... bang. But CPU load could be brought down easily. All it would take is for the decoder to interleave the audio channels: LLLLRRRRLLLLRRRRLLLLRRRR... Audio subsystems prefer it this way. Other decoders do this too. Can we have frame.Samples[i] without subframes? Would it be possible to pick the outgoing format when opening a file? Thank you for considering

@mewmew
Copy link
Member

mewmew commented Aug 20, 2018

The FLAC format is based on subframes, and to decode the samples from two channels you really have to decode the entire first subframe, before the second subframe (and its audio samples) even becomes visible in the FLAC bitstream.

That being said, it may be possible to achieve better latency if the buffer used for decoding and sending samples to output on the speakers is made smaller, because then it will decode a single frame, and use samples from that frame several times, before having to refill the now smaller buffer used for sending to the sound card.

To do this, change the buffer size when invoking speaker.Init of beep; or analogously for the library of choice you are using.

I've updated blip (see mewspring/blip@db40fe5) so you can experiment with different buffer sizes. The buffer size can be specified using the -dur command line flag of blip, which sets the duration between rebuffering of samples.

Hopefully you can experiment with this value and find a sweet spot for the RPi

@mehrvarz
Copy link
Author

to decode the samples from two channels you really have to decode the entire first subframe, before the second subframe

Can you not fill all even words of a unified outgoing buffer with data from the 1st subframe and then fill all odd words with data from the 2nd subframe? The data needs to be zipped together no matter what. But it is super inefficient to do it in a separate step using an extra buffer. I would like to eliminate this code:

	// expensive sh*t
	j:=0
	for i := 0; i < int(frame.BlockSize); i++ {
		outbuf[j] = frame.Subframes[0].Samples[i]
		j++
		outbuf[j] = frame.Subframes[1].Samples[i]
		j++
	}

@mewmew mewmew mentioned this issue Aug 20, 2018
11 tasks
@mewmew
Copy link
Member

mewmew commented Aug 20, 2018

Can you not fill all even words of a unified outgoing buffer with data from the 1st subframe and then fill all odd words with data from the 2nd subframe? The data needs to be zipped together no matter what. But it is super inefficient to do it in a separate step using an extra buffer. I would like to eliminate this code:

I definitely see your point regarding performance. The update to unify samples from subframes into a single slices would be rather straight forward. However, not all applications and users of the API expect the data to be zipped. For instance, sound.Source of the zikichombo audio library expects samples from each channel to be placed directly after each other; e.g. n samples of the first channel in samples[0:n] and n samples of the second channel in samples[n:2*n].

Consolidating the API to handle both cases is possible of course with a conditional check, and perhaps that is the direction to take going forward.

As this change would update the API we'd have to do it in version 2.x, which is right now in planning stage. So any feedback and input is welcome :) Feel free to join the discussion towards of the 2.x roadmap issue: #33

P.S. I've added a bullet to track the issue of having to duplicate the audio sample slices.

@mehrvarz
Copy link
Author

mehrvarz commented Aug 21, 2018

I made a couple of measurements on the RPi.

1) mp3  16/44.1:  playmp3   5-7%   pulseaudio 1-2%    combined top:  9%
2) flac 16/44.1:  playflac 13-23%  pulseaudio 1-2%    combined top: 25%
3) flac 24/96 a:  playflac 32-50%  pulseaudio 6-9%    combined top: 59%
4) flac 24/96 b:  playflac 32-63%  pulseaudio 6-9%    combined top: 72%
5) flac 24/96 c:  playflac 32-58%  pulseaudio 6-9%    combined top: 67%

Cases 3-5 represent different strategies in my code to hand over 24bit data to portaudio. The different values (50-63%) demonstrate how the slicing and copying of data in my code does have a serious impact on the overall load. Case 3 represents the copy-loop shown two messages up (it is the most efficient of the three). Anyway, if this loop could be eliminated, I think cpu-load could drop from 50% to below 45%.

If you compare cases 1 and 2 (both 16/44.1) you see that playmp3 appears to be more efficient. I know it is apples and oranges but the amount of decoded data handed over to the audio subsystem is the same in both cases. Do you think that mewkiz/flac (disregarding the copy loop in the client app) could be made more efficient still? Or would is decoding flac (vs. mp3) simply more cpu-intensive?

(In my first message I mentioned "50-70%" cpu load. I was referring to the combined load of playflac + pulseaudio.)

@mehrvarz
Copy link
Author

I created a mewkiz/flac "unified buffer" test implementation in which interleaved audio samples from all channels are being combined in one buffer. I allocated a new frame.BlockSamples[] in frame.Parse(). I then applied the following change to parseSubframe() where I zip the decoded data:

for i, sample := range subframe.Samples {
	//subframe.Samples[i] = sample << subframe.Wasted  // original code
	frame.BlockSamples[i*frame.Channels.Count() + channel] = sample << subframe.Wasted
}

I can hand over the resulting frame.BlockSamples to the audio subsystem without having to slice or copy anything in my client app. I took a shortcut and outremarked correlate() because it expects the data to still be in subframe.Samples[]. Same is true for frame.Hash. Different functions making assumptions about subframe.Samples[] and the order of data within, appears to be the biggest work item when trying to fully implement this. One would also need to enable the client to somehow specify (or select) the desired output format.

I also wonder how beneficial it may be, to use separate threads for decoding and outputting of data to the sound system. The client app could do this on it's own, but there would need to be a way to hand over altering frame buffers. Any thoughts? (Some Go channels have yet to be utilized.)

@mewmew
Copy link
Member

mewmew commented Aug 21, 2018

Do you think that mewkiz/flac (disregarding the copy loop in the client app) could be made more efficient still?

Oh, most definitely! There is a benchmark you could run (and feel free to add and extend the benchmark with samples that from your own collection).

u@x61s ~/D/g/s/g/m/f/frame> go test -bench=. github.com/mewkiz/flac/frame
goos: linux
goarch: amd64
pkg: github.com/mewkiz/flac/frame
BenchmarkFrameParse-2   	       1	39170578822 ns/op
BenchmarkFrameHash-2    	       1	47362633952 ns/op
PASS
ok  	github.com/mewkiz/flac/frame	89.865s

From https://github.com/mewkiz/flac/blob/master/frame/frame_test.go#L61:

	// The file 151185.flac is a 119.5 MB public domain FLAC file used to
	// benchmark the flac library. Because of its size, it has not been included
	// in the repository, but is available for download at
	//
	//    http://freesound.org/people/jarfil/sounds/151185/

Profiling of the mewkiz/flac library shows that there are some long hanging fruits that we could target for optimizing performance.

u@x61s ~/D/g/s/g/m/f/frame> go test -bench=. -cpuprofile=cpu.out .
goos: linux
goarch: amd64
pkg: github.com/mewkiz/flac/frame
BenchmarkFrameParse-2   	       1	39051467853 ns/op
BenchmarkFrameHash-2    	       1	43459214627 ns/op
PASS
ok  	github.com/mewkiz/flac/frame	86.464s

Specifically, the bit reader. It takes 65% of the cumulated time.

u@x61s ~/D/g/s/g/m/f/frame> go tool pprof cpu.out 
Local symbolization failed for frame.test: open /tmp/go-build489395259/b001/frame.test: no such file or directory
Some binary filenames not available. Symbolization may be incomplete.
Try setting PPROF_BINARY_PATH to the search path for local binaries.
File: frame.test
Type: cpu
Time: Aug 22, 2018 at 8:07pm (JST)
Duration: 1.44mins, Total samples = 1.34mins (92.90%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 61.13s, 76.17% of 80.25s total
Dropped 143 nodes (cum <= 0.40s)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
    15.43s 19.23% 19.23%     52.77s 65.76%  github.com/mewkiz/flac/internal/bits.(*Reader).Read
     9.11s 11.35% 30.58%     30.66s 38.21%  io.(*teeReader).Read
     6.20s  7.73% 38.31%      6.20s  7.73%  github.com/mewkiz/flac/frame.(*Subframe).decodeLPC
     5.54s  6.90% 45.21%      7.87s  9.81%  bufio.(*Reader).Read
     4.98s  6.21% 51.41%     20.20s 25.17%  github.com/mewkiz/flac/internal/bits.(*Reader).ReadUnary
     4.69s  5.84% 57.26%     62.94s 78.43%  github.com/mewkiz/flac/frame.(*Subframe).decodeRiceResidual
     4.36s  5.43% 62.69%      4.36s  5.43%  github.com/mewkiz/flac/internal/hashutil/crc16.Update
     3.98s  4.96% 67.65%     34.64s 43.17%  io.ReadAtLeast
     3.88s  4.83% 72.49%      8.24s 10.27%  github.com/mewkiz/flac/internal/hashutil/crc16.(*digest).Write
     2.96s  3.69% 76.17%      5.44s  6.78%  github.com/mewkiz/flac/internal/hashutil/crc8.(*digest).Write

@mehrvarz
Copy link
Author

Do you know how to make pprof show the # of function calls? I'd like to know this in relation to "time spent".

Can you please share your thoughts regarding the dual-thread idea? This could be a big win, especially if interleaved samples idea cannot be easily implemented. The client should be able to hand over a fresh subframe buffer, say, to ParseNext(). It could then juggle two buffers, handing one to the decoder, while feeding the other to the audio sink. It wouldn't matter anymore if the client needs to spent some time reformatting data. Client could also use more than two frame buffers, should this ever be useful.

@mehrvarz
Copy link
Author

This mpeg/mp3 package let's clients hand over the decode buffer:
https://github.com/bobertlo/go-mpg123/blob/master/mpg123/mpg123.go#L163

With this I can call the decoder multiple times, without having to copy any data. And when I do use the data, the decoder can continue crunching future data. I can provide any number of buffers, but in my console log below I am only using 4. This may be enough already to prevent all 24/96 underruns (even) on the RPi. Thoughts?

INFO   jukebox_mp3flac folder=/home/dave/Music/
INFO   jukebox_mp3flac tag string: [Milonga Tati - Quadro Nuevo - Tango Bitter Sweet]
INFO   jukebox_mp3flac decoder worker 0
INFO   jukebox_mp3flac decoder worker 1
INFO   jukebox_mp3flac decoder worker 2
INFO   jukebox_mp3flac decoder worker 3
INFO   jukebox_mp3flac sender  worker      0
INFO   jukebox_mp3flac decoder worker 0
INFO   jukebox_mp3flac sender  worker      1
INFO   jukebox_mp3flac decoder worker 1
INFO   jukebox_mp3flac sender  worker      2
INFO   jukebox_mp3flac sender  worker      3
INFO   jukebox_mp3flac decoder worker 2
INFO   jukebox_mp3flac sender  worker      0
INFO   jukebox_mp3flac sender  worker      1
INFO   jukebox_mp3flac decoder worker 3
INFO   jukebox_mp3flac sender  worker      2
INFO   jukebox_mp3flac decoder worker 0
INFO   jukebox_mp3flac decoder worker 1
INFO   jukebox_mp3flac decoder worker 2
INFO   jukebox_mp3flac sender  worker      3
INFO   jukebox_mp3flac sender  worker      0
INFO   jukebox_mp3flac decoder worker 3
INFO   jukebox_mp3flac decoder worker 0
INFO   jukebox_mp3flac sender  worker      1
INFO   jukebox_mp3flac sender  worker      2
INFO   jukebox_mp3flac decoder worker 1
INFO   jukebox_mp3flac decoder worker 2
INFO   jukebox_mp3flac sender  worker      3
INFO   jukebox_mp3flac sender  worker      0
INFO   jukebox_mp3flac decoder worker 3
INFO   jukebox_mp3flac decoder worker 0
INFO   jukebox_mp3flac sender  worker      1
INFO   jukebox_mp3flac sender  worker      2
INFO   jukebox_mp3flac decoder worker 1
INFO   jukebox_mp3flac decoder worker 2
INFO   jukebox_mp3flac sender  worker      3
INFO   jukebox_mp3flac sender  worker      0
INFO   jukebox_mp3flac decoder worker 3
INFO   jukebox_mp3flac decoder worker 0
INFO   jukebox_mp3flac sender  worker      1
INFO   jukebox_mp3flac sender  worker      2
INFO   jukebox_mp3flac decoder worker 1
INFO   jukebox_mp3flac decoder worker 2
INFO   jukebox_mp3flac sender  worker      3
INFO   jukebox_mp3flac sender  worker      0
INFO   jukebox_mp3flac decoder worker 3
INFO   jukebox_mp3flac decoder worker 0
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants