Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove autogenerated code from ChannelMixer #1966

Merged
merged 1 commit into from
Nov 21, 2021

Conversation

Be-ing
Copy link
Contributor

@Be-ing Be-ing commented Dec 27, 2018

Previously, the autogenerated code allowed for a vectorizing
optimization by combining gain application (channel faders +
crossfader) with channel mixing. When postfader effects were
implemented in Mixxx 2.1 (PR #1254), these two operations
had to be decoupled to process effects between gain
application and channel mixing. Therefore, the
autogenerated code lost its advantage and became an
overcomplication.

Unused autogenerated functions were removed from SampleUtil
as well. Some of the autogenerated SampleUtil functions are
still in use, so they were kept.

@daschuer
Copy link
Member

Thank you.
Can you add some benchmarks comparing the performance effects of these changes?

@uklotzde uklotzde added this to the 2.3.0 milestone Jan 12, 2019
@Be-ing Be-ing modified the milestones: 2.3.0, 2.4.0 Feb 7, 2019
@Be-ing Be-ing closed this Apr 24, 2019
@Be-ing Be-ing deleted the audio_engine_cleanup branch April 24, 2019 23:48
@Be-ing Be-ing restored the audio_engine_cleanup branch April 24, 2019 23:48
@Be-ing
Copy link
Contributor Author

Be-ing commented Apr 24, 2019

Whoops, I did not intend to close this.

@Be-ing Be-ing reopened this Apr 24, 2019
@Be-ing
Copy link
Contributor Author

Be-ing commented Oct 23, 2019

I tested playing two decks for a few minutes (using the same tracks with each test):

master + 324b2fb cherry-picked:

Debug [Main]: Stat("EngineMaster::process after processChannels","count=8218,sum=4.20202e+08ns,average=51131.9ns,min=22165ns,max=1.15742e+06ns,variance=1.14174e+09ns^2,stddev=33789.6ns")

this branch:

Debug [Main]: Stat("EngineMaster::process after processChannels","count=7411,sum=4.26568e+08ns,average=57558.8ns,min=23141ns,max=1.55683e+06ns,variance=1.20842e+09ns^2,stddev=34762.3ns")

So this branch does seem to be slightly slower. The mean difference is 6426.9 nanoseconds. I think we can afford that to get rid of this awfully overcomplicated code.

@Be-ing Be-ing requested a review from rryan November 12, 2019 04:45
@Holzhaus Holzhaus added this to In progress in 2.4 release Apr 9, 2020
@Be-ing Be-ing moved this from In progress to Needs review in 2.4 release Jun 15, 2020
@Holzhaus
Copy link
Member

Can you fix the merge conflicts? I'd like to reduce the number of open PRs. I think this is a candidate for merging.

@daschuer
Copy link
Member

The benchmarks, that proves that this is a good idea is still missing. Without it, we should not touch this code.

@Be-ing
Copy link
Contributor Author

Be-ing commented Sep 10, 2020

Huh? I posted that above.

@poelzi
Copy link
Contributor

poelzi commented Sep 10, 2020

Our engine is still single threaded and with keylock enabled on one channel alone is near buffer maximum even on my machine (t470p). I rather don't like slowing down the engine for the reason of removing som automated code.

@Holzhaus
Copy link
Member

@Be-ing Do you plan to pick it up and possibly fix the performance issuues or should we close this?

@Be-ing Be-ing changed the base branch from master to main October 23, 2020 23:28
@Be-ing
Copy link
Contributor Author

Be-ing commented Oct 23, 2020

I do not know if the performance of this could be improved further. But I think this is the wrong place to be nitpicking about performance. The difference I measured before was 6426.9 nanoseconds. If we want to optimize the audio processing code, we should be talking about metering and timestretching. Other software uses assembler or Intel intrinsics to optimize metering.

@Holzhaus
Copy link
Member

Holzhaus commented Oct 23, 2020

0.006426 milliseconds. I think we can live with that.

@Holzhaus
Copy link
Member

Holzhaus commented Oct 23, 2020

Maybe we can use intel SIMD intrinsics on to work on multiple channels at once? But we'd need to rewrite that code and maintain multiple code paths for non-x86 architectures.

@Be-ing
Copy link
Contributor Author

Be-ing commented Oct 23, 2020

A more practical approach is using SIMD instructions for multiple samples within a single buffer.

@Holzhaus
Copy link
Member

If we abstract it like this, we can also extend it to other Architectures that offer a similar SIMD instruction set: https://blog.molecular-matters.com/2011/10/18/simdifying-multi-platform-math/

@Swiftb0y
Copy link
Member

I have dismissed my critical review since long, to free the space for another reviewer. No one has taken the chance since now, so there is no need to blame me.

I'm sorry, this was not clear to me. So all that is left for this PR is that someone takes on the responsibility for Approving it and then hitting merge?

@daschuer
Copy link
Member

I don't think its super fair to expect anyone else to do this work if you are the only one insisting on the data resulting from that work.

I don't insist.

I guess I was not clear in my post. It is not to early to vote now.

Basing our votes on comments that seeing no regression from Uwes test is to early. I hope my post has clarified the situation.

@daschuer
Copy link
Member

.... oh I see it was already considered as bikesheding.

That's not fair, especially because I have just complained about such voice.

Comment on lines +69 to +70
for (int i = 0; i < activeChannels->size(); ++i) {
EngineMaster::ChannelInfo* pChannelInfo = activeChannels->at(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider replacing with ranged-for, might result in some performance benefit even.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could that provide a performance benefit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilers may be able to optimize better when conveying the intent better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, currently activeChannels->size() is probably called on every iteration while the ranged-for implementation would only compare the begin- and end-pointers of the iterator. I'm not saying this is a huge benefit, I just believe that we should do what compiler writers expect from us so we get better code and they can possibly produce better binaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability would benefit form a range based for loop. That's the main advantage, everything else is just bonus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Swiftb0y
Copy link
Member

.... oh I see it was already considered as bikesheding.

That's not fair, especially because I have just complained about such voice.

please elaborate, I don't understand what you mean.

@daschuer
Copy link
Member

The thumb down on my post and the comment: #1966 (comment)

@Swiftb0y
Copy link
Member

I agree, continuously repeating statements as a means to pressure actions is not nice either. I ask @Be-ing to refrain from this pattern of behavior, it's useless and happens to often.

@rryan
Copy link
Member

rryan commented Nov 14, 2021

Just providing some context -- I can't remember when this was written, but it was a big win at the time performance wise on my Lenovo T400 and my eeePC. This was around the time I was rewriting the engine to support more than 2 decks, and we were adding samplers and other mixing sources.

I'm surprised there isn't a mixing benchmark testing this code. I was pretty sure I wrote one -- maybe I never committed it.

I don't unroll loops or produce annoying-to-maintain code for fun -- just so everyone is clear :).

@Be-ing
Copy link
Contributor Author

Be-ing commented Nov 14, 2021

I don't unroll loops or produce annoying-to-maintain code for fun -- just so everyone is clear :).

Of course, I don't think anyone thought you did. :P But those optimizations became irrelevant with #1254 because mixing and applying the gain of the faders cannot be done in the same step with postfader effects.

@rryan
Copy link
Member

rryan commented Nov 14, 2021

Just providing some context -- I can't remember when this was written

Ah.. 2013.
63b8411

Of course, I don't think anyone thought you did. :P But those optimizations became irrelevant with #1254 because mixing and applying the gain of the faders cannot be done in the same step with postfader effects.

Right.. the original code called out to an inlined autogenerated "mix and apply ramping gain to N channels" routine, which was the key performance win over serial application of ramping gain then accumulation into an output buffer. Writing the code that way let auto-vectorization take over and do good things. (at the time we were writing SIMD by hand in SampleUtil too, but that was a pain to maintain).

The one additional bit of information we're giving the compiler here is telling it that the number of elements in this channels array is small (<32) -- but I don't think the loop body in its current form can be optimized anyway.

@rryan
Copy link
Member

rryan commented Nov 14, 2021

I'm scratching my head trying to explain the performance differences measured by @uklotzde or @Be-ing.

Even if EngineEffectsManager::processPostFaderAndMix can be inlined (don't think it can) the code in there only deals with a single buffer, then adds it to the output buffer so I believe it's no different than the serial "process then add to output" flow we had before we added generated code and the generated code in its current form.

I suspect the timer measurements are noisy potentially due to reasons @daschuer suggested (e.g. did you use the start/stop timing feature once the engine was already in a steady state, or was the timing from program startup? was mixxx the only program running? cpufreq "performance" governor enabled, etc.).

@rryan
Copy link
Member

rryan commented Nov 14, 2021

I support merging this as is for the reasons in my above two comments.

Generally +1 to the request for microbenchmarks of channel mixing for a range of common channel counts, I'm sorry they don't already exist.

Can we merge this or is this another victim of bikeshedding?
No more bikeshedding. Vote now.

I think it's reasonable and not bikeshedding to request benchmarks for a change that is primarily concerning a maintenance vs. performance trade-off.

@Be-ing
Copy link
Contributor Author

Be-ing commented Nov 19, 2021

After 5 days we have 3 👍 s and 0 👎 s.

@uklotzde
Copy link
Contributor

The benchmarks should have been provided with the auto-generated code. Requesting them as a requisite for removing code that is difficult to maintain and prevents innovation is unfair. Especially in conjunction with a veto that puts all burden on the one who proposes the change. In this respect I support Jan's position as outlined in #1966 (comment)

@uklotzde
Copy link
Contributor

Merge?

@uklotzde
Copy link
Contributor

On a second thought: Even if we had those benchmarks we would need to run them on a variety of platforms including different CPU architectures for meaningful results. We don't have the resources to do that.

@JoergAtGithub
Copy link
Member

On a second thought: Even if we had those benchmarks we would need to run them on a variety of platforms including different CPU architectures for meaningful results. We don't have the resources to do that.

For such encapsulate code units, you can execute a unit test in callgrind. Callgrind emulates an artificial CPU and returns the needed CPU cycles as reproduceable number. Due to the CPU emulation, it's independent of the real CPU and can be run in CI. Using an optimized release binary, this is a good indicator if compiler optimizations like SIMD vectorization work for a code unit.
This is a general remark and I do not recommend to implement this in this PR!

@Be-ing
Copy link
Contributor Author

Be-ing commented Nov 20, 2021

Merge?

The vote is already 4 in favor with 0 opposing.

@Be-ing
Copy link
Contributor Author

Be-ing commented Nov 20, 2021

@JoergAtGithub if you want to setup callgrind on CI in another PR, that seems like it could be quite helpful.

@Be-ing
Copy link
Contributor Author

Be-ing commented Nov 20, 2021

Is anyone going to merge this or will I have to merge my own PR?

@uklotzde
Copy link
Contributor

@Be-ing Please be patient. I will merge it.

PS: Being pressured or getting into crossfire is not fun. I don't need to deal with this and could turn away from Mixxx at any time for my own peace of mind.

@Be-ing
Copy link
Contributor Author

Be-ing commented Nov 20, 2021

What is the reason for not merging it already? I don't see that anybody has articulated one.

@uklotzde
Copy link
Contributor

I have left a comment and opened a PR for this branch as you noticed. But apparently you prefer to be impatient.

I will close my PR and not merge this PR. Done. Someone else may take over.

@Be-ing
Copy link
Contributor Author

Be-ing commented Nov 20, 2021

I don't think it's fair to ask me to be patient on a 3 year old PR.

@uklotzde
Copy link
Contributor

It is unfair to pressure me after I started to pick up this PR just a few hours ago. Please rethink your behavior.

@rryan
Copy link
Member

rryan commented Nov 21, 2021

The benchmarks should have been provided with the auto-generated code.

I agree! Sorry, again :).

Requesting them as a requisite for removing code that is difficult to maintain and prevents innovation is unfair.

I only sort of agree -- the engine code is complex and difficult to maintain, in part because the problem domain of real-time friendly code is sometimes at odds with maintainability. In the general case, I think some changes that have an obvious performance risk should come with benchmarks to show it's safe (and it's ok to put the burden of proof on the code author who claims the change is safe).

@rryan
Copy link
Member

rryan commented Nov 21, 2021

After 5 days we have 3 👍 s and 0 👎 s.

Speaking for myself, I voted yes to "merge as is".

Since this PR has generated so many bad feelings all around, I think we should merge and move on.

Thanks for the improvements PR, @uklotzde! If you retarget to main, will that work?

@rryan rryan merged commit 4cdc52b into mixxxdj:main Nov 21, 2021
2.4 release automation moved this from Needs review to Done Nov 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build code quality stale Stale issues that haven't been updated for a long time.
Projects
Status: Done
2.4 release
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants