New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I've added additional C++ optimizations and limited the number of threads we create #4

Merged
merged 7 commits into from Oct 3, 2013

Conversation

Projects
None yet
2 participants
@m42a
Contributor

m42a commented Oct 3, 2013

I've measured a speedup of about 25% from these changes, not counting the added speedup from multithreading. Most of it comes from the first 2 commits; I don't know how applicable those are to Go though.

@kidoman

This comment has been minimized.

Show comment
Hide comment
@kidoman

kidoman Oct 3, 2013

Owner

Alright let me get this merged in asap :) Kudos!

Owner

kidoman commented Oct 3, 2013

Alright let me get this merged in asap :) Kudos!

@kidoman kidoman merged commit e2616ac into kidoman:master Oct 3, 2013

@kidoman

This comment has been minimized.

Show comment
Hide comment
@kidoman

kidoman Oct 3, 2013

Owner

Done... thanks for the pr

Owner

kidoman commented Oct 3, 2013

Done... thanks for the pr

@kidoman

This comment has been minimized.

Show comment
Hide comment
@kidoman

kidoman Oct 3, 2013

I would love to know why you decided to do this particular optimization

kidoman commented on ref/rays.cpp in c4ddd1e Oct 3, 2013

I would love to know why you decided to do this particular optimization

This comment has been minimized.

Show comment
Hide comment
@m42a

m42a Oct 3, 2013

Owner

One of the biggest hotspots in the code looked like this:

//Compute b^2
1:  0.41% mulss  %xmm4,%xmm5
//Compute c
2:  9.13% subss  %xmm9,%xmm2
//Compute b^2-c
3:  0.15% subss  %xmm2,%xmm5
//Do comparison
4:  8.48% ucomis %xmm10,%xmm5
5: 16.81% jbe    150

Which means the function was spending over a third of its execution time in those last couple of instructions. The processor can't parallelize them well because instruction 3 depends on the previous 2 instructions, and instruction 4 depends on instruction 3. The idea was that since the check is usually false, by moving the final subtraction to after the jump we can speculatively do the jump without a stall, and since the jump points to the loop bounds check (which is all integer math) it won't have to wait on the floating point computations. And it seems to have worked:

//Compute b^2
1:  0.00% mulss  %xmm2,%xmm15
//Compute c
2:  6.35% subss  %xmm7,%xmm0
//Do comparison
3:  2.26% ucomis %xmm0,%xmm15
4: 18.94% jbe    b8

Now the only dependant instruction is the compare, so we can go straight into the predicted jump. It's only a 3% total improvement, but they all add up.

Owner

m42a replied Oct 3, 2013

One of the biggest hotspots in the code looked like this:

//Compute b^2
1:  0.41% mulss  %xmm4,%xmm5
//Compute c
2:  9.13% subss  %xmm9,%xmm2
//Compute b^2-c
3:  0.15% subss  %xmm2,%xmm5
//Do comparison
4:  8.48% ucomis %xmm10,%xmm5
5: 16.81% jbe    150

Which means the function was spending over a third of its execution time in those last couple of instructions. The processor can't parallelize them well because instruction 3 depends on the previous 2 instructions, and instruction 4 depends on instruction 3. The idea was that since the check is usually false, by moving the final subtraction to after the jump we can speculatively do the jump without a stall, and since the jump points to the loop bounds check (which is all integer math) it won't have to wait on the floating point computations. And it seems to have worked:

//Compute b^2
1:  0.00% mulss  %xmm2,%xmm15
//Compute c
2:  6.35% subss  %xmm7,%xmm0
//Do comparison
3:  2.26% ucomis %xmm0,%xmm15
4: 18.94% jbe    b8

Now the only dependant instruction is the compare, so we can go straight into the predicted jump. It's only a 3% total improvement, but they all add up.

This comment has been minimized.

Show comment
Hide comment
@kidoman

kidoman Oct 4, 2013

Thanks for the explanation. I thought it would be something in this line, and did a few similar changes on the Go side and they seem to have paid off as well.

Thanks to your changes, C++ is way ahead of the Go version now. I am rerunning the benchmarks and putting up a comparo.

kidoman replied Oct 4, 2013

Thanks for the explanation. I thought it would be something in this line, and did a few similar changes on the Go side and they seem to have paid off as well.

Thanks to your changes, C++ is way ahead of the Go version now. I am rerunning the benchmarks and putting up a comparo.

This comment has been minimized.

Show comment
Hide comment
@kidoman

kidoman Oct 4, 2013

Also, did you actually benchmark and come up with the 3% number, or is there a better way to quantify these things?

kidoman replied Oct 4, 2013

Also, did you actually benchmark and come up with the 3% number, or is there a better way to quantify these things?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment