Sleep before ReadBuffer to resolve Nvidia's busywait issue #60

Kubuxu · 2016-11-11T13:14:46Z

I did not observe any negative influence on performance with this patch.
The CPU core usage is reduced to about 8% from 100%.

I am getting consistent 38Sols/s on MSI Gaming X 1070 on one instance
while testing. Sols/s on AMD cards (RX480) are not affected.

The clFlush it required as AMD will only start working when there is
blocking operation waiting which is delayed in this case or the queue is flushed.

This is only suggestion and solution that works for me, it isn't perfect. With this silentarmy is the best miner for Nvidia that I know.

Don't feel obligated to accept this PR, you can use parts of it or it in
whole to engineer a solution that you think is the best.

I have tried to sched_yield trick, it seems to stopped working
unfortunately.

Resolves #54

Singman33 · 2016-11-11T16:42:04Z

Tested and approved, compatible with eXternal Optimizations.

mbevand · 2016-11-12T02:11:52Z

Oh that's a good, working hack. Thanks! I might prefer your idea over krnlx's suggestion from #54

lhl · 2016-11-12T10:16:57Z

Just merged my branch back up w/ upstream and also tested this patch, and it seems to work great - freed up a core from 100% to now using about 7% (on an i7-4790k). 970 is mining at about 45 sol/s (unchanged), XC AVX2 -t6 is up from 24sol/s to 26+ sol/s.

Tested also on my 470 dev system, performance is unchanged (~77 sol/s w/ a memstrap 1500 system), CPU < 2%.

Kubuxu · 2016-11-12T10:48:28Z

The timing here is not perfect, with two timers, one measuring kernel time and one measuring the time spent in the read it should be possible to reduce the load even further.

The increased CPU load is because the the sleep time will be always 2% less than time required for the kernel to complete.

montvid · 2016-11-12T19:15:03Z

Awesome! My weak GeForce GT 740M is making a bit more sols than before. It sure helps for the weak gpu's.

Kubuxu · 2016-11-12T19:45:23Z

@mbevand something worth considering is changing the mining pipeline a bit.
Currently main.c does (simplified):

1. Enqueue equihash task
2. Block on ReadBuffer
3. Check the solutions
4. Repeat from one

In this patch I do:

1. Enqueue equihash task
2. Wait rolling average time - 2%, it takes to do one invocation
2. Block on ReadBuffer
3. Check the solutions
4. Repeat from 1.

Something worth considering:

1. Enqueue eqihash task
2. Enqueue non blocking ReadBuffer save the event E, save the estimated invocation finish time T, calculated by rolling average of previous runs
2. Check previous solutions if there are any
3. Sleep '(T - now())*0.75', 0.75 is quite arbitrary here and might have to be adjusted.
4. Check if event E is done; if not, jump to 3
5. Save the solution
6. Repeat from 1.

This is a bit more complex but has two benefits, 1. there is even less downtime (it might turn out that with this running with one instance will be better), 2. the timing adjustment is even better.

mbevand · 2016-11-12T21:03:33Z

Cool. Apparently zawawa was even able to include and compile your patch on Cygwin/Windows: https://bitcointalk.org/index.php?topic=1666489.msg16854647#msg16854647

I asked him to confirm that clock_gettime(CLOCK_MONOTONIC...) works on Cygwin/Windows.

Kubuxu · 2016-11-12T21:04:21Z

It doesn't he replaced it by something else, also Cygwin doesn't have nanosleep.

mbevand · 2016-11-12T21:05:34Z

I guess we don't have to worry about Windows right now.

Kubuxu · 2016-11-12T21:07:20Z

Give me 30min before merging it, I might try something else that will have lower load on CPU.

mbevand · 2016-11-12T21:11:59Z

Take your time. I will probably need 1-2 days before I merge it.

Kubuxu · 2016-11-12T21:13:52Z

Sure.

I did not observe any negative influence on performance with this patch. The CPU core usage is reduced to about 8% from 100%. I am getting consistent 38Sols/s on MSI Gaming X 1070 on one instance while testing. Sols/s on AMD cards (RX480) are not affected. The clFlush it required as AMD will only start working when there is blocking operation waiting which is delayed in this case. This is only suggestion and solution that works for me, it isn't perfect but it works for me. With this silentarmy is the best miner for Nvidia that I know. Don't feel obligated to accept this PR, you can use parts of it or it in whole to engineer a solution that you think is the best. I have tried to sched_yield trick, it seems to stopped working unfortunately.

Kubuxu · 2016-11-12T22:55:51Z

I reworked it. Getting the same hash rates as without the patch, the CPU usage is down to 4% from 100%. Parameters are tweakable.

mbevand · 2016-11-12T23:32:44Z

Awesome! Have you made sure it doesn't negatively impact AMD GPUs? Should we leave the code as-is or turn it on only when running against an Nvidia GPU?

zawawa replied me saying clock_gettime() does seem to work on Cygwin/Windows.

Kubuxu · 2016-11-12T23:35:47Z

I have run tests on AMD too, I couldn't see any difference, the whole thing can be disabled by setting SLEEP_SKIP_RATIO to 1.

Kubuxu · 2016-11-12T23:51:05Z

MSI RX480 X 8GB at test run 1000 nonces:
master: Total 1825 solutions in 25147.7 ms (72.6 Sol/s)
with patch: Total 1825 solutions in 25333.1 ms (72.0 Sol/s)
with patch and SLEEP_SKIP_RATIO=1: Total 1825 solutions in 25246.9 ms (72.3 Sol/s)

So the overhead is 0.6Sol/s or (0.3 with a tweak), it will be less on slower GPUs (as invocation time is longer).
This would be solved by my proposal here: #60 (comment) but I won't have time to implement it.

Kubuxu · 2016-11-12T23:54:31Z

It might also have no influence at all with two instances running.

krnlx · 2016-11-15T06:01:14Z

wait wait, I tested the last solution, it seems a little performance decrease. Celeron 1840, 6x1070, I think I will stay at Kubuxu v1 solution, it is fastest.

Kubuxu · 2016-11-15T10:04:59Z

@krnlx try playing with SLEEP_SKIP_RATIO it previously was at 0.02

mbevand · 2016-11-15T17:05:56Z

Thanks so much, I merged a cleaned-up version of your fix:
a6c3517

Please not that you had 2 clFlush() and only 1 is needed. Verify on Nvidia hardware I didn't screw things up :)

Kubuxu · 2016-11-15T17:08:45Z

I had two flushes as I wanted the GPU to start work ASAP (applies to AMD) but it didn't seem to have any difference IIRC.

mbevand · 2016-11-15T17:11:24Z

That makes sense. You had a flush before enqueuing kernel_sols and the
non-blocking buffer read, but these 2 calls should take at most tens or
hundreds of usec, so it won't make a difference for a 10-20 msec equihash
run.

On Tue, Nov 15, 2016 at 11:08 AM, Jakub Sztandera notifications@github.com
wrote:

I had two flushes as I wanted the GPU to start work ASAP (applies to AMD)
but it didn't seem to have any difference IIRC.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#60 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC20HJxRjtYYomKXB5Q9dcBZ9QsfGV-mks5q-ecegaJpZM4KvvJ7
.

Kubuxu · 2016-11-15T17:27:02Z

Also confirming on Nvidia, CPU usage is ~3%.

Kubuxu · 2016-11-15T17:31:23Z

Final comparison:
GTX1070: with 69.6Sol/s (4% CPU), without 66.7Sol/s (100% CPU)
RX480: with 72.0 Sol/s, without 72.0Sol/s

Tested with LD_PRELOAD=/usr/lib/libOpenCL.so.1 ./sa-solver --use X --nonces 1000.

Kubuxu force-pushed the feat/flush-and-sleep branch 2 times, most recently from 159c3e2 to f29b4e0 Compare November 11, 2016 13:15

Kubuxu changed the title ~~Sleep before ReadBuffer to resolve Nvidia's busywait issueo~~ Sleep before ReadBuffer to resolve Nvidia's busywait issue Nov 11, 2016

Kubuxu force-pushed the feat/flush-and-sleep branch from f29b4e0 to 162cdef Compare November 11, 2016 13:24

Kubuxu force-pushed the feat/flush-and-sleep branch 2 times, most recently from 9bd836b to 6cdf385 Compare November 11, 2016 23:22

Kubuxu mentioned this pull request Nov 11, 2016

sa-solver 100% CPU core with nvidia #54

Closed

Kubuxu force-pushed the feat/flush-and-sleep branch 5 times, most recently from 6601f0b to acb7e15 Compare November 12, 2016 22:46

Rework sleeping and flushing to be smarter and faster

c00bab0

Kubuxu force-pushed the feat/flush-and-sleep branch from acb7e15 to c00bab0 Compare November 12, 2016 22:53

Kubuxu mentioned this pull request Nov 13, 2016

Windows #43

Open

Disable sleeping routine when SLEEP_SKIP_RATI=1

d41eca4

Kubuxu force-pushed the feat/flush-and-sleep branch from 38648a3 to d41eca4 Compare November 13, 2016 15:32

mbevand closed this Nov 15, 2016

Kubuxu deleted the feat/flush-and-sleep branch November 15, 2016 17:08

Squall-Leonhart mentioned this pull request Mar 12, 2017

Workaround: Nvidia GPU OpenCL uses 100% CPU one core openmm/openmm#1541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sleep before ReadBuffer to resolve Nvidia's busywait issue #60

Sleep before ReadBuffer to resolve Nvidia's busywait issue #60

Kubuxu commented Nov 11, 2016 •

edited

Loading

Singman33 commented Nov 11, 2016

mbevand commented Nov 12, 2016

lhl commented Nov 12, 2016

Kubuxu commented Nov 12, 2016 •

edited

Loading

montvid commented Nov 12, 2016

Kubuxu commented Nov 12, 2016 •

edited

Loading

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

Kubuxu commented Nov 12, 2016 •

edited

Loading

Kubuxu commented Nov 12, 2016

krnlx commented Nov 15, 2016

Kubuxu commented Nov 15, 2016

mbevand commented Nov 15, 2016

Kubuxu commented Nov 15, 2016

mbevand commented Nov 15, 2016

Kubuxu commented Nov 15, 2016

Kubuxu commented Nov 15, 2016 •

edited

Loading

Sleep before ReadBuffer to resolve Nvidia's busywait issue #60

Sleep before ReadBuffer to resolve Nvidia's busywait issue #60

Conversation

Kubuxu commented Nov 11, 2016 • edited Loading

Singman33 commented Nov 11, 2016

mbevand commented Nov 12, 2016

lhl commented Nov 12, 2016

Kubuxu commented Nov 12, 2016 • edited Loading

montvid commented Nov 12, 2016

Kubuxu commented Nov 12, 2016 • edited Loading

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

mbevand commented Nov 12, 2016

Kubuxu commented Nov 12, 2016

Kubuxu commented Nov 12, 2016 • edited Loading

Kubuxu commented Nov 12, 2016

krnlx commented Nov 15, 2016

Kubuxu commented Nov 15, 2016

mbevand commented Nov 15, 2016

Kubuxu commented Nov 15, 2016

mbevand commented Nov 15, 2016

Kubuxu commented Nov 15, 2016

Kubuxu commented Nov 15, 2016 • edited Loading

Kubuxu commented Nov 11, 2016 •

edited

Loading

Kubuxu commented Nov 12, 2016 •

edited

Loading

Kubuxu commented Nov 12, 2016 •

edited

Loading

Kubuxu commented Nov 12, 2016 •

edited

Loading

Kubuxu commented Nov 15, 2016 •

edited

Loading