Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results #26

Open
solardiz opened this issue Feb 6, 2019 · 16 comments
Open

Benchmark results #26

solardiz opened this issue Feb 6, 2019 · 16 comments

Comments

@solardiz
Copy link
Contributor

solardiz commented Feb 6, 2019

It should be possible to sanity-check the performance of a ProgPOW build against its developers' expectations. For that, this repository should include benchmark results on a few system setups (which should also be documented - both hardware and software) for the latest and/or other specified versions of ProgPOW. So far, I only found outdated results in the "Testing" section of the first comment in ZcashFoundation/GrantProposals-2018Q2#15, and these don't include detail on the system setups (they only list GPU types) nor the block number.

@solardiz
Copy link
Contributor Author

solardiz commented Feb 6, 2019

My current best speeds for ProgPOW as of today built as described in #27 on our HPC Village box, using GTX 1080, Titan X Maxwell, Titan Kepler, and Vega 64:

$ ethminer/ethminer --version
ethminer version 0.15.0.dev0+git.83acba5
Build: linux/release/gnu

$ ethminer/ethminer -U -M 30000 --cuda-devices 0
[...]
 cu  20:57:40|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
[...]
min/mean/max: 15056099/15085757/15154954 H/s
inner mean: 15072578 H/s

$ ethminer/ethminer -U -M 30000 --cuda-devices 1
[...]
 cu  21:05:14|cuda-0  |  Using device: GeForce GTX TITAN X  (Compute 5.2)
[...]
min/mean/max: 13795672/13887855/13999889 H/s
inner mean: 13881239 H/s

$ ethminer/ethminer -U -M 30000 --cuda-devices 2
[...]
 cu  21:05:54|cuda-0  |  Using device: GeForce GTX TITAN  (Compute 3.5)
[...]
min/mean/max: 5768898/5800013/5872612 H/s
inner mean: 5786185 H/s

$ ethminer/ethminer -G -M 30000 
[...]
 cl  21:07:23|cl-0    |  Device:   gfx900  / OpenCL 2.0 AMD-APP (2766.4)
[...]
min/mean/max: 19348162/19476735/19561810 H/s
inner mean: 19491235 H/s

The speed on GTX 1080 achieved above wouldn't persist long-term as that GPU's clock rate reduces very significantly as the GPU gets hotter. The remaining 3 GPUs would likely achieve similar speeds long-term.

For comparison, here are best speeds achieved at Ethash using current ethminer:

$ ethminer/ethminer --version

ethminer 0.18.0-alpha.3
Build: linux/release/gnu

$ ethminer/ethminer -M 30000
[...]
 m 20:08:14 ethminer 0:00 A0 87.43 Mh - cu0 21.39, cl1 28.40, cu2 18.69, cu3 18.95

cl1 is the Vega 64. cu0, cu1, cu2 are the same NVIDIA GPUs as above.

Of course, in a sense comparing ProgPOW vs. Ethash is apples to oranges, but on the other hand we see that the old Titan Kepler is an outlier - it performed remarkably well at Ethash (same 18M+ speed as the newer Titan X Maxwell's), but a lot worse at ProgPOW (13.9M on Titan X Maxwell vs. 5.8M on Titan Kepler).

Forced use of OpenCL (rather than CUDA) on NVIDIA resulted in same or worse speeds for both Ethash and ProgPOW on all 3 NVIDIA GPUs.

@solardiz
Copy link
Contributor Author

solardiz commented Feb 6, 2019

The above was for block 30k (1 GB DAG) as used in the testcase. The below is similar for block 7M (3 GB DAG), which would reflect current Ethereum.

ProgPOW:

$ ethminer/ethminer -U -M 7000000 --cuda-devices 0
[...]
 cu  00:03:59|cuda-0  |  Using device: GeForce GTX 1080  (Compute 6.1)
[...]
min/mean/max: 14997636/15074730/15156470 H/s
inner mean: 15073182 H/s

$ ethminer/ethminer -U -M 7000000 --cuda-devices 1
[...]
 cu  00:04:44|cuda-0  |  Using device: GeForce GTX TITAN X  (Compute 5.2)
[...]
Trial 1...
3604930
Trial 2...
4037825
Trial 3...
4038229
Trial 4...
4038229
Trial 5...
4038229
min/mean/max: 3604930/3951488/4038229 H/s
inner mean: 4038094 H/s

$ ethminer/ethminer -U -M 7000000 --cuda-devices 2
[...]
 cu  00:06:28|cuda-0  |  Using device: GeForce GTX TITAN  (Compute 3.5)
[...]
Trial 1...
3521225
Trial 2...
3619034
Trial 3...
3777517
Trial 4...
3777895
Trial 5...
3778274
min/mean/max: 3521225/3694789/3778274 H/s
inner mean: 3724815 H/s

$ ethminer/ethminer -G -M 7000000
[...]
 cl  00:08:16|cl-0    |  Device:   gfx900  / OpenCL 2.0 AMD-APP (2766.4)
[...]
Trial 1...
17711251
Trial 2...
19769611
Trial 3...
19719144
Trial 4...
19876478
Trial 5...
19876478
min/mean/max: 17711251/19390592/19876478 H/s
inner mean: 19788411 H/s

Ethash:

$ ethminer/ethminer -M 7000000


ethminer 0.18.0-alpha.3
Build: linux/release/gnu
[...]
 m 00:02:00 ethminer 0:00 A0 58.71 Mh - cu0 21.34, cl1 28.15, cu2 4.78, cu3 4.44

At this DAG size, somehow there's a huge speed drop not only on Titan Kepler but also on Titan X Maxwell, and not only at ProgPOW, but also at Ethash. Only the newer GTX 1080 and Vega 64 do well, at both ProgPOW and Ethash.

@solardiz solardiz changed the title Add benchmark results Benchmark results Feb 6, 2019
@ifdefelse
Copy link
Owner

The huge speed drop on older hardware is due to the DAG size being >2GB. This is a known limitation of the hardware:
ethereum-mining/ethminer#544 (comment)

We have some slightly out-of-date expected hashrates at the end of this post:
https://medium.com/@ifdefelse/understanding-progpow-performance-and-tuning-d72713898db3

You're right at we should refresh them, though it'll take me a bit to collect all the hardware again.

@solardiz
Copy link
Contributor Author

solardiz commented Feb 8, 2019

Thanks @ifdefelse! I learned a few things from those links.

  1. Per @ddobreff's comment, it sounds like NVIDIA's page_size or AMD's fragment_size of 64 KB isn't a hardware thing, but is technically adjustable from the driver at least on AMD? If so, can it also be adjusted with some hack on NVIDIA? The comment doesn't explain the actual problem in detail, but I guess it's something along the lines of the page tables becoming too large to fit in a cache (or a TLB), which larger page size then mitigates through making them smaller again (or needing fewer TLB entries). Correct? If so, perhaps another workaround would be to do larger sequential accesses, so that the cache miss cost (on page table read) is amortized across the larger data transfer (and more time). Is this a tweak ProgPOW could make? There could then be the issue of where to fetch that larger data to, but if our only goal of making it larger is amortizing the cost of a random lookup then we don't actually have to prefetch any more data than we currently do - we can fetch in the same smaller portions as we currently use - we only need to make a few of them sequential rather than random (e.g., 4 sequential, then randomize the starting offset for the next 4, etc.)

  2. Your benchmark results for Ethash and ProgPOW on Vega 64 show significantly higher speeds than I obtained. Why do you think is that? Is your Vega 64 possibly running with a raised power limit or/and undervolted? You only mentioned stock clocks, but Vega 64's clock rate is a function of it hitting the power limit. (Some of your other results also look rather high compared to mine, but those are not directly comparable since those are different GPUs than what I have.)

@xazax310
Copy link

xazax310 commented Mar 1, 2019

@solardiz I've benchmarked a few GPUs I had. See here
https://medium.com/@infantry1337/the-miners-benchmark-progpow-e79cab6eabc3
I've encountered issues with Nvidia and AMD in how they handle core-clocks, which are important in progpow. This causes loss of hashrate unless you specifically tell the driver to set core-clock speeds. That could explain the hashrate differences your seeing. Unfortunately, I don't own a Vega card to test this against @ifdefelse results.

Working on benching the newest 0.9.3 spec with some AMD/Nvidia GPUs.

@solardiz
Copy link
Contributor Author

solardiz commented Mar 4, 2019

With the AMDGPU-PRO driver under Linux, I am now using:

echo manual > '/sys/bus/pci/drivers/amdgpu/0000:05:00.0/power_dpm_force_performance_level'
echo 7 > '/sys/bus/pci/drivers/amdgpu/0000:05:00.0/pp_dpm_sclk'
echo 4 > '/sys/bus/pci/drivers/amdgpu/0000:05:00.0/pp_power_profile_mode'

Previously, the Vega 64 card would tend to run at 1084 MHz under load, and I thought this was what it'd have to be under its default power limit - but it seems not. The first two commands above try to set the frequency to the max in this card's default frequency list, which is 1663 MHz (might be vendor OC). With these settings, the actual frequency seen under load is alternating between 1401 and 1576 MHz at Ethash, and is 1401 MHz all the time at ProgPoW.

The last "echo 4" is switching the card from video to compute mode (as mentioned in @xazax310 blog post referenced above), but this appears to make (almost?) no performance difference in my testing.

With these settings, I am getting speeds close to those seen in @ifdefelse's blog post also referenced above: ProgPoW block 7M speed went up from 19.8M to 22.7M (@ifdefelse's is 22.5M), Ethash block 7M speed went up from 28.2M to 36.2M initially (@ifdefelse's is 37.1M) but is reducing to ~33M under longer runs as the GPU gets hotter and the lower 1401 MHz frequency is used more.

This ProgPoW speed on Vega 64 is close to @xazax310's reported 22.6M for GTX 1080 Ti, but perhaps Vega 64 consumes significantly more power at this test.

@xazax310
Copy link

xazax310 commented Mar 5, 2019

@solardiz I'm going to get a Vega 56 for a day from a friend, ill run a few benches. Unfortunately I don't think ill have enough time to test ROCM in linux, but gains from ROCM are suppose to be 10% or so.

In my further testing of 0.9.3 spec I notice core speed varies, this happens to AMD and Nvidia. Example being RX480 stock 1266mhz core speed, clocks down to 1069 core, resulting in loss of hashrate. I've had to set my core voltage to 1000mv~ then the core speeds should stay set @ 1266mhz. I've reach out to @ifdefelse to get there thoughts on it.

@lookfirst
Copy link

@solardiz Awesome! Please also show how you're running ethminer benchmark mode. I assume you're running against block 7m? It is important to try a few spread out blocks as you're going to get different results on each block.

@solardiz
Copy link
Contributor Author

solardiz commented Mar 5, 2019

@lookfirst It's ethminer -G -M 7000000 as shown in one of my previous comments. Yes, I'll need to try different blocks, but for now I find these speeds close enough to what's expected to proceed with trying out ProgPoW algorithm tweaks as I had planned.

ProgPoW on Vega 64:

$ ethminer/ethminer -G -M 7000000
  m  21:08:56|ethminer|  ethminer version 0.15.0.dev0
  m  21:08:56|ethminer|  Build: linux / release +git. 83acba5
  21:08:57|ethminer|  Found suitable OpenCL device [ gfx900 ] with 8573157376  bytes of GPU memory
Benchmarking on platform: CL
Preparing DAG for block #7000000
 cl  21:08:57|cl-0    |  No work. Pause for 3 s.
Warming up...
 cl  21:09:00|cl-0    |  New epoch: 233
 cl  21:09:02|cl-0    |  Platform: AMD Accelerated Parallel Processing
 cl  21:09:02|cl-0    |  Device:   gfx900  / OpenCL 2.0 AMD-APP (2766.4)
 cl  21:09:04|cl-0    |  Build info: 
 cl  21:09:04|cl-0    |  Creating light cache buffer, size 47316928
 cl  21:09:04|cl-0    |  Creating DAG buffer, size 3028287104
 cl  21:09:04|cl-0    |  Loading kernels
 cl  21:09:04|cl-0    |  Writing light cache buffer
 cl  21:09:04|cl-0    |  Creating buffer for header.
 cl  21:09:04|cl-0    |  Creating mining buffer
 21:09:06|cl-0    |  2.82031  GB of DAG data generated in 2302 ms.
Trial 1... 
20449504
Trial 2... 
22706211
Trial 3... 
22656038
Trial 4... 
22760927
Trial 5... 
22760927
min/mean/max: 20449504/22266721/22760927 H/s
inner mean: 22707725 H/s

Ethash on all 4 GPUs mentioned before (Vega 64 is cl1):

$ ethminer/ethminer -M 7000000
[...]
cl 21:07:33 cl-1     2.82 GB of DAG data generated in 3,665 ms.
 m 21:07:38 ethminer 0:00 A0 0.00 h - cu0 0.00, cl1 0.00, cu2 0.00, cu3 0.00
cu 21:07:39 cuda-3   Generated DAG + Light in 9,170 ms. 9.06 GB left.
cu 21:07:39 cuda-2   Generated DAG + Light in 9,452 ms. 3.08 GB left.
cl 21:07:40 cl-1     Job: d4d2a7c0 Sol: 0x48399debe9cd3f21
 i 21:07:40 ethminer **Accepted   1 ms. localhost:0
cu 21:07:41 cuda-0   Generated DAG + Light in 11,606 ms. 5.06 GB left.
 m 21:07:43 ethminer 0:00 A1 36.72 Mh - cu0 0.15, cl1 36.21, cu2 0.17, cu3 0.18
 m 21:07:48 ethminer 0:00 A1 66.82 Mh - cu0 21.33, cl1 36.20, cu2 4.85, cu3 4.44
 m 21:07:53 ethminer 0:00 A1 66.81 Mh - cu0 21.34, cl1 36.17, cu2 4.86, cu3 4.44
 m 21:07:58 ethminer 0:00 A1 66.77 Mh - cu0 21.34, cl1 36.13, cu2 4.86, cu3 4.44
 m 21:08:03 ethminer 0:00 A1 66.74 Mh - cu0 21.34, cl1 36.10, cu2 4.86, cu3 4.44
 m 21:08:08 ethminer 0:00 A1 66.70 Mh - cu0 21.34, cl1 36.06, cu2 4.86, cu3 4.44

@solardiz
Copy link
Contributor Author

solardiz commented Apr 4, 2019

I got partial success repairing the speeds on older GPUs, as per my comment here from Feb 8. Specifically, after applying #35 we can then use these settings:

// blocks before changing the random program
#define PROGPOW_PERIOD          10
// lanes that work together calculating a hash
#define PROGPOW_LANES           16
// uint32 registers per lane
#define PROGPOW_REGS            64
// uint32 loads from the DAG per lane
#define PROGPOW_DAG_LOADS       8
// size of the cached portion of the DAG
#define PROGPOW_CACHE_BYTES     (16*1024)
// DAG accesses, also the number of loops executed
#define PROGPOW_CNT_DAG         32
// random cache accesses per loop
#define PROGPOW_CNT_CACHE       22
// random math instructions per loop
#define PROGPOW_CNT_MATH        36

This doubles PROGPOW_DAG_LOADS and compensates for that by also doubling (relative to the 0.9.3 proposal) PROGPOW_REGS, PROGPOW_CNT_CACHE, and PROGPOW_CNT_MATH and halving PROGPOW_CNT_DAG. The result is larger DAG loads (512 instead of 256 bytes), but twice fewer of them, keeping the total the same. Also the total compute remains the same. The number of registers used doubles, which is a good thing if we can afford it - and in my testing yes, we can.

With this, I am getting the same speed as above (which was for ProgPoW 0.9.2) at block 7M on Vega 64 (still 22.7M), slightly lower speed on GTX 1080 (down from 15.15M to 14.8M), but much higher speed on Titan X Maxwell (up from 4.0M to 6.6M) and somewhat higher on the old Titan Kepler (up from 3.8M to 4.4M). I think the 65% speed increase on Maxwell may well justify the maybe 2.5% slowdown on Pascal. The twice larger register file usage might justify that, too, even if there were no speed increase on those older GPUs.

This is based on benchmark for one block number only. Since the generated program changes with these parameter changes, it's not a direct comparison of the two versions of the parameters. For a direct comparison, many benchmarks for different block numbers would need to be run for both versions and then average speeds compared. (So even the 2.5% slowdown observed on Pascal might as well not exist for real, or it might be different.)

Also, we don't strictly have to double 0.9.3's PROGPOW_CNT_CACHE and PROGPOW_CNT_MATH. We can try other suitable combinations, given that after the major change to DAG read size these need to be re-tuned across a variety of GPUs anyway. For example, I've also tried 28 and 29, respectively, and got similar speeds. I think such settings might be preferable since the "gather loads" from the cache are relatively demanding, more so than many of ProgPoW's math operations.

@xazax310
Copy link

xazax310 commented Apr 5, 2019

So I encountered the same issue with Vega 64 in Windows.
see my updated Benchmark testing here, https://medium.com/altcoin-magazine/comprehensive-progpow-benchmark-715126798476

While I didn't get a lot of time to play with the Vega 64, It's definitely something with AMD's driver/software controls. I couldn't get it to run correctly for Ethhash either.

GTX 1080 (down from 15.15M to 14.8M), but much higher speed on Titan X Maxwell (up from 4.0M to 6.6M) and somewhat higher on the old Titan Kepler (up from 3.8M to 4.4M

To your point of increasing DAG loads to 512 which benefits older GPUs. While interesting, it's irrelevant. As a fairly large farm myself, I don't run a single Maxwell GPU. In fact nothing older than Pascal or Polaris. Most GPU farms, gamers, hobbyists, would not be running such old GPUs. If the increase had a positive effect in newer generation I would say it makes sense, but we're degrading performance in favor of older GPUs. That makes no sense.

We should build the spec to current generation GPUs. In 2 years once new generation comes out and many replace old equipment, as Kristy has said, ProgPoW could then be tuned towards Turing and Navi.

(So even the 2.5% slowdown observed on Pascal might as well not exist for real, or it might be different.)
The block heights for ProgPoW make benchmarking fairly difficult since the output will always be different. In my opinion, best to check a few different block heights and average them.

@solardiz
Copy link
Contributor Author

solardiz commented Apr 5, 2019

@xazax310 Thanks for that Medium post. I learned of some news from there (I'm not following Ethereum news normally, but am looking into ProgPoW as part of Zcash's potential interest in it).

You make a valid point about irrelevance of older GPUs.

@solardiz
Copy link
Contributor Author

solardiz commented Apr 5, 2019

Implementing more of my idea from Feb 8 with a further hack, I got a 3x+ speedup on Titan X Maxwell (up from 4.0M to 12.3M) at a cost of 3.5% slowdown on GTX 1080 (down from 15.15M to 14.6M). Is this possibly adequate enough speed for some miners to reconsider using Maxwell again?

+++ b/libprogpow/ProgPow.cpp
@@ -133,6 +133,7 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
         ret << "barrier(CLK_LOCAL_MEM_FENCE);\n";
         ret << "offset = share[group_id];\n";
     }
+    ret << "uint32_t orig_offset = offset;\n";
     ret << "offset %= PROGPOW_DAG_ELEMENTS;\n";
     ret << "offset = offset * PROGPOW_LANES + (lane_id ^ loop) % PROGPOW_LANES;\n";
     ret << "data_dag = g_dag[offset];\n";
@@ -181,7 +182,12 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
         ret << "if (hack_false) __threadfence_block();\n";
     else
         ret << "if (hack_false) barrier(CLK_LOCAL_MEM_FENCE);\n";
+    ret << "if ((loop & 3) == 3) {\n";
     ret << merge("mix[0]", "data_dag.s[0]", rnd());
+    ret << "} else {\n";
+    ret << merge("mix[1]", "data_dag.s[0]", rnd());
+    ret << "mix[0] = orig_offset + 1;\n";
+    ret << "}\n";
     for (int i = 1; i < PROGPOW_DAG_LOADS; i++)
     {
         std::string dest = mix_dst();

Combined with #35 and the parameters change proposed above, this little patch implements sequential fetches of groups of 4 blocks of 512 bytes each, or effectively fetches of blocks of 2 KiB each, at the same time without requiring this much room to fetch to.

I've also tested its variation with if (loop & 1), which fetches groups of 2 blocks of 512 bytes each, or effectively 1 KiB. It provided 10M on Titan X Maxwell at a similar 3.5%'ish slowdown on Pascal. And a variation with if ((loop & 7) == 7), which fetches groups of 8 blocks of 512 bytes each, or effectively 4 KiB. This one provided 12.5M on Titan X Maxwell at a similar 3.5%'ish slowdown on Pascal. 12.5M is slightly higher than 12.3M above, but I'm not convinced this is worth it as the larger fetches allow for use of higher latency memory in an "attack".

While the slowdowns on modern GPUs are really unfortunate, to me ProgPoW is far from final yet - I am considering many other tweaks - so performance differences of a few percent might be premature to take seriously, whereas the 3x+ speedup is a real thing. Consider this a proof of concept.

Disclaimer: in absence of test vectors for this revised code that we'd compare against a pure host-side implementation, it's always possible that I made some error and the code doesn't actually behave as I assume it does, which would invalidate the benchmark results. These results are consistent with my expectations, and make sense to me, but they'd need to be verified.

@solardiz
Copy link
Contributor Author

solardiz commented Apr 7, 2019

Where I overwrite mix[0] in that PoC code, of course the previous value should ideally be made use of e.g. with ret << "mix[1] += mix[0]\n"; just before the overwrite or else we might waste (let the compiler optimize out) some of the random math that changes mix[0] (in cases when the changed mix[0] wasn't yet used by further random math). I did say it's just a proof of concept, but I thought I'd mention this already-known way in which the PoC code is non-final.

@ifdefelse
Copy link
Owner

As xazax said while academically interesting, I don't see any reason to tune for old Maxwell cards. Maxwell cards have performed terribly at Ethash ever since the DAG grew to >2GB back at the end of 2017. They've probably all been retired from mining farms over the past 1.5 years. There's no reason to target hardware that no longer exists.

@solardiz
Copy link
Contributor Author

solardiz commented Apr 8, 2019

Thank you for sharing your opinion @ifdefelse. I'm thinking of more than "just" Ethereum here. I guess some Maxwell GPUs are still mining some other altcoins. Those GPUs might switch to Ethereum if that becomes reasonable, or those other altcoins might switch to ProgPoW. But maybe I'm imagining this.

There's also the 2x increase in register file usage with these changes, which is of some value even on newer hardware, whereas the slight performance drop might or might not persist after other tweaks yet to be made for other reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants