Increase timeout for the mining test 4 times (to 240s) #7371

mj-xmr · 2021-02-11T12:45:45Z

The mining test fails quite often, due to RandomX taking more time to initialize, than the fixed timeout was set to. I increased it by 4 times for now.
The reason for this problem could be an abnormal stress on the CI machines from our and not only our project, while the RandomX is being initialized.

I have reproduced the problem by setting the timeout artificially to 1, and I got the same error message as we typically get.

In a possible future PR I will check, if I can obtain any feedback from the RX itself.

Active work time: ~0.5h

iamamyth · 2021-02-13T21:58:46Z

I don't necessarily object to increasing the timeout, however, I think that option should be the last resort which finds use only after other, potentially more sane solutions have been exhausted. Otherwise, the functional tests lose a lot of utility in finding real problems with the mining code's performance (4xing the timeout means someone could wreck performance and no one would see it).

It seems quite odd to me that init has such a high variance but mining subsequent blocks doesn't. Did you inspect the error messages to confirm the timeout occurs on the first block, vs subsequent ones? In both cases, the format string would be identical, so it's the parameters that matter (specifically whether the target height is equal to the initial).

Assuming randomx init really is the problem, such a high variance implies something awry on the test machine. Sure, you may have noisy neighbors, but I would suggest checking the machine's capabilities to ensure it isn't thrashing due to lack of RAM. It could also be that you see slow startup because some test machines use software AES; in that case, the test suite could be parameterized based on the host machine's capabilities.

iamamyth · 2021-02-13T22:17:32Z

One minor addendum to my prior comment: If you end up tracing a fairly precise rationale, that should end up as a comment in the test. For example, "randomx is slow to init" might become "randomx init has high variance on CI machines due to noisy neighbors" (of course the actual comment would depend on the underlying issue). A well-articulated, evidence-based rationale helps others decide whether the high timeout still makes sense as the underlying conditions change over time. As it stands, if randomx init latency halved on a typical machine tomorrow, no one would have any idea whether to halve the timeout or leave it.

mj-xmr · 2021-02-14T05:17:02Z

These are all very good ideas. I will start another PR, that addresses this. My CCS proposal is just about problems like this one, so I promise it will not go unnoticed.

I have yet another idea:
Just before starting the RX init, I will run a numerical calculation on all threads, like calculating 100 decimal places of Pi, and measure the time it takes. This will give me the currently available CPU power "score", upon which I can make assumptions on how long the RX init should take, allowing it to take, say 25% longer than expected. I have enough local machines with various CPU power, that can help me verify the method.

Another test should verify the AES availibility, which would simply change a multiplier of the "CPU score" obtained earlier.

iamamyth · 2021-02-14T18:56:47Z

Tightening the bound using a baseline calculation would certainly improve the utility of the test. To my mind, that's above and beyond even what I would initially attempt, so agreed it would be a separate PR. The main goal, at least at the start, would be to find a sensible "high watermark" in the common worst case (noisy neighbors + software AES), and then try to iteratively reduce that bound.

Increase timeout for the mining test 4 times (to 240s)

1aecdcf

mj-xmr mentioned this pull request Feb 11, 2021

Run tests in parallel, whenever possible #7283

Open

mj-xmr marked this pull request as ready for review February 11, 2021 15:26

selsta approved these changes Feb 12, 2021

View reviewed changes

luigi1111 merged commit 1614be4 into monero-project:master Feb 18, 2021

mj-xmr mentioned this pull request Feb 19, 2021

[CI] Adaptive mining timeout, depending on available CPU power #7387

Merged

mj-xmr deleted the functional-tests-mining-failed-fix branch March 8, 2021 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase timeout for the mining test 4 times (to 240s) #7371

Increase timeout for the mining test 4 times (to 240s) #7371

mj-xmr commented Feb 11, 2021 •

edited

iamamyth commented Feb 13, 2021

iamamyth commented Feb 13, 2021

mj-xmr commented Feb 14, 2021 •

edited

iamamyth commented Feb 14, 2021

Increase timeout for the mining test 4 times (to 240s) #7371

Increase timeout for the mining test 4 times (to 240s) #7371

Conversation

mj-xmr commented Feb 11, 2021 • edited

iamamyth commented Feb 13, 2021

iamamyth commented Feb 13, 2021

mj-xmr commented Feb 14, 2021 • edited

iamamyth commented Feb 14, 2021

mj-xmr commented Feb 11, 2021 •

edited

mj-xmr commented Feb 14, 2021 •

edited