Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout for the mining test 4 times (to 240s) #7371

Merged

Conversation

mj-xmr
Copy link
Contributor

@mj-xmr mj-xmr commented Feb 11, 2021

The mining test fails quite often, due to RandomX taking more time to initialize, than the fixed timeout was set to. I increased it by 4 times for now.
The reason for this problem could be an abnormal stress on the CI machines from our and not only our project, while the RandomX is being initialized.

I have reproduced the problem by setting the timeout artificially to 1, and I got the same error message as we typically get.

In a possible future PR I will check, if I can obtain any feedback from the RX itself.

Active work time: ~0.5h

@mj-xmr mj-xmr marked this pull request as ready for review February 11, 2021 15:26
@iamamyth
Copy link

I don't necessarily object to increasing the timeout, however, I think that option should be the last resort which finds use only after other, potentially more sane solutions have been exhausted. Otherwise, the functional tests lose a lot of utility in finding real problems with the mining code's performance (4xing the timeout means someone could wreck performance and no one would see it).

It seems quite odd to me that init has such a high variance but mining subsequent blocks doesn't. Did you inspect the error messages to confirm the timeout occurs on the first block, vs subsequent ones? In both cases, the format string would be identical, so it's the parameters that matter (specifically whether the target height is equal to the initial).

Assuming randomx init really is the problem, such a high variance implies something awry on the test machine. Sure, you may have noisy neighbors, but I would suggest checking the machine's capabilities to ensure it isn't thrashing due to lack of RAM. It could also be that you see slow startup because some test machines use software AES; in that case, the test suite could be parameterized based on the host machine's capabilities.

@iamamyth
Copy link

One minor addendum to my prior comment: If you end up tracing a fairly precise rationale, that should end up as a comment in the test. For example, "randomx is slow to init" might become "randomx init has high variance on CI machines due to noisy neighbors" (of course the actual comment would depend on the underlying issue). A well-articulated, evidence-based rationale helps others decide whether the high timeout still makes sense as the underlying conditions change over time. As it stands, if randomx init latency halved on a typical machine tomorrow, no one would have any idea whether to halve the timeout or leave it.

@mj-xmr
Copy link
Contributor Author

mj-xmr commented Feb 14, 2021

These are all very good ideas. I will start another PR, that addresses this. My CCS proposal is just about problems like this one, so I promise it will not go unnoticed.

I have yet another idea:
Just before starting the RX init, I will run a numerical calculation on all threads, like calculating 100 decimal places of Pi, and measure the time it takes. This will give me the currently available CPU power "score", upon which I can make assumptions on how long the RX init should take, allowing it to take, say 25% longer than expected. I have enough local machines with various CPU power, that can help me verify the method.

Another test should verify the AES availibility, which would simply change a multiplier of the "CPU score" obtained earlier.

@iamamyth
Copy link

Tightening the bound using a baseline calculation would certainly improve the utility of the test. To my mind, that's above and beyond even what I would initially attempt, so agreed it would be a separate PR. The main goal, at least at the start, would be to find a sensible "high watermark" in the common worst case (noisy neighbors + software AES), and then try to iteratively reduce that bound.

@luigi1111 luigi1111 merged commit 1614be4 into monero-project:master Feb 18, 2021
@mj-xmr mj-xmr deleted the functional-tests-mining-failed-fix branch March 8, 2021 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants