Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core dump runtime - fopen failed #925

Open
kbi-user opened this issue Aug 30, 2021 · 10 comments
Open

core dump runtime - fopen failed #925

kbi-user opened this issue Aug 30, 2021 · 10 comments

Comments

@kbi-user
Copy link

kbi-user commented Aug 30, 2021

Working Directory: /srv/tmp/
Working Directory 2: /srv/tmp/
Plot Name: plot-k32-2021-08-30-09-33-20e24798982d5b5e8cb2777fcddc1eeb470ae05eca66e2212342398e1480fe44
[P1] Table 1 took 25.1035 sec
[P1] Table 2 took 192.124 sec, found 4295041098 matches
[P1] Table 3 took 200.674 sec, found 4295058971 matches
terminate called after throwing an instance of 'std::runtime_error'
what(): thread failed with: fopen() failed
Aborted (core dumped)


In previous issues the trailing slash was the issue for that error. Yet here the trailing slash is set. Things to note in my system:

  1. Command used: ./build/chia_plot -t /srv/tmp/ -2 /srv/tmp/ -d /srv/tmp/ -f [key1] -p [key2]
  2. /srv/tmp/ is a specific high-end drive via a kernel module. I have no issues whatsoever with that drive even at 44 parallel "normal" plots. Specs for that drive: up to 30GB/s read, up to 15GB/s write, ~100ns latency on seq reads and writes.
  3. Container environment is rootless podman (shouldnt make a difference, yet still mentioning depending on what the chia-plotter does)
  4. AMD Milan CPU (7443P) with 256GB RAM, 200GB RAM free when starting
@kbi-user
Copy link
Author

ADD: RAM is server ECC, not overclocked, tested with memtest86 burnin - RAM is fine...

@madMAx43v3r
Copy link
Owner

that's gotta be a drive issue

@kbi-user
Copy link
Author

kbi-user commented Aug 30, 2021

that's gotta be a drive issue

I doubt it's that easy: eraraid from raidix -

Right now I am testing the eraraid for some time. Experience so far:

  • +: used in HPC clusters, observed the insane performance myself on 20+ high-end nvme drives in one raid array
  • +: handles 80K+ open files - easily (verified myself)
  • +: low latency (for a nvme raid)
  • -: has quite an unusual read and write pattern

Various suggestions assuming you took a standard approach in file handling:

  1. retry fopen
  2. increase timeout
  3. crosscheck if there may be a potential logical error on extremely fast drives with their own error handling (this drive offers way more bandwidth than the usual pseudo-ram-drives via tmpfs)
  4. potential timing issues with dkms-based kernel-module drives
  5. ...

@kbi-user
Copy link
Author

kbi-user commented Aug 30, 2021

deleted since replacement test disk was full (OS Samsung 980 Pro)

@madMAx43v3r
Copy link
Owner

so problem solved?

@kbi-user
Copy link
Author

kbi-user commented Aug 30, 2021

so problem solved?

No.

What I did: Test run with a physical NVME (Samsung 980 Pro).

Result: completed successfully:
Phase 4 took 68.4926 sec, final plot size is 108807737395 bytes
Total plot creation time was 1370.53 sec (22.8421 min)

Overall too slow, System load dropped below 20% by times. So yes, I want to use the eraraid, which should be possible. Its tested, works perfectly fine without any hickup - even when creating 50+ plots simultaneously the classic way.


What I did: checked with several options by removing any limitations (which shouldnt be a problem; still removed them nonetheless) - problem persists on eraraid. Opened a call with raidix.

What I ask from you if possible: see, where this issue can result from and if possible point out, where this may come from (support me in my raidix call).


I am aware this is a rather uncommon issue by using a software raid from HPC solutions. Yet when talking raw speed, I figure you may be really interested to make it work.

@kbi-user
Copy link
Author

Add - remark: As long as your chia-plotter is running on eraraid, I saw the physical write limit cap at ~15+GB/s. Still there is room for more since that high write load only happens app 1-2 seconds during 10 seconds. Comparing the raid with the slow 980 Pro almost hurts... when being unable to use it.

@kbi-user
Copy link
Author

kbi-user commented Aug 30, 2021

From raidix support:

Yes I see periodical core dump from chia_plot:
Aug 30 12:30:13 [servername] systemd[1]: Started Process Core Dump (PID 2744583/UID 0).
Aug 30 12:30:14 [servername] systemd-coredump[2744584]: Core file was truncated to 2147483648 bytes.
Aug 30 12:30:20 [servername] systemd-coredump[2744584]: Process 2737972 (chia_plot) of user [UID] dumped core.

                                                              Stack trace of thread 246:
                                                               #0  0x00007f7d814d018b n/a (n/a)

Aug 30 12:30:20 [servername] systemd[1]: systemd-coredump@11-2744583-0.service: Succeeded.
But nothing from ERA at the logs at that time. May be an additional information from support you're talking about will help us.

Any idea?

@kbi-user
Copy link
Author

kbi-user commented Aug 31, 2021

@madMAx43v3r any chance you support me in my call with raidix?

In case I didn't state it clearly: the eraraid is fine on anything else. Even the most heavy load over prolonged time (10h+). Verified it again myself. Cooling is fine too (way below 60°C on heavy load).

@kbi-user
Copy link
Author

More information:

usually the fopen error comes up, once a phase is finished. Sometimes after phase 1, last time after phase 2:

Phase 2 took 302.038 sec
Wrote plot header with 268 bytes
terminate called after throwing an instance of 'std::runtime_error'
what(): fopen() failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants