core dump runtime - fopen failed #925

kbi-user · 2021-08-30T08:08:47Z

Working Directory: /srv/tmp/
Working Directory 2: /srv/tmp/
Plot Name: plot-k32-2021-08-30-09-33-20e24798982d5b5e8cb2777fcddc1eeb470ae05eca66e2212342398e1480fe44
[P1] Table 1 took 25.1035 sec
[P1] Table 2 took 192.124 sec, found 4295041098 matches
[P1] Table 3 took 200.674 sec, found 4295058971 matches
terminate called after throwing an instance of 'std::runtime_error'
what(): thread failed with: fopen() failed
Aborted (core dumped)

In previous issues the trailing slash was the issue for that error. Yet here the trailing slash is set. Things to note in my system:

Command used: ./build/chia_plot -t /srv/tmp/ -2 /srv/tmp/ -d /srv/tmp/ -f [key1] -p [key2]
/srv/tmp/ is a specific high-end drive via a kernel module. I have no issues whatsoever with that drive even at 44 parallel "normal" plots. Specs for that drive: up to 30GB/s read, up to 15GB/s write, ~100ns latency on seq reads and writes.
Container environment is rootless podman (shouldnt make a difference, yet still mentioning depending on what the chia-plotter does)
AMD Milan CPU (7443P) with 256GB RAM, 200GB RAM free when starting

kbi-user · 2021-08-30T08:15:30Z

ADD: RAM is server ECC, not overclocked, tested with memtest86 burnin - RAM is fine...

madMAx43v3r · 2021-08-30T08:25:09Z

that's gotta be a drive issue

kbi-user · 2021-08-30T09:16:14Z

that's gotta be a drive issue

I doubt it's that easy: eraraid from raidix -

Right now I am testing the eraraid for some time. Experience so far:

+: used in HPC clusters, observed the insane performance myself on 20+ high-end nvme drives in one raid array
+: handles 80K+ open files - easily (verified myself)
+: low latency (for a nvme raid)
-: has quite an unusual read and write pattern

Various suggestions assuming you took a standard approach in file handling:

retry fopen
increase timeout
crosscheck if there may be a potential logical error on extremely fast drives with their own error handling (this drive offers way more bandwidth than the usual pseudo-ram-drives via tmpfs)
potential timing issues with dkms-based kernel-module drives
...

kbi-user · 2021-08-30T09:46:45Z

deleted since replacement test disk was full (OS Samsung 980 Pro)

madMAx43v3r · 2021-08-30T11:28:49Z

so problem solved?

kbi-user · 2021-08-30T12:15:22Z

so problem solved?

No.

What I did: Test run with a physical NVME (Samsung 980 Pro).

Result: completed successfully:
Phase 4 took 68.4926 sec, final plot size is 108807737395 bytes
Total plot creation time was 1370.53 sec (22.8421 min)

Overall too slow, System load dropped below 20% by times. So yes, I want to use the eraraid, which should be possible. Its tested, works perfectly fine without any hickup - even when creating 50+ plots simultaneously the classic way.

What I did: checked with several options by removing any limitations (which shouldnt be a problem; still removed them nonetheless) - problem persists on eraraid. Opened a call with raidix.

What I ask from you if possible: see, where this issue can result from and if possible point out, where this may come from (support me in my raidix call).

I am aware this is a rather uncommon issue by using a software raid from HPC solutions. Yet when talking raw speed, I figure you may be really interested to make it work.

kbi-user · 2021-08-30T12:33:37Z

Add - remark: As long as your chia-plotter is running on eraraid, I saw the physical write limit cap at ~15+GB/s. Still there is room for more since that high write load only happens app 1-2 seconds during 10 seconds. Comparing the raid with the slow 980 Pro almost hurts... when being unable to use it.

kbi-user · 2021-08-30T15:03:38Z

From raidix support:

Yes I see periodical core dump from chia_plot:
Aug 30 12:30:13 [servername] systemd[1]: Started Process Core Dump (PID 2744583/UID 0).
Aug 30 12:30:14 [servername] systemd-coredump[2744584]: Core file was truncated to 2147483648 bytes.
Aug 30 12:30:20 [servername] systemd-coredump[2744584]: Process 2737972 (chia_plot) of user [UID] dumped core.
                                                              Stack trace of thread 246:
                                                               #0  0x00007f7d814d018b n/a (n/a)
Aug 30 12:30:20 [servername] systemd[1]: systemd-coredump@11-2744583-0.service: Succeeded.
But nothing from ERA at the logs at that time. May be an additional information from support you're talking about will help us.

Any idea?

kbi-user · 2021-08-31T07:30:28Z

@madMAx43v3r any chance you support me in my call with raidix?

In case I didn't state it clearly: the eraraid is fine on anything else. Even the most heavy load over prolonged time (10h+). Verified it again myself. Cooling is fine too (way below 60°C on heavy load).

kbi-user · 2021-08-31T08:36:43Z

More information:

usually the fopen error comes up, once a phase is finished. Sometimes after phase 1, last time after phase 2:

Phase 2 took 302.038 sec
Wrote plot header with 268 bytes
terminate called after throwing an instance of 'std::runtime_error'
what(): fopen() failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core dump runtime - fopen failed #925

core dump runtime - fopen failed #925

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 30, 2021

madMAx43v3r commented Aug 30, 2021

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 30, 2021 •

edited

madMAx43v3r commented Aug 30, 2021

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 30, 2021

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 31, 2021 •

edited

kbi-user commented Aug 31, 2021

core dump runtime - fopen failed #925

core dump runtime - fopen failed #925

Comments

kbi-user commented Aug 30, 2021 • edited

kbi-user commented Aug 30, 2021

madMAx43v3r commented Aug 30, 2021

kbi-user commented Aug 30, 2021 • edited

kbi-user commented Aug 30, 2021 • edited

madMAx43v3r commented Aug 30, 2021

kbi-user commented Aug 30, 2021 • edited

kbi-user commented Aug 30, 2021

kbi-user commented Aug 30, 2021 • edited

kbi-user commented Aug 31, 2021 • edited

kbi-user commented Aug 31, 2021

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 30, 2021 •

edited

kbi-user commented Aug 31, 2021 •

edited