Many disks are shutdown due to heavy workload #4779

polyrabbit · 2024-04-25T08:01:04Z

What happened:
During a heavy IO workload, we observed more than 280 pending write requests to the cache store, which overwhelmed the disk and caused slow IO. The bad disk dictator thus shuts it down forever even it recovers shortly. But worse, we lost the read cache forever.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version:
1.2-beta1

The text was updated successfully, but these errors were encountered:

jiefenghuang · 2024-04-25T08:39:05Z

#4749
In this PR, we have added a fault tolerance range to reduce misjudgments, which aligns with your scenario. If needed, maxIOErrors can be adjusted as configurable.

polyrabbit · 2024-04-25T08:59:52Z

Thanks, but we already included this commit in our build:

jiefenghuang · 2024-04-25T09:25:19Z

In this writing scenario, maybe turn on maxStageWrite is better

polyrabbit · 2024-04-25T09:51:11Z

When one disk cache store is shutdown. Will we lose staged files in that store?

jiefenghuang · 2024-04-25T10:00:24Z

files still exist.

jiefenghuang · 2024-04-26T02:45:50Z

Conclusion: There are two scenarios to consider. One is long-term high-load writing, for which writing directly to object storage is recommended (resource issue: disk-cache or object storage bandwidth). The other is short-term high-load writing, for which the concurrency limit can be lifted to redirect and distribute writes to the object storage. This issue will be temporarily closed for now.

davies · 2024-04-26T08:01:46Z

We should enable max-stage-write for this issue, maybe 10 as the default value.

polyrabbit added the kind/bug Something isn't working label Apr 25, 2024

jiefenghuang closed this as completed Apr 26, 2024

jiefenghuang mentioned this issue Apr 26, 2024

disk cache: more accurate disk failure detection and recovery mechanism. #4798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many disks are shutdown due to heavy workload #4779

Many disks are shutdown due to heavy workload #4779

polyrabbit commented Apr 25, 2024

jiefenghuang commented Apr 25, 2024 •

edited

polyrabbit commented Apr 25, 2024

jiefenghuang commented Apr 25, 2024

polyrabbit commented Apr 25, 2024

jiefenghuang commented Apr 25, 2024

jiefenghuang commented Apr 26, 2024

davies commented Apr 26, 2024

Many disks are shutdown due to heavy workload #4779

Many disks are shutdown due to heavy workload #4779

Comments

polyrabbit commented Apr 25, 2024

jiefenghuang commented Apr 25, 2024 • edited

polyrabbit commented Apr 25, 2024

jiefenghuang commented Apr 25, 2024

polyrabbit commented Apr 25, 2024

jiefenghuang commented Apr 25, 2024

jiefenghuang commented Apr 26, 2024

davies commented Apr 26, 2024

jiefenghuang commented Apr 25, 2024 •

edited