Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many disks are shutdown due to heavy workload #4779

Closed
polyrabbit opened this issue Apr 25, 2024 · 7 comments
Closed

Many disks are shutdown due to heavy workload #4779

polyrabbit opened this issue Apr 25, 2024 · 7 comments
Labels
kind/bug Something isn't working

Comments

@polyrabbit
Copy link
Contributor

What happened:
During a heavy IO workload, we observed more than 280 pending write requests to the cache store, which overwhelmed the disk and caused slow IO. The bad disk dictator thus shuts it down forever even it recovers shortly. But worse, we lost the read cache forever.

img_v3_02a9_4b8c5d10-9c4f-4503-bde9-561a7095c62g

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?

Environment:

  • JuiceFS version (use juicefs --version) or Hadoop Java SDK version:
    1.2-beta1
@polyrabbit polyrabbit added the kind/bug Something isn't working label Apr 25, 2024
@jiefenghuang
Copy link
Contributor

jiefenghuang commented Apr 25, 2024

#4749
In this PR, we have added a fault tolerance range to reduce misjudgments, which aligns with your scenario. If needed, maxIOErrors can be adjusted as configurable.

@polyrabbit
Copy link
Contributor Author

Thanks, but we already included this commit in our build:
image

@jiefenghuang
Copy link
Contributor

In this writing scenario, maybe turn on maxStageWrite is better

@polyrabbit
Copy link
Contributor Author

When one disk cache store is shutdown. Will we lose staged files in that store?

@jiefenghuang
Copy link
Contributor

files still exist.

@jiefenghuang
Copy link
Contributor

Conclusion: There are two scenarios to consider. One is long-term high-load writing, for which writing directly to object storage is recommended (resource issue: disk-cache or object storage bandwidth). The other is short-term high-load writing, for which the concurrency limit can be lifted to redirect and distribute writes to the object storage. This issue will be temporarily closed for now.

@davies
Copy link
Contributor

davies commented Apr 26, 2024

We should enable max-stage-write for this issue, maybe 10 as the default value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants