Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not cap attempts to flush with retries, should always attempt to write out #815

Merged
merged 2 commits into from
Jul 31, 2018

Conversation

robskillington
Copy link
Collaborator

This change avoids potentially returning false from NeedsFlush(...) for a namespace when max retries have been exceeded for flushing.

There are two faults with this behavior:

  1. If false returns from NeedsFlush(...) due to retries being exceeded for a given block time start, the cleanup code may believe that it was successful and can cleanup commit logs and snapshots - hence losing data on the local node.
  2. Naturally if disk was not writeable for some time, we'd ideally like it to just natively recover and flush out blocks that haven't been flushed yet, rather than max retries being hit and then having to restart nodes to have them flush out their data.

@codecov
Copy link

codecov bot commented Jul 31, 2018

Codecov Report

Merging #815 into master will decrease coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #815      +/-   ##
==========================================
- Coverage   78.08%   77.96%   -0.13%     
==========================================
  Files         368      368              
  Lines       31813    31802      -11     
==========================================
- Hits        24842    24794      -48     
- Misses       5302     5325      +23     
- Partials     1669     1683      +14
Flag Coverage Δ
#coordinator 60.86% <ø> (-0.12%) ⬇️
#dbnode 81.41% <100%> (-0.14%) ⬇️
#m3ninx 72.7% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2392fda...846a7a6. Read the comment docs.

@nerd0
Copy link
Contributor

nerd0 commented Jul 31, 2018

Very naive question, so the operator would have to know that a restart would address a process infinitely retrying the flush?

@robskillington
Copy link
Collaborator Author

robskillington commented Jul 31, 2018

@nerd0: Prior to this change, that was the status quo - if it gave up it wouldn't try again until restarted. With this change it'll naturally recover as it will continue to retry until successful.

Errors causing flushes is going to be due to some filesystem issue (like not correctly setting permissions on directory, etc) so auto-recovery is desirable here.

Copy link
Collaborator

@prateek prateek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robskillington robskillington merged commit d80e97f into master Jul 31, 2018
@robskillington robskillington deleted the r/remove-max-flush-retries branch July 31, 2018 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants