Do not cap attempts to flush with retries, should always attempt to write out #815

robskillington · 2018-07-31T15:35:07Z

This change avoids potentially returning false from NeedsFlush(...) for a namespace when max retries have been exceeded for flushing.

There are two faults with this behavior:

If false returns from NeedsFlush(...) due to retries being exceeded for a given block time start, the cleanup code may believe that it was successful and can cleanup commit logs and snapshots - hence losing data on the local node.
Naturally if disk was not writeable for some time, we'd ideally like it to just natively recover and flush out blocks that haven't been flushed yet, rather than max retries being hit and then having to restart nodes to have them flush out their data.

…t to write them out

codecov · 2018-07-31T16:16:51Z

Codecov Report

Merging #815 into master will decrease coverage by 0.12%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #815      +/-   ##
==========================================
- Coverage   78.08%   77.96%   -0.13%     
==========================================
  Files         368      368              
  Lines       31813    31802      -11     
==========================================
- Hits        24842    24794      -48     
- Misses       5302     5325      +23     
- Partials     1669     1683      +14

Flag	Coverage Δ
#coordinator	`60.86% <ø> (-0.12%)`	⬇️
#dbnode	`81.41% <100%> (-0.14%)`	⬇️
#m3ninx	`72.7% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2392fda...846a7a6. Read the comment docs.

nerd0 · 2018-07-31T16:20:08Z

Very naive question, so the operator would have to know that a restart would address a process infinitely retrying the flush?

robskillington · 2018-07-31T16:29:58Z

@nerd0: Prior to this change, that was the status quo - if it gave up it wouldn't try again until restarted. With this change it'll naturally recover as it will continue to retry until successful.

Errors causing flushes is going to be due to some filesystem issue (like not correctly setting permissions on directory, etc) so auto-recovery is desirable here.

prateek

LGTM

Rob Skillington added 2 commits July 31, 2018 11:23

Do not cap attempts to flush with flush retries, should always attemp…

8bc6584

…t to write them out

Fix tests

846a7a6

prateek approved these changes Jul 31, 2018

View reviewed changes

robskillington merged commit d80e97f into master Jul 31, 2018

robskillington deleted the r/remove-max-flush-retries branch July 31, 2018 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not cap attempts to flush with retries, should always attempt to write out #815

Do not cap attempts to flush with retries, should always attempt to write out #815

robskillington commented Jul 31, 2018

codecov bot commented Jul 31, 2018

nerd0 commented Jul 31, 2018

robskillington commented Jul 31, 2018 •

edited

Loading

prateek left a comment

Do not cap attempts to flush with retries, should always attempt to write out #815

Do not cap attempts to flush with retries, should always attempt to write out #815

Conversation

robskillington commented Jul 31, 2018

codecov bot commented Jul 31, 2018

Codecov Report

nerd0 commented Jul 31, 2018

robskillington commented Jul 31, 2018 • edited Loading

prateek left a comment

Choose a reason for hiding this comment

robskillington commented Jul 31, 2018 •

edited

Loading