Add cleanup logic for corrupted index filesets #3430

notbdu · 2021-04-19T05:17:56Z

What this PR does / why we need it:

Adds cleanup logic for corrupted index filesets.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

src/dbnode/storage/index.go

linasm · 2021-04-27T06:16:38Z

src/dbnode/persist/fs/files.go

+		infoFilePath, ok := corrupted.InfoFilePath()
+		if !ok {
+			fn(corrupted, nil, true)
+			return
+		}
+		infoData, err := read(infoFilePath)
+		if err != nil {
+			// NB: If no info data is supplied, we assume that the
+			// info file itself is corrupted. Since this is the
+			// first file written to disk, this should be safe to remove.
+			fn(corrupted, nil, true)
+			return
+		}


IMHO to preserve these ordering guarantees (that info file is the first to be persisted), we may need to fsync (File.Sync() call) info file immediately after the first write.

Do we? The info file is the first file we attempt to write. Shouldn't the OS flush buffered info file (from first write) pages to disk first before flushing other buffered pages for other files to disk that we've written later? Or does this happen out of order?

I have seen in practice a corrupt fileset where checkpoint file was fully persisted, but some other files were incomplete, even though checkpoint file is the last to be written and closed.
So I think that without fsyncing info file we can end up in a situation where the first part of info file has been written, we continue with further writing of other files, cleanup process picks up this fileset, but it does not see the info file yet and so it deletes other files.

vpranckaitis

I have some concerns about reading info files which haven't yet been fully written (see comments). Though I'm not sure if it's something to worry about.

src/dbnode/storage/index.go

vpranckaitis · 2021-04-28T12:25:11Z

src/dbnode/storage/index.go

+		if file.Info.BlockStart == 0 {
+			// Mark filesets w/ corrupted index info files for deletion right away.
+			toDelete = append(toDelete, file.AbsoluteFilePaths...)
+			continue
+		}


I wonder if it is possible that the info file will be read after it was created, but before the contents were written? Not sure about filesystem atomicity guarantees, but at least WriteFile() implementation [1, 2] hints that there might be a gap.

A comment above says that we intentionally skip latest block start. Could we accidentally violate this by reading info file which haven't been written yet and assuming it's corrupted? If that file belonged to the latest block start, we wouldn't even know.

Hmm yea, I'm not sure - maybe @robskillington can comment on this.

We could err on the side of caution and not attempt to delete filesets w/ a potentially corrupt index info file.

codecov · 2021-04-29T05:41:59Z

Codecov Report

Merging #3430 (98c8f25) into master (93aef73) will decrease coverage by 0.0%.
The diff coverage is 65.3%.

@@            Coverage Diff            @@
##           master    #3430     +/-   ##
=========================================
- Coverage    72.4%    72.4%   -0.1%     
=========================================
  Files        1100     1100             
  Lines      102736   102830     +94     
=========================================
+ Hits        74479    74529     +50     
- Misses      23158    23196     +38     
- Partials     5099     5105      +6

Flag	Coverage Δ
aggregator	`76.9% <ø> (-0.1%)`	⬇️
cluster	`84.9% <ø> (-0.1%)`	⬇️
collector	`84.3% <ø> (ø)`
dbnode	`79.0% <65.3%> (-0.1%)`	⬇️
m3em	`74.4% <ø> (ø)`
m3ninx	`73.6% <ø> (-0.1%)`	⬇️
metrics	`19.7% <ø> (ø)`
msg	`74.5% <ø> (-0.2%)`	⬇️
query	`67.1% <ø> (ø)`
x	`80.5% <ø> (+0.1%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93aef73...98c8f25. Read the comment docs.

linasm · 2021-04-29T12:56:00Z

src/dbnode/persist/fs/files.go

+		infoFilePath, ok := corrupted.InfoFilePath()
+		if !ok {
+			fn(corrupted, nil, true)
+			return
+		}
+		infoData, err := read(infoFilePath)
+		if err != nil {
+			// NB: If no info data is supplied, we assume that the
+			// info file itself is corrupted. Since this is the
+			// first file written to disk, this should be safe to remove.
+			fn(corrupted, nil, true)
+			return
+		}


I have seen in practice a corrupt fileset where checkpoint file was fully persisted, but some other files were incomplete, even though checkpoint file is the last to be written and closed.
So I think that without fsyncing info file we can end up in a situation where the first part of info file has been written, we continue with further writing of other files, cleanup process picks up this fileset, but it does not see the info file yet and so it deletes other files.

linasm · 2021-04-29T13:09:09Z

src/dbnode/storage/cleanup.go

+			"encountered errors when cleaning up index files for %v: %w", t, err))
+	}
+
+	if err := m.cleanupCorruptedIndexFiles(namespaces); err != nil {


This makes me think why does is the cleanup process called from WarmFlushCleanup. It would clean up cold flush index filesets as well, right? Also, from what I can see it covers all namespaces, including computed namespaces which perform no warm/cold flushing. Which makes the code somewhat confusing. Perhaps some refactoring/naming improvement could be useful here.

It's not just this fn, all of the index file cleanup fns are the same. We only cleanup data files in the cold flush loop and index files in the warm flush loop. IIRC we only have cold flush cleanup because it was not safe to cleanup data files in the warm flush loop (can't remember the exact reason). Otherwise, we would've done all cleanup in the warm flush loop.

linasm · 2021-04-29T13:17:03Z

src/dbnode/storage/index.go

 	)
+	// NB: Info files should be ordered by block start.


Nit: maybe worth adding an invariant check in the loop to guarantee this?

a3f7f73#diff-6fedd1c6b4deb503fca7c3a8a8d2fbedf6abeeccfcbc90d2d239a6c72cb12199R2128-R2140

src/dbnode/storage/index.go

linasm · 2021-04-29T13:21:18Z

src/dbnode/storage/index.go

+		// This intentionally skips the latest block start as that's the active block.
+		if file.Info.BlockStart > latestBlockStart {


But for cold flush (or computed namespace), active block may not necessarily be the latest one?

Hmm maybe I should remove this comment - it's a little confusing, it's fine even if we iterate over the active block or any block for that matter that we're actively writing to (we check against latest vol idx).

I had the comment for why we're not running the cleanup logic for all blocks. Would need to wrap it in a fn and run it outside of the loop to cover all blocks. But running against the active block is essentially a no-op so I put the comment in.

codecov · 2021-04-30T05:10:19Z

Codecov Report

Merging #3430 (d395a4e) into master (d395a4e) will not change coverage.
The diff coverage is n/a.

❗ Current head d395a4e differs from pull request most recent head beb5139. Consider uploading reports for the commit beb5139 to get more accurate results

@@          Coverage Diff           @@
##           master   #3430   +/-   ##
======================================
  Coverage    56.3%   56.3%           
======================================
  Files         548     548           
  Lines       61902   61902           
======================================
  Hits        34898   34898           
  Misses      23886   23886           
  Partials     3118    3118

Flag	Coverage Δ
aggregator	`57.3% <0.0%> (ø)`
cluster	`∅ <0.0%> (∅)`
collector	`54.3% <0.0%> (ø)`
dbnode	`60.9% <0.0%> (ø)`
m3em	`46.4% <0.0%> (ø)`
metrics	`19.8% <0.0%> (ø)`
msg	`74.7% <0.0%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d395a4e...beb5139. Read the comment docs.

…eam corruptions

Co-authored-by: Vilius Pranckaitis <vpranckaitis@gmail.com>

Co-authored-by: Linas Medžiūnas <linasm@users.noreply.github.com>

src/dbnode/persist/fs/index_write.go

# Conflicts: # src/dbnode/storage/cleanup_test.go # src/dbnode/storage/index.go

# Conflicts: # src/dbnode/persist/fs/files.go # src/dbnode/storage/cleanup.go # src/dbnode/storage/index.go # src/dbnode/storage/index_test.go

I'm taking over the work on this PR

linasm · 2021-05-13T06:08:58Z