Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cleanup logic for corrupted index filesets #3430

Merged
merged 29 commits into from
May 20, 2021
Merged

Conversation

notbdu
Copy link
Contributor

@notbdu notbdu commented Apr 19, 2021

What this PR does / why we need it:

Adds cleanup logic for corrupted index filesets.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

@notbdu notbdu changed the title [WIP] Add cleanup logic for corrupted index filesets Add cleanup logic for corrupted index filesets Apr 27, 2021
src/dbnode/storage/index.go Outdated Show resolved Hide resolved
src/dbnode/storage/index.go Outdated Show resolved Hide resolved
Comment on lines +730 to +742
infoFilePath, ok := corrupted.InfoFilePath()
if !ok {
fn(corrupted, nil, true)
return
}
infoData, err := read(infoFilePath)
if err != nil {
// NB: If no info data is supplied, we assume that the
// info file itself is corrupted. Since this is the
// first file written to disk, this should be safe to remove.
fn(corrupted, nil, true)
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO to preserve these ordering guarantees (that info file is the first to be persisted), we may need to fsync (File.Sync() call) info file immediately after the first write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we? The info file is the first file we attempt to write. Shouldn't the OS flush buffered info file (from first write) pages to disk first before flushing other buffered pages for other files to disk that we've written later? Or does this happen out of order?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen in practice a corrupt fileset where checkpoint file was fully persisted, but some other files were incomplete, even though checkpoint file is the last to be written and closed.
So I think that without fsyncing info file we can end up in a situation where the first part of info file has been written, we continue with further writing of other files, cleanup process picks up this fileset, but it does not see the info file yet and so it deletes other files.

@vpranckaitis vpranckaitis self-requested a review April 28, 2021 06:09
vpranckaitis
vpranckaitis previously approved these changes Apr 28, 2021
Copy link
Collaborator

@vpranckaitis vpranckaitis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about reading info files which haven't yet been fully written (see comments). Though I'm not sure if it's something to worry about.

src/dbnode/storage/index.go Outdated Show resolved Hide resolved
Comment on lines 2153 to 2156
if file.Info.BlockStart == 0 {
// Mark filesets w/ corrupted index info files for deletion right away.
toDelete = append(toDelete, file.AbsoluteFilePaths...)
continue
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is possible that the info file will be read after it was created, but before the contents were written? Not sure about filesystem atomicity guarantees, but at least WriteFile() implementation [1, 2] hints that there might be a gap.

A comment above says that we intentionally skip latest block start. Could we accidentally violate this by reading info file which haven't been written yet and assuming it's corrupted? If that file belonged to the latest block start, we wouldn't even know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea, I'm not sure - maybe @robskillington can comment on this.

We could err on the side of caution and not attempt to delete filesets w/ a potentially corrupt index info file.

@codecov
Copy link

codecov bot commented Apr 29, 2021

Codecov Report

Merging #3430 (98c8f25) into master (93aef73) will decrease coverage by 0.0%.
The diff coverage is 65.3%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3430     +/-   ##
=========================================
- Coverage    72.4%    72.4%   -0.1%     
=========================================
  Files        1100     1100             
  Lines      102736   102830     +94     
=========================================
+ Hits        74479    74529     +50     
- Misses      23158    23196     +38     
- Partials     5099     5105      +6     
Flag Coverage Δ
aggregator 76.9% <ø> (-0.1%) ⬇️
cluster 84.9% <ø> (-0.1%) ⬇️
collector 84.3% <ø> (ø)
dbnode 79.0% <65.3%> (-0.1%) ⬇️
m3em 74.4% <ø> (ø)
m3ninx 73.6% <ø> (-0.1%) ⬇️
metrics 19.7% <ø> (ø)
msg 74.5% <ø> (-0.2%) ⬇️
query 67.1% <ø> (ø)
x 80.5% <ø> (+0.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93aef73...98c8f25. Read the comment docs.

Comment on lines +730 to +742
infoFilePath, ok := corrupted.InfoFilePath()
if !ok {
fn(corrupted, nil, true)
return
}
infoData, err := read(infoFilePath)
if err != nil {
// NB: If no info data is supplied, we assume that the
// info file itself is corrupted. Since this is the
// first file written to disk, this should be safe to remove.
fn(corrupted, nil, true)
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen in practice a corrupt fileset where checkpoint file was fully persisted, but some other files were incomplete, even though checkpoint file is the last to be written and closed.
So I think that without fsyncing info file we can end up in a situation where the first part of info file has been written, we continue with further writing of other files, cleanup process picks up this fileset, but it does not see the info file yet and so it deletes other files.

"encountered errors when cleaning up index files for %v: %w", t, err))
}

if err := m.cleanupCorruptedIndexFiles(namespaces); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think why does is the cleanup process called from WarmFlushCleanup. It would clean up cold flush index filesets as well, right? Also, from what I can see it covers all namespaces, including computed namespaces which perform no warm/cold flushing. Which makes the code somewhat confusing. Perhaps some refactoring/naming improvement could be useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just this fn, all of the index file cleanup fns are the same. We only cleanup data files in the cold flush loop and index files in the warm flush loop. IIRC we only have cold flush cleanup because it was not safe to cleanup data files in the warm flush loop (can't remember the exact reason). Otherwise, we would've done all cleanup in the warm flush loop.

)
// NB: Info files should be ordered by block start.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe worth adding an invariant check in the loop to guarantee this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/dbnode/storage/index.go Outdated Show resolved Hide resolved
Comment on lines 2127 to 2128
// This intentionally skips the latest block start as that's the active block.
if file.Info.BlockStart > latestBlockStart {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for cold flush (or computed namespace), active block may not necessarily be the latest one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe I should remove this comment - it's a little confusing, it's fine even if we iterate over the active block or any block for that matter that we're actively writing to (we check against latest vol idx).

I had the comment for why we're not running the cleanup logic for all blocks. Would need to wrap it in a fn and run it outside of the loop to cover all blocks. But running against the active block is essentially a no-op so I put the comment in.

@codecov
Copy link

codecov bot commented Apr 30, 2021

Codecov Report

Merging #3430 (d395a4e) into master (d395a4e) will not change coverage.
The diff coverage is n/a.

❗ Current head d395a4e differs from pull request most recent head beb5139. Consider uploading reports for the commit beb5139 to get more accurate results

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #3430   +/-   ##
======================================
  Coverage    56.3%   56.3%           
======================================
  Files         548     548           
  Lines       61902   61902           
======================================
  Hits        34898   34898           
  Misses      23886   23886           
  Partials     3118    3118           
Flag Coverage Δ
aggregator 57.3% <0.0%> (ø)
cluster ∅ <0.0%> (∅)
collector 54.3% <0.0%> (ø)
dbnode 60.9% <0.0%> (ø)
m3em 46.4% <0.0%> (ø)
metrics 19.8% <0.0%> (ø)
msg 74.7% <0.0%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d395a4e...beb5139. Read the comment docs.

@linasm linasm self-assigned this May 5, 2021
# Conflicts:
#	src/dbnode/storage/cleanup_test.go
#	src/dbnode/storage/index.go
# Conflicts:
#	src/dbnode/persist/fs/files.go
#	src/dbnode/storage/cleanup.go
#	src/dbnode/storage/index.go
#	src/dbnode/storage/index_test.go
@vpranckaitis vpranckaitis dismissed their stale review May 11, 2021 13:40

I'm taking over the work on this PR

@vpranckaitis vpranckaitis requested a review from linasm May 12, 2021 10:59
@vpranckaitis vpranckaitis self-assigned this May 13, 2021
opts := NewTestOptions(t).
SetNamespaces([]namespace.Metadata{ns})

// Test setup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment appears twice (and in general doesn't look like it is adding value).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 57 to 58
SetRetentionOptions(nsROpts).SetIndexOptions(
namespace.NewIndexOptions().SetBlockSize(idxBlockSize).SetEnabled(true)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe try formatting these Sets differently for easier readability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReaderBufferSize: setup.FilesystemOpts().InfoReaderBufferSize(),
})
require.NoError(t, err)
require.NotEmpty(t, filesets)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check the len(filesets) here instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace: ns.ID(),
ReaderBufferSize: setup.FilesystemOpts().InfoReaderBufferSize(),
})
require.NoError(t, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no error returned by the code above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeIndexFileSetFiles(t, setup.StorageOpts(), ns, filesetsIdentifiers)

// Corrupt some of the written filesets.
deltaNow := now.Add(time.Minute)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you move this declaration closer to where it is being used (for readability)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1421 to 1427
stat, err := os.Stat(filePath)
if err != nil {
return nil, err
}

size := int(stat.Size())
buf := make([]byte, size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this part of code is exactly the same as in readAndValidate. Might be worth extracting something like func bufferForEntireFile(filePath string) ([]byte, error)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// Check for invariants.
for j := 0; j+1 < len(filesets); j++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this loop would be more readable in this form:

Suggested change
for j := 0; j+1 < len(filesets); j++ {
for j := 1; j < len(filesets); j++ {

And then adjusting the indexes in the loop body accordingly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for j := len(filesets) - 1; j >= 0; j-- {
f := filesets[j]

// NB: If the fileset info fields contains conflicting information, it means that info file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about what you call "conflicting information" here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded it to "inconsistent". E.g. block start is written in filename and inside info file, and if those don't match, we might not be able to trust the other information, too.

bca22e4#diff-6fedd1c6b4deb503fca7c3a8a8d2fbedf6abeeccfcbc90d2d239a6c72cb12199R2149-R2150
92966b6#diff-6fedd1c6b4deb503fca7c3a8a8d2fbedf6abeeccfcbc90d2d239a6c72cb12199R2149-R2155

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though maybe it would be better return more info from ReadIndexInfoFiles() itself? Instead of simply returning corrupted bool, we could make it an enum and describe the type of corruption. There seems to be a handful corruption modes that we care about:

  • not corrupted – volume type known;
  • info file is missing / can't be parsed – volume type unknown;
  • info file is fine, though some other file is corrupted – volume type known.

// The most recent volume is excluded because it is more likely to be actively written to.
// If info file writes are not atomic, due to timing readers might observe the file
// to be corrupted, even though at that moment the file is being written/re-written.
if f.Corrupted && !f.ID.BlockStart.Equal(xtime.FromNanoseconds(f.Info.BlockStart)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning behind f.ID.BlockStart.Equal(xtime.FromNanoseconds(f.Info.BlockStart)) check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks for consistency of info file. IIUC, f.ID.BlockStart is extracted from filename while f.Info.BlockStart is read from info file.

if f.Info.IndexVolumeType != nil {
volType = idxpersist.IndexVolumeType(f.Info.IndexVolumeType.Value)
}
// Delete corrupted filesets if there are more recent volumes with with the same volume type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Delete corrupted filesets if there are more recent volumes with with the same volume type.
// Delete corrupted filesets if there are more recent volumes with the same volume type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpranckaitis vpranckaitis requested a review from linasm May 13, 2021 12:37
Copy link
Collaborator

@linasm linasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

type infoFileFn func(file FileSetFile, infoData []byte)
type infoFileFn func(file FileSetFile, infoData []byte, corrupted bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to somehow get rid of this bool arg? To me, it doesn't look very clean. Maybe this corrupted flag could be embedded in FileSetFile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that corrupted could be placed in FileSetFile struct. Though I'm not sure if now it's a good time to do that. FileSetFile struct is used in many functions across this file, most of which do not deal with corrupted filesets. We currently only care about corrupted filesets when calling ReadIndexInfoFiles().

So I guess it would make sense to move corrupted into FileSetFile if it was more widely used, but for now having it separate makes the notion of corrupted filesets more contained.

@linasm linasm removed their assignment May 19, 2021
@linasm linasm merged commit d3f729f into master May 20, 2021
@linasm linasm deleted the bdu/index-corrupted branch May 20, 2021 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants