-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cleanup logic for corrupted index filesets #3430
Conversation
813003e
to
721074c
Compare
0c395fc
to
a930af6
Compare
infoFilePath, ok := corrupted.InfoFilePath() | ||
if !ok { | ||
fn(corrupted, nil, true) | ||
return | ||
} | ||
infoData, err := read(infoFilePath) | ||
if err != nil { | ||
// NB: If no info data is supplied, we assume that the | ||
// info file itself is corrupted. Since this is the | ||
// first file written to disk, this should be safe to remove. | ||
fn(corrupted, nil, true) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO to preserve these ordering guarantees (that info file is the first to be persisted), we may need to fsync (File.Sync()
call) info file immediately after the first write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we? The info file is the first file we attempt to write. Shouldn't the OS flush buffered info file (from first write) pages to disk first before flushing other buffered pages for other files to disk that we've written later? Or does this happen out of order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen in practice a corrupt fileset where checkpoint file was fully persisted, but some other files were incomplete, even though checkpoint file is the last to be written and closed.
So I think that without fsync
ing info file we can end up in a situation where the first part of info file has been written, we continue with further writing of other files, cleanup process picks up this fileset, but it does not see the info file yet and so it deletes other files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about reading info files which haven't yet been fully written (see comments). Though I'm not sure if it's something to worry about.
src/dbnode/storage/index.go
Outdated
if file.Info.BlockStart == 0 { | ||
// Mark filesets w/ corrupted index info files for deletion right away. | ||
toDelete = append(toDelete, file.AbsoluteFilePaths...) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is possible that the info file will be read after it was created, but before the contents were written? Not sure about filesystem atomicity guarantees, but at least WriteFile()
implementation [1, 2] hints that there might be a gap.
A comment above says that we intentionally skip latest block start. Could we accidentally violate this by reading info file which haven't been written yet and assuming it's corrupted? If that file belonged to the latest block start, we wouldn't even know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yea, I'm not sure - maybe @robskillington can comment on this.
We could err on the side of caution and not attempt to delete filesets w/ a potentially corrupt index info file.
Codecov Report
@@ Coverage Diff @@
## master #3430 +/- ##
=========================================
- Coverage 72.4% 72.4% -0.1%
=========================================
Files 1100 1100
Lines 102736 102830 +94
=========================================
+ Hits 74479 74529 +50
- Misses 23158 23196 +38
- Partials 5099 5105 +6
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
infoFilePath, ok := corrupted.InfoFilePath() | ||
if !ok { | ||
fn(corrupted, nil, true) | ||
return | ||
} | ||
infoData, err := read(infoFilePath) | ||
if err != nil { | ||
// NB: If no info data is supplied, we assume that the | ||
// info file itself is corrupted. Since this is the | ||
// first file written to disk, this should be safe to remove. | ||
fn(corrupted, nil, true) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen in practice a corrupt fileset where checkpoint file was fully persisted, but some other files were incomplete, even though checkpoint file is the last to be written and closed.
So I think that without fsync
ing info file we can end up in a situation where the first part of info file has been written, we continue with further writing of other files, cleanup process picks up this fileset, but it does not see the info file yet and so it deletes other files.
"encountered errors when cleaning up index files for %v: %w", t, err)) | ||
} | ||
|
||
if err := m.cleanupCorruptedIndexFiles(namespaces); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me think why does is the cleanup process called from WarmFlushCleanup
. It would clean up cold flush index filesets as well, right? Also, from what I can see it covers all namespaces, including computed namespaces which perform no warm/cold flushing. Which makes the code somewhat confusing. Perhaps some refactoring/naming improvement could be useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just this fn, all of the index file cleanup fns are the same. We only cleanup data files in the cold flush loop and index files in the warm flush loop. IIRC we only have cold flush cleanup because it was not safe to cleanup data files in the warm flush loop (can't remember the exact reason). Otherwise, we would've done all cleanup in the warm flush loop.
src/dbnode/storage/index.go
Outdated
) | ||
// NB: Info files should be ordered by block start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe worth adding an invariant check in the loop to guarantee this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/dbnode/storage/index.go
Outdated
// This intentionally skips the latest block start as that's the active block. | ||
if file.Info.BlockStart > latestBlockStart { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for cold flush (or computed namespace), active block may not necessarily be the latest one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe I should remove this comment - it's a little confusing, it's fine even if we iterate over the active block or any block for that matter that we're actively writing to (we check against latest vol idx).
I had the comment for why we're not running the cleanup logic for all blocks. Would need to wrap it in a fn and run it outside of the loop to cover all blocks. But running against the active block is essentially a no-op so I put the comment in.
Codecov Report
@@ Coverage Diff @@
## master #3430 +/- ##
======================================
Coverage 56.3% 56.3%
======================================
Files 548 548
Lines 61902 61902
======================================
Hits 34898 34898
Misses 23886 23886
Partials 3118 3118
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
Co-authored-by: Vilius Pranckaitis <vpranckaitis@gmail.com>
Co-authored-by: Linas Medžiūnas <linasm@users.noreply.github.com>
81b7d03
to
1247334
Compare
# Conflicts: # src/dbnode/storage/cleanup_test.go # src/dbnode/storage/index.go
# Conflicts: # src/dbnode/persist/fs/files.go # src/dbnode/storage/cleanup.go # src/dbnode/storage/index.go # src/dbnode/storage/index_test.go
opts := NewTestOptions(t). | ||
SetNamespaces([]namespace.Metadata{ns}) | ||
|
||
// Test setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this comment appears twice (and in general doesn't look like it is adding value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetRetentionOptions(nsROpts).SetIndexOptions( | ||
namespace.NewIndexOptions().SetBlockSize(idxBlockSize).SetEnabled(true))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe try formatting these Set
s differently for easier readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReaderBufferSize: setup.FilesystemOpts().InfoReaderBufferSize(), | ||
}) | ||
require.NoError(t, err) | ||
require.NotEmpty(t, filesets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we check the len(filesets)
here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namespace: ns.ID(), | ||
ReaderBufferSize: setup.FilesystemOpts().InfoReaderBufferSize(), | ||
}) | ||
require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no error returned by the code above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeIndexFileSetFiles(t, setup.StorageOpts(), ns, filesetsIdentifiers) | ||
|
||
// Corrupt some of the written filesets. | ||
deltaNow := now.Add(time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you move this declaration closer to where it is being used (for readability)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/dbnode/persist/fs/files.go
Outdated
stat, err := os.Stat(filePath) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
size := int(stat.Size()) | ||
buf := make([]byte, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this part of code is exactly the same as in readAndValidate
. Might be worth extracting something like func bufferForEntireFile(filePath string) ([]byte, error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/dbnode/storage/index.go
Outdated
} | ||
|
||
// Check for invariants. | ||
for j := 0; j+1 < len(filesets); j++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this loop would be more readable in this form:
for j := 0; j+1 < len(filesets); j++ { | |
for j := 1; j < len(filesets); j++ { |
And then adjusting the indexes in the loop body accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/dbnode/storage/index.go
Outdated
for j := len(filesets) - 1; j >= 0; j-- { | ||
f := filesets[j] | ||
|
||
// NB: If the fileset info fields contains conflicting information, it means that info file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific about what you call "conflicting information" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded it to "inconsistent". E.g. block start is written in filename and inside info file, and if those don't match, we might not be able to trust the other information, too.
bca22e4#diff-6fedd1c6b4deb503fca7c3a8a8d2fbedf6abeeccfcbc90d2d239a6c72cb12199R2149-R2150
92966b6#diff-6fedd1c6b4deb503fca7c3a8a8d2fbedf6abeeccfcbc90d2d239a6c72cb12199R2149-R2155
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though maybe it would be better return more info from ReadIndexInfoFiles()
itself? Instead of simply returning corrupted bool
, we could make it an enum and describe the type of corruption. There seems to be a handful corruption modes that we care about:
- not corrupted – volume type known;
- info file is missing / can't be parsed – volume type unknown;
- info file is fine, though some other file is corrupted – volume type known.
// The most recent volume is excluded because it is more likely to be actively written to. | ||
// If info file writes are not atomic, due to timing readers might observe the file | ||
// to be corrupted, even though at that moment the file is being written/re-written. | ||
if f.Corrupted && !f.ID.BlockStart.Equal(xtime.FromNanoseconds(f.Info.BlockStart)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning behind f.ID.BlockStart.Equal(xtime.FromNanoseconds(f.Info.BlockStart))
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks for consistency of info file. IIUC, f.ID.BlockStart
is extracted from filename while f.Info.BlockStart
is read from info file.
src/dbnode/storage/index.go
Outdated
if f.Info.IndexVolumeType != nil { | ||
volType = idxpersist.IndexVolumeType(f.Info.IndexVolumeType.Value) | ||
} | ||
// Delete corrupted filesets if there are more recent volumes with with the same volume type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Delete corrupted filesets if there are more recent volumes with with the same volume type. | |
// Delete corrupted filesets if there are more recent volumes with the same volume type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
type infoFileFn func(file FileSetFile, infoData []byte) | ||
type infoFileFn func(file FileSetFile, infoData []byte, corrupted bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to somehow get rid of this bool
arg? To me, it doesn't look very clean. Maybe this corrupted
flag could be embedded in FileSetFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that corrupted
could be placed in FileSetFile
struct. Though I'm not sure if now it's a good time to do that. FileSetFile
struct is used in many functions across this file, most of which do not deal with corrupted filesets. We currently only care about corrupted filesets when calling ReadIndexInfoFiles()
.
So I guess it would make sense to move corrupted
into FileSetFile
if it was more widely used, but for now having it separate makes the notion of corrupted filesets more contained.
# Conflicts: # src/dbnode/storage/storage_mock.go
What this PR does / why we need it:
Adds cleanup logic for corrupted index filesets.
Special notes for your reviewer:
Does this PR introduce a user-facing and/or backwards incompatible change?:
Does this PR require updating code package or user-facing documentation?: