Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files #2058

richardartoul · 2019-12-04T18:31:22Z

This P.R fixes a bug in the fileset writer that would trigger a (near) infinite loop when the writer was reused after the previous set of files encountered an error trying to write out their index files. The P.R includes a regression test to ensure the issue doesn't crop up again and verify the included fix.

The bug was caused by the following sequence of events:

For some reason (root cause still pending on that) writing out a set of fileset files encountered an error during the call to Close(), likely that duplicate series IDs had been written into the fileset file which will cause the writer to error out when it tries to write its index-related files.
The writer is reused for writing out an entirely different set of fileset files. Normally this is fine since the call to Open() performs an implicit reset of the writers' state, however, the call to Open() has a bug where it does not reset the state of w.indexEntries which is a slice of all the time series that were written into the file. So now the state is that we're writing out filesets for files X but we're still holding on to indexEntries from files Y.
The fileset that we're writing just so happens to not have any time series for the current block start. This is a normal scenario that can happen when the M3DB nodes are not receiving any writes or briefly after topology changes where a flush may occur for a shard that was recently closed.
After writing 0 time series into the files, Close is called on the fileset writer which triggers the following block of code:

func (w *writer) writeIndexRelatedFiles() error {
	summariesApprox := float64(len(w.indexEntries)) * w.summariesPercent
	summaryEvery := 0
	if summariesApprox > 0 {
		summaryEvery = int(math.Floor(float64(len(w.indexEntries)) / summariesApprox))
	}

	// Write the index entries and calculate the bloom filter
	n, p := uint(w.currIdx), w.bloomFilterFalsePositivePercent
	m, k := bloom.EstimateFalsePositiveRate(n, p)
	bloomFilter := bloom.NewBloomFilter(m, k)

	err := w.writeIndexFileContents(bloomFilter, summaryEvery)
	if err != nil {
		return err
	}

Due to the implementation of EstimateFalsePositiveRate, passing a value of 0 for n will result in 9223372036854775808 being returned for the value of k (the number of hash functions the bloom filter will run for each value that is added to the bloom filter).

Normally this isn't a big deal because when the value of n is 0 there are no time series IDs to add to the bloom filter anyways and bloomfilter.Add() never gets called.

However, due to the aforementioned error writing out the previous set of files + the bug with indexEntries not properly being reset, the call to writeIndexFileContents will run the following function:

func (w *writer) writeIndexFileContents(
	bloomFilter *bloom.BloomFilter,
	summaryEvery int,
) error {
	// NB(r): Write the index file in order, in the future we could write
	// these in order to avoid this sort at the end however that does require
	// significant changes in the storage/databaseShard to store things in order
	// which would sacrifice O(1) insertion of new series we currently have.
	//
	// Probably do want to do this at the end still however so we don't stripe
	// writes to two different files during the write loop.
	sort.Sort(w.indexEntries)

	var (
		offset      int64
		prevID      []byte
		tagsIter    = ident.NewTagsIterator(ident.Tags{})
		tagsEncoder = w.tagEncoderPool.Get()
	)
	defer tagsEncoder.Finalize()
	for i := range w.indexEntries {
		id := w.indexEntries[i].id.Bytes()
		// Need to check if i > 0 or we can never write an empty string ID
		if i > 0 && bytes.Equal(id, prevID) {
			// Should never happen, Write() should only be called once per ID
			return fmt.Errorf("encountered duplicate ID: %s", id)
		}

		var encodedTags []byte
		if tags := w.indexEntries[i].tags; tags.Values() != nil {
			tagsIter.Reset(tags)
			tagsEncoder.Reset()
			if err := tagsEncoder.Encode(tagsIter); err != nil {
				return err
			}
			data, ok := tagsEncoder.Data()
			if !ok {
				return errWriterEncodeTagsDataNotAccessible
			}
			encodedTags = data.Bytes()
		}

		entry := schema.IndexEntry{
			Index:       w.indexEntries[i].index,
			ID:          id,
			Size:        int64(w.indexEntries[i].size),
			Offset:      w.indexEntries[i].dataFileOffset,
			Checksum:    int64(w.indexEntries[i].checksum),
			EncodedTags: encodedTags,
		}

		w.encoder.Reset()
		if err := w.encoder.EncodeIndexEntry(entry); err != nil {
			return err
		}

		data := w.encoder.Bytes()
		if _, err := w.indexFdWithDigest.Write(data); err != nil {
			return err
		}

		// Add to the bloom filter, note this must be zero alloc or else this will
		// cause heavy GC churn as we flush millions of series at end of each
		// time window
		bloomFilter.Add(id)

		if i%summaryEvery == 0 {
			// Capture the offset for when we write this summary back, only capture
			// for every summary we'll actually write to avoid a few memcopies
			w.indexEntries[i].indexFileOffset = offset
		}

		offset += int64(len(data))

		prevID = id
	}

	return nil
}

Since w.indexEntries was never properly reset, bloomfilter.Add() will be called and the goroutine will get stuck in a near infinite loop where it tries to run 9223372036854775808 hash functions.

This issue was extremely hard to debug because it manifested as the M3DB processes turning into "zombies" with 1 CPU core constantly pegged, but the nodes would not respond to any RPCs or networking in general so standard pprof tooling could not be used.

The reason this happened is because all of the function calls within bloomfilter.Add() are inlined making the entire function call unpre-emptible by the G.C until all of the 9223372036854775808 hash functions had been completed.

So when a stop the world G.C was started by the Go runtime it shut down all active goroutines that could have served any network requests and then hung forever waiting for the goroutine running the bloomfilter.Add() to complete so it could begin garbage collection.

This is clearly visible in the output from sudo perf top which shows the Go runtime stuck trying to start a stop the world G.C as well as demonstrates the bloomfilter.Add is clearly stuck in a very long loop based on how much time is being spent on the highlighted assembly instructions.

This P.R likely requires several other followups:

Prevent EstimateFalsePositiveRate from returning absurdly large values of K when the value of n is zero.
Figure out under what combination of cold flushing / background repairs and their interactions causes fileset writes to fail immediately after topology changes.

However we will get this P.R merged ASAP to prevent the nodes from getting stuck into undebuggable states.

codecov · 2019-12-04T18:40:21Z

Codecov Report

Merging #2058 into master will decrease coverage by 15.8%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2058      +/-   ##
==========================================
- Coverage    73.9%    58.1%   -15.9%     
==========================================
  Files        1013     1008       -5     
  Lines      103993   134996   +31003     
==========================================
+ Hits        76954    78553    +1599     
- Misses      22198    49862   +27664     
- Partials     4841     6581    +1740

Flag	Coverage Δ
#aggregator	`71% <ø> (-9.5%)`	⬇️
#cluster	`55% <ø> (-30.3%)`	⬇️
#collector	`23.9% <ø> (-41%)`	⬇️
#dbnode	`67.9% <100%> (+17.4%)`	⬆️
#m3em	`21.1% <ø> (-39.2%)`	⬇️
#m3ninx	`65.1% <ø> (-1.7%)`	⬇️
#m3nsch	`60.1% <ø> (-10.6%)`	⬇️
#metrics	`8.3% <ø> (-17.1%)`	⬇️
#msg	`68.9% <ø> (-5.2%)`	⬇️
#query	`38.6% <ø> (-31.2%)`	⬇️
#x	`69.1% <ø> (-11.3%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a585b99...d3a9932. Read the comment docs.

prateek · 2019-12-04T19:11:30Z

src/dbnode/persist/fs/write.go

@@ -161,6 +161,11 @@ func (w *writer) Open(opts DataWriterOpenOptions) error {
 	w.currIdx = 0
 	w.currOffset = 0
 	w.err = nil
+	// This happens after writing the previous set of files index files, however, do it


mind grouping the reseting code into a new func (w *writer) reset() fn

tangentially - should indexEntries be bounded/randomly re-allocated if it exceeds certain size to reduce memory usage? not for this PR, just curious myself.

@prateek sure

@vdarulis Theoretically yeah it probably should. In practice we only keep one of these around per node (for the most part) and the size of this slice will never exceed the number of series for a given block/shard combination so in practice its usually in the 10s of thousands tops and each item in the slice is not that big so I think its really unlikely it would become an issue. There are definitely other things like this in the code-base though where we're more paranoid because they could become an issue if left unchecked

prateek

LGTM

vdarulis · 2019-12-04T21:31:27Z

src/dbnode/persist/fs/write.go

@@ -161,6 +161,11 @@ func (w *writer) Open(opts DataWriterOpenOptions) error {
 	w.currIdx = 0
 	w.currOffset = 0
 	w.err = nil
+	// This happens after writing the previous set of files index files, however, do it


tangentially - should indexEntries be bounded/randomly re-allocated if it exceeds certain size to reduce memory usage? not for this PR, just curious myself.

richardartoul requested review from prateek, robskillington and vdarulis December 4, 2019 18:31

Richard Artoul added 3 commits December 4, 2019 13:58

fix writer (almost) infinite loop

7fc4ba4

update test

a0a9bd7

add license

5199fda

richardartoul force-pushed the ra/fix-writer-infinite-loop branch from 1a11433 to 5199fda Compare December 4, 2019 18:59

richardartoul changed the title ~~Fix (almost) infinite loop in Fileset writes when previous fileset encountered an error writing out index files~~ Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files Dec 4, 2019

richardartoul requested review from justinjc and simonrobb December 4, 2019 19:01

prateek reviewed Dec 4, 2019

View reviewed changes

prateek approved these changes Dec 4, 2019

View reviewed changes

vdarulis approved these changes Dec 4, 2019

View reviewed changes

factor our reset helper method

d3a9932

richardartoul merged commit 42d1168 into master Dec 5, 2019

richardartoul deleted the ra/fix-writer-infinite-loop branch December 5, 2019 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files #2058

Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files #2058

richardartoul commented Dec 4, 2019 •

edited

Loading

codecov bot commented Dec 4, 2019 •

edited

Loading

prateek Dec 4, 2019

vdarulis Dec 4, 2019

richardartoul Dec 5, 2019

richardartoul Dec 5, 2019

prateek left a comment

vdarulis Dec 4, 2019

Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files #2058

Fix (almost) infinite loop in Fileset writer when previous fileset encountered an error writing out index files #2058

Conversation

richardartoul commented Dec 4, 2019 • edited Loading

codecov bot commented Dec 4, 2019 • edited Loading

Codecov Report

prateek Dec 4, 2019

Choose a reason for hiding this comment

vdarulis Dec 4, 2019

Choose a reason for hiding this comment

richardartoul Dec 5, 2019

Choose a reason for hiding this comment

richardartoul Dec 5, 2019

Choose a reason for hiding this comment

prateek left a comment

Choose a reason for hiding this comment

vdarulis Dec 4, 2019

Choose a reason for hiding this comment

richardartoul commented Dec 4, 2019 •

edited

Loading

codecov bot commented Dec 4, 2019 •

edited

Loading