Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log error message for invalid checksum #1713

Merged
merged 1 commit into from
Apr 24, 2020
Merged

Conversation

adityacs
Copy link
Contributor

@adityacs adityacs commented Feb 18, 2020

What this PR does / why we need it:
Skip chunks with invalid checkusm and log error message.

Which issue(s) this PR fixes:
Fixes #1346

Checklist

  • Tests updated

@adityacs
Copy link
Contributor Author

@cyriltovena Have few queries.

  1. What should we log in the error message for invalid checksum? Available fields are blocksize, bytes etc. This won't be useful for the user.
  2. I think we should just skip and continue here while iterating blocks of chunk
    return bc, ErrInvalidChecksum

    Is it ok to skip here? when the checksum of chunk meta is not valid
    return nil, ErrInvalidChecksum

return err
err := instance.consumeChunk(userCtx, lbls, chunk) err != nil {
if err.Error() == "invalid checksum" {
level.Error(util.Logger).Log("msg", "Checksum does not match for chunk, this chunk will be skipped", "err", "invalid checksum", "chunk", chunk.String())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if logging chuck.String() is a good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able reproduce this issue(#1346). Can we request the author of this issue to validate from his side using this commit and see what he sees in his logs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is necessary since now you're not sending back the invalid error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the transfer erroring (rollback that file), I don't think this is causing any issue for #1346.

@codecov-io
Copy link

codecov-io commented Feb 23, 2020

Codecov Report

Merging #1713 into master will increase coverage by 0.04%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1713      +/-   ##
==========================================
+ Coverage   64.75%   64.79%   +0.04%     
==========================================
  Files         122      122              
  Lines        9271     9283      +12     
==========================================
+ Hits         6003     6015      +12     
- Misses       2853     2854       +1     
+ Partials      415      414       -1     
Impacted Files Coverage Δ
pkg/chunkenc/interface.go 87.50% <ø> (ø)
pkg/chunkenc/lazy_chunk.go 0.00% <ø> (ø)
pkg/chunkenc/memchunk.go 73.31% <0.00%> (-0.22%) ⬇️
pkg/storage/iterator.go 86.03% <83.33%> (+0.21%) ⬆️
pkg/ingester/transfer.go 66.42% <0.00%> (+1.42%) ⬆️

pkg/storage/iterator.go Outdated Show resolved Hide resolved
@@ -385,6 +385,7 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error {
}
level.Debug(log).Log("msg", "loading lazy chunks", "chunks", totalChunks)

invalidChunks := []*chunk.Fetcher{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for creating this array is later when we have to remove invalid chunks, we can iterate only invalid chunks rather than iterating all fetchers and chunks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this.

@@ -403,8 +404,15 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error {
}
chks, err := fetcher.FetchChunks(ctx, chks, keys)
if err != nil {
if err.Error() == chunk.ErrInvalidChecksum.Error() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err.Error() == chunk.ErrInvalidChecksum.Error() {
if err == chunk.ErrInvalidChecksum {

@@ -403,8 +404,15 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error {
}
chks, err := fetcher.FetchChunks(ctx, chks, keys)
if err != nil {
if err.Error() == chunk.ErrInvalidChecksum.Error() {
level.Error(util.Logger).Log("msg", "checksum of chunks does not match", "err", chunk.ErrInvalidChecksum)
invalidChunks = append(invalidChunks, fetcher)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this invalid chunks array.

@@ -415,6 +423,13 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error {
}(fetcher, chunks)
}

// remove data of any invalid chunk
for _, fetcher := range invalidChunks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to remove empty chunk from the chunks variable. And this needs to happens when we have waited for all errors.

after:

var lastErr error
	for i := 0; i < len(chksByFetcher); i++ {
		if err := <-errChan; err != nil {
			lastErr = err
		}
	}

if lastErr != nil {
 return lastErr
}
i := 0 // output index
for _, c := range chunks {
    if c.Chunk.Data != nil {
        chunks[i] = c
        i++
    }
}
chunks = chunks[:i]
return nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you

Copy link
Contributor Author

@adityacs adityacs Mar 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena I see a problem here. I may not be correct though.
From the following function https://github.com/grafana/loki/blob/master/pkg/storage/iterator.go#L280, we first partitionBySeriesChunks, which partitions chunks based on overlapping iterator. At this stage say chunksBySeries has chunks as below
chksBySeries after partition

[0xc00011c400]
[0xc00011c480]
[0xc00011c500 0xc00011c600]
[0xc00011c580]

Later first chunks are loaded from loadFirstChunks which is passed with value chksBySeries. After this chksBySeries looks as below

[0xc00011c400]
[0xc00011c480]
[0xc00011c500 0xc00011c600]
[0xc00011c580]

and in turn fetchLazyChunks would have picked following chunks
[0xc00011c400 0xc00011c480 0xc00011c500 0xc00011c580]

Even at this point chkBySeries has below values

[0xc00011c400]
[0xc00011c480]
[0xc00011c500 0xc00011c600]
[0xc00011c580]

Later filterSeriesByMatchers is called with value chksBySeries and chksBySeries has following values

[[0xc00011c500 0xc00011c600][0xc00011c580]]

Now, say 0xc00011c600 chunk returns invalid chunk checksum when doing fetchLazyChunks and the chunks picked are [0xc00011c500],[0xc00011c580], Iterator is not affected with this because buildIterators is passed with chksBySeries which iterates following chunks
[[0xc00011c500 0xc00011c600][0xc00011c580]]. Since, 0xc00011c600 is having no data, iterator errors saying chunks not loaded.

May be we should check for nil Data chunk here https://github.com/grafana/loki/blob/master/pkg/storage/iterator.go#L335 and skip it
OR
Re arrange chkBySeries after fetching lazy chunks before calling buildIterators here https://github.com/grafana/loki/blob/master/pkg/storage/iterator.go#L305

PS: I am not sure how efficient I was in conveying the message 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm off for a week. We'll get back to you then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Tried to add a dummy code here https://play.golang.org/p/HFXOy5Ek2xd. This should explain what's actually happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No currently it crashes. I want this function to be changed like we discussed.

Copy link
Contributor Author

@adityacs adityacs Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was explaining things without pushing code. The code added fails here. The test code is not neat. Just validating things. However, we get into below scenario.

The chunks may be initially chunk1, chunk2, chunk3 after we run this https://github.com/grafana/loki/pull/1713/files#diff-7d61465906fc121b68e860420fcd1c87R440 chunks may be
[chunk1, chunk3]. However, chksBySeries here https://github.com/grafana/loki/blob/64a1c0d96bd033d67c634aff195fe1b0b96441c5/pkg/storage/iterator.go#L305 might have chunks as [[chunk1, chunk2][chunk3]]. So, if chunk2 is having nil data due to invalid checksum chksBySeries don't drop the chunk. Which leads to chunk is not loaded here https://github.com/grafana/loki/blob/64a1c0d96bd033d67c634aff195fe1b0b96441c5/pkg/chunkenc/lazy_chunk.go#L29

We should re-arrange the chunks before calling buildIterators https://github.com/grafana/loki/blob/64a1c0d96bd033d67c634aff195fe1b0b96441c5/pkg/storage/iterator.go#L305

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes That's what I tried to explain you should remove the lazy chunk from the array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just checking for easy ways to do this Removing a chunk from an array is quiet a costly operation. We would have to remove an element and shift other elements. When we have large number of chunks to load, this can turn out to be time taking process.

One easy way is to label the chunk as invalid or valid or may be a boolean. Then in the iterators we can check if the chunk is valid or not. If invalid we can skip and continue.

what's your thought on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena Need your feedback on this

@adityacs adityacs force-pushed the 1346 branch 4 times, most recently from c427cda to 64a1c0d Compare March 9, 2020 16:19
@cyriltovena
Copy link
Contributor

I think you're still working on this right ? We can close and open a new one if you want to clear things up when you're ready.

@adityacs
Copy link
Contributor Author

Yeah, I am still working on this. Anything is fine with me. However, since we have had a good discussion on this PR, I would like to get this PR merged rather opening new PR.

It's your call finally.

Comment on lines 306 to 317
var validChunks map[*chunkenc.LazyChunk]bool
for _, series := range chksBySeries {
for _, chunks := range series {
for _, chunk := range chunks {
if contains(chunk, allChunks) {
validChunks[chunk] = true
}
}
}
}

iters, err := buildIterators(ctx, chksBySeries, validChunks, filter, direction, from, through)
Copy link
Contributor Author

@adityacs adityacs Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena This is one way we can check and avoid iterating invalid chunks. However, this is going to be O(n^3) operation.

We can add a label to LazyChunk like below but we will have to change many references and disturb the code a bit.

type LazyChunk struct {
	Chunk   chunk.Chunk
        IsValid   bool
	Fetcher *chunk.Fetcher
}

What do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type LazyChunk struct {
	Chunk   chunk.Chunk
        IsValid   bool
	Fetcher *chunk.Fetcher
}

Is less intrusive for this code base IMO I prefer this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks

for i := range chks {
iterators := make([]iter.EntryIterator, 0, len(chks[i]))
for j := range chks[i] {
if _, ok := validChunks[chks[i][j]]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly where you should filter invalid chunk. But I don't like a map. I prefer that you check a boolean on the chunk. I don't think you need a log also, as this will already be logged by the memchunk.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -403,8 +406,14 @@ func fetchLazyChunks(ctx context.Context, chunks []*chunkenc.LazyChunk) error {
}
chks, err := fetcher.FetchChunks(ctx, chks, keys)
if err != nil {
if err.Error() == chunkenc.ErrInvalidChecksum.Error() || strings.Contains(err.Error(), chunkenc.ErrInvalidChecksum.Error()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking err == chunkenc.ErrInvalidChecksum is not working. We have to do err.Error() == chunkenc.ErrInvalidChecksum.Error().

Also, the error might come from cortex from here https://github.com/cortexproject/cortex/blob/02af2474da0cef0cb4076106e03528eed8830166/pkg/chunk/chunk.go#L263. So, we have to check if stack contains invalid chunk checksum error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the err not the stringer, as the stringer function could change.

If it can also come from cortex you can also check that since they are exposing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a catch here. Cortex returns errors with different types. Here it is promql https://github.com/cortexproject/cortex/blob/master/pkg/chunk/chunk_store_utils.go#L165.

So, comparing err == chunkenc.ErrInvalidChecksum is not possible since chunkenc.ErrInvalidChecksum is of type errors.Error and err will be of type promql.ErrStorage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally understand I think you should make a function.

func isInvalidCheckSum(err error) bool

You could check if it is a promql.ErrStorage first, if so unwrap the error, then check it.

Bonus point this means this function can be easily testable and you should send it chunkenc.ErrInvalidChecksum from cortex and from Loki and wrapped with promql.ErrStorage and with a random error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. Will add this

pkg/storage/util_test.go Outdated Show resolved Hide resolved
func isInvalidChunkError(err error) bool {
err = errors.Cause(err)
if err, ok := err.(promql.ErrStorage); ok {
return err.Err == chunk.ErrInvalidChecksum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the error could be this one and the Loki one ? shouldn't you test that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can add the Loki one as well. It's just the same error. Will change it

Copy link
Contributor Author

@adityacs adityacs Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena I have made this change. Also, I will rebase on top of this #1914

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the error could be this one and the Loki one ?

I think it can be both. You also need a small test for that function, this was the point.

Almost there !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use chunk.ErrInvalidChecksum from cortex here as well https://github.com/grafana/loki/blob/master/pkg/chunkenc/memchunk.go#L209 ?

If any change on cortex, we should make sure to bring that change to Loki as well otherwise we would break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena Added the test

Copy link
Contributor Author

@adityacs adityacs Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series test Test_store_GetSeries is failing. Have to identify some other way to do chunk test. Doing this doesn't work out
https://github.com/grafana/loki/blob/7c2274093f7a76f433a6fd32aad2bd0205e7c1c6/pkg/storage/util_test.go#L208

I will re work on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the Batchsize in Test_store_GetSeries test. That works.

WDYT we should?

func Test_IsInvalidChunkError(t *testing.T) {
tests := []struct {
name string
err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add a case at least where it is not an invalid chunk error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -373,7 +423,7 @@ func Test_store_GetSeries(t *testing.T) {
{Labels: mustParseLabels("{foo=\"bar\"}")},
{Labels: mustParseLabels("{foo=\"bazz\"}")},
},
1,
2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? why do you need to change that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still this is a hacky thing. However, I just want to keep this because this is kind of complete testing for chunk along with stream

@adityacs adityacs force-pushed the 1346 branch 3 times, most recently from 2e7c297 to 5dd4a5a Compare April 15, 2020 08:13
Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit scared of those tests changes. I think I prefer a new set of tests only for that change. Doesn't have to be complex at least test isInvalidChunkError. The rest is less a problem.

@adityacs
Copy link
Contributor Author

@cyriltovena tests for isInvalidChunkError is already added. I will try to add new tests for fetchLazyChunks or at least make the existing tests look better.

},
},
})
secondChunk = newLazyInvalidChunk(logproto.Stream{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
secondChunk = newLazyInvalidChunk(logproto.Stream{
invalid = newLazyInvalidChunk(logproto.Stream{

@@ -77,10 +86,11 @@ func newChunk(stream logproto.Stream) chunk.Chunk {
}
chk.Close()
c := chunk.NewChunk("fake", client.Fingerprint(l), l, chunkenc.NewFacade(chk), from, through)
// force the checksum creation
// // force the checksum creation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// // force the checksum creation
// force the checksum creation

},
} {
t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
ctx = user.InjectOrgID(context.Background(), "test-user")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See I think this test is sufficient. Remove any changes from the other tests in pkg/storage/store_test.go and other changes for that tests, let's keep only this one. And rename the chunk variable to invalidChunk.

@adityacs
Copy link
Contributor Author

adityacs commented Apr 23, 2020

@cyriltovena I have given one more try to add tests to fetchLazyChunks. Kindly review and let me know. If it's not good, I will revert it.

https://github.com/grafana/loki/blob/4fac8b37f126a1e76293ec6dfef2675955b7494c/pkg/storage/store_test.go#L485

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyriltovena
Copy link
Contributor

Test are failing I'm fine if you remove all those new tests and leave the one on heapIterator, It cover what we need.

@adityacs
Copy link
Contributor Author

I have removed the new tests. Have retained IsInvalidChunkError test

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyriltovena cyriltovena merged commit c0a28cb into grafana:master Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid chunk cksum
3 participants