move validation in distributor into separate middleware #3386

replay · 2022-11-04T00:44:26Z

This change moves the validation of incoming series before the Distributor's forwarding functionality, so that we don't forward invalid series.

osg-grafana · 2022-11-09T12:24:17Z

Meta and please ignore: Test out new subteam of the Docs Squad.

CHANGELOG.md

osg-grafana

A nit

replay · 2022-11-09T13:03:17Z

I just ran the benchmark BenchmarkDistributor_Push and I compared the results of main against this branch. I'm concerned about the increase in allocations, I'll try to optimize that:

name                                                      old time/op    new time/op    delta
Distributor_Push/too_many_labels_limit_reached-10            946µs ± 0%     892µs ± 0%     -5.65%
Distributor_Push/max_label_name_length_limit_reached-10     5.33ms ± 0%    2.95ms ± 0%    -44.69%
Distributor_Push/max_label_value_length_limit_reached-10    3.33ms ± 0%    0.62ms ± 0%    -81.52%
Distributor_Push/timestamp_too_new-10                        579µs ± 0%     636µs ± 0%     +9.95%
Distributor_Push/all_samples_successfully_pushed-10          762µs ± 0%     696µs ± 0%     -8.56%
Distributor_Push/ingestion_rate_limit_reached-10             486µs ± 0%     328µs ± 0%    -32.53%

name                                                      old alloc/op   new alloc/op   delta
Distributor_Push/too_many_labels_limit_reached-10           79.6kB ± 0%  2293.3kB ± 0%  +2780.62%
Distributor_Push/max_label_name_length_limit_reached-10      119kB ± 0%    1047kB ± 0%   +782.56%
Distributor_Push/max_label_value_length_limit_reached-10     104kB ± 0%    1034kB ± 0%   +896.58%
Distributor_Push/timestamp_too_new-10                       88.1kB ± 0%  1021.0kB ± 0%  +1059.20%
Distributor_Push/all_samples_successfully_pushed-10          143kB ± 0%     109kB ± 0%    -23.53%
Distributor_Push/ingestion_rate_limit_reached-10            25.5kB ± 0%     4.7kB ± 0%    -81.38%

name                                                      old allocs/op  new allocs/op  delta
Distributor_Push/too_many_labels_limit_reached-10            2.14k ± 0%     7.15k ± 0%   +233.96%
Distributor_Push/max_label_name_length_limit_reached-10      2.10k ± 0%     6.08k ± 0%   +190.31%
Distributor_Push/max_label_value_length_limit_reached-10     2.10k ± 0%     6.09k ± 0%   +190.79%
Distributor_Push/timestamp_too_new-10                        2.04k ± 0%     6.05k ± 0%   +196.57%
Distributor_Push/all_samples_successfully_pushed-10           48.0 ± 0%      47.0 ± 0%     -2.08%
Distributor_Push/ingestion_rate_limit_reached-10              49.0 ± 0%      44.0 ± 0%    -10.20%

replay · 2022-11-09T13:29:50Z

I know now why the benchmarks show much more allocations in this branch:
Because the new validation middleware shortens the .Timeseries and .Metadata slices when there are invalid entries and then those entries that get dropped don't get returned to the pools. Will fix that.

replay · 2022-11-09T13:37:46Z

The commit d84ed76 fixes this in the new middleware prePushValidationMiddleware and also in the old middleware prePushRelabelMiddleware, the benchmarks are looking much better now.

name                                                      old time/op    new time/op    delta
Distributor_Push/too_many_labels_limit_reached-10            946µs ± 0%     397µs ± 0%  -58.02%
Distributor_Push/max_label_name_length_limit_reached-10     5.33ms ± 0%    2.65ms ± 0%  -50.28%
Distributor_Push/max_label_value_length_limit_reached-10    3.33ms ± 0%    0.41ms ± 0%  -87.82%
Distributor_Push/timestamp_too_new-10                        579µs ± 0%     385µs ± 0%  -33.43%
Distributor_Push/all_samples_successfully_pushed-10          762µs ± 0%     675µs ± 0%  -11.38%
Distributor_Push/ingestion_rate_limit_reached-10             486µs ± 0%     316µs ± 0%  -35.07%

name                                                      old alloc/op   new alloc/op   delta
Distributor_Push/too_many_labels_limit_reached-10           79.6kB ± 0%    84.3kB ± 0%   +5.85%
Distributor_Push/max_label_name_length_limit_reached-10      119kB ± 0%     121kB ± 0%   +2.19%
Distributor_Push/max_label_value_length_limit_reached-10     104kB ± 0%     106kB ± 0%   +2.30%
Distributor_Push/timestamp_too_new-10                       88.1kB ± 0%    92.8kB ± 0%   +5.38%
Distributor_Push/all_samples_successfully_pushed-10          143kB ± 0%     109kB ± 0%  -23.93%
Distributor_Push/ingestion_rate_limit_reached-10            25.5kB ± 0%     4.7kB ± 0%  -81.37%

name                                                      old allocs/op  new allocs/op  delta
Distributor_Push/too_many_labels_limit_reached-10            2.14k ± 0%     2.15k ± 0%   +0.33%
Distributor_Push/max_label_name_length_limit_reached-10      2.10k ± 0%     2.09k ± 0%   -0.14%
Distributor_Push/max_label_value_length_limit_reached-10     2.10k ± 0%     2.09k ± 0%   -0.14%
Distributor_Push/timestamp_too_new-10                        2.04k ± 0%     2.05k ± 0%   +0.34%
Distributor_Push/all_samples_successfully_pushed-10           48.0 ± 0%      45.0 ± 0%   -6.25%
Distributor_Push/ingestion_rate_limit_reached-10              49.0 ± 0%      44.0 ± 0%  -10.20%

replay · 2022-11-09T13:41:14Z

pkg/distributor/distributor.go

@@ -1007,155 +1144,57 @@ func (d *Distributor) push(ctx context.Context, pushReq *push.Request) (*mimirpb
 		return nil, err
 	}

+	d.updateReceivedMetrics(req, userID)


this could be moved below the return on :1150, but then certain metrics which currently are registered with 0 will not be registered at all anymore which will require me to update the tests, so I didn't do that because I'd prefer to not update the tests at all in this PR to make it easy to see that they all still pass without any changes.

pstibrany

PR looks fine, but I think it needs a test showing that invalid series are no longer forwarded. Test should fail on main branch, but will pass here. WDYT?

I would suggest to double check reuse of timeseries -- I think we don't return yoloSlice to the pool at all.

pstibrany · 2022-11-10T08:52:03Z

pkg/distributor/distributor.go

@@ -805,6 +806,9 @@ func (d *Distributor) prePushRelabelMiddleware(next push.Func) push.Func {
 		}

 		if len(removeTsIndexes) > 0 {
+			for _, removeTsIndex := range removeTsIndexes {
+				mimirpb.ReuseTimeseries(req.Timeseries[removeTsIndex].TimeSeries)


I think this doesn't properly return yoloSlice from PreallocTimeseries back to the pool. Perhaps we need to introduce ReusePreallocTimeseries to handle both *TimeSeries and yoloSlice?

I am wondering if we could simplify the cleanup by 1) keeping original slice in the push handler, 2) and reusing that original slice. However that wouldn't work, as elements in the slice get rewritten too (eg. in this code, when we drop some of them).

I'm not sure why this wouldn't return the yoloSlice to the pool, util.RemoveSliceIndexes shouldn't replace the slice although it might shorten it by updating the length property.

Now that I think about it... if util.RemoveSliceIndexes removes index 0 of the slice then even though the underlying data array of the slice doesn't get replaced, the offset 0 gets shifted forward and I suspect that the positions on the underlying array which are before the new offset 0 will never be accessible again. Is this what you were referring to?

I'm not sure why this wouldn't return the yoloSlice to the pool, util.RemoveSliceIndexes shouldn't replace the slice although it might shorten it by updating the length property.

for elements that we're going to remove, we call mimirpb.ReuseTimeseries(req.Timeseries[removeTsIndex].TimeSeries) here.

for entire slice, we will eventually call mimirpb.ReuseSlice(req.Timeseries), added as "cleanup" function in push.go. We perform this call on updated req.Timeseries, which at this point when cleanup runs only has timeseries that were actually pushed to the ingester, but not timeseries that were removed due to validation or relabeling.

Call to mimirpb.ReuseSlice(req.Timeseries) in step 2 will call ReuseTimeseries(ts[i].TimeSeries) on individual elements, and then also return their yoloSlice to the pool:

mimir/pkg/mimirpb/timeseries.go

Lines 290 to 301 in 70f91b8

func ReuseSlice(ts []PreallocTimeseries) {

for i := range ts {

ReuseTimeseries(ts[i].TimeSeries)

if ts[i].yoloSlice != nil {

reuseYoloSlice(ts[i].yoloSlice)

ts[i].yoloSlice = nil

}

}

slicePool.Put(ts[:0]) //nolint:staticcheck //see comment on slicePool for more details

}

But in step 1, we only call mimirpb.ReuseTimeseries, but don't return yoloSlice to the pool. And since these elements are no longer part of req.Timeseries when cleanup (step 2) is done, their yoloSlice will not be returned back.

You're right, I missed that.

Fixed here: ead8758

Thanks, fix looks good!

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

replay · 2022-11-10T23:17:47Z

PR looks fine, but I think it needs a test showing that invalid series are no longer forwarded. Test should fail on main branch, but will pass here. WDYT?

That's a good idea. I added this test: https://github.com/grafana/mimir/pull/3386/files#diff-16b7662ec0e247fa72bf86b63ec8558a8cea51ac2280baf05a25887578132c44R3203-R3331

I force-pushed in order to move the test into the beginning of the PR as the first commit, this allows us to verify that it fails without any further changes by just checking it out directly, when checking out the last commit of the PR it passes:

replay@Mauros-MBP mimir % git checkout 7d32a2d125a2f4e5397ad035e1a1b729795f838f
...
HEAD is now at 7d32a2d12 add test to show that validation happens before forwarding
replay@Mauros-MBP mimir % go test -run TestValidationBeforeForwarding github.com/grafana/mimir/pkg/distributor 
--- FAIL: TestValidationBeforeForwarding (1.00s)
    distributor_test.go:3318: 
                Error Trace:    /Users/replay/go/src/github.com/grafana/mimir/pkg/distributor/distributor_test.go:3318
                Error:          An error is expected but got nil.
                Test:           TestValidationBeforeForwarding
    distributor_test.go:3320: 
                Error Trace:    /Users/replay/go/src/github.com/grafana/mimir/pkg/distributor/distributor_test.go:3320
                Error:          Should be true
                Test:           TestValidationBeforeForwarding
FAIL
FAIL    github.com/grafana/mimir/pkg/distributor        2.352s
FAIL
replay@Mauros-MBP mimir % git checkout 193c8f220943bc6c11df9518eb98a02d375825de
Previous HEAD position was 7d32a2d12 add test to show that validation happens before forwarding
HEAD is now at 193c8f220 return timeseries to pool
replay@Mauros-MBP mimir % go test -run TestValidationBeforeForwarding github.com/grafana/mimir/pkg/distributor
ok      github.com/grafana/mimir/pkg/distributor        1.506s

pstibrany · 2022-11-11T09:59:56Z

Thanks for adding the test.

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

pstibrany

LGTM, thank you!

replay · 2022-11-11T13:59:00Z

After the latest changes I ran the distributor benchmark one more time, comparing the commit which this PR is based on with the last commit of this PR. It looks good:

# git checkout c512e2bdadcac5148060e3e0ad99545969449d15
# go test -run=none -bench=BenchmarkDistributor_Push  ./pkg/distributor > before.bench 2>&1
# git checkout distributor_validate_forwarded_samples
# go test -run=none -bench=BenchmarkDistributor_Push  ./pkg/distributor > after.bench 2>&1
replay@Mauros-MBP mimir % benchstat -delta-test none ./before.bench ./after.bench 
name                                                      old time/op    new time/op    delta
Distributor_Push/all_samples_successfully_pushed-10          738µs ± 0%     674µs ± 0%   -8.71%
Distributor_Push/ingestion_rate_limit_reached-10             489µs ± 0%     322µs ± 0%  -34.12%
Distributor_Push/too_many_labels_limit_reached-10            948µs ± 0%     399µs ± 0%  -57.93%
Distributor_Push/max_label_name_length_limit_reached-10     5.33ms ± 0%    2.67ms ± 0%  -49.86%
Distributor_Push/max_label_value_length_limit_reached-10    3.11ms ± 0%    0.40ms ± 0%  -87.06%
Distributor_Push/timestamp_too_new-10                        575µs ± 0%     392µs ± 0%  -31.77%

name                                                      old alloc/op   new alloc/op   delta
Distributor_Push/all_samples_successfully_pushed-10          143kB ± 0%     109kB ± 0%  -24.01%
Distributor_Push/ingestion_rate_limit_reached-10            25.4kB ± 0%     4.7kB ± 0%  -81.43%
Distributor_Push/too_many_labels_limit_reached-10           79.6kB ± 0%    84.3kB ± 0%   +5.84%
Distributor_Push/max_label_name_length_limit_reached-10      119kB ± 0%     121kB ± 0%   +2.16%
Distributor_Push/max_label_value_length_limit_reached-10     104kB ± 0%     106kB ± 0%   +2.36%
Distributor_Push/timestamp_too_new-10                       88.0kB ± 0%    92.8kB ± 0%   +5.44%

name                                                      old allocs/op  new allocs/op  delta
Distributor_Push/all_samples_successfully_pushed-10           49.0 ± 0%      45.0 ± 0%   -8.16%
Distributor_Push/ingestion_rate_limit_reached-10              49.0 ± 0%      44.0 ± 0%  -10.20%
Distributor_Push/too_many_labels_limit_reached-10            2.14k ± 0%     2.15k ± 0%   +0.33%
Distributor_Push/max_label_name_length_limit_reached-10      2.10k ± 0%     2.09k ± 0%   -0.05%
Distributor_Push/max_label_value_length_limit_reached-10     2.10k ± 0%     2.09k ± 0%   -0.14%
Distributor_Push/timestamp_too_new-10                        2.04k ± 0%     2.05k ± 0%   +0.34%

This reverts commit 5356edd. See grafana/mimir-squad#973

* Revert "Distributor push wrapper should only receive unforwarded samples. (#2980)" This reverts commit 3d14b39. See grafana/mimir-squad#973 * Revert "move validation in distributor into separate middleware (#3386)" This reverts commit 5356edd. See grafana/mimir-squad#973 * Pin doc-validator version Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pracucci · 2022-11-16T13:59:37Z

pkg/mimirpb/timeseries.go

+	}
+
+	if ts.yoloSlice != nil {
+		reuseYoloSlice(ts.yoloSlice)


[nit] I would set ts.yoloSlice to nil and then reuse this function in ReuseSlice().

Done in #3464

* add test to show that validation happens before forwarding Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * move validation in distributor into separate middleware Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * fixes Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * cleanup Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * dont send empty requests to ingesters and fix metrics Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * fix error handling Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * changelog Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * PR feedback Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com> * return timeseries to pool Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * reuse yoloSlice correctly Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>

replay force-pushed the distributor_validate_forwarded_samples branch from 4ac3590 to 2d6da45 Compare November 9, 2022 11:59

replay marked this pull request as ready for review November 9, 2022 12:13

replay requested a review from a team as a code owner November 9, 2022 12:13

osg-grafana added the type/docs Improvements or additions to documentation label Nov 9, 2022

osg-grafana requested a review from a team November 9, 2022 12:23

osg-grafana reviewed Nov 9, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

osg-grafana approved these changes Nov 9, 2022

View reviewed changes

replay commented Nov 9, 2022

View reviewed changes

pstibrany reviewed Nov 10, 2022

View reviewed changes

replay and others added 9 commits November 10, 2022 23:13

add test to show that validation happens before forwarding

7d32a2d

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

move validation in distributor into separate middleware

5c6e161

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

fixes

8c9e5dc

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

cleanup

2d4dace

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

dont send empty requests to ingesters and fix metrics

bbf83da

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

fix error handling

37e0ebb

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

changelog

a6c3c81

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

PR feedback

35f2666

Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>

return timeseries to pool

193c8f2

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

replay force-pushed the distributor_validate_forwarded_samples branch from d84ed76 to 193c8f2 Compare November 10, 2022 23:15

reuse yoloSlice correctly

ead8758

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

pstibrany approved these changes Nov 11, 2022

View reviewed changes

replay merged commit 5356edd into main Nov 11, 2022

replay deleted the distributor_validate_forwarded_samples branch November 11, 2022 14:04

56quarters added a commit that referenced this pull request Nov 15, 2022

Revert "move validation in distributor into separate middleware (#3386)"

1ea900a

This reverts commit 5356edd. See grafana/mimir-squad#973

pracucci reviewed Nov 16, 2022

View reviewed changes

This was referenced Nov 16, 2022

Fix sending all series to all ingesters #3458

Merged

Use ReusePreallocTimeseries from ReuseSlice. #3464

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move validation in distributor into separate middleware #3386

move validation in distributor into separate middleware #3386

replay commented Nov 4, 2022 •

edited

osg-grafana commented Nov 9, 2022 •

edited

osg-grafana left a comment

replay commented Nov 9, 2022

replay commented Nov 9, 2022

replay commented Nov 9, 2022

replay Nov 9, 2022

pstibrany left a comment

pstibrany Nov 10, 2022

replay Nov 10, 2022

pstibrany Nov 11, 2022 •

edited

replay Nov 11, 2022

pstibrany Nov 11, 2022

replay commented Nov 10, 2022

pstibrany commented Nov 11, 2022

pstibrany left a comment

replay commented Nov 11, 2022 •

edited

pracucci Nov 16, 2022

pstibrany Nov 17, 2022

	func ReuseSlice(ts []PreallocTimeseries) {
	for i := range ts {
	ReuseTimeseries(ts[i].TimeSeries)

	if ts[i].yoloSlice != nil {
	reuseYoloSlice(ts[i].yoloSlice)
	ts[i].yoloSlice = nil
	}
	}

	slicePool.Put(ts[:0]) //nolint:staticcheck //see comment on slicePool for more details
	}

move validation in distributor into separate middleware #3386

move validation in distributor into separate middleware #3386

Conversation

replay commented Nov 4, 2022 • edited

osg-grafana commented Nov 9, 2022 • edited

osg-grafana left a comment

Choose a reason for hiding this comment

replay commented Nov 9, 2022

replay commented Nov 9, 2022

replay commented Nov 9, 2022

replay Nov 9, 2022

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

pstibrany Nov 10, 2022

Choose a reason for hiding this comment

replay Nov 10, 2022

Choose a reason for hiding this comment

pstibrany Nov 11, 2022 • edited

Choose a reason for hiding this comment

replay Nov 11, 2022

Choose a reason for hiding this comment

pstibrany Nov 11, 2022

Choose a reason for hiding this comment

replay commented Nov 10, 2022

pstibrany commented Nov 11, 2022

pstibrany left a comment

Choose a reason for hiding this comment

replay commented Nov 11, 2022 • edited

pracucci Nov 16, 2022

Choose a reason for hiding this comment

pstibrany Nov 17, 2022

Choose a reason for hiding this comment

replay commented Nov 4, 2022 •

edited

osg-grafana commented Nov 9, 2022 •

edited

pstibrany Nov 11, 2022 •

edited

replay commented Nov 11, 2022 •

edited