Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move validation in distributor into separate middleware #3386

Merged
merged 10 commits into from Nov 11, 2022

Conversation

replay
Copy link
Contributor

@replay replay commented Nov 4, 2022

This change moves the validation of incoming series before the Distributor's forwarding functionality, so that we don't forward invalid series.

@replay replay force-pushed the distributor_validate_forwarded_samples branch from 4ac3590 to 2d6da45 Compare November 9, 2022 11:59
@replay replay marked this pull request as ready for review November 9, 2022 12:13
@replay replay requested a review from a team as a code owner November 9, 2022 12:13
@osg-grafana osg-grafana added the type/docs Improvements or additions to documentation label Nov 9, 2022
@osg-grafana osg-grafana requested a review from a team November 9, 2022 12:23
@osg-grafana
Copy link
Contributor

osg-grafana commented Nov 9, 2022

Meta and please ignore: Test out new subteam of the Docs Squad.

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@osg-grafana osg-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit

@replay
Copy link
Contributor Author

replay commented Nov 9, 2022

I just ran the benchmark BenchmarkDistributor_Push and I compared the results of main against this branch. I'm concerned about the increase in allocations, I'll try to optimize that:

name                                                      old time/op    new time/op    delta
Distributor_Push/too_many_labels_limit_reached-10            946µs ± 0%     892µs ± 0%     -5.65%
Distributor_Push/max_label_name_length_limit_reached-10     5.33ms ± 0%    2.95ms ± 0%    -44.69%
Distributor_Push/max_label_value_length_limit_reached-10    3.33ms ± 0%    0.62ms ± 0%    -81.52%
Distributor_Push/timestamp_too_new-10                        579µs ± 0%     636µs ± 0%     +9.95%
Distributor_Push/all_samples_successfully_pushed-10          762µs ± 0%     696µs ± 0%     -8.56%
Distributor_Push/ingestion_rate_limit_reached-10             486µs ± 0%     328µs ± 0%    -32.53%

name                                                      old alloc/op   new alloc/op   delta
Distributor_Push/too_many_labels_limit_reached-10           79.6kB ± 0%  2293.3kB ± 0%  +2780.62%
Distributor_Push/max_label_name_length_limit_reached-10      119kB ± 0%    1047kB ± 0%   +782.56%
Distributor_Push/max_label_value_length_limit_reached-10     104kB ± 0%    1034kB ± 0%   +896.58%
Distributor_Push/timestamp_too_new-10                       88.1kB ± 0%  1021.0kB ± 0%  +1059.20%
Distributor_Push/all_samples_successfully_pushed-10          143kB ± 0%     109kB ± 0%    -23.53%
Distributor_Push/ingestion_rate_limit_reached-10            25.5kB ± 0%     4.7kB ± 0%    -81.38%

name                                                      old allocs/op  new allocs/op  delta
Distributor_Push/too_many_labels_limit_reached-10            2.14k ± 0%     7.15k ± 0%   +233.96%
Distributor_Push/max_label_name_length_limit_reached-10      2.10k ± 0%     6.08k ± 0%   +190.31%
Distributor_Push/max_label_value_length_limit_reached-10     2.10k ± 0%     6.09k ± 0%   +190.79%
Distributor_Push/timestamp_too_new-10                        2.04k ± 0%     6.05k ± 0%   +196.57%
Distributor_Push/all_samples_successfully_pushed-10           48.0 ± 0%      47.0 ± 0%     -2.08%
Distributor_Push/ingestion_rate_limit_reached-10              49.0 ± 0%      44.0 ± 0%    -10.20%

@replay
Copy link
Contributor Author

replay commented Nov 9, 2022

I know now why the benchmarks show much more allocations in this branch:
Because the new validation middleware shortens the .Timeseries and .Metadata slices when there are invalid entries and then those entries that get dropped don't get returned to the pools. Will fix that.

@replay
Copy link
Contributor Author

replay commented Nov 9, 2022

The commit d84ed76 fixes this in the new middleware prePushValidationMiddleware and also in the old middleware prePushRelabelMiddleware, the benchmarks are looking much better now.

name                                                      old time/op    new time/op    delta
Distributor_Push/too_many_labels_limit_reached-10            946µs ± 0%     397µs ± 0%  -58.02%
Distributor_Push/max_label_name_length_limit_reached-10     5.33ms ± 0%    2.65ms ± 0%  -50.28%
Distributor_Push/max_label_value_length_limit_reached-10    3.33ms ± 0%    0.41ms ± 0%  -87.82%
Distributor_Push/timestamp_too_new-10                        579µs ± 0%     385µs ± 0%  -33.43%
Distributor_Push/all_samples_successfully_pushed-10          762µs ± 0%     675µs ± 0%  -11.38%
Distributor_Push/ingestion_rate_limit_reached-10             486µs ± 0%     316µs ± 0%  -35.07%

name                                                      old alloc/op   new alloc/op   delta
Distributor_Push/too_many_labels_limit_reached-10           79.6kB ± 0%    84.3kB ± 0%   +5.85%
Distributor_Push/max_label_name_length_limit_reached-10      119kB ± 0%     121kB ± 0%   +2.19%
Distributor_Push/max_label_value_length_limit_reached-10     104kB ± 0%     106kB ± 0%   +2.30%
Distributor_Push/timestamp_too_new-10                       88.1kB ± 0%    92.8kB ± 0%   +5.38%
Distributor_Push/all_samples_successfully_pushed-10          143kB ± 0%     109kB ± 0%  -23.93%
Distributor_Push/ingestion_rate_limit_reached-10            25.5kB ± 0%     4.7kB ± 0%  -81.37%

name                                                      old allocs/op  new allocs/op  delta
Distributor_Push/too_many_labels_limit_reached-10            2.14k ± 0%     2.15k ± 0%   +0.33%
Distributor_Push/max_label_name_length_limit_reached-10      2.10k ± 0%     2.09k ± 0%   -0.14%
Distributor_Push/max_label_value_length_limit_reached-10     2.10k ± 0%     2.09k ± 0%   -0.14%
Distributor_Push/timestamp_too_new-10                        2.04k ± 0%     2.05k ± 0%   +0.34%
Distributor_Push/all_samples_successfully_pushed-10           48.0 ± 0%      45.0 ± 0%   -6.25%
Distributor_Push/ingestion_rate_limit_reached-10              49.0 ± 0%      44.0 ± 0%  -10.20%

@@ -1007,155 +1144,57 @@ func (d *Distributor) push(ctx context.Context, pushReq *push.Request) (*mimirpb
return nil, err
}

d.updateReceivedMetrics(req, userID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be moved below the return on :1150, but then certain metrics which currently are registered with 0 will not be registered at all anymore which will require me to update the tests, so I didn't do that because I'd prefer to not update the tests at all in this PR to make it easy to see that they all still pass without any changes.

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks fine, but I think it needs a test showing that invalid series are no longer forwarded. Test should fail on main branch, but will pass here. WDYT?

I would suggest to double check reuse of timeseries -- I think we don't return yoloSlice to the pool at all.

@@ -805,6 +806,9 @@ func (d *Distributor) prePushRelabelMiddleware(next push.Func) push.Func {
}

if len(removeTsIndexes) > 0 {
for _, removeTsIndex := range removeTsIndexes {
mimirpb.ReuseTimeseries(req.Timeseries[removeTsIndex].TimeSeries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't properly return yoloSlice from PreallocTimeseries back to the pool. Perhaps we need to introduce ReusePreallocTimeseries to handle both *TimeSeries and yoloSlice?

I am wondering if we could simplify the cleanup by 1) keeping original slice in the push handler, 2) and reusing that original slice. However that wouldn't work, as elements in the slice get rewritten too (eg. in this code, when we drop some of them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this wouldn't return the yoloSlice to the pool, util.RemoveSliceIndexes shouldn't replace the slice although it might shorten it by updating the length property.

Now that I think about it... if util.RemoveSliceIndexes removes index 0 of the slice then even though the underlying data array of the slice doesn't get replaced, the offset 0 gets shifted forward and I suspect that the positions on the underlying array which are before the new offset 0 will never be accessible again. Is this what you were referring to?

Copy link
Member

@pstibrany pstibrany Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this wouldn't return the yoloSlice to the pool, util.RemoveSliceIndexes shouldn't replace the slice although it might shorten it by updating the length property.

  1. for elements that we're going to remove, we call mimirpb.ReuseTimeseries(req.Timeseries[removeTsIndex].TimeSeries) here.
  2. for entire slice, we will eventually call mimirpb.ReuseSlice(req.Timeseries), added as "cleanup" function in push.go. We perform this call on updated req.Timeseries, which at this point when cleanup runs only has timeseries that were actually pushed to the ingester, but not timeseries that were removed due to validation or relabeling.

Call to mimirpb.ReuseSlice(req.Timeseries) in step 2 will call ReuseTimeseries(ts[i].TimeSeries) on individual elements, and then also return their yoloSlice to the pool:

func ReuseSlice(ts []PreallocTimeseries) {
for i := range ts {
ReuseTimeseries(ts[i].TimeSeries)
if ts[i].yoloSlice != nil {
reuseYoloSlice(ts[i].yoloSlice)
ts[i].yoloSlice = nil
}
}
slicePool.Put(ts[:0]) //nolint:staticcheck //see comment on slicePool for more details
}

But in step 1, we only call mimirpb.ReuseTimeseries, but don't return yoloSlice to the pool. And since these elements are no longer part of req.Timeseries when cleanup (step 2) is done, their yoloSlice will not be returned back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I missed that.

Fixed here: ead8758

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fix looks good!

replay and others added 9 commits November 10, 2022 23:13
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
@replay replay force-pushed the distributor_validate_forwarded_samples branch from d84ed76 to 193c8f2 Compare November 10, 2022 23:15
@replay
Copy link
Contributor Author

replay commented Nov 10, 2022

PR looks fine, but I think it needs a test showing that invalid series are no longer forwarded. Test should fail on main branch, but will pass here. WDYT?

That's a good idea. I added this test: https://github.com/grafana/mimir/pull/3386/files#diff-16b7662ec0e247fa72bf86b63ec8558a8cea51ac2280baf05a25887578132c44R3203-R3331

I force-pushed in order to move the test into the beginning of the PR as the first commit, this allows us to verify that it fails without any further changes by just checking it out directly, when checking out the last commit of the PR it passes:

replay@Mauros-MBP mimir % git checkout 7d32a2d125a2f4e5397ad035e1a1b729795f838f
...
HEAD is now at 7d32a2d12 add test to show that validation happens before forwarding
replay@Mauros-MBP mimir % go test -run TestValidationBeforeForwarding github.com/grafana/mimir/pkg/distributor 
--- FAIL: TestValidationBeforeForwarding (1.00s)
    distributor_test.go:3318: 
                Error Trace:    /Users/replay/go/src/github.com/grafana/mimir/pkg/distributor/distributor_test.go:3318
                Error:          An error is expected but got nil.
                Test:           TestValidationBeforeForwarding
    distributor_test.go:3320: 
                Error Trace:    /Users/replay/go/src/github.com/grafana/mimir/pkg/distributor/distributor_test.go:3320
                Error:          Should be true
                Test:           TestValidationBeforeForwarding
FAIL
FAIL    github.com/grafana/mimir/pkg/distributor        2.352s
FAIL
replay@Mauros-MBP mimir % git checkout 193c8f220943bc6c11df9518eb98a02d375825de
Previous HEAD position was 7d32a2d12 add test to show that validation happens before forwarding
HEAD is now at 193c8f220 return timeseries to pool
replay@Mauros-MBP mimir % go test -run TestValidationBeforeForwarding github.com/grafana/mimir/pkg/distributor
ok      github.com/grafana/mimir/pkg/distributor        1.506s

@pstibrany
Copy link
Member

Thanks for adding the test.

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@replay
Copy link
Contributor Author

replay commented Nov 11, 2022

After the latest changes I ran the distributor benchmark one more time, comparing the commit which this PR is based on with the last commit of this PR. It looks good:

# git checkout c512e2bdadcac5148060e3e0ad99545969449d15
# go test -run=none -bench=BenchmarkDistributor_Push  ./pkg/distributor > before.bench 2>&1
# git checkout distributor_validate_forwarded_samples
# go test -run=none -bench=BenchmarkDistributor_Push  ./pkg/distributor > after.bench 2>&1
replay@Mauros-MBP mimir % benchstat -delta-test none ./before.bench ./after.bench 
name                                                      old time/op    new time/op    delta
Distributor_Push/all_samples_successfully_pushed-10          738µs ± 0%     674µs ± 0%   -8.71%
Distributor_Push/ingestion_rate_limit_reached-10             489µs ± 0%     322µs ± 0%  -34.12%
Distributor_Push/too_many_labels_limit_reached-10            948µs ± 0%     399µs ± 0%  -57.93%
Distributor_Push/max_label_name_length_limit_reached-10     5.33ms ± 0%    2.67ms ± 0%  -49.86%
Distributor_Push/max_label_value_length_limit_reached-10    3.11ms ± 0%    0.40ms ± 0%  -87.06%
Distributor_Push/timestamp_too_new-10                        575µs ± 0%     392µs ± 0%  -31.77%

name                                                      old alloc/op   new alloc/op   delta
Distributor_Push/all_samples_successfully_pushed-10          143kB ± 0%     109kB ± 0%  -24.01%
Distributor_Push/ingestion_rate_limit_reached-10            25.4kB ± 0%     4.7kB ± 0%  -81.43%
Distributor_Push/too_many_labels_limit_reached-10           79.6kB ± 0%    84.3kB ± 0%   +5.84%
Distributor_Push/max_label_name_length_limit_reached-10      119kB ± 0%     121kB ± 0%   +2.16%
Distributor_Push/max_label_value_length_limit_reached-10     104kB ± 0%     106kB ± 0%   +2.36%
Distributor_Push/timestamp_too_new-10                       88.0kB ± 0%    92.8kB ± 0%   +5.44%

name                                                      old allocs/op  new allocs/op  delta
Distributor_Push/all_samples_successfully_pushed-10           49.0 ± 0%      45.0 ± 0%   -8.16%
Distributor_Push/ingestion_rate_limit_reached-10              49.0 ± 0%      44.0 ± 0%  -10.20%
Distributor_Push/too_many_labels_limit_reached-10            2.14k ± 0%     2.15k ± 0%   +0.33%
Distributor_Push/max_label_name_length_limit_reached-10      2.10k ± 0%     2.09k ± 0%   -0.05%
Distributor_Push/max_label_value_length_limit_reached-10     2.10k ± 0%     2.09k ± 0%   -0.14%
Distributor_Push/timestamp_too_new-10                        2.04k ± 0%     2.05k ± 0%   +0.34%

@replay replay merged commit 5356edd into main Nov 11, 2022
@replay replay deleted the distributor_validate_forwarded_samples branch November 11, 2022 14:04
56quarters added a commit that referenced this pull request Nov 15, 2022
* Revert "Distributor push wrapper should only receive unforwarded samples. (#2980)"

This reverts commit 3d14b39.

See grafana/mimir-squad#973

* Revert "move validation in distributor into separate middleware (#3386)"

This reverts commit 5356edd.

See grafana/mimir-squad#973

* Pin doc-validator version

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
}

if ts.yoloSlice != nil {
reuseYoloSlice(ts.yoloSlice)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I would set ts.yoloSlice to nil and then reuse this function in ReuseSlice().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #3464

masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
* add test to show that validation happens before forwarding

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* move validation in distributor into separate middleware

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* fixes

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* cleanup

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* dont send empty requests to ingesters and fix metrics

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* fix error handling

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* changelog

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* PR feedback

Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>

* return timeseries to pool

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* reuse yoloSlice correctly

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants