Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Improve Slice pool hit rate #1998

Merged
merged 8 commits into from
Sep 1, 2021
Merged

Improve Slice pool hit rate #1998

merged 8 commits into from
Sep 1, 2021

Conversation

shanson7
Copy link
Collaborator

These changes are to try to improve the SlicePool hit rate. Our query nodes have a very disappointing hit rate of ~80%.

I believe the reason is that most queries handle raw series that aren't 2000 datapoints long, yet will be added to the pool. Then most functions will be asking for 2000 (because they don't specify what they need). Because unfit slices are added back to the pool, we could (unverified) also be repeatedly grabbing the same low capacity slice over and over again.

So these changes:

  1. Call GetMin everywhere we have a good idea of how many points we need
  2. Make the min/default slice size configurable
  3. Unbias PointSlicePool metrics - The metrics as they existed lumped large with default, making it difficult to know how often the default size was insufficiently large.

Copy link
Contributor

@Dieterbe Dieterbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. pinging @robert-milan / @colega / @jesusvazquez as they have an interest in the expr library now too.

@@ -57,6 +58,7 @@ func ConfigSetup() {
apiCfg.BoolVar(&optimizations.PreNormalization, "pre-normalization", true, "enable pre-normalization optimization")
apiCfg.BoolVar(&optimizations.MDP, "mdp-optimization", false, "enable MaxDataPoints optimization (experimental)")
apiCfg.BoolVar(&middleware.LogHeaders, "log-headers", false, "output query headers in logs")
apiCfg.UintVar(&minSliceSize, "min-slice-pool-size", 2000, "Minimum (and default) length of slice to allocate from pool")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this to metrictank-sample.ini and then run scripts/dev/sync-configs.sh

api/init.go Outdated
@@ -9,7 +9,7 @@ import (
var pointSlicePool *pointslicepool.PointSlicePool

func init() {
pointSlicePool = pointslicepool.New(pointslicepool.DefaultPointSliceSize)
pointSlicePool = pointslicepool.New(int(minSliceSize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means we can now delete DefaultPointSliceSize from the code

p.putLarge.Inc()
} else {
} else if cap(s) < p.defaultSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we no longer track puts and get-makes if they equal defaultSize.
I can see how that helps with troubleshooting the kind of efficiencies you're looking into. OTOH only having metrics for a subset of the puts and get-makes is a bit weird, especially because the sum no longer relates to the get-candidate metrics. should we introduce get-make-default and put-default (or get-make-min and put-min) to track these? or are all these metrics getting out of hand (seems we rarely use them. but then again, that goes for most of the metrics i guess)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I debated this back and forth with myself. I don't want to over-instrument the code and these metrics aren't generally interesting except when trying to determine the efficacy of the slice pool in reducing allocations (which is exactly what I was just doing).

Ultimately, it would be nice to know:

  • Allocations/bytes saved using the pool
  • Allocation/bytes required through the pool (and maybe additional bytes due to default size)
  • Reason the allocation is required (miss or unfit)

I'm not sure I find the "put" metrics that interesting, except maybe in determining that we are "losing" slices that should be returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards adding in get-make-min and put-min, to make the set of metrics coherent.
if we decide the metrics are not (or no longer) useful, we can take them out all at once.

Copy link
Contributor

@Dieterbe Dieterbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
just want to keep this open for a while to give @robert-milan / @colega / @jesusvazquez a chance to check this out

@colega
Copy link
Contributor

colega commented Sep 1, 2021

Thanks for the heads up @Dieterbe, reviewed and makes sense to me.

Offtopic: we're importing this in a codebase that is instrumented with Prometheus so I'd like to make the instrumentation configurable in a separate PR (should be straightforward since both Prometheus and Graphite counters are interface { Inc() })

@Dieterbe Dieterbe merged commit 2833b60 into grafana:master Sep 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants