Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample stats: total and peak #10369

Merged
merged 3 commits into from Mar 21, 2022
Merged

Sample stats: total and peak #10369

merged 3 commits into from Mar 21, 2022

Conversation

aughr
Copy link
Contributor

@aughr aughr commented Feb 28, 2022

This PR covers most of what we proposed in #10181. It's a combination of my work, @alanprot's work, and @Harkishen-Singh's work.

It adds new stats:

  • totalQueryableSamples: The total number of samples read out of the underlying Queryable instance inclusive of any samples that are buffered between range steps.
  • totalQueryableSamplesPerStep: The above statistic tracked per step in a range query. These will always add up to the value in totalQueryableSamples. The engine does not compute or report these by default. We only do the work and report the data if an engine flag is set (available via --enable-feature per-step-stats from the CLI) and a per-query flag is set (stats=all in the request). We introduce a new QueryOpts struct in the Engine's API to make having these kinds of per-query options easier in the future.
  • peakSamples: The high water mark of the value compared against MaxSamples during the course of query execution.

Per @bboreham's feedback, this PR also changes the query stats response from a struct to an interface and adds a way to override that response. Implementations like Cortex can use this to read stats from the Context and insert them into the response that's sent to the user and/or remove information from the stats response.

This does not include the series stats we proposed because we have some more thinking to do about how best to represent them. More to come on that front.

The performance impact of the new statistics that are always on should be negligible per our local benchmarking. Even the per-step statistics don't have much impact outside the combination of very simple queries with many steps. I'll upload these benchmark runs in a bit.

An example JSON response with all these stats on for max_over_time(up[5m]) with 30-second steps over 5 minutes. The underlying data here started only a minute before running the query.

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "instance": "localhost:9090",
          "job": "prometheus"
        },
        "values": [
          [
            1645813076,
            "1"
          ],
          [
            1645813106,
            "1"
          ],
          [
            1645813136,
            "1"
          ],
          [
            1645813166,
            "1"
          ],
          [
            1645813196,
            "1"
          ],
          [
            1645813226,
            "1"
          ],
          [
            1645813256,
            "1"
          ],
          [
            1645813286,
            "1"
          ],
          [
            1645813316,
            "1"
          ],
          [
            1645813346,
            "1"
          ],
          [
            1645813376,
            "1"
          ]
        ]
      }
    ],
    "stats": {
      "timings": {
        "evalTotalTime": 0.006922879,
        "resultSortTime": 2.677e-06,
        "queryPreparationTime": 0.003402187,
        "innerEvalTime": 0.002431834,
        "execQueueTime": 0.000819559,
        "execTotalTime": 0.008257353
      },
      "samples": {
        "totalQueryableSamplesPerStep": [
          [
            1645813076,
            "4"
          ],
          [
            1645813106,
            "6"
          ],
          [
            1645813136,
            "8"
          ],
          [
            1645813166,
            "10"
          ],
          [
            1645813196,
            "12"
          ],
          [
            1645813226,
            "14"
          ],
          [
            1645813256,
            "16"
          ],
          [
            1645813286,
            "18"
          ],
          [
            1645813316,
            "20"
          ],
          [
            1645813346,
            "20"
          ],
          [
            1645813376,
            "20"
          ]
        ],
        "totalQueryableSamples": 148,
        "peakSamples": 31
      }
    }
  }
}

@beorn7
Copy link
Member

beorn7 commented Mar 3, 2022

/prombench edfe657

@prombot
Copy link
Contributor

prombot commented Mar 3, 2022

Incorrect prombench syntax, please find correct syntax here.

@beorn7
Copy link
Member

beorn7 commented Mar 3, 2022

/prombench main

@prombot
Copy link
Contributor

prombot commented Mar 3, 2022

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10369 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, in general, this LGTM.

Do we need to compute the peak and total? Could we let that to the caller instead?


type Stats struct {
TimerStats *QueryTimers
SampleStats *QuerySamples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are concerned about impact, we could later on make that an interface, so when it is disabled we have a NoopSampleStats that has lesser impact. I see @beorn7 has launched prombench already, it might not be needed.

cmd/prometheus/main.go Outdated Show resolved Hide resolved
promql/engine.go Outdated
@@ -379,12 +394,12 @@ func (ng *Engine) SetQueryLogger(l QueryLogger) {
}

// NewInstantQuery returns an evaluation query for the given expression at the given time.
func (ng *Engine) NewInstantQuery(q storage.Queryable, qs string, ts time.Time) (Query, error) {
func (ng *Engine) NewInstantQuery(q storage.Queryable, opts QueryOpts, qs string, ts time.Time) (Query, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this, because we want to jave later on per-query lookback delta, it could be a query opt as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have QueryOpts as a pointer ? we are passing that downstream mutliple times, creating un-needed allocs. it would also be nicer in tests.

@prombot
Copy link
Contributor

prombot commented Mar 6, 2022

Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

@beorn7
Copy link
Member

beorn7 commented Mar 7, 2022

/prombench cancel

I assume you have seen enough benchmark data.

@prombot
Copy link
Contributor

prombot commented Mar 7, 2022

Benchmark cancel is in progress.

@alanprot
Copy link
Contributor

alanprot commented Mar 7, 2022

Hi @roidelapluie, thanks a lot for looking into this.

Do we need to compute the peak and total? Could we let that to the caller instead?

I don't think is possible to compute the one stats from another as they are tracking different things (Peak is not the max(totalQueryableSamples) but instead is the max samples kept in memory during the query execution. The Peak stats show how close the query was to reach the query.max-samples limit while the totalQueryableSamples shows how many samples was used to run the query.

@roidelapluie
Copy link
Member

Hi @roidelapluie, thanks a lot for looking into this.

Do we need to compute the peak and total? Could we let that to the caller instead?

I don't think is possible to compute the one stats from another as they are tracking different things (Peak is not the max(totalQueryableSamples) but instead is the max samples kept in memory during the query execution. The Peak stats show how close the query was to reach the query.max-samples limit while the totalQueryableSamples shows how many samples was used to run the query.

I mean, can we only track them per step, and not compute the overall maximum.

@aughr
Copy link
Contributor Author

aughr commented Mar 8, 2022

I mean, can we only track them per step, and not compute the overall maximum.

We're currently not recording peak samples per step, just the total samples queried. I think it's doable, so if you'd like to see it in this PR, I can take a shot at it. That said, because per-step stats are disabled by default, while overall stats are always on, I'd prefer to always compute overall stats.

@aughr
Copy link
Contributor Author

aughr commented Mar 11, 2022

One thing I've been thinking about doing before landing this is changing the stats suffix from PerTime to PerStep. Thoughts @roidelapluie and others?

@roidelapluie
Copy link
Member

👍 for PerStep

@roidelapluie
Copy link
Member

close/reopen to trigger circleci

Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have got the time to review & test the pull request.

Somehow,

"samples": {
                "peakSamples": 153,
                "totalQueryableSamples": 126
            },

are always reported. I think it's fine, we could even later add those fields to the UI.

In my review, I noted a wrong change in tsdb, and comments that do not end with full stop. The feature flag description is also incorrect.

Thanks!

cmd/prometheus/main.go Outdated Show resolved Hide resolved
cmd/prometheus/main.go Outdated Show resolved Hide resolved
promql/engine.go Outdated Show resolved Hide resolved
promql/engine.go Outdated Show resolved Hide resolved
promql/engine_test.go Outdated Show resolved Hide resolved
promql/engine_test.go Outdated Show resolved Hide resolved
promql/engine_test.go Outdated Show resolved Hide resolved
promql/engine_test.go Outdated Show resolved Hide resolved
promql/engine_test.go Outdated Show resolved Hide resolved
tsdb/head_test.go Outdated Show resolved Hide resolved
@roidelapluie
Copy link
Member

Windows tests should be fixed by rebasing on top of main.

alanprot and others added 3 commits March 21, 2022 09:57
We always track total samples queried and add those to the standard set
of stats queries can report.

We also allow optionally tracking per-step samples queried. This must be
enabled both at the engine and query level to be tracked and rendered.
The engine flag is exposed via a Prometheus feature flag, while the
query flag is set when stats=all.

Co-authored-by: Alan Protasio <approtas@amazon.com>
Co-authored-by: Andrew Bloomgarden <blmgrdn@amazon.com>
Co-authored-by: Harkishen Singh <harkishensingh@hotmail.com>
Signed-off-by: Andrew Bloomgarden <blmgrdn@amazon.com>
This allows other implementations to inject their own statistics that
they're gathering in data linked from the context.Context. For example,
Cortex can inject its stats.Stats value under the `cortex` key.

Signed-off-by: Andrew Bloomgarden <blmgrdn@amazon.com>
This exactly corresponds to the statistic compared against MaxSamples
during the course of query execution, so users can see how close their
queries are to a limit.

Co-authored-by: Harkishen Singh <harkishensingh@hotmail.com>
Co-authored-by: Andrew Bloomgarden <blmgrdn@amazon.com>
Signed-off-by: Andrew Bloomgarden <blmgrdn@amazon.com>
@roidelapluie
Copy link
Member

triggering ci

@roidelapluie roidelapluie merged commit a64b9fe into prometheus:main Mar 21, 2022
@roidelapluie
Copy link
Member

Thanks!

aughr added a commit to aughr/prometheus that referenced this pull request Mar 23, 2022
Per Julien's feedback on prometheus#10369, we're choosing to be consistent with
data types inside the stats structure (ints) rather than with the points
format that is part of the normal query responses (strings). We have
this option because this data cannot be NaN/Inf.

Signed-off-by: Andrew Bloomgarden <blmgrdn@amazon.com>
roidelapluie pushed a commit that referenced this pull request Mar 24, 2022
Per Julien's feedback on #10369, we're choosing to be consistent with
data types inside the stats structure (ints) rather than with the points
format that is part of the normal query responses (strings). We have
this option because this data cannot be NaN/Inf.

Signed-off-by: Andrew Bloomgarden <blmgrdn@amazon.com>
@roidelapluie roidelapluie mentioned this pull request Apr 7, 2022
@roidelapluie
Copy link
Member

/prombench v2.34.0

@prombot
Copy link
Contributor

prombot commented Apr 13, 2022

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10369 and v2.34.0

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart v2.34.0

@roidelapluie
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Apr 14, 2022

Benchmark cancel is in progress.

@roidelapluie
Copy link
Member

This pull request has memory issues but it's head has been rebased on top of #10450, which I believe to be the actual issue.

prymitive added a commit to prymitive/client_golang that referenced this pull request Apr 22, 2022
prometheus/prometheus#10369 was part of v2.35.0 release and it exposes more stats.
There's currently no way to access those stats since Query() and RangeQuery() returns only a subset of Prometheus API response.
To avoid bloating Query() return values I'm adding this as a new Option where you can pass a pointer to a QueryStats struct that will hold all recorded stats,
instead of adding another return value.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants