Add User Defined Functions to PromQL #12851

zenador · 2023-09-14T17:43:43Z

This PR adds User Defined Functions (UDFs) to PromQL.

Why use UDFs?

Users have asked for functions that we are reluctant to include, e.g. mad_over_time (Median absolute deviation, similar to standard deviation but ignores outliers)
Similar products like VictoriaMetrics have implemented these functions or something similar
It'd be good to give users the flexibility to define their own functions, without having to make the case for it being needed and waiting for it to be added to Prometheus, and so they can use their preferred variety (e.g. there are different kinds of MAD functions)
We would not need to support a lot of additional functions, and we would not have a lot of rarely-used functions to confuse users

Limitations of the current implementation:

Slower than native functions (see below benchmarks)
Adds a dependency on a new library (which is not updated that often)
Allowing scripting may add security risks
Uses a lesser known scripting language, but one that is simple and based on go (Tengo)
UDFs won't appear in query autocomplete
UDFs are shared across the whole instance, not separated by tenant
Cannot aggregate multiple series
Input and output types are limited, but they cover most of the functions that we currently have
Does not support native histograms yet, but this is theoretically possible to add though I haven't tested it

Example YML (add to main Prometheus config):

udfs:
  - name: mad_over_time
    input_types:
      - matrix
    modules:
      - math
    util: true
    src: |
      math := import("math")
      util := import("util")

      mad := func(seq) {
        median := util.median(seq)
        if is_error(median) {
          return error("cannot compute median")
        }
        deltas := []
        for x in seq {
          deltas = append(deltas, math.abs(x-median))
        }
        return util.median(deltas)
      }

      output := mad(input0)

Benchmark results (before is native function, after is UDF):

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/promql
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
                  │ BenchmarkStdDevOverTime_before.txt │  BenchmarkStdDevOverTime_after.txt   │
                  │               sec/op               │   sec/op     vs base                 │
StdDevOverTime-16                          16.22µ ± 1%   52.71µ ± 3%  +224.97% (p=0.000 n=10)

                  │ BenchmarkStdDevOverTime_before.txt │   BenchmarkStdDevOverTime_after.txt    │
                  │                B/op                │     B/op       vs base                 │
StdDevOverTime-16                         10.32Ki ± 0%   109.33Ki ± 0%  +959.63% (p=0.000 n=10)

                  │ BenchmarkStdDevOverTime_before.txt │  BenchmarkStdDevOverTime_after.txt   │
                  │             allocs/op              │  allocs/op   vs base                 │
StdDevOverTime-16                           145.0 ± 0%   1056.0 ± 0%  +628.28% (p=0.000 n=10)

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/promql
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
               │ BenchmarkMadOverTime_before.txt │     BenchmarkMadOverTime_after.txt     │
               │             sec/op              │    sec/op     vs base                  │
MadOverTime-16                       26.15µ ± 3%   288.25µ ± 4%  +1002.11% (p=0.000 n=10)

               │ BenchmarkMadOverTime_before.txt │     BenchmarkMadOverTime_after.txt      │
               │              B/op               │     B/op       vs base                  │
MadOverTime-16                      19.87Ki ± 0%   245.13Ki ± 0%  +1133.71% (p=0.000 n=10)

               │ BenchmarkMadOverTime_before.txt │    BenchmarkMadOverTime_after.txt     │
               │            allocs/op            │  allocs/op   vs base                  │
MadOverTime-16                        149.0 ± 0%   4052.0 ± 0%  +2619.46% (p=0.000 n=10)

BenchmarkStdDevOverTime is a better comparison as the native function and UDF are implemented in very similar ways. For BenchmarkMadOverTime, the implementations are very different, and there is room to optimise the UDF performance by using more efficient methods to calculate the median.

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

…e config Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

zenador · 2023-09-14T18:38:19Z

The current test failure seems transient.

beorn7 · 2023-09-19T13:13:04Z

Thank you very much @zenador . That looks really cool.

We had discussed embedding a scripting engine before at the dev-summit.
Note that there is a relevant proposed agenda item for the upcoming dev-summit. I think we should discuss this approach in more detail. Having a concrete implementation on the table definitely helps. Topics that come to mind:

Do we generally want this?
What about the choice of language? (Here: Tengo. Maybe data scientists would prefer a Python-like language. We could also offer different choices… )
What about implementation details? (Provide code via YAML, see the listed limitations above, …)

bboreham · 2023-09-20T14:41:09Z

I think you should set out your requirements and proposal in a GitHub issue, then link this PR to it.
That way we can debate requirements and solution separately, and it might be that there are multiple proposed solutions.

Also, would this help with #11609?

beorn7 · 2023-09-20T14:47:17Z

I think #11609 is aiming for adding new concepts to PromQL itself, while this PR uses a different language (Tengo) to add new functions to PromQL (which you could already do easily in Go if you are fine with compiling your own Prometheus code – this PR just allows to "script" functions in a Go-like language and run them within an already compiled standard Prometheus.)

zenador · 2023-09-20T15:44:26Z

@beorn7 Sure, do you mean discuss in a meeting, on a Github issue/PR or at the dev summit?

I chose Tengo for this implementation as their benchmarks indicate that it is faster than the alternatives (including gpython), though I didn't check the benchmarks closely.

@bboreham Embedding a scripting language is something that has been discussed previously (it's also a dev summit item) and I think others have more context on this. I couldn't find an issue though, so please let me know if I should create one @beorn7.

I think it can help with #11609 but not directly, in that it doesn't do exactly as requested but it should be able to achieve the goals in another way. It doesn't allow defining variables in PromQL as requested there, but you can define variables in Tengo for a new UDF (so the UDF code should be more readable), and you can reuse the UDF in multiple places, so queries should be shorter and more understandable overall.

As mentioned in the dev summit item, it would help with issues like #5514 #12497 and #9016 where users want to use new functions, but Prometheus wants to be selective about the officially supported functions that need to be maintained. This way, users can add any functions they like without needing it to be accepted by Prometheus first (and it seems very difficult to push for new functions to be accepted), and functions can have many varieties (see MAD and the discussion on holt winters), so users can choose which variety they want to use for themselves.

beorn7 · 2023-09-20T16:02:43Z

Since the dev summit is literally next week, I would say we can just wait to see the topic discussed and take it from there.

beorn7 · 2023-10-04T15:46:11Z

Unfortunately, the dev summit didn't manage to discuss the topic relevant to this PR. It will still happen eventually on the monthly online summits, but maybe we can at least get an idea how to proceed by discussing it here.

@juliusv @roidelapluie you two might be the most relevant custodians of PromQL, so flagging you (but I don't want to exclude anyone else, please speak up). What do you generally think about this approach? Personally, I think this is probably worth a design doc, but your input would good to find out about the direction to take in the doc. Or worst case, if you are fundamentally against this, we could discuss the reasons before investing a lot of effort into writing a design doc.

beorn7 · 2023-10-31T17:08:17Z

We discussed this PR and the topics around it at the recent dev summit. I would like to summarize the relevant outcome here.

Generally, we didn't come to a consensus that we want this. However, there was a fairly easy consensus that more functions should be accepted into PromQL behind a feature flag. The bar was set to "at least somewhat generally useful" and "cannot be easily modeled by existing PromQL primitives". Since the new functions will be behind a feature flag, there is no danger of a "second holt_winters" (where a function was added under that name that isn't the "real" Holt Winters and was also of limited use, but now we have to maintain it forever). New functions can then be graduated to stable (or removed again) depending on the experience made in the wild.

Adding native functions more easily would reduce the weight of one of the basic motivations behind this PR. Which in turn would increase the relative weight of the concerns.

While performance concerns were mentioned, the main concerns were more about operational aspects, like the following:

It is harder to share PromQL solutions if you also have to hand around your UDFs. Instead of a few "canonical" solutions, we'll breed a whole zoo of solutions involving different UDFs.
While many will probably use UDFs for more readable PromQL expression, UDFs also hide what's actually going on, so in practice (and in the heat of an outage) it might actually be much harder to understand what a PromQL expression is doing if it involves UDFs.
Many different approaches for the same problem were discussed, and there is little clarity which one is best to pursue (plugins on the binary level, implement UDFs through a remote-read backend, allow some kind of UDFs within PromQL, …).

In favor of this approach, people mentioned the ease of adding new functions for problems at hand and enabling quick experimentation. This could also be used to see what kind of functions users implement most often, so that we can take that as an inspiration for future built-in PromQL functions.

To avoid name collisions, UDFs should come with some unique prefix (so that the later introduction of built-in functions of the same name doesn't cause an error) or maybe UDFs can be called through a special built-in PromQL function (e.g. call(my_own_mad_function, my_metric{foo="bar"}).

Overall, I would recommend to wait how more openness towards adding native functions plays out. Right now, it will probably be a lot of effort to get sufficient buy-in for UDFs via an embedded scripting engine.

zenador · 2023-10-31T19:30:53Z

Noted, thanks for the summary!

Just an FYI, this PR already avoids name collisions by enforcing a udf_ prefix.

zenador added 9 commits September 15, 2023 01:39

Start to add custom functions

ac751c8

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Avoid modifying engine.go

7b5ba5d

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Get rid of funcMap

1dd940a

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Add addUDF function

c981d33

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Add relevant config

03b185f

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Generalise UDFs to more combinations of input types

161e6ec

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

We somehow can't use string args with matrix, so check for that in th…

f5411c0

…e config Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Add error check for lint

b611020

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

Move code only used for testing/benchmarking into tests

318ca27

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

zenador requested a review from roidelapluie as a code owner September 14, 2023 17:43

Fix go.mod for lint

40a51ae

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>

This was referenced Sep 19, 2023

TestQuerierIndexQueriesRace timed out in CI #12866

Closed

TestQuerierIndexQueriesRace timed out in CI #12867

Closed

beorn7 mentioned this pull request Oct 31, 2023

Add mad_over_time function #5514

Closed

beorn7 mentioned this pull request Dec 12, 2023

Rules are complex, and we want aggregation and real-time computation rules to support logical operations #13266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add User Defined Functions to PromQL #12851

Add User Defined Functions to PromQL #12851

zenador commented Sep 14, 2023 •

edited

Loading

zenador commented Sep 14, 2023

beorn7 commented Sep 19, 2023

bboreham commented Sep 20, 2023

beorn7 commented Sep 20, 2023

zenador commented Sep 20, 2023 •

edited

Loading

beorn7 commented Sep 20, 2023

beorn7 commented Oct 4, 2023

beorn7 commented Oct 31, 2023

zenador commented Oct 31, 2023

Add User Defined Functions to PromQL #12851

Are you sure you want to change the base?

Add User Defined Functions to PromQL #12851

Conversation

zenador commented Sep 14, 2023 • edited Loading

zenador commented Sep 14, 2023

beorn7 commented Sep 19, 2023

bboreham commented Sep 20, 2023

beorn7 commented Sep 20, 2023

zenador commented Sep 20, 2023 • edited Loading

beorn7 commented Sep 20, 2023

beorn7 commented Oct 4, 2023

beorn7 commented Oct 31, 2023

zenador commented Oct 31, 2023

zenador commented Sep 14, 2023 •

edited

Loading

zenador commented Sep 20, 2023 •

edited

Loading