Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add User Defined Functions to PromQL #12851

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

zenador
Copy link
Contributor

@zenador zenador commented Sep 14, 2023

This PR adds User Defined Functions (UDFs) to PromQL.

Why use UDFs?

  • Users have asked for functions that we are reluctant to include, e.g. mad_over_time (Median absolute deviation, similar to standard deviation but ignores outliers)
  • Similar products like VictoriaMetrics have implemented these functions or something similar
  • It'd be good to give users the flexibility to define their own functions, without having to make the case for it being needed and waiting for it to be added to Prometheus, and so they can use their preferred variety (e.g. there are different kinds of MAD functions)
  • We would not need to support a lot of additional functions, and we would not have a lot of rarely-used functions to confuse users

Limitations of the current implementation:

  • Slower than native functions (see below benchmarks)
  • Adds a dependency on a new library (which is not updated that often)
  • Allowing scripting may add security risks
  • Uses a lesser known scripting language, but one that is simple and based on go (Tengo)
  • UDFs won't appear in query autocomplete
  • UDFs are shared across the whole instance, not separated by tenant
  • Cannot aggregate multiple series
  • Input and output types are limited, but they cover most of the functions that we currently have
  • Does not support native histograms yet, but this is theoretically possible to add though I haven't tested it

Example YML (add to main Prometheus config):

udfs:
  - name: mad_over_time
    input_types:
      - matrix
    modules:
      - math
    util: true
    src: |
      math := import("math")
      util := import("util")

      mad := func(seq) {
        median := util.median(seq)
        if is_error(median) {
          return error("cannot compute median")
        }
        deltas := []
        for x in seq {
          deltas = append(deltas, math.abs(x-median))
        }
        return util.median(deltas)
      }

      output := mad(input0)

Benchmark results (before is native function, after is UDF):

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/promql
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
                  │ BenchmarkStdDevOverTime_before.txt │  BenchmarkStdDevOverTime_after.txt   │
                  │               sec/op               │   sec/op     vs base                 │
StdDevOverTime-16                          16.22µ ± 1%   52.71µ ± 3%  +224.97% (p=0.000 n=10)

                  │ BenchmarkStdDevOverTime_before.txt │   BenchmarkStdDevOverTime_after.txt    │
                  │                B/op                │     B/op       vs base                 │
StdDevOverTime-16                         10.32Ki ± 0%   109.33Ki ± 0%  +959.63% (p=0.000 n=10)

                  │ BenchmarkStdDevOverTime_before.txt │  BenchmarkStdDevOverTime_after.txt   │
                  │             allocs/op              │  allocs/op   vs base                 │
StdDevOverTime-16                           145.0 ± 0%   1056.0 ± 0%  +628.28% (p=0.000 n=10)

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/promql
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
               │ BenchmarkMadOverTime_before.txt │     BenchmarkMadOverTime_after.txt     │
               │             sec/op              │    sec/op     vs base                  │
MadOverTime-16                       26.15µ ± 3%   288.25µ ± 4%  +1002.11% (p=0.000 n=10)

               │ BenchmarkMadOverTime_before.txt │     BenchmarkMadOverTime_after.txt      │
               │              B/op               │     B/op       vs base                  │
MadOverTime-16                      19.87Ki ± 0%   245.13Ki ± 0%  +1133.71% (p=0.000 n=10)

               │ BenchmarkMadOverTime_before.txt │    BenchmarkMadOverTime_after.txt     │
               │            allocs/op            │  allocs/op   vs base                  │
MadOverTime-16                        149.0 ± 0%   4052.0 ± 0%  +2619.46% (p=0.000 n=10)

BenchmarkStdDevOverTime is a better comparison as the native function and UDF are implemented in very similar ways. For BenchmarkMadOverTime, the implementations are very different, and there is room to optimise the UDF performance by using more efficient methods to calculate the median.

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
…e config

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
@zenador
Copy link
Contributor Author

zenador commented Sep 14, 2023

The current test failure seems transient.

@beorn7
Copy link
Member

beorn7 commented Sep 19, 2023

Thank you very much @zenador . That looks really cool.

We had discussed embedding a scripting engine before at the dev-summit.
Note that there is a relevant proposed agenda item for the upcoming dev-summit. I think we should discuss this approach in more detail. Having a concrete implementation on the table definitely helps. Topics that come to mind:

  • Do we generally want this?
  • What about the choice of language? (Here: Tengo. Maybe data scientists would prefer a Python-like language. We could also offer different choices… )
  • What about implementation details? (Provide code via YAML, see the listed limitations above, …)

@bboreham
Copy link
Member

I think you should set out your requirements and proposal in a GitHub issue, then link this PR to it.
That way we can debate requirements and solution separately, and it might be that there are multiple proposed solutions.

Also, would this help with #11609?

@beorn7
Copy link
Member

beorn7 commented Sep 20, 2023

I think #11609 is aiming for adding new concepts to PromQL itself, while this PR uses a different language (Tengo) to add new functions to PromQL (which you could already do easily in Go if you are fine with compiling your own Prometheus code – this PR just allows to "script" functions in a Go-like language and run them within an already compiled standard Prometheus.)

@zenador
Copy link
Contributor Author

zenador commented Sep 20, 2023

@beorn7 Sure, do you mean discuss in a meeting, on a Github issue/PR or at the dev summit?

I chose Tengo for this implementation as their benchmarks indicate that it is faster than the alternatives (including gpython), though I didn't check the benchmarks closely.

@bboreham Embedding a scripting language is something that has been discussed previously (it's also a dev summit item) and I think others have more context on this. I couldn't find an issue though, so please let me know if I should create one @beorn7.

I think it can help with #11609 but not directly, in that it doesn't do exactly as requested but it should be able to achieve the goals in another way. It doesn't allow defining variables in PromQL as requested there, but you can define variables in Tengo for a new UDF (so the UDF code should be more readable), and you can reuse the UDF in multiple places, so queries should be shorter and more understandable overall.

As mentioned in the dev summit item, it would help with issues like #5514 #12497 and #9016 where users want to use new functions, but Prometheus wants to be selective about the officially supported functions that need to be maintained. This way, users can add any functions they like without needing it to be accepted by Prometheus first (and it seems very difficult to push for new functions to be accepted), and functions can have many varieties (see MAD and the discussion on holt winters), so users can choose which variety they want to use for themselves.

@beorn7
Copy link
Member

beorn7 commented Sep 20, 2023

Since the dev summit is literally next week, I would say we can just wait to see the topic discussed and take it from there.

@beorn7
Copy link
Member

beorn7 commented Oct 4, 2023

Unfortunately, the dev summit didn't manage to discuss the topic relevant to this PR. It will still happen eventually on the monthly online summits, but maybe we can at least get an idea how to proceed by discussing it here.

@juliusv @roidelapluie you two might be the most relevant custodians of PromQL, so flagging you (but I don't want to exclude anyone else, please speak up). What do you generally think about this approach? Personally, I think this is probably worth a design doc, but your input would good to find out about the direction to take in the doc. Or worst case, if you are fundamentally against this, we could discuss the reasons before investing a lot of effort into writing a design doc.

@beorn7
Copy link
Member

beorn7 commented Oct 31, 2023

We discussed this PR and the topics around it at the recent dev summit. I would like to summarize the relevant outcome here.

Generally, we didn't come to a consensus that we want this. However, there was a fairly easy consensus that more functions should be accepted into PromQL behind a feature flag. The bar was set to "at least somewhat generally useful" and "cannot be easily modeled by existing PromQL primitives". Since the new functions will be behind a feature flag, there is no danger of a "second holt_winters" (where a function was added under that name that isn't the "real" Holt Winters and was also of limited use, but now we have to maintain it forever). New functions can then be graduated to stable (or removed again) depending on the experience made in the wild.

Adding native functions more easily would reduce the weight of one of the basic motivations behind this PR. Which in turn would increase the relative weight of the concerns.

While performance concerns were mentioned, the main concerns were more about operational aspects, like the following:

  • It is harder to share PromQL solutions if you also have to hand around your UDFs. Instead of a few "canonical" solutions, we'll breed a whole zoo of solutions involving different UDFs.
  • While many will probably use UDFs for more readable PromQL expression, UDFs also hide what's actually going on, so in practice (and in the heat of an outage) it might actually be much harder to understand what a PromQL expression is doing if it involves UDFs.
  • Many different approaches for the same problem were discussed, and there is little clarity which one is best to pursue (plugins on the binary level, implement UDFs through a remote-read backend, allow some kind of UDFs within PromQL, …).

In favor of this approach, people mentioned the ease of adding new functions for problems at hand and enabling quick experimentation. This could also be used to see what kind of functions users implement most often, so that we can take that as an inspiration for future built-in PromQL functions.

To avoid name collisions, UDFs should come with some unique prefix (so that the later introduction of built-in functions of the same name doesn't cause an error) or maybe UDFs can be called through a special built-in PromQL function (e.g. call(my_own_mad_function, my_metric{foo="bar"}).

Overall, I would recommend to wait how more openness towards adding native functions plays out. Right now, it will probably be a lot of effort to get sufficient buy-in for UDFs via an embedded scripting engine.

@zenador
Copy link
Contributor Author

zenador commented Oct 31, 2023

Noted, thanks for the summary!

Just an FYI, this PR already avoids name collisions by enforcing a udf_ prefix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants