Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'preserveTags' flag for divideSeries, divideSeriesLists, asPercent #2639

Open
shanson7 opened this issue Oct 7, 2020 · 26 comments
Open

'preserveTags' flag for divideSeries, divideSeriesLists, asPercent #2639

shanson7 opened this issue Oct 7, 2020 · 26 comments

Comments

@shanson7
Copy link
Contributor

shanson7 commented Oct 7, 2020

Is your feature request related to a problem? Please describe.

Several functions (at least divideSeries, divideSeriesLists, and asPercent) trash the tags of the inputs. This makes it difficult to use features in grafana that require tags (e.g. "Labels to Fields" transformation) or other functions like groupByTags post division.

It would be nice to have an option to preserve tags where the dividend and divisor have the same values.

Describe the solution you'd like

Add a flag (default to false for backwards compatibility) to preserve overlapping tags

Describe alternatives you've considered

N/A

@shanson7
Copy link
Contributor Author

shanson7 commented Oct 7, 2020

Perhaps even easier (more useful?) an option to keep the tags of the dividend for these functions, rather than strictly overlapping tags.

@deniszh
Copy link
Member

deniszh commented Oct 7, 2020

Hi @shanson7,
Sorry, but could you please elaborate your request a bit? What do you mean "trash the tags of the inputs" ?
Maybe add some useful example which can be used in tests?

@shanson7
Copy link
Contributor Author

shanson7 commented Oct 7, 2020

What do you mean "trash the tags of the inputs" ?

Tags go in but they don't come out.

Maybe add some useful example which can be used in tests?

Sure, what format?

@deniszh
Copy link
Member

deniszh commented Oct 7, 2020

Tags go in but they don't come out.

Sorry, I still not sure that I understand what that means. I do not use tags in production, though.

Sure, what format?

Example input, current output, desired output ?

@shanson7
Copy link
Contributor Author

shanson7 commented Oct 7, 2020

Example input, current output, desired output ?

Haha, yeah I meant what format do tests take for series/tags. I'll dig it out of the code.

@shanson7
Copy link
Contributor Author

shanson7 commented Oct 7, 2020

Ok so let's say we have 2 input series representing the number of requests and the total time taken (divisor and dividend respectively):

divisor.name = 'requests.count'
divisor.tags = {
  'host' : 'h1',
  'instanceName': 'instance1'
}

dividend.name = 'requests.seconds'
dividend.tags = {
  'host' : 'h1',
  'instanceName': 'instance1'
}

Currently, if we do divideSeries(dividend, divisor) we get a new series:

result.name = 'divideSeries(requests.seconds,requests.count)'
result.tags = {}

It would be desirable to maintain the tags like:

result.name = 'divideSeries(requests.seconds,requests.count)'
result.tags = {  
  'host' : 'h1',
  'instanceName': 'instance1'
}

You can expand this to divideSeriesLists where losing the tags makes the output series indistiguishable and we need to rely on functions like aliasByTags before the division to make it work. However, grafana is becoming more and more tag aware and features that rely on tags become inaccessible.

@DanCech
Copy link
Member

DanCech commented Oct 7, 2020

imho the fact that the resulting series have no tags should be considered a bug, and I agree that we should propagate tags in roughly the same way as we do for other aggregation functions. My suggestion would be that by default all tags of the dividend series should be propagated, along with a new tag to be consistent with the behavior of other aggregation-type functions. We may want to consider also propagating tags on the divisor series that are not present on the dividend, whether by default or with an option flag. Does that seem reasonable?

@shanson7
Copy link
Contributor Author

shanson7 commented Oct 7, 2020

My suggestion would be that by default all tags of the dividend series should be propagated, along with a new tag to be consistent with the behavior of other aggregation-type functions. We may want to consider also propagating tags on the divisor series that are not present on the dividend, whether by default or with an option flag. Does that seem reasonable?

I get a little nervous about changing this type of behavior after the fact (bug or not) because some people might be actively working around it and it could break existing dashboards. However, the behavior you describe does sound reasonable to me (backward compatibility aside).

@Dieterbe
Copy link
Contributor

imho the fact that the resulting series have no tags should be considered a bug, and I agree that we should propagate tags in roughly the same way as we do for other aggregation functions.

+1. I'm not too worried about accommodating people who may have worked around this lack of tags, as it is such as obvious bug; also tags are not commonly used yet. But ultimately this is a question for the graphite maintainers (which i'm not part of).

Note the asPercent signature: asPercent(seriesList, total=None, *nodes)
since *nodes swallows all remaining args, i don't think we can just add an arg at the end, we would probably add it in the middle (before or after total), because it would be an optional flag, this should be fine (in both graphite and metrictank)

Note that Dan suggests setting a new tag aggregatedBy: <func-name> as well as either of these:

  • all dividend tags only
  • all dividend tags + tags that are set only on divisor (this is effectively the union)

In that proposal, it is implied that if the same tag keys are set between dividend and divisor, but with different values, dividend wins. Keeping a tag value of one of the series as is, when the other series directly contradicts is, seems like a bad idea....

E.g. imagine a division between:

http_responses{service=A,code=200}

and

http_responses{service=A,code=total}

we have the choice between:

  1. let dividend or divisor overrule the other. per Dan (service=A,code=200)
  2. drop tag (service=A)
  3. generate a new tag that's the unison of them (service=A,code=200/total)

2 is probably not that useful (we may drop precisely what distinguishes different series), Dan's idea is actually not so bad (we typically look at a series "with respect to" another, where the "another" is implied and keeping only the former explicit may be good enough.), the 3rd one seems most factually correct because it acknowledges the contribution of both components of the division. but the naive formatting may be hard to decipher if either of the tag values already contain a slash.

Anyway, that tangent aside, I note that there's two more options to consider to dan's proposal:

  • intersection between dividend and divisor (key-wise intersection that is, value is subject to the same question above)
  • divisor tags only

I agree the latter isn't all that useful e.g. in asPercent() when we have multiple distinct dividends and a single divisor, we would want to distinguish based on divident tags; whereas in other cases where we map dividends and divisors 1:1 (this happens in divideSeries, divideSeriesList and certain asPercent invocations), I think "divisor tags only" is just as defensible as "dividend tags only".

Thus I think all these modes make sense:

  • dividend only
  • divisor only
  • union
  • intersection

and then i think it would make sense to have an additional flag to control the behavior in the latter 2 cases, in case of tag-matches with value mismatches:

  • drop
  • merge
  • divisor wins
  • dividend wins

@deniszh
Copy link
Member

deniszh commented Oct 12, 2020

Agreed with @DanCech here - tags should not be swallowed by default.
Not sure do we need to implement full spectrum of modes/flags, though. But I think we need listen what users of tags functionality want - otherwise indeed implement full fledged solution as mostly flexible, as @Dieterbe described.

@shanson7
Copy link
Contributor Author

We have been using tags for a while now, so I expect many dashboards currently "work" with the existing behavior. I expect many of those will be negatively impacted by a backwards incompatible change (not to mention any code that is making queries using these functions).

That being said, I'm not strictly against a backwards incompatible change, it just shouldn't be done lightly. The issue might be significantly more prevalent if you expand it to all functions that aggregate/combine multiple series, e.g multiplySeries or even sum/avg/etc.

@stale
Copy link

stale bot commented Dec 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 12, 2020
@Dieterbe
Copy link
Contributor

We plan to provide a concrete proposal for this feature after the holidays

@Dieterbe
Copy link
Contributor

Looking on how to expand these function signatures is interesting...

first of all, looking at the function signature we have today for asPercent, it seems to me that even when trying to not use total, the first node specified will always be assigned to total. See below. am i missing something? how does this work today?

$ cat stock.py
def asPercent(seriesList, total=None, *nodes):
    print ("series", seriesList)
    print ("total", total)
    print ("nodes", nodes)
asPercent("seriesList", 1, 2, 3)
$ python2 stock.py
('series', 'seriesList')
('total', 1)
('nodes', (2, 3))
$ python3 stock.py
series seriesList
total 1
nodes (2, 3)

that aside, trying to add the arg at the end, only works in python3 (invalid syntax in py2)

$ cat newargend.py
def asPercent(seriesList, total=None, *nodes, preserveTags=False):
    print ("series", seriesList)
    print ("total", total)
    print ("nodes", nodes)
    print("preserveTags", preserveTags)
asPercent("seriesList", "n1", "n2", "n3", preserveTags=True)
$ python2 newargend.py
  File "newargend.py", line 3
    def asPercent(seriesList, total=None, *nodes, preserveTags=False):
                                                             ^
SyntaxError: invalid syntax
$ python3 newargend.py
series seriesList
total n1
nodes ('n2', 'n3')
preserveTags True

adding the new arg as keyword arg before nodes and trying to set it as kwarg, does not work:

$ cat newargmid.py
def asPercent(seriesList, total=None, preserveTags=False, *nodes):
    print ("series", seriesList)
    print ("total", total)
    print("preserveTags", preserveTags)
    print ("nodes", nodes)
asPercent("seriesList", preserveTags=True, 1, 2, 3)
$ python2 newargmid.py
  File "newargmid.py", line 6
    asPercent("seriesList", preserveTags=True, 1, 2, 3)
SyntaxError: non-keyword arg after keyword arg
$ python3 newargmid.py
  File "/home/dieter/code/sandbox/graphite-preservetags/newargmid.py", line 6
    asPercent("seriesList", preserveTags=True, 1, 2, 3)
                                                      ^
SyntaxError: positional argument follows keyword argument

but trying to call it not specifying it as an arg suffers from the same problem as with total (unless I'm missing something)

$ cat newargmid.py
def asPercent(seriesList, total=None, preserveTags=False, *nodes):
    print ("series", seriesList)
    print ("total", total)
    print("preserveTags", preserveTags)
    print ("nodes", nodes)
asPercent("seriesList", 1, 2, 3)
$ python2 newargmid.py
('series', 'seriesList')
('total', 1)
('preserveTags', 2)
('nodes', (3,))
$ python3 newargmid.py
series seriesList
total 1
preserveTags 2
nodes (3,)
$ 

So the only solution I see is adding it at the end and dropping support for python2. but that seems a bit much to ask.

@deniszh
Copy link
Member

deniszh commented Jan 19, 2021

@Dieterbe: IMO it's easier to introduce new fuction then, with different arguments.

@Dieterbe
Copy link
Contributor

Gave this some more thought and discussed a bit internally with some set theory nerds at Grafana ( :) )

First of all, I think figuring out which tags to set on the results may become much easier if the division (and the "as-percenting") function matches up dividends and divisors by joining them on tags, so we may want to introduce tag-join-based versions of these functions at some point.

I note that divideSeries, divideSeriesList and asPercent functions are not tag aware. they simply match dividends and divisors in the order they were provided, without regard for which tags they have (at this point i would like to ask you @shanson7 , do you do anything special to make this work for you? as in, how do you assure you have the right pairs being matched up? Do you sort the dividend and divisorList somehow, or do you rely on them being sorted already?)

Additionally, or because of the former, I'm not excited about introducing a new set of functions that replicate the current ones (although possibly merge divideSeries and divideSeriesLists into one), for the sole reason of being able to pass extra parameters to control how labels are set on the outputs. That's just messy.

Thus, I'm hopeful we can simply agree on 1 behavior which we would consider sane and good enough, which doesn't require any flags, nor worries about backwards compatibility because we can just see it as bugfix for the current lack of tags setting.

So, my proposal is simply that we tweak divideSeries, divideSeriesLists, and asPercent, such that each output series gets all tags from the dividend, as well as tags from the divisor that have keys that weren't present in the divisor, as well as an aggregatedBy tag just like aggregations.

  1. this is @DanCech 's proposal, but no config flag
  2. this errs on the side of including more tags rather than fewer. If the user wants to get rid of some tags, we could simply write another function "stripTags" or something which strips a set of tags from series.

How does this sound ?

@Dieterbe
Copy link
Contributor

Dieterbe commented Jan 19, 2021

one more thing.. there is no precedent for something like aggregatedBy: divide or agregatedBy: asPercent. That's normal because there are no such aggregation functions. Maybe I took @DanCech too literally when he said "along with a new tag to be consistent with the behavior of other aggregation-type functions".
I do see other functions settings describing their own name, as well as a parameter e.g. series.tags['pow'] = factor or for when no parameter available they just set the function name series.tags['sigmoid'] = 'sigmoid'.

For consistency then, let me retract the aggregatedBy proposal, and instead do something like:

series.tags['divideSeries'] = 'divideSeries'
series.tags['divideSeriesLists'] = 'divideSeriesLists'
series.tags['asPercent'] = 'asPercent'

I don't see it as particularly useful, but many other series do it, so ...

@shanson7
Copy link
Contributor Author

I note that divideSeries, divideSeriesList and asPercent functions are not tag aware. they simply match dividends and divisors in the order they were provided, without regard for which tags they have (at this point i would like to ask you @shanson7 , do you do anything special to make this work for you? as in, how do you assure you have the right pairs being matched up? Do you sort the dividend and divisorList somehow, or do you rely on them being sorted already?)

Yes, we use aliasByTags and sortByName on the divisor and dividend series to make sure the order is correct.

So, my proposal is simply that we tweak divideSeries, divideSeriesLists, and asPercent, such that each output series gets all tags from the dividend, as well as tags from the divisor that have keys that weren't present in the divisor

This sounds reasonable. Would it still be based on input order or would there be tag matching?

For consistency then, let me retract the aggregatedBy proposal, and instead do something like:
series.tags['divideSeries'] = 'divideSeries'
series.tags['divideSeriesLists'] = 'divideSeriesLists'
series.tags['asPercent'] = 'asPercent'
I don't see it as particularly useful, but many other series do it, so ...

Ditto on not seeing these additional tags as useful. In fact, we often need to work around these tags in grafana by adding groupByTags with the list of tags that we want to keep.

this errs on the side of including more tags rather than fewer. If the user wants to get rid of some tags, we could simply write another function "stripTags" or something which strips a set of tags from series.

I think that a stripTags function would be useful, however it could result in duplicate series (unlike groupByTags which would merge any series that ended up with the same tagset). A version of groupByTags that worked on a set of tags to drop would be interesting, but not strictly required here.

@Dieterbe
Copy link
Contributor

This sounds reasonable. Would it still be based on input order or would there be tag matching?

I propose we don't change how the functions match up dividends and divisors. Merely that we fix the tags being set on the results. Tag based joining can eventually be done in a new set of functions (at that point we can also address other things, such as having 1 division function regardless of the number of inputs), but that's another project.

@shanson7
Copy link
Contributor Author

I'm still a little worried about changing this behavior backwards incompatibly, since I'm not sure where we will be impacted. However, creating a new divideByTags or something with straight-forward tag behavior sounds like an appealing solution, since users that are impacted can opt-in to using that (and there is value add)

@deniszh
Copy link
Member

deniszh commented Jan 21, 2021

I'm also voting for new and shiny function, why not?

@Dieterbe
Copy link
Contributor

proliferation of many functions that all work slightly (or not so slightly) differently is not something that leads to an elegant, intuitive user experience. That said, I see how it is a low cost way to introduce new behaviors without breaking compatibility. Introducing functions should be done thoughtfully and ideally the functions should "feel similar" to other existing ones. I'm working on a proposal and will keep you posted.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jan 25, 2021

Proposal for a tag based division function

introduction

We desire to write a function that looks something along the lines of:

divideSeriesByTag(dividends seriesList, divisors seriesList, *tags)

It is inspired by divideSeries, divideSeriesList and asPercent, with these notes:

  • allowed inputs:
    • 1 dividend and 1 divisor
    • a list of dividends and divisors of same length for pairwise matching. no need for separate functions for these use cases
    • a list of dividends and divisors of unequal length. See below
  • since asPercent is at its core a division and scaling by 100, we won't implement dedicated "as percent" functionality, instead a division result can fed to scale()
  • no support for dividing series by a number. That could be added later without breaking backwards compatibility.
  • no support for specifying divisors=None and use "nodes" (or tags) to divvy up dividends into groups to be summed and used as divisors (like asPercent). That behavior is not obvious and not explicit enough

At the core, matching of dividends to divisors is done via tags (either manually specified or auto-discovered. see below), and also a key aspect of this function is how it sets tags on the resulting series. (in contrast to the above mentioned functions where tags from divisor and dividend don't make it into the output series.

terminology

  • join tags: a set of tags that is used to generate a join key
  • join key: the list of values (or list of key=value pairs) corresponding to the join tags after ordering them. Used to match up dividends and divisors with one another.
  • join key generation: generating the join key by collecting the values of all join tags
  • extra tags: tags that a dividend or divisor has, but that is not part of the join tags

explanation

So, dividends and divisors are both a seriesList of 1..N series.
(if either or both has no series, a runtime error is triggered)
each series has a set of tags. So let's imagine a set of dividends and a set of divisors

A basic case would look something like:

dividends
* name1{a=1, b=1}
* name1{a=1, b=2}
* name2{a=1, b=1}

divisors
* name3{a=1, b=1, c=3}
* name4{a=1, b=1, c=4}

How do we match up dividends and divisors? We need a set of join tags. We have 2 choices:

  1. have the user specify a set of tags
  2. auto discover the set of tags, e.g. pick the largest set of tag keys that is shared across all dividends (a, b), across all divisors (a,b,c) or both (a,b)

The most simple way, is to

  1. automatically take all keys shared across all dividends and all divisors,
  2. assert that the list of join keys is not empty. error otherwise
  3. assert that there are no two dividends or divisors sharing the same join key. (while this could be resolved by specifying an aggregator callback, this makes things overly complicated. we rather error in this case and tell the user to fix their query, possibly by preparing the input with groupByTags)
  4. possibly, assert that for each divisor join key there is a matching dividend join key. This feels overly restrictive as it seems quite natural to sometimes have more values on either side of the equation. In this case we can either discard unmatched series, or generate "MISSING" series, similar to asPercent. I propose the latter, as it is more explicit.

However, there is a benefit to manually specifying the join tags in some special cases. Namely, you can choose to ignore a tag that all series have in common, if it's not relevant for joining (e.g. a "description" tag),
likewise you can choose to include a tag even though not all input series have it, if you want to use it for joining on those series that have it. (see the 3rd item below)

looking at any tag of any given dividend and divisor, there are 7 scenarios. We list them below along with possible behaviors.

  • join tag with same val: contributes to a match, obviously
  • join tag with different val: makes the pair not match
  • join tag unset on one side: // would not happen for auto join tags, but this allows to use extra tags for joining if they're set (by manually specifying them)
    • strict: makes the pair not match
    • loose: considered a match
  • join tag unset on both sides: ignored
  • extra tags with same val:
    • drop
    • keep on the resulting series
  • extra tags with different val:
    • keep the one from dividend
    • keep the one from divisor
    • create a new value that is both of them joined together
    • drop the tag
  • extra tags with val not set one side
    • keep if it's set on the dividend
    • keep if it's set on the divisor
    • keep either one
    • drop the tag

It's not so hard to envision situations in which we may want a variety of these behaviors.
The proposal is to have a simple default case, but ample room to customize if needed.

divideSeriesByTag(dividends seriesList, divisors seriesList, options string, *tags)

join tags defaults to the set of tags that are seen across all dividends and divisors, but can be overridden.
to accommodate all possible behaviors outlined above, the user can specify a comma separated list of options.
as demonstrated here it's not possible
to have an options string that can be completely skipped, along with a variadic tags argument.
Thus, the user will always have to specify options, but in the base case they would specify defaults.
They can then set an option out of each of the categories below, to tweak the behavior.
The proposed default is equivalent to "join_strict,extra_same_keep,extra_diff_keep_dividend,extra_unset_keep_any"

one of:

  • join_strict
  • join_loose

one of:

  • extra_same_drop
  • extra_same_keep

one of:

  • extra_diff_keep_dividend
  • extra_diff_keep_divisor
  • extra_diff_join
  • extra_diff_drop

one of:

  • extra_unset_keep_dividend
  • extra_unset_keep_divisor
  • extra_unset_keep_any
  • extra_unset_drop

@Dieterbe
Copy link
Contributor

Btw, i wouldn't commit to implement all the options first go. I would implement at least the default behavior and a couple of the options, but the implementation phase might reveal some of the options would make things too complicated.

@shanson7
Copy link
Contributor Author

assert that the list of join keys is not empty. error otherwise

What if you want to divide a number of input series by a single tag-less series (e.g. the result of sum)? Why shouldn't this be allowed?

assert that there are no two dividends or divisors sharing the same join key.

Why would we need to assert this? What if there are "extra" tags that would make the result unique? Imagine we had some metric like "CPU usage per process" and we want to divide by the total CPU capacity of the host . Matching on just host should be fine, with a unique series output per processName/host combination.

process.cpu.util;processName=X1;host=Y
process.cpu.util;processName=X2;host=Y

host.cpu.capacity;host=Y

divideByTags(process.cpu.util, host.cpu.capacity);processName=X1;host=Y
divideByTags(process.cpu.util, host.cpu.capacity);processName=X2;host=Y

Theoretically, these 2 would have the same join key (host=Y). I could see how using the extra_unset_drop you would get multiple result series with the same name/tags but while possibly visually confusing in grafana is something you can easily do today with the alias functions.

@robert-milan
Copy link

robert-milan commented Mar 21, 2021

I will be picking up with @Dieterbe's proposal. @shanson7 and I discussed this and I agree with his comments (#2639 (comment)). So for the path forward, the proposal is what Dieter originally said (#2639 (comment)) with the following changes:

  1. We will allow an empty list of join keys for the divisor(s).
    a. Tagless divisors will create a new output series for every dividend they encounter
    b. When we encounter a tagless divisor we will ignore join, extraSame, extraDiff, and extraUnset, and instead use all tags from the dividend
    c. A sane default limit should be enabled with an option to adjust it manually

  2. We will allow more than two dividends and divisors to share the same join keys
    a. This may produce multiple output series with the same name/tags, but more often than not it will produce unique output series due to default sane options (specifically extra_unset_keep_any)

  3. A slight change to the naming of the options. Instead of having options such as join_strict and join_loose we will favor option=value. So those two options become join=strict and join=loose.
    a. proposed defaults would then change to join=strict,extraSame=keep,extraDiff=keepDividend,extraUnset=keepAny

Full options list with changes:

  1. join
    a. strict
    b. loose

  2. extraSame
    a. drop
    b. keep

  3. extraDiff
    a. keepDividend
    b. keepDivisor
    c. join
    d. drop

  4. extraUnset
    a. keepDividend
    b. keepDivisor
    c. keepAny
    d. drop

  5. maxOutseries (not the best name, open to suggestions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants