Mathematics across measurements #3552

Open
srfraser opened this Issue Aug 4, 2015 · 90 comments

Comments

Projects
None yet
@srfraser

srfraser commented Aug 4, 2015

Apologies if this is a duplicate, I had a look and couldn't see a relevant issue.

I can see from the documentation how to select from multiple measurements (although it calls them series, still, at https://influxdb.com/docs/v0.9/query_language/data_exploration.html )

For example, with data inserted by telegraf, you can do:
select * from disk_used,disk_total where host = 'myhostname' and path = '/'

How would you express that as a percentage? I've tried variations of the following, and none seem to work:

select disk_used.value/disk_total.value from disk_used, disk_total where host = 'myhostname' and path='/'

The "mydb"."retentionpolicy"."measurement" syntax doesn't work there, either.

Is it a good idea to add aggregation functions for cases like diff(value1, value2) from m1, m2 and divide(value, value) from m1, m2, or should the arithmetic operators be working?

Also, I noticed when experimenting that it's also not possible to divide one derivative by another. For example, if I have two counters, bytes transferred and api calls made - both of which are constantly going up - how would you calculate the mean bytes per api call?

@srfraser srfraser changed the title from Mathematics across measurements to Mathematics across measurements with 0.9.2 Aug 4, 2015

@srfraser srfraser changed the title from Mathematics across measurements with 0.9.2 to [0.9.2] Mathematics across measurements Aug 4, 2015

@beckettsean beckettsean changed the title from [0.9.2] Mathematics across measurements to [feature request] Mathematics across measurements Aug 4, 2015

@beckettsean beckettsean added this to the Longer term milestone Aug 25, 2015

@hexluthor

This comment has been minimized.

Show comment
Hide comment
@hexluthor

hexluthor Sep 3, 2015

👍
I work with sensor networks and find this limitation frustrating. For example, I wish to compute weighted averages like this:
SELECT sum(oxygen_percentage.value * flow_rate.value) / sum(flow_rate.value) FROM oxygen_percentage, flow_rate WHERE site_id = '3'
But InfluxDB returns nothing. Even SELECT oxygen_percentage.value FROM oxygen_percentage doesn't work. Using 0.9.3-rc1 master (0163945).

👍
I work with sensor networks and find this limitation frustrating. For example, I wish to compute weighted averages like this:
SELECT sum(oxygen_percentage.value * flow_rate.value) / sum(flow_rate.value) FROM oxygen_percentage, flow_rate WHERE site_id = '3'
But InfluxDB returns nothing. Even SELECT oxygen_percentage.value FROM oxygen_percentage doesn't work. Using 0.9.3-rc1 master (0163945).

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Sep 8, 2015

Same here. I'd also like to calculate values across different series like:

select * from mysql_value where type='mysql_commands' and type_instance='show_tables' +
select * from mysql_value where type='mysql_commands' and type_instance='show_databases'

Cheers,
Szop

ghost commented Sep 8, 2015

Same here. I'd also like to calculate values across different series like:

select * from mysql_value where type='mysql_commands' and type_instance='show_tables' +
select * from mysql_value where type='mysql_commands' and type_instance='show_databases'

Cheers,
Szop

@bbinet

This comment has been minimized.

Show comment
Hide comment
@bbinet

bbinet Sep 18, 2015

Contributor

same as @hexluthor, I feel this is very limiting: if we need to correlate data coming from various sensors we currently have to write all data as fields in the same measurement... But would it be a good idea in terms of data structure to have a single measurement with more than 50 fields? Will it impact query performance?
And this sensor data does not always get logged with the same sampling frequency, so this is not always possible to combine data in the same measurement if we want to keep data with high sampling frequency.

I'm not comfortable with distorting the data structure (dropping natural data organization) because of technical limitations.
In the sysadmin world, it would be like putting all the cpu, ram, disk, and apache response time metrics in the same measurement for the sole purpose of being able to correlate apache response time with cpu, ram, or disk metrics.

Contributor

bbinet commented Sep 18, 2015

same as @hexluthor, I feel this is very limiting: if we need to correlate data coming from various sensors we currently have to write all data as fields in the same measurement... But would it be a good idea in terms of data structure to have a single measurement with more than 50 fields? Will it impact query performance?
And this sensor data does not always get logged with the same sampling frequency, so this is not always possible to combine data in the same measurement if we want to keep data with high sampling frequency.

I'm not comfortable with distorting the data structure (dropping natural data organization) because of technical limitations.
In the sysadmin world, it would be like putting all the cpu, ram, disk, and apache response time metrics in the same measurement for the sole purpose of being able to correlate apache response time with cpu, ram, or disk metrics.

@bbinet

This comment has been minimized.

Show comment
Hide comment
@bbinet

bbinet Sep 18, 2015

Contributor

Also, what are the actual technical issues that prevent InfluxDB to support queries with simple math operations across measurements?

Contributor

bbinet commented Sep 18, 2015

Also, what are the actual technical issues that prevent InfluxDB to support queries with simple math operations across measurements?

@corylanou

This comment has been minimized.

Show comment
Hide comment
@corylanou

corylanou Sep 18, 2015

Contributor

This was recently changed to a "feature request" so that means it will be evaluated in future releases if we are going to add it or not. There are a couple work arounds right now, and that is to save a calculated field when you write data, such as storing another field for oxygen_percentage.value * flow_rate.value. I understand this isn't ideal, but it might get you moving forward.

Otherwise, I think these requests are sane, but they will take some work. I believe sum() / sum() is supposed to work already, but I thought I remember seeing a bug about math still not behaving properly.

Contributor

corylanou commented Sep 18, 2015

This was recently changed to a "feature request" so that means it will be evaluated in future releases if we are going to add it or not. There are a couple work arounds right now, and that is to save a calculated field when you write data, such as storing another field for oxygen_percentage.value * flow_rate.value. I understand this isn't ideal, but it might get you moving forward.

Otherwise, I think these requests are sane, but they will take some work. I believe sum() / sum() is supposed to work already, but I thought I remember seeing a bug about math still not behaving properly.

@bbinet

This comment has been minimized.

Show comment
Hide comment
@bbinet

bbinet Sep 21, 2015

Contributor

@corylanou about the work around you're talking about: the oxygen_percentage.value * flow_rate.value field should be created when new points are created or is there a way to compute the calculation afterwards in a continuous query?

Contributor

bbinet commented Sep 21, 2015

@corylanou about the work around you're talking about: the oxygen_percentage.value * flow_rate.value field should be created when new points are created or is there a way to compute the calculation afterwards in a continuous query?

@bbinet bbinet referenced this issue Sep 21, 2015

Closed

Joining series #3860

@corylanou

This comment has been minimized.

Show comment
Hide comment
@corylanou

corylanou Sep 21, 2015

Contributor

Yes, I believe you should be able to do that in a CQ and then you can select from that retention policy.

Contributor

corylanou commented Sep 21, 2015

Yes, I believe you should be able to do that in a CQ and then you can select from that retention policy.

@srfraser

This comment has been minimized.

Show comment
Hide comment
@srfraser

srfraser Sep 21, 2015

How can we do it in a continuous query? I thought the syntax of normal queries and continuous ones was the same, so if it's possible in one, it should be possible in the other.

How can we do it in a continuous query? I thought the syntax of normal queries and continuous ones was the same, so if it's possible in one, it should be possible in the other.

@corylanou

This comment has been minimized.

Show comment
Hide comment
@corylanou

corylanou Sep 21, 2015

Contributor

instead of sum(value & value), you are doing a CQ with select val * val as newval and then you can select sum(newval) from your new data that was calculated from a CQ.

Contributor

corylanou commented Sep 21, 2015

instead of sum(value & value), you are doing a CQ with select val * val as newval and then you can select sum(newval) from your new data that was calculated from a CQ.

@srfraser

This comment has been minimized.

Show comment
Hide comment
@srfraser

srfraser Sep 21, 2015

And that works across measurements? Using @bbinet's example, this would work?

select oxygen_percentage.value * flow_rate.value as newmeasurement from oxygen_percentage, flow_rate 

And that works across measurements? Using @bbinet's example, this would work?

select oxygen_percentage.value * flow_rate.value as newmeasurement from oxygen_percentage, flow_rate 
@corylanou

This comment has been minimized.

Show comment
Hide comment
@corylanou

corylanou Sep 21, 2015

Contributor

Hmm, it should, but I just tried this basic test and it crashed the server 😢

> create database math
> use math
Using database math
> insert mul a=1,b=2
> select * from mul
name: mul
---------
time                            a       b
2015-09-21T12:17:36.377625368Z  1       2

> select a*b as c from mul
ERR: Get http://localhost:8086/query?db=math&q=select+a%2Ab+as+c+from+mul: EOF

I logged another issue here: influxdb#4183

Contributor

corylanou commented Sep 21, 2015

Hmm, it should, but I just tried this basic test and it crashed the server 😢

> create database math
> use math
Using database math
> insert mul a=1,b=2
> select * from mul
name: mul
---------
time                            a       b
2015-09-21T12:17:36.377625368Z  1       2

> select a*b as c from mul
ERR: Get http://localhost:8086/query?db=math&q=select+a%2Ab+as+c+from+mul: EOF

I logged another issue here: influxdb#4183

@srfraser

This comment has been minimized.

Show comment
Hide comment
@srfraser

srfraser Sep 21, 2015

and that was only from one measurement :)

and that was only from one measurement :)

@corylanou

This comment has been minimized.

Show comment
Hide comment
@corylanou

corylanou Sep 21, 2015

Contributor

Hopefully this is a central bug in our post-processing that when fixed will fix all of it. I'll see if I can fix it today. It appears to be just a bad reference while putting the math together, so it might be a quick fix.

Contributor

corylanou commented Sep 21, 2015

Hopefully this is a central bug in our post-processing that when fixed will fix all of it. I'll see if I can fix it today. It appears to be just a bad reference while putting the math together, so it might be a quick fix.

@bbinet

This comment has been minimized.

Show comment
Hide comment
@bbinet

bbinet Sep 21, 2015

Contributor

Thanks @corylanou, but as @srfraser said in his previous comment, your example comes from the same measurement: is it supposed to work with multiple measurements?
I thought that queries running as continuous were the same as normal queries so if maths does not work across multiple measurements in a normal query, I thought it won't work neither in a continuous query.
Is it wrong?

Contributor

bbinet commented Sep 21, 2015

Thanks @corylanou, but as @srfraser said in his previous comment, your example comes from the same measurement: is it supposed to work with multiple measurements?
I thought that queries running as continuous were the same as normal queries so if maths does not work across multiple measurements in a normal query, I thought it won't work neither in a continuous query.
Is it wrong?

@corylanou

This comment has been minimized.

Show comment
Hide comment
@corylanou

corylanou Sep 21, 2015

Contributor

Ah, yes, I keep forgetting we don't calculate across values. Although in a simple query we should support this. The biggest problem is type checking and overflow so that when you take an unsigned int and multiple it by a float, etc. that we are able to properly convert to a common type for the math, and not overflow either.

Contributor

corylanou commented Sep 21, 2015

Ah, yes, I keep forgetting we don't calculate across values. Although in a simple query we should support this. The biggest problem is type checking and overflow so that when you take an unsigned int and multiple it by a float, etc. that we are able to properly convert to a common type for the math, and not overflow either.

@bbinet

This comment has been minimized.

Show comment
Hide comment
@bbinet

bbinet Sep 22, 2015

Contributor

Ok, I see.
That would be great if cross measurements calculation could be possible at least for series which shares the same type (since no type conversion would be needed)

Contributor

bbinet commented Sep 22, 2015

Ok, I see.
That would be great if cross measurements calculation could be possible at least for series which shares the same type (since no type conversion would be needed)

@drmclean

This comment has been minimized.

Show comment
Hide comment
@drmclean

drmclean Oct 7, 2015

+1 We REALLY want this for our use-case!

drmclean commented Oct 7, 2015

+1 We REALLY want this for our use-case!

@malnor

This comment has been minimized.

Show comment
Hide comment
@malnor

malnor Nov 26, 2015

+1, really missing this feature.

malnor commented Nov 26, 2015

+1, really missing this feature.

@Millnert

This comment has been minimized.

Show comment
Hide comment

+1

@alintuhut

This comment has been minimized.

Show comment
Hide comment

+1

@clongbottom

This comment has been minimized.

Show comment
Hide comment

+1

@beckettsean beckettsean referenced this issue in influxdata/docs.influxdata.com Jan 5, 2016

Closed

[new page] open feature requests #64

@xaniasd

This comment has been minimized.

Show comment
Hide comment

xaniasd commented Jan 27, 2016

👍

@deepujain

This comment has been minimized.

Show comment
Hide comment

👍

@plieningerweb

This comment has been minimized.

Show comment
Hide comment

👍

@cxreg

This comment has been minimized.

Show comment
Hide comment

cxreg commented Feb 20, 2016

+1

@migibert

This comment has been minimized.

Show comment
Hide comment

+1

@migibert migibert referenced this issue in grafana/grafana Feb 22, 2016

Closed

division of two series with influxDB ? #2349

@graphex

This comment has been minimized.

Show comment
Hide comment
@graphex

graphex Feb 23, 2016

+1 thought I was going crazy, but this is a pretty substantial omission that might mean I've got to use another project instead of influx. Many times there is just no way to get correlated information into the same measurement. Even after an arduous journey with CQs, I only found that tags aren't included in CQ writes so there is no way to even fan-in with multiple CQs. Why were the MERGE and JOIN features from 0.8 dropped without there being a replacement? With the 0.9 documentation recommending the optimal way to structure things is to have many series and a single field named “value” (or some other key of your choice) used consistently across all series. and there apparently being no way to migrate from that kind of structure to the sort recommended at https://docs.influxdata.com/influxdb/v0.10/concepts/schema_and_data_layout/ I'm worried we're left hanging.

A viable CQ approach would be OK, but it is a lot more work than simply joining time-grouped measurements at query time, the way that influx used to work.

graphex commented Feb 23, 2016

+1 thought I was going crazy, but this is a pretty substantial omission that might mean I've got to use another project instead of influx. Many times there is just no way to get correlated information into the same measurement. Even after an arduous journey with CQs, I only found that tags aren't included in CQ writes so there is no way to even fan-in with multiple CQs. Why were the MERGE and JOIN features from 0.8 dropped without there being a replacement? With the 0.9 documentation recommending the optimal way to structure things is to have many series and a single field named “value” (or some other key of your choice) used consistently across all series. and there apparently being no way to migrate from that kind of structure to the sort recommended at https://docs.influxdata.com/influxdb/v0.10/concepts/schema_and_data_layout/ I'm worried we're left hanging.

A viable CQ approach would be OK, but it is a lot more work than simply joining time-grouped measurements at query time, the way that influx used to work.

@catchagain

This comment has been minimized.

Show comment
Hide comment
@catchagain

catchagain Feb 23, 2016

@graphex not that it solves the main problem, but tags are in fact included in CQ writes if the CQs have something like group by time(30m), * in them.

@graphex not that it solves the main problem, but tags are in fact included in CQ writes if the CQs have something like group by time(30m), * in them.

@adrianlzt

This comment has been minimized.

Show comment
Hide comment

+1

@HarasimowiczKamil

This comment has been minimized.

Show comment
Hide comment
@kipe

This comment has been minimized.

Show comment
Hide comment
@kipe

kipe Sep 29, 2016

We're using InfluxDB to store measurements gathered from the LUT Green Campus. We mainly gather measurements related to energy, which typically means turning voltage and current measurements to power. The electrical system has three phases (L1, L2, L3), which are used as fields in the schema. An example query to calculate active_power from measured voltage, current and phi for a single phase L1 would be something similar to

SELECT mean(voltage.L1) * mean(current.L1) * mean(cos(phi.L1)) GROUP BY time(5m), panel_group

Furthermore, the values for apparent_power and reactive_power can be derived from these three measurements.

kipe commented Sep 29, 2016

We're using InfluxDB to store measurements gathered from the LUT Green Campus. We mainly gather measurements related to energy, which typically means turning voltage and current measurements to power. The electrical system has three phases (L1, L2, L3), which are used as fields in the schema. An example query to calculate active_power from measured voltage, current and phi for a single phase L1 would be something similar to

SELECT mean(voltage.L1) * mean(current.L1) * mean(cos(phi.L1)) GROUP BY time(5m), panel_group

Furthermore, the values for apparent_power and reactive_power can be derived from these three measurements.

@rdworth

This comment has been minimized.

Show comment
Hide comment
@rdworth

rdworth Sep 30, 2016

As another data point on use cases we have exactly the same use case as @kipe at Flywheel Building Intelligence.

rdworth commented Sep 30, 2016

As another data point on use cases we have exactly the same use case as @kipe at Flywheel Building Intelligence.

@jwilder jwilder modified the milestones: Longer term, 1.1.0 Oct 6, 2016

@dandv

This comment has been minimized.

Show comment
Hide comment
@dandv

dandv Oct 7, 2016

Contributor

Any chance for those who just "+1"ed to remove their comments and +1 the OP instead?

It would make it easier to follow the discussion. Thanks.

Contributor

dandv commented Oct 7, 2016

Any chance for those who just "+1"ed to remove their comments and +1 the OP instead?

It would make it easier to follow the discussion. Thanks.

@thealphaking01

This comment has been minimized.

Show comment
Hide comment
@thealphaking01

thealphaking01 Oct 12, 2016

Hey Guys,

Another use case which I got was calculating a percentage, where the denominator value is in one measurement and the numerator is in another, and I needed to do something like

SELECT (sum(a.value) / sum(b.value)) * 100

where a and b are separate measurements, which is similar to what you guys mentioned. Is there no roundabout way to do this for now?

thealphaking01 commented Oct 12, 2016

Hey Guys,

Another use case which I got was calculating a percentage, where the denominator value is in one measurement and the numerator is in another, and I needed to do something like

SELECT (sum(a.value) / sum(b.value)) * 100

where a and b are separate measurements, which is similar to what you guys mentioned. Is there no roundabout way to do this for now?

@jsternberg

This comment has been minimized.

Show comment
Hide comment
@jsternberg

jsternberg Oct 12, 2016

Contributor

@thealphaking01 for your example, is that supposed to be sum(a.value) / sum(b.value) or is the second value supposed to be the raw values?

Contributor

jsternberg commented Oct 12, 2016

@thealphaking01 for your example, is that supposed to be sum(a.value) / sum(b.value) or is the second value supposed to be the raw values?

@thealphaking01

This comment has been minimized.

Show comment
Hide comment
@thealphaking01

thealphaking01 Oct 12, 2016

@jsternberg
a.value is a subset of b.value. Didn't get what you meant by raw values? For a particular duration, sum(a.value) gives me the number satisfying a criteria, while sum(b.value) gives me all possible values. Hence the percentage calculation.

@jsternberg
a.value is a subset of b.value. Didn't get what you meant by raw values? For a particular duration, sum(a.value) gives me the number satisfying a criteria, while sum(b.value) gives me all possible values. Hence the percentage calculation.

@jsternberg

This comment has been minimized.

Show comment
Hide comment
@jsternberg

jsternberg Oct 12, 2016

Contributor

Raw values are the ones directly written into the system. So SELECT value FROM cpu returns the raw values for the value field in the cpu measurement. The reason why I ask is because I didn't knonw if your above example had a typo or not since a.value is inside of an aggregate and b.value is not.

Contributor

jsternberg commented Oct 12, 2016

Raw values are the ones directly written into the system. So SELECT value FROM cpu returns the raw values for the value field in the cpu measurement. The reason why I ask is because I didn't knonw if your above example had a typo or not since a.value is inside of an aggregate and b.value is not.

@thealphaking01

This comment has been minimized.

Show comment
Hide comment
@thealphaking01

thealphaking01 Oct 12, 2016

Sorry, that was a typo. Fixed it now.

Sorry, that was a typo. Fixed it now.

@claybar

This comment has been minimized.

Show comment
Hide comment
@claybar

claybar Oct 13, 2016

I'm not sure if this is the correct case to implement the issue of case #3525 for inequalities? I would like to do something along the lines of:

SELECT mean(value) > 40

where the result is returned as a full time series of boolean values. My specific use case for this is determining if a pump is operating based on the current draw it is using.

claybar commented Oct 13, 2016

I'm not sure if this is the correct case to implement the issue of case #3525 for inequalities? I would like to do something along the lines of:

SELECT mean(value) > 40

where the result is returned as a full time series of boolean values. My specific use case for this is determining if a pump is operating based on the current draw it is using.

@jwilder jwilder referenced this issue Oct 13, 2016

Closed

SQL Joins, FTW #7422

@nathaniel

This comment has been minimized.

Show comment
Hide comment
@nathaniel

nathaniel Oct 13, 2016

Hi Jonathan,
Apparently I received this email because it mentions "@nathaniel" and
nathaniel is my GitHub username; but I am not associated with this project
in any way.
Can I please be removed from this email chain? If you would remove me from
the CC line if/when you continue the conversation further, that would be
great.

Thanks,
Nathaniel Cunningham, Ph.D.
Lincoln, NE

On Tue, Sep 27, 2016 at 7:01 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

Problem Statement

While attempting to implement this, I ran into another issue which I'll
describe here and I'll describe the solution that we determined would be
most appropriate. For those following this issue closely, my apologies
for not writing up the problem before we started talking internally so
anybody who wanted could contribute. I'm writing it up here for the sake
of transparency before we attempt to implement it.

The first thing to understand is how series are joined together
currently. When you have a single measurement with multiple fields like
the following, we need to join the values together in some way.

cpu,host=server01 idle=20,user=70,system=10 0
cpu,host=server02 idle=30,user=50,system=20 0

When we execute a query to retrieve all of the fields, we are querying 6
different series. There are 3 fields and 2 series keys for a total of 6
series, but we are only returning 2 rows at time zero. How do we
determine which series are joined together? According to the other
proposal, the series would just be joined together in a fairly random
order so you could intermix values from cpu,host=server01 with values
from cpu,host=server02. There are a few problems with this though.

  1. It's confusing for users. They expect values that have the same
    series key to be matched with each other even if that's not how the storage
    engine works. The current system matches them up so why wouldn't it
    continue to work?
  2. You would be no longer able to query tags because you would be
    unable to determine which series to take tag information from.

But, if we don't use this method of joining together series, joining
series between measurements becomes very complicated. Imagine that you
have the following in the mem measurement and join it with the above.

mem,host=server01,region=uswest free=80 0

Since this series does not have a series key that matches anything
above, the value would not be joined with any row in cpu above. While
we could force series keys to be the same across measurements for this
feature to work, we think that's too limiting.
Updated Proposal

We are proposing a new syntax and new functionality to join. When series
are being joined (either between multiple measurements or a single
measurement), the user will be able to specify which parts of the series
key should be considered while joining. Only tags that are mentioned in
the join can be used in the resulting query (including GROUP BY). The
join will happen before anything else so series will be joined into a
single row before they are grouped into different series. I think this
contrasts with Kapacitor which does the grouping first and then joins
the series. /cc @nathaniel https://github.com/nathaniel

The syntax will be something like this:

SELECT cpu.idle, cpu.user, mem.free FROM cpu, mem JOIN SERIES ON host

JOIN MEASUREMENTS has been changed to JOIN SERIES to put the focus
on how we are joining series together. Series from different
measurements are not treated in a special way.

Series will be joined together with the cartesian product for any values
at the same time. So if you tell a query to join on host and have 2
series that match with 2 series in the other field, you will end up with
4 rows total. In particular, that means queries like this will work in a
way we think is most fitting.

cpu,host=server01,num=0 value=20 0
cpu,host=server01,num=1 value=30 0
mem,host=server01 value=10 0

SELECT cpu.value, mem.value FROM cpu, mem JOIN SERIES ON host
time cpu.value mem.value


0 20 10
0 30 10

Feedback

  • When there is a tag selected in the query, is it reasonable to infer
    that series should be joined on that tag?
  • When a GROUP BY is used, is it reasonable to infer that it should be
    included in the JOIN so the user doesn't have to repeat it?

For now, we're going to keep things explicit and try to make this more
user friendly with feedback in the future. If you can think of any
problems or you think that this is the wrong way to deal with joining
measurements, please post a comment so we can address any concerns. This
is a difficult problem and we appreciate and welcome any community
involvement.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3552 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEKc6ccxw_WrBp-7oTmJiFjXBO6DJkXks5qua5sgaJpZM4FljNk
.

Hi Jonathan,
Apparently I received this email because it mentions "@nathaniel" and
nathaniel is my GitHub username; but I am not associated with this project
in any way.
Can I please be removed from this email chain? If you would remove me from
the CC line if/when you continue the conversation further, that would be
great.

Thanks,
Nathaniel Cunningham, Ph.D.
Lincoln, NE

On Tue, Sep 27, 2016 at 7:01 PM, Jonathan A. Sternberg <
notifications@github.com> wrote:

Problem Statement

While attempting to implement this, I ran into another issue which I'll
describe here and I'll describe the solution that we determined would be
most appropriate. For those following this issue closely, my apologies
for not writing up the problem before we started talking internally so
anybody who wanted could contribute. I'm writing it up here for the sake
of transparency before we attempt to implement it.

The first thing to understand is how series are joined together
currently. When you have a single measurement with multiple fields like
the following, we need to join the values together in some way.

cpu,host=server01 idle=20,user=70,system=10 0
cpu,host=server02 idle=30,user=50,system=20 0

When we execute a query to retrieve all of the fields, we are querying 6
different series. There are 3 fields and 2 series keys for a total of 6
series, but we are only returning 2 rows at time zero. How do we
determine which series are joined together? According to the other
proposal, the series would just be joined together in a fairly random
order so you could intermix values from cpu,host=server01 with values
from cpu,host=server02. There are a few problems with this though.

  1. It's confusing for users. They expect values that have the same
    series key to be matched with each other even if that's not how the storage
    engine works. The current system matches them up so why wouldn't it
    continue to work?
  2. You would be no longer able to query tags because you would be
    unable to determine which series to take tag information from.

But, if we don't use this method of joining together series, joining
series between measurements becomes very complicated. Imagine that you
have the following in the mem measurement and join it with the above.

mem,host=server01,region=uswest free=80 0

Since this series does not have a series key that matches anything
above, the value would not be joined with any row in cpu above. While
we could force series keys to be the same across measurements for this
feature to work, we think that's too limiting.
Updated Proposal

We are proposing a new syntax and new functionality to join. When series
are being joined (either between multiple measurements or a single
measurement), the user will be able to specify which parts of the series
key should be considered while joining. Only tags that are mentioned in
the join can be used in the resulting query (including GROUP BY). The
join will happen before anything else so series will be joined into a
single row before they are grouped into different series. I think this
contrasts with Kapacitor which does the grouping first and then joins
the series. /cc @nathaniel https://github.com/nathaniel

The syntax will be something like this:

SELECT cpu.idle, cpu.user, mem.free FROM cpu, mem JOIN SERIES ON host

JOIN MEASUREMENTS has been changed to JOIN SERIES to put the focus
on how we are joining series together. Series from different
measurements are not treated in a special way.

Series will be joined together with the cartesian product for any values
at the same time. So if you tell a query to join on host and have 2
series that match with 2 series in the other field, you will end up with
4 rows total. In particular, that means queries like this will work in a
way we think is most fitting.

cpu,host=server01,num=0 value=20 0
cpu,host=server01,num=1 value=30 0
mem,host=server01 value=10 0

SELECT cpu.value, mem.value FROM cpu, mem JOIN SERIES ON host
time cpu.value mem.value


0 20 10
0 30 10

Feedback

  • When there is a tag selected in the query, is it reasonable to infer
    that series should be joined on that tag?
  • When a GROUP BY is used, is it reasonable to infer that it should be
    included in the JOIN so the user doesn't have to repeat it?

For now, we're going to keep things explicit and try to make this more
user friendly with feedback in the future. If you can think of any
problems or you think that this is the wrong way to deal with joining
measurements, please post a comment so we can address any concerns. This
is a difficult problem and we appreciate and welcome any community
involvement.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3552 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEKc6ccxw_WrBp-7oTmJiFjXBO6DJkXks5qua5sgaJpZM4FljNk
.

@inselbuch

This comment has been minimized.

Show comment
Hide comment
@inselbuch

inselbuch Oct 13, 2016

Opening a feature request kicks off a discussion.
Requests may be closed if we're not actively planning to work on them.

Proposal: Implement SQL Joins in the Influx Query Language
Current behavior: Not supported.
Desired behavior: Supported.

Use case:

When requesting data from InfluxDB it would be very useful to combine metadata, configuration data, etc., from one "table" with the time-series data. For example:

assets (measurement)
time (actually not used in this table)
id (tag) = cryptic internal identifier
friendly (field) = character string
port (field) = number (TCP/IP port number used by the device)

i.e.,
insert assets,id=y268938rjnau3 friendly='Asset1',port=64200

rundata (measurement)
time
id (tag) = cryptic internal identifier
temperature (field)
pressure (field)

This works great:
select last(temperature) from rundata group by mac

But this is what I really want:

Timestamp Machine Temperature

10/05/2016 8:04:06pm C2475 1675.4
10/02/2016 9:02:11am C7524 850.5

That might be done like this:

select
c.time as "Timestamp", a.friendly as "Machine",last(c.temperature) as "Temperature"
from
rundata c,assets a
where
c.id = a.id
group by
c.mac

Opening a feature request kicks off a discussion.
Requests may be closed if we're not actively planning to work on them.

Proposal: Implement SQL Joins in the Influx Query Language
Current behavior: Not supported.
Desired behavior: Supported.

Use case:

When requesting data from InfluxDB it would be very useful to combine metadata, configuration data, etc., from one "table" with the time-series data. For example:

assets (measurement)
time (actually not used in this table)
id (tag) = cryptic internal identifier
friendly (field) = character string
port (field) = number (TCP/IP port number used by the device)

i.e.,
insert assets,id=y268938rjnau3 friendly='Asset1',port=64200

rundata (measurement)
time
id (tag) = cryptic internal identifier
temperature (field)
pressure (field)

This works great:
select last(temperature) from rundata group by mac

But this is what I really want:

Timestamp Machine Temperature

10/05/2016 8:04:06pm C2475 1675.4
10/02/2016 9:02:11am C7524 850.5

That might be done like this:

select
c.time as "Timestamp", a.friendly as "Machine",last(c.temperature) as "Temperature"
from
rundata c,assets a
where
c.id = a.id
group by
c.mac

@pauldix

This comment has been minimized.

Show comment
Hide comment
@pauldix

pauldix Oct 13, 2016

Member

@nathaniel we don't have the capability to unsubscribe you, but if you click the link in the email notification and follow it to this issue's Github page, you can click the "Unsubscribe" button at the bottom of the right hand side. Sorry about the accidental mention.

Member

pauldix commented Oct 13, 2016

@nathaniel we don't have the capability to unsubscribe you, but if you click the link in the email notification and follow it to this issue's Github page, you can click the "Unsubscribe" button at the bottom of the right hand side. Sorry about the accidental mention.

@daviesalex

This comment has been minimized.

Show comment
Hide comment
@daviesalex

daviesalex Nov 2, 2016

Contributor

I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:

SELECT non_negative_derivative(last(PortXmitData * 4), $interval) from ibstats WHERE "interface" =~ /ib.*/ AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)

Note the PortXmitData * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.

One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.

Contributor

daviesalex commented Nov 2, 2016

I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:

SELECT non_negative_derivative(last(PortXmitData * 4), $interval) from ibstats WHERE "interface" =~ /ib.*/ AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)

Note the PortXmitData * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.

One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.

@inselbuch

This comment has been minimized.

Show comment
Hide comment
@inselbuch

inselbuch Nov 2, 2016

I understand your issue… makes sense… the same issue in my opinion.
But I don’t see tx_bytes*4 in your query…

I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:

SELECT non_negative_derivative(last(PortXmitData), $interval) from ibstats WHERE ("interface" =~ /ib./ OR "interface" =~ /p./) AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)

Note the tx_bytes * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.

One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/influxdata/influxdb/issues/3552#issuecomment-257987465, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVlReR3OTMcZ_DcCy4OJy3ea4qU58p5Pks5q6PFqgaJpZM4FljNk.

I understand your issue… makes sense… the same issue in my opinion.
But I don’t see tx_bytes*4 in your query…

I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:

SELECT non_negative_derivative(last(PortXmitData), $interval) from ibstats WHERE ("interface" =~ /ib./ OR "interface" =~ /p./) AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)

Note the tx_bytes * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.

One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/influxdata/influxdb/issues/3552#issuecomment-257987465, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVlReR3OTMcZ_DcCy4OJy3ea4qU58p5Pks5q6PFqgaJpZM4FljNk.

@daviesalex

This comment has been minimized.

Show comment
Hide comment
@daviesalex

daviesalex Nov 3, 2016

Contributor

@inselbuch my apologies, I fixed the example query.

The problem is that this works:
non_negative_derivative(last(PortXmitData), $interval)
And this does not:
non_negative_derivative(last(PortXmitData * 4), $interval)

Contributor

daviesalex commented Nov 3, 2016

@inselbuch my apologies, I fixed the example query.

The problem is that this works:
non_negative_derivative(last(PortXmitData), $interval)
And this does not:
non_negative_derivative(last(PortXmitData * 4), $interval)

@espiegel

This comment has been minimized.

Show comment
Hide comment
@espiegel

espiegel Nov 3, 2016

+1

Edit: +1ed the OP

espiegel commented Nov 3, 2016

+1

Edit: +1ed the OP

@cattt84

This comment has been minimized.

Show comment
Hide comment

cattt84 commented Nov 7, 2016

+1

@joshzitting

This comment has been minimized.

Show comment
Hide comment
@joshzitting

joshzitting Dec 6, 2016

I am currently in the process of trying to switch from graphite to influx with grafana as our front end.. I currently have queries like this for graphite but I havent found a way to convert them to influx yet..

asPercent(reporting1_dev.memory.used,reporting1_dev.memory.total)
asPercent(reporting1_dev.memory.buffers,reporting1_dev.memory.total)

Also like this

asPercent(nonNegativeDerivative(reporting1_dev.cpu.total.iowait),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))
asPercent(sumSeries(nonNegativeDerivative(reporting1_dev.cpu.total.{guest,iowait,nice,steal,irq,softirq})),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))

Any help would be great!!!
Thanks!
Josh

So I found that I if I am doing math all of the info has to be in the same table.

joshzitting commented Dec 6, 2016

I am currently in the process of trying to switch from graphite to influx with grafana as our front end.. I currently have queries like this for graphite but I havent found a way to convert them to influx yet..

asPercent(reporting1_dev.memory.used,reporting1_dev.memory.total)
asPercent(reporting1_dev.memory.buffers,reporting1_dev.memory.total)

Also like this

asPercent(nonNegativeDerivative(reporting1_dev.cpu.total.iowait),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))
asPercent(sumSeries(nonNegativeDerivative(reporting1_dev.cpu.total.{guest,iowait,nice,steal,irq,softirq})),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))

Any help would be great!!!
Thanks!
Josh

So I found that I if I am doing math all of the info has to be in the same table.

@ikkaro

This comment has been minimized.

Show comment
Hide comment
@ikkaro

ikkaro Dec 16, 2016

+1 we need to do some maths between measurements

ikkaro commented Dec 16, 2016

+1 we need to do some maths between measurements

@lpic10

This comment has been minimized.

Show comment
Hide comment
@lpic10

lpic10 Jan 9, 2017

This is a huge feature missing from InfluxDB, making it much less powerful than Prometheus.

lpic10 commented Jan 9, 2017

This is a huge feature missing from InfluxDB, making it much less powerful than Prometheus.

@titilambert titilambert referenced this issue in home-assistant/home-assistant Jan 9, 2017

Merged

Improve InfluxDB #5238

@actionjax

This comment has been minimized.

Show comment
Hide comment

+1

@trondvh

This comment has been minimized.

Show comment
Hide comment

trondvh commented Jan 18, 2017

+1

@andrewpuch

This comment has been minimized.

Show comment
Hide comment

+1

@joshzitting

This comment has been minimized.

Show comment
Hide comment
@joshzitting

joshzitting Jan 20, 2017

I think the point has been stated with all of the +1s... This thread should be locked until there is progress on it.

I think the point has been stated with all of the +1s... This thread should be locked until there is progress on it.

@tblok

This comment has been minimized.

Show comment
Hide comment

tblok commented Feb 10, 2017

+1

@jsternberg

This comment has been minimized.

Show comment
Hide comment
@jsternberg

jsternberg Feb 10, 2017

Contributor

I'm locking this again. While that means you won't be able to add your 👍 reaction, I think there are enough of them that we're well aware people want this feature. We do want to hear any resources that may be useful in terms of implementing this. You can look in the commit log to find my email. If you want to wait for comments regarding this issue, please use the "Subscribe" button on the issue instead of responding to the issue.

Thank you.

Contributor

jsternberg commented Feb 10, 2017

I'm locking this again. While that means you won't be able to add your 👍 reaction, I think there are enough of them that we're well aware people want this feature. We do want to hear any resources that may be useful in terms of implementing this. You can look in the commit log to find my email. If you want to wait for comments regarding this issue, please use the "Subscribe" button on the issue instead of responding to the issue.

Thank you.

@influxdata influxdata locked and limited conversation to collaborators Feb 10, 2017

@jsternberg jsternberg removed their assignment Apr 28, 2017

@rbetts rbetts removed this from the Longer term milestone Oct 27, 2017

@rbetts rbetts changed the title from [feature request] Mathematics across measurements to Mathematics across measurements Oct 27, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.