Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select a tag key when the tag and field conflict #6519

Closed
jsternberg opened this issue Apr 29, 2016 · 18 comments
Closed

Select a tag key when the tag and field conflict #6519

jsternberg opened this issue Apr 29, 2016 · 18 comments
Assignees
Labels
area/influxql Issues related to InfluxQL query language area/queries
Milestone

Comments

@jsternberg
Copy link
Contributor

jsternberg commented Apr 29, 2016

Problem

This is a single issue to discuss #4630.

When points are written to the same measurement where a tag key and a field key conflict with each other, it is impossible to query the tag key with InfluxQL. The following examples (rendered to different measurements to demonstrate these are different examples) leaves the tag not queryable.

Example 1

example1,host=server01 value=2
example1 host='server02',value=3

Example 2

example2,host=server01 host='server02',value=2

The following queries become ambiguous and cannot be resolved correctly, so they default to selecting the field rather than the tag.

SELECT host, value FROM example1
SELECT host, value FROM example2

Background

There is a bit of a debate on exactly what a tag is. Before talking about what a tag is, here is a list of exactly what capabilities a tag has:

  • They are indexed.
  • They are always strings.
  • Points that have the same tag keys/values make up a series.
  • They are associated with the series, not the individual point.

Since a tag is associated with the series and not the individual point, tags do not have a time associated with them. This makes it impossible to query for only a tag in a measurement since a select works by finding the time associated with a field and returning the value at that time. Since tags are not associated with points, they also do not have a time and cannot be iterated over. To query a tag with a select, they are wholly dependent on being queried with a field in the same query.

There is also no way to differentiate between a field and a tag in InfluxQL at the moment. The search path for how to treat a variable reference is:

  • Check if the measurement (not series) has a field with the value of the key.
  • Check if the series has a tag with the value of the key.
  • Pretend the cursor is a tag with a blank value.

Because of the first step, it makes it impossible to reference a tag once any series in a measurement writes uses a key as a field even if there was previous data with that tag.

Possible Solutions

Solution 1

Change the prioritization so it will read a tag over a field (if present) and restrict points from being written with a conflicting field and tag key as described and implemented in #6410.

This solution makes it so example 1 is possible and works correctly when querying so that the first series will return the value of the tag and the second will return the value in the field. It makes example 2 into an error so it becomes impossible to get into this situation to begin with.

Pros:

  • Tags become equivalent to indexed fields that are always strings.
  • No additional complexity is added to the query language or query engine beyond a small fix to how it creates cursors.
  • Makes SELECT value, host FROM cpu GROUP BY host use a consistent value (currently it will use the tag for the GROUP BY and the field for the actual selection).

Cons:

  • Tags only sometimes act as fields. As described earlier, you cannot query a tag by itself. If you do SELECT host FROM example1 you will get a weird result where it will only return the second point.
  • Old data written where the field and the tag were written as part of the same series will still be unreadable, but you won't be able to write any new data.
  • Breaking change.

Solution 2

Introduce new syntax to specify whether you meant to grab a tag or a field to InfluxQL as described and implemented in #6509 or as proposed in #4823.

This solution makes it so both examples are queryable and doesn't require any change to existing data or the point writer, but does require users to use new syntax to reference the correct value.

Pros:

  • Backwards compatible.
  • Keeps tags and fields as completely separate entities rather than tags being a special type of field.
  • May allow us to detect when someone tries to select only a tag as we will be able to notify the user that they must select at least one field. This benefit is only if this new syntax is made required to access tags (not backwards compatible anymore) or if somebody actually uses it.

Cons:

  • The implementation is more complex and InfluxQL will have to be modified.
  • To maintain backwards compatibility, you still have to allow tags to be referenced with the same variable reference as fields. If you write host as a tag and then write it once as a field to one series in the measurement, all of your previous queries will break. It is much more optimal to require that the new syntax is always used for tags, but then we're not backwards compatible anymore.
Proposed Syntax
Solution 1

Prepend the variable reference with tag or field.

SELECT tag.host, field.value FROM cpu

The syntax for this looks pretty ugly in my personal opinion and I dislike that it adds more reserved keywords to a language that already has a lot of them. It becomes impossible to have a key with these names which makes this a backwards incompatible change. We then also might have to add code that strips the front of the variable reference from the ident as the query engine passes around auxiliary fields as a raw string. That means the identifier cannot just be stripped at parse time since we need to send it to the underlying engine which doesn't accept the AST structs as arguments. We can modify the iterator options struct to include references for the auxiliary fields, but this changes the wire protocol for RPC and might make things more complicated with protobuf.

Solution 2
SELECT @host, value FROM cpu

Prepend the input with a special sigil to signify that a tag is being referenced. The syntax for this is common in programming languages like Perl and Ruby. Some programmers may be offput by the syntax and so it might not be desirable (Perl and Ruby's use of sigils is a holy war among programmers). The positive benefit is that it is one character (very easy to check and strip efficiently) and so it is easier to pass around as a string without involving complex AST structures. This is also a backwards compatible change since the @ symbol is not used like this anywhere else.

What sigil is used specifically is debatable. I would like to keep $ reserved for future InfluxQL support for a Template node in the AST since I think this will make some things, like Chronograf or Influx Stress, easier to implement using the influxql package. I also avoided # since that's a comment character even though I was very tempted to use it because tag and hashtag both reference tags (even if its real name is the pound sign).

Solution 3
SELECT host::tag, value::string FROM cpu

This syntax alleviates one of the problems from solution 1 because it is less likely to conflict with a real measurement name, but it still may be difficult to strip off the end in an efficient manner when passing it through auxiliary fields. Likely, the best solution would be to change the function signature of IteratorOptions and just accept the wire protocol will change. The syntax is supposed to correspond to the casting syntax from PostreSQL, but we're not technically doing a cast, but a selection. While true, we can extend this to also allow specifying the field type of a field. If a field type is given, that will be considered a cast for a field and will override the default field type chosen. Tag would be treated as a separate type.

@jsternberg
Copy link
Contributor Author

@jwilder
Copy link
Contributor

jwilder commented Apr 29, 2016

There was another syntax propsed as well which is modeled after Postgres's cast syntax.

Solution 3
SELECT host::tag, value::field FROM cpu

The first syntax may cause parsing trouble because '.' is allowed in field and tags name too.

@toddboom
Copy link
Contributor

I like #2 and #3. It seems like #2 would be cleaner to implement, and shouldn't cause too much drama if it's only required in the event of a conflict.

@jsternberg
Copy link
Contributor Author

Ah, I'll add that syntax too. I personally don't like it because we're not doing a cast and it has some of the same problems solution 1 has, but it deserves to be on there anyway.

@benbjohnson
Copy link
Contributor

I personally like syntax 3 (::). It's not technically a cast although I can see us using that syntax for forcing a data type in the future and I think it works well from a user's point of view.

e.g. SELECT value::string FROM cpu

@jsternberg
Copy link
Contributor Author

@benbjohnson I just added the syntax and pointed out that syntax should probably be used for forcing a specific type when types conflict. I think that syntax fits a lot better in your example than as a specifier for tag vs field.

@benbjohnson
Copy link
Contributor

I find the @ to be confusing. Someone who knows InfluxQL well could look at that and I know what it is but your average user is going to have no idea what that is if they saw a query with it. At least with ::tag the user will always know that it's specifying a tag.

@benbjohnson
Copy link
Contributor

Also, the field types don't overlap with tags. You'd never specify a value::string::tag. Either you're specifying that you want a "field of type X" or a "tag".

e.g.

SELECT value::integer FROM cpu
SELECT value::float FROM cpu
SELECT value::string FROM cpu
SELECT value::boolean FROM cpu
SELECT value::tag FROM cpu

@jsternberg
Copy link
Contributor Author

Hm, that could work. We would then not include a field cast and would say that tags are not "just strings". The concerns I would have though is it would require a wire protocol change for the RPC because the Aux field in IteratorOptions would need to be changed from []string to []*influxql.VarRef (or something else) so that it could have knowledge of both the name of the variable and the type that it's supposed to be cast to (with unknown as the default). Otherwise, the engine won't have any knowledge about what type it's supposed to select and it won't know if it's supposed to select a tag.

@benbjohnson
Copy link
Contributor

I don't think the RPC concern is specific to the :: syntax. The @ has the same issue. It either needs a flag or it needs to be parsed by the receiver.

@jsternberg
Copy link
Contributor Author

The sigil syntax is much easier to parse than the others so it can be embedded within the string without requiring any kind of special parser. The :: syntax requires looking at the entire string rather than just the first character.

@jwilder
Copy link
Contributor

jwilder commented Apr 29, 2016

I prefer :: because of the same reasons @benbjohnson mentioned. There is also precedence in other dbs so it's likely more understandable by users.

@jsternberg
Copy link
Contributor Author

Updated solution 3 to include the information from what @benbjohnson and I discussed.

@jwilder jwilder added area/queries area/influxql Issues related to InfluxQL query language labels Apr 30, 2016
@pauldix
Copy link
Member

pauldix commented Apr 30, 2016

There's still the problem of what to do about SELECT * FROM foo queries. Right now it will return all tags and fields. Having the field and tag prefix makes SELECT field::* FROM foo possible.

Maybe another way to make that work with the other syntaxes?

@benbjohnson
Copy link
Contributor

We could still use *::field to select all fields.

@jsternberg
Copy link
Contributor Author

Hm, I'm not really sure the best way to do that. It seems like we're putting too many features into one piece of syntax. It also seems like we're diverging from this being the cast operator. I dislike allowing ::field unless we're going to say, "We are casting to a field." I'm perfectly fine with ::tag because a tag can be considered to be a completely different data type.

Unless there's some precedent for using a cast operator with a wildcard, I don't think that makes sense and I'm not sure how useful the feature is.

The rationale for allowing ::float, ::int, ::string, and ::boolean are that we would be casting iterators to a specific type. Normally, we automatically choose which type to cast to. The cast operator would explicitly choose which type to cast to rather than it being automatic and values that can't be cast to another type (like string can't go to float) will just be discarded as they currently are.

jsternberg added a commit that referenced this issue May 1, 2016
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `tag1::tag` to specify that a tag should be selected.

This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement.

This changes the RPC wire protocol and is not backwards compatible.

Fixes #6519.
jsternberg added a commit that referenced this issue May 1, 2016
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `tag1::tag` to specify that a tag should be selected.

This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.

The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.

This changes the RPC wire protocol and is not backwards compatible.

Fixes #6519.
@jsternberg jsternberg self-assigned this May 2, 2016
jsternberg added a commit that referenced this issue May 2, 2016
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `field1::field` or `tag1::tag` to specify that a field or tag should
be selected.

This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.

The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.

This changes the RPC wire protocol and is not backwards compatible.
InfluxDB will be able to process old messages, but it will not be able
to communicate with old coordinators.

Fixes #6519.
jsternberg added a commit that referenced this issue May 10, 2016
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `field1::field` or `tag1::tag` to specify that a field or tag should
be selected.

This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.

The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.

This changes the RPC wire protocol and is not backwards compatible.
InfluxDB will be able to process old messages, but it will not be able
to communicate with old coordinators.

Fixes #6519.
@jsternberg jsternberg added this to the 1.0.0 milestone May 11, 2016
jsternberg added a commit that referenced this issue May 13, 2016
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `field1::field` or `tag1::tag` to specify that a field or tag should
be selected.

This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.

The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.

This changes the RPC wire protocol and is not backwards compatible.
InfluxDB will be able to process old messages, but it will not be able
to communicate with old coordinators.

Fixes #6519.
jsternberg added a commit that referenced this issue May 16, 2016
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `field1::field` or `tag1::tag` to specify that a field or tag should
be selected.

This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.

The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.

Fixes #6519.
@adampl
Copy link

adampl commented Jun 12, 2016

Hi. For me, as a user, the idea of a tag type is ugly because it's not really a type, and selecting either a field or a tag by the same name using the typecast sounds like a very dirty hack. I know I'm late in this discussion, but for me the idea of prepending tag name with a @ was much better.

Also it would be nice if selecting all fields or all tags, or both (grouped separately as fields and tags) were possible - and it would be consistent with the way the data is written.

( comment copied from the docs repo as advised by @beckettsean )

@timhallinflux timhallinflux modified the milestones: 1.0.0, 1.0.0 beta Dec 19, 2016
@jacktang
Copy link

jacktang commented Apr 4, 2017

Hello, @jsternberg

Has this feature been provided in InfluxDB 1.2? I tried but it threw syntax error

> select max(water_level), max(water_level)::time from h2o_feet where time > 1439856000000000000 and time < 1439859240000000000
ERR: error parsing query: found ::, expected FROM at line 1, char 42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxql Issues related to InfluxQL query language area/queries
Projects
None yet
Development

No branches or pull requests

8 participants