New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Update Data Exploration #817

Merged

rkuchan merged 6 commits into master from rk-more-issues

Nov 3, 2016

Contributor

rkuchan commented Oct 18, 2016 •

edited

Restructures and reformats the Data Exploration page. Adds consistent headers (like: Syntax, Description of Syntax, Examples, Common Issues) for each section to make the doc easier to parse and understand.

It updates and edits all of the content on the page. I tried to include a lot more syntax-specific information.

Fixed issues:

#796: Creates a section dedicated to how to specify a measurement(s) in the FROM clause, including how to fully qualify the measurement.

#536: Changes the offset_interval examples to clarify its function. I spent a lot of time on this and am very worried I've made it worse.

rkuchan force-pushed the rk-more-issues branch from 4ebcc6a to b39fb44 Compare

October 18, 2016 18:53

beckettsean approved these changes

View reviewed changes

Contributor

beckettsean left a comment

looks great! I like the re-org

beckettsean suggested changes

View reviewed changes

Contributor

beckettsean left a comment

just a few things need updating, but overall great work

content/influxdb/v1.0/query_language/data_exploration.md

+              #### Syntax
+              ```
+              SELECT <function>(<field_key>) FROM <measurement_name> [WHERE <time_range>] GROUP BY [ * | <tag_key>[,<tag_key] ]

Contributor

beckettsean Oct 19, 2016

I think capitalize function:

SELECT <FUNCTION>(<field_key>) FROM <measurement_name> ...

Contributor

beckettsean Oct 19, 2016

Missing bracket:

[ * | <tag_key>[,<tag_key>]]

content/influxdb/v1.0/query_language/data_exploration.md


		#### Description of Basic Syntax

		`GROUP BY <tag>` queries require and InfluxQL [function](/influxdb/v1.0/query_language/functions/).

Contributor

beckettsean Oct 19, 2016

require an InfluxQL ...

content/influxdb/v1.0/query_language/data_exploration.md

-                  `w` weeks
+              #### Syntax
+              ```
+              SELECT <function>(<field_key>) FROM <measurement_name> WHERE <time_range> GROUP BY time(<time_interval>),[tag_key]

Contributor

beckettsean Oct 19, 2016

capitalize FUNCTION

Contributor

beckettsean Oct 19, 2016

Also, let's indicate tags in the GROUP BY can be 0 to many:

...GROUP BY time(<time_interval>)[,<tag_key>[,<tag_key>]]

content/influxdb/v1.0/query_language/data_exploration.md

		```
		> SELECT "water_level" FROM "h2o_feet" WHERE time >= '2015-08-18T00:00:00Z' AND time <= '2015-08-18T00:30:00Z'

Contributor

beckettsean Oct 19, 2016

SELECT "water_level", "location" FROM ...

content/influxdb/v1.0/query_language/data_exploration.md

		```
		> SELECT count("water_level") FROM "h2o_feet" WHERE "location"='coyote_creek' AND time >= '2015-08-18T00:06:00Z' AND time <= '2015-08-18T00:12:00Z' GROUP BY time(12m)

Contributor

beckettsean Oct 19, 2016

drop the "location" tag from the WHERE clause, since it's not in the data anyway.

I would also clarify in the follow text that the lower time boundary is 00:06:00, because I missed that at first, and couldn't understand why users would expect COUNT = 2 at 00:06:00.

content/influxdb/v1.0/query_language/data_exploration.md


		Explanation:

		Because the query covers a 12 minute time range and groups results into 12 minute

Contributor

beckettsean Oct 19, 2016

The query starts at 00:06:00 and covers 12 minute intervals. Many users expect the interval to start at 00:06:00, the explicitly supplied lower time boundary, and extend for 12 minute groups from there. The expectation is that the query would return a COUNT of 2 with the timestamp 2015-08-18T00:06:00Z.

content/influxdb/v1.0/query_language/data_exploration.md

+              #### Syntax
+              ```
+              SELECT <function>(<field_key>) FROM <measurement_name> WHERE <time_range> GROUP BY time(<time_interval>,<offset_interval>),[tag_key]

Contributor

beckettsean Oct 19, 2016

FUNCTION

Contributor

beckettsean Oct 19, 2016

...GROUP BY time(<time_interval>,<offset_interval>)[,<tag_key>[,<tag_key>]]

content/influxdb/v1.0/query_language/data_exploration.md

+              , and on InfluxDB's preset time boundaries to determine the raw data included in each time boundary
+              and the timestamps returned by the query.
+              #### Examples of Advanced Syntax

Contributor

beckettsean Oct 19, 2016

Looks like WIP, lemme know when it's ready for review.

content/influxdb/v1.0/query_language/data_exploration.md

+              name: h2o_feet
+              --------------
+              time                   count
+-08-18T00:06:00Z   2
               ```
               ## The `GROUP BY` clause and `fill()`

Contributor

beckettsean Oct 19, 2016

Is fill(null) the default behavior? Seems like it is, but we should mention what the default is when fill() isn't specified.

rkuchan added the WIP label

rkuchan force-pushed the rk-more-issues branch 4 times, most recently from 8ec6c42 to 692cbe5 Compare

October 26, 2016 01:33

rkuchan force-pushed the rk-more-issues branch 2 times, most recently from 9d3e3f6 to a31c53e Compare

October 31, 2016 19:02

rkuchan changed the title ~~Edit existing docs with issues~~ Update Data Exploration

rkuchan added 3 commits

October 31, 2016 12:08


          Update data_exploration, #796,#536

5187d45


          Update links to match new data exploration headers

fd1f73f


          Add time and *influxql.VarRef to error page

e73ee67

rkuchan force-pushed the rk-more-issues branch from a31c53e to e73ee67 Compare

October 31, 2016 19:08

rkuchan removed the WIP label

Contributor

beckettsean commented Oct 31, 2016

Can't comment per-line because the diff is too big and I don't grok the GitHub Mac client enough to make PR comments.

line 192:

Identifiers must be double quoted if they contain characters other than [A-z,0-9,_], or if they are an InfluxQL keyword. While not always necessary, we recommend that you double quote identifiers.

Technically identifiers that start with a digit must also be quoted. I think that's worth mentioning here, since starting an identifier with a digit is common enough.

Identifiers must be double quoted if they contain characters other than [A-z,0-9,_], if they begin with a digit, or if they are an InfluxQL keyword. While not always necessary, we recommend that you double quote identifiers.

line 387:
WHERE [<cond_expr> [(AND|OR) <cond_expr>]…]
<cond_expr> = [ field_key | tag_key | time_condition ] binary_operator [ 'string' | integer | float | boolean | "RFC3339_timestamp" ]

line 415:
Is regex valid for field comparisons in 1.1?

Contributor

beckettsean commented Oct 31, 2016

line 636:
GROUP BY <tag> queries group query results by a user-specified set of tags.

line 653:
maybe mention that the order of the GROUP BY tags or times is irrelevant?

line 785:
Mention that because there are only two tags the output is identical to the above, where we explicitly specified each of the two tags

line 830:
that query wouldn't produce the output shown, as "location" is not part of it.

Line 935:
reword to something like The following query covers a 12-minute time range and groups results into 12 minute intervals, but it returns _two_ results.

line 943:
field output would be called "count", as that is the function, not mean

line 950:
InfluxDB uses preset round-number time boundaries for GROUP BY intervals, independent of any time conditions in the WHERE clause. When it calculates the results, all returned data must occur within the query's explicit time range but the GROUP BY intervals will be based on the preset time boundaries.

Contributor

beckettsean commented Nov 1, 2016

line 1079:
I still wasn't sure at first if the table that follows was with or without the offset. Since we give the offset query first, it's strange to see the results in the opposite order.
I think it would be helpful to explicitly state that we're looking at the last query results first.

The time boundaries and returned timestamps for the query without the offset_interval adhere to InfluxDB's preset time boundaries. Let's first examine the reults without an offset:

line 1187:
It's a little confusing that shifting the buckets forward by +6m is the same as shifting them backward by -12m. It begs the question, which should I use? I think it actually doesn't really matter, but we should be explicit, I think. "Shift by whatever is most intuitive" or something like that. Maybe even talk about how each method creates an "empty" bucket with no results, since it lies entirely outside the WHERE time range. E.g. shifting foward 6 min means the entire 4th bucket falls outside the query range. Shifting back 12 min means the first bucket happens entirely before the start of the WHERE time range. If we show the empty buckets it might clarify things a bit, since there are still four buckets, more or less.

line 1248:
Again, maybe show the empty second bucket past the end of the time range.

line 1280:
Any numerical value Reports the given numerical value for time intervals with no data.

line 1292:
null Reports null for time intervals with no data but returns a timestamp. This is the same as the default behavior

line 1539:
This is backreferencing.
to
:MEASUREMENT is a backreference to each measurement matched in the FROM clause.

line 1706:
The prior examples don't use GROUP BY * to preserve tags, so shouldn't we see some tags turned into fields in the output?

line 1742:
Something to the effect that "ORDER BY time DESC must come after the WHERE clause if there's no GROUP BY clause."

line 1812:
Queries with a LIMIT clause require GROUP BY * for deterministic results. LIMIT queries without a GROUP BY * clause may return different results when the query is re-run.

line 1884:
@jwilder @jsternberg @benbjohnson Are the series returned by SLIMIT idempotent? Are they deterministic? Said another way, will I always get the same series for the same query? And is it possible to know in advance which series I will get? We need to be explicit one way or the other. If it's random, that's fine, but we need to tell people. If it's not random but can't be predicted, that's also good to note. If it's not random and is predictable, we need to give users the algorithm to determine the series returned. (If it is deterministic, I imagine it's related to sorting by the series key.)

line 1932:
Note that without SLIMIT 1, the query would return results for the two series associated with the h2o_feet measurement, location=coyote_creek and location=santa_monica.

line 1993:
Note that without LIMIT 2 SLIMIT 1, the query would return four points for each of the two series associated with the h2o_feet measurement.

line 2047:
The query returns the fourth, fifth, and sixth points from the two series

line 2071:
The LIMIT 2 clause limits the number of points returned per series to two.

line 2072:
The OFFSET 2 clause excludes the first two points per series.

line 2095:
The SOFFSET clause requires both GROUP BY * and an SLIMIT clause.
@jwilder @jsternberg why does SOFFSET require an SLIMIT clause? I don't see an obvious reason why that would be necessary.

line 2143:
I think this needs something at the end of the line so that it will start a newline for the next sentence. Right now they are appearing on the same line. (Missing space or something?)

line 2160:
can we link to a description elsewhere of the min time? Would be nice to explain to new users why that very random looking date is the lowest valid timestamp.

line 2175:
...date-time strings or epoch time. (note the E, not datA)

line 2182:
Let's be explicit that OR is not allowed for time ranges, and link to the FAQ and/or GitHub Issue

line 2216:
maybe
Epoch time is the number of duration literals that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970.

line 2344:
SELECT_clause FROM_clause WHERE time now() [[ - | + ] <duration_literal>]
added whitespace to the operators

line 2415:
Note that InfluxDB only returns points where the water_level field has data.

line 2468:
Regular expression comparisons are more computationally intensive than exact string comparisons and thus queries with regular expressions are not as performant as those without.

line 2539:
it's a little weird that there's no output. It's missing the CLI return for the query.

line 2573:
Might be worth putting a &pretty flag on the query string so that the JSON output is more readable

beckettsean suggested changes

View reviewed changes

Contributor

beckettsean left a comment

looks great! amazing improvement. lots of little notes but I think we can get this out before the end of the week.

rkuchan mentioned this pull request

Regex on field keys in the SELECT clause #834

Closed

Contributor

jsternberg commented Nov 2, 2016

Are the series returned by SLIMIT idempotent? Are they deterministic? Said another way, will I always get the same series for the same query? And is it possible to know in advance which series I will get? We need to be explicit one way or the other. If it's random, that's fine, but we need to tell people. If it's not random but can't be predicted, that's also good to note. If it's not random and is predictable, we need to give users the algorithm to determine the series returned. (If it is deterministic, I imagine it's related to sorting by the series key.)

SLIMIT <n> is idempotent, but it won't necessary return <n> series because the series it chooses are based on the global list of series rather than what is in the shards that are being queried. So if you have cpu,host=server01, cpu,host=server02, and cpu,host=server03 and you use SLIMIT 1, you'll always get cpu,host=server01. But if the shard you query doesn't contain any points for cpu,host=server01 then you will get no results.

The SOFFSET clause requires both GROUP BY * and an SLIMIT clause. why does SOFFSET require an SLIMIT clause? I don't see an obvious reason why that would be necessary.

It shouldn't. This sounds like an error to me if it's happening. If you file a bug report, I'll try to see if I can get it fixed for 1.1.

Contributor

beckettsean commented Nov 2, 2016

@jsternberg thanks for that detail. So is it fair to say that on a single node OSS instance, SLIMIT is both idempotent and deterministic?

Contributor

jsternberg commented Nov 2, 2016

Yes. It should be to my knowledge. It just won't always give n series...

Contributor

beckettsean commented Nov 2, 2016

Right, n is the upper bound, but the lower bound is 0 if the query matches no series. It will always return 0 to n series, and always the same series given the same absolute WHERE conditions.

rkuchan added 2 commits

November 2, 2016 16:23


          Make edits

da921fe


          Merge branch 'master' into rk-more-issues

15a5629

Contributor Author

rkuchan commented Nov 2, 2016 •

edited

I opened an issue about using SLIMIT without GROUP BY *: influxdata/influxdb#7571.

I'm also having trouble with OFFSET without LIMIT and SOFFSET without SLIMIT (and GROUP BY *). Writing it up here because I'm not sure if I'm missing something obvious.

Issue 1: `OFFSET` without `LIMIT`

1. Write some data

> create database mydb
> use mydb
Using database mydb
> insert mymeas,color=yellow value=2
> insert mymeas,color=yellow value=3
> insert mymeas,color=yellow value=4
> insert mymeas,color=yellow value=5
> SELECT * FROM mymeas
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:55.659009728Z  yellow  2
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4
2016-11-02T23:28:01.064617644Z  yellow  5

2. Run an `OFFSET` query without `LIMIT`

> SELECT * FROM mymeas OFFSET 1
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:57.532365177Z  yellow  3

I can't really explain that result. Why does it only return one point? I would (maybe erroneously) have expected it to return the following:

name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4
2016-11-02T23:28:01.064617644Z  yellow  5

3. Run an `OFFSET` query with `LIMIT`

> SELECT * FROM mymeas LIMIT 2 OFFSET 1
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4

This one makes total sense to me. It returns two points and skips the first point.

Issue 2: `SOFFSET` without `SLIMIT`

1. Add another series to the data I wrote above

> insert mymeas,color=blue value=800
> insert mymeas,color=blue value=900
> SELECT * FROM mymeas
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:55.659009728Z  yellow  2
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4
2016-11-02T23:28:01.064617644Z  yellow  5
2016-11-02T23:29:08.953855972Z  blue    800
2016-11-02T23:29:11.427777943Z  blue    900

2. Run an `SOFFSET` query without `SLIMIT`

> SELECT * FROM mymeas SOFFSET 1
>

I get no results. To be honest, I'm not sure what I'd expect this query to do. In just a normal SELECT * FROM measurement query InfluxDB returns all series - so how would it paginate through series if it returns all of them by default?

3. Run an `SOFFSET` query with `SLIMIT`

> SELECT * FROM mymeas SLIMIT 1 SOFFSET 1
>

I also get no results. I'm assuming this has some relation to influxdata/influxdb#7571.

4. Run an `SOFFSET` query with `SLIMIT` and `GROUP BY *`

> SELECT * FROM mymeas GROUP BY * SLIMIT 1 SOFFSET 1
name: mymeas
tags: color=yellow
time                value
----                -----
2016-11-02T23:27:55.659009728Z  2
2016-11-02T23:27:57.532365177Z  3
2016-11-02T23:27:59.352763724Z  4
2016-11-02T23:28:01.064617644Z  5

This one makes total sense to me. If I don't include SOFFSET 1 then it returns everything for the mymeas color=blue series.

Contributor Author

rkuchan commented Nov 2, 2016

@beckettsean, @jsternberg ^

Contributor

beckettsean commented Nov 2, 2016

@rkuchan I agree with your assumptions on the behavior. I would expect the same results you expect.

Contributor

beckettsean commented Nov 3, 2016

@jwilder @benbjohnson any thoughts on @rkuchan's questions about unexpected results from OFFSET queries without LIMIT and SOFFSET queries without SLIMIT and GROUP BY *?

Contributor

benbjohnson commented Nov 3, 2016

@beckettsean @rkuchan Both of those seem like bugs. I would expect a missing LIMIT or SLIMIT to effectively default to ∞ and return all rows.


          Add short note about OFFSET without LIMIT and SOFFSET without SLIMIT

12b405e

Contributor Author

rkuchan commented Nov 3, 2016 •

edited

OK - opened two issues about OFFSET and SOFFSET and linked to them from the Data Exploration page.

influxdata/influxdb#7577
influxdata/influxdb#7578

(Thanks, @benbjohnson!)

rkuchan merged commit b2db266 into master

rkuchan deleted the rk-more-issues branch

November 3, 2016 18:36

rkuchan added a commit that referenced this pull request


          Add 1.0 updates in #817 to 1.1 docs

b92c6f7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment