Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Data Exploration #817

Merged
merged 6 commits into from Nov 3, 2016
Merged

Update Data Exploration #817

merged 6 commits into from Nov 3, 2016

Conversation

rkuchan
Copy link
Contributor

@rkuchan rkuchan commented Oct 18, 2016

Restructures and reformats the Data Exploration page. Adds consistent headers (like: Syntax, Description of Syntax, Examples, Common Issues) for each section to make the doc easier to parse and understand.

It updates and edits all of the content on the page. I tried to include a lot more syntax-specific information.

Fixed issues:

#796: Creates a section dedicated to how to specify a measurement(s) in the FROM clause, including how to fully qualify the measurement.

#536: Changes the offset_interval examples to clarify its function. I spent a lot of time on this and am very worried I've made it worse.

Copy link
Contributor

@beckettsean beckettsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! I like the re-org

Copy link
Contributor

@beckettsean beckettsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few things need updating, but overall great work

#### Syntax

```
SELECT <function>(<field_key>) FROM <measurement_name> [WHERE <time_range>] GROUP BY [ * | <tag_key>[,<tag_key] ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think capitalize function:

SELECT <FUNCTION>(<field_key>) FROM <measurement_name> ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing bracket:

[ * | <tag_key>[,<tag_key>]]


#### Description of Basic Syntax

`GROUP BY <tag>` queries require and InfluxQL [function](/influxdb/v1.0/query_language/functions/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require an InfluxQL ...

`w` weeks
#### Syntax
```
SELECT <function>(<field_key>) FROM <measurement_name> WHERE <time_range> GROUP BY time(<time_interval>),[tag_key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalize FUNCTION

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's indicate tags in the GROUP BY can be 0 to many:

...GROUP BY time(<time_interval>)[,<tag_key>[,<tag_key>]]

```
> SELECT "water_level" FROM "h2o_feet" WHERE time >= '2015-08-18T00:00:00Z' AND time <= '2015-08-18T00:30:00Z'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT "water_level", "location" FROM ...

```
> SELECT count("water_level") FROM "h2o_feet" WHERE "location"='coyote_creek' AND time >= '2015-08-18T00:06:00Z' AND time <= '2015-08-18T00:12:00Z' GROUP BY time(12m)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the "location" tag from the WHERE clause, since it's not in the data anyway.

I would also clarify in the follow text that the lower time boundary is 00:06:00, because I missed that at first, and couldn't understand why users would expect COUNT = 2 at 00:06:00.


Explanation:

Because the query covers a 12 minute time range and groups results into 12 minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query starts at 00:06:00 and covers 12 minute intervals. Many users expect the interval to start at 00:06:00, the explicitly supplied lower time boundary, and extend for 12 minute groups from there. The expectation is that the query would return a COUNT of 2 with the timestamp 2015-08-18T00:06:00Z.

#### Syntax

```
SELECT <function>(<field_key>) FROM <measurement_name> WHERE <time_range> GROUP BY time(<time_interval>,<offset_interval>),[tag_key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FUNCTION

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...GROUP BY time(<time_interval>,<offset_interval>)[,<tag_key>[,<tag_key>]]

, and on InfluxDB's preset time boundaries to determine the raw data included in each time boundary
and the timestamps returned by the query.

#### Examples of Advanced Syntax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like WIP, lemme know when it's ready for review.

name: h2o_feet
--------------
time count
2015-08-18T00:06:00Z 2
```

## The `GROUP BY` clause and `fill()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is fill(null) the default behavior? Seems like it is, but we should mention what the default is when fill() isn't specified.

@rkuchan rkuchan added the WIP label Oct 19, 2016
@rkuchan rkuchan force-pushed the rk-more-issues branch 4 times, most recently from 8ec6c42 to 692cbe5 Compare October 26, 2016 01:33
@rkuchan rkuchan force-pushed the rk-more-issues branch 2 times, most recently from 9d3e3f6 to a31c53e Compare October 31, 2016 19:02
@rkuchan rkuchan changed the title Edit existing docs with issues Update Data Exploration Oct 31, 2016
@beckettsean
Copy link
Contributor

Can't comment per-line because the diff is too big and I don't grok the GitHub Mac client enough to make PR comments.

line 192:

Identifiers must be double quoted if they contain characters other than [A-z,0-9,_], or if they are an InfluxQL keyword. While not always necessary, we recommend that you double quote identifiers.

Technically identifiers that start with a digit must also be quoted. I think that's worth mentioning here, since starting an identifier with a digit is common enough.

Identifiers must be double quoted if they contain characters other than [A-z,0-9,_], if they begin with a digit, or if they are an InfluxQL keyword. While not always necessary, we recommend that you double quote identifiers.

line 387:
WHERE [<cond_expr> [(AND|OR) <cond_expr>]…]
<cond_expr> = [ field_key | tag_key | time_condition ] binary_operator [ 'string' | integer | float | boolean | "RFC3339_timestamp" ]

line 415:
Is regex valid for field comparisons in 1.1?

@beckettsean
Copy link
Contributor

line 636:
GROUP BY <tag> queries group query results by a user-specified set of tags.

line 653:
maybe mention that the order of the GROUP BY tags or times is irrelevant?

line 785:
Mention that because there are only two tags the output is identical to the above, where we explicitly specified each of the two tags

line 830:
that query wouldn't produce the output shown, as "location" is not part of it.

Line 935:
reword to something like The following query covers a 12-minute time range and groups results into 12 minute intervals, but it returns _two_ results.

line 943:
field output would be called "count", as that is the function, not mean

line 950:
InfluxDB uses preset round-number time boundaries for GROUP BY intervals, independent of any time conditions in the WHERE clause. When it calculates the results, all returned data must occur within the query's explicit time range but the GROUP BY intervals will be based on the preset time boundaries.

@beckettsean
Copy link
Contributor

line 1079:
I still wasn't sure at first if the table that follows was with or without the offset. Since we give the offset query first, it's strange to see the results in the opposite order.
I think it would be helpful to explicitly state that we're looking at the last query results first.

The time boundaries and returned timestamps for the query without the offset_interval adhere to InfluxDB's preset time boundaries. Let's first examine the reults without an offset:

line 1187:
It's a little confusing that shifting the buckets forward by +6m is the same as shifting them backward by -12m. It begs the question, which should I use? I think it actually doesn't really matter, but we should be explicit, I think. "Shift by whatever is most intuitive" or something like that. Maybe even talk about how each method creates an "empty" bucket with no results, since it lies entirely outside the WHERE time range. E.g. shifting foward 6 min means the entire 4th bucket falls outside the query range. Shifting back 12 min means the first bucket happens entirely before the start of the WHERE time range. If we show the empty buckets it might clarify things a bit, since there are still four buckets, more or less.

line 1248:
Again, maybe show the empty second bucket past the end of the time range.

line 1280:
Any numerical value Reports the given numerical value for time intervals with no data.

line 1292:
null Reports null for time intervals with no data but returns a timestamp. This is the same as the default behavior

line 1539:
This is backreferencing.
to
:MEASUREMENT is a backreference to each measurement matched in the FROM clause.

line 1706:
The prior examples don't use GROUP BY * to preserve tags, so shouldn't we see some tags turned into fields in the output?

line 1742:
Something to the effect that "ORDER BY time DESC must come after the WHERE clause if there's no GROUP BY clause."

line 1812:
Queries with a LIMIT clause require GROUP BY * for deterministic results. LIMIT queries without a GROUP BY * clause may return different results when the query is re-run.

line 1884:
@jwilder @jsternberg @benbjohnson Are the series returned by SLIMIT idempotent? Are they deterministic? Said another way, will I always get the same series for the same query? And is it possible to know in advance which series I will get? We need to be explicit one way or the other. If it's random, that's fine, but we need to tell people. If it's not random but can't be predicted, that's also good to note. If it's not random and is predictable, we need to give users the algorithm to determine the series returned. (If it is deterministic, I imagine it's related to sorting by the series key.)

line 1932:
Note that without SLIMIT 1, the query would return results for the two series associated with the h2o_feet measurement, location=coyote_creek and location=santa_monica.

line 1993:
Note that without LIMIT 2 SLIMIT 1, the query would return four points for each of the two series associated with the h2o_feet measurement.

line 2047:
The query returns the fourth, fifth, and sixth points from the two series

line 2071:
The LIMIT 2 clause limits the number of points returned per series to two.

line 2072:
The OFFSET 2 clause excludes the first two points per series.

line 2095:
The SOFFSET clause requires both GROUP BY * and an SLIMIT clause.
@jwilder @jsternberg why does SOFFSET require an SLIMIT clause? I don't see an obvious reason why that would be necessary.

line 2143:
I think this needs something at the end of the line so that it will start a newline for the next sentence. Right now they are appearing on the same line. (Missing space or something?)

line 2160:
can we link to a description elsewhere of the min time? Would be nice to explain to new users why that very random looking date is the lowest valid timestamp.

line 2175:
...date-time strings or epoch time. (note the E, not datA)

line 2182:
Let's be explicit that OR is not allowed for time ranges, and link to the FAQ and/or GitHub Issue

line 2216:
maybe
Epoch time is the number of duration literals that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970.

line 2344:
SELECT_clause FROM_clause WHERE time now() [[ - | + ] <duration_literal>]
added whitespace to the operators

line 2415:
Note that InfluxDB only returns points where the water_level field has data.

line 2468:
Regular expression comparisons are more computationally intensive than exact string comparisons and thus queries with regular expressions are not as performant as those without.

line 2539:
it's a little weird that there's no output. It's missing the CLI return for the query.

line 2573:
Might be worth putting a &pretty flag on the query string so that the JSON output is more readable

Copy link
Contributor

@beckettsean beckettsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! amazing improvement. lots of little notes but I think we can get this out before the end of the week.

@jsternberg
Copy link
Contributor

Are the series returned by SLIMIT idempotent? Are they deterministic? Said another way, will I always get the same series for the same query? And is it possible to know in advance which series I will get? We need to be explicit one way or the other. If it's random, that's fine, but we need to tell people. If it's not random but can't be predicted, that's also good to note. If it's not random and is predictable, we need to give users the algorithm to determine the series returned. (If it is deterministic, I imagine it's related to sorting by the series key.)

SLIMIT <n> is idempotent, but it won't necessary return <n> series because the series it chooses are based on the global list of series rather than what is in the shards that are being queried. So if you have cpu,host=server01, cpu,host=server02, and cpu,host=server03 and you use SLIMIT 1, you'll always get cpu,host=server01. But if the shard you query doesn't contain any points for cpu,host=server01 then you will get no results.

The SOFFSET clause requires both GROUP BY * and an SLIMIT clause. why does SOFFSET require an SLIMIT clause? I don't see an obvious reason why that would be necessary.

It shouldn't. This sounds like an error to me if it's happening. If you file a bug report, I'll try to see if I can get it fixed for 1.1.

@beckettsean
Copy link
Contributor

@jsternberg thanks for that detail. So is it fair to say that on a single node OSS instance, SLIMIT is both idempotent and deterministic?

@jsternberg
Copy link
Contributor

Yes. It should be to my knowledge. It just won't always give n series...

@beckettsean
Copy link
Contributor

Right, n is the upper bound, but the lower bound is 0 if the query matches no series. It will always return 0 to n series, and always the same series given the same absolute WHERE conditions.

@rkuchan
Copy link
Contributor Author

rkuchan commented Nov 2, 2016

I opened an issue about using SLIMIT without GROUP BY *: influxdata/influxdb#7571.

I'm also having trouble with OFFSET without LIMIT and SOFFSET without SLIMIT (and GROUP BY *). Writing it up here because I'm not sure if I'm missing something obvious.

Issue 1: OFFSET without LIMIT

1. Write some data

> create database mydb
> use mydb
Using database mydb
> insert mymeas,color=yellow value=2
> insert mymeas,color=yellow value=3
> insert mymeas,color=yellow value=4
> insert mymeas,color=yellow value=5
> SELECT * FROM mymeas
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:55.659009728Z  yellow  2
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4
2016-11-02T23:28:01.064617644Z  yellow  5

2. Run an OFFSET query without LIMIT

> SELECT * FROM mymeas OFFSET 1
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:57.532365177Z  yellow  3

I can't really explain that result. Why does it only return one point? I would (maybe erroneously) have expected it to return the following:

name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4
2016-11-02T23:28:01.064617644Z  yellow  5

3. Run an OFFSET query with LIMIT

> SELECT * FROM mymeas LIMIT 2 OFFSET 1
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4

This one makes total sense to me. It returns two points and skips the first point.

Issue 2: SOFFSET without SLIMIT

1. Add another series to the data I wrote above

> insert mymeas,color=blue value=800
> insert mymeas,color=blue value=900
> SELECT * FROM mymeas
name: mymeas
time                color   value
----                -----   -----
2016-11-02T23:27:55.659009728Z  yellow  2
2016-11-02T23:27:57.532365177Z  yellow  3
2016-11-02T23:27:59.352763724Z  yellow  4
2016-11-02T23:28:01.064617644Z  yellow  5
2016-11-02T23:29:08.953855972Z  blue    800
2016-11-02T23:29:11.427777943Z  blue    900

2. Run an SOFFSET query without SLIMIT

> SELECT * FROM mymeas SOFFSET 1
>

I get no results. To be honest, I'm not sure what I'd expect this query to do. In just a normal SELECT * FROM measurement query InfluxDB returns all series - so how would it paginate through series if it returns all of them by default?

3. Run an SOFFSET query with SLIMIT

> SELECT * FROM mymeas SLIMIT 1 SOFFSET 1
>

I also get no results. I'm assuming this has some relation to influxdata/influxdb#7571.

4. Run an SOFFSET query with SLIMIT and GROUP BY *

> SELECT * FROM mymeas GROUP BY * SLIMIT 1 SOFFSET 1
name: mymeas
tags: color=yellow
time                value
----                -----
2016-11-02T23:27:55.659009728Z  2
2016-11-02T23:27:57.532365177Z  3
2016-11-02T23:27:59.352763724Z  4
2016-11-02T23:28:01.064617644Z  5

This one makes total sense to me. If I don't include SOFFSET 1 then it returns everything for the mymeas color=blue series.

@rkuchan
Copy link
Contributor Author

rkuchan commented Nov 2, 2016

@beckettsean, @jsternberg ^

@beckettsean
Copy link
Contributor

@rkuchan I agree with your assumptions on the behavior. I would expect the same results you expect.

@beckettsean
Copy link
Contributor

@jwilder @benbjohnson any thoughts on @rkuchan's questions about unexpected results from OFFSET queries without LIMIT and SOFFSET queries without SLIMIT and GROUP BY *?

@benbjohnson
Copy link
Contributor

@beckettsean @rkuchan Both of those seem like bugs. I would expect a missing LIMIT or SLIMIT to effectively default to and return all rows.

@rkuchan
Copy link
Contributor Author

rkuchan commented Nov 3, 2016

OK - opened two issues about OFFSET and SOFFSET and linked to them from the Data Exploration page.

influxdata/influxdb#7577
influxdata/influxdb#7578

(Thanks, @benbjohnson!)

@rkuchan rkuchan merged commit b2db266 into master Nov 3, 2016
@rkuchan rkuchan deleted the rk-more-issues branch November 3, 2016 18:36
rkuchan added a commit that referenced this pull request Nov 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants