# Aggregating results by using GROUP BY
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This tutorial demonstrates how to work with [`GROUP BY`](https://druid.apache.org/docs/27.0.0/querying/sql#group-by) to aggregate rows and produce metrics from underlying measures at query time and during ingestion.

## Prerequisites

This tutorial works with Druid 27.0.0 or later.

### Run with Docker

<!-- Profiles are:
`druid-jupyter` - just Jupyter and Druid
`all-services` - includes Jupyter, Druid, and Kafka
 -->

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).


## Initialization

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

Run the following cell to create a table called `example-koalas-groupby`. Notice only columns required for this notebook are ingested from the overall sample dataset.

When completed, you'll see table details.

In [None]:
sql='''
REPLACE INTO "example-koalas-groupby" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "browser",
  "city",
  "continent",
  "country",
  "loaded_image",
  "os",
  "session_length"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-groupby')
display.table('example-koalas-groupby')

Finally, run the following cell to import additional Python modules that you will use for this notebook.

In [None]:
import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Introduction to `GROUP BY`

You can combine rows of common values in your results by using the [`GROUP BY` clause](https://druid.apache.org/docs/27.0.0/querying/sql#group-by), producing aggregations from the source values. `GROUP BY` is an important technique to [apply at ingestion time](https://druid.apache.org/docs/27.0.0/ingestion/rollup), allowing you to aggregate raw data and pre-calculate common aggregates.

This notebook focuses on SQL-based functions. Native equivalents exist for use in, for example, the [`metricsSpec` section](https://druid.apache.org/docs/27.0.0/ingestion/ingestion-spec#metricsspec) of JSON-based specifications of streaming ingestion.

### Generate simple aggregations

Run the following cell to generate a table of results from the example dataset.

The SQL includes the `GROUP BY` clause, combining rows in the results with a common value in `loaded_image`.

In [None]:
sql='''
SELECT
  "loaded_image",
  avg(session_length) AS "timetakenms_average",
  max(session_length) AS "timetakenms_max",
  min(session_length) AS "timetakenms_min",
  count(*) AS "count",
  count(DISTINCT "country") AS "countries"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
GROUP BY 1
'''

display.sql(sql)

The `GROUP BY` combines data from the raw data with common a `browser`, and generates three [aggregates](https://druid.apache.org/docs/27.0.0/querying/aggregations): the average, maximum, and minimum time taken to complete a session.

A `COUNT` is also generated, returning the number of events from the `TABLE` that are in the group. `COUNT` can also be used to calculate the number of distinct values in a set by using the `DISTINCT` operator, used in the SQL below to `COUNT` the distinct number of countries per image.

The `WHERE` clause uses a `TIME_IN_INTERVAL` function to ensure we only retrieve rows for a specific time period - good practice for all Druid queries. In this case, only events between 10am and 6pm on the 25th August 2019 are included.

`GROUP BY 1` is a shorthand way of writing a `GROUP BY "loaded_image".

### Generate objects

The following SQL uses the `STRING_AGG` and `ARRAY_AGG` functions in their `DISTINCT` form to create collections of the values from the source data.

Run this cell to see how the values in `continent` are handled for each `loaded_image`.

In [None]:
sql='''
SELECT
  "loaded_image",
  STRING_AGG(DISTINCT "continent", ',') AS "string",
  ARRAY_AGG(DISTINCT "continent") AS "array",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
GROUP BY 1
'''

display.sql(sql)

### Find the earliest and latest value

Several functions exist to determine the earliest and latest values in the source data.

Run the cell below, which uses the `EARLIEST_BY` and `LATEST_BY` functions to calculate, between 10am and 6pm on the 25th August, the earliest and latest recorded `country`, broken down by continent.

In [None]:
sql='''
SELECT
  "continent",
  EARLIEST_BY("country","__time",1024) AS "earliest_country",
  LATEST_BY("country","__time",1024) AS "latest_country"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
GROUP BY 1
'''

display.sql(sql)

## Transformation

You can include an [expression](https://druid.apache.org/docs/latest/querying/math-expr) in the SQL statement to apply a function to the source data.

### Group by the results of an expression

Run the cell below, where a function has been applied to the `loaded_image` data to extract only the filename by using a regular expression via  [`REGEXP_EXTRACT`](https://druid.apache.org/docs/latest/querying/sql-scalar#string-functions). The result of this function is then used in the `GROUP BY`, providing a results table that only contains the filename.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'([a-zA-Z0-9\s_\\.\-\(\):])+(.jpg)$') AS "loaded_image_filename",
  avg(session_length) AS "timetakenms_average",
  max(session_length) AS "timetakenms_max",
  min(session_length) AS "timetakenms_min",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
GROUP BY 1
ORDER BY 2 DESC
'''

display.sql(sql)

### Group by time

Run the next cell to apply a time function, [`TIME_EXTRACT`](https://druid.apache.org/docs/27.0.0/querying/sql-functions#time_extract), to extract the HOUR from the underlying data, providing a breakdown of the number of sessions by hour-of-the-day across the entire time period. The SQL then stores the results in a Pandas dataframe and displays a histogram.

In [None]:
sql='''
SELECT
  TIME_EXTRACT("__time", 'HOUR') AS "time",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T00:00:00/2019-08-30T00:00:00')
GROUP BY 1
'''

df = pd.DataFrame(sql_client.sql(sql))
df.plot.bar(x='time', y='count')
plt.show()

Druid can apply the `TIME_EXTRACT` function in two ways - one where timezones are specified, and one without.

Run this cell to see a table showing the hour from the source data, along with the equivalent hour in Los Angeles, Copenhagen, and Shanghai.

In [None]:
sql='''
SELECT
  TIME_EXTRACT("__time", 'HOUR') AS "time",
  TIME_EXTRACT("__time", 'HOUR', 'America/Los_Angeles') AS "time_LA",
  TIME_EXTRACT("__time", 'HOUR', 'Europe/Copenhagen') AS "time_Cop",
  TIME_EXTRACT("__time", 'HOUR', 'Asia/Shanghai') AS "time_Sha",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T00:00:00/2019-08-26T00:00:00')
GROUP BY 1,2,3,4
ORDER BY 2 ASC
'''

display.sql(sql)

For data vizualisations where time is on the x axis, the `TIME_FLOOR` function is particularly useful.

Run the next cell, which stores the results of a query in a dataframe and then plots them into a line chart for the period.

In [None]:
sql='''
SELECT
  STRING_FORMAT('%tR',TIME_FLOOR("__time", 'PT1H')) AS "time",
  count(*) AS "count",
  sum(session_length) AS "timetakenms"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T00:00:00/2019-08-26T00:00:00')
GROUP BY 1
'''

df = pd.DataFrame(sql_client.sql(sql))

fig, ax = plt.subplots()

df.plot(x = 'time', y = 'count', ax = ax) 
df.plot(x = 'time', y = 'timetakenms', ax = ax, secondary_y = True) 
plt.show()

* The [`TIME_FLOOR`](https://druid.apache.org/docs/latest/querying/sql-functions#time_floor) function is used against `__time` to return only the date and hour for each timestamp in the source data. The result of this is then passed to the `STRING_FORMAT` function to apply string formatting.

* The [`TIME_IN_INTERVAL`](https://druid.apache.org/docs/27.0.0/querying/sql-functions#time_in_interval) ensures the result set only contains results for events on the 25th August 2019.

* Two aggregates are calculated - the number of sessions (COUNT) and the sum total length of all sessions (`timetakenms`).

## Filtering

`WHERE` filters rows from source data used in the query, while `HAVING` filters result sets.

In this section, see how these two mechanisms for filtering data can be used with `GROUP BY` queries.

### Filter the source data

The results of the `REGEXP_EXTRACT` example query above includes rows where no filename could be found in the source data.

To prevent this from happening, you can add a `LIKE` function in the `WHERE` clause to ensure source rows for the query contain a JPG image in the `loaded_image`.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'([a-zA-Z0-9\s_\\.\-\(\):])+(.jpg)$') AS "loaded_image_filename",
  avg(session_length) AS "timetakenms_average",
  max(session_length) AS "timetakenms_max",
  min(session_length) AS "timetakenms_min",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "loaded_image" LIKE '%jpg'
GROUP BY 1
ORDER BY 2 DESC
'''

display.sql(sql)

### Filter the results

A `WHERE` clause only filters source data. Since the `loaded_image_filename` dimension is calculated, we cannot use `WHERE` to filter the result set. This SQL, for example, would be invalid:

```sql
WHERE "loaded_image_filename" = 'koalas2.jpg'
```

The `HAVING` clause filters the final result set, allowing filters to be created that address calculated columns directly.

In the following SQL, the calculated column, `loaded_image_filename`, is used in the `HAVING` clause to remove any empty results, an alternative approach to the `WHERE` filter above.

Running this cell will show that the results match. Remember, however, that the `WHERE`-based query is far more efficient than this new query as it draws fewer rows out of the source `TABLE`. In this alternative form, filtering happens very late in query execution.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'([a-zA-Z0-9\s_\\.\-\(\):])+(.jpg)$') AS "loaded_image_filename",
  avg(session_length) AS "timetakenms_average",
  max(session_length) AS "timetakenms_max",
  min(session_length) AS "timetakenms_min",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
GROUP BY 1
HAVING "loaded_image_filename" IS NOT NULL
ORDER BY 2 DESC
'''

display.sql(sql)

`HAVING` is commonly used to filter results based on the output of aggregate functions.

Run the following cell to calculate the average session length and return images that take over 300000 milliseconds (300 seconds) to load on average.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'([a-zA-Z0-9\s_\\.\-\(\):])+(.jpg)$') AS "loaded_image_filename",
  avg(session_length) AS "timetakenms_average",
  max(session_length) AS "timetakenms_max",
  min(session_length) AS "timetakenms_min",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "loaded_image" LIKE '%jpg'
GROUP BY 1
HAVING "timetakenms_average" > 300000
ORDER BY 2 DESC
'''

display.sql(sql)

### Filter data used in the aggregate calculation

Expressions can themselves have a filter, restricting the rows that are included in the calculation of the specific aggregation.

Run the following cell where the `FILTER` clause has been added to the `COUNT` calculation.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'([a-zA-Z0-9\s_\\.\-\(\):])+(.jpg)$') AS "loaded_image_filename",
  count(*) FILTER (WHERE "os" LIKE 'OS %') AS "count_OSX",
  count(*) FILTER (WHERE "os" LIKE 'Windows %') AS "count_windows"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "loaded_image" LIKE '%jpg'
GROUP BY 1
ORDER BY 2 DESC
'''

display.sql(sql)

There are two counts returned, one which only counts rows with a `OS`-like operating system, and another `COUNT` that only accounts for `Windows`-like operating systems.

## Apply advanced groupings

A `GROUP BY` clause creates a set of aggregations by each of the columns that you specify.

The `GROUP BY` statement in the next cell calculates a maximum session length and a `COUNT` of the events for each `os` for each `browser`. Or, put another way, grouped by operating system grouped by browser.

In [None]:
sql='''
SELECT
  "browser",
  "os",
  max(session_length) AS "max_session",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "os" LIKE 'Windows%'
GROUP BY 1, 2
'''

display.sql(sql)

### Return independent groups

Rather than additive grouping, `GROUPING SETS` generates separate groups against each of the dimensions specified.

Run the following cell, which creates two sets results - one grouped by `browser`, and another grouped by `os`.

In [None]:
sql='''
SELECT
  "browser",
  "os",
  max(session_length) AS "max_session",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "os" LIKE 'Windows%'
GROUP BY GROUPING SETS ("browser","os")
'''

display.sql(sql)

### Return multiple groupings

It's also possible to combine approaches, executing a single query that provides multiple groupings that can be used in multiple ways by the calling application.

In the SQL below, the `GROUPING SETS` clause has been expanded so that three sets of results are provided. One is purely a `GROUP` on `continent`. The second is grouped by `continent` and then by `os`, and the final is grouped by `continent` and then by `browser`.

In [None]:
sql='''
SELECT
  "continent",
  "browser",
  "os",
  max(session_length) AS "max_session",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "os" LIKE 'Windows%'
AND "continent" LIKE '%America'
GROUP BY GROUPING SETS (
    "continent",
    ("continent","os"),
    ("continent","browser")
    )
'''

display.sql(sql)

### Rolling up groups

The SQL statement below incorporates the `GROUP BY ROLLUP` clause.

Run the following cell to see the effect.

In [None]:
sql='''
SELECT
  "continent",
  "browser",
  "os",
  max(session_length) AS "max_session",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "os" LIKE 'OS %'
AND "continent" LIKE '%America'
GROUP BY ROLLUP (
    "continent",
    "browser",
    "os"
    )
'''

display.sql(sql)

The results show four sets of grouping:

1. Grouped by continent, broken down by browser, and then by operating system
2. Grouped by continent and then by browser
3. Grouped by continent
4. Without any grouping

### Return all possible groupings

The `CUBE` modifier to the `GROUP BY` clause prompts Druid to generate all possible permutations of `GROUP` for the columns that we specify.

To keep this result set small enough for this notebook, a `HAVING` clause is applied, ensuring only rows in our results that have a `COUNT` of over 1000 are included. You may want to remove this clause yourself to see how the full result set looks.

In [None]:
sql='''
SELECT
  "continent",
  "browser",
  "os",
  max(session_length) AS "max_session",
  count(*) AS "count"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T10:00:00/2019-08-25T18:00:00')
AND "os" LIKE 'OS %'
AND "continent" LIKE '%America'
GROUP BY CUBE (
    "continent",
    "os",
    "browser"
    )
HAVING "count" > 500
'''

display.sql(sql)

## Determine the `GROUP BY` execution plan

There are several execution engines might being used for `GROUP BY` operations. `EXPLAIN PLAN` shows specifically which will be used for each type of query.

In this section, see `EXPLAIN PLAN` results for some `GROUP BY` queries.

The following cell contains a `GROUP BY` query that matches the [requirements](https://druid.apache.org/docs/latest/querying/sql-translation#query-types) for the [`timeseries`](https://druid.apache.org/docs/27.0.0/querying/timeseriesquery) execution.

Run the following cell to see the `EXPLAIN PLAN` for the query, noting that the `queryType` is `timeseries`.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time", 'PT1H') AS "time",
  count(*) AS "count",
  sum(session_length) AS "timetakenms"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T00:00:00/2019-08-26T00:00:00')
GROUP BY 1
'''

print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

Review then run the SQL in the cell below.

This `GROUP BY` query additionally groups rows by the operating system (`os`), leading Druid to use the `groupby` query type.

Run the cell to retrieve an `EXPLAIN PLAN` for the query above. Notice that the `queryType` is `groupBy`. This indicates that Druid is using the [`groupby`](https://druid.apache.org/docs/latest/querying/groupbyquery) execution engine.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time", 'PT1H') AS "__time_by_hour",
  "os",
  count(*) AS "count",
  sum(session_length) AS "timetakenms"
FROM "example-koalas-groupby"
WHERE TIME_IN_INTERVAL("__time", '2019-08-25T04:00:00/2019-08-25T06:00:00')
GROUP BY 1,2
'''

print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

The `TopN` execution pattern applies approximation to `GROUP BY` results.

Try the [TopN](02-approx-ranking.ipynb) notebook on using approximation with `GROUP BY` queries.

## Cleanup

Run the following cell to drop the table.

In [None]:
druid.datasources.drop("example-koalas-groupby")

## Summary

* `GROUP BY` can be used at query and ingestion time to combine rows and generate aggregates.
* There are a wide range of aggregation functions available.
* `GROUP BY` operations can be approximate or accurate.
* Transformation and filtering can be incorporated into `GROUP BY` queries.
* There are several modes for `GROUP BY` that can be used to generate multiple useful sets of data from one query.
* Under the hood, Druid utilizes different native execution plans depending on the pattern of the SQL.

## Learn more

* Incorporate a `GROUP BY` into your SQL-based ingestion or, if using JSON-based ingestion, enable `rollup`, `queryGranularity`, and a `metricsSpec`
* Dig deeper into the `EARLIEST` and `LATEST` aggregations
* Try the [TopN](02-approx-ranking.ipynb) notebook on using approximation with `GROUP BY` queries.
* Review the available [aggregation functions](https://druid.apache.org/docs/latest/querying/sql-aggregations)
* Read more about the [groupby](https://druid.apache.org/docs/latest/querying/groupbyquery) execution engine.
* Find out [when each type of query mode is used](https://druid.apache.org/docs/latest/querying/sql-translation#query-types) from the documentation.
* Take a [look at](https://www.novixys.com/blog/java-string-format-examples/#31_Date_and_Time_Formatting) other `STRING_FORMAT` options
* Review the list [`timezones`](https://www.joda.org/joda-time/timezones.html).