# Using window functions

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

A [window function](https://druid.apache.org/docs/latest/querying/sql-window-functions/#window-function-reference) operates over a "window" of rows that you specify, and then emits a value to each row of query results.

This tutorial demonstrates how to define windows, and gives examples of window functions in action.

## Prerequisites

This tutorial works with Druid 31 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Create a table using batch ingestion

Run the following cell to create a table using batch ingestion. Only the channel, user, and delta columns are ingested.

When completed, you'll see a description of the final table.

In [None]:
table_name = 'example-wikipedia-windows'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "channel" VARCHAR, "user" VARCHAR, "delta" BIGINT))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "channel",
  "user",
  "delta"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready(table_name)
display.table(table_name)

## Ordering windows

In this first section, you will define a window of ordered rows to add ranks, to numerate rows, and to bucket them.

### Ranking results

Run the following cell to produce an hourly activity timeline by channel, ordering them in descending order by the sum total number of edits.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    channel,
    user,
    SUM(delta) AS "net_user_changes"
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1, 2, 3
ORDER BY SUM(delta) DESC
"""

display.sql(sql)

Review the query in the next cell.

* The RANK window function adds a new dimension `rank_edits`.
* An OVER clause defines the window to use with the window function.
* The window is ordered by SUM(delta) descending.

Run the cell to see the results.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    channel,
    user,
    SUM(delta) AS "net_user_changes",
    RANK() OVER (
        ORDER BY SUM(delta) DESC
        ) AS "rank_edits"
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1, 2, 3
ORDER BY 4 DESC
"""

display.sql(sql)

For each row, Druid has added a rank by taking the main GROUP BY result rows, ordering them by SUM(delta) descending, and then adding a rank value.

Run the next cell.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    channel,
    user,
    SUM(delta) AS "net_user_changes",
    RANK() OVER (
        ORDER BY SUM(delta)
        ) AS "rank_edits"
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1, 2, 3
ORDER BY 4 DESC
"""

display.sql(sql)

Notice that the main ORDER BY caused results to be ordered by SUM(delta) descending, but the rank has been computed over a window where SUM(delta) is in ascending order.

In the next query, the outer ORDER BY has been changed so that the final result list is ordered by time.

Run the next cell to see the results.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    channel,
    user,
    SUM(delta) AS "net_user_changes",
    RANK() OVER (
        ORDER BY SUM(delta)
        ) AS "rank_edits"
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1, 2, 3
ORDER BY 1
"""

display.sql(sql)

The `rank_edits` window remains in SUM(delta) ascending order, and the final results are ordered by `period`.

In the next query:

* `rend_editors` is a new column produced by an additional RANK function.
* The window is ordered by COUNT(DISTINCT user) descending.

Run the cell to see the results.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    RANK( ) OVER ( ORDER BY SUM(delta) ) AS "rank_edits",
    RANK( ) OVER ( ORDER BY COUNT(DISTINCT user) DESC ) AS "rank_editors"
FROM "{table_name}"
WHERE channel = '#en.wikipedia'
AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1
ORDER BY 1
"""

display.sql(sql)

Notice that `rank_editors` is duplicated at position 10.

You may want to adjust the SQL above to include use COUNT(DISTINCT user) to find out why, then change the window function to DENSE_RANK to see how this affects the results.

As an additional experiment, add a PERCENT_RANK window function. Remember to defining a window using OVER and ORDER BY.

### Numerating and bucketing results

The following query includes two more window functions:

* ROW_NUMBER adds a monotonically increasing number, ordered by a count of the number of distinct users.
* NTILE splits the result rows into buckets of (near to) equal size, and gives each row a bucket number - again based on the count of distinct users.

The window for each is based on COUNT(DISTINCT user) in descending order.

Run the cell to see the results.

In [None]:
sql = f"""
SELECT 
    channel,
    SUM(delta) AS "edits",
    ROW_NUMBER() OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS row_no,
    NTILE(6) OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS ntile_val
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1
ORDER BY 3
"""

display.sql(sql)

For each row, Druid has added a monotonically increasing row number and a bucket number.

Try adjusting the window for `row_no` is in order of the number of edits.

In the next query, a LIMIT is applied to the main query.

What do you expect to happen?

Run the cell to see the result.

In [None]:
sql = f"""
SELECT 
    channel,
    SUM(delta) AS "edits",
    ROW_NUMBER() OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS row_no,
    NTILE(6) OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS ntile_val
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""

display.sql(sql)

## Partitioning windows

Use PARTITION BY in window definitions to align window function calculations with a dimension value in each result row.

In this section, you'll use partitioned windows with some [standard aggregation functions](https://druid.apache.org/docs/latest/querying/sql-aggregations/).

Run the following cell to calcluate the SUM total edits in two specific channels in a day in the data.

In [None]:
sql = f"""
SELECT 
    channel,
    SUM(delta) AS "edits"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
AND channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY 1
"""

display.sql(sql)

Review the query below.

* The main GROUP BY emits a sum of `edits` for each `channel` and `user`.
* ROW_NUMBER and NTILE work as before.
* The `channel_edits` column is new.
  * The SUM() function is calculated over a window.
  * The window is partitioned by `channel`.
* The results are limited to a specific day and `channel`.

Run the next cell to see the results.

In [None]:
sql = f"""
SELECT 
    channel,
    user,
    SUM(delta) AS "edits",
    ROW_NUMBER() OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS "row_no",
    NTILE(6) OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS "ntile_val",
    SUM(SUM(delta)) OVER (
        PARTITION BY channel
        ) AS "channel_edits"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
AND channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY 1, 2
ORDER BY 1, 2
"""

display.sql(sql)

Notice how `channel_edits` matches the totals calculated earlier. Druid has calculated the SUM in partitions, one per channel.

The same technique can be used with a number of different aggregates. Run the next cell to see the results of a more complex query.

In [None]:
sql=f"""
SELECT user,
    channel,
    SUM(delta) AS "edits",
    ROW_NUMBER() OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS row_no,
    NTILE(6) OVER (
        ORDER BY COUNT(DISTINCT user) DESC
        ) AS ntile_val,
    SUM(SUM(delta)) OVER (
        PARTITION BY channel
        ) AS channel_edits,
    SUM(SUM(delta)) OVER (
        PARTITION BY user
        ) AS user_edits,
    COUNT(channel) OVER (
        PARTITION BY user
        ) AS user_row_count
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
AND user LIKE 'L%'
GROUP BY 1, 2
HAVING SUM(delta) > 5000
"""

display.sql(sql)

Notice that `Lsjbot` has made multiple edits across multiple channels in the same period.

## Ordering partitioned windows

Use ORDER and PARTITION to order the rows inside each partitioned window.

* The outer GROUP BY and SUM generates the `edits` for each `period`, `channel`, and `user`.
* The RANK function is used to create `rank_within_channel_day`.
* The window is partitioned by channel and ordered by SUM(delta) descending.

Run the cell to see the results.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    channel, 
    user,
    SUM(delta) AS "edits",
    RANK() OVER (
        PARTITION BY channel
        ORDER BY SUM(delta) DESC
        ) AS "rank_within_channel_day"
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND TIME_IN_INTERVAL ("__time",'2016-06-27/P1D')
GROUP BY 1, 2, 3
ORDER BY 2, 5
"""

display.sql(sql)

Notice how the `rank_within_channel_day` has been calculated per channel, and is ordered by the number of edits in descending order.

## Limiting windows

Run the following cell to produce a timeline of the number of edits in channels that begin with `#h`.

In [None]:
sql = f"""
SELECT
    channel,
    TIME_FLOOR(__time, 'PT1H') AS "period",
    SUM(delta) AS "edits",
FROM "{table_name}"
WHERE "channel" LIKE '#h%'
  AND TIME_IN_INTERVAL("__time",'2016-06-27/PT12H')
GROUP BY 1, 2
"""

display.sql(sql)

Use ROWS BETWEEN to [limit the window](https://druid.apache.org/docs/latest/querying/sql-window-functions/#window-function-syntax) over which a function operates.

In the following query, a new column `cumulative_channel_edits` has been added. It operates over a window that is:

* Ordered by `__time`.
* Partitioned by `channel`.
* Scoped to include all preceding rows up to the current row.

The result is a cumulative total, built up from all the results of the GROUP BY.

A second new column, `rolling_max_edits` operates over a window that is:

* Ordered by `__time`.
* Partitioned by `channel`.
* Scoped to the preceding 3 rows and the current row.

This produces a channel-specific rolling maximum value over time.

In [None]:
sql = f"""
SELECT
    channel,
    TIME_FLOOR(__time, 'PT1H') AS "period",
    SUM(delta) AS "edits",
    SUM(SUM(delta)) OVER (
        PARTITION BY channel
        ORDER BY TIME_FLOOR(__time, 'PT1H')
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS "cumulative_channel_edits",
    MAX(SUM(delta)) OVER (
        PARTITION BY channel
        ORDER BY TIME_FLOOR(__time, 'PT1H')
        ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
        ) AS "rolling_max_edits"
FROM "{table_name}"
WHERE "channel" LIKE '#h%'
  AND TIME_IN_INTERVAL("__time",'2016-06-27/PT12H')
GROUP BY 1, 2
"""

display.sql(sql)

## Row-specific windows

LAG, LEAD, and other functions address specific rows in the window that you define.

Run the next cell to see the results of a query where:

* `previous_edit_sum` returns the value of the previous row's SUM(delta).
* `nextNext_edit_sum` returns the value of SUM(delta) two rows ahead.
* `first_edit_sum` returns the very first SUM(delta) in the window.
* `final_edit_sum` returns the very last SUM(delta) in the window.

In [None]:
sql = f"""
SELECT TIME_FLOOR("__time",'PT1H') AS "period",
    channel,
    SUM(delta) AS "edit_sum",
    LAG(SUM(delta)) OVER (
        PARTITION BY channel
        ORDER BY TIME_FLOOR("__time",'PT1H')
        ) AS "previous_edit_sum",
    LEAD(SUM(delta), 2) OVER (
        PARTITION BY channel
        ORDER BY TIME_FLOOR("__time",'PT1H')
        ) AS "nextNext_edit_sum",
    FIRST_VALUE(SUM(delta)) OVER (
        PARTITION BY channel
        ) AS "first_edit_sum",
    LAST_VALUE(SUM(delta)) OVER (
        PARTITION BY channel
        ) AS "final_edit_sum"
FROM "{table_name}"
WHERE channel = '#de.wikipedia'
GROUP BY 1, 2
"""

display.sql(sql)

Combining what you have seen on cumulative calculations, try adapting the query above to incorporate a new column that begins with zero edits, and then reaches `final_edit_sum`. Can you make it work on a per-channel for all channels that begin with `#e`?

## Re-using window definitions

Use WINDOW to create window definitions that you can re-use in your SQL statement.

The query above has been reworked to abstract the common window definitions:

* `channelTime_window` is partitioned by channel and ordered by time.
* `channel_window` is partitioned by channel.
* The LAG, LEAD, FIRST_VALUE, and LAST_VALUE functions have been adapted to address the WINDOW definitions.

In [None]:
sql = f"""
SELECT TIME_FLOOR("__time",'PT1H') AS "period",
    channel,
    SUM(delta) AS "edit_sum",
    LAG(SUM(delta)) OVER channelTime_window AS "previous_edit_sum",
    LEAD(SUM(delta), 2) OVER channelTime_window AS "nextNext_edit_sum",
    FIRST_VALUE(SUM(delta)) OVER channel_window AS "first_edit_sum",
    LAST_VALUE(SUM(delta)) OVER channel_window AS "final_edit_sum"
FROM "{table_name}"
WHERE channel = '#de.wikipedia'
GROUP BY 1, 2
WINDOW
    channelTime_window AS
    (
        PARTITION BY channel
        ORDER BY TIME_FLOOR("__time",'PT1H')
    ),
    channel_window AS
    (
        PARTITION BY channel
    )
"""

display.sql(sql)

Take a moment to review some of the queries in this notebook. Can you simplify them by using WINDOW?

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* Windows are defined in an OVER clause.
* The data in windows can be ordered and / or partitioned.
* Many standard aggregations can operate over windows.
* Specific window functions exist for ranking, numbering, and bucketing results.
* There are additional functions that return values from specific rows in the window.
* Windows can be defined inline or by using WINDOW.

## Learn more

* Amend the queries in this notebook to try other functions, such as PERCENT_RANK.
* Try out different definitions for windows from the [official documentation](https://druid.apache.org/docs/latest/querying/sql-window-functions/#window-function-syntax), such as setting a scope to UNBOUNDED FOLLOWING.