# Learn the basics of the Druid window functions

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This tutorial demonstrates how to work with [window functions](https://druid.apache.org/docs/latest/querying/sql-window-functions/#window-function-reference). In this tutorial you perform the following tasks:

- Task 1
- Task 2
- Task 3
- etc

## Prerequisites

This tutorial works with Druid 31 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Create a table using batch ingestion

Run the following cell to create a table using batch ingestion. Only the channel, user, and delta columns are ingested.

When completed, you'll see a description of the final table.

In [None]:
table_name = 'example-wikipedia-windows'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "channel" VARCHAR, "user" VARCHAR, "delta" BIGINT))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "channel",
  "user",
  "delta"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready(table_name)
display.table(table_name)

## Using window functions with ordered windows

Run the following cell to produce an hourly activity timeline by channel.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    channel,
    user,
    SUM(delta) AS "net_user_changes"
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1, 2, 3
ORDER BY 1, 2
"""

display.sql(sql)

In the following query, a window function is added. This gives a ranking for each row.

* RANK is the [window function](https://druid.apache.org/docs/latest/querying/sql-window-functions/#window-function-reference) applied.
* The ORDER BY clause specifies how to generate the rank itself.

Run the cell to see the results.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    channel,
    user,
    SUM(delta) AS "net_user_changes",
    RANK( ) OVER ( ORDER BY SUM(delta) DESC ) AS "rank_edits"
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1, 2, 3
ORDER BY 1, 2
"""

display.sql(sql)

Window functions can be used together. In the next query, another RANK function is added, `rank_editors`, which adds a rank for each row according to the number of users in each group.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as "period", 
    RANK( ) OVER ( ORDER BY SUM(delta) DESC ) AS "rank_edits",
    RANK( ) OVER ( ORDER BY COUNT(DISTINCT user) DESC ) AS "rank_editors"
FROM "{table_name}"
WHERE channel = '#en.wikipedia'
AND TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1
ORDER BY 3
"""

display.sql(sql)

Notice that because RANK was used, some rankings for `rank_editors` are duplicated.

You may want to adjust the SQL above to use DENSE_RANK, which ensures all rankings are unique, and PERCENT_RANK, which divides the RANK over a count of the rows.

The following query shows two more useful functions:

* ROW_NUMBER gives each row a monotonic result order by calculating the number of distinct users.
* NTILE splits rows into buckets of (near to) equal size, and gives each row a bucket number - again based on the count of distinct users.

Notice the GROUP BY clause causes each row to relate to a single channel. When the ORDER BY and LIMIT are applied, that means the top 10 channels by edits are returned, but the ROW_NUMBER and NTILE concern _all_ result rows.

In [None]:
sql = f"""
SELECT 
    channel,
    SUM(delta) AS "edits",
    ROW_NUMBER()   OVER (  ORDER BY COUNT(DISTINCT user) DESC ) AS row_no,
    NTILE(6)       OVER (  ORDER BY COUNT(DISTINCT user) DESC ) AS ntile_val
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""

display.sql(sql)

## Using window functions with partitioned windows

PARTITION BY in the OVER clause calculates a window function result using other rows that share the same value in the given dimension.

In the query below:

* `total_user_changes` adds the sum number of edits made by the same user to each row. 
* `total_channel_changes`  adds the sum number of edits made in the same channel to each row.

Run the cell to see the results generated in over five minutes worth of data.

In [None]:
sql = f"""
SELECT
    TIME_FLOOR(__time, 'PT1H') as time_hour, channel, user,
    SUM(delta) edits,
    SUM(SUM(delta)) OVER (PARTITION BY user ) AS total_user_edits,
    SUM(SUM(delta)) OVER (PARTITION BY channel ) AS total_channel_edits
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/PT5M')
GROUP BY TIME_FLOOR(__time, 'PT1H'),2,3
ORDER BY channel,TIME_FLOOR(__time, 'PT1H'), user
"""

display.sql(sql)

Notice that all rows for `#en.wikipedia` have the same `total_channel_edits`.

The GROUP BY clause ensures that each result is for a single time period on a single channel. Unlike many of the other users, notice that `Lsjbot` has made multiple edits across multiple channels in the same period: one is `#ceb.wikipedia` - can you spot the others?

## Using aggregate functions with partitioned windows

Many aggregate functions can also be used with partitioned windows.

Run the following query, where a number of standard aggregations have been included.

A specific time period, user string pattern, and HAVING are used to make the results more readible.

In [None]:
sql=f"""
SELECT user,
    channel,
    SUM(delta) AS "edits",
    AVG(SUM(delta)) OVER (PARTITION BY user) AS "user_average_edits",
    MAX(SUM(delta)) OVER (PARTITION BY user) AS "user_most_edits",
    COUNT(channel) OVER (PARTITION BY user) AS "user_channel_count"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2016-06-27/P1D')
AND user LIKE 'L%'
GROUP BY 1, 2
HAVING SUM(delta) > 5000
"""

display.sql(sql)

Notice that the `user...` aggregations contain the PARTITION clause. These aggregates are therefore specific to the `user` given in each row.

## Window functions with ordered partition windows

You can combine the two window types within a the query to perform ordered calculations within each partition of data.

The following query ranks user hourly activity within the channel:

In [None]:
sql = f"""
SELECT
    channel, 
    TIME_FLOOR(__time, 'PT1H') as time_hour, 
    user,
    SUM(delta) hourly_user_changes,
    RANK() OVER (PARTITION BY channel ORDER BY SUM(delta) DESC) AS rank_within_channel_day
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
GROUP BY 1, TIME_FLOOR(__time, 'PT1H'),3
ORDER BY channel, 5
"""

display.sql(sql)


## Limiting windows

Use ROWS BETWEEN to limit the set of result rows to be used.

In the following query, 

`<range start>` and `<range end>` can take on values:
UNBOUNDED PRECEDING   - from the beginning of the partition as order by the \<order expression\>
N ROWS PRECEDING    - N rows before the current row as ordered by the \<order expression\>
CURRENT ROW         - the current row
N ROWS FOLLOWING    - N rows after the current row as ordered by the \<order expression\>
UNBOUNDED FOLLOWING - to the end of the partition as ordered by the \<order expression\>

The following query uses a few different window frames overall activity by channel: 

In [None]:
sql = f"""
SELECT
    channel, 
    TIME_FLOOR(__time, 'PT1H')      AS time_hour, 
    SUM(delta)                      AS hourly_channel_changes,
    SUM(SUM(delta)) OVER (   
                         PARTITION BY channel 
                         ORDER BY TIME_FLOOR(__time, 'PT1H') 
                         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                     ) AS cumulative_activity_in_channel,
    SUM(SUM(delta)) OVER ( 
                    PARTITION BY channel 
                    ORDER BY TIME_FLOOR(__time, 'PT1H') 
                    ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
                  )    AS csum5,
    COUNT(1) OVER ( 
                    PARTITION BY channel 
                    ORDER BY TIME_FLOOR(__time, 'PT1H') 
                    ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
                  )           AS count5
FROM "{table_name}"
WHERE channel = '#en.wikipedia'
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
GROUP BY 1, TIME_FLOOR(__time, 'PT1H')
"""

display.sql(sql)

This example uses the WINDOW clause to define multiple window specifications for reuse with many window function calculations.
The query uses two windows:
- `cumulative` is partitioned by `channel` and includes all rows from the beginning of partition up to the current row as ordered by `__time` which enables cumulative aggregation
- `moving5` is also partitioned by channel but only includes up to the last 4 rows and the current row as ordered by time

Notice in the `count5` resulting column that the number of rows considered for the `moving5` window:
- starts at 1 because there are no rows before the current one
- and grows up to 5 as defined by `ROWS BETWEEN 4 PRECEDING AND CURRENT ROW`

Notice the differences in ordinal ranking values in each column:
- row_no  - `ROW_NUMBER()` is grows monotonically by one for each row, regardless of ties.
- rank_no - `RANK()` assigned the same rank value of `5` to the tied rows but then skips to `9` for the next row because there are 8 rows before it.
- dense_rank_no - `DENSE_RANK()` also assigned the same rank of `5` to the tied values but then continues with `6`.

Distribution ranks:
- pct_rank  - calculates `(rank() -1 ) / ( rows in partition - 1 )` providing a measure of what percentage of values fall before the current value in the distribution.
- cumulative_dist - calculates the cumulative distribution, it can be read as, the value in this row is in the top X percent of the population
- ntile_val  - calculates which population distribution bucket the row corresponds to, where N is the number of buckets, ntile(4) is calculating quartiles, ntile(100) calculates percentiles.
   

## Value window functions
You can use the following functions to include values from other rows within the window for the current row.

In [None]:
sql = f"""
SELECT FLOOR(__time TO HOUR) AS event_time,
    channel,
    SUM(delta) total_activity,
    LAG(SUM(delta), 1, NULL) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS lag_val,
    LEAD(SUM(delta), 1, NULL) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS lead_val,
    FIRST_VALUE(SUM(delta)) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS first_val,
    LAST_VALUE(SUM(delta)) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS last_val
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY channel, FLOOR(__time TO HOUR) 
"""

req = sql_client.sql_request(query)
req.add_context("enableWindowing", "true")
display.sql(sql)

The `total_activity` column shows the net change in bytes for the `channel` by hour `event_time`. The calculated window function result columns are:

- `lag_val`  - shows a the value of `total_activity` from 1 row preceding the current row, notice that the second parameter of the `LAG` function specifies the specific preceding row. In this case for the first row in each partition, the prior value does not exist, so NULL is returned.
- `lead_val` - shows a the value of `total_activity` from 1 row following the current row, again that the second parameter of the `LEAD` function specifies the specific following row to select from. In this case for the last row in each partition, the following value does not exist, so NULL is returned.
- `first_val` - shows the first value in the `channel` partition ordered by time
- `last_val` - shows the last value in the `channel` partition ordered by time

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

## Inspect the data

The example queries demonstrate a comparison of the total delta value for a change event in Wikipedia by channel and by user. 

The dataset describes changes that each individual `user` has made to Wikipedia pages within a `channel` expressed as the number of bytes added or deleted in the `delta` column and where `__time` is when the change was submitted. 

Run this query to have a look at the data:

In [None]:
sql = f"""
SELECT
    __time,
    channel,
    user,
    delta
FROM "{table_name}"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
ORDER BY __time
"""
display.sql(query)