# Learn the basics of the Druid window functions

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
[Window functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) in Apache Druid produce values based upon the relationship of one row within a window of rows to the other rows within the same window. A window is a group of related rows within a result set. For example, rows with the same value for a specific dimension.

This tutorial uses Wikipedia data to demonstrate the how window functions work in Druid.

## Prerequisites

This tutorial works with Druid 28.0.0 or later.

> Note that window functions are an exerimental feature.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Load example data

The example queries demonstrate a comparison of the total delta value for a change event in Wikipedia by channel and by user. For that reason, we only need the timestamp, channel, user, and delta columns for the source data.

In [None]:
sql='''
REPLACE INTO "example-wikipedia-windows" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "channel" VARCHAR, "user" VARCHAR, "delta" BIGINT))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "channel",
  "user",
  "delta"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-windows')
display.table('example-wikipedia-windows')

The dataset describes changes that each individual `user` has made to Wikipedia pages within a `channel` expressed as the number of bytes added or deleted in the `delta` column and where `__time` is when the change was submitted. 

Run this query to have a look at the data:

In [None]:
query = """
SELECT
    __time,
    channel,
    user,
    delta
FROM "example-wikipedia-windows"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
ORDER BY __time
"""
display.sql(query)

## Window functions in Druid

Druid implements Window Functions over aggregate queries. The general syntax is:
```
SELECT
    <dimensions>,
    <aggregation function(s)>
    window_function()
      OVER ( PARTITION BY <partitioning expression>
             ORDER BY <order expression>
             <window frame>
            )
    FROM <table>
    GROUP BY <dimensions>
```

The `GROUP BY <dimensions>` is applied first, calculating all non-window `<aggregation functions>` and then applying the window function over the aggregate results.

Start by defining the aggregation to use as the base query. 
In this example the query standardizes the wikipedia activity metrics by summarizing it by HOUR by `channel` by `user` as in:

In [None]:
query = """
SELECT
    channel, 
    TIME_FLOOR(__time, 'PT1H') as time_hour, 
    user,
    SUM(delta) net_user_changes
FROM "example-wikipedia-windows"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
GROUP BY TIME_FLOOR(__time, 'PT1H'), channel, user
ORDER BY channel, TIME_FLOOR(__time, 'PT1H'), user

"""

req = sql_client.sql_request(query)
# Window functions are currently experimental. Set the enableWindiowing
# context parameter to "true" to use them.
req.add_context("enableWindowing", "true")
display.sql(req)


## ORDER BY windows

When the window definition only specifies ORDER BY <order expression>, it sorts the aggregate data set and applies the function in that order.

The following query uses `ORDER BY SUM(delta) DESC` to rank user hourly activity from the most changed the least changed within an hour:

In [None]:
query = """
SELECT
    TIME_FLOOR(__time, 'PT1H') as time_hour, 
    channel, 
    user,
    SUM(delta) net_user_changes,
    RANK( ) OVER ( ORDER BY SUM(delta) DESC ) editing_rank
FROM "example-wikipedia-windows"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
GROUP BY TIME_FLOOR(__time, 'PT1H'), channel, user
ORDER BY 5 

"""

req = sql_client.sql_request(query)
# Window functions are currently experimental. Set the enableWindiowing
# context parameter to "true" to use them.
req.add_context("enableWindowing", "true")
display.sql(req)

## PARTITION BY windows

When a window only specifies `PARTITION BY <partition expression>` it calculates the aggregate window function over all the rows that share a <partitioning expression> values within the selected dataset.

In this example, the query uses two different windows `PARTITION BY channel` and `PARTITION BY user` to calculate the total activity in the channel and total activity by the user so that they can be compared to individual hourly activity.


In [None]:
query = """
SELECT
    TIME_FLOOR(__time, 'PT1H') as time_hour, channel, user,
    SUM(delta) hourly_user_changes,
    SUM(SUM(delta)) OVER (PARTITION BY user ) AS total_user_changes,
    SUM(SUM(delta)) OVER (PARTITION BY channel ) AS total_channel_changes
FROM "example-wikipedia-windows"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
GROUP BY TIME_FLOOR(__time, 'PT1H'),2,3
ORDER BY channel,TIME_FLOOR(__time, 'PT1H'), user

"""

req = sql_client.sql_request(query)
# Window functions are currently experimental. Set the enableWindiowing
# context parameter to "true" to use them.
req.add_context("enableWindowing", "true")
display.sql(req)


Since the window definition only uses the PARTITION BY clause, Druid performs the calculation over the whole dataset for each value of the `<partition expression>`. Since the dataset is filtered for a single day, the window function results represent the total activity for the day, for the `user` and for the `channel` respectively.

Such a result shows the impact that individual user's hourly activity :
- the impact to the channel by comparing hourly_user_changes to total_channel_changes
- the impact of each user over the channel by total_user_changes to total_channel_changes
- the progress of each user's inidividal activity by comparing hourly_user_changes to total_user_changes

## PARTITION BY + ORDER BY windows

You can combine the two window types within a the query to perform ordered calculations within each partition of data.

The following query ranks user hourly activity within the channel:

In [None]:
query = """
SELECT
    channel, 
    TIME_FLOOR(__time, 'PT1H') as time_hour, 
    user,
    SUM(delta) hourly_user_changes,
    RANK() OVER (PARTITION BY channel ORDER BY SUM(delta) DESC) AS rank_within_channel_day
FROM "example-wikipedia-windows"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
GROUP BY 1, TIME_FLOOR(__time, 'PT1H'),3
ORDER BY channel, 5

"""

req = sql_client.sql_request(query)
# Window functions are currently experimental. Set the enableWindiowing
# context parameter to "true" to use them.
req.add_context("enableWindowing", "true")
display.sql(req)


## Window frames

Window frames limit the set of rows used for the windowed aggregation.
The general form is:
```
<window funtion>
OVER (
        [ PARTITION BY <partition expression>] ORDER BY <order expression>
        ROWS BETWEEN <range start> AND <range end>
     )
```
`<range start>` and `<range end>` can take on values:
UNBOUNDED PRECEDING   - from the beggining of the partition as order by the \<order expression\>
N ROWS PRECEDING    - N rows before the current row as ordered by the \<order expression\>
CURRENT ROW         - the current row
N ROWS FOLLOWING    - N rows after the current row as ordered by the \<order expression\>
UNBOUNDED FOLLOWING - to the end of the partition as ordered by the \<order expression\>

The following query uses a few differnt window frames overall activity by channel: 

In [None]:
query = """
SELECT
    channel, 
    TIME_FLOOR(__time, 'PT1H')      AS time_hour, 
    SUM(delta)                      AS hourly_channel_changes,
    SUM(SUM(delta)) OVER cumulative AS cumulative_activity_in_channel,
    SUM(SUM(delta)) OVER moving5    AS csum5,
    COUNT(1) OVER moving5           AS count5
FROM "example-wikipedia-windows"
WHERE channel = '#en.wikipedia'
  AND __time BETWEEN '2016-06-27' AND '2016-06-28'
GROUP BY 1, TIME_FLOOR(__time, 'PT1H')
WINDOW cumulative AS (   
                         PARTITION BY channel 
                         ORDER BY TIME_FLOOR(__time, 'PT1H') 
                         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                     )
                     ,
        moving5 AS ( 
                    PARTITION BY channel 
                    ORDER BY TIME_FLOOR(__time, 'PT1H') 
                    ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
                  )
"""

req = sql_client.sql_request(query)
# Window functions are currently experimental. Set the enableWindiowing
# context parameter to "true" to use them.
req.add_context("enableWindowing", "true")
display.sql(req)


This example uses the WINDOW clause to define multiple window specifications for reuse with many window function calculations.
The query uses two windows:
- `cumulative` is partitioned by `channel` and includes all rows from the beginning of partition up to the current row as ordered by `__time` which enables cumulative aggregation
- `moving5` is also partitioned by channel but only includes up to the last 4 rows and the current row as ordered by time

Notice in the `count5` resulting column that the number of rows considered for the `moving5` window:
- starts at 1 because there are no rows before the current one
- and grows up to 5 as defined by `ROWS BETWEEN 4 PRECEDING AND CURRENT ROW`

## Ranking functions
Ranking window functions calculate results based on the ORDER BY clause in the window definition.
The following example queries the activity of a single channel `#lt.wikipedia`  during a single hour `__time BETWEEN '2016-06-27 00:00:00' AND '2016-06-27 01:00:00'` and uses all ranking functions ordered by the total activity by user rounded to the nearest hundred. The rounding causes some ties in the values which illustrates the difference among rank functions:

In [None]:
query = """
SELECT 
    channel,
    user,
    ROUND(SUM(delta),-2) AS hourly_change_rounded
    ,ROW_NUMBER()   OVER (  ORDER BY ROUND(SUM(delta),-2) DESC ) AS row_no
    ,RANK()         OVER (  ORDER BY ROUND(SUM(delta),-2) DESC ) AS rank_no
    ,DENSE_RANK()   OVER (  ORDER BY ROUND(SUM(delta),-2) DESC ) AS dense_rank_no
    ,PERCENT_RANK() OVER (  ORDER BY ROUND(SUM(delta),-2) DESC ) AS pct_rank
    ,CUME_DIST()    OVER (  ORDER BY ROUND(SUM(delta),-2) DESC ) AS cumulative_dist
    ,NTILE(4)       OVER (  ORDER BY ROUND(SUM(delta),-2) DESC ) AS ntile_val
FROM "example-wikipedia-windows"
WHERE channel = '#en.wikipedia'
  AND __time BETWEEN '2016-06-27 00:00:00' AND '2016-06-27 01:00:00'
GROUP BY TIME_FLOOR(__time, 'PT1H'), channel, user
"""

req = sql_client.sql_request(query)
req.add_context("enableWindowing", "true")
display.sql(req)

Notice the differences in ordinal ranking values in each column:
- row_no  - `ROW_NUMBER()` is grows monotonically by one for each row, regardless of ties.
- rank_no - `RANK()` assigned the same rank value of `5` to the tied rows but then skips to `9` for the next row because there are 8 rows before it.
- dense_rank_no - `DENSE_RANK()` also assigned the same rank of `5` to the tied values but then continues with `6`.

Distribution ranks:
- pct_rank  - calculates `(rank() -1 ) / ( rows in partition - 1 )` providing a measure of what percentage of values fall before the current value in the distribution.
- cumulative_dist - calculates the cumulative distribution, it can be read as, the value in this row is in the top X percent of the population
- ntile_val  - calculates which population distribution bucket the row corresponds to, where N is the number of buckets, ntile(4) is calculating quartiles, ntile(100) calculates percentiles.
   

## Value window functions
You can use the following functions to include values from other rows within the window for the current row.

In [None]:
query = """
SELECT FLOOR(__time TO HOUR) AS event_time,
    channel,
    SUM(delta) total_activity,
    LAG(SUM(delta), 1, NULL) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS lag_val,
    LEAD(SUM(delta), 1, NULL) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS lead_val,
    FIRST_VALUE(SUM(delta)) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS first_val,
    LAST_VALUE(SUM(delta)) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO HOUR) ASC) AS last_val
FROM "example-wikipedia-windows"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY channel, FLOOR(__time TO HOUR) 
"""

req = sql_client.sql_request(query)
req.add_context("enableWindowing", "true")
display.sql(req)

The `total_activity` column shows the net change in bytes for the `channel` by hour `event_time`. The calculated window function result columns are:

- `lag_val`  - shows a the value of `total_activity` from 1 row preceding the current row, notice that the second parameter of the `LAG` function specifies the specific preceding row. In this case for the first row in each partition, the prior value does not exist, so NULL is returned.
- `lead_val` - shows a the value of `total_activity` from 1 row following the current row, again that the second parameter of the `LEAD` function specifies the specific following row to select from. In this case for the last row in each partition, the following value does not exist, so NULL is returned.
- `first_val` - shows the first value in the `channel` partition ordered by time
- `last_val` - shows the last value in the `channel` partition ordered by time

## Aggregate window functions
Many of the normal aggregate functions can also be used in the context of a window specification.
The following query shows an example of the most common ones:

In [None]:
query="""
SELECT FLOOR(__time TO DAY) AS event_time,
    channel,
    user,
    SUM(delta) total_activity,
    AVG(SUM(delta)) OVER (PARTITION BY channel ) AS avg_user_daily_change,
    MIN(SUM(delta)) OVER (PARTITION BY channel ) AS min_user_daily_change,
    MAX(SUM(delta)) OVER (PARTITION BY channel ) AS max_user_daily_change,
    SUM(SUM(delta)) OVER (PARTITION BY channel ) AS total_daily_change
FROM "example-wikipedia-windows"
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY channel, FLOOR(__time TO DAY), user 
"""
req = sql_client.sql_request(query)
req.add_context("enableWindowing", "true")
display.sql(req)

## Cleanup
Run the following cell to remove the objects created within this notebook.

In [None]:
print(f"Drop datasource: [{druid.datasources.drop('example-wikipedia-windows')}]")

## Conclusion
Window functions can be extremely useful for analytics. 
They are available as an experimental feature for Apache Druid 28.0

