# Learn the basics of the Druid Window functions

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
[Window functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) in Apache Druid produce values based upon the relationship of one row within a window of rows to the other rows within the same window. A window is a group of related rows within a result set. For example, rows with the same value for a specific dimension.

This tutorial uses Wikipedia data to demonstrate the basics of how to work with window functions in Druid.

## Prerequisites

This tutorial works with Druid 28.0.0 or later.

> Note that window functions are an exerimental feature.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Load example data

The example queries demonstrate a comparisan of the total delta value for a change event in Wikipedia per country. For that reason, we only need the timestamp, countryIsoCode, and delta columns for the source data.

In [None]:
sql='''
REPLACE INTO "example_wikipedia_windows" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "channel" VARCHAR, "delta" BIGINT))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "channel",
  "delta"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-windows')
display.table('example-wikipedia-windows')

## Example window functions queries

When your data becomes available, you can run window functions queries the data. The following query uses the window function with the SUM aggregation to evaluate the cumulative changes over time within a window.

The OVER clause defines the window:
- PARTITION BY spefies the dimension that
- ORDER BY orders the rows in the window according to the `__time` column.

For simplicity's sake, the query limits the query to two channels. You can experiment with adding more channels to the query. 


In [None]:
query = """
SELECT
    FLOOR(__time TO MINUTE) as "time",
    channel,
    ABS(delta) AS changes,
    sum(ABS(delta)) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO MINUTE) ASC) AS cum_changes
FROM example_wikipedia_windows
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY channel, __time, delta
"""

req = sql_client.sql_request(query)
# Window functions are currently experimental. Set the enableWindiowing
# context parameter to "true" to use them.
req.add_context("enableWindowing", "true")
display.sql(req)


The following query illustrates all of the built in window functions using the same data:

* ROW_NUMBER which returns the numeric value for the row within the window
* RANK which returns the numeric rank of the row within the window
* DENSE_RANK which returns the rank of the row with no gaps. For example, if two rows tie for 1, the next row is ranked 2
* PERCENT_RANK which returns the percentage rank of the row within the window according to the formula (rank - 1)/(total window rows - 1)
* CUME_DIST which returns the cumulative distribution of the current row within the window calculated as number of window (rows at the same rank or higher than current row) / (total window rows)
* NTILE which divides the number of results as evently as possible into a number of tiles as and returns the value of tile for the row
* LAG which returns the value for the row that preceeds the current row by a given offset
* LEAD which returns the value for the row that follows the current row by a given offset
* FIRST_VALUE which returns the value for the expression for the first row within the window
* LAST_VALUE which returns the value for the expression for the last row within the window

See [Window functions](https://druid.apache.org/docs/latest/querying/sql-window-functions) for syntax and more detail.

The query also demonstrates how you can alias a window within the SELECT clause and define it later with the WINDOW keyword.

In [None]:
query = """
SELECT FLOOR(__time TO DAY) AS event_time,
    channel,
    ABS(delta) AS change,
    ROW_NUMBER() OVER w AS row_no,
    RANK() OVER w AS rank_no,
    DENSE_RANK() OVER w AS dense_rank_no,
    PERCENT_RANK() OVER w AS pct_rank,
    CUME_DIST() OVER w AS cumulative_dist,
    NTILE(4) OVER w AS ntile_val,
    LAG(ABS(delta), 1, 0) OVER w AS lag_val,
    LEAD(ABS(delta), 1, 0) OVER w AS lead_val,
    FIRST_VALUE(ABS(delta)) OVER w AS first_val,
    LAST_VALUE(ABS(delta)) OVER w AS last_val
FROM example_wikipedia_windows
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY channel, ABS(delta), FLOOR(__time TO DAY) 
WINDOW w AS (PARTITION BY channel ORDER BY ABS(delta) ASC)
"""

req = sql_client.sql_request(query)
req.add_context("windowsAreForClosers", "true")
display.sql(req)