# Learn the basics of the Druid Window functions

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
[Window functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) in Apache Druid produce values based upon the relationship of one row within a window of rows to the other rows within the same window. A window is a group of related rows within a result set. For example, rows with the same value for a specific dimension.

This tutorial uses Wikipedia data to demonstrate the basics of how to work with window functions in Druid.

## Prerequisites

This tutorial works with Druid 28.0.0 or later.

In [7]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Opening a connection to http://host.docker.internal:8888.


'29.0.0-SNAPSHOT'

## Load example data

The example queries demonstrate a comparisan of the total delta value for a change event in Wikipedia per country. For that reason, we only need the timestamp, countryIsoCode, and delta columns for the source data.

In [22]:
sql='''
REPLACE INTO "example_wikipedia_windows" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "channel" VARCHAR, "delta" BIGINT))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "channel",
  "delta"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-windows')
display.table('example-wikipedia-windows')

Loading data, status:[SUCCESS]: 100%|██████████| 100.0/100.0 [00:31<00:00,  3.16it/s]


Position,Name,Type
1,__time,TIMESTAMP
2,countryIsoCode,VARCHAR
3,delta,BIGINT


### Query your data

Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes.


In [None]:
query = """
SELECT
    FLOOR(__time TO MINUTE) as "time",
    channel,
    ABS(delta) AS changes,
    sum(ABS(delta)) OVER (PARTITION BY channel ORDER BY FLOOR(__time TO MINUTE) ASC) AS cum_changes
FROM example_wikipedia_windows
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY channel, __time, delta
"""

req = sql_client.sql_request(query)
req.add_context("windowsAreForClosers", "true")
display.sql(req)


In [27]:
query = """
SELECT FLOOR(__time TO DAY) AS event_time,
    channel,
    ABS(delta) AS change,
    ROW_NUMBER() OVER w AS row_no,
    RANK() OVER w AS rank_no,
    DENSE_RANK() OVER w AS dense_rank_no,
    PERCENT_RANK() OVER w AS pct_rank,
    CUME_DIST() OVER w AS cumulative_dist,
    NTILE(4) OVER w AS ntile_val,
    LAG(ABS(delta), 1, 0) OVER w AS lag_val,
    LEAD(ABS(delta), 1, 0) OVER w AS lead_val,
    FIRST_VALUE(ABS(delta)) OVER w AS first_val,
    LAST_VALUE(ABS(delta)) OVER w AS last_val
FROM example_wikipedia_windows
WHERE channel IN ('#kk.wikipedia', '#lt.wikipedia')
GROUP BY channel, ABS(delta), FLOOR(__time TO DAY) 
WINDOW w AS (PARTITION BY channel ORDER BY ABS(delta) ASC)
"""

req = sql_client.sql_request(query)
req.add_context("windowsAreForClosers", "true")
display.sql(req)

event_time,channel,change,row_no,rank_no,dense_rank_no,pct_rank,cumulative_dist,ntile_val,lag_val,lead_val,first_val,last_val
2016-06-27T00:00:00.000Z,#kk.wikipedia,1,1,1,1,0.0,0.125,1,,7.0,1,6900
2016-06-27T00:00:00.000Z,#kk.wikipedia,7,2,2,2,0.1428571428571428,0.25,1,1.0,56.0,1,6900
2016-06-27T00:00:00.000Z,#kk.wikipedia,56,3,3,3,0.2857142857142857,0.375,2,7.0,63.0,1,6900
2016-06-27T00:00:00.000Z,#kk.wikipedia,63,4,4,4,0.4285714285714285,0.5,2,56.0,91.0,1,6900
2016-06-27T00:00:00.000Z,#kk.wikipedia,91,5,5,5,0.5714285714285714,0.625,3,63.0,2440.0,1,6900
2016-06-27T00:00:00.000Z,#kk.wikipedia,2440,6,6,6,0.7142857142857143,0.75,3,91.0,2703.0,1,6900
2016-06-27T00:00:00.000Z,#kk.wikipedia,2703,7,7,7,0.8571428571428571,0.875,4,2440.0,6900.0,1,6900
2016-06-27T00:00:00.000Z,#kk.wikipedia,6900,8,8,8,1.0,1.0,4,2703.0,,1,6900
2016-06-27T00:00:00.000Z,#lt.wikipedia,1,1,1,1,0.0,0.1,1,,2.0,1,4358
2016-06-27T00:00:00.000Z,#lt.wikipedia,2,2,2,2,0.1111111111111111,0.2,1,1.0,13.0,1,4358


In addition to the query, there are a few additional things you can define within the payload. For a full list, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)

This tutorial uses a context parameter and a dynamic parameter.

Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily like when you need to cancel a query.


Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).


The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`.

In [None]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT * FROM wikipedia_api WHERE __time > ? LIMIT 1",
  "context": {
    "sqlQueryId" : "important-query" 
    },
  "parameters": [
    { "type": "TIMESTAMP", "value": "2016-06-27"}
  ]
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(json.dumps(json.loads(response.text), indent=4))

## Next steps

This tutorial covers the some of the basics related to the Druid API. To learn more about the kinds of things you can do, see the API documentation:

- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)
- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)

You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.



