# Array data types in Druid
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

In this notebook you will find examples of how to work with the array datatype in Druid, constructing them at query time and ingestion time using [scalar](https://druid.apache.org/docs/latest/querying/sql-array-functions) and [aggregation](https://druid.apache.org/docs/latest/querying/sql-aggregations) functions of expanding them with UNNEST.

## Prerequisites

This tutorial works with Druid 28.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Constructing arrays

Run the following cell to create a table, `example-koalas-arrays-1`.

Only specific source fields are ingested that you will use in later examples.

* The `language` field contains [multi-value strings](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings). In the ingestion, this is  renamed as `language-mvd`.
* NULL replaces the value "N/A" in the `timezone` field.
* A field `percentClear` is added that records specific stages of the user's journey through the website.

In [None]:
sql='''
REPLACE INTO "example-koalas-arrays-1" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  CASE WHEN "timezone" = 'N/A' THEN NULL ELSE "timezone" END AS "timezone",
  "browser",
  "language" AS "language-mvd",
  "session",
  "session_length",
  CASE WHEN ("event_type" = 'PercentClear') THEN
    (CASE WHEN ("event_subtype" = '') THEN 0 ELSE "event_subtype" END)
    ELSE NULL END AS "percentClear"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-arrays-1')
display.table('example-koalas-arrays-1')

### Construct an array with functions

The ARRAY constructor creates an array literal. Run the following cell to see an example of this in action.

* `staticArray` is a simple array of constants.
* The `tags` column contains the browser and timezone of each row.

In [None]:
sql='''
SELECT
  ARRAY['en-GB','fr'] AS "staticArray",
  ARRAY["browser","timezone"] AS "tags"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T14/PT2S')
'''

display.sql(sql)

The example data set contains a set of language tags for each row, stored as a [multi-value string](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings) in `language-mvd`.

In the following cell, an MV_TO_ARRAY function call is added to take the list of language tags and convert it to an array.

In [None]:
sql='''
SELECT
  ARRAY['en-GB','fr'] AS "staticArray",
  ARRAY["browser","timezone"] AS "tags",
  MV_TO_ARRAY("language-mvd") AS "tags-languages"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T14/PT2S')
'''

display.sql(sql)

Using ARRAY_CONCAT, two arrays can be concatenated together.

Run the next cell to see ARRAY_CONCAT extending the `tags` to also include all the tags from `language-mvd`, and also being used to add an arbritrary array to the original list of browsers.

In [None]:
sql='''
SELECT
  ARRAY_CONCAT(ARRAY['Chrome','Chrome Mobile'],ARRAY["browser"]) AS "tags-browsers-withChrome",
  ARRAY_CONCAT(ARRAY["browser","timezone"],MV_TO_ARRAY("language-mvd")) AS "tags"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T14/PT2S')
'''

display.sql(sql)

Using the ARRAY_PREPEND and ARRAY_APPEND functions, elements can be added to an array.

Running the following cell results in a number of combinations of the arrays from the previous examples.

* ARRAY_PREPEND adds Vogon to the beginning of all language tags
* ARRAY_APPEND adds Klingon to the end of all language tags

In [None]:
sql='''
SELECT
  ARRAY_PREPEND('vog',MV_TO_ARRAY("language-mvd")) AS "everything-is-vogon-poetry",
  ARRAY_APPEND(MV_TO_ARRAY("language-mvd"),'tlh') AS "everything-is-klingon-opera"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T14/PT2S')
'''

display.sql(sql)

### Construct an array with aggregators using GROUP BY

Arrays can be constructed from groups of rows by using the aggregators ARRAY_AGG and ARRAY_CONCAT_AGG.

In the following cell, a complete list of all browser tags is created in 10-second buckets using a roll-up style query for a minute of the data.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT10S') AS "time-bucket",
  ARRAY_AGG("browser") AS "tags-browsers"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT1M')
GROUP BY 1
'''

display.sql(sql)

As the size of the time window increases, more elements will be added to the array as the number of rows to be grouped increases.

To illustrate this, in the following cell the time floor has been increased to 1 minute (`PT1M`), and an extra parameter has been added to the ARRAY_AGG function to increase the maximum size to allow for larger arrays to be constructed.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1M') AS "time-bucket",
  ARRAY_AGG("browser", 65535) AS "tags-browsers"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT2M')
GROUP BY 1
'''

display.sql(sql)

In order to only return distinct values when the arrays are constructed, the following SQL contains the DISTINCT keyword.

This also means that the size parameter can be removed.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1M') AS "time-bucket",
  ARRAY_AGG(DISTINCT "browser") AS "tags-browsers"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT1M')
GROUP BY 1
'''

display.sql(sql)

To combine arrays, use the ARRAY_CONCAT_AGG function.

In the following example, the `language-mvd` field is parsed as an array for each row. The ARRAY_CONCAT_AGG function comes into play as each group of rows is combined. The DISTINCT clause ensures that, after this aggregation, only the unique values are returned as `tags-language`.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1M') AS "time-bucket",
  ARRAY_CONCAT_AGG(DISTINCT MV_TO_ARRAY("language-mvd")) AS "tags-languages"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT5M')
GROUP BY 1
'''

display.sql(sql)

These aggregating functions, ARRAY_AGG and ARRAY_CONCAT_AGG, can be used in combination with the scalar functions from previous examples.

In the following cell notice:

* Aggregators:
  * ARRAY_AGG produces an array of browsers for each group of rows.
  * ARRAY_CONCAT_AGG produces a distinct list of languages from the array version of the `language-mvd` data for each group of rows.
* Functions:
  * ARRAY_CONCAT merges two arrays and produces another array containing all tags.
  * ARRAY_APPEND ensures the ubiquitous presence of Vogon poetry by taking the array from above and adding the `vog` element.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1M') AS "time-bucket",
  ARRAY_APPEND(
    ARRAY_CONCAT(
      ARRAY_AGG(DISTINCT "browser"),
      ARRAY_CONCAT_AGG(DISTINCT MV_TO_ARRAY("language-mvd"))
      ),
    'vog') AS "tags"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT5M')
GROUP BY 1
'''

display.sql(sql)

Arrays can also be constructed to contain objects or indeed other arrays.

In the example below, the JSON_OBJECT function is used to produce an object that is then aggregated into an array via a GROUP BY.

In [None]:
sql='''
SELECT
  "session",
  ARRAY_AGG(JSON_OBJECT(KEY 'PercentClear' VALUE "percentClear", KEY 'timestamp' VALUE TIME_FORMAT("__time"))) AS "journey-object",
  ARRAY_AGG(ARRAY["percentClear", TIME_FORMAT("__time")], 65535) AS "journey-array"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT10S')
AND "percentClear" IS NOT NULL
GROUP BY "session"
'''

display.sql(sql)

## Construct an array at ingestion time



### Create arrays from existing data

Handle arrays in source data using the MV_TO_ARRAY and STRING_TO_ARRAY functions as part of the ingestion SQL statement.

Run the following cell to create a new table, `example-koalas-arrays-2`.

For each row in the source data:

* The timestamp, session, browser, timezone, and session length are ingested as-is into individual dimensions.
* The incoming multi-value string dimension is transformed into an array as `tags-languages`.
* Two new columns, `journey-percent` and `journey-timestamp` record how far through their journey a visitor was at that point in time.
* From these two new column, `journey-array` is constructed, holding a key-value pair of the data.

Notice that the `arrayIngestMode` [context parameter](https://druid.apache.org/docs/latest/querying/sql-data-types/#arrays) has been set to `array`.

In [None]:
sql='''
REPLACE INTO "example-koalas-arrays-2" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "session",
  "browser",
  "timezone",
  "session_length",
  MV_TO_ARRAY("language") AS "tags-languages",
  CASE WHEN ("event_type" = 'PercentClear') THEN
    (CASE WHEN
      ("event_subtype" = '') THEN 0
      ELSE CAST("event_subtype" AS DECIMAL)
      END)
    ELSE NULL END AS "journey-percent",
  CASE WHEN ("event_type" = 'PercentClear') THEN TIME_PARSE("timestamp")
    ELSE NULL END AS "journey-timestamp",
  CASE WHEN ("event_type" = 'PercentClear') THEN
    (CASE WHEN
      ("event_subtype" = '') THEN ARRAY[0, TIME_PARSE("timestamp")]
      ELSE ARRAY[CAST("event_subtype" AS DECIMAL), TIME_PARSE("timestamp")]
      END)
    ELSE NULL END AS "journey-array"
FROM "ext"
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")

display.run_task(req)
sql_client.wait_until_ready('example-koalas-arrays-2')
display.table('example-koalas-arrays-2')

Run the following cell to see a sample of the data.

Notice that the `journey-timestamp` is a true secondary timestamp, so is stored as milliseconds since the Unix epoch. In order for all elements of the `journey-array` array be of the same data type, the CAST function has been used on journey stage to ensure it is a DECIMAL.

In [None]:
sql='''
SELECT *
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT5S')
'''

display.sql(sql)

### Create an array at ingestion time using rollup (GROUP BY)

Run the following cell to create a table called `example-koalas-arrays-3`.

Note the addition of the `finalizeAggregations` parameter as part of this request.

In [None]:
sql='''
REPLACE INTO "example-koalas-arrays-3" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(EXTERN('{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', '{"type":"json"}')) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR)
)
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT5S') AS "__time",
  "session",
  ARRAY_AGG(DISTINCT "browser", 65535) AS "tags-browsers",
  ARRAY_AGG(DISTINCT CASE WHEN "timezone" = 'N/A' THEN NULL ELSE "timezone" END, 65535) AS "tags-timezones",
  ARRAY_CONCAT_AGG(DISTINCT MV_TO_ARRAY("language"), 65535) AS "tags-languages",
  ARRAY_CONCAT_AGG(
    CASE WHEN ("event_type" = 'PercentClear') THEN (
      CASE WHEN ("event_subtype" = '') THEN ARRAY[0, TIME_PARSE("timestamp")]
      ELSE ARRAY[CAST("event_subtype" AS DECIMAL), TIME_PARSE("timestamp")]
      END)
    ELSE NULL END, 65535) AS "journey-array"
FROM "ext"
GROUP BY 1, 2
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")
req.add_context("finalizeAggregations", "true")

display.run_task(req)
sql_client.wait_until_ready('example-koalas-arrays-3')
display.table('example-koalas-arrays-3')

In [None]:
sql='''
SELECT *
FROM "example-koalas-arrays-3"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT5S')
'''

display.sql(sql)

## Determine array size

By using ARRAY_LENGTH it's possible to count the number of elements in the array.

In the next cell, a query returns the average number of languages over a particular time period.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT15M') AS "period",
  AVG(ARRAY_LENGTH("language-array")) AS "average_languages"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT1H')
GROUP BY 1
'''

display.sql(sql)

## Filter results using arrays

The ARRAY_CONTAINS function tests for the presence of either an element or another array.

Run the following cell to determine how many visitors completed 20%, 50%, and 90% of their journey within a day of recorded site traffic. To achieve this, FILTER (WHERE) is appended to the COUNT function, and ARRAY_CONTAINS is used to test for a specific array element.

An additional filter in the WHERE clause narrows down the rows for the calculation to French _and_ British English visitors. This uses the ARRAY_CONTAINS function against another array, seeking out rows that contain both values in the second array.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1H') AS "period",
  COUNT(*) FILTER (WHERE ARRAY_CONTAINS("clear-kv",'20')) AS "passed-20",
  COUNT(*) FILTER (WHERE ARRAY_CONTAINS("clear-kv",'50')) AS "passed-50",
  COUNT(*) FILTER (WHERE ARRAY_CONTAINS("clear-kv",'90')) AS "passed-90"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25/P1D')
AND ARRAY_CONTAINS("language-array",ARRAY['en-GB','fr-FR'])
GROUP BY 1
'''

display.sql(sql)

The variation below uses ARRAY_OVERLAP. This time, results include French _or_ British English visitors.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1H') AS "period",
  COUNT(*) FILTER (WHERE ARRAY_CONTAINS("clear-kv",'20')) AS "passed-20",
  COUNT(*) FILTER (WHERE ARRAY_CONTAINS("clear-kv",'50')) AS "passed-50",
  COUNT(*) FILTER (WHERE ARRAY_CONTAINS("clear-kv",'90')) AS "passed-90"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25/P1D')
AND ARRAY_OVERLAP("language-array",ARRAY['en-GB','fr-FR'])
GROUP BY 1
'''

display.sql(sql)

## Find elements in an array

ARRAY_ORDINAL and ARRAY_OFFSET return values from specific positions in the array.

In the SQL below, the timestamps of all events where a visitor reached 10% through their journey are returned.

In [None]:
sql='''
SELECT
  ARRAY_ORDINAL("clear-kv",2) AS "timestamp"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T04/PT5S')
AND ARRAY_ORDINAL("clear-kv",1) = 10
'''

display.sql(sql)

By constructing an array from `clear-kv` on a per-session basis using ARRAY_AGG, it's possible to build up a visitors journey during a specific period of data.

The following cells build up an advanced query that will return the maximum time taken for visitors to complete three key stages: zero to 25, 25 to 50, 50 to 75, and 75 to 95.

The first query builds a view of all the journey points on a session-by-session basis. Two arrays are constructed: `clear-percent-array` and `clear-percent-timestamp`. In effect, the first array acts as an index to the second array, giving the time when that stage in the journey was reached.

This information is stored in a temporary table, `journey-snapshot`.

Run the cell below to see a sample of this result set.

Note that the timestamp array contains secondary timestamps, and as such are represented as milliseconds since the Unix epoch.

In [None]:
sql='''
WITH "journey-snapshot" AS (
SELECT
   "session",
   ARRAY_AGG("clear-percent", 5120) AS "clear-percent-array",
   ARRAY_AGG("clear-timestamp", 5120) AS "clear-timestamp-array"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T04/PT5S')
AND "clear-kv" IS NOT NULL
GROUP BY "session"
)

SELECT
  *
FROM "journey-snapshot"
'''

display.sql(sql)

The next cell uses these arrays with ARRAY_ORDINAL_OF and ARRAY_ORDINAL.

First, ARRAY_ORDINAL_OF attempts to return the position of a zero-percent clear event in the array.

```sql
ARRAY_ORDINAL_OF("clear-percent-array",'0')
```

This position is then used in ARRAY_ORDINAL on the `clear-timestamp-array` to find the timestamp.

```sql
ARRAY_ORDINAL("clear-timestamp-array", ... )
```

The result of this function call is wrapped in MILLIS_TO_TIMESTAMP so that a proper timestamp is returned in results.

Each of these sets of functions is wrapped in a CASE statement to catch situations where the source data, `journey-snapshot`, does not contain an event with the percent-clear event we are looking for.

Run the cell below to see a sample.

In [None]:
sql='''
WITH "journey-snapshot" AS (
SELECT
   "session",
   ARRAY_AGG("clear-percent", 5120) AS "clear-percent-array",
   ARRAY_AGG("clear-timestamp", 5120) AS "clear-timestamp-array"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT5S')
AND "clear-kv" IS NOT NULL
GROUP BY "session"
)

SELECT
  "session",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'0') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'0')))
      ELSE NULL END AS "time-0",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'25') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'25')))
      ELSE NULL END AS "time-25",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'50') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'50')))
      ELSE NULL END AS "time-50",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'75') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'75')))
      ELSE NULL END AS "time-75",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'95') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'95')))
      ELSE NULL END AS "time-95"
FROM "journey-snapshot"
'''

display.sql(sql)

This query is now wrapped up into another temporary table, `journey-timestamps`.

A final SQL statement then addresses that table.

* The temporary table, `journey-timestamps`, is JOINed to the original table, `example-koalas-arrays-2`.
* TIMESTAMPDIFF calculates the number of seconds between each of the timestamps and returns them as `to25`, `to50`, and so on.
* MAX is wrapped around TIMESTAMPDIFF to return the longest period of time.
* GROUP BY gives statistics per `browser`.

> When you run this cell, the `journey-snapshot` table will cover a longer period. Depending on the amount of memory you have available for your learning environment, you may need to reduce the time period in the TIME_IN_INTERVAL filter from `PT6H` to something smaller, like `PT1H` for just one hour.

In [None]:
sql='''
WITH "journey-snapshot" AS (
SELECT
   "session",
   ARRAY_AGG("clear-percent", 5120) AS "clear-percent-array",
   ARRAY_AGG("clear-timestamp", 5120) AS "clear-timestamp-array"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT6H')
AND "clear-kv" IS NOT NULL
GROUP BY "session"
),

"journey-timestamps" AS (
SELECT
  "session",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'0') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'0')))
      ELSE NULL END AS "time-0",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'25') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'25')))
      ELSE NULL END AS "time-25",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'50') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'50')))
      ELSE NULL END AS "time-50",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'75') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'75')))
      ELSE NULL END AS "time-75",
  CASE WHEN
    ARRAY_ORDINAL_OF("clear-percent-array",'95') > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("clear-timestamp-array",ARRAY_ORDINAL_OF("clear-percent-array",'95')))
      ELSE NULL END AS "time-95"
FROM "journey-snapshot"
)

SELECT
  t."browser",
  MAX(TIMESTAMPDIFF(SECOND,"time-0","time-25")) AS "to25",
  MAX(TIMESTAMPDIFF(SECOND,"time-25","time-50")) AS "to50",
  MAX(TIMESTAMPDIFF(SECOND,"time-50","time-75")) AS "to75",
  MAX(TIMESTAMPDIFF(SECOND,"time-75","time-95")) AS "to95"
FROM "example-koalas-arrays-2" t
LEFT JOIN "journey-timestamps" jt ON jt."session" = t."session"
GROUP BY 1
'''

display.sql(sql)

## Join to an array with UNNEST

By using the UNNEST, an array can be expanded to a temporary table that can then be joined to another datasource.

Recall that `language-array` contains an array of language codes.

In [None]:
sql='''
SELECT
  "language-array",
  "browser",
  "session_length"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT2S')
'''

display.sql(sql)

To return a row for each language for each `browser` and each `session_length`, use UNNEST with a JOIN.

Run the following cell to see this in action.

In [None]:
sql='''
SELECT
  la."language",
  "browser",
  "session_length"
FROM "example-koalas-arrays-2"
CROSS JOIN UNNEST("language-array") AS la("language")
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT2S')
'''

display.sql(sql)

This data can then be used in a GROUP BY to generate metrics on a per-language basis.

In [None]:
sql='''
SELECT
  la."language",
  MIN("session_length") AS "shortest-session",
  MAX("session_length") AS "longest-session",
  COUNT(DISTINCT "session") AS "unique-sessions"
FROM "example-koalas-arrays-2"
CROSS JOIN UNNEST("language-array") AS la("language")
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT2S')
GROUP BY 1
'''

display.sql(sql)

## Clean up

Run the following cell to remove the tables used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-arrays-1")
druid.datasources.drop("example-koalas-arrays-2")
druid.datasources.drop("example-koalas-arrays-3")

## Summary

* Arrays can be created in a number of ways, including by conversion from delimited and multi-value strings.
* The UNNEST function, together with a JOIN, allows for arrays to be expanded into individual rows.
* Scalar array functions allow for items to be found and added.

## Learn more

* Read more about arrays on the [SQL data types](https://druid.apache.org/docs/latest/querying/sql-data-types/#arrays) page in the official documentation.
* See the full documentation on [scalar](https://druid.apache.org/docs/latest/querying/sql-array-functions) and [aggregation](https://druid.apache.org/docs/latest/querying/sql-aggregations) array functions
* Check out the [official tutorial on UNNEST](https://druid.apache.org/docs/latest/tutorials/tutorial-unnest-arrays) with arrays 