# Array data types in Druid
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

In this notebook you will find examples of how to work with the array datatype in Druid, constructing them at query time and ingestion time using [scalar](https://druid.apache.org/docs/latest/querying/sql-array-functions) and [aggregation](https://druid.apache.org/docs/latest/querying/sql-aggregations) functions of expanding them with UNNEST.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Construct arrays at query time

Run the following cell to create a table, `example-koalas-arrays-1`.

This table contains data you will use later in the notebook.

In the source data, the `language` field contains [multi-value strings](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings). In the ingestion, this is brought in as `language-mv`.

In [None]:
sql='''
REPLACE INTO "example-koalas-arrays-1" OVERWRITE ALL WITH "ext" AS (
  SELECT 
    * 
  FROM 
    TABLE(
      EXTERN(
        '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', 
        '{"type":"json"}'
      )
    ) EXTEND (
      "timestamp" VARCHAR, "agent_category" VARCHAR, 
      "agent_type" VARCHAR, "browser" VARCHAR, 
      "browser_version" VARCHAR, "city" VARCHAR, 
      "continent" VARCHAR, "country" VARCHAR, 
      "version" VARCHAR, "event_type" VARCHAR, 
      "event_subtype" VARCHAR, "loaded_image" VARCHAR, 
      "adblock_list" VARCHAR, "forwarded_for" VARCHAR, 
      "language" VARCHAR, "number" VARCHAR, 
      "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, 
      "referrer" VARCHAR, "referrer_host" VARCHAR, 
      "region" VARCHAR, "remote_address" VARCHAR, 
      "screen" VARCHAR, "session" VARCHAR, 
      "session_length" BIGINT, "timezone" VARCHAR, 
      "timezone_offset" VARCHAR, "window" VARCHAR
    )
) 
SELECT 
  TIME_PARSE("timestamp") AS "__time", 
  "timezone", 
  "browser", 
  "language" AS "language-mv", 
  "session",
  "event_type",
  "event_subtype"
FROM 
  "ext" PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-arrays-1')
display.table('example-koalas-arrays-1')

### Array scalar functions

The ARRAY function constructs an array from a number of elements in a row and returns an array. Run the following cell to see ARRAY being used to create two fields:

* `staticArray` is a simple array of constants.
* The `tags` column contains the browser and timezone of each row.

Note that here, and in all the other examples of this notebook, all the elements of an array [share the same data type](https://druid.apache.org/docs/latest/querying/sql-functions#array).

In [None]:
sql='''
SELECT 
  ARRAY[ 'en-GB', 'fr' ] AS "staticArray", 
  ARRAY[ "browser", "timezone" ] AS "tags" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T14/PT2S')
'''

display.sql(sql)

The SQL below extends this with an MV_TO_ARRAY function. It adds an array of language tags for each row from the source data's [multi-value string](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings) data that was ingested into `language-mv`.

In [None]:
sql='''
SELECT 
  ARRAY[ 'en-GB', 'fr' ] AS "staticArray", 
  ARRAY[ "browser", "timezone" ] AS "tags", 
  MV_TO_ARRAY("language-mv") AS "tags-languages" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T14/PT2S')
'''

display.sql(sql)

Using the ARRAY_PREPEND and ARRAY_APPEND functions, elements can be added to an array.

Run the cell below to see an example where:

* ARRAY_PREPEND adds Vogon to the beginning of all language tags when the browser is `Chrome`.
* ARRAY_APPEND adds Klingon to the end of all language tags in all other cases.

In [None]:
sql='''
SELECT 
  CASE "browser" WHEN 'Chrome' THEN ARRAY_PREPEND( 'vog', MV_TO_ARRAY("language-mv") )
    ELSE ARRAY_APPEND( MV_TO_ARRAY("language-mv"), 'tlh')
    END AS "tags-languages" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T14/PT2S')
'''

display.sql(sql)

ARRAY_CONCAT takes two arrays and outputs a single new array.

Using ARRAY_CONCAT we can create just one list of tags for each row covering languages and browsers.

* The first argument creates an array representation of both the browser and timezone.
* The second argument is the array returned by the conversion of the `language-mv` data into an array.

In [None]:
sql='''
SELECT 
  ARRAY_CONCAT(
    ARRAY[ "browser", "timezone" ], 
    MV_TO_ARRAY("language-mv") ) AS "tags" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T14/PT2S')
'''

display.sql(sql)

### Array aggregators

While ARRAY creates an array from individual dimensions in a row, ARRAY_AGG creates an array from values _across_ rows as part of a GROUP BY query.

In the following cell:

* GROUP BY separates the source rows into 10-second buckets.
* Individual values from `browser` are merged from across the source rows into an array, one for for each 10-second bucket.

In [None]:
sql='''
SELECT 
  TIME_FLOOR("__time", 'PT10S') AS "time-bucket", 
  ARRAY_AGG("browser") AS "tags-browsers" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T10/PT1M') 
GROUP BY 
  1
'''

display.sql(sql)

As the size of the time window increases, more elements will be added to the array as the number of rows to be grouped increases. As there is a limit on the size of arrays, an extra parameter may need to be added to the ARRAY_AGG function to increase the maximum size.

In the following cell the time floor has been increased to 1 minute (`PT1M`) along with an additional maximum size parameter of 65535 bytes.

Run this cell with and without the maximum size parameter for ARRAY_AGG.

In [None]:
sql='''
SELECT 
  TIME_FLOOR("__time", 'PT1M') AS "time-bucket", 
  ARRAY_AGG("browser", 65535) AS "tags-browsers" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T10/PT2M') 
GROUP BY 
  1
'''

display.sql(sql)

In order to only return distinct values when the arrays are constructed, the following SQL contains the DISTINCT keyword with ARRAY_AGG.

Notice that the reduction in the size of the array means that the additionam maximum size parameter can be removed from ARRAY_AGG.

In [None]:
sql='''
SELECT 
  TIME_FLOOR("__time", 'PT1M') AS "time-bucket", 
  ARRAY_AGG(DISTINCT "browser") AS "tags-browsers" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T10/PT2M') 
GROUP BY 
  1
'''

display.sql(sql)

Just as ARRAY_CONCAT combines two arrays on the same row, ARRAY_CONCAT_AGG merges arrays from across rows.

The following cell demonstrates this in a very similar way as the example for ARRAY_AGG above:

* GROUP BY separates the source rows into 10-second buckets.
* Arrays from `language-mv` (via MV_TO_ARRAY) are merged from across the source rows into an array, one for for each 10-second bucket.
* The DISTINCT clause ensures that, after this aggregation, only the unique values are returned as `tags-language`.

You can also run this cell without DISTINCT, experimenting with a maximum array size parameter.

In [None]:
sql='''
SELECT 
  TIME_FLOOR("__time", 'PT10S') AS "time-bucket", 
  ARRAY_CONCAT_AGG(DISTINCT MV_TO_ARRAY("language-mv")) AS "tags-languages" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T10/PT2M') 
GROUP BY 
  1
'''

display.sql(sql)

Aggregating functions can be used in combination with the scalar functions.

In the following cell GROUP BY splits the source table rows into 1-minute buckets, and each bucket contains a single array containing all the tags in that bucket.

* ARRAY_AGG produces a distinct array of browsers.
* ARRAY_CONCAT_AGG produces a distinct array of languages from the array version of `language-mv`.
* ARRAY_CONCAT merges the two resulting arrays, producing an array that now contains both browsers and languages.
* ARRAY_APPEND adds the `vog` element to the end of the array.

In [None]:
sql='''
SELECT 
  TIME_FLOOR("__time", 'PT1M') AS "time-bucket", 
  ARRAY_APPEND(
    ARRAY_CONCAT(
      ARRAY_AGG(DISTINCT "browser"), 
      ARRAY_CONCAT_AGG(DISTINCT MV_TO_ARRAY("language-mv"))
    ), 
    'vog'
  ) AS "tags" 
FROM 
  "example-koalas-arrays-1" 
WHERE 
  TIME_IN_INTERVAL("__time", '2019-08-25T10/PT5M') 
GROUP BY 
  1
'''

display.sql(sql)

Through ARRAY_AGG, arrays can be constructed that contain nested objects and other arrays.

In the example that follows, source rows from the table are grouped into sessions using `session`.  Then, for each session, ARRAY_AGG is used twice:

* `journey-arrays` is built up of individual arrays through ARRAY, with each array containing the time, `event_type`, and `event_subtype`.
* `journey-objects` is built up of nested objects containing the same data, made using JSON_OBJECT.

Note that there is no explicit ORDER BY when the array is constructed. This means events in the arrays are [not guaranteed to be in any particular order](https://druid.apache.org/docs/latest/querying/sql-aggregations).

In [None]:
sql='''
SELECT
  "session",
  ARRAY_AGG(
    ARRAY[
      TIME_FORMAT("__time"),
      "event_type",
      "event_subtype"
    ], 65535) AS "journey-arrays",
  ARRAY_AGG(
    JSON_OBJECT(
      KEY 'timestamp' VALUE TIME_FORMAT("__time"),
      KEY 'event' VALUE "event_type",
      KEY 'event_sub' VALUE "event_subtype"
    )
  ) AS "journey-objects"
FROM "example-koalas-arrays-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT10S')
GROUP BY "session"
'''

display.sql(sql)

## Construct an array at ingestion time

You can use a variety of functions and aggregators at ingestion time to create arrays in your data.

In all the examples that follow, the context parameter `arrayIngestMode` [context parameter](https://druid.apache.org/docs/latest/querying/sql-data-types/#arrays) has been set to `array` to ensure that the [ARRAY](https://druid.apache.org/docs/latest/querying/sql-data-types#arrays) data type is applied to the columns in the table.

### Using existing data

Convert existing lists of values from source data using MV_TO_ARRAY and STRING_TO_ARRAY.

Run the following cell to create a new table, `example-koalas-arrays-2`.

For each row in the source data:

* The incoming multi-value string data is converted to an array in `tags-languages`.
* The ARRAY function is used to create a `journey-array`:
  * A test is made using CASE to only return an ARRAY under certain conditions, namely that the event is indicative of visitor's progress through the website.
  * A second test is a data cleansing action, required since that progress is not always returned as "zero" - ie, that someone just started their journey.
  * The ARRAY function creates an array containing the timestamp, stored as a LONG, and a percentage progress as recorded from the source data.
  * CAST ensures all the resulting array elements contain the same data type.

In [None]:
sql='''
REPLACE INTO "example-koalas-arrays-2" OVERWRITE ALL WITH "ext" AS (
  SELECT 
    * 
  FROM 
    TABLE(
      EXTERN(
        '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', 
        '{"type":"json"}'
      )
    ) EXTEND (
      "timestamp" VARCHAR, "agent_category" VARCHAR, 
      "agent_type" VARCHAR, "browser" VARCHAR, 
      "browser_version" VARCHAR, "city" VARCHAR, 
      "continent" VARCHAR, "country" VARCHAR, 
      "version" VARCHAR, "event_type" VARCHAR, 
      "event_subtype" VARCHAR, "loaded_image" VARCHAR, 
      "adblock_list" VARCHAR, "forwarded_for" VARCHAR, 
      "language" VARCHAR, "number" VARCHAR, 
      "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, 
      "referrer" VARCHAR, "referrer_host" VARCHAR, 
      "region" VARCHAR, "remote_address" VARCHAR, 
      "screen" VARCHAR, "session" VARCHAR, 
      "session_length" BIGINT, "timezone" VARCHAR, 
      "timezone_offset" VARCHAR, "window" VARCHAR
    )
) 
SELECT 
  TIME_PARSE("timestamp") AS "__time", 
  "timezone", 
  "browser", 
  MV_TO_ARRAY("language") AS "tags-languages", 
  "session", 
  "event_type", 
  "event_subtype", 
  "session_length", 
  CASE
    WHEN ("event_type" = 'PercentClear') THEN (
      CASE
        WHEN ("event_subtype" = '') THEN ARRAY[CAST(TIME_PARSE("timestamp") AS BIGINT), 0]
        ELSE ARRAY[CAST(TIME_PARSE("timestamp") AS BIGINT), CAST("event_subtype" AS BIGINT)]
        END
    )
    ELSE NULL END AS "journey-array"
FROM 
  "ext" PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")

display.run_task(req)
sql_client.wait_until_ready('example-koalas-arrays-2')
display.table('example-koalas-arrays-2')

Run the following cell to see a sample of the data.

Notice that the `journey-timestamp` is a true secondary timestamp, so is stored as milliseconds since the Unix epoch. In order for all elements of the `journey-array` array be of the same data type, the CAST function has been used on journey stage to ensure it is a DECIMAL.

In [None]:
sql='''
SELECT *
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT5S')
'''

display.sql(sql)

### Create an array at ingestion time using rollup (GROUP BY)

Run the following cell to create a table called `example-koalas-arrays-3`, where source data is grouped into 30-minute buckets.

A row is then emitted containing, for each group:

* An distinct array from ARRAY_AGG of each event's timezone - with a CASE replacing any 'N/A' entry with NULL.
* A distinct array of each event's browsers through ARRAY_AGG.
* A distinct array of the combination of each row's own language tag array through ARRAY_CONCAT_AGG.

In [None]:
sql='''
REPLACE INTO "example-koalas-arrays-3" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(EXTERN('{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', '{"type":"json"}')) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR)
)
SELECT 
  TIME_FLOOR(TIME_PARSE("timestamp"),'PT30M') AS "__time",
  ARRAY_AGG(
    DISTINCT CASE
      WHEN "timezone" = 'N/A' THEN NULL
      ELSE "timezone"
      END
    , 65535
  ) AS "tags-timezones", 
  ARRAY_AGG(DISTINCT "browser", 65535) AS "tags-browsers", 
  ARRAY_CONCAT_AGG(DISTINCT MV_TO_ARRAY("language"), 65535) AS "tags-languages",
  COUNT(DISTINCT "session") AS "sessions",
  MAX("session_length") AS "longest_session",
  COUNT(*) AS "events"
FROM 
  "ext"
GROUP BY
  1
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")
req.add_context("finalizeAggregations", "true")

display.run_task(req)
sql_client.wait_until_ready('example-koalas-arrays-3')
display.table('example-koalas-arrays-3')

Take a look at a sample of the resulting data by running the cell below.

In [None]:
sql='''
SELECT
  "__time",
  "tags-languages",
  "tags-timezones",
  "tags-browsers"
FROM "example-koalas-arrays-3"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T02/PT2H')
AND "longest_session" > 500000
'''

display.sql(sql)

## Determine array size

By using ARRAY_LENGTH it's possible to count the number of elements in the array.

The results of the next cell show the time periods with the most number of languages over a six-hour period.

In [None]:
sql='''
SELECT
  "__time" AS "Period",
  MAX(ARRAY_LENGTH("tags-languages")) AS "Tagged Languages"
FROM "example-koalas-arrays-3"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T10/PT6H')
GROUP BY 1
ORDER BY 2 DESC
'''

display.sql(sql)

## Filter results using arrays

The ARRAY_CONTAINS function tests for the presence of either an element or another array.

Running the next cell will show a count of the number of sessions broken down into 15-minute intervals. ARRAY_CONTAINS is used in combination with FILTER (WHERE) to break down the counts into English, French, and Spanish.

However, the WHERE clause contains an ARRAY_CONTAINS also. It restricts the number of rows used in the calculation to those that contain all three tags: `en`, `fr`, and `es`.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT15M') AS "Period",
  COUNT(DISTINCT "session") FILTER (WHERE ARRAY_CONTAINS("tags-languages",'en')) AS "English Sessions",
  COUNT(DISTINCT "session") FILTER (WHERE ARRAY_CONTAINS("tags-languages",'fr')) AS "French Sessions",
  COUNT(DISTINCT "session") FILTER (WHERE ARRAY_CONTAINS("tags-languages",'es')) AS "Spanish Sessions",
  COUNT(DISTINCT "session") AS "Total Sessions"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25/PT6H')
AND ARRAY_CONTAINS("tags-languages",ARRAY['en','fr','es'])
GROUP BY 1
'''

display.sql(sql)

In order to include source rows that contain _any_ of the three target languages, the following cell uses the ARRAY_OVERLAP function instead of ARRAY_CONTAINS in the WHERE clause.

ARRAY_OVERLAP tests whether there is any overlap at all between the language tags on each row and the array of lanugages being tested against.

Run the cell to see how this affects results.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT15M') AS "Period",
  COUNT(DISTINCT "session") FILTER (WHERE ARRAY_CONTAINS("tags-languages",'en')) AS "English Sessions",
  COUNT(DISTINCT "session") FILTER (WHERE ARRAY_CONTAINS("tags-languages",'fr')) AS "French Sessions",
  COUNT(DISTINCT "session") FILTER (WHERE ARRAY_CONTAINS("tags-languages",'es')) AS "Spanish Sessions",
  COUNT(DISTINCT "session") AS "Total Sessions"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25/PT6H')
AND ARRAY_OVERLAP("tags-languages",ARRAY['en','fr','es'])
GROUP BY 1
'''

display.sql(sql)

## Find elements in an array

ARRAY_ORDINAL and ARRAY_OFFSET return values from specific positions in the array.

To see these in action, run the following cell to create a new table, `example-koalas-arrays-4`.

The GROUP BY in this ingestion breaks up the rows into 30-minute blocks, and then into individual sessions together with its browser. Only the events that record someone's journey through their visit on the website are included - these are identified by the WHERE clause filter on `event_type` to `PercentClear`.

Each row in the table then contains two columns:

* ARRAY_AGG is used to create 'journey-timestamps' - it contains an array of the timestamps for every row in the group.
* Again for every row in the group, 'journey-percentages' contains an array of the percentage through the journey.

In [None]:
sql='''
REPLACE INTO "example-koalas-arrays-4" OVERWRITE ALL WITH "ext" AS (
  SELECT 
    * 
  FROM 
    TABLE(
      EXTERN(
        '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}', 
        '{"type":"json"}'
      )
    ) EXTEND (
      "timestamp" VARCHAR, "agent_category" VARCHAR, 
      "agent_type" VARCHAR, "browser" VARCHAR, 
      "browser_version" VARCHAR, "city" VARCHAR, 
      "continent" VARCHAR, "country" VARCHAR, 
      "version" VARCHAR, "event_type" VARCHAR, 
      "event_subtype" VARCHAR, "loaded_image" VARCHAR, 
      "adblock_list" VARCHAR, "forwarded_for" VARCHAR, 
      "language" VARCHAR, "number" VARCHAR, 
      "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, 
      "referrer" VARCHAR, "referrer_host" VARCHAR, 
      "region" VARCHAR, "remote_address" VARCHAR, 
      "screen" VARCHAR, "session" VARCHAR, 
      "session_length" BIGINT, "timezone" VARCHAR, 
      "timezone_offset" VARCHAR, "window" VARCHAR
    )
) 
SELECT 
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT30M') AS "__time",
  "session",
  "browser",
  ARRAY_AGG(
    CASE
      WHEN ("event_subtype" = '') THEN 0
      ELSE CAST("event_subtype" AS BIGINT)
      END, 
    65535
  ) AS "journey-percentages", 
  ARRAY_AGG(TIME_PARSE("timestamp"), 65535) AS "journey-timestamps" 
FROM 
  "ext"
WHERE "event_type" = 'PercentClear'
GROUP BY 
  1, 2, 3
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")
req.add_context("finalizeAggregations", "true")

display.run_task(req)
sql_client.wait_until_ready('example-koalas-arrays-4')
display.table('example-koalas-arrays-4')

The following cell shows the results for a specific session.

The source data has a one-to-one relationship between the `session` and the `browser`, so even though this was a grouped dimension during ingestion there remains only one row per session in the table.

Notice the correlation between the data in `journey-percentages` and `journey-timestamps`.

In [None]:
sql='''
SELECT *
FROM "example-koalas-arrays-4"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25/PT1M')
AND "session" = 'S00079331'
'''

display.sql(sql)

The next cell uses these arrays with ARRAY_ORDINAL_OF and ARRAY_ORDINAL.

First, ARRAY_ORDINAL_OF attempts to return the position of a zero-percent clear event in the array.

```sql
ARRAY_ORDINAL_OF("journey-percentages",'0')
```

This position is then used in ARRAY_ORDINAL on the `journey-timestamps` to find the timestamp.

```sql
ARRAY_ORDINAL("journey-timestamps", ... )
```

You will note that:
* The result of this function call is wrapped in MILLIS_TO_TIMESTAMP so that a proper timestamp is returned in results.
* For each column, a CASE test is made to catch errors that may occur if the percentage we're looking for is not in the array.

Run the cell below to see what the results are for the same session from the previous results.

In [None]:
sql='''
SELECT
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",0) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",0)))
      ELSE NULL END AS "time-0",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",25) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",25)))
      ELSE NULL END AS "time-25",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",50) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",50)))
      ELSE NULL END AS "time-50",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",75) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",75)))
      ELSE NULL END AS "time-75",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",95) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",95)))
      ELSE NULL END AS "time-95"
FROM "example-koalas-arrays-4"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25/PT1M')
AND "session" = 'S00079331'
'''

display.sql(sql)

In a final step, let's use this data to calculate the average time taken to move between these stages, focusing on a particular period of time in the data.

The query above is put into a temporary table, `journey-timestamps`, and a main query then addresses that table.

* TIMESTAMPDIFF calculates the number of seconds between each of the timestamps and returns them as `to25`, `to50`, and so on.
* MAX is wrapped around TIMESTAMPDIFF to return the longest period of time.

In [None]:
sql='''
WITH "journey-timestamps" AS (
SELECT
  "session",
  "browser",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",0) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",0)))
      ELSE NULL END AS "time-0",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",25) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",25)))
      ELSE NULL END AS "time-25",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",50) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",50)))
      ELSE NULL END AS "time-50",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",75) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",75)))
      ELSE NULL END AS "time-75",
  CASE WHEN
    ARRAY_ORDINAL_OF("journey-percentages",95) > 0
      THEN MILLIS_TO_TIMESTAMP(ARRAY_ORDINAL("journey-timestamps",ARRAY_ORDINAL_OF("journey-percentages",95)))
      ELSE NULL END AS "time-95"
FROM "example-koalas-arrays-4"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25/PT6H')
)

SELECT
  "browser",
  MAX(TIMESTAMPDIFF(SECOND,"time-0","time-25")) AS "to25",
  MAX(TIMESTAMPDIFF(SECOND,"time-25","time-50")) AS "to50",
  MAX(TIMESTAMPDIFF(SECOND,"time-50","time-75")) AS "to75",
  MAX(TIMESTAMPDIFF(SECOND,"time-75","time-95")) AS "to95"
FROM "journey-timestamps"
GROUP BY 1
'''

display.sql(sql)

## Join to an array with UNNEST

By using the UNNEST, an array can be expanded to a temporary table that can then be joined to another datasource.

Recall that the `example-koalas-arrays-2` table contains an array of languages from the original `languages` column called `tags-languages`.

In [None]:
sql='''
SELECT
  "tags-languages",
  "browser",
  "session_length"
FROM "example-koalas-arrays-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT2S')
'''

display.sql(sql)

The UNNEST function returns each entry in an array as a row as a new table.

In the following cell, UNNEST is used with CROSS JOIN to return a row for each language for each `browser` and each `session_length`.

* Only `browser` and `session_length` are returned from `example-koalas-arrays-2`.
* UNNEST takes the language tags array for each of those rows and returns it as a table called `la` with one column, `language`.
* The CROSS JOIN causes a cartesian product between the two rows, therefore adding every `language` to the associated row from `example-koalas-arrays-2`.

In [None]:
sql='''
SELECT
  la."language",
  "session",
  "browser"
FROM "example-koalas-arrays-2"
CROSS JOIN UNNEST("tags-languages") AS la("language")
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT2S')
'''

display.sql(sql)

This data can then be used in a GROUP BY to generate metrics on a per-language basis.

In [None]:
sql='''
SELECT
  la."language",
  MIN("session_length") AS "shortest-session",
  MAX("session_length") AS "longest-session",
  COUNT(DISTINCT "session") AS "unique-sessions"
FROM "example-koalas-arrays-2"
CROSS JOIN UNNEST("tags-languages") AS la("language")
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T4/PT2S')
GROUP BY 1
'''

display.sql(sql)

<a id="ingest_array"></a>
## Ingest arrays directly from source data
Arrays of primitive values in the source data can be loaded directly into Druid using SQL based or native ingestion. The elements of the array and whole array values are indexed. Filtering on whole array values will use array valued index and functions like ARRAY_CONTAINS will use the array element valued index.

The following examples ingest some array data of different primitive types and demonstrate the use of these filtering mechanisms.
The data contains two string arrays `"pets"` and `"breeds"` and an array of numeric values `"weights"`:
```
{"time":"2024-01-10 05:00:00", "owner":"Alex",   "pets":["Max","Yoli"],"breeds":["boxer","lab"],"weights":[48,65]}
{"time":"2024-01-10 05:30:00", "owner":"Jill",   "pets":["Mowgli","Pelusa"],"breeds":["boxer","mix"],"weights":[56,27]}
{"time":"2024-01-10 06:00:00", "owner":"Devraj", "pets":["Linda","Frida"],"breeds":["beagle","basenji"],"weights":[40,45]}
{"time":"2024-01-10 06:30:00", "owner":"Kyle",   "pets":["Kala","Boots"],"breeds":["pitbull","siamese"],"weights":[58,10]}
```

Druid supports ARRAY datatypes in the EXTEND clause of the EXTERN table function defining the schema of the external data:
```
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":...',
      '{"type":"json"}'
    )
  ) EXTEND ("time" VARCHAR, "owner" VARCHAR, "pets" VARCHAR ARRAY, "breeds" VARCHAR ARRAY, "weights" DOUBLE ARRAY)
```
- use type \<DATATYPE> ARRAY to ingest arrays
- in this example "pets" and "breeds" are loaded as a VARCHAR ARRAY
- and "weights" is loaded as a DOUBLE ARRAY 

Run the following cell to ingest the data:

In [None]:
sql='''
REPLACE INTO "example-arrays-in-source" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{\\"time\\":\\"2024-01-10 05:00:00\\", \\"owner\\":\\"Alex\\",   \\"pets\\":[\\"Max\\",\\"Yoli\\"],\\"breeds\\":[\\"boxer\\",\\"lab\\"],\\"weights\\":[48,65]}\\n{\\"time\\":\\"2024-01-10 05:30:00\\", \\"owner\\":\\"Jill\\",   \\"pets\\":[\\"Mowgli\\",\\"Pelusa\\"],\\"breeds\\":[\\"boxer\\",\\"mix\\"],\\"weights\\":[56,27]}\\n{\\"time\\":\\"2024-01-10 06:00:00\\", \\"owner\\":\\"Devraj\\", \\"pets\\":[\\"Linda\\",\\"Frida\\"],\\"breeds\\":[\\"beagle\\",\\"basenji\\"],\\"weights\\":[40,45]}\\n{\\"time\\":\\"2024-01-10 06:30:00\\", \\"owner\\":\\"Kyle\\",   \\"pets\\":[\\"Kala\\",\\"Boots\\"],\\"breeds\\":[\\"pitbull\\",\\"siamese\\"],\\"weights\\":[58,10]}"}',
      '{"type":"json"}'
    )
  ) EXTEND ("time" VARCHAR, "owner" VARCHAR, "pets" VARCHAR ARRAY, "breeds" VARCHAR ARRAY, "weights" DOUBLE ARRAY)
)
SELECT
  TIME_PARSE("time") as __time,
  "owner",
  "pets",
  "breeds",
  ARRAY_TO_MV("breeds") as "breeds_mv",
  "weights"
FROM "ext"
PARTITIONED BY ALL
'''
req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")

display.run_task(req)
sql_client.wait_until_ready('example-arrays-in-source')
display.table('example-arrays-in-source')

Notice that the ingestion is loading the "breeds" column as both and array and a multi-value string column. 
While this is not usual it will help to understand the difference between the two.

Take a look at the data:

In [None]:
sql='''
SELECT * 
FROM "example-arrays-in-source" x
'''

display.sql(sql)

The result shows "breeds" and "breeds_mv" that look identical except for the fact that "breeds_mv" is sorted.
The query group by behavior on these two columns is quite different.
The SQL example below shows how grouping on an ARRAY column will group on the whole array value:

In [None]:
sql='''
SELECT "breeds", count(*) 
FROM "example-arrays-in-source" x
GROUP BY 1
'''

display.sql(sql)

The multi-value column aggregation behavior is different, it naturally expands/unnests the multiple values resulting in:

In [None]:
sql='''
SELECT "breeds_mv", count(*) 
FROM "example-arrays-in-source" x
GROUP BY 1
'''

display.sql(sql)

The same aggregation behavior as the MVD column can be achieved by explicitly unnesting the `"breed"` array column:

In [None]:
sql='''
SELECT b."breed", count(*) 
FROM "example-arrays-in-source" x, UNNEST("breeds") AS b("breed")
GROUP BY 1
'''

display.sql(sql)

You can filter for a literal array value, a match occurs only if the contents are identical, including order of the values:

In [None]:
sql='''
SELECT * 
FROM "example-arrays-in-source" x
WHERE "pets"=ARRAY['Kala','Boots']
'''

display.sql(sql)

Use ARRAY_CONTAINS function to find rows with a specific element appearing in the array column. 
The following SQL finds customers who have a pet boxer:

In [None]:
sql='''
SELECT * 
FROM "example-arrays-in-source" x
WHERE ARRAY_CONTAINS("breeds",'boxer')
'''

display.sql(sql)

Find customers with pets that weigh more than 50 pounds by UNNESTing the "weights" array and comparing the individual items:

In [None]:
sql='''
SELECT * 
FROM "example-arrays-in-source" x, UNNEST(x."weights") AS y("weight")
WHERE y."weight">50
'''

display.sql(sql)

<a id='json_array_of_objects'></a>
## Working with nested arrays of objects
Druid can ingest arrays of objects and they can be UNNESTed, filtered and aggregated.
This is very useful when data contains lists of related objects associated to an event. Here's an IOT example that contains an array of metrics issued by the sensors of a device:

```
{
    "time":"2024-01-01 10:00:00",
    "device":"ABF001",
    "loop":"NH3-100-01",
    "loop-seq":1,
    "process":
            {
                "name":"NH3-100",
                "session":"BATCH-000001",
                "metrics":[
                     {"name":"temperature","value":30},
                     {"name":"pressure","value":56},
                     {"name":"flow","value":10}
                  ]
            }
}
```

Given that different devices may have different sets of sensors, another example in the same set might look like:
```
{
    "time":"2024-01-01 10:00:00",
    "device":"HEAT001",
    "loop":"NH3-100-01",
    "loop-seq":2,
    "process":
        {
            "name":"NH3-100",
            "session":"BATCH-000001",
            "metrics":[
                 {"name":"temperature","value":455},
                 {"name":"pressure","value":100},
                 {"name":"fuel-input", "value":10}
              ]
        }
}
```

Use the following cell to load these two examples:

In [None]:
sql='''
REPLACE INTO "example-array-json-objects" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{ \\"time\\":\\"2024-01-01 10:00:00\\",  \\"device\\":\\"ABF001\\", \\"loop\\":\\"NH3-100-01\\", \\"loop-seq\\":1,\\"process\\": {\\"name\\":\\"NH3-100\\",\\"session\\":\\"BATCH-000001\\",\\"metrics\\":[ {\\"name\\":\\"temperature\\",\\"value\\":30}, {\\"name\\":\\"pressure\\",\\"value\\":56}, {\\"name\\":\\"flow\\",\\"value\\":10}]}}\\n{\\"time\\":\\"2024-01-01 10:00:00\\",\\"device\\":\\"HEAT001\\",\\"loop\\":\\"NH3-100-01\\",\\"loop-seq\\":2,\\"process\\":{ \\"name\\":\\"NH3-100\\", \\"session\\":\\"BATCH-000001\\",\\"metrics\\":[{\\"name\\":\\"temperature\\",\\"value\\":455},{\\"name\\":\\"pressure\\",\\"value\\":100},{\\"name\\":\\"fuel-input\\", \\"value\\":10}]}}\\n"}',      '{"type":"json"}'
    )
  ) EXTEND ("time" VARCHAR, "device" VARCHAR, "loop" VARCHAR, "loop-seq" BIGINT, "process" TYPE('COMPLEX<json>'))
)
SELECT
  TIME_PARSE(TRIM("time")) AS "__time",
  "device",
  "loop",
  "loop-seq",
  "process"
FROM "ext"
PARTITIONED BY DAY
'''
req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")

display.run_task(req)
sql_client.wait_until_ready('example-array-json-objects')
display.table('example-array-json-objects')

Use the JSON_QUERY_ARRAY function to access nested arrays of objects:

In [None]:
sql='''
SELECT "loop", JSON_QUERY_ARRAY( "process", '$.metrics') "metric_array" FROM "example-array-json-objects"
'''
display.sql(sql)

Combine JSON_QUERY_ARRAY with UNNEST to access array elements, and use JSON_VALUE functions to access array object fields to do filtering and aggregation:

In [None]:
sql='''
SELECT "loop", 
        JSON_VALUE( m."metric", '$.name') metric_name, 
        MIN( JSON_VALUE( m."metric", '$.value' RETURNING DOUBLE) ) min_metric_value,
        MAX( JSON_VALUE( m."metric", '$.value' RETURNING DOUBLE) ) max_metric_value,
        AVG( JSON_VALUE( m."metric", '$.value' RETURNING DOUBLE) ) avg_metric_value
FROM "example-array-json-objects", 
      UNNEST( JSON_QUERY_ARRAY( "process", '$.metrics')) AS m("metric")
WHERE JSON_VALUE( m."metric", '$.name') IN ('temperature', 'pressure')
GROUP BY 1,2
'''
display.sql(sql)

## Clean up

Run the following cell to remove the tables used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-arrays-1")
druid.datasources.drop("example-koalas-arrays-2")
druid.datasources.drop("example-koalas-arrays-3")
druid.datasources.drop("example-koalas-arrays-4")
druid.datasources.drop("example-array-json-objects")
druid.datasources.drop("example-arrays-in-source")

## Summary

* Arrays can be created in a number of ways at query time, including by conversion from delimited and multi-value strings.
* With the right context parameters, arrays can be constructed from source data and created using aggregators.
* Scalar array functions allow for items to be found and added.
* The UNNEST function, together with a JOIN, allows for arrays to be expanded into individual rows.
* JSON_QUERY_ARRAY combined with UNNEST enables the use of arrays of objects that can be expanded into rows and columns.

## Learn more

* Read more about arrays on the [SQL data types](https://druid.apache.org/docs/latest/querying/sql-data-types/#arrays) page in the official documentation.
* See the full documentation on [scalar](https://druid.apache.org/docs/latest/querying/sql-array-functions) and [aggregation](https://druid.apache.org/docs/latest/querying/sql-aggregations) array functions
* Check out the [official tutorial on UNNEST](https://druid.apache.org/docs/latest/tutorials/tutorial-unnest-arrays) with arrays 