# Generating and working with NULL values
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Databases are commonly capable of recording when a value on a row is missing using a reserved representation: [NULL](https://en.wikipedia.org/wiki/Null_(SQL)). They are stored, generated, and used in special ways by functions, operators, and aggregators.

In this notebook, you will generate [NULL](https://druid.apache.org/docs/latest/querying/sql-data-types#null-values) values in tables from example data, working with them using a variety of functions and aggregations.

## Prerequisites

This tutorial works with Druid 28.0.0 or later.

> __Using versions of Apache Druid prior to this may yield unexpected results.__
> 
> There are two modes for [NULL-handling](https://druid.apache.org/docs/latest/querying/sql-data-types#null-values) in Apache Druid, with the default in Druid 28 and above to use SQL-compatible NULL handling. Define what mode to use by setting the `druid.generic.useDefaultValueForNull` runtime property. Read more in the [handling null values](https://druid.apache.org/docs/latest/design/segments/#handling-null-values) documentation.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status
status_client.version

## Generating NULLs at ingestion time

Run the following ingestion to create a table called `example-koalas-null-1`.

You will use this later to see examples of IS NULL and IS NOT NULL.

Notice, too, that there are CASE statements which purposefully inject true NULL into the table under certain conditions:
* A CASE statement corrects values of "N/A" to NULL in `timezone`.
* `referrer-null` contains a NULL whenever `referrer` has a value of "Direct" - else the original value from `referrer` is stored.
* A new column `session_length-EDTonly` is added that only contains session lengths for EDT timezone events.
* A new column, `session_length-PDTonly`, only contains the session length for PDT timezone events.
* Finally, `session_length-others` contains the session length for anything other than EDT or PDT.
* A new column called `percentClear` is created, correcting for missing values in the source data, and outputting NULL in other situations.

In [None]:
sql='''
REPLACE INTO "example-koalas-null-1" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  CASE WHEN "timezone" = 'N/A' THEN NULL
    ELSE "timezone"
    END AS "timezone",
  "referrer",
  CASE WHEN "referrer" = 'Direct' THEN NULL
    ELSE "referrer" END AS "referrer-null",
  "session_length",
  CASE WHEN ("timezone" = 'EDT') THEN "session_length" ELSE NULL END AS "session_length-EDTonly",
  CASE WHEN ("timezone" = 'PDT') THEN "session_length" ELSE NULL END AS "session_length-PDTonly",
  CASE WHEN ("timezone" <> 'EDT' AND "timezone" <> 'PDT') THEN "session_length" ELSE NULL END AS "session_length-others",
  "event_type",
  CASE WHEN ("event_type" = 'PercentClear') THEN
      (CASE WHEN ("event_subtype" = '') THEN 0 ELSE "event_subtype" END)
      ELSE NULL END AS "percentClear"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-null-1')
display.table('example-koalas-null-1')

Not all source data formats explicitly allow for the storage of NULLs. Let us imagine that this is why the developer of the [KoalasToTheMax](https://www.koalastothemax.com) website used the word "Direct" in the `referrer` column whenever a `referrer` could not be identified.

In Druid, the data engineer decides to use NULL for this purpose instead, storing the revised data in a column called `referrer-null` using a CASE statement. It detects any value of "Direct" in the `referrer` column, and stores a true NULL when found. All other values are passed through from the `referrer` column as-is. The original data is maintained in the table in the `referrer` column.

The same applies to `timezone` - when the data contains "N/A" it's determined that this is really best handled as a NULL. In this instance, however, the original data is discarded and only the newly nulled `timezone` data is stored.

Run the following cell which shows a count of records in `referrer` with the value `Direct`, and the equivalent count in the `referrer-null` column. A third count is also shown of rows that do not have a timezone - or rather, in source data, had a value of "N/A". This is achieved using the FILTER (WHERE...) clause, one filtering against "Direct", and the others filtering using IS NULL.

In [None]:
sql='''
SELECT
  COUNT(*) FILTER (WHERE "referrer" = 'Direct') AS "referrer",
  COUNT(*) FILTER (WHERE "referrer-null" IS NULL) AS "referrer-null",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) AS "timezone-null"
FROM "example-koalas-null-1"
'''

display.sql(sql)

In the source data, `PercentClear`-type events are recorded over time as people interact with an image on the [KoalasToTheMax](https://www.koalastothemax.com) during their visit to the website. As a user uncovers more of an image, events are recorded with an increasing percentage clear recorded in `event_subtype`.

There is never a "zero percent" event recorded in the source - rather the data in `event_subtype` is left empty. A NULL and the lack of a value are not equivalent, so to ensure clarity for analysts working with SQL downstream, imagine a decision is taken to concretely distinguish having 0% clear from having no record of how much has been cleared (NULL). The ingestion SQL therefore creates a new column, `percentClear`.

* The column only holds information about percentage of an image cleared (from `event_subtype`) - everything else is NULL.
* It handles the missing "zero percent" by recording a zero to indicate the start of their image-clearing journey - distinguishing it from a NULL.

Run the following cell to see how this shows up in the data.

In [None]:
sql='''
SELECT
  CONCAT("percentClear",'%') AS "Percentage Cleared",
  COUNT(*) AS "events"
FROM "example-koalas-null-1"
WHERE "percentClear" IS NOT NULL
GROUP BY "percentClear"
ORDER BY CAST("percentClear" AS DOUBLE)
'''

display.sql(sql)

In the WHERE clause, notice the IS NOT NULL filter against `percentClear`. This ensures that the counts in the results only concern table rows related to `PercentClear`-type events. You may want to adjust the SQL above to see what effect removing this filter has.

## Scalar functions

In the following SQL, some simple string [scalar functions](https://druid.apache.org/docs/latest/querying/sql-scalar) are used to output a number of new values.

Run the cell to see how a NULL value affects results.

In [None]:
sql='''
SELECT
  CONCAT("timezone",' timezone') AS "timezone",
  LENGTH("timezone") AS "length",
  REPLACE("timezone",'T',' timezone') AS "easyToRead",
  REVERSE("timezone") AS "backwards",
  COUNT(*) AS "events"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1, 2, 3, 4
'''

display.sql(sql)

Use [NVL or COALESCE](https://druid.apache.org/docs/latest/querying/sql-scalar#other-scalar-functions) to return another value when an expression IS NULL.

Run the following cell to see NVL being used, together with a simple of COALESCE example.

If the `timezone` is NULL, the value "UTC" is returned instead.

In [None]:
sql='''
SELECT
  "timezone",
  COALESCE("timezone",'UTC') AS "timezone-coalesce",
  NVL("timezone",'UTC') AS "timezone-nvl",
  "session_length"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

In the following example, COALESCE addresses all three of the timezone-specific session lengths, returning whichever value in the list contains a value first.

In [None]:
sql='''
SELECT
  COALESCE("timezone",'UTC') AS "timezone-coalesce",
  NVL("timezone",'UTC') AS "timezone-nvl",
  "session_length",
  "session_length-EDTonly",
  "session_length-PDTonly",
  "session_length-others",
  COALESCE("session_length-EDTonly","session_length-PDTonly","session_length-others") AS "sessionLength-coalesce"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

## Aggregations and NULL

In this section you will see an examples of how aggregation functions like COUNT handle data that contains NULL values.

Run this cell to see the source data that will be used:

In [None]:
sql='''
SELECT
  "event_type",
  "percentClear",
  "session_length",
  "timezone"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

Run the following cell to see how COUNT works with NULL data.

* A total number of all the rows is output as `totalRows`.
* A count of all rows with a NULL `timezone` is made and output as `null-timezone-rows`.
* A count is made of the number of rows where `timezone` contains a non-value, output as `nonNull-timezone-rows`.
* The NULL and non-NULL row counts are added together, showing they total `totalRows`.

In [None]:
sql='''
SELECT
  COUNT(*) AS "totalRows",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) AS "null-timezone-rows",
  COUNT(*) FILTER (WHERE "timezone" IS NOT NULL) AS "nonNull-timezone-rows",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) + COUNT(*) FILTER (WHERE "timezone" IS NOT NULL) AS "totalRows-2"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

Rather than filtering where `timezone` IS NOT NULL, another form of COUNT can be used to count rows that contain data by specifying the `timezone` dimension instead of a "*".

In [None]:
sql='''
SELECT
  COUNT(*) AS "totalRows",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) AS "null-timezone-rows",
  COUNT(*) FILTER (WHERE "timezone" IS NOT NULL) AS "nonNull-timezone-rows",
  COUNT("timezone") AS "nonNull-timezone-rows-2",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) + COUNT("timezone") AS "totalRows-2"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

The next cell counts the number of distinct timezones.

In [None]:
sql='''
SELECT
  COUNT(DISTINCT "timezone") AS "distinct-timezones"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

Run the following to shows that, as with COUNT on a specific column, NULL does not count as a separate value in COUNT DISTINCT operations.

In [None]:
sql='''
SELECT
  COUNT(DISTINCT "timezone") AS "distinct-timezones",
  COUNT(DISTINCT "timezone") FILTER (WHERE "timezone" IS NOT NULL) AS "distinctNonNull-timezones",
  COUNT(DISTINCT "timezone") FILTER (WHERE "timezone" IS NULL) AS "distinctNull-timezones"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

The following query shows that NULL is returned as a separate row during GROUP BY.

In [None]:
sql='''
SELECT
  "timezone",
  COUNT(*) AS "totalEvents",
  SUM("session_length") AS "totalSessionLength",
  STRING_FORMAT('%.3f',AVG("session_length")) AS "avgSessionLength",
  MAX("session_length") AS "maxSessionLength",
  MIN("session_length") AS "minSessionLength"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1
ORDER BY timezone DESC
'''

display.sql(sql)

Recall that the `session_length-EDTonly` and `session_length-PDTonly` dimensions only contains a session length in seconds when the timezone is EDT or PDT respectively, otherwise they contain a NULL.

Run the following SQL to see that NULLs are ignored for other aggregations, too.

* As above, `totalEvents` is the complete number of rows in the data.
* `totalEventsEDT` shows the number of events that are known to have been recorded in the EDT timezone.
* SUM, MAX, and MIN specifically reference the `session_length-EDTonly` and `session_length_PDTonly` dimensions.

Cross-reference the results of the query above with the results of this query.

In [None]:
sql='''
SELECT
  "timezone",
  COUNT(*) AS "totalEvents",
  COUNT(*) FILTER (WHERE "timezone" = 'EDT') AS "totalEvents-EDT",
  SUM("session_length-EDTonly") AS "totalSessionLength-EDT",
  MAX("session_length-EDTonly") AS "maxSessionLength-EDT",
  COUNT(*) FILTER (WHERE "timezone" = 'PDT') AS "totalEvents-PDT",
  SUM("session_length-PDTonly") AS "totalSessionLength-PDT",
  MAX("session_length-PDTonly") AS "maxSessionLength-PDT"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1
'''

display.sql(sql)

Some aggregation functions [return NULL](https://druid.apache.org/docs/latest/querying/sql-aggregations/) by default.

In the following query, COALESCE is used to combine the results of the query above into individual columns.

* The two arguments for each COALESCE are aggregates of EDT and PDT using the relevant specific dimensions.
* Each aggregate returns NULL if no rows are found to perform the aggregation on.
* GROUP BY produces rows, one for each timezone.
* COALESCE picks the first non-NULL value:
  * For the EDT row, COALESCE returns the aggregation on the EDT column.
  * For the PDT row, COALESCE returns the aggregation on the PDT column.
  * For all other rows, there is no value to return - thus NULL is returned.

In [None]:
sql='''
SELECT
  "timezone",
  COALESCE(SUM("session_length-EDTonly"), SUM("session_length-PDTonly")) AS "total_session_length",
  COALESCE(MAX("session_length-EDTonly"), MAX("session_length-PDTonly")) AS "max_session_length"
FROM "example-koalas-null-1"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1
'''

display.sql(sql)

## Arrays and NULL

Run the following cell to create a new table, `example-koalas-null-2`.

The SQL for this ingestion includes a GROUP BY clause to roll up the source data.
* The `timestamp` is parsed with TIME_PARSE and then floored to the nearest 15 minutes using TIME_FLOOR.
* The [ARRAY_AGG](https://druid.apache.org/docs/latest/querying/sql-aggregations) aggregator takes a number of rows and produces an array, effectively tagging each resulting row with the timezones that they relate to, storing it in `timezone_array`.
* Other aggregates, such as SUM, MAX, and COUNT provide other information about the rolled up rows.

Notice that this ingestion API call includes the [`arrayIngestMode` context parameter](https://druid.apache.org/docs/latest/querying/sql-data-types#arrays) to instruct Druid to create a true array-type field for `timezone_array`.

In [None]:
sql='''
REPLACE INTO "example-koalas-null-2" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"),'PT15M') AS "__time",
  ARRAY_AGG(DISTINCT CASE WHEN "timezone" = 'N/A' THEN NULL
    ELSE "timezone"
    END) AS "timezone_array",
  SUM("session_length") AS "total_session_length",
  MAX(CASE WHEN ("event_type" = 'PercentClear') THEN
      (CASE WHEN ("event_subtype" = '') THEN 0 ELSE CAST("event_subtype" AS DOUBLE) END)
      ELSE NULL END) AS "max_percentClear",
  COUNT(*) FILTER (WHERE "event_type" = 'PercentClear') AS "clear_events",
  COUNT(*) AS "events"
FROM "ext"
GROUP BY 1
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("arrayIngestMode", "array")

display.run_task(req)
sql_client.wait_until_ready('example-koalas-null-2')
display.table('example-koalas-null-2')

You can return the position of values in an array using ARRAY_OFFSET_OF or ARRAY_ORDINAL_OF.

Run the next cell to see this function being used to find NULL values in the table.

In [None]:
sql='''
SELECT
  ARRAY_OFFSET_OF("timezone_array", NULL) AS "array_offset-GMT",
  ARRAY_ORDINAL_OF("timezone_array", NULL) AS "array_ordinal-GMT",
  COUNT(*) AS "events"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:00/PT15M')
GROUP BY 1, 2
'''

display.sql(sql)

ARRAY_OFFSET_OF and ARRAY_ORDINAL_OF [array functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) return NULL if a value cannot be found in an array. Run the following cell to see this in action.

In [None]:
sql='''
SELECT
  ARRAY_OFFSET_OF("timezone_array", 'GMT') AS "array_offset-GMT",
  ARRAY_ORDINAL_OF("timezone_array", 'GMT') AS "array_ordinal-GMT",
  COUNT(*) AS "events"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:00/PT15M')
GROUP BY 1, 2
'''

display.sql(sql)

Use [UNNEST](https://druid.apache.org/docs/latest/querying/sql/#unnest) to explode an array. Together with CROSS JOIN, this allows a row to be returned for each entry in the array.

In the following SQL, the `timezone_array` is unnested and then joined to the `example-koalas-null-2` table, producing a single row per row, across which a count of events is generated. The HAVING clause ensures only those timezones recorded against at least 10 rows in the data are returned.

Notice how there is a row for NULL, accounting for any source rows that have a NULL timezone.

In [None]:
sql='''
SELECT
  timezone_array.timezone AS "timezone",
  COUNT(*) AS "events"
FROM "example-koalas-null-2"
CROSS JOIN
  UNNEST("timezone_array") AS timezone_array(timezone)
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:00/PT6H')
GROUP BY 1
HAVING COUNT(*) > 10
ORDER BY COUNT(*) DESC
'''

display.sql(sql)

## Clean up

Run the following cell to remove the three tables used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-null-1")
druid.datasources.drop("example-koalas-null-2")

## Summary

* Druid has two modes for handling and storing NULL values - Druid 28.0.0 defaults to SQL-compatible NULL handling.
* You can transform source data to explicitly store NULLs using CASE statements.
* Aggregation and scalar functions handle NULL values in defined ways.

## Learn more

* Read the documentation on:
  * Enabling and disabling [SQL-compatible NULL handling](https://druid.apache.org/docs/latest/querying/sql-data-types#null-values) using `druid.generic.useDefaultValueForNull`
  * How Druid stores [NULL during ingestion](https://druid.apache.org/docs/latest/design/segments#handling-null-values).
  * The default returned value for different [aggregations](https://druid.apache.org/docs/latest/querying/sql-aggregations/).
  * How [ARRAY functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) work when not using SQL-compatible mode.
* If you tend to use native rather than SQL queries, read about the [NULL filter](https://druid.apache.org/docs/latest/querying/filters#null-filter) in the documentation.
* See the [table of default values](https://druid.apache.org/docs/latest/querying/sql-data-types/#standard-types) stored during ingestion when SQL-compatible NULL-handling is not turned on.
* Follow the [notebook on GROUP BY](./01-groupby.ipynb) to see how NULL appears in [GROUPING SETS](https://druid.apache.org/docs/latest/querying/sql/#group-by).
* Try out other scalar functions with NULL - check out the dedicated notebooks on [datetime](./07-functions-datetime.ipynb), [string](./08-functions-strings.ipynb), and [IP address](./10-functions-ip.ipynb) functions for examples.