# Handling NULL values using functions and operators
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

In this notebook, you will see examples of various functions and operators used with NULL values.

## Prerequisites

This tutorial works with Druid 28.0.0 or later.

> __Using versions of Apache Druid prior to this may yield unexpected results.__
> 
> There are two modes for [NULL-handling](https://druid.apache.org/docs/latest/querying/sql-data-types#null-values) in Apache Druid, with the default in Druid 28 and above to use SQL-compatible NULL handling. Define what mode to use by setting the `druid.generic.useDefaultValueForNull` runtime property. Read more in the [handling null values](https://druid.apache.org/docs/latest/design/segments/#handling-null-values) documentation.
>
> Apache Druid can use SQL-compatible mode for boolean operators. Read more about how to turn [strict boolean behaviour](https://druid.apache.org/docs/latest/misc/math-expr/#logical-operator-modes) on or off in coniguration through `druid.expressions.useStrictBooleans`.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [83]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status
status_client.version

Opening a connection to http://router:8888.


'28.0.0-SNAPSHOT'

## Testing for NULL

Run the following ingestion to create a table called `example-koalas-null-1`.

You will use this later to see examples of IS NULL and IS NOT NULL.

Notice, too, that there are CASE statements which purposefully inject true NULL into the table under certain conditions:
* `referrer-null` contains a NULL whenever `referrer` has a value of "Direct" - else the original value from `referrer` is stored.
* A new column called `percentClear` is created, correcting for missing values in the source data, and outputing NULL in other situations.

In [84]:
sql='''
REPLACE INTO "example-koalas-null-1" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "referrer",
  CASE WHEN "referrer" = 'Direct' THEN NULL
    ELSE "referrer" END AS "referrer-null",
  "event_type",
  "event_subtype",
  CASE WHEN ("event_type" = 'PercentClear') THEN
      (CASE WHEN ("event_subtype" = '') THEN 0 ELSE "event_subtype" END)
      ELSE NULL END AS "percentClear"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-null-1')
display.table('example-koalas-null-1')

Loading data, status:[SUCCESS]: 100%|██████████| 100.0/100.0 [00:12<00:00,  8.20it/s]


Position,Name,Type
1,__time,TIMESTAMP
2,referrer,VARCHAR
3,referrer-null,VARCHAR
4,event_type,VARCHAR
5,event_subtype,VARCHAR
6,percentClear,VARCHAR


Let us imagine that the developer of the KoalsToTheMax website had to work around a lack of "NULL" support in the data format being ingested by putting the word "Direct" into the `referrer` column whenever the `referrer` could not be identified.

Knowing this, you might either correct the data in place or – as is done here – create a new dimension - that stores them as true NULL values.

Review the CASE statement that produces the `referrer-null` column - anything recorded as "Direct" in `referrer` is stored as a true NULL in this new column.

Run the following cell which allows us to check the output of the CASE function. You will see a count of records in `referrer` with the value `Direct` and the equivallent count in the `referrer-null` column, using the FILTER (WHERE...) clause of the COUNT function.

In [85]:
sql='''
SELECT
  COUNT(*) FILTER (WHERE "referrer" = 'Direct') AS "referrer",
  COUNT(*) FILTER (WHERE "referrer-null" IS NULL) AS "referrer-null"
FROM "example-koalas-null-1"
'''

display.sql(sql)

referrer,referrer-null
192328,192328


In the source data, `PercentClear`-type events are recorded as people interact with an image on the [KoalasToTheMax](https://www.koalastothemax.com) website.

As a user uncovers more of an image, events are recorded with an increasing percentage clear recorded in `event_subtype`. There is never a "zero percent" event recorded in the source - rather the data in `event_subtype` is left empty.

The CASE statement creates a new column, `percentClear`.
* It ensures the column only holds information about percentage of an image cleared (from `event_subtype`) and nothing else (storing NULL).
* It handles the missing "zero percent" and records a true zero.

Having no value in some row in source data, such as in this example data, is very often not the same as NULL. Therefore, the CASE statement tests for an empty string, rather than using the IS NULL test.

Run the following cell to see how this shows up in the data.

In [86]:
sql='''
SELECT
  CONCAT("percentClear",'%') AS "Percentage Cleared",
  COUNT(*) AS "events"
FROM "example-koalas-null-1"
WHERE "percentClear" IS NOT NULL
GROUP BY "percentClear"
ORDER BY CAST("percentClear" AS DOUBLE)
'''

display.sql(sql)

Percentage Cleared,events
0%,32499
5%,23599
10%,21978
15%,20769
20%,19675
25%,18576
30%,17532
35%,16519
40%,15496
45%,14480


Notice the IS NOT NULL test is applied to `percentClear`. Because of the CASE statement, we can be sure that only rows relating to "PercentClear" events are included in the calculation.

Use NVL or COALESCE to return another value when an expression IS NULL.

Run the following ingestion to create a new table, `example-koalas-null-2`, that expands on the first table.

* A CASE statement corrects values of "N/A" to NULL in `timezone`.
* A new field `session_length-EDTonly` is added that only contains values for EDT timezone events.
* As above but for PDT timezone events.
* `session_length-others` that contains the session length for anything other than EDT or PDT.

In [131]:
sql='''
REPLACE INTO "example-koalas-null-2" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "referrer",
  CASE WHEN "referrer" = 'Direct' THEN NULL
    ELSE "referrer" END AS "referrer-null",
  "loaded_image",
  "event_type",
  "event_subtype",
  CASE WHEN ("event_type" = 'PercentClear') THEN
      (CASE WHEN ("event_subtype" = '') THEN 0 ELSE "event_subtype" END)
      ELSE NULL END AS "percentClear",
  "session",
  "session_length",
  CASE WHEN ("timezone" = 'EDT') THEN "session_length" ELSE NULL END AS "session_length-EDTonly",
  CASE WHEN ("timezone" = 'PDT') THEN "session_length" ELSE NULL END AS "session_length-PDTonly",
  CASE WHEN ("timezone" <> 'EDT' AND "timezone" <> 'PDT') THEN "session_length" ELSE NULL END AS "session_length-others",
  "platform",
  "agent_category",
  "continent",
  "language",
  CASE WHEN "timezone" = 'N/A' THEN NULL
    ELSE "timezone"
    END AS "timezone"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-null-2')
display.table('example-koalas-null-2')

Loading data, status:[SUCCESS]: 100%|██████████| 100.0/100.0 [00:18<00:00,  5.46it/s]            


Position,Name,Type
1,__time,TIMESTAMP
2,referrer,VARCHAR
3,referrer-null,VARCHAR
4,loaded_image,VARCHAR
5,event_type,VARCHAR
6,event_subtype,VARCHAR
7,percentClear,VARCHAR
8,session,VARCHAR
9,session_length,BIGINT
10,session_length-EDTonly,BIGINT


Run the following cell to see NVL being used, together with a simple of COALESCE example.

If the `timezone` is NULL, the value "UTC" is returned instead.

In [132]:
sql='''
SELECT
  "timezone",
  COALESCE("timezone",'UTC') AS "timezone-coalesce",
  NVL("timezone",'UTC') AS "timezone-nvl",
  "session_length"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

timezone,timezone-coalesce,timezone-nvl,session_length
PDT,PDT,PDT,144642
,UTC,UTC,53529
CEST,CEST,CEST,49
,UTC,UTC,39488
,UTC,UTC,57443
,UTC,UTC,1070453
,UTC,UTC,2487
CEST,CEST,CEST,2302
,UTC,UTC,17963
,UTC,UTC,60384


In the following example, COALESCE addresses all three of the timezone-specific session lengths, returning whichever value in the list contains a value first.

In [134]:
sql='''
SELECT
  COALESCE("timezone",'UTC') AS "timezone-coalesce",
  NVL("timezone",'UTC') AS "timezone-nvl",
  "session_length",
  "session_length-EDTonly",
  "session_length-PDTonly",
  "session_length-others",
  COALESCE("session_length-EDTonly","session_length-PDTonly","session_length-others") AS "sessionLength-coalesce"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

timezone-coalesce,timezone-nvl,session_length,session_length-EDTonly,session_length-PDTonly,session_length-others,sessionLength-coalesce
PDT,PDT,144642,,144642.0,,144642
UTC,UTC,53529,,,53529.0,53529
CEST,CEST,49,,,49.0,49
UTC,UTC,39488,,,39488.0,39488
UTC,UTC,57443,,,57443.0,57443
UTC,UTC,1070453,,,1070453.0,1070453
UTC,UTC,2487,,,2487.0,2487
CEST,CEST,2302,,,2302.0,2302
UTC,UTC,17963,,,17963.0,17963
UTC,UTC,60384,,,60384.0,60384


## Aggregations and NULL

In this section you will see an examples of how aggregation functions like COUNT handle data that contains NULL values.

Run this cell to see the source data that will be used:

In [135]:
sql='''
SELECT
  "event_type",
  "percentClear",
  "session_length",
  "timezone"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

event_type,percentClear,session_length,timezone
PercentClear,35.0,144642,PDT
PercentClear,5.0,53529,
GoodLoad,,49,CEST
PercentClear,30.0,39488,
PercentClear,5.0,57443,
LayerClear,,1070453,
GoodLoad,,2487,
LayerClear,,2302,CEST
PercentClear,15.0,17963,
PercentClear,55.0,60384,


Run the following cell to see how COUNT works with NULL data.

* A total number of all the rows is output as `totalRows`.
* A count of all rows with a NULL `timezone` is made and output as `null-timezone-rows`.
* A count is made of the number of rows where `timezone` contains a non-value, output as `nonNull-timezone-rows`.
* The NULL and non-NULL row counts are added together, showing they total `totalRows`.

In [136]:
sql='''
SELECT
  COUNT(*) AS "totalRows",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) AS "null-timezone-rows",
  COUNT(*) FILTER (WHERE "timezone" IS NOT NULL) AS "nonNull-timezone-rows",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) + COUNT(*) FILTER (WHERE "timezone" IS NOT NULL) AS "totalRows-2"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

totalRows,null-timezone-rows,nonNull-timezone-rows,totalRows-2
16,11,5,16


Rather than filtering where `timezone` IS NOT NULL, another form of COUNT can be used to count rows that contain data by specifying the `timezone` dimension instead of a "*".

In [137]:
sql='''
SELECT
  COUNT(*) AS "totalRows",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) AS "null-timezone-rows",
  COUNT(*) FILTER (WHERE "timezone" IS NOT NULL) AS "nonNull-timezone-rows",
  COUNT("timezone") AS "nonNull-timezone-rows-2",
  COUNT(*) FILTER (WHERE "timezone" IS NULL) + COUNT("timezone") AS "totalRows-2"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

totalRows,null-timezone-rows,nonNull-timezone-rows,nonNull-timezone-rows-2,totalRows-2
16,11,5,5,16


The next cell counts the number of distinct timezones.

In [138]:
sql='''
SELECT
  COUNT(DISTINCT "timezone") AS "distinct-timezones"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

distinct-timezones
3


Run the following to shows that, as with COUNT on a specific column, NULL does not count as a separate value in COUNT DISTINCT operations.

In [140]:
sql='''
SELECT
  COUNT(DISTINCT "timezone") AS "distinct-timezones",
  COUNT(DISTINCT "timezone") FILTER (WHERE "timezone" IS NOT NULL) AS "distinctNonNull-timezones",
  COUNT(DISTINCT "timezone") FILTER (WHERE "timezone" IS NULL) AS "distinctNull-timezones"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
'''

display.sql(sql)

distinct-timezones,distinctNonNull-timezones,distinctNull-timezones
3,3,0


The following query shows that NULL is returned as a separate row during GROUP BY.

In [141]:
sql='''
SELECT
  "timezone",
  COUNT(*) AS "totalEvents",
  SUM("session_length") AS "totalSessionLength",
  STRING_FORMAT('%.3f',AVG("session_length")) AS "avgSessionLength",
  MAX("session_length") AS "maxSessionLength",
  MIN("session_length") AS "minSessionLength"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1
ORDER BY timezone DESC
'''

display.sql(sql)

timezone,totalEvents,totalSessionLength,avgSessionLength,maxSessionLength,minSessionLength
PDT,1,144642,144642.0,144642,144642
EDT,1,131536,131536.0,131536,131536
CEST,3,4658,1552.667,2307,49
,11,1555276,141388.727,1070453,43


Recall that the `session_length-EDTonly` and `session_length-PDTonly` dimensions only contains a session length in seconds when the timezone is EDT or PDT respectively, otherwise they contain a NULL.

Run the following SQL to see that NULLs are ignored for other aggregations, too.

* As above, `totalEvents` is the complete number of rows in the data.
* `totalEventsEDT` shows the number of events that are known to have been recorded in the EDT timezone.
* SUM, MAX, and MIN specifically reference the `session_length-EDTonly` and `session_length_PDTonly` dimensions.

Cross-reference the results of the query above with the results of this query.

In [147]:
sql='''
SELECT
  "timezone",
  COUNT(*) AS "totalEvents",
  COUNT(*) FILTER (WHERE "timezone" = 'EDT') AS "totalEvents-EDT",
  SUM("session_length-EDTonly") AS "totalSessionLength-EDT",
  STRING_FORMAT('%.3f',AVG("session_length-EDTonly")) AS "avgSessionLength-EDT",
  MAX("session_length-EDTonly") AS "maxSessionLength-EDT",
  COUNT(*) FILTER (WHERE "timezone" = 'PDT') AS "totalEvents-PDT",
  SUM("session_length-PDTonly") AS "totalSessionLength-PDT",
  MAX("session_length-PDTonly") AS "maxSessionLength-PDT",
  STRING_FORMAT('%.3f',AVG("session_length-PDTonly")) AS "avgSessionLength-PDT"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1
'''

display.sql(sql)

timezone,totalEvents,totalEvents-EDT,totalSessionLength-EDT,avgSessionLength-EDT,maxSessionLength-EDT,totalEvents-PDT,totalSessionLength-PDT,maxSessionLength-PDT,avgSessionLength-PDT
,11,0,,nul,,0,,,nul
CEST,3,0,,nul,,0,,,nul
EDT,1,1,131536.0,131536.000,131536.0,0,,,nul
PDT,1,0,,nul,,1,144642.0,144642.0,144642.000


Some aggregation functions [return NULL](https://druid.apache.org/docs/latest/querying/sql-aggregations/) by default.

In the following query, COALESCE is used to combine the results of the query above into individual columns.

* The two arguments for each COALESCE are aggregates of EDT and PDT using the relevant specific dimensions.
* Each aggregate returns NULL if no rows are found to perform the aggregation on.
* GROUP BY produces rows, one for each timezone.
* COALESCE picks the first non-NULL value:
  * For the EDT row, COALESCE returns the aggregation on the EDT column.
  * For the PDT row, COALESCE returns the aggregation on the PDT column.
  * For all other rows, there is no value to return - thus NULL is returned.

In [151]:
sql='''
SELECT
  "timezone",
  COALESCE(SUM("session_length-EDTonly"), SUM("session_length-PDTonly")) AS "total_session_length",
  COALESCE(MAX("session_length-EDTonly"), MAX("session_length-PDTonly")) AS "max_session_length"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1
'''

display.sql(sql)

timezone,total_session_length,max_session_length
,,
CEST,,
EDT,131536.0,131536.0
PDT,144642.0,144642.0


## Scalar functions

In the following SQL, some simple string scalar functions are used to output a number of new values.

Run the cell to see how a NULL value affects results.

In [157]:
sql='''
SELECT
  CONCAT("timezone",' timezone') AS "timezone",
  LENGTH("timezone") AS "length",
  REPLACE("timezone",'T',' timezone') AS "easyToRead",
  REVERSE("timezone") AS "backwards",
  COUNT(*) AS "events"
FROM "example-koalas-null-2"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T00:00:10/PT5S')
GROUP BY 1, 2, 3, 4
'''

display.sql(sql)

timezone,length,easyToRead,backwards,events
,,,,11
CEST timezone,4.0,CES timezone,TSEC,3
EDT timezone,3.0,ED timezone,TDE,1
PDT timezone,3.0,PD timezone,TDP,1


## Boolean operators and NULL

Use the `||` and `&&` operators in a query as a shorthand for OR and AND respectively.

> 

* The first two results are considered TRUE.
* The remaining calculations return NULL.

In [None]:

SELECT
true || null,
null || true,
false || null,
null || false,
null || null
FROM x


Run the next cell for examples of the AND operator.

* The first calculation returns TRUE as both functions return TRUE.
* The next two calculations return FALSE.
* The remaining three return NULL.

In [None]:

SELECT
true && true,
false && null,
null && false,
true && null,
null && true,
null && null
FROM x


The following cell shows a query where data is being [intepreted as NULL](https://druid.apache.org/docs/latest/misc/math-expr/#logical-operator-modes) according to non-SQL-compatible NULL-handling behavior.

In [None]:
SELECT
100 && 11,
0.7 || 0.3,
100 && 0,
'troo' && 'true',
'troo' || 'true'
FROM X

## Arrays

Use the dedicated [array functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) to work with arrays.

In the following SQL, this function returns NULL if a value cannot be found in the ARRAY (out of range) or if it cannot be found.

> When SQL-compatible NULL-handling is not being used, the returned value would be -1. Read more [here](https://druid.apache.org/docs/latest/querying/sql-array-functions).

In [None]:
SELECT

ARRAY_OFFSET(array, 5)
ARRAY_OFFSET_OF(array, thing)

Run the following cell to see [how UNNEST handles NULL values](https://druid.apache.org/docs/latest/querying/sql/#unnest).

Notice that a record corresponding to each NULL is returned, rather then being removing duplicates.

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [130]:
druid.datasources.drop("example-koalas-null-1")
druid.datasources.drop("example-koalas-null-2")

## Summary

* You learned this
* Remember this

## Learn more

* Read the documentation on:
  * Enabling and disabling [SQL-compatible NULL handling](https://druid.apache.org/docs/latest/querying/sql-data-types#null-values) using `druid.generic.useDefaultValueForNull`
  * How Druid stores [NULL during ingestion](https://druid.apache.org/docs/latest/design/segments#handling-null-values)
  * The default returned value for different [aggregations](https://druid.apache.org/docs/latest/querying/sql-aggregations/)
  * [Logical operator](https://druid.apache.org/docs/latest/misc/math-expr/#logical-operator-modes) modes
* If you tend to use native rather than SQL queries, read about the [NULL filter](https://druid.apache.org/docs/latest/querying/filters#null-filter) in the documentation.
* See the [table of default values](https://druid.apache.org/docs/latest/querying/sql-data-types/#standard-types) stored during ingestion when SQL-compatible NULL-handling is not turned on
* Follow the [notebook on GROUP BY](./01-groupby.ipynb) to see how NULL appears in [GROUPING SETS](https://druid.apache.org/docs/latest/querying/sql/#group-by)
* Try out other scalar functions with NULL - check out the dedicated notebooks on [datetime](./07-functions-datetime.ipynb), [string](./08-functions-strings.ipynb), and [IP address](./10-functions-ip.ipynb) functions for examples.

In [None]:
# STANDARD CODE BLOCKS

# When just wanting to display some SQL results
display.sql(sql)

# When ingesting data:
display.run_task(sql)
sql_client.wait_until_ready('example-koalas-null')
display.table('example-koalas-null')

# When you want to make an EXPLAIN look pretty
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3