# (Result) by (action) using (feature)
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Introductory paragraph - for example:

This tutorial demonstrates how to work with [feature](link to feature doc). In this tutorial you perform the following tasks:

- Task 1
- Task 2
- Task 3
- etc

## Prerequisites

This tutorial works with Druid XX.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

COVERAGE

https://druid.apache.org/docs/latest/querying/sql-data-types#null-values
https://druid.apache.org/docs/latest/development/extensions-core/approximate-histograms#null-handling
https://druid.apache.org/docs/latest/design/segments#handling-null-values
https://druid.apache.org/docs/latest/querying/filters#null-filter
https://druid.apache.org/docs/latest/querying/sql-aggregations/ - noting if aggregators return zero or NULL
https://druid.apache.org/docs/latest/misc/math-expr/#logical-operator-modes - what happens when operators work with NULL
https://druid.apache.org/docs/latest/misc/math-expr/#array-functions - whether it's -1 or NULL
https://druid.apache.org/docs/latest/misc/math-expr/#math-functions - particularly safe_divide
https://druid.apache.org/docs/latest/querying/sql/#unnest - on how it handles NULLs
https://druid.apache.org/docs/latest/querying/sql/#group-by - how GROUP BY handles NULLs
https://imply.io/blog/numeric-column-null-checks-apache-druid/ - 2020 blog by Clint
https://druid.apache.org/docs/latest/querying/sql-data-types#standard-types - how default values are set
https://druid.apache.org/docs/latest/querying/sql-data-types#null-values - what the default values are

In [5]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status
status_client.version

Opening a connection to http://router:8888.


'27.0.0-SNAPSHOT'

### Load example data

Run the following cell to create a table called `example-koalas-null`. Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

When completed, you'll see a description of the final table.

In [11]:
sql='''
REPLACE INTO "example-koalas-null" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "referrer",
  CASE WHEN "referrer" = 'Direct' THEN ''
    ELSE "referrer" END AS "referrer-emptyString",
  CASE WHEN "referrer" = 'Direct' THEN NULL
    ELSE "referrer" END AS "referrer-null",
  "loaded_image",
  "event_type",
  "event_subtype",
  CASE WHEN "event_type" = 'PercentClear' AND "event_subtype" IS NULL THEN 0 ELSE "event_subtype" END AS "event_type-nullCheck",
  CASE WHEN "event_type" = 'PercentClear' AND "event_subtype" = '' THEN 0 ELSE "event_subtype" END AS "event_type-emptyStringCheck",
  "session",
  "session_length",
  CASE WHEN "session_length" > 10000 THEN "session_length"
    ELSE 0 END AS "session_length-zero",
  CASE WHEN "session_length" > 10000 THEN "session_length"
    ELSE NULL END AS "session_length-null",
  "platform",
  "agent_category",
  "continent",
  "language",
  "timezone",
  CASE WHEN "timezone" = 'N/A' THEN NULL
    ELSE "timezone"
    END AS "timezone-null",
  CASE WHEN "timezone" = 'N/A' THEN ''
    ELSE "timezone"
    END AS "timezone-emptyString"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-null')
display.table('example-koalas-null')

Loading data, status:[SUCCESS]: 100%|██████████| 100.0/100.0 [00:22<00:00,  4.46it/s]            


Position,Name,Type
1,__time,TIMESTAMP
2,referrer,VARCHAR
3,referrer-emptyString,VARCHAR
4,referrer_host,VARCHAR
5,referrer_host-null,VARCHAR
6,loaded_image,VARCHAR
7,event_type,VARCHAR
8,event_subtype,VARCHAR
9,event_type-nullCheck,VARCHAR
10,event_type-emptyStringCheck,VARCHAR


## String NULL

Let's count how many NULL values we have.

In [12]:
sql='''
SELECT
    COUNT(*) FILTER (WHERE "referrer-emptyString" IS NULL),
    COUNT(*) FILTER (WHERE "referrer-emptyString" = ''),
    COUNT(*) FILTER (WHERE "referrer-null" IS NULL),
    COUNT(*) FILTER (WHERE "referrer-null" = '')
FROM "example-koalas-null"
'''

display.sql(sql)

EXPR\$0,EXPR\$1,EXPR\$2,EXPR\$3
192328,192328,192328,192328


## Numeric NULL

Let's count how many NULL values we have.

In [9]:
sql='''
SELECT
    COUNT(*) FILTER (WHERE "session_length-zero" IS NULL),
    COUNT(*) FILTER (WHERE "session_length-zero" = 0),
    COUNT(*) FILTER (WHERE "session_length-null" IS NULL),
    COUNT(*) FILTER (WHERE "session_length-null" = 0)
FROM "example-koalas-null"
'''

display.sql(sql)

EXPR\$0,EXPR\$1,EXPR\$2,EXPR\$3
0,128614,0,128614


In [10]:
sql='''
SELECT
    COUNT(*) FILTER (WHERE "session_length-zero" = 0),
    COUNT(*) FILTER (WHERE "session_length-null" IS NULL)
FROM "example-koalas-null"
'''

display.sql(sql)

EXPR\$0,EXPR\$1
128614,0


## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-null")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

In [None]:
# STANDARD CODE BLOCKS

# When just wanting to display some SQL results
display.sql(sql)

# When ingesting data:
display.run_task(sql)
sql_client.wait_until_ready('example-koalas-null')
display.table('example-koalas-null')

# When you want to make an EXPLAIN look pretty
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3