# Returning values using CASE (if-then-else) functions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

The CASE function is used to determine what values to return based on values in the data.

This tutorial demonstrates how to work with the two forms of this [scalar function](https://druid.apache.org/docs/latest/querying/sql-scalar#other-scalar-functions) both at query and ingestion time.

## Prerequisites

This tutorial works with Druid 27.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Run the following cell to create a table called `example-koalas-conditions` and load data from the Koalas to the Max dataset. Notice only required columns are ingested.

When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-koalas-conditions" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "browser",
  "event_type",
  "event_subtype",
  "loaded_image",
  "session_length",
  "session"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-conditions')
display.table('example-koalas-conditions')

## Understanding the dataset

The Koalas to the Max dataset records events from [Koalas to the Max](https://www.koalastothemax.com). Before you start, visit the site to see how it operates.

For a given user session, there are three types of events recorded in `event_type`: "GoodLoad", "PercentClear", and "LayerClear".
Each event may have an `event_subtype`, which may be a string, numeric, or null value that further describes the event.
- When a user first opens an image, the application posts a "GoodLoad" type event.
- As you uncover more of the image, "PercentClear" type events are posted. The amount of the image that has been cleared is recorded as a value in `event_subtype`.
- An image is comprised of layers. When you clear a layer of the image, the application issues a "LayerClear" event and identifies the cleared layer in the `event_subtype`.
- When a user clears the entire image, the application records a "LayerClear" event without a value in `event_subtype`.

Run the following cell to see how this journey shows up in the data.

In [None]:
sql='''
SELECT "__time",
  "event_type",
  "event_subtype",
  "loaded_image"
FROM "example-koalas-conditions"
WHERE session = 'S89403399'
'''

display.sql(sql)

## CASE functions at query time

The CASE function provides if-then-else behaviour in two forms.

- The simplest form switches between different outputs based on the value of a dimension.
- The searched form allows for more complex comparison operations.

This notebook uses the SQL versions of CASE. Native (JSON-based) versions [are also available](https://druid.apache.org/docs/latest/querying/math-expr#general-functions), allowing them to be used in streaming ingestion, for example.

### Simple CASE

The first parameter for the CASE is the dimension to evaluate. Subsequent parameters describe the condition and the result when the condition is met. You can include a terminal ELSE clause that describes the expression to evaluate if no conditions are met. The CASE statement is then closed with an END.
The following SQL shows a simple CASE statement being used to tweak the results of a query based on the browser that someone was using on the site. The output of the CASE statement is stored in a new column called `average_session_length_maybe`.

* The CASE clause specifies the comparison is against "browser".
* A WHEN clause lists a specific value to compare against - "IE" and "Chrome".
* A THEN clause for each describes the expression to evaluate - here, two simple calculations.

In [None]:
sql='''
SELECT
  browser,
  AVG("session_length") AS "average_session_length",
  AVG(
      CASE "browser"
          WHEN 'IE' THEN session_length * 2
          WHEN 'Chrome' THEN session_length / 2
          ELSE session_length
          END
          ) AS "average_session_length_maybe"
FROM "example-koalas-conditions"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T0/PT1H')
GROUP BY 1
'''

display.sql(sql)

### Searched CASE

The second form of CASE is not pinned to a specific dimension or expression, allowing for more complex conditions.

A common use of CASE is to tag particular events as significant.

With some artistic license, we can apply this to the example dataset. The SQL statement below flags some events as impressions (when someone first saw an image) and others as conversions (when someone completed the task and finished clearing the image).

In [None]:
sql='''
SELECT
  CASE WHEN ("event_type" = 'GoodLoad') THEN 'yes' ELSE 'no' END AS "isImpression",
  CASE WHEN ("event_type" = 'LayerClear' AND "event_subtype" IS NULL) THEN 'yes' ELSE 'no' END AS "isConversion",
  COUNT(*) AS "Count"
FROM "example-koalas-conditions"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T0/PT4H')
GROUP BY 1, 2
'''

display.sql(sql)

Combining CASE with SUM leads to an interesting pattern: the SQL below shows an approach to finding the number of conversions per image shown.

Notice that the SQL includes a [REGEXP_EXTRACT](https://druid.apache.org/docs/latest/querying/sql-scalar#string-functions) function, taking the "loaded_image" and splitting into parts to return just the "filename".

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))') AS "filename",
  SUM(CASE WHEN ("event_type" = 'GoodLoad') THEN 1 ELSE 0 END) AS "impressions",
  SUM(CASE WHEN ("event_type" = 'LayerClear' AND "event_subtype" IS NULL) THEN 1 ELSE 0 END) AS "conversions"
FROM "example-koalas-conditions"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T0/PT4H')
GROUP BY 1
'''
display.sql(sql)

Run the following cell to return a new dimension, `funnelStage`. This dimension indicates how far along their journey into the page a visitor went. Stage 1 indicates that they saw the image and started playing with the site, stage 2 shows more interest, stage 3 would indicate that they are really trying to do something, and stage 4 shows true determination!

The stage is determined by applying a search-type CASE function, with conditions based on the percentage of the image that is cleared as recorded in the `event_subtype` field.

Notice that the WHERE clause for this statement ensures that the result set is built only from the correct event type.

In [None]:
sql='''
SELECT
  CASE
    WHEN "event_subtype" <= 24 THEN 1
    WHEN "event_subtype" BETWEEN 25 AND 49 THEN 2
    WHEN "event_subtype" BETWEEN 50 AND 74 THEN 3
    WHEN "event_subtype" BETWEEN 75 AND 99 THEN 4
    END
    AS "funnelStage",
  COUNT(*) AS "Count"
FROM "example-koalas-conditions"
WHERE "event_type" = 'PercentClear'
AND TIME_IN_INTERVAL(__time,'2019-08-25T0/PT4H')
GROUP BY 1
'''

display.sql(sql)

Incorporating the CASE within a MAX allows us to use the raw activity data in the table to determine how far along the journey (funnel) visitors travelled for each image within a specific time window.

In [None]:
sql='''
SELECT
  session,
  REGEXP_EXTRACT("loaded_image",'[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))') AS "filename",
  MAX(
    CASE
      WHEN "event_subtype" <= 24 THEN 1
      WHEN "event_subtype" BETWEEN 25 AND 49 THEN 2
      WHEN "event_subtype" BETWEEN 50 AND 74 THEN 3
      WHEN "event_subtype" BETWEEN 75 AND 99 THEN 4
      END)
    AS "funnelStage"
FROM "example-koalas-conditions"
WHERE "event_type" = 'PercentClear'
AND TIME_IN_INTERVAL(__time,'2019-08-25T0/PT4H')
GROUP BY 1, 2
LIMIT 10
'''

display.sql(sql)

## Using CASE at ingestion time

Use functions at ingestion time to front-load work that would otherwise need to be done at query time.

The following SQL statement creates a table called `example-koalas-conditions-rollup` where CASE functions are used. It incorporates the functions we met previously, determining whether an event should be counted as an impression or as a conversion, and a calculated funnel stage.

In [None]:
sql='''
REPLACE INTO "example-koalas-conditions-rollup" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"),'PT15M') AS "__time",
  "session",
  "browser",
  REGEXP_EXTRACT("loaded_image",'[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))') AS "filename",
  "event_type",
  "event_subtype",
  CASE "event_type" WHEN 'PercentClear' THEN
    CASE
      WHEN "event_subtype" <= 24 THEN 1
      WHEN "event_subtype" BETWEEN 25 AND 49 THEN 2
      WHEN "event_subtype" BETWEEN 50 AND 74 THEN 3
      WHEN "event_subtype" BETWEEN 75 AND 99 THEN 4
      ELSE NULL
      END
    END
    AS "funnelStage",
  CASE WHEN ("event_type" = 'GoodLoad') THEN 1 ELSE 0 END AS "isImpression",
  CASE WHEN ("event_type" = 'LayerClear' AND "event_subtype" IS NULL) THEN 1 ELSE 0 END AS "isConversion",
  COUNT(*) AS "events"
FROM "ext"
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-conditions-rollup')
display.table('example-koalas-conditions-rollup')

Notice that, for the calculated funnel stage, two levels of CASE need to be applied to ensure that the data is correct.

1. A simple CASE tests if the event is telling us how much of the image was cleared ("PercentClear") - and if it is...
2. A search-type CASE determines what the funnel stage is based on the percentage of the image that was cleared.

Run the following cell to see how the data in the new dimensions might be used.

In [None]:
sql='''
SELECT
  TIME_FLOOR(__time, 'PT4H') AS "timebucket",
  SUM("isImpression") AS "totalImpressions",
  COUNT(DISTINCT "session") FILTER (WHERE "funnelStage" = 1) AS "reached_stage1",
  COUNT(DISTINCT "session") FILTER (WHERE "funnelStage" = 2) AS "reached_stage2",
  COUNT(DISTINCT "session") FILTER (WHERE "funnelStage" = 3) AS "reached_stage3",
  COUNT(DISTINCT "session") FILTER (WHERE "funnelStage" = 4) AS "reached_stage4",
  SUM("isConversion") AS "totalConversions"
FROM "example-koalas-conditions-rollup"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T0/P1D')
GROUP BY 1
ORDER BY 1 ASC
'''

display.sql(sql)

## Clean up

Run the following cell to remove the tables used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-conditions")
druid.datasources.drop("example-koalas-conditions-rollup")

## Summary

* There are two forms of CASE, simple and search.
* CASE statements can be used at query time to add new fields to result sets.
* CASE statements at ingestion-time can enrich tables ahead of time.
* There are both SQL and native versions of the CASE function.

## Learn more

* Check the documentation for the [SQL](https://druid.apache.org/docs/latest/querying/sql-scalar#other-scalar-functions) and [native](https://druid.apache.org/docs/latest/querying/math-expr#general-functions) versions of CASE