# (Result) by (action) using (feature)
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Adding dimensions and metrics to your data can enhance its analytic value. It's common, for example, to add product categorization, user demographics or additional location based metrics to Retail clickstream or POS data. In IoT scenarios, additional info like metric type (temperature, pressure, flow, etc) for a particular device is common, the device can be associated to a specific industrial process and grouped into components and subcomponents of the overall system being monitored are very useful in determining subsystem anomalies. 

Lookups and joins can be used at query time to enhance data in this fashion. But there is a performance penalty when using lookups and even more penalty with joins at query time. So in the interest of achieving fast analytic queries, joins can be applied at ingestion time.

This tutorial demonstrates how to work with [feature](link to feature doc). In this tutorial you perform the following tasks:

- Task 1
- Task 2
- Task 3
- etc

## Prerequisites

This tutorial works with Druid XX.0.0 or later.

<!-- Profiles are:
`druid-jupyter` - just Jupyter and Druid
`all-services` - includes Jupyter, Druid, and Kafka
 -->

Launch this tutorial and all prerequisites using the `<PLACE PROFILE NAME HERE>` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Set up a connection to Apache Kafka

<!-- Include these cells if your notebook uses Kafka. -->

Run the next cell to set up the connection to Apache Kafka.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Set up a connection to the data generator

<!-- Include these cells if your notebook uses the data generator. -->

Run the next cell to set up the connection to the data generator.

In [None]:
import requests
import json

datagen_host = "http://datagen:9999"
datagen_headers = {'Content-Type': 'application/json'}

<!-- Include these cells if you need additional Python modules -->

### Import additional modules

Run the following cell to import additional Python modules that you will use to X, Y, Z.

In [None]:
# Add your modules here, remembering to align this with the prerequisites section

import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Create a table using batch ingestion

<!-- Use these cells if you are using batch ingestion for your notebook. -->

Run the following cell to create a table using batch ingestion. Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

When completed, you'll see a description of the final table.

In [None]:
# Replace example-dataset-topic with a unique table name for this notebook.

# - Always prefix your table name with `example-`
# - If using the standard example datasets, use the following standard values for `dataset`:

#     wikipedia       wikipedia
#     koalas          KoalasToTheMax one day
#     koalanest       KoalasToTheMax one day (nested)
#     nyctaxi3        NYC Taxi cabs (3 files)
#     nyctaxi         NYC Taxi cabs (all files)
#     flights         FlightCarrierOnTime (1 month)

# Remember to apply good data modelling practice to your INSERT / REPLACE.

table_name = 'example-dataset-topic'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
'''

display.run_task(sql)
sql_client.wait_until_ready(table_name)
display.table(table_name)

## Create a table using streaming ingestion

In this section, you use the data generator to generate a stream of messages into a Apache Kafka topic. Next, you'll set up an on-going ingestion into Druid.

### Use the data generator to populate a Kafka topic

Run the following cell to instruct the data generator to start producing data.

In [None]:
# For more information on the available configurations and settings for the data generator, see the dedicated notebook in "99-contributing"

# Replace example-dataset-topic with a unique table name for this notebook.

# - Always prefix your table name with `example-`
# - If using the standard example datasets, use the following standard values for `dataset`:

#     social/socialposts            social
#     clickstream/clickstream       clickstream

# Remember to apply good data modelling practice to your data schema.

datagen_topic = "example-dataset-topic"
datagen_job = datagen_topic
datagen_config = "social/social_posts.json"

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic  },
    "config_file": datagen_config, 
    "time":"10m",
    "concurrency":100
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)
requests.get(f"{datagen_host}/status/{datagen_job}").json()

### Use streaming ingestion to populate the table

Ingest data from an Apache Kafka topic into Apache Druid by submitting an [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html) to the [streaming ingestion supervisor API](https://druid.apache.org/docs/latest/api-reference/supervisor-api).

Run the next cell to set up the [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig), [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig), and [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

In [None]:
ioConfig = {
    "type": "kafka",
    "consumerProperties": { "bootstrap.servers": kafka_host },
    "topic": datagen_topic,
    "inputFormat": { "type": "json" },
    "useEarliestOffset": "true" }

tuningConfig = { "type": "kafka" }

# Replace example-dataset-topic with a unique table name for this notebook.

# - Always prefix your table name with `example-`
# - If using the standard example datasets, use the following standard values for `dataset`:

#     social/socialposts            social
#     clickstream/clickstream       clickstream

# Remember to apply good data modelling practice to your data schema.

table_name = 'example-dataset-topic'

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "hour" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"}
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
sql_client.wait_until_ready(table_name, verify_load_status=False)
display.table(table_name)

#### Broadcast joins - small lookups joins

SQL Based Ingestion can process joins efficiently during ingestion using either broadcast or sort merge joins. Broadcast is the default method, in which the right table of the join is broadcast in its entirety to all workers involved in the ingestion. The content of the lookup is kept within each worker's memory in order to process the join. You'll need to take care that the whole set of lookup tables joined in this fashion for a given ingestion will fit within the heap of each worker JVM.

Here's an example:

In [None]:
sql = '''
REPLACE INTO "example-kttm-enhanced-batch" OVERWRITE ALL
WITH
kttm_data AS (
  SELECT * 
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR)
),
country_lookup AS (
  SELECT * 
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://static.imply.io/example-data/lookup/countries.tsv"]}',
      '{"type":"tsv","findColumnsFromHeader":true}'
    )
  ) EXTEND ("Country" VARCHAR, "Capital" VARCHAR, "ISO3" VARCHAR, "ISO2" VARCHAR)
)

SELECT
  TIME_PARSE(kttm_data."timestamp") AS __time,
  kttm_data."session",
  kttm_data."agent_category",
  kttm_data."agent_type",
  kttm_data."browser",
  kttm_data."browser_version",
  kttm_data."language",
  kttm_data."os",
  kttm_data."city",
  kttm_data."country",
  country_lookup."Capital" AS "capital",
  country_lookup."ISO3" AS "iso3",
  kttm_data."forwarded_for" AS "ip_address",
  kttm_data."session_length",
  kttm_data."event_type"
FROM kttm_data
LEFT JOIN country_lookup ON country_lookup.Country = kttm_data.country
PARTITIONED BY DAY
'''
druid.display.run_task(sql)
druid.sql.wait_until_ready('example-kttm-enhanced-batch')

Data for both sources "kttm_data" and "country_lookup" are obtained from external sources:
```sql
WITH
kttm_data AS
(
  SELECT *
  FROM TABLE(
    EXTERN(
               '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
               '{"type":"json"}'
    )
  ) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, ...)
),
country_lookup AS
(
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://static.imply.io/example-data/lookup/countries.tsv"]}',
      '{"type":"tsv","findColumnsFromHeader":true}'
    )
  ) EXTEND ("Country" VARCHAR, "Capital" VARCHAR, "ISO3" VARCHAR, "ISO2" VARCHAR)
)
```

Columns from both tables can be used in the SELECT expressions using the alias "country_lookup" to reference any joined column:
```sql
  kttm_data."country",
  country_lookup."Capital" AS "capital",
  country_lookup."ISO3" AS "iso3"
```

The join is specified in the FROM clause:
```sql
FROM kttm_data
LEFT JOIN country_lookup ON country_lookup.Country = kttm_data.country
```
LEFT JOIN insured that all the rows from kttm_data source are ingested. An INNER JOIN would exclude rows from "kttm_data" if the value for "kttm_data.country" is not present in "country_lookup.Country". 
Since no context parameters were set, the join is processed as a broadcast join. The first table in the FROM clause is the distributed table and all other joined tables will be shipped to the workers to execute the join.

Take a look at the data:

In [None]:
druid.display.sql("""
SELECT
  "iso3" AS "country_code", 
  "capital",
  count( DISTINCT "ip_address" ) distinct_users, 
  MIN("session_length")/1000 fastest_session_ms,
  MAX("session_length")/1000 slowest_session_ms
FROM "example-kttm-enhanced-batch"
WHERE "event_type"='LayerClear'
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10
""")

#### Shuffle joins - "Large lookup to fact" or "fact to fact" joins

See [Shuffle joins in SQL Based Ingestion](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#sort-merge).
This is the ability to join large tables to other large tables without fully loading either one into memory. Both sources involved in the join are scanned in parallel across all workers, the intermediate data for both sources is then redistributed among the workers based on the join column(s) such that rows from both sources with the same values end up in the same worker.

![](assets/shuffle-join.png)

In order to use shuffle join the query context must include:
```
{
   "sqlJoinAlgorithm":"sortMerge"
}
```

Given that this example is meant to run on the local docker compose deployment, two very large tables is not possible, try it out with small sources and just pretend they are big. We'll use the "wikipedia" sample data and join it with "example-wiki-users-batch" profile data. But first, create the users table because at the time of this writing there wasn't a matching "user" source handy:

In [None]:
# since we don't have a source for example-wiki-users-batch let's
# create one using in-database transformation that
# generates user profiles from the wikipedia data by
# - grouping on "user"
# - injecting variability into the "group" column using modulus of the __time of the event
# - "edits" represent the number of edits done by the user, so just count the number of events
# - calculate the registration time using the minimum __time from events and adjusting it some variable years back in time
# - determine the preferred language of the user based on their earliest channel edit

sql = '''
REPLACE INTO "example-wiki-users-batch" OVERWRITE ALL
SELECT 
  "user", 
  EARLIEST(
    CASE 
      WHEN  MOD(TIMESTAMP_TO_MILLIS(__time),5) > 3 THEN 'Reviewers' 
      WHEN  MOD(TIMESTAMP_TO_MILLIS(__time),17) > 13 THEN 'Patrollers' 
      WHEN  MOD(TIMESTAMP_TO_MILLIS(__time),23) > 21 THEN 'Bots'
      ELSE 'Autoconfirmed'
    END,
    1024
  ) AS "group",
  count(*) "edits",
  TIME_SHIFT(MIN(__time), 'P1Y', -1 * MOD(MIN(EXTRACT (MICROSECOND FROM __time)),20) ) AS "registered_at_ms",
  EARLIEST(SUBSTRING("channel", 2, 2), 1024) AS "language"
FROM "example-wikipedia-batch"
GROUP BY 1
PARTITIONED BY ALL
'''
request = druid.sql.sql_request( sql)              # init request object
request.add_context( 'finalizeAggregations', True) # EARLIEST functions will store a partial aggregation otherwise
request.add_context( 'maxNumTasks', 2)             # can't go any higher in test env

druid.display.run_task(request)
druid.sql.wait_until_ready('example-wiki-users-batch')

In [None]:
The next cell runs an ingestion using the sortMerge join:

In [None]:
sql = '''
REPLACE INTO "example-wiki-merge-batch" OVERWRITE ALL
WITH "wikidata" AS 
(
    SELECT *
    FROM TABLE(
      EXTERN(
        '{"type":"http",
          "uris":[ "https://druid.apache.org/data/wikipedia.json.gz"]
         }',
        '{"type":"json"}'
      )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT 
  TIME_PARSE(d."timestamp") as "__time",
  d."isRobot", 
  d."channel" , 
  d."timestamp" , 
  d."flags" , 
  d."isUnpatrolled" , 
  d."page" , 
  d."diffUrl" , 
  d."added" , 
  d."comment" , 
  d."commentLength" , 
  d."isNew" , 
  d."isMinor" , 
  d."delta" , 
  d."isAnonymous" , 
  d."user" , 
  d."deltaBucket" , 
  d."deleted" , 
  d."namespace" , 
  d."cityName" , 
  d."countryName" , 
  d."regionIsoCode" , 
  d."metroCode" , 
  d."countryIsoCode" , 
  d."regionName", 
  u."group" AS "user_group",
  u."edits" AS "user_edits",
  u."registered_at_ms" AS "user_registration_epoch",
  u."language" AS "user_language"
FROM "wikidata" AS d 
  LEFT JOIN "example-wiki-users-batch" AS u ON u."user"=d."user"
PARTITIONED BY DAY
'''
request = druid.sql.sql_request( sql)                 # init request object
request.add_context( 'sqlJoinAlgorithm', 'sortMerge') # use sortMerge to join the sources
request.add_context( 'maxNumTasks', 2)                # use 2 tasks to run the ingestion

druid.display.run_task(request)
druid.sql.wait_until_ready('example-wiki-merge-batch')


In [None]:
Run the next cell to query the newly joined data:

In [None]:
druid.display.sql("""
SELECT "user_group",
  count( DISTINCT "user") "distinct_users",
  sum("user_edits") "total_activity"
FROM "example-wiki-merge-batch"
GROUP BY 1
ORDER BY 1, 3 DESC
""")

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
# Use this for batch ingested tables

print(f"Drop table: [{druid.datasources.drop(table_name)}]")

# Use this when doing streaming with the data generator

print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

In [None]:
# Here are some useful code elements that you can re-use.

# When just wanting to display some SQL results
sql = f'''SELECT * FROM "{table_name}" LIMIT 5'''
display.sql(sql)

# When ingesting data and wanting to describe the schema
display.run_task(sql)
sql_client.wait_until_ready('{table_name}')
display.table('{table_name}')

# When you want to show the native version of a SQL statement
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3

# When you want to see some messages from a Kafka topic
from kafka import KafkaConsumer

consumer = KafkaConsumer(bootstrap_servers=kafka_host)
consumer.subscribe(topics=datagen_topic)
count = 0
for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))
consumer.unsubscribe()