# Run asynchronous queries on data in deep storage using the asynchronous query API

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Druid provides two APIs to run SELECT queries: the interactive API and the asynchronous API.

The interactive query API uses data pre-cached on Historical services and data arriving from event streams. However the asynchronous query API accesses data in deep storage in combination with streaming data.

This tutorial focuses on using the asynchronous API to access data in [deep storage](https://druid.apache.org/docs/latest/api-reference/sql-api#query-from-deep-storage) in combination with data from [real-time ingestion](14-query-async-realtime.ipynb).

To work through examples that focus on historical data, including examples of result pagination, see the notebook on [historical data query](21-query-async-historical).

In this tutorial you perform the following tasks:

- Generate three months of data and use batch ingestion to load it.
- Set up stream ingestion for live data from generator.
- Set up Retention rules to only hold 1 month of data in Historical cache.
- Use synchronous and asynchronous queries to show timeline coverage of each case.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import requests
import json

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Set up a connection to Apache Kafka

Run the next cell to set up the connection to Apache Kafka.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Set up a connection to the data generator

Run the next cell to set up the connection to the data generator.

In [None]:
datagen_host = "http://datagen:9999"
datagen_headers = {'Content-Type': 'application/json'}

### Import additional modules

Run the following cell to import additional Python modules that you will use as part of data generation.

In [None]:
from datetime import datetime, timedelta
import time

## Create a table using streaming ingestion

In this section, you use the data generator to generate a stream of messages into a Apache Kafka topic and ingest it into a table in Druid.

The data generator configuration here produces clickstream data to Kafka starting 12 days ago (`time_type`, set from `datagen_time_type`) for a duration of 10 days (`time`).


In [None]:
datagen_topic = "example-clickstream-async"
datagen_job = datagen_topic
datagen_config = "clickstream/clickstream.json"
datagen_time_type = (datetime.now() - timedelta(days=12)).strftime("%Y-%m-%d %H:%M:%S")

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic  },
    "config_file": datagen_config,
    "time":"240h",
    "time_type": datagen_time_type,
    "concurrency": 1
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)

Run the following cell to wait until the data generator has finished publishing events to Kafka.

In [None]:
datagen_done = False

while not datagen_done:
    result = requests.get(f"{datagen_host}/status/{datagen_job}").json()
    if result["status"] == 'COMPLETE':
        datagen_done = True
    else:
        time.sleep(1)

### Use streaming ingestion to populate the table

Ingest data from an Apache Kafka topic into Apache Druid by submitting an [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html) to the [streaming ingestion supervisor API](https://druid.apache.org/docs/latest/api-reference/supervisor-api).

Run the next cell to set up the [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig), [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig), and [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). Notice that the specification:

* Begins reading from the beginning of the stream (`useEarliestOffset`) since this is the [first time](https://druid.apache.org/docs/latest/ingestion/kafka-ingestion#io-configuration) this topic has been read.
* Has a primary partitioning (`segmentGranularity`) of `DAY`.
* Uses automatic schema discovery.

In [None]:
ioConfig = {
    "type": "kafka",
    "consumerProperties": { "bootstrap.servers": kafka_host },
    "topic": datagen_topic,
    "inputFormat": { "type": "json" },
    "useEarliestOffset": "true" }

tuningConfig = { "type": "kafka" }

table_name = datagen_topic

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "day" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"}
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
sql_client.wait_until_ready(table_name, verify_load_status=False)
display.table(table_name)

[Terminating or suspending](https://druid.apache.org/docs/latest/ingestion/supervisor#manage-a-supervisor) a supervisor will cause a [handoff](https://druid.apache.org/docs/latest/design/storage#indexing-and-handoff) from the indexers to deep storage, and from there to historicals.

Run the following cell to pause execution of this notebook for 5 seconds before terminating the supervisor.

In [None]:
time.sleep(5)
requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate", headers=druid_headers)

Run the following cell to see how the table is currently loaded in the cluster.

In [None]:
sql=f'''
SELECT
  a."start",
  a."end",
  c."server",
  c."tier",
  c."server_type"
FROM "sys"."segments" a
LEFT JOIN "sys"."server_segments" b ON a."segment_id" = b."segment_id"
LEFT JOIN "sys"."servers" c ON b."server" = c."server"
WHERE "datasource" = '{table_name}'
ORDER BY "start", "tier"
'''

display.sql(sql)

Run the cell above a number of times.

You will see how the data remains advertised on the ingestion tasks ("peons"), is loaded to historicals, and then eventually stops being advertised on ingestion tasks.

The default set of retention rules (`_default`) currently applies to the table, with the entire set of segments loaded to historicals.

## Apply retention rules

Force some of the data to be left on deep storage by changing the retention load rules for a table.

The retention rules in the next cell ensure Druid loads data younger than 2 days (`period`)  to Historical services in the `_default_tier` tier. Druid applies the second rule to all remaining data which [leaves data on deep storage](https://druid.apache.org/docs/latest/querying/query-deep-storage#keep-segments-in-deep-storage-only):

* `tieredReplicants` is empty.
* `useDefaultTierForNull` is set to false.

Run the cell to apply the rule.

In [None]:
retention_rules = [
  {
    "type": "loadByPeriod",
    "period": "P1W",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {},
    "useDefaultTierForNull": "false"
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

Now run the next cell to confirm the change.

In [None]:
sql=f'''
SELECT
  a."start",
  a."end",
  c."server",
  c."tier"
FROM "sys"."segments" a
LEFT JOIN "sys"."server_segments" b ON a."segment_id" = b."segment_id"
LEFT JOIN "sys"."servers" c ON b."server" = c."server"
WHERE "datasource" = '{table_name}'
ORDER BY "start", "tier"
'''

display.sql(sql)

Run the cell above until some segments are shown without a server and tier, indicating that those segments exist in the database, but are not cached on historicals for query through the interactive API.

## Execute an asynchronous query

Use the `/druid/v2/sql/statements` API endpoint to run asynchronous queries using MSQ engine.

Run the following cell, which uses the `async_sql` method of the `druid_api` package to call the API and return the results. The `async_sql` method handles the [necessary steps](https://druid.apache.org/docs/latest/tutorials/tutorial-query-deep-storage#query-from-deep-storage) not only to submit the query, but to retrieve the results.

In [None]:
sql=f'''
SELECT
  TIME_FLOOR("__time",'P1D') AS "period",
  COUNT(*) as "events"
FROM "{table_name}"
GROUP BY 1
'''

req = sql_client.sql_request(sql)
result = sql_client.async_sql(req)
result.wait_until_done()

print(json.dumps(result.rows, indent=2))

Compare this to the faster set of results retrieved from the interactive Druid SQL API, which only includes segments that are cached on historicals.

In [None]:
sql=f'''
SELECT
  TIME_FLOOR("__time",'P1D') AS "period",
  COUNT(*) as "events"
FROM "{table_name}"
GROUP BY 1
'''

display.sql(sql)

To include results from streaming ingestion, include `includeSegmentSource` in the POST to the API.

Run the following cell to resume the streaming of events into the Kafka topic and restart the supervisor.

In [None]:
datagen_time_type = 'REAL'

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic  },
    "config_file": datagen_config,
    "time":"15m",
    "time_type": datagen_time_type,
    "concurrency": 100
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)

Submit the query to the asynchronous API again by running the next cell.

This time, notice that `includeSegmentSource` is set in the context.

In [None]:
sql=f'''
SELECT
  TIME_FLOOR("__time",'P1D') AS "period",
  COUNT(*) as "events"
FROM "{table_name}"
GROUP BY 1
'''

req = sql_client.sql_request(sql)
req.add_context("includeSegmentSource", "realtime")
result = sql_client.async_sql(req)
result.wait_until_done()

print(json.dumps(result.rows, indent=2))

The result from the query covers the full time-line of data available, the ingested data available in deep storage and data being streamed into the table continuously.

## Clean up

Run the following cell to stop the running data generator, delete the supervisor, reset the retention rules from the table, and drop the table.

In [None]:
requests.post(f"{datagen_host}/stop/{datagen_job}", '').json()

print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')

retention_rules = []
requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* SELECT operations can run online (interactive API) or offline (asynchronous API).
* The offline API can access data that has not been prefetched to historicals.
* The same API can also access data currently being ingested from streams.

## Learn more

* Read about [using EXTERN to export data](https://druid.apache.org/docs/latest/multi-stage-query/reference#extern-to-export-to-a-destination).
* See the [documentation](https://druid.apache.org/docs/latest/querying/query-deep-storage) and [tutorial](https://druid.apache.org/docs/latest/tutorials/tutorial-query-deep-storage) on querying from deep storage.
* Read about [durable storage](https://druid.apache.org/docs/latest/multi-stage-query/reference#durable-storage) in the documentation, including how to [configure](https://druid.apache.org/docs/latest/operations/durable-storage) it.
* See the [historical async query](21-query-async-historical.ipynb) notebook to examples of the API being used directly, and of using pagination in results.
* See more retention load rules in the [load rules](20-tiering-historicals.ipynb) notebook.