# Query from Deep Storage covering the whole timeline
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
Apache Druid is known for fast performance on both historical and real-time data. These queries are resolved through the synchronous API endpoints `/druid/v2` & `/druid/v2/sql` that use Druid's native engine.

Asynchronous queries from Deep Storage with API endpoint `/druid/v2/sql/statements` were introduced in the 27.0.0 release along with Druid retention rules with no historical caching. This allows Druid users to query longer timeframes without the need to cache all data in the historical layer. This set of features enabled users to optimize their cluster cost by only cacheing recent data in the historical layer and leaving older segment data available for queries in deep storge. 

At that stage, asynchronous queries were only accessing segments that were avaialble in Deep Storage, so the latest streaming ingested data would not be visible in the query results.

With Druid 28.0.0, the asynchronous query capability is expanded to query real-time tasks which allows this type of query to access the complete timeline.

This tutorial demonstrates how to work with [Query From Deep Storage](https://druid.apache.org/docs/latest/api-reference/sql-api#query-from-deep-storage). In this tutorial you perform the following tasks:

- Generate 3 months of data and use batch ingestion to load it.
- Setup stream ingestion for live data from generator.
- Setup Retention rules to only hold 1 month of data in Historical cache.
- Use synchronous and asynchronous queries to show timeline coverage of each case.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import json
import time
from datetime import datetime, timedelta

# get druid host from param if available
if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

# get kafka host from param if available
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

print(f"Opening a connection to {druid_host}.")

#setup Druid API clients
druid = druidapi.jupyter_client(druid_host)
display_client = druid.display
sql_client = druid.sql
status_client = druid.status
rest_client = druid.rest

# client for Data Generator API
datagen = druidapi.rest.DruidRestClient("http://datagen:9999")

# define header for REST calls
headers = {
  'Content-Type': 'application/json'
}

status_client.version

## Helper functions
This set of functions help to control the data flow in the notebook:
- wait_for_datagen - check the status of a data generation job, displays it and loops until the status is COMPLETE.
- monitor_ingestion - waits for the target table to have a certain number of rows, this is used to monitor the completion of an ingestion.
- stop_streaming_job - gracefully shuts down a streaming ingestion job and resets its partition offsets so that it can be reexecuted from the start
- drop_table - marks all of the segments in a table as unused and executes a `kill` operation to cleanup the metadata and Deep Storage.

In [None]:
import time
from IPython.display import clear_output

# wait for the messages to be fully published 
def wait_for_datagen( job_name): 
    done = False
    while not done:
        result = datagen.get_json(f"/status/{job_name}",'')
        clear_output(wait=True)
        print(json.dumps(result, indent=2))
        if result["status"] == 'COMPLETE':
            done = True
        else:
            time.sleep(1)


# monitor ingestion by counting the rows ingested until the expected number of rows have been loaded
def monitor_ingestion( target_table:str, target_rows:int):
    row_count=0
    while row_count<target_rows:
        res = sql_client.sql(f'SELECT count(1) as "count" FROM {target_table}')
        clear_output(wait=True)
        print(json.dumps(res, indent=2))
        row_count = res[0]['count']
        time.sleep(1)
        
# suspend the streaming ingestion job and wait for tasks to publish their segments
def stop_streaming_job( target_table: str, reset_offsets: bool = False):
    print(f'Pause streaming ingestion: [{druid.rest.post(f"/druid/indexer/v1/supervisor/{target_table}/suspend","", require_ok=False)}]')
    

    tasks = druid.tasks.tasks(state='running', table=target_table)
    tasks_done = 0
    while tasks_done<len(tasks):
        tasks_done = 0
        clear_output( wait=True)
        print(f'Waiting for running tasks to publish their segments ...')
        for task in tasks:
            status = druid.tasks.task_status(task['id'])
            print(f"Task [{task['id']}] Status:{status['status']['statusCode']} RunnerStatus:{status['status']['runnerStatusCode']}")
            if (status['status']['statusCode']!='RUNNING'): 
                tasks_done += 1 
        time.sleep(1)
            
    if reset_offsets:
        print(f'Reset offsets for re-runnability: [{druid.rest.post(f"/druid/indexer/v1/supervisor/{target_table}/reset","", require_ok=False)}]')
    print(f'Terminate streaming ingestion: [{druid.rest.post(f"/druid/indexer/v1/supervisor/{target_table}/terminate","", require_ok=False)}]')

# Remove table data and metadata from Druid
def drop_table( target_table: str):
    # mark segments as unused 
    druid.datasources.drop(target_table)
    # remove segment metadata and data for unused segments
    headers = {'Content-Type': 'application/json'}
    kill_task = {
      "type": "kill",
      "dataSource": target_table,
      "interval" : "2000-09-12/2999-09-13"
    }
    print(druid.rest.post(f"/druid/indexer/v1/task", json.dumps(kill_task),require_ok=False, headers=headers))

## Generate history
Run the following cell to create 3 months of history up to midnight last night. Later in the notebook we'll use retention rules to split the 3 months into one month that is available in historicals and the rest only when [queried from Deep Storage](https://druid.apache.org/docs/latest/querying/query-deep-storage).

In [None]:
# generate 90 days of click data
days_of_history = 90

start_time = datetime.now()
start_time = start_time - timedelta(days=days_of_history)
start_date = start_time.strftime('%Y-%m-%dT%H:%M:%S.001')
print(f"Starting to generate history at {start_date}.")

# Give the datagen job a name for use in subsequent API calls
job_name="gen_clickstream_history"

# Generate a data file on the datagen server
datagen_request = {
    "name": job_name,
    "target": { "type": "file", "path":"clicks-90-days.json"},
    "config_file": "clickstream/clickstream.json", 
    "time_type": start_date,
    "time": f"{days_of_history*24}h",
    "concurrency":2
}

datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)

wait_for_datagen(job_name)

Wait here while the prior cell completes. Generation started with a timestamp 90 days ago. You can see progress of data generation by looking at the output above where `total_records` shows the number of events generated and `status_msg` displays the simulated time in `Sim Clock` which will catchup when it reaches 90 days from the `start_time` and the `status` shows "COMPLETE".

In the following cell the data is ingested from this generated data file using SQL Based ingestion.
When completed, you'll see a description of the final table.

In [None]:
# initiate ingestion job
sql='''
REPLACE INTO "example-clicks-full-timeline" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["http://datagen:9999/file/clicks-90-days.json"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("time" VARCHAR, "user_id" VARCHAR, "event_type" VARCHAR, "client_ip" VARCHAR, "client_device" VARCHAR, "client_lang" VARCHAR, "client_country" VARCHAR, "referrer" VARCHAR, "keyword" VARCHAR, "product" VARCHAR)
)
SELECT
  TIME_PARSE("time") AS "__time",
  "user_id",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country",
  "referrer",
  "keyword",
  "product"
FROM "ext"
PARTITIONED BY DAY
'''

display_client.run_task(sql)
sql_client.wait_until_ready('example-clicks-full-timeline')
display_client.table('example-clicks-full-timeline')

## Create data stream and streaming ingestion
In order to setup the full test of this feature, we'll need to complement the batch ingestion above with real-time ingestion.
The following cell initiates data generation in real-time streaming the events to kafka in topic `clicks` and it is setup to run for 4 hours.

In [None]:
# Give the datagen job a name for use in subsequent API calls
job_name="gen_clickstream_stream"

# Generate streaming data in real time
datagen_request = {
    "name": job_name,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": "clicks" },
    "config_file": "clickstream/clickstream.json", 
    "time_type": "REAL",
    "time": "4h",
    "concurrency":2
}
output = datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)

output

### Start the streaming ingestion job in Druid
The following streaming ingestion job reads from topic `clicks` and it makes use of `useSchemaDiscovery: True` to [automatically detect the columns](https://druid.apache.org/docs/latest/ingestion/schema-design#schema-auto-discovery-for-dimensions) in the stream. It sets the `segmentGranularity` equivalent to the batch ingestion you ran above.

In [None]:
# start streaming ingestion job
kafka_ingestion_spec = {
  "type": "kafka",
  "spec": {
    "ioConfig": { "type": "kafka",  "consumerProperties": { "bootstrap.servers": "kafka:9092" },
      "topic": "clicks",
      "inputFormat": { "type": "kafka", "valueFormat": { "type": "json" }   },
       "useEarliestOffset": True
    },
    "tuningConfig": { "type": "kafka"  },
    "dataSchema": {
      "dataSource": "example-clicks-full-timeline",
      "timestampSpec": { "column": "time", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": [ ],
        "useSchemaDiscovery": True
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": False,
        "segmentGranularity": "day"
      }
    }
  }
}
druid.rest.post("/druid/indexer/v1/supervisor", json.dumps(kafka_ingestion_spec), headers=headers)

### Nothing new so far

It might take it a minute or two to start making the real-time data available. 
There are 3 months of data in the table and the next query shows when it has caught up to almost now. Run it a few times until you see the full 90+ days in the data and the most recent event within a minute. Events are being generated irregularly, but there should be at least one every minute. 

In [None]:
sql='''
  SELECT 
      min(__time) "min_time", 
      max(__time) "max_time", 
      TIMESTAMPDIFF( DAY, min(__time), max(__time)) "days_of_data", 
      TIMESTAMPDIFF( SECOND, max(__time), CURRENT_TIMESTAMP) "recent_event_seconds_ago" 
  FROM "example-clicks-full-timeline"
'''
display_client.sql(sql)

You can see the number of segments assigned to the Historical and the Peon running the real-time ingestion with the following system table query:

In [None]:
sql='''
SELECT
  b."server",
  c."server_type",
  a."is_realtime",
  COUNT(*) AS "segments",
  SUM(a."num_replicas") AS "total_replicas"
FROM "sys"."segments" a
LEFT JOIN "sys"."server_segments" b ON a."segment_id" = b."segment_id"
LEFT JOIN "sys"."servers" c ON b."server" = c."server"
WHERE "datasource" = 'example-clicks-full-timeline'
GROUP BY 1, 2, 3
'''
display_client.sql(sql)

## Turn off pre-fetch for older data using retention rules
The [load and drop rules](https://druid.apache.org/docs/latest/operations/rule-configuration#load-rules) in the next cell adjust the rules followed by the [coordinator](https://druid.apache.org/docs/latest/design/coordinator) when it distributes segment to the Historical processes. The setting for tieredReplicants is different according to the period of time that a given segment of example-clicks-full-timeline table data covers.

- The first rule covers the period up to one month ago (P1M) and adjusts the number of replicas on the default historical tier to 1.
- The second rule covers the period up to three months ago (P3M) and sets the number of replicas to an empty list, instructing the coordinator only keep this data in Deep Storage removing it from Historicals if they already have it.
- The third rule, dropForever, catches all other data from the table and marks it for deletion.
When you run the cell, the rules will be updated and used in the coordinator's next cycle. Since the data was already loaded and distributed to the Historical, the coordinator will instruct the historical to remove segments older than 1 month from its local cache. 

In [None]:
retention_rule = [
    {"type":"loadByPeriod", "period":"P1M", "tieredReplicants": { "_default_tier": 1} },
    {"type":"loadByPeriod", "period":"P3M", "tieredReplicants": { }, "useDefaultTierForNull": False }, 
    {"type":"dropForever" }
]

druid.rest.post("/druid/coordinator/v1/rules/example-clicks-full-timeline", json.dumps(retention_rule), headers=headers)

## Synchronous queries with native engine 
It might take a minute or two for the cluster to re-organize the data.

The coordinator will ask the historical servers to offload any copies of segment data that are older than 1 month ago from the first rule's period `P1M`. 

Try the following cell multiple times until you see this result.
- `min_time` will now be 30 days ago 
- `max_time` will continue to keep up with real-time.
- `days_of_data` will now report 31 days, the past 30 plus the one we continue to load
- `recent_event_seconds_ago` should be up to date, events occur about 1 every minute.

In [None]:
sql='''
  SELECT 
      min(__time) "min_time", 
      max(__time) "max_time", 
      TIMESTAMPDIFF( DAY, min(__time), max(__time)) "days_of_data", 
      TIMESTAMPDIFF( SECOND, max(__time), CURRENT_TIMESTAMP) "recent_event_seconds_ago" 
  FROM "example-clicks-full-timeline"
'''
display_client.sql(sql)

Run the following cell to look at the segments now, notice that the historical segments were reduced and now most of the segments are not assigned a to a server at all, these segments only exist in Deep Storage:

In [None]:
sql='''
SELECT
  b."server",
  c."server_type",
  a."is_realtime",
  COUNT(*) AS "segments",
  SUM(a."num_replicas") AS "total_replicas"
FROM "sys"."segments" a
LEFT JOIN "sys"."server_segments" b ON a."segment_id" = b."segment_id"
LEFT JOIN "sys"."servers" c ON b."server" = c."server"
WHERE "datasource" = 'example-clicks-full-timeline'
GROUP BY 1, 2, 3
'''
display_client.sql(sql)

## Asynchronous queries with MSQ 
Use the API endpoint `/druid/v2/sql/statements` to run asynchronous queries using MSQ engine. It reads data directly from Deep Storage and can therefore access all 3 months of segments. The `druidapi` package includes the `async_sql` function which uses this API.  

That was a long setup to describe the new feature in Druid 28.0.0!

### Include real-time data in asynchrounous queries
The `includeSegmentSource=realtime` context parameter instructs the database to include data from the real-time segments seen in the above SYS query results.

Try without setting it first:

In [None]:
sql='''
  SELECT 
      TIME_FORMAT( min(__time), 'YYYY-MM-dd hh:mm:ss') "min_time", 
      TIME_FORMAT(max(__time), 'YYYY-MM-dd hh:mm:ss') "max_time", 
      TIMESTAMPDIFF( DAY, min(__time), max(__time)) "days_of_data", 
      TIMESTAMPDIFF( SECOND, max(__time), CURRENT_TIMESTAMP ) "recent_event_seconds_ago" 
  FROM "example-clicks-full-timeline"
'''
result = sql_client.async_sql(sql)
result.wait_until_done() # wait until the query has completed
display(result.rows)


Notice that the query includes all the ingested segments starting from the date 90 days ago but the most recent events are not available, `recent_event_seconds_ago` is much larger than a 60 seconds ago.

Now try with context parameter `includeSegmentSource = realtime`:

In [None]:
sql='''
  SELECT 
      TIME_FORMAT( min(__time), 'YYYY-MM-dd hh:mm:ss') "min_time", 
      TIME_FORMAT(max(__time), 'YYYY-MM-dd hh:mm:ss') "max_time", 
      TIMESTAMPDIFF( DAY, min(__time), max(__time)) "days_of_data", 
      TIMESTAMPDIFF( SECOND, max(__time), CURRENT_TIMESTAMP) "most_recent_to_now_s" 
  FROM "example-clicks-full-timeline"
'''
req = sql_client.sql_request(sql)
req.add_context("includeSegmentSource", "realtime")
result = sql_client.async_sql(req)
result.wait_until_done()  # wait until the query has completed
display(result.rows)


The result from the query covers the full timeline of data available, the ingested data available in Deep Storage plus the latest data being streamed into the table continuously.

<a id='async_rows_per_page'></a>
## Retrieving results page by page 

Use the `rowsPerPage` query context parameter to control the size of the results page.
In the cell below, the page size is set by providing a parameter in the call to the Python API.

In [None]:
sql='''
SELECT * 
  FROM "example-clicks-full-timeline"
  WHERE __time > CURRENT_TIMESTAMP - INTERVAL '2' HOUR
  ORDER BY __time DESC
'''
req = sql_client.sql_request(sql)
req.add_context("includeSegmentSource", "realtime")
req.add_context("selectDestination", "durableStorage")  # needed to enable paging results

# use rowsPerPage parameter to define page size, defaults to 100000
result = sql_client.async_sql(req, rowsPerPage=10)

# wait for query to be processed
result.wait_until_done()

# retrieve results one page at a time
print("\nPAGE #0\n")
display(result.paged_rows(pageNum=0))

print("\nPAGE #1\n")
display(result.paged_rows(pageNum=1))

## Clean up

Run the following cell to remove everything used in this notebook from the database and data generation engine.

In [None]:

stop_streaming_job("example-clicks-full-timeline")

display(datagen.post(f"/stop/gen_clickstream_stream", '', require_ok=False).json())
display(datagen.post(f"/stop/gen_clickstream_history", '', require_ok=False).json())

drop_table("example-clicks-full-timeline")


## Summary

You learned about setting up retention rules for different periods to:
* cache recent segments in historical tier
* keep older segments available for async queries from deep storage
* use async queries that also retrieve real-time data
* how to retrieve results a page at a time