# Run a query offline using the asynchronous query API
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Druid provides two APIs to run SELECT queries - the interactive API and the asynchronous API.

Whereas the interactive API uses data prefetched on historicals and arriving from event streams, the asynchronous accesses data in deep storage in combination with streaming data.

This tutorial focuses on using the asynchronous API to access data in [deep storage](https://druid.apache.org/docs/latest/api-reference/sql-api#query-from-deep-storage). To see how to access real-time data, see the [full timeline queries](14-query-async-realtime.ipynb) notebook.

You will perform the following tasks:

- Ingest some data covering a long period of time.
- Apply a retention rule set to the table so that some of the data is not loaded to historicals.
- Execute a query using the asynchronous API.
- Retrieve the results of your query.
- Apply pagination to your results.

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Import additional modules

Run the following cell to import additional Python modules that you will use to call some APIs directly.

In [None]:
import json
import requests

## Create a table using batch ingestion

<!-- Use these cells if you are using batch ingestion for your notebook. -->

Run the following cell to create a table using batch ingestion. Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

When completed, you'll see a description of the final table.

In [None]:
table_name = 'example-flights-querydeepstorage'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Year",
  "Reporting_Airline",
  "Origin",
  "Dest",
  "Distance"
FROM "ext"
WHERE "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready(table_name)
display.table(table_name)

Run the following cell to see how the table is currently loaded in the cluster.

In [None]:
sql=f'''
SELECT
  a."start",
  a."end",
  c."server",
  c."tier",
  "num_rows",
  "size"
FROM "sys"."segments" a
LEFT JOIN "sys"."server_segments" b ON a."segment_id" = b."segment_id"
LEFT JOIN "sys"."servers" c ON b."server" = c."server"
WHERE "datasource" = '{table_name}'
ORDER BY "start", "tier"
'''

display.sql(sql)

Notice that, because the default set of retention rules (`_default`) has been applied to this new table, the entire set of segments are loaded to historicals.

## Apply retention rules

Force some of the data to be left on deep storage by changing the retention load rules for a table.

In the next cell, the retention rules ensure data after 11th November 2005 is picked to be loaded to historicals in the `_default_tier` tier. All other data is caught by the second rule, which ensures no data is cached on historicals.

Run the cell to apply the rule.

In [None]:
retention_rules = [
  {
    "type": "loadByInterval",
    "interval": "2005-11-18/P10Y",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {},
    "useDefaultTierForNull": "false"
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

Now run the next cell to confirm the change.

In [None]:
sql=f'''
SELECT
  a."start",
  a."end",
  c."server",
  c."tier",
  "num_rows",
  "size"
FROM "sys"."segments" a
LEFT JOIN "sys"."server_segments" b ON a."segment_id" = b."segment_id"
LEFT JOIN "sys"."servers" c ON b."server" = c."server"
WHERE "datasource" = '{table_name}'
ORDER BY "start", "tier"
'''

display.sql(sql)

Run the cell above until some segments are shown without a server and tier.

Notice that, when you run the following query, results are limited to data fetched to a historical.

In [None]:
sql=f'''
SELECT
  TIME_FLOOR("__time",'P1D') AS "period",
  COUNT(*) as "events"
FROM "{table_name}"
GROUP BY 1
'''
display.sql(sql)

## Execute an asynchronous query

Use the `/druid/v2/sql/statements` API endpoint to run asynchronous queries using MSQ engine.

### Call the API using the druid_api package

The `async_sql` method handles the [necessary steps](https://druid.apache.org/docs/latest/tutorials/tutorial-query-deep-storage#query-from-deep-storage) not only to submit the query, but to retrieve the results.

Run the following cell, which uses the `async_sql` method of the `druid_api` package to call the API and return the results.

In [None]:
sql=f'''
SELECT
  TIME_FLOOR("__time",'P1D') AS "period",
  COUNT(*) as "events"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2005-11-16/P1W')
GROUP BY 1
'''

result = sql_client.async_sql(sql)
result.wait_until_done()

print(json.dumps(result.rows, indent=2))

Notice that the results cover an entire week, and are not limited to those periods loaded to historicals.

### Call the API directly

Run the following cell to submit the query directly to the API.

* The `request_json` object contains the query and the necessary context parameters.
* The API is called using `requests.post` with the results stored in `response`.
* A pretty print is made of the JSON response from the API.

In [None]:
query_request_json = {
    "query":f"SELECT TIME_FLOOR(\"__time\",'P1D') AS \"period\", COUNT(*) as \"events\" FROM \"{table_name}\" WHERE TIME_IN_INTERVAL(\"__time\",'2005-11-16/P1W') GROUP BY 1",
    "context":{
        "executionMode":"ASYNC"
    }
}

query_response = requests.post(f"{druid_host}/druid/v2/sql/statements", json.dumps(query_request_json), headers=druid_headers)
query_response_json = json.loads(query_response.text)

print(json.dumps(query_response_json, indent=2))

Notice the `queryId` in the response object.

Run the next cell to use the API to see the status of the query as it executes. The cell captures the `queryId` and uses this to construct a GET to the API.

In [None]:
queryId = query_response_json["queryId"]

job_result=requests.get(f"{druid_host}/druid/v2/sql/statements/{queryId}")
job_result_json = json.loads(job_result.text)

print(json.dumps(job_result_json, indent=2))

Once the status shows as "SUCCESS", retrieve the result by running the following cell.

The Url is constructed from:
* The Url to the API, appended with the query Id.
* The page of the results to return.
* The format of the results to return.
  
The results of the call to the API are captured in `results_response`, and then printed.

In [None]:
results_resultFormat = "objectLines"
results_page = "0"

results_request_url = f'{druid_host}/druid/v2/sql/statements/{queryId}/results?page={results_page}&resultFormat={results_resultFormat}'

results_response = requests.get(results_request_url, headers=druid_headers).text
print (results_response)

## Paginate results

Use the `rowsPerPage` query context parameter to control the size of the results page.

In the `query_request_json`, the page size is set by providing a parameter in the call to the Python API, and the destination of the query has been set to [`durableStorage`](https://druid.apache.org/docs/latest/operations/durable-storage).

Durable Storage has been configured in the learning environment's [`environment`](https://github.com/implydata/learn-druid/blob/main/environment) file to point to a local volume.

In [None]:
query_request_json = {
    "query":f"SELECT TIME_FLOOR(\"__time\",'P1D') AS \"period\", COUNT(*) as \"events\" FROM \"{table_name}\" WHERE TIME_IN_INTERVAL(\"__time\",'2005-11-16/P1W') GROUP BY 1",
    "context":{
        "executionMode":"ASYNC",
        "selectDestination":"durableStorage",
        "rowsPerPage":3
    }
}

query_response = requests.post(f"{druid_host}/druid/v2/sql/statements", json.dumps(query_request_json), headers=druid_headers)
query_response_json = json.loads(query_response.text)

print(json.dumps(query_response_json, indent=2))

Run the following cell to get the status of the query.

Repeat until the `state` shows as success.

In [None]:
queryId = query_response_json["queryId"]

job_result=requests.get(f"{druid_host}/druid/v2/sql/statements/{queryId}")
job_result_json = json.loads(job_result.text)

print(json.dumps(job_result_json, indent=2))

Notice that there are now multiple `pages` to read from.

Run the next cell to retrieve only one page of the results.

In [None]:
results_page = "2"

results_request_url = f'{druid_host}/druid/v2/sql/statements/{queryId}/results?page={results_page}&resultFormat={results_resultFormat}'

results_response = requests.get(results_request_url, headers=druid_headers).text
print (results_response)

## Clean up

Run the following cell to reset the retention rules for the table, and then to drop it from the database.

In [None]:
retention_rules = []
requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* SELECTs can run online (interactive API) or offline (asynchronous API).
* The offline API can access data that has not been prefetched to historicals.
* Durable Storage can be used to hold results of queries.
* Results can be paginated and can have different formats.

## Learn more

* Read about [using EXTERN to export data](https://druid.apache.org/docs/latest/multi-stage-query/reference#extern-to-export-to-a-destination).
* See the [documenation](https://druid.apache.org/docs/latest/querying/query-deep-storage) and [tutorial](https://druid.apache.org/docs/latest/tutorials/tutorial-query-deep-storage) on querying from deep storage.
* Read about [durable storage](https://druid.apache.org/docs/latest/multi-stage-query/reference#durable-storage) in the documentation, including how to [configure](https://druid.apache.org/docs/latest/operations/durable-storage) it.
* See the [full timeline](14-sync-async-queries.ipynb) notebook to see how the asynchronous API can also incorporate real-time data.
* See more retention load rules in the [load rules](20-tiering-historicals.ipynb) notebook.