# Load table data to different historical tiers using retention rules
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

[Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) provides administrators the ability to provide cluster resources suited to different performance and storage requirements, such as isolating heavy queries involving complex subqueries or large result from high priority, interactive queries.

This tutorial demonstrates how to work with [historical tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to load particular ages of data onto different processes. In turn, this causes queries to execute on different services depending on the period of time covered by a query.

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

This tutorial requires a deployment of Druid with multiple historicals, and presumes that the additional tier is called "slow".

Launch this tutorial and all prerequisites using the `druid-jupyter-tiered-hist` profile of the Docker Compose file for Jupyter-based Druid tutorials to create a cluster with an additional historical.

For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [107]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Opening a connection to http://router:8888.


'30.0.0'

### Import additional modules

Run the following cell to import additional Python modules that you will use to make direct calls to some APIs in Druid.

In [108]:
import requests
import json

## Create a table using batch ingestion

In this section, you will create a table that contains data spanning a few years using batch ingestion.

Run the next cell to bring in the initial data. Only a subset of the columns that are available in the example dataset will be ingested.

When completed, you'll see a description of the final table.

In [109]:
table_name = 'example-wikipedia-tiering'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "isUnpatrolled",
  "page",
  "comment",
  "commentLength",
  "user"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready(table_name)
display.table(table_name)

Loading data, status:[SUCCESS]: 100%|██████████| 100.0/100.0 [00:07<00:00, 14.05it/s]


Position,Name,Type
1,__time,TIMESTAMP
2,isRobot,VARCHAR
3,channel,VARCHAR
4,isUnpatrolled,VARCHAR
5,page,VARCHAR
6,comment,VARCHAR
7,commentLength,BIGINT
8,user,VARCHAR


Run another ingestion from the wikipedia example dataset.

INSERT is used to append data instead of REPLACE INTO, and [TIME_PARSE](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) function has been used to shift the timestamp back by a year. This will create some "fake" data in the table that is a year old.

Run the next cell to append some data to ingest your data.

In [93]:
sql='''
INSERT INTO "''' + table_name + '''"
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_SHIFT(TIME_PARSE("timestamp"), 'P1Y', -1) AS "__time",
  "isRobot",
  "channel",
  "isUnpatrolled",
  "page",
  "comment",
  "commentLength",
  "user"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)

Loading data, status:[SUCCESS]: 100%|██████████| 100.0/100.0 [00:07<00:00, 14.03it/s]


## Separate older data onto a different tier

In this section you will change the retention [load rules](https://druid.apache.org/docs/latest/operations/rule-configuration#load-rules) for the table so that some of the data is loaded onto a different tier.

You will see how a query like the one above will be parallelised across and executed on different processes depending on where the data has been cached.

### Generate some statistics about the data

Take a look at the distribution of your table across the years by running the next cell.

In [99]:
sql = f'''
SELECT
  TIME_EXTRACT(__time, 'YEAR') AS "year",
  COUNT(*) AS "rows",
  COUNT(DISTINCT "user") AS "users",
  CAST(AVG("commentLength") AS INTEGER) AS "average_comment"
FROM "{table_name}"
GROUP BY 1
'''

display.sql(sql)

year,rows,users,average_comment
2014,24433,7923,62
2015,24433,7923,62
2016,24433,7923,62


### Inspect the servers and current configuration

Use a query against the servers system table to see what historicals are available, and the tiers that they are assigned to.

In [100]:
sql='''
SELECT server, tier, curr_size
FROM "sys"."servers"
WHERE "server_type" = 'historical'
'''

display.sql(sql)

server,tier,curr_size
172.19.0.11:8083,slow,10484204
172.19.0.10:8083,_default_tier,10484204


You will see that there are multiple historical servers.

- One server has the default of "_default_tier".
- The "_default_tier" currently contains all the data for the table.
- An additional tier currently contains no data.

Use the coordinator API to inspect and manage retention rules.

Run the following cell to call the API and get a list of all rules that currently apply. You will store these in a variable for later.

In [85]:
print(json.dumps(json.loads(requests.get(f'{druid_host}/druid/coordinator/v1/rules').text), indent=2))

{
  "_default": [
    {
      "tieredReplicants": {
        "_default_tier": 2
      },
      "useDefaultTierForNull": true,
      "type": "loadForever"
    }
  ]
}


In a clean deployment, only one set of rules is configured, called `_default`.

By default, the `_default` rules set contains only one rule - a [load forever](https://druid.apache.org/docs/latest/operations/rule-configuration#forever-load-rule) rule (`loadForever`) with a replication factor (`tieredReplicants`) of 2 across servers in the `_default_tier`.

### Load data onto the additional tier

Since we have two tiers of historicals, create a load rule loads all data onto the additional tier.

Run the next cell to create a JSON object for us to store the amended rule in, and to send that to the Coordinator API.

- Historicals in the `slow` tier have been added to the replication rules (`tieredReplicants`).
- The `slow` tier will receive one replica of the data.
- The `_default_tier` tier will receive one replica of the data.

In [86]:
retention_rules = [
  {
    "type": "loadForever",
    "tieredReplicants": {
      "_default_tier": 1,
      "slow": 1
    }
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

<Response [200]>

Inspect the current distribution of data by running a query against the system tables.

In [69]:
sql='''
SELECT
  a."server",
  b."tier",
  c."start",
  c."end",
  COUNT(*) AS "Count",
  SUM(c."num_rows") AS "rows"
FROM "sys"."server_segments" a
LEFT JOIN "sys"."servers" b ON a."server" = b."server"
LEFT JOIN "sys"."segments" c ON a."segment_id" = c."segment_id"
GROUP BY 1, 2, 3, 4
ORDER BY "start", "tier"
'''

display.sql(sql)

server,tier,start,end,Count,rows
172.19.0.10:8083,_default_tier,2014-06-27T00:00:00.000Z,2014-06-28T00:00:00.000Z,3,48866
172.19.0.11:8083,slow,2014-06-27T00:00:00.000Z,2014-06-28T00:00:00.000Z,3,48866
172.19.0.10:8083,_default_tier,2015-06-27T00:00:00.000Z,2015-06-28T00:00:00.000Z,3,48866
172.19.0.11:8083,slow,2015-06-27T00:00:00.000Z,2015-06-28T00:00:00.000Z,3,48866
172.19.0.10:8083,_default_tier,2016-06-27T00:00:00.000Z,2016-06-28T00:00:00.000Z,2,48866
172.19.0.11:8083,slow,2016-06-27T00:00:00.000Z,2016-06-28T00:00:00.000Z,2,48866


For each year in the table, once rules have been applied to your deployment, there will be a replica on the `slow` tier and an additional replica on the `_default_tier` tier.

Run the cell above until you see this applied.

### Split table data across tiers according to age

Using [period load rules](https://druid.apache.org/docs/latest/operations/rule-configuration/#period-load-rule) your table can be split across different historical tiers according to the timestamp.

Review the following default retention rule:

In [70]:
retention_rules = [
  {
    "type": "loadByPeriod",
    "period": "P1Y",
    "tieredReplicants": {
      "_default_tier": 1,
      "slow": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {
      "slow": 1
    }
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/_default", json.dumps(retention_rules), headers=druid_headers)

<Response [200]>

There are now two rules.

- `loadByPeriod`, covering the last year, which sets one cached replica on both the `_default_tier` and `slow` tiers.
- `loadForever`, requesting all data to be cached on the `slow` tier.

Run the following cell to see the resulting distribution:

In [71]:
sql='''
SELECT
  a."server",
  b."tier",
  c."start",
  c."end",
  COUNT(*) AS "Count",
  SUM(c."num_rows") AS "rows"
FROM "sys"."server_segments" a
LEFT JOIN "sys"."servers" b ON a."server" = b."server"
LEFT JOIN "sys"."segments" c ON a."segment_id" = c."segment_id"
GROUP BY 1, 2, 3, 4
ORDER BY "start", "tier"
'''

display.sql(sql)

server,tier,start,end,Count,rows
172.19.0.11:8083,slow,2014-06-27T00:00:00.000Z,2014-06-28T00:00:00.000Z,3,48866
172.19.0.11:8083,slow,2015-06-27T00:00:00.000Z,2015-06-28T00:00:00.000Z,3,48866
172.19.0.11:8083,slow,2016-06-27T00:00:00.000Z,2016-06-28T00:00:00.000Z,2,48866


Remembering that rules are processed in order, and that our data is _older_ than one year, only the `slow` tier receives any data.

Run the cell below to ingest some data for the current year. The TIME_EXTRACT function is used to fake data for this year by calculating what the shift should be between the timetamp of the example data and today's date.

In [72]:
sql='''
INSERT INTO "''' + table_name + '''"
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_SHIFT(TIME_PARSE("timestamp"), 'P1Y', (TIME_EXTRACT(CURRENT_TIMESTAMP,'YEAR') - TIME_EXTRACT(TIME_PARSE("timestamp"),'YEAR'))) AS "__time",
  "isRobot",
  "channel",
  "isUnpatrolled",
  "page",
  "comment",
  "commentLength",
  "user"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready(table_name)
display.table(table_name)

Loading data, status:[SUCCESS]: 100%|██████████| 100.0/100.0 [00:07<00:00, 14.01it/s]


Position,Name,Type
1,__time,TIMESTAMP
2,isRobot,VARCHAR
3,channel,VARCHAR
4,isUnpatrolled,VARCHAR
5,page,VARCHAR
6,comment,VARCHAR
7,commentLength,BIGINT
8,user,VARCHAR


Run the next cell to see how the default load rule has now been applied.

In [73]:
sql='''
SELECT
  a."server",
  b."tier",
  c."start",
  c."end",
  COUNT(*) AS "Count",
  SUM(c."num_rows") AS "rows"
FROM "sys"."server_segments" a
LEFT JOIN "sys"."servers" b ON a."server" = b."server"
LEFT JOIN "sys"."segments" c ON a."segment_id" = c."segment_id"
GROUP BY 1, 2, 3, 4
ORDER BY "start", "tier"
'''

display.sql(sql)

server,tier,start,end,Count,rows
172.19.0.11:8083,slow,2014-06-27T00:00:00.000Z,2014-06-28T00:00:00.000Z,3,48866
172.19.0.11:8083,slow,2015-06-27T00:00:00.000Z,2015-06-28T00:00:00.000Z,3,48866
172.19.0.11:8083,slow,2016-06-27T00:00:00.000Z,2016-06-28T00:00:00.000Z,2,48866
172.19.0.10:8083,_default_tier,2024-06-27T00:00:00.000Z,2024-06-28T00:00:00.000Z,1,24433
172.19.0.11:8083,slow,2024-06-27T00:00:00.000Z,2024-06-28T00:00:00.000Z,1,24433


## Clean up

Run the following cell to remove the table used in this notebook from the database and delete your additional ruleset.

In [110]:
print(f"Drop table: [{druid.datasources.drop(table_name)}]")
retention_rules = []
requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

Drop table: [None]


<Response [200]>

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

In [None]:
# Here are some useful code elements that you can re-use.

# When just wanting to display some SQL results
sql = f'''SELECT * FROM "{table_name}" LIMIT 5'''
display.sql(sql)

# When ingesting data and wanting to describe the schema
display.run_task(sql)
sql_client.wait_until_ready('{table_name}')
display.table('{table_name}')

# When you want to show the native version of a SQL statement
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3

# When you want to see some messages from a Kafka topic
from kafka import KafkaConsumer

consumer = KafkaConsumer(bootstrap_servers=kafka_host)
consumer.subscribe(topics=datagen_topic)
count = 0
for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))
consumer.unsubscribe()