# Load table data to different historical tiers using retention rules
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

[Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) provides administrators the ability to provision cluster resources suited to different performance and storage requirements, such as isolating heavy queries involving complex subqueries or large results from high priority, interactive queries.

This tutorial demonstrates how to work with [historical tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to load particular ages of data onto different processes. In turn, this causes queries to execute on different services depending on the period of time covered by a query.

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

This tutorial requires a deployment of Druid with multiple historicals, and presumes that the additional tier is called "slow".

Launch this tutorial and all prerequisites using the `druid-jupyter-tiered-hist` profile of the Docker Compose file for Jupyter-based Druid tutorials to create a cluster with an additional historical.

For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Import additional modules

Run the following cell to import additional Python modules that you will use to make direct calls to some APIs in Druid.

In [None]:
import requests
import json

### Create some helper functions

Run the next cell to set up a standard piece of SQL that you will use in this notebook. It uses the `server_segments`, `servers`, and `segments` tables to produce a list of segments and where they have been cached.

In [None]:
table_name = 'example-wikipedia-tiering'

layout_query = f'''
SELECT
  a."start",
  a."end",
  c."server",
  c."tier",
  "num_rows",
  "size"
FROM "sys"."segments" a
LEFT JOIN "sys"."server_segments" b ON a."segment_id" = b."segment_id"
LEFT JOIN "sys"."servers" c ON b."server" = c."server"
WHERE "datasource" = '{table_name}'
ORDER BY "start", "tier"
'''

## Create a table using batch ingestion

In this section, you will create a table that contains data spanning a few years using batch ingestion, and then look at where this data has been cached for query.

### Ingest example data

Run the next cell to bring in the initial data. Only a subset of the columns that are available in the example dataset will be ingested.

When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "isUnpatrolled",
  "page",
  "comment",
  "commentLength",
  "user"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready(table_name)
display.table(table_name)

### Inspect the servers and current configuration

Use a query against the servers system table to see what historicals are available, and the tiers that they are assigned to.

In [None]:
sql='''
SELECT server, tier, curr_size
FROM "sys"."servers"
WHERE "server_type" = 'historical'
'''

display.sql(sql)

You will see that there are multiple historical servers.

- One historical belongs to the default tier of `_default_tier`.
- One historical belongs to the `slow` tier.

> If you do not see multiple servers on multiple tiers, stop now.
> See [pre-requisites](#prerequisites) for more information.

Run the next cell to inspect the current distribution of data using the sys tables.

In [None]:
display.sql(layout_query)

All segments for the table, totalling around 20,000 rows, are loaded onto historicals in the `default_tier` tier.

To understand why, run the following cell to use the co-ordinator API to inspect the current retention load rules.

In [None]:
print(json.dumps(json.loads(requests.get(f'{druid_host}/druid/coordinator/v1/rules').text), indent=2))

On creation, tables have no set of rules of their own. Instead, the server's default set of rules, `_default`, are applied to the table.

By default, the `_default` rules set contains only one rule - a [load forever](https://druid.apache.org/docs/latest/operations/rule-configuration#forever-load-rule) rule (`loadForever`) with a replication factor (`tieredReplicants`) of 2 across servers in the `_default_tier`.

The entire timelime of data for your table is cached on historicals in the `_default_tier`, and queries will execute there.

## Cache data on different tiers

In this section, you'll create a load rule that also loads data onto the `slow` tier. You will use a mixture of forever, period, and interval [load rules](https://druid.apache.org/docs/latest/operations/rule-configuration#load-rules).

### Load all data onto multiple tiers

Run the next cell to create a JSON object for us to store a retention rule, send it to the Coordinator API, and then print out the current full rule set on the database.

- Historicals in the `slow` tier have been added to the replication rules (`tieredReplicants`).
- The `slow` tier will receive one replica of the data.
- The `_default_tier` tier will receive one replica of the data.
- The API call is made to the `rules` endpoint for the table (using the `table_name` variable).

In [None]:
retention_rules = [
  {
    "type": "loadForever",
    "tieredReplicants": {
      "_default_tier": 1,
      "slow": 1
    }
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)
print(json.dumps(json.loads(requests.get(f'{druid_host}/druid/coordinator/v1/rules').text), indent=2))

In addition to the `_default` rule set, there is now a new rule set specific to the table you have created.

Run the next cell to see where the data has been cached.

In [None]:
display.sql(layout_query)

By setting up a table-specific rule set, where `tieredReplicants` includes both tiers, both the `slow` and `_default_tier` tiers have been loaded with all the segments of your table.

(Run the cell above again if you do not see this immediately.)

### Load tiers according to data age

Run the next cell to create some "fake" data in the table that is a year older.

* INSERT is used to append data instead of REPLACE INTO.
* The [TIME_PARSE](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) function has been used to shift the timestamp back by a year. 

In [None]:
sql='''
INSERT INTO "''' + table_name + '''"
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_SHIFT(TIME_PARSE("timestamp"), 'P1Y', -1) AS "__time",
  "isRobot",
  "channel",
  "isUnpatrolled",
  "page",
  "comment",
  "commentLength",
  "user"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)

Review the following retention rule.

- An [interval load rule](https://druid.apache.org/docs/latest/operations/rule-configuration#interval-load-rule) (`loadByInterval`) covers 10 years of data before 1st January 2016, and requests one cached replica on both the `_default_tier` and `slow` tiers.
- `loadForever` requests that all data be cached on the `slow` tier.

Each segment is checked against the rules in order when the decision is made as to where it must be cached.

Run the cell to commit it to the database.

In [None]:
retention_rules = [
  {
    "type": "loadByInterval",
    "interval": "P10Y/2016",
    "tieredReplicants": {
      "slow": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {
        "_default_tier": 1
    }
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

Run the next cell to see where the data has been cached.

In [None]:
display.sql(layout_query)

Rules are processed in order. Therefore:

* Data for 2015 (`loadByInterval`) is only cached on the `slow` tier.
* All other data (`loadForever`) is only available on the `_default_tier` tier.

(Re-run the cell above if you do not see this immediately.)

Run the following query on the table.

In [None]:
sql = f'''
SELECT
  TIME_FLOOR(__time, 'P1D') AS "date",
  COUNT(*) AS "rows",
  COUNT(DISTINCT "user") AS "users",
  CAST(AVG("commentLength") AS INTEGER) AS "average_comment"
FROM "{table_name}"
GROUP BY 1
'''

display.sql(sql)

Consider that, because of [time partitioning](https://druid.apache.org/docs/latest/multi-stage-query/concepts#partitioning-by-time), some parts of this query were calculated on the `slow` historical tier, and some were calculated on historicals in the `_default_tier` tier.

### Load data according to age

Run the cell below to ingest some data for the current year.

The TIME_EXTRACT function is used to fake data for this year by calculating the shift between the source timestamp and today's date.

In [None]:
sql='''
INSERT INTO "''' + table_name + '''"
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_SHIFT(TIME_PARSE("timestamp"), 'P1Y', (TIME_EXTRACT(CURRENT_TIMESTAMP,'YEAR') - TIME_EXTRACT(TIME_PARSE("timestamp"),'YEAR'))) AS "__time",
  "isRobot",
  "channel",
  "isUnpatrolled",
  "page",
  "comment",
  "commentLength",
  "user"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)

Review the following retention load rules configuration, then run the cell to apply it to the table.

- `loadByPeriod`, giving a [period load rule](https://druid.apache.org/docs/latest/operations/rule-configuration/#period-load-rule) covering data newer than one year old, requesting one cached replica on both the `_default_tier` and `slow` tiers.
- `loadForever`, requesting all data to be cached on the `slow` tier.

In [None]:
retention_rules = [
  {
    "type": "loadByPeriod",
    "period": "P1Y",
    "tieredReplicants": {
      "_default_tier": 1,
      "slow": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {
      "slow": 1
    }
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

Run the following cell to see the resulting distribution:

In [None]:
display.sql(layout_query)

When it has been applied:

* All data _younger_ than a year is cached on both tiers (caught by the opening `loadByPeriod` rule).
* All other data is cached exclusively on the `slow` tier (caught by the closing `loadForever` rule).

### Leave some data uncached

To keep some of your table data accessible _only_ from [deep storage](https://druid.apache.org/docs/latest/querying/query-deep-storage), set `tieredReplicants` to an empty set and set `useDefaultTierForNull` to false.

Review the rules below, and then run the cell to apply it to the table. This uses a mixture of all the retention load rules you have seen so far.

* A period load rule, which fires first, catches all data younger than a year, and loads it on `_default_tier` historicals.
* An interval load rule, catcheing data for 2015 and loading this to historicals in the `slow` tier only.
* A final forever load rule, that, since it has no `tieredReplicants` and `useDefaultTierForNull` is `false`, ensures none of the remaining data is cached on historicals.

What do you predict will happen to data for 2016?

In [None]:
retention_rules = [
  {
    "type": "loadByPeriod",
    "period": "P1Y",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "loadByInterval",
    "interval": "2015/P1Y",
    "tieredReplicants": {
      "slow": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {},
    "useDefaultTierForNull": "false"
  }
]

requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)

Run the following cell to see the resulting distribution:

In [None]:
display.sql(layout_query)

Notice that, depending on the period of time they cover, some table segments are loaded onto historicals and some are not.

The order of the rules means:

1. Data younger than a year is loaded to `_default_tier` historicals.
2. Data covering 2015 is cached on `slow`-tier historicals.
3. No other data is loaded.

## Clean up

Run the following cell to remove the table used in this notebook from the database and delete your additional ruleset.

In [None]:
retention_rules = []
requests.post(f"{druid_host}/druid/coordinator/v1/rules/{table_name}", json.dumps(retention_rules), headers=druid_headers)
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* All historical servers belong to a tier.
* The default tier for all historicals is `_default`.
* Default retention rules apply to all tables.
* Out of the box, the default retention rule set has only one rule, loading all data on to the `_default` tier.
* There is an API endpoint for amending load rules.
* Rule sets can be made up of a mixture of age, interval, and "forever" rules.
* Rules are applied in order.

## Learn more

* Amend `retention_rules` to try different periods, and to see what happens when the rule order is reversed.
* Read more about [retention rules](https://druid.apache.org/docs/latest/operations/rule-configuration), particularly [load rules](https://druid.apache.org/docs/latest/operations/rule-configuration#load-rules), in the documentation.