# Optimize table data layout by partitioning and clustering using compaction
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Through compaction, whether manual or running automatically, you can change the number and size of segments that make up a table.

This tutorial demonstrates how to work with [compaction](https://druid.apache.org/docs/latest/data-management/compaction) to partition and cluster the segments for an existing table

In this tutorial you perform the following tasks:

- Create a table using batch ingestion with a very high number of segments.
- Run a PARTITION-style compaction job to reduce the number of segments by increasing their size.
- Run a CLUSTER-style compaction job to change the secondary partitioning scheme of an existing table.

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Import additional modules

Run the following cell to import additional Python modules that you will use to call Druid APIs directly.

In [None]:
import requests
import json

druid_headers = { 'Content-Type': 'application/json' }

## Create a table using batch ingestion

<!-- Use these cells if you are using batch ingestion for your notebook. -->

Run the following cell to create a table using batch ingestion. Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

When completed, you'll see a description of the final table.

In [None]:
table_name = 'example-wikipedia-compaction'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "namespace",
  "page",
  "user",
  "channel",
  "added",
  "deleted",
  "commentLength",
  "isRobot",
  "isAnonymous",
  "regionIsoCode",
  "countryIsoCode"
FROM "ext"
PARTITIONED BY HOUR
'''

display.run_task(sql)
sql_client.wait_until_ready(f'{table_name}')
display.table(f'{table_name}')

## View the layout of a table

Use Druid's `SYS.SEGMENTS` table to get information about a TABLE's segments. Run the cell below to see the segments created by the ingestion above.

In [None]:
sql=f'''
SELECT
  "start",
  "end",
  "num_rows",
  "size"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

Since you used PARTITIONED BY HOUR, you will see one segment per hour for the entire ingested data set.

## Apply changes to data layout through compaction

Compaction is a special type of native [Druid task](https://druid.apache.org/docs/latest/ingestion/tasks#all-task-types) that, like streaming ingestion, uses JSON specifications to define behaviors. Each contains:

* An [ioConfig](https://druid.apache.org/docs/latest/data-management/manual-compaction#compaction-io-configuration), defining what the source data is for the job.
* A [tuningConfig](https://druid.apache.org/docs/latest/ingestion/native-batch#tuningconfig), detailing specific controls.
* And elements to control what happens to the data (as you would find in a [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema) in streaming ingestion).
  * The dimensions to put into the resulting data given in a `dimensionsSpec`.
  * Any filters or calculations to do on the data as listed in the `transformsSpec`.
  * Any aggregation that should be done, as given in the `metricsSpec` when `rollup` is enabled.
 
In the cells that follow you will see various examples of how to use compaction to effect segment layout and table data.

### Apply a different PARTITION BY scheme

Rows per segment being very small is [one reason](https://druid.apache.org/docs/latest/data-management/compaction#compaction-guidelines) to run a compaction job to change the partitioning scheme.

To affect the PARTITION BY scheme in compaction you will use a [`granularitySpec`](https://druid.apache.org/docs/latest/data-management/manual-compaction/#compaction-granularity-spec) and a daily primary partitioning scheme by setting the `segmentGranularity` to DAY.

Run the next cell to build up a JSON ingestion specification for a compaction job:

In [None]:
compaction_ioConfig_inputSpec = {
    "type" : "interval",
    "interval" : "1970/2070" }

compaction_ioConfig = {
    "type" : "compact",
    "inputSpec" : compaction_ioConfig_inputSpec }

compaction_granularitySpec = { "segmentGranularity" : "DAY" }

compaction_spec = {
    "type": "compact",
    "dataSource": table_name,
    "ioConfig": compaction_ioConfig,
    "granularitySpec": compaction_granularitySpec
}

print(json.dumps(compaction_spec, indent=2))

Submit the task by running the next cell.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/task", json.dumps(compaction_spec), headers=druid_headers)

This task should not take too long to run. Take a look at the segments for the table by running the cell below.

In [None]:
sql=f'''
SELECT
  "start",
  "end",
  "num_rows",
  "size"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

### Apply a different CLUSTERED BY scheme through compaction

Use compaction to apply a data clustering scheme to your table, enabling greater parallelisation and pruning of filtering operations on the dimensions in question. This is particularly important for streaming ingestion.

During table creation, no CLUSTERED BY clause was used. Apply a clustering scheme to the table by running a compaction task. The following sewction will apply a [`partitionsSpec`](https://druid.apache.org/docs/latest/ingestion/native-batch-simple-task#partitionsspec) inside the compaction job's [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/native-batch-simple-task#tuningconfig) so that the clustering schemes are applied.

#### See the table layout when hash partitioning is used

The `compaction_tuningConfig_partitionsSpec` object contains the configuration needed to partition the table using hashing against the `channel` dimension.

This is placed inside the `compaction_tuningConfig` object, which has:

* Sets the type to `index_parallel` - this processes the table using the native batch ingestion pipeline, similar to native streaming.
* Enables [perfect roll-up](https://druid.apache.org/docs/latest/ingestion/rollup#perfect-rollup-vs-best-effort-rollup) - this is required when partitioning by specific dimensions.

These two objects are then incorporated into a new section, `tuningConfig`, in the compaction spec, and submitted.

In [None]:
compaction_tuningConfig_partitionsSpec = {
    "type" : "hashed",
    "partitionDimensions" : [
        "channel" ] }

compaction_tuningConfig = {
    "type" : "index_parallel",
    "forceGuaranteedRollup" : "true",
    "partitionsSpec" : compaction_tuningConfig_partitionsSpec }

compaction_spec = {
    "type": "compact",
    "dataSource": table_name,
    "ioConfig": compaction_ioConfig,
    "tuningConfig" : compaction_tuningConfig,
    "granularitySpec": compaction_granularitySpec
}

print(json.dumps(compaction_spec, indent=2))

requests.post(f"{druid_host}/druid/indexer/v1/task", json.dumps(compaction_spec), headers=druid_headers)

Run the following cell to see the segments for the table. Notice that the `shard_spec` now shows that [Murmur32 hash function](murmur3_32_abs) was applied.

In [None]:
sql=f'''
SELECT
  "start",
  "end",
  "shard_spec",
  "num_rows",
  "size"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

#### See the table layout when multi-dimension range partitioning is used

Now apply the compaction again, this time using [multi-dimension range partitioning](https://druid.apache.org/docs/latest/ingestion/native-batch/#multi-dimension-range-partitioning), which effectively creates a periodic range index across the dimensions.

Run the next cell.

Notice that the `partitionsSpec` has been changed to use `range`-type partitioning, and that multiple dimensions will be used. For the purposes of this notebook, a (for example purposes only!) target of 10000 rows per segment has been set.

In [None]:
compaction_tuningConfig_partitionsSpec = {
    "type" : "range",
    "partitionDimensions" : [
        "isRobot", "channel" ],
    "targetRowsPerSegment" : 10000 }

compaction_tuningConfig = {
    "type" : "index_parallel",
    "forceGuaranteedRollup" : "true",
    "partitionsSpec" : compaction_tuningConfig_partitionsSpec }

compaction_spec = {
    "type": "compact",
    "dataSource": table_name,
    "ioConfig": compaction_ioConfig,
    "tuningConfig" : compaction_tuningConfig,
    "granularitySpec": compaction_granularitySpec
}

print(json.dumps(compaction_spec, indent=2))

requests.post(f"{druid_host}/druid/indexer/v1/task", json.dumps(compaction_spec), headers=druid_headers)

See what the table segments look like by running the SQL below.

In [None]:
sql=f'''
SELECT
  "start",
  "end",
  "shard_spec"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

For each PARTITION (which is DAY) there is now a set of files for a range of values shown in the `shard_spec`.

## Change a table schema through compaction

As each segment defines its own schema, use the SYS.SEGMENTS table to view the dimensions of the table by running the cell below.

In [None]:
sql=f'''
SELECT DISTINCT
    "start",
    "end",
    "dimensions"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

By using the [`dimensionsSpec`](https://druid.apache.org/docs/latest/data-management/manual-compaction#compaction-dimensions-spec) in a compaction task, remove some dimensions from the table for a particular hour in the data.

The cell below will construct a compaction task specification using a [`dimensionsSpec`](https://druid.apache.org/docs/latest/data-management/manual-compaction#compaction-dimensions-spec) so that some of the original dimensions are removed, but _only_ between 19:00 and 20:00.

Run the cell to print out the JSON.

In [None]:
compaction_ioConfig_inputSpec = {
    "type" : "interval",
    "interval" : "2016-06-27T19:00:00/PT1H" }

compaction_ioConfig = {
    "type" : "compact",
    "inputSpec" : compaction_ioConfig_inputSpec }

compaction_granularitySpec = { "segmentGranularity" : "HOUR" }

compaction_dimensionsSpec = {
    "dimensionExclusions" : [ "namespace", "isAnonymous", "user" ] }

compaction_spec = {
    "type": "compact",
    "dataSource": table_name,
    "ioConfig": compaction_ioConfig,
    "tuningConfig" : compaction_tuningConfig,
    "granularitySpec": compaction_granularitySpec,
    "dimensionsSpec": compaction_dimensionsSpec
}

print(json.dumps(compaction_spec, indent=2))

Taking a look at the specification, notice:

* The `interval` of the `inputSpec` (ie, what data will be processed) uses a period of one hour after 7pm.
* Range partitioning will still be applied.
* The `granularitySpec` is set to `HOUR` - this is required since the table must be broken down small enough for the `interval` to be applied.
* The `namespace`, `isAnonymous`, and `user` dimensions will be removed from the table.

Run the cell below to begin the compaction.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/task", json.dumps(compaction_spec), headers=druid_headers)

Run this cell to see what the new table's segments look like, complete with a list of their dimensions.

In [None]:
sql=f'''
SELECT DISTINCT
    "start",
    "end",
    "shard_spec",
    "dimensions"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

In [None]:
And, sure enough, the data is missing!

In [None]:
sql=f'''
SELECT
  "__time",
  "namespace",
  "page",
  "user",
  "channel",
  "isRobot",
  "isAnonymous"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time, '2016-06-27T18:59:30/PT1M')
ORDER BY 1
'''

display.sql(sql)

In [None]:
## Apply a GROUP BY (rollup) through compaction

Use metricsSpec and https://druid.apache.org/docs/latest/data-management/manual-compaction/#compaction-granularity-spec / rollup.

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
# Use this for batch ingested tables

druid.datasources.drop(f"{table_name}")

# Use this when doing streaming with the data generator

print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

In [None]:
# Here are some useful code elements that you can re-use.

# When just wanting to display some SQL results
sql = f'''SELECT * FROM "{table_name}" LIMIT 5'''
display.sql(sql)

# When ingesting data and wanting to describe the schema
display.run_task(sql)
sql_client.wait_until_ready('{table_name}')
display.table('{table_name}')

# When you want to show the native version of a SQL statement
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3

# When you want to see some messages from a Kafka topic
from kafka import KafkaConsumer

consumer = KafkaConsumer(bootstrap_servers=kafka_host)
consumer.subscribe(topics=datagen_topic)
count = 0
for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))
consumer.unsubscribe()