# Change table data by using compaction
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Through compaction, whether manual or running automatically, you can change the schema of a table, filter data out, and aggregate your data - whether that is the entire table, or just data of a particular time range.

This tutorial demonstrates how to work with [compaction](https://druid.apache.org/docs/latest/data-management/compaction) to apply different dimension schemes and to apply transformations manually, though compaction can also run [automatically](https://druid.apache.org/docs/latest/data-management/automatic-compaction).

In this tutorial you perform the following tasks:

- Create a table using batch ingestion.
- Run a compaction task to remove dimensions for a particular time period in the data.
- Run a compaction task to remove all data that matches a particular criteria.
- Run a task to change granularity of a data (rollup).

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

Before following this notebook, it's recommended to complete the notebook on [changing data layout with compaction](./04-compaction-partitioning.ipynb).

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Import additional modules

Run the following cell to import additional Python modules that you will use to call Druid APIs directly.

In [None]:
import requests
import json

## Create a table using batch ingestion

Run the following cell to create a table using batch ingestion. Specific dimensions are selected from the source data.

When completed, you'll see a description of the final table.

In [None]:
table_name = 'example-wikipedia-datacompaction'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "namespace",
  "page",
  "user",
  "channel",
  "added",
  "deleted",
  "commentLength",
  "isRobot",
  "isAnonymous",
  "regionIsoCode",
  "countryIsoCode"
FROM "ext"
PARTITIONED BY HOUR
'''

display.run_task(sql)
sql_client.wait_until_ready(f'{table_name}')
display.table(f'{table_name}')

## Apply changes to data using compaction

Compaction is a special type of native [Druid task](https://druid.apache.org/docs/latest/ingestion/tasks#all-task-types) that, like streaming ingestion, uses JSON specifications to define behaviors. Each contains:

* An [ioConfig](https://druid.apache.org/docs/latest/data-management/manual-compaction#compaction-io-configuration), defining what the source data is for the job.
* A [tuningConfig](https://druid.apache.org/docs/latest/ingestion/native-batch#tuningconfig), detailing specific controls.
* And elements to control what happens to the data (as you would find in a [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema) in streaming ingestion).
  * The dimensions to put into the resulting data given in a `dimensionsSpec`.
  * Any filters or calculations to do on the data as listed in the `transformsSpec`.
  * Any aggregation that should be done, as given in the `metricsSpec` when `rollup` is enabled.
 
In the cells that follow you will see various examples of how to use the `dimensionsSpec` and `transformsSpec` to affect table data as part of compaction.

### Use dimensionsSpec to add or remove columns

Amend the dimensions in the table at compaction time by using a [`dimensionsSpec`](https://druid.apache.org/docs/latest/data-management/manual-compaction#compaction-dimensions-spec). Options include removing (`dimensionExclusions`) or explicitly including (`dimensions`) specific columns.

As each segment defines its own schema, use the SYS.SEGMENTS table to view the dimensions of the table in each partition by running the cell below.

In [None]:
sql=f'''
SELECT DISTINCT
    "start",
    "end",
    "dimensions"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

The cell below will construct a compaction task specification that includes a [`dimensionsSpec`](https://druid.apache.org/docs/latest/data-management/manual-compaction#compaction-dimensions-spec) to remove specific columns.

* The `ioConfig` is represented by `compaction_ioConfig`. It contains an `inputSpec` that has a restriction on the `interval` so that this task only affects data between 19:00 and 20:00.
* The `granularitySpec` matches the original PARTITIONED BY.
* The `dimensionsSpec` contains an explicit list of `dimensionExclusions` - these are what will be removed.

Finally, the `compaction_spec` object uses these objects to create the final JSON compaction specification.

Run the cell to print out the JSON.

In [None]:
compaction_ioConfig_inputSpec = {
    "type" : "interval",
    "interval" : "2016-06-27T19:00:00/PT1H" }

compaction_ioConfig = {
    "type" : "compact",
    "inputSpec" : compaction_ioConfig_inputSpec }

compaction_granularitySpec = { "segmentGranularity" : "HOUR" }

compaction_dimensionsSpec = {
    "dimensionExclusions" : [ "namespace", "isAnonymous", "user" ] }

compaction_spec = {
    "type": "compact",
    "dataSource": table_name,
    "ioConfig": compaction_ioConfig,
    "granularitySpec": compaction_granularitySpec,
    "dimensionsSpec": compaction_dimensionsSpec
}

print(json.dumps(compaction_spec, indent=2))

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/task", json.dumps(compaction_spec), headers=druid_headers)

Run this cell below to follow along as the compaction task runs.

In [None]:
sql=f'''
SELECT DISTINCT
    "start",
    "end",
    "dimensions"
FROM sys.segments
WHERE datasource = '{table_name}'
AND "start" LIKE '2016-06-27T1%'
ORDER BY 1
'''

display.sql(sql)

When completed, you will see that the table no longer contains `namespace`, `isAnonymous`, or `user` between 1900 and 2000.

### Use a transformsSpec to filter out data

Incorporate a [native filter](https://druid.apache.org/docs/latest/querying/filters) into a `transformsSpec` during compaction to retain or remove rows that match a particular condition.

Run the following cell to retrieve some data from the table between 1000 and 1100.

In [None]:
sql=f'''
SELECT channel,
   COUNT(*) AS "events"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL ("__time",'2016-06-27T10/PT1H')
GROUP BY 1
'''

display.sql(sql)

Run the cell below to construct a compaction specification that only retains events where the `channel` is `#en.wikipedia`. Notice that the `interval` constrains this job to events in the table between 1000 and 1100.

In [None]:
compaction_ioConfig_inputSpec = {
    "type" : "interval",
    "interval" : "2016-06-27T10:00:00/PT1H" }

compaction_ioConfig = {
    "type" : "compact",
    "inputSpec" : compaction_ioConfig_inputSpec }

compaction_granularitySpec = { "segmentGranularity" : "HOUR" }

compaction_transformSpec = {
    "filter":
        {
            "type": "selector",
            "dimension": "channel",
            "value": "#en.wikipedia"
        }
    }

compaction_spec = {
    "type": "compact",
    "dataSource": table_name,
    "ioConfig": compaction_ioConfig,
    "granularitySpec": compaction_granularitySpec,
    "transformSpec": compaction_transformSpec
}

print(json.dumps(compaction_spec, indent=2))

Run the next cell to submit the compaction.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/task", json.dumps(compaction_spec), headers=druid_headers)

Run the cell below to follow along as Druid processes the table data.

When completed, the time period selected _only_ contains rows for a particular channel.

In [None]:
sql=f'''
SELECT channel,
   COUNT(*) AS "events"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL ("__time",'2016-06-27T10:00/PT1H')
GROUP BY 1
'''

display.sql(sql)

### Aggregate table data using rollup

Enable [`rollup`](https://druid.apache.org/docs/latest/data-management/compaction#rollup) inside the `granularitySpec` and add a [`metricsSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#metricsspec) to apply a GROUP BY to table data through compaction.

Run the following cell to create a compaction specification.

* The `interval` is set to process only the first six hours of the data.
* `rollup` is set to true to apply a GROUP BY on the data.
* A `metricsSpec` is represented by the `compaction_metricsSpec` object, and sets up four metrics to be produced:
    * Instead of the raw data, `added` becomes a SUM of the source values.
    * `deleted` receives the same treatment as `added`, replacing the raw values with a SUM.
    * A new column called `theta_user` is added - this contains an Apache Thetasketch of the underlying users for [approximate COUNT DISTINCT](https://druid.apache.org/docs/latest/querying/aggregations#approximate-aggregations) operations.
 
In order to make the rollup ratio efficient:

* `queryGranularity` is added to the `granularitySpec` so that the timestamp is changed to fifteen-minute precision.
* Only specific dimensions are retained - these are specified using a `dimensionsSpec` list of `dimensions` to retain. Notice that `user` is no longer in the list of dimensions.

Run the next cell to build the compaction specification and take a look at the result.

In [None]:
compaction_ioConfig_inputSpec = {
    "type" : "interval",
    "interval" : "2016-06-27/PT6H" }

compaction_ioConfig = {
    "type" : "compact",
    "inputSpec" : compaction_ioConfig_inputSpec }

compaction_granularitySpec = {
    "segmentGranularity" : "HOUR",
    "rollup" : "true",
    "queryGranularity" : "fifteen_minute"}

compaction_metricsSpec = [
        { "type": "doubleSum", "name": "added", "fieldName": "added" },
        { "type": "doubleSum", "name": "deleted", "fieldName": "deleted" },
        { "type": "thetaSketch", "name": "theta_user", "fieldName": "user" }
    ]

compaction_dimensionsSpec = {
    "dimensions": [
        "namespace",
        "channel",
        "isRobot",
        "isAnonymous",
        "regionIsoCode",
        "countryIsoCode"
    ] }

compaction_spec = {
    "type": "compact",
    "dataSource": table_name,
    "ioConfig": compaction_ioConfig,
    "granularitySpec": compaction_granularitySpec,
    "dimensionsSpec": compaction_dimensionsSpec,
    "metricsSpec": compaction_metricsSpec
}

print(json.dumps(compaction_spec, indent=2))

Run the cell below to submit the compaction job.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/task", json.dumps(compaction_spec), headers=druid_headers)

Run the cell below as the process runs.

When the compaction is finished you will see that up to 0600, the table contains multiple rows with the same timestamp. After this point, the precision of the table remains the same.

In [None]:
sql=f'''
SELECT
  "__time",
  COUNT(*) AS "rows"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time", '2016-06-27T04:30/PT2H')
GROUP BY 1
LIMIT 10
'''

display.sql(sql)

Take a look at the actual data in the table by running the SQL below.

In [None]:
sql=f'''
SELECT *
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time", 'PT16M/2016-06-27T06:01')
ORDER BY __time DESC
LIMIT 20
'''

display.sql(sql)

## Clean up

Run the following cell to drop the table from the database.

In [None]:
druid.datasources.drop(f"{table_name}")

## Summary

* Compaction can be run manually or automatically.
* Adjustments can be made to specific particular periods of time.
* Data can be filtered out and the schema changed.
* Rows can be aggregated and metrics emitted.

## Learn more

* Take a look at more options for [filters](https://druid.apache.org/docs/latest/querying/filters) in `transformSpec` in the documentation and the [native filters notebook](../02-ingestion/14-native-filters.ipynb).
* Find the technical details in the documentation about the [`dimensionsSpec`](https://druid.apache.org/docs/latest/querying/dimensionspecs) and walk through examples in the [native dimensions](../02-ingestion/15-native-dimensions.ipynb) notebook.
* Read the documentation on [native aggregations](https://druid.apache.org/docs/latest/querying/aggregations) you can add to the `metricsSpec`.
* Understand the importance of approximation in the notebooks on [ranking](../03-query/02-approx-ranking.ipynb), [count distinct](../03-query/03-approx-count-distinct.ipynb), and [distribution](../03-query/04-approx-distribution.ipynb).
* Learn more about compaction-time [`rollup`](https://druid.apache.org/docs/latest/data-management/compaction#rollup) in the documentation.