# Aggregating source data by using rollup with streaming ingestion
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Data streams often contain many millions of rows of data at a very high time precision. This level of detail is often just not needed - let alone fit onto a user's screen! With batch ingestion, a GROUP BY can be incorporated into the SQL to aggregate incoming rows. In streaming ingestion, a similar mechanism exists called "[rollup](https://druid.apache.org/docs/latest/ingestion/rollup)".

This tutorial shows examples of some rollup functionality against an example dataset incoming data stream.

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Set up a connection to Apache Kafka

<!-- Include these cells if your notebook uses Kafka. -->

Run the next cell to set up the connection to Apache Kafka.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Set up a connection to the data generator

<!-- Include these cells if your notebook uses the data generator. -->

Run the next cell to set up the connection to the data generator.

In [None]:
import requests
import json

datagen_host = "http://datagen:9999"
datagen_headers = {'Content-Type': 'application/json'}

### Start data generation to a Kafka topic

Run the following cell to instruct the data generator to start producing data and display the status of the job.

In [None]:
datagen_topic = "example-clickstream-rollup"
datagen_job = datagen_topic
datagen_config = "social/social_posts.json"

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic  },
    "config_file": datagen_config, 
    "concurrency":100
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)
requests.get(f"{datagen_host}/status/{datagen_job}").json()

### Take a look at a sample of the data

Run the cell below to show the data that you will use in this notebook by connecting to and printing out raw events from the Kafka stream.

In [None]:
from kafka import KafkaConsumer

consumer = KafkaConsumer(bootstrap_servers=kafka_host)
consumer.subscribe(topics=datagen_topic)
count = 0
for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))
consumer.unsubscribe()

### Set re-usable elements for the ingestion specification

Ingest data from an Apache Kafka topic into Apache Druid by submitting an [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html) to the [streaming ingestion supervisor API](https://druid.apache.org/docs/latest/api-reference/supervisor-api).

An ingestion specification contains three parts:

* [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig).
* [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig).
* [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema).

Run the cell below to create objects to represent the `ioConfig` and `tuningConfig` that you will re-use throughout the notebook, and a variable to hold a table name that we will put the data from Kafka into.

In [None]:
ioConfig = {
    "type": "kafka",
    "consumerProperties": { "bootstrap.servers": kafka_host },
    "topic": datagen_topic,
    "inputFormat": { "type": "json" },
    "useEarliestOffset": "true" }

tuningConfig = { "type": "kafka" }

table_name = datagen_topic

## Create a table with aggregated source data

Aggregate streaming data at ingestion time by setting `rollup` to `true` in your ingestion specification's [`granularitySpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#granularityspec), part of the [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). This instructs Druid to apply a GROUP BY to the incoming data.

In the following sections, you will:

* Create an aggregated view of the source data.
* Enrich the table with simple aggregates calculated from the source data.
* Enrich the table with sketch objects that are used for approximation.

### Use `rollup` and `queryGranularity` to aggregate source data rows

The efficiency of GROUP BY is dependent on the cardinality of all dimensions, including the timestamp.

Affect the cardinality of the timestamp by flooring the source time field using `queryGranularity`. Set this to the desired level of [granularity](https://druid.apache.org/docs/latest/querying/granularities) in the `granularitySpec`.

A simple approach to handling the cardinality of all other dimensions is to be specific about the dimensions you wish to keep in your in the `dimensionsSpec`. Another might be to apply an expression, such as truncation, inside a [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec).

Run the cell below to create a `granularitySpec`.

* Rollup is enabled.
* Incoming timestamps will be truncated via `queryGranularity` to a precision of `FIFTEEN_MINUTE`.
* `post_title`, `views`, `upvotes` and `comments`, which all have high cardinality, are removed.

In [None]:
dataSchema_granularitySpec = {
    "rollup": "true",
    "queryGranularity": "fifteen_minute",
    "segmentGranularity": "hour"
}

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": dataSchema_granularitySpec,
    "dimensionsSpec": {
        "dimensions": [
            "username",
            "edited"
            ]
        }
    }

Run the next cell to build the ingestion specification and print it out.

In [None]:
ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=3))

Run the cell below to post the ingestion specification and start ingesting from the topic. When finished, you'll see a description of the table.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=druid_headers)
sql_client.wait_until_ready(table_name, verify_load_status=False)
display.table(table_name)

To see some of the data in the table, run the next cell.

In [None]:
sql=f'''
SELECT * FROM "{table_name}"
'''

display.sql(sql)

As you might expect, you see that each time period (truncated) is grouped into `username` and then `edited`.

Run the next cell to stop ingestion and drop the table.

In [None]:
print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

### Generate simple aggregations at rollup time

Rather than dropping the all-important measures from the source data, in this section a `metricsSpec` will be introduced.

A [`metricsSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#metricsspec) contains a list of [native aggregations](https://druid.apache.org/docs/latest/querying/aggregations) to apply to the raw data as it arrives. 

Run the cell below to create an object that will hold the `metricsSpec`. It will instruct Druid to generate new columns in the table of the name `name` using the specified `type` of aggregator (SUM, MAX, MIN) using `fieldName` from the source data.

You'll see a print out of the final ingestion spec before it is submitted and, when done, a description of the table.

In [None]:
dataSchema_metricsSpec = [
    { "name" : "count", "type" : "count" },
    { "name" : "sum_views", "type" : "longSum", "fieldName" : "views" },
    { "name" : "max_views", "type" : "longMax", "fieldName" : "views" },
    { "name" : "min_views", "type" : "longMin", "fieldName" : "views" },
    { "name" : "sum_upvotes", "type" : "longSum", "fieldName" : "upvotes" },
    { "name" : "sum_comments", "type" : "longSum", "fieldName" : "comments" }
    ]

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": dataSchema_granularitySpec,
    "dimensionsSpec": {
        "dimensions": [
            "username",
            "edited"
            ]
        },
    "metricsSpec" : dataSchema_metricsSpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=3))

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=druid_headers)
sql_client.wait_until_ready(table_name, verify_load_status=False)
display.table(table_name)

Run the following cell to see the result.

Because this is live data fed from a stream, remember that the latest figures will change each time you run the cell!

In [None]:
sql=f'''
SELECT * FROM "{table_name}"
'''

display.sql(sql)

In [None]:
Run the next cell to stop the ingestion and drop the table.

In [None]:
print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

### Speed up approximation by generating Apache Datasketch objects at ingestion time

The remaining high-cardinality column in the data is `username`. Similarly, IP addresses, session Ids, device Ids, and so on, can also be high cardinality and effect aggregation.

When users do not need to access the raw values, use [Apache Datasketches](https://datasketches.apache.org/) to improve the effectiveness of rollup and speed up COUNT DISTINCT - and other operations - even further.

Run the cell below to replace the `username` with a new column, `username_hll`, using the [HLL Sketch](https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll) build aggregator.

In [None]:
dataSchema_metricsSpec = [
    { "name" : "count", "type" : "count" },
    { "name" : "views_sum", "type" : "longSum", "fieldName" : "views" },
    { "name" : "views_max", "type" : "longMax", "fieldName" : "views" },
    { "name" : "views_min", "type" : "longMin", "fieldName" : "views" },
    { "name" : "upvotes_sum", "type" : "longSum", "fieldName" : "upvotes" },
    { "name" : "comments_sum", "type" : "longSum", "fieldName" : "comments" },
    { "name" : "username_hll", "type" : "HllSketchBuild", "fieldName" : "username" }
    ]

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": dataSchema_granularitySpec,
    "dimensionsSpec": {
        "dimensions": [
            "edited"
            ]
        },
    "metricsSpec" : dataSchema_metricsSpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=3))

In [None]:
Run the next cell to post this ingestion specification to Druid and start building the table.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=druid_headers)
sql_client.wait_until_ready(table_name, verify_load_status=False)
display.table(table_name)

Run the next cell to take a peek at the data.

Using `queryGranularity` to truncate the timestamp, being specific about the `dimensions`, and adding a `metricsSpec` rather than storing raw values, you have created a fully aggregated table that maintains only essential attributes of each event.

In [None]:


sql=f'''
SELECT * FROM "{table_name}"
'''

display.sql(sql)

### Ingest a table containing objects


## Filtering source data to be included

https://druid.apache.org/docs/latest/querying/filters

## Filter out rows based on metrics

https://druid.apache.org/docs/latest/querying/having

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* You learned this
* Remember this

## Learn more

* See examples of transformations on streaming data in the notebook on [native transforms](./13-native-transforms.ipynb).
* Learn more about approximation in Druid by looking at the notebooks on approximate [ranking](../03-query/02-approx-ranking.ipynb), [COUNT DISTINCT](../03-query/03-approx-count-distinct.ipynb), and [distribution](../03-query/04-approx-distribution.ipynb).
* See more examples of [generating Data Sketches](./03-generating-sketches.ipynb) by looking at the related notebook for batch ingestion.