# Optimizing performance with streaming ingestion
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

In the [05-partitioning-data.ipynb](05-partitioning-data.ipynb) notebook you learned about Apache Druid's data sharding strategy and how it can be used to improve system health and query performance. That was all about batch ingestion, but what happens with streaming ingestion?

Streaming ingestion is inherently different because it is optimized for scalable throughput. Streaming ingestion scales up by adding more tasks. Each task ingests portions of the data from a stream and generates segment files idependently from other tasks. With more parallel tasks, this causes more fragmentation. We will review how compaction tasks are used to merge segments and how compaction can apply a secondary partitioning strategy in order to optimize the data after ingestion.

In this notebook you will setup a streaming ingestion job and understand how its settings translate into segment creation. You'll learn about fragmentation of the data and how to optimize it after initial ingestion.


## Prerequisites

This tutorial works was tested with Druid 27.0.0.

#### Run with Docker

<!-- Profiles are:
`druid-jupyter` - just Jupyter and Druid
`all-services` - includes Jupyter, Druid, and Kafka
 -->

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [the project on github](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import json


if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

# shortcuts for APIs
display = druid.display
sql_client = druid.sql
status_client = druid.status

# client for Data Generator API
datagen = druidapi.rest.DruidRestClient("http://datagen:9999")

# REST client for Druid API
rest_client = druid.rest

## Setup the Topic and produce data
To test parallelism we'll need a topic with multiple partitions. Apache Druid maps one or more Kafka partitions to each task. Each task consumes from its assigned partitions to serve queries and create segments.
A topic with 2 partitions means we can use a maximum of 2 tasks to ingest from it. 


In [None]:
from kafka.admin import KafkaAdminClient, NewTopic

topic_name='social'
admin_client = KafkaAdminClient(
    bootstrap_servers=kafka_host, 
    client_id='admin'
)

topic_list = []
topic_list.append(NewTopic(name=topic_name, num_partitions=4, replication_factor=1))
admin_client.create_topics(new_topics=topic_list, validate_only=False)

### Feed the Topic
Add an hour's worth of social posts to the topic.

In [None]:
from datetime import datetime, timedelta

#simulate events with timestamps starting at the top of the hour
gen_now = datetime.now().replace(microsecond=0, second=0, minute=0)
gen_start_time = gen_now.strftime("%Y-%m-%d %H:%M:%S")

# generate 1 million events
total_row_count=1000000
headers = {
  'Content-Type': 'application/json'
}
datagen_request = {
    "name": "social_stream",
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": topic_name  },
    "config_file": "social/social_posts.json", 
    "total_events":total_row_count,
    "concurrency":500,
    "time_type": gen_start_time
}
datagen.post("/start", json.dumps(datagen_request), headers=headers)

The next cell monitors the data generation until it completes publishing messages into the Kafka topic.
It will take a few minutes to generate 1 million messages. you can see the progress in the output field `total_records`.

In [None]:
import time

from IPython.display import clear_output

# wait for the messages to be fully published 
done = False
while not done:
    result = datagen.get_json("/status/social_stream",'')
    clear_output(wait=True)
    print(json.dumps(result, indent=2))
    if result["status"] == 'COMPLETE':
        done = True
    else:
        time.sleep(1)

### Helper functions
These functions will help us measure multiple attempts at ingesting the data.

In [None]:
# monitor ingestion by counting the rows ingested until the expected number of rows have been loaded
def monitor_ingestion( target_table:str, target_rows:int):
    row_count=0
    while row_count<target_rows:
        res = sql_client.sql(f'SELECT count(1) as "count" FROM {target_table}')
        clear_output(wait=True)
        print(json.dumps(res, indent=2))
        row_count = res[0]['count']
        time.sleep(1)
        
# suspend the streaming ingestion job and wait for tasks to publish their segments
def stop_streaming_job( target_table: str, reset_offsets: bool = False):
    print(f'Pause streaming ingestion: [{druid.rest.post(f"/druid/indexer/v1/supervisor/{target_table}/suspend","", require_ok=False)}]')
    

    tasks = druid.tasks.tasks(state='running', table=target_table)
    tasks_done = 0
    while tasks_done<len(tasks):
        tasks_done = 0
        clear_output( wait=True)
        print(f'Waiting for running tasks to publish their segments ...')
        for task in tasks:
            status = druid.tasks.task_status(task['id'])
            print(f"Task [{task['id']}] Status:{status['status']['statusCode']} RunnerStatus:{status['status']['runnerStatusCode']}")
            if (status['status']['statusCode']!='RUNNING'): 
                tasks_done += 1 
        time.sleep(1)
            
    if reset_offsets:
        print(f'Reset offsets for re-runnability: [{druid.rest.post(f"/druid/indexer/v1/supervisor/{target_table}/reset","", require_ok=False)}]')
    print(f'Terminate streaming ingestion: [{druid.rest.post(f"/druid/indexer/v1/supervisor/{target_table}/terminate","", require_ok=False)}]')

# Remove table data and metadata from Druid
def drop_table( target_table: str):
    # mark segments as unused 
    druid.datasources.drop(target_table)
    # remove segment metadata and data for unused segments
    headers = {'Content-Type': 'application/json'}
    kill_task = {
      "type": "kill",
      "dataSource": target_table,
      "interval" : "2000-09-12/2999-09-13"
    }
    print(druid.rest.post(f"/druid/indexer/v1/task", json.dumps(kill_task),require_ok=False, headers=headers))

### Ingest the data with a single task
Next, we'll try ingesting the same data with different number of tasks.
The docker compose environment file is configured with `druid_worker_capacity=4` meaning we can have up to 4 tasks running concurrently. This config oversubscribes the CPUs available to docker, so it is not intended to show scale up, but rather the effects on segment generation when using more tasks.

The first test is with 1 task and we'll run with 2 tasks and 4 tasks to see which segments are produced in each case.

In [None]:
target_table = 'social_media_1task'
kafka_ingestion_spec = {
    "type": "kafka",
    "spec": { 
        "ioConfig": { "type": "kafka", "consumerProperties": { "bootstrap.servers": kafka_host },
            "topic": topic_name,  "taskCount": 1, "useEarliestOffset": True,
            "inputFormat": { "type": "json"} 
        },
        "tuningConfig": { "type": "kafka"},
        "dataSchema": {
            "dataSource": target_table,
            "timestampSpec": { "column": "time", "format": "iso" },
            "dimensionsSpec": {
                "dimensions": [
                    "username",
                    "post_title",
                    { "type": "long", "name": "views"},
                    { "type": "long", "name": "upvotes"},
                    { "type": "long", "name": "comments"},
                    "edited"
                ]
            },
            "granularitySpec": { "queryGranularity": "none",  "rollup": False, "segmentGranularity": "hour" }
        }
    }
}
supervisor = rest_client.post("/druid/indexer/v1/supervisor", json.dumps(kafka_ingestion_spec), headers=headers)
print(f'Start supervisor response: [{supervisor.status_code}]')
print(f'Waiting for tasks to start streaming...')
# wait for table creation and ingestion start
druid.sql.wait_until_ready(target_table, verify_load_status=False) 
start_time = datetime.now()
monitor_ingestion( target_table, total_row_count)  # wait for the 1000000 rows to be loaded
end_time = datetime.now()
print(f'Total load time = {(end_time-start_time).total_seconds()}')

My result for this run was: 
```
[
  {
    "count": 1000000
  }
]
Total load time = 8.326123
```

Let's stop that ingestion to free up the worker slots and start the next one. 
Stopping the ingestion involves waiting for the currently running tasks to finish building and publishing their segments, this could take up to a minute when the coordinator picks up the new segment and hands it off to the historical:

In [None]:
stop_streaming_job('social_media_1task', True)

### Ingest the data with two and four tasks
The next task ingests the same data into a different table so we can compare the results.
This one runs with 2 tasks:

In [None]:
target_table = 'social_media_2task'
kafka_ingestion_spec = {
    "type": "kafka",
    "spec": { 
        "ioConfig": { "type": "kafka", "consumerProperties": { "bootstrap.servers": kafka_host },
            "topic": topic_name,  "taskCount": 2, "useEarliestOffset": True,
            "inputFormat": { "type": "json"} 
        },
        "tuningConfig": { "type": "kafka"},
        "dataSchema": {
            "dataSource": target_table,
            "timestampSpec": { "column": "time", "format": "iso" },
            "dimensionsSpec": {
                "dimensions": [
                    "username",
                    "post_title",
                    { "type": "long", "name": "views"},
                    { "type": "long", "name": "upvotes"},
                    { "type": "long", "name": "comments"},
                    "edited"
                ]
            },
            "granularitySpec": { "queryGranularity": "none",  "rollup": False, "segmentGranularity": "hour" }
        }
    }
}
supervisor = rest_client.post("/druid/indexer/v1/supervisor", json.dumps(kafka_ingestion_spec), headers=headers)
print(f'Start supervisor response: [{supervisor.status_code}]')
print(f'Waiting for tasks to start streaming...')
# wait for table creation and ingestion start
druid.sql.wait_until_ready(target_table, verify_load_status=False) 
start_time = datetime.now()
monitor_ingestion( target_table, total_row_count)  # wait for the 1000000 rows to be loaded
end_time = datetime.now()
print(f'Total load time = {(end_time-start_time).total_seconds()}')

My results:
```
[
  {
    "count": 1000000
  }
]
Total load time = 6.341758
```

I see some improvement which means that some of the work is being parallelized, but it is likely that I/O is is the long pole in the tent as I only have one drive. This would be different in a real cluster, but hey, 120,000+ msgs/second with 1 and 157,000+ with 2 isn't bad for my local setup.

Let's stop this ingestion and try with 4:

In [None]:
stop_streaming_job('social_media_2task', True)


In [None]:
target_table = 'social_media_4task'
kafka_ingestion_spec = {
    "type": "kafka",
    "spec": { 
        "ioConfig": { "type": "kafka", "consumerProperties": { "bootstrap.servers": kafka_host },
            "topic": topic_name,  "taskCount": 4, "useEarliestOffset": True,
            "inputFormat": { "type": "json"} 
        },
        "tuningConfig": { "type": "kafka"},
        "dataSchema": {
            "dataSource": target_table,
            "timestampSpec": { "column": "time", "format": "iso" },
            "dimensionsSpec": {
                "dimensions": [
                    "username",
                    "post_title",
                    { "type": "long", "name": "views"},
                    { "type": "long", "name": "upvotes"},
                    { "type": "long", "name": "comments"},
                    "edited"
                ]
            },
            "granularitySpec": { "queryGranularity": "none",  "rollup": False, "segmentGranularity": "hour" }
        }
    }
}
supervisor = rest_client.post("/druid/indexer/v1/supervisor", json.dumps(kafka_ingestion_spec), headers=headers)
print(f'Start supervisor response: [{supervisor.status_code}]')
print(f'Waiting for tasks to start streaming...')
# wait for table creation and ingestion start
druid.sql.wait_until_ready(target_table, verify_load_status=False) 
start_time = datetime.now()
monitor_ingestion( target_table, total_row_count)  # wait for the 1000000 rows to be loaded
end_time = datetime.now()
print(f'Total load time = {(end_time-start_time).total_seconds()}')

In [None]:
stop_streaming_job('social_media_4task', True)

My result with 4 tasks:
```
[
  {
    "count": 1000000
  }
]
Total load time = 5.541399
```
While it got a little faster, the returns are clearly diminishing and remember that testing performance with this laptop setup doesn't make much sense because all processes are sharing my laptop resources, so parallelism is limited and resource isolation is non-existent. As an example of this variability, in another run, this ingestion turned out to be 1 second slower than the 2 task run.

We are focusing on the effects that the number of tasks have on segment production so let's [look at the datasources](http://localhost:8888/unified-console.html#segments) in the Druid console. Notice that the number of segments produced is the number of tasks that we running to ingest. 

![](assets/datasources-streaming-diif-tasks.png)

In general, streaming tasks will generate at least one segment file per task duration period. They could generate more. If the __time column of the data received spans multiple segment granularity time chunks, there will be at least one segment output for each time chunk touched by the data. There could be more than one segment per time chunk if the number of rows received exceeds the `maxRowsPerSegment`. So there are a few sources of data fragmentation when running streaming ingestion that are inherent to its scalable high troughput design. 

## Optimize segments through compaction

In a real cluster, each streaming task runs on independent resources, increasing the number of tasks increases overall throughput on the ingestion. But keeping the task count to the minimum needed is better in terms of the number of segments they'll generate. So there's a balance between the throughput that you need in order to reduce or eliminate lag and the number of segments generated.

Notice that the `Shard Type` column shows "Numbered". This means that the segments partitions within a segment granularity (one hour in our case) are not split in any particular way, each task consumed from its assigned Kafka partitions and built one or more segments from the data it received.

The [partitioning-data](05-partitioning-data.ipynb) notebook talks about ideal segment sizes and secondary partitioning (also called clustering) which reorganizes the data into segments that are more efficient at query time. Streaming ingestion does not create ideal segments. In order to optimize segments that have been ingested through streaming ingestion, it is a best practice to use [auto compaction](https://druid.apache.org/docs/latest/design/coordinator#automatic-compaction). 

For now, let's do a [manual compaction](https://druid.apache.org/docs/latest/data-management/compaction#setting-up-manual-compaction) to demonstrate its effects and then we'll review the setup for auto compaction. 

In [None]:
target_table = "social_media_4task"
gen_start_time = gen_now.strftime("%Y-%m-%dT%H:%M:%S")
gen_end_time = (gen_now + timedelta(hours=1)).strftime("%Y-%m-%dT%H:%M:%S")

interval = gen_start_time + '/' + gen_end_time
compaction_task = {
  "type": "compact",
  "dataSource": target_table,
  "ioConfig": {
    "type": "compact",
    "inputSpec": {
      "type": "interval",
      "interval": interval
    }
  },
  "granularitySpec": {
    "segmentGranularity": "hour"
  }
}
compaction_response = rest_client.post("/druid/indexer/v1/task", json.dumps(compaction_task), headers=headers)
print(f'Start compaction response: [{compaction_response}]')


[Look at the segments](http://localhost:8888/unified-console.html#segments/datasource=social_media_4task) for `social_media_4task` now (it might take a few seconds to show the change). You'll see that the 4 segments have been compacted into one. 

Compaction can also apply secondary partitioning to reorganize the data within each time chunk for better segment pruning at query time. 

Since the table only has 1 million records and the default `targetRowsPerSegment` during compaction is 5 million, you'll need to lower that value to see the effects of secondary partitioning with this table.

Here's the compaction task, notice that the `partitionsSpec` in `tuningConfig` specifies `range` partitioning with `username` as the partitioning column. It also uses targetRowsPerSegment at 250k so that we'll see the partitioning. This is just for illustration, keeping it at 5 million or so is a good initial target in a real scenario.

In [None]:
compaction_task = {
  "type": "compact",
  "dataSource": target_table,
  "ioConfig": {
    "type": "compact",
    "inputSpec": {
      "type": "interval",
      "interval": interval
    }
  }, 
  "tuningConfig": {
    "type":"compaction",
    "forceGuaranteedRollup": True,
    "partitionsSpec": {
      "type": "range",
      "partitionDimensions": [
        "username"
      ],
      "targetRowsPerSegment": 250000
    }
  } 
}


compaction_response = rest_client.post("/druid/indexer/v1/task", json.dumps(compaction_task), headers=headers)
print(f'Start compaction response: [{compaction_response}]')


[Look at the segments](http://localhost:8888/unified-console.html#segments/datasource=social_media_4task) again. You'll see that it is back to 4 segments but this time they have more information in the Shard Spec column. The compaction process sorted the data by `username` and split it up into ranges of the column's values attempting to keep each segment file as close to the 250k we requested. Given the low number of user names in the data and their slight skew, we ended up with uneven segment sizes. Try to use columns with enough cardinality to avoid this issue or add other columns to the partitioning spec that will help add cardinality. The columns used for secondary partitioning should be the most common filter criteria in your queries; other than time filters that is. This will help improve query performance because the Broker will prune the segments needed and only submit the query to the Historicals with the relevant segments. 

![](assets/segments-range-partitioned.png)

## Clean up

Run the following cell to remove the tables and topics created by this notebook from the database.

In [None]:
# drop tables

drop_table("social_media_1task")
drop_table("social_media_2task")
drop_table("social_media_4task")

# remove topic from Kafka
try:
    admin_client.delete_topics(topics=[topic_name])
    print("Topic Deleted Successfully")
except  Exception as e:
    print(e)



## Summary

* You learned that
  * Streaming increases throughput by using more tasks
  * More tasks means more segments being produced
  * Find the right balance for your use case
  * Segments are not optimally organized for query after streaming
  * Compaction should follow every streaming ingestion
  * Compaction has the same partitioning and clustering capabilities as batch ingestion.
  
## Learn more

* If you haven't already, review the [05-partitioning-data.ipynb](05-partitioning-data.ipynb) notebook which takes a look at how to address partitioning and clustering for batch ingestion.
* Read about:
  * [Streaming Ingestion]()
  * [Compaction and Auto Compaction]()