# Ingest, query, and visualize streaming data

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This tutorial introduces you to streaming ingestion in Apache Druid using the Apache Kafka event streaming platform.
Follow along to learn how to create and load data into a Kafka topic, start ingesting data from the topic into Druid, and query results over time. This tutorial assumes you have a basic understanding of Druid ingestion, querying, and API requests.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import time

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status
druid_headers = {'Content-Type': 'application/json'}

status_client.version

Run the next cell to set up the connection to Apache Kafka and to ready the connection to the data generator.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
    kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

Finally, run this cell to prepare for connection to the data generator, and to set up some variables used in later cells.

In [None]:
import requests
import json

datagen_host = "http://datagen:9999"
datagen_headers = {'Content-Type': 'application/json'}
datagen_job = "example-social-quickstart"
datagen_topic = datagen_job

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic  },
    "config_file": "social/social_posts.json", 
    "concurrency":100
}

## Load example data

This section uses the data generator included as part of the Docker application to generate a stream of messages into a Apache Kafka topic named `social_media`. Next, you'll set up an on-going ingestion into Druid.

### Use the data generator to produce sample data

Run the following cell to initialize data generation.

In [None]:
requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)

### Create a specification for the ingestion

Ingest data from an Apache Kafka topic into Apache Druid by submitting an [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html) to the [streaming ingestion supervisor API](https://druid.apache.org/docs/latest/api-reference/supervisor-api).

Run the next cell to set up an object to contain the configuration for the first part of the supervisor spec - the [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig). This will start an ingestion supervisor to create and manage ingestion tasks that will run on the cluster.

In [None]:
ioConfig = {
    "type": "kafka",
    "consumerProperties": { "bootstrap.servers": kafka_host },
    "topic": datagen_topic,
    "inputFormat": { "type": "json" },
    "useEarliestOffset": "true" }

The configuration tells the supervisor:

* What type of ingestion tasks to create and manage - in this case, `kafka` consumers.
* How each of the tasks will connect to the input source - here, you set the boostrap servers for each consumer using the Kafka host set earlier.
* What parser the tasks need to use when reading the data - the data generator is sending `json` data so this is set in the `inputFormat`.
* Whether the supervisor will instruct its consumers to read from the beginning of the stream or the end - for this notebook we'll start at the beginning by using the earliest offset.

The next part of the ingestion specification, [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig), contains special tuning parameters. For this notebook, not much is needed. Run the next cell.

In [None]:
tuningConfig = { "type": "kafka" }

The final part, [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema),  tells the sueprvisor what table to put the data into, and what each task should do with the raw data, including how to partition it when its being optimized and stored. Run the next cell to set up an object to hold this configuration, which includes:

* What table to put the data into - the same name as the topic that the data generator is pushing data to is being used.
* The primary timestamp and the format that it is in.
* How to partition the data using the timestamp - this is set to partition things hourly.
* What dimensions to ingest - this is set to automatic.

In [None]:
target_table = datagen_topic

dataSchema = {
    "dataSource": target_table,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "hour" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"}
    }

Run the following cell to assemble the supervisor spec from its constituent parts.

In [None]:
ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

### Start the ingestion in Apache Druid

Send the spec to Apache Druid to start the supervisor. The supervisor will launch consumer tasks to read from the topic.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)

Run the following cell to wait until the ingestion has started and the new table is ready for query.

In [None]:
druid.sql.wait_until_ready(target_table, verify_load_status=False)
print("Ready to go!")

## Query the table and visualise the results

There are two mechanisms for querying table data:

1. The [Druid interactive SQL API](https://druid.apache.org/docs/latest/api-reference/sql-api).
2. The [Druid asynchronous SQL API](https://druid.apache.org/docs/latest/api-reference/sql-ingestion-api) (experimental).

In this section, you will use the interactive API to visualize query results using the Matplotlib and Seaborn visualization libraries. Run the following cell import these packages.

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Run a simple query to view a subset of rows.

In [None]:
sql = f'''SELECT * FROM "{target_table}" LIMIT 5'''
display.sql(sql)

In this social media scenario, each incoming event represents a post on social media, for which you collect the timestamp, username, and post metadata. You are interested in analyzing the total number of upvotes for all posts, compared between users.

Run the next cell to execute the query, store the results (`response`), and display them.

In [None]:
sql = f'''
SELECT
  COUNT(post_title) as num_posts,
  SUM(upvotes) as total_upvotes,
  username
FROM "{target_table}"
GROUP BY username
ORDER BY num_posts
'''

response = sql_client.sql_query(sql)
response.show()

Visualize the total number of upvotes per user using a line plot.

The next cell stores the results in a Pandas dataframe and then sorts them. Note that the order of users may vary as new results arrive.

Next, plot the the dataframe.

In [None]:
df = pd.DataFrame(response.json)
df = df.sort_values('username')

df.plot(x='username', y='total_upvotes', marker='o')
plt.xticks(rotation=45, ha='right')
plt.ylabel("Total number of upvotes")
plt.gca().get_legend().remove()
plt.show()

The total number of upvotes likely depends on the total number of posts created per user. To better assess the relative impact per user, compare the total number of upvotes (line plot) with the total number of posts.

In [None]:
matplotlib.rc_file_defaults()
ax1 = sns.set_style(style=None, rc=None )

fig, ax1 = plt.subplots()
plt.xticks(rotation=45, ha='right')


sns.lineplot(
    data=df, x='username', y='total_upvotes',
    marker='o', ax=ax1, label="Sum of upvotes")
ax1.get_legend().remove()

ax2 = ax1.twinx()
sns.barplot(data=df, x='username', y='num_posts',
            order=df['username'], alpha=0.5, ax=ax2, log=True,
            color="orange", label="Number of posts")


# ask matplotlib for the plotted objects and their labels
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, bbox_to_anchor=(1.55, 1))

You should see a correlation between total number of upvotes and total number of posts. In order to track user impact on a more equal footing, normalize the total number of upvotes relative to the total number of posts, and plot the result:

In [None]:
df['upvotes_normalized'] = df['total_upvotes']/df['num_posts']

df.plot(x='username', y='upvotes_normalized', marker='o', color='green')
plt.xticks(rotation=45, ha='right')
plt.ylabel("Number of upvotes (normalized)")
plt.gca().get_legend().remove()
plt.show()

You've been working with data taken at a single snapshot in time from when you ran the last query. Run the same query again, and store the output in `response2`, which you will compare with the previous results:

In [None]:
response2 = sql_client.sql_query(sql)
response2.show()

Normalizing the data also helps you evaluate trends over time more consistently on the same plot axes. Plot the normalized data again, this time alongside the results from the previous snapshot:

In [None]:
df2 = pd.DataFrame(response2.json)
df2 = df2.sort_values('username')
df2['upvotes_normalized'] = df2['total_upvotes']/df2['num_posts']

ax = df.plot(x='username', y='upvotes_normalized', marker='o', color='green', label="Time 1")
df2.plot(x='username', y='upvotes_normalized', marker='o', color='purple', ax=ax, label="Time 2")
plt.xticks(rotation=45, ha='right')
plt.ylabel("Number of upvotes (normalized)")
plt.show()

This plot shows how some users maintain relatively consistent social media impact between the two query snapshots, whereas other users grow or decline in their influence.

## Cleanup 
The following cell stops data generation, ingestion jobs and removes the datasource from Druid.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=target_table)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=target_table)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(target_table)}]")

## Learn more

This tutorial showed you how to create a Kafka topic using a Python client for Kafka, send a simulated stream of data to Kafka using a data generator, and query and visualize results over time. For more information, see the following resources:

* [Apache Kafka ingestion](https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html)
* [Querying data](https://druid.apache.org/docs/latest/tutorials/tutorial-query.html)
* [Tutorial: Run with Docker](https://druid.apache.org/docs/latest/tutorials/docker.html)