# Streaming ingestion from multipletopics 
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This notebook demonstrates the use of the new Druid 28.0.0 feature that allows streaming ingestion to consume messages from multiple topics. Take a look at the [documentaiton of this new feature here](https://druid.apache.org/docs/latest/development/extensions-core/kafka-supervisor-reference#ingesting-from-multiple-topics). 

In this notebook you will:
- Create a topic and initiate a data feed on it
- Create a multi-topic ingestion job that users `topicPattern` to find new topics dynamically
- Create an additional topic that fits the same topic name pattern 
- Query multiple topic data

## Prerequisites

This tutorial works with Druid 28.0.0 or later.

#### Run with Docker

<!-- Profiles are:
`druid-jupyter` - just Jupyter and Druid
`all-services` - includes Jupyter, Druid, and Kafka
 -->

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import json
import time

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"


if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status
rest_client = druid.rest

# client for Data Generator API
datagen = druidapi.rest.DruidRestClient("http://datagen:9999")


status_client.version


## Create first Kafka topic and feed it

In [None]:
headers = {
  'Content-Type': 'application/json'
}

datagen_request = {
    "name": "gen_social_twitter",
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": "social-twitter" },
    "config_file": "social/social_posts.json", 
    "time":"1h",
    "concurrency":100,
    "time_type":"REAL"
}
datagen.post("/start", json.dumps(datagen_request), headers=headers)


Check the status of the job with the following cell:

In [None]:
time.sleep(1) # avoid race between start of the job and its status being available
response = datagen.get('/status/gen_social_twitter')
response.json()

## Create multi-topic streaming ingestion job

A multi-topic streaming spec is almost identical to the single topic ingestion specs. The only difference is:
- single topic ingestions use the `"topic"` property to specify topic to read from
- multi-topic jobs use  `"topicPattern"` property instead with a REGEX expression that use used to match from the topics available in kafka.

The following spec uses `"topicPattern": "social_.*"` which will match any topic that begins with `social_`. If a specific list of topics is preferable REGEX multi option list form will work, e.g. `"social-twitter|social_linkedin"`.

The ingestion job automatically adds a column called `"kafka.topic"` that is available as a dimension and will be captured by schema discovery which is enabled in the following spec:     


In [None]:
kafka_ingestion_spec = {
  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": { "bootstrap.servers": "kafka:9092" },
        
      "topicPattern": "social-.*",   # this is new 
        
      "inputFormat": { "type": "kafka","valueFormat": { "type": "json" } }
    },
    "tuningConfig": {
      "type": "kafka"
    },
    "dataSchema": {
      "dataSource": "example-social-media",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [ ],
        "useSchemaDiscovery": True
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": False,
        "segmentGranularity": "hour"
      }
    }
  }
}

Send the spec to Druid to start the streaming ingestion from Kafka:

In [None]:
headers = {
  'Content-Type': 'application/json'
}

supervisor = rest_client.post("/druid/indexer/v1/supervisor", json.dumps(kafka_ingestion_spec), headers=headers)
print(supervisor.status_code)

A `200` response indicates that the request was successful. You can view the running ingestion task and the new datasource in the web console's [ingestion view](http://localhost:8888/unified-console.html#ingestion).

The following cell pauses further execution until the ingestion has started and the datasource is available for querying:

In [None]:
druid.sql.wait_until_ready('example-social-media', verify_load_status=False)

You can see the supervisor job in the [Druid Console Supervisor View](http://localhost:8888/unified-console.html#supervisors). Click the magnifying glass icon to view the status of the job. The `startingOffsets` property lists kafka partitions that the job is consuming from, each partition is identified by `"<topic-name>:<partition #>"` which will list all partitions from all topics that have been discovered :
```
{
  "dataSource": "social_media",
  "stream": "social-.*",
  "partitions": 1,
  "replicas": 1,
  "durationSeconds": 3600,
  "activeTasks": [
    {
      "id": "index_kafka_social_media_77e5722b1640edd_cnnmjgpc",
      "startingOffsets": {
        "social-twitter:0": 0   <<<<<<<<<<<<<<<<<<< SEE DISCOVERED TOPICS HERE
      },
      "startTime": "2023-11-03T20:52:41.173Z",
      ...
}
```

The following query shows the last few minutes of activity from topics being captured so far. The data only includes `kafka.topic` = `social-twitter`:  

In [None]:
sql = '''
    SELECT TIME_FLOOR("__time", 'PT1M') as "minute", 
       "kafka.topic",   SUM(views) as "total_views" 
    FROM "example-social-media" 
    WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '5' MINUTE
    GROUP BY 1,2
    ORDER BY 1 DESC, 3 DESC
    LIMIT 5
'''
display.sql(sql)

## Add another topic that matches the pattern

The supervisor job that controls the streaming ingestion will periodically check for new topics that match the `topicPattern` and automatically begin consuming from them when they appear.
The following cell initiates a second topic called `social-linkedin` and begins streaming data to it:

In [None]:
headers = {
  'Content-Type': 'application/json'
}

datagen_request = {
    "name": "gen_social_linkedin",
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": "social-linkedin" },
    "config_file": "social/social_posts.json", 
    "time":"1h",
    "concurrency":500,
    "time_type":"REAL"
}
datagen.post("/start", json.dumps(datagen_request), headers=headers)
time.sleep(1) # avoid race between start of the job and its status being available
response = datagen.get('/status/gen_social_linkedin')
response.json()

In [None]:
datagen.get_json('/jobs')

## Query the multi-topic data
Try the following query a few times. It will initially only show `social-twitter` activity, when the supervisor picks up the new topic you will see it appear. It can take a couple of minutes.

In [None]:
sql = '''
    SELECT TIME_FLOOR("__time", 'PT1M') as "minute", 
       "kafka.topic",   SUM(views) as "total_views" 
    FROM "example-social-media" 
    WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '5' MINUTE
    GROUP BY 1,2
    ORDER BY 1 DESC, 3 DESC
    LIMIT 5
'''
display.sql(sql)

## Cleanup 
The following cell stops data generation, ingestion jobs and removes the datasource from Druid.

In [None]:
print(f"Stop streaming generator: [{datagen.post('/stop/gen_social_linkedin','',require_ok=False)}]")
print(f"Stop streaming generator: [{datagen.post('/stop/gen_social_twitter','',require_ok=False)}]")

print(f'Pause streaming ingestion: [{druid.rest.post("/druid/indexer/v1/supervisor/example-social-media/suspend","", require_ok=False)}]')
print(f'Shutting down running tasks ...')
tasks = druid.tasks.tasks(state='running', table='example-social-media')
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table='example-social-media')       
print(f'Reset offsets for re-runnability: [{druid.rest.post("/druid/indexer/v1/supervisor/example-social-media/reset","", require_ok=False)}]')
print(f'Terminate streaming ingestion: [{druid.rest.post("/druid/indexer/v1/supervisor/example-social-media/terminate","", require_ok=False)}]')

print(f"Drop datasource: [{druid.datasources.drop('example-social-media')}]")


## Learn more

This tutorial showed you how to create a Kafka topic using a Python client for Kafka, send a simulated stream of data to Kafka using a data generator, and query and visualize results over time. For more information, see the following resources:

* [Apache Kafka ingestion](https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html)
* [Querying data](https://druid.apache.org/docs/latest/tutorials/tutorial-query.html)
* [Tutorial: Run with Docker](https://druid.apache.org/docs/latest/tutorials/docker.html)