# Use the data generator to send sample data to Apache Kafka

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
This notebook walks through sending sample data from the [Imply Data Generator](https://github.com/implydata/druid-datagenerator) direct to a topic in Apache Kafka. While running, you can access the topic directly from Druid, allowing you to ingest generated data in streaming ingestions.

## Prerequisites

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

Run the next cell to set up the connection to the data generator and to Apache Kafka.

In [None]:
import requests
import json
import os
import kafka

datagenUrl = "http://datagen:9999"
datagenHeaders = {'Content-Type': 'application/json'}

if (os.environ['KAFKA_HOST'] == None):
    kafka_host=f"kafka:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

## Generate simulated data

Run the following cell to create a JSON object that you will pass to the `start` endpoint in order to start the data generation process.

In [None]:
job_name="example_clickstream"

target = {
    "type":"kafka",
    "endpoint": kafka_host,
    "topic": job_name
}

datagen_request = {
    "name": job_name,
    "target": target,
    "config_file": "clickstream/clickstream.json",
    "time": "1h",
    "concurrency":10,
    "time_type": "REAL"
}

display(datagen_request)

The JSON object contains:

* `name`: a job identifier.
* `target`: Kafka connection details and the topic to write to.
* `config_file`: the type of simulation to run.
* `time_type`: the timestamps to use.
* `time`: period of time to run the simulation for.
* `concurrency` the number state machines running the simulation.

As `time_type` is `REAL`, each row will have the current timestamp as it is generated.

Within `target` you only provide the `endpoint` and `topic`, but [other options are available](https://github.com/implydata/druid-datagenerator#target-object).

Run the following cell to start a simulation by using the `start` endpoint using the `datagen_request` configuration.

In [None]:
requests.post(f"{datagenUrl}/start", json.dumps(datagen_request), headers=datagenHeaders)

Use the `/status` endpoint get the status of the simulation.

In [None]:
display(requests.get(f"{datagenUrl}/status/{job_name}").json())

## View the data

With the simulator writing directly to Apache Kafka, run the following cell to display the sample data.

This code creates a KafkaConsumer (`consumer`), subscribes to the topic in Kafka, and then enters a loop to display only five events from the simulator. Finally, it unsubscribes the KafkaConsumer from the topic.

In [None]:
from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
 bootstrap_servers=kafka_host
)

consumer.subscribe(topics=job_name)
count = 0

for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))

consumer.unsubscribe()

Notice that the simulation will run (`time`) for one hour.

Run the following cell to `stop` the data generation.

In [None]:
display(requests.post(f"{datagenUrl}/stop/{job_name}", '').json())

## Summary

* The `target` property of the data generator configuration allows you to send data directly to a Kafka topic.
* Timestamps in the data are configured using the `time_real` property.
* The simulator will run for a period specified in `time`.
* Stop data generation using the `stop` endpoint.

## Learn more

* Read more about the data generator in the [official repository](https://github.com/implydata/druid-datagenerator).
* Start a simulation and then connect to the topic data directly from Apache Druid using Kafka ingestion.