# (Result) by (action) using (feature)
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Introductory paragraph - for example:

This tutorial demonstrates how to work with [feature](link to feature doc). In this tutorial you perform the following tasks:

- Task 1
- Task 2
- Task 3
- etc

## Prerequisites

This tutorial works with Druid XX.0.0 or later.

<!-- Profiles are:
`druid-jupyter` - just Jupyter and Druid
`all-services` - includes Jupyter, Druid, and Kafka
 -->

Launch this tutorial and all prerequisites using the `<PLACE PROFILE NAME HERE>` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Set up a connection to Apache Kafka

<!-- Include these cells if your notebook uses Kafka. -->

Run the next cell to set up the connection to Apache Kafka.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Set up a connection to the Data Generator

<!-- Include these cells if your notebook uses the data generator. -->

Run the next cell to set up the connection to the Data Generator.

In [None]:
import requests
import json

datagen_host = "http://datagen:9999"
datagen_headers = {'Content-Type': 'application/json'}

<!-- Include these cells if you need additional Python modules -->

### Import additional modules

Run the following cell to import additional Python modules that you will use to X, Y, Z.

In [None]:
# Add your modules here, remembering to align this with the prerequisites section

import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Create a table using batch ingestion

<!-- Use these cells if you are using batch ingestion for your notebook. -->

Run the following cell to create a table using batch ingestion. Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

When completed, you'll see a description of the final table.

In [None]:
# Replace example-dataset-topic with a unique table name for this notebook.

# - Always prefix your table name with `example-`
# - If using the standard example datasets, use the following standard values for `dataset`:

#     wikipedia       wikipedia
#     koalas          KoalasToTheMax one day
#     koalanest       KoalasToTheMax one day (nested)
#     nyctaxi3        NYC Taxi cabs (3 files)
#     nyctaxi         NYC Taxi cabs (all files)
#     flights         FlightCarrierOnTime (1 month)

# Remember to apply good data modelling practice to your INSERT / REPLACE.

table_name = 'example-dataset-topic'

sql=f'''
REPLACE INTO {table_name}...
'''

display.run_task(sql)
sql_client.wait_until_ready(f'{table_name}')
display.table(f'{table_name}')

## Create a table using streaming ingestion

In this section, you use the data generator to generate a stream of messages into a Apache Kafka topic. Next, you'll set up an on-going ingestion into Druid.

### Use the data generator to populate a Kafka topic

Run the following cell to instruct the data generator to start producing data.

In [None]:
# For more information on the available configurations and settings for the data generator, see the dedicated notebook in "99-contributing"

datagen_job = f"{table_name}"
datagen_topic = f"{table_name}"
datagen_config = "social/social_posts.json"

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic  },
    "config_file": datagen_config, 
    "time":"10m",
    "concurrency":100
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)
display(requests.get(f"{datagen_host}/status/{datagen_job}").json())

### Use streaming ingestion to populate the table

Ingest data from an Apache Kafka topic into Apache Druid by submitting an [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html) to the [streaming ingestion supervisor API](https://druid.apache.org/docs/latest/api-reference/supervisor-api).

Run the next cell to set up the [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig), [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig), and [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

In [None]:
ioConfig = {
    "type": "kafka",
    "consumerProperties": { "bootstrap.servers": kafka_host },
    "topic": datagen_topic,
    "inputFormat": { "type": "json" },
    "useEarliestOffset": "true" }

tuningConfig = { "type": "kafka" }

# Replace example-dataset-topic with a unique table name for this notebook.

# - Always prefix your table name with `example-`
# - If using the standard example datasets, use the following standard values for `dataset`:

#     social/socialposts            social
#     clickstream/clickstream       clickstream

# Remember to apply good data modelling practice to your data schema.

table_name = 'example-dataset-topic'

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "hour" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"}
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
display.table(f'{table_name}')

## Awesome!

The main body of your notebook goes here!

### This is a step

Here things get done

### And so is this!

Wow! Awesome!

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
# Use this for batch ingested tables

druid.datasources.drop(f"{table_name}")

# Use this when doing streaming with the data generator

print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

In [None]:
# Here are some useful code elements that you can re-use.

# When just wanting to display some SQL results
sql = f'''SELECT * FROM "{table_name}" LIMIT 5'''
display.sql(sql)

# When ingesting data and wanting to describe the schema
display.run_task(sql)
sql_client.wait_until_ready('{table_name}')
display.table('{table_name}')

# When you want to show the native version of a SQL statement
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3

# When you want to see some messages from a Kafka topic
from kafka import KafkaConsumer

consumer = KafkaConsumer(bootstrap_servers=kafka_host)
consumer.subscribe(topics=datagen_topic)
count = 0
for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))
consumer.unsubscribe()