# Define schemas for incoming stream data
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

During streaming ingestion, the schema for events written into a table from a stream are set in the `dimensionsSpec`. This tutorial demonstrates various ways to work with the [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec) against an example stream of events.

In this tutorial, you perform the following tasks:

- Set up a streaming ingestion from Apache Kafka.
- Start an ingestion that consumes specific dimensions and writes them into a table.
- Amend the ingestion to consume all but specific dimensions.
- Run an ingestion using automatic schema discovery.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import requests
from datetime import datetime, timedelta

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Run the next cell to set up the connection to Apache Kafka and Data Generator, and import helper functions for use later in the tutorial.

In [None]:
import json
import kafka
from kafka import KafkaConsumer

datagenUrl = "http://datagen:9999"

generalHeaders = {'Content-Type': 'application/json'}

if (os.environ['KAFKA_HOST'] == None):
    kafka_host=f"kafka:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Start a data stream

Run the next cell to use the learn-druid Data Generator to create a stream that we can consume from.

This creates clickstream sample data for an hour and publishes it to a Kafka topic for Druid to consume from.

In [None]:
job_name="example-social-dimensions"
topic_name = job_name

target = {
    "type":"kafka",
    "endpoint": kafka_host,
    "topic": topic_name
}

datagen_request = {
    "name": topic_name,
    "target": target,
    "config_file": "social/social_posts.json",
    "time": "1h",
    "concurrency":10,
    "time_type": "REAL"
}

requests.post(f"{datagenUrl}/start", json.dumps(datagen_request), headers=generalHeaders)

### Set up ingestion specification basics

Run the following cell to create an [ioConfig](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig) object that sets the connection to the topic from Druid along with a very simple [tuningConfig](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig) object for the tuning configuration for the ingestion.

In [None]:
ioConfig = {
  "type": "kafka",
  "consumerProperties": {
    "bootstrap.servers": "kafka:9092"
  },
  "topic": topic_name,
  "inputFormat": {
    "type": "json"
  },
  "useEarliestOffset": "false"
}

tuningConfig = { "type": "kafka" }

The third part of the ingestion specification defines the [dataSchema](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). In the cells that follow, you will define all three parts:

* [timestampSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) uses the `time` column from the generated data as the primary timestamp.
* [granularitySpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#granularityspec) uses the primary timestamp to write data into daily partitions.
* [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec) defines what data to create inside the target table.

## Configure the timestamp and partitioning scheme

Run the next cell to see a sample of the raw data being emitted from the Data Generator.

This cell uses a simple consumer to subscribe to the topic and show the first five rows that appear.

In [None]:
consumer = KafkaConsumer(
 bootstrap_servers=kafka_host
)

consumer.subscribe(topics=topic_name)
count = 0

for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))

consumer.unsubscribe()

Each event includes a timestamp in the `time` field in ISO standard format.

Run the following cell to set the primary timestamp to this field with the [format](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) set as "iso".

In [None]:
dataSchema_timestampSpec = {
    "column": "time",
    "format": "iso"
    }

Run the next cell to set the primary partitioning for your table to `HOUR`.

Read more about this important design consideration in the official documentation on [partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning) and [segment size optimization](https://druid.apache.org/docs/latest/operations/segment-optimization).

Notice that you also disable ingestion-time aggregation ([rollup](https://druid.apache.org/docs/latest/ingestion/rollup)) inside the `granularitySpec`.

In [None]:
dataSchema_granularitySpec = {
    "rollup": "false",
    "segmentGranularity": "hour"
    }

You have now created the first two parts of the `dataSchema` that deal with treatment and use of a primary timestamp.

Reviewing the sample data, we can now turn our attention to the options for the final part of the `dataSchema`: the `dimensionsSpec`.

## Explicitly set dimensions

Run the next cell to create a `dimensionsSpec` object that uses the "explicit" method for ingesting events.

Notice that it is made up of [dimension objects](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimension-objects) inside a `dimensions` list - the "edited" field has been left out intentionally.

There are two flavors of dimension object:

* Dimensions that ingested using all defaults, bringing in data as a string with a bitmap index.
* Dimensions that have specific types.

In [None]:
dataSchema_dimensionsSpec = {
    "dimensions": [
        "username",
        "post_title",
        {
            "name" : "views",
            "type" : "long" },
        {
            "name" : "upvotes",
            "type" : "long" },
        {
            "name" : "comments",
            "type" : "long" }
        ]
      }

Run the next cell to create the final `dataSchema`. Notice that the table name is also defined here.

Beneath this it is combined with the `ioConfig` and `tuningConfig` to create a native [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec).

In [None]:
table_name = topic_name

dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Run the next cell to start ingestion raw data from Kafka to Druid.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

Run the following cell to wait until the ingestion has started and the new table is ready for query.

In [None]:
druid.sql.wait_until_ready(table_name, verify_load_status=False)
print("Ready to go!")

Run the following cell to get details about the table you have created.

In [None]:
display.table(table_name)

The `type` tells you how Druid will interpret the data in SQL. Notice that the `dimensionsSpec` has caused Druid to apply a type of BIGINT to `views`, `upvotes`, and `comments`.

Learn more about data types in the dedicated [notebook on data types](../02-ingestion/04-table-datatypes.ipynb).

Run the next cell to stop ingestion and drop the table.

In [None]:
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/suspend","")}]')
print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

## Explicitly exclude dimensions

Run the next cell to create a `dimensionsSpec` object that uses the "exclusion" method for ingesting events.

Notice that it is made up of the names of dimensions to exclude from the incoming data inside `dimensionExclusions` list.

In [None]:
dataSchema_dimensionsSpec = {
    "dimensionExclusions": [
        "username",
        "edited"
        ]
      }

Now incorporate this adaptation into the overall ingestion specification by running the next cell.

In [None]:
dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Submit the revised specification for this table to Druid by running the next cell.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
print("Ready to go!")

Run the next cell to view the schema for the table.

In [None]:
display.table(table_name)

Notice that `views`, `upvotes`, and `comments` have a type of VARCHAR.

As before, stop ingestion and drop the table by running the next cell.

In [None]:
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/suspend","")}]')
print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

## Use automatic schema discovery

Now set up your `dimensionsSpec` to instruct Druid to discover dimensions and determine a data type automatically by running the next cell.

In [None]:
dataSchema_dimensionsSpec = {
    "useSchemaDiscovery" : "true" }

dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Submit the revised specification for this table to Druid by running the next cell.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
print("Ready to go!")

Review the schema for the table by running the next cell.

In [None]:
display.table(table_name)

Notice that `views`, `upvotes`, and `comments` have been detected as a BIGINT.

Stop ingestion and drop the table by running the next cell.

In [None]:
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/suspend","")}]')
print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

## Clean up

Run the following cell to stop the data generator.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{datagenUrl}/stop/{job_name}','')}]")

## Summary

* The schema of incoming data is defined in the `dimensionsSpec` and is realized in the target table.
* Dimensions can be explicitly included and typed, explicitly excluded, or automatically detected and typed.

## Learn more

* Review the documentation on the [`dimensionsSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec).
* Review the documentation on [partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning) and [segment size optimization](https://druid.apache.org/docs/latest/operations/segment-optimization).
* Run through the dedicated [notebook on data types](../02-ingestion/04-table-datatypes.ipynb).
* Learn about [changing schemas](https://druid.apache.org/docs/latest/data-management/schema-changes) in Druid.
* Experiment with combining batch and streaming data in the same table.