# Define schemas for incoming stream data
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Druid tables have an evolving schema that is realized dynamically from the data that you ingest.

In streaming ingestion, the schema of the data is defined in the `dimensionsSpec`, and you can change this over time.

This tutorial demonstrates various ways to work with the [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec) from an example stream of events, showing schema evolution in action.

In this tutorial, you perform the following tasks:

- Set up a streaming ingestion from Apache Kafka.
- Start an ingestion that consumes specific dimensions and writes them into a table.
- Amend the ingestion to consume all but specific dimensions.
- Run an ingestion using automatic schema discovery.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import requests
from datetime import datetime, timedelta

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Run the next cell to set up the connection to Apache Kafka and Data Generator, and import helper functions for use later in the tutorial.

In [None]:
import json
import kafka
from kafka import KafkaConsumer

datagen_host = "http://datagen:9999"

datagen_headers = {'Content-Type': 'application/json'}

if (os.environ['KAFKA_HOST'] == None):
    kafka_host=f"kafka:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Start a data stream

Run the next cell to use the learn-druid Data Generator to create a stream that we can consume from.

This creates clickstream sample data for an hour and publishes it to a Kafka topic for Druid to consume from.

In [None]:
datagen_job="example-social-dimensions"
kafka_topic = datagen_job

target = {
    "type":"kafka",
    "endpoint": kafka_host,
    "topic": kafka_topic
}

datagen_starttime = "2020-01-01 00:00:00"

datagen_request = {
    "name": datagen_job,
    "target": target,
    "config_file": "social/social_posts.json",
    "concurrency":50,
    "time_type": datagen_starttime
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)
requests.get(f"{datagen_host}/status/{datagen_job}").json()

### Set up ingestion specification basics

An streaming ingestion specification contains three parts:

- [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig): sets the connection to the source data.
- [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig): set specific tuning options for the ingestion.
- [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema): controls what happens to the data as it arrives and what the output should be.

Run the following cell to create two objects to represent the `ioConfig` and `tuningConfig` that you will use throughout this notebook.

In [None]:
ioConfig = {
  "type": "kafka",
  "consumerProperties": {
    "bootstrap.servers": "kafka:9092"
  },
  "topic": kafka_topic,
  "inputFormat": {
    "type": "json"
  },
  "useEarliestOffset": "false"
}

tuningConfig = { "type": "kafka" }

The `dataSchema` is made of three parts, and is the focus of this notebook.

1. [timestampSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) and [granularitySpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#granularityspec) define the primary timestamp (`__time`) and how to use this to partition data.
2. [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec) defines what other measures and attributes to add to the table from the incoming dimensions.

In this notebook you will work with all three parts to set up the timestamp and dimensions of an example table.

## Configure the primary timestamp

The primary timestamp is required in every table, and is set in the `timestampSpec`. As the primary partitioning dimension, you must also use the same field to apply initial partitioning to your data - use `granularitySpec` to define how this is done.

Run the next cell to set up a simple consumer and peek at the raw data being emitted from the Data Generator.

In [None]:
consumer = KafkaConsumer(
 bootstrap_servers=kafka_host
)

consumer.subscribe(topics=kafka_topic)
count = 0

for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))

consumer.unsubscribe()

Each event includes a timestamp in the `time` field in ISO standard format that you will use as the `__time` field.

Run the following cell to set up an object that you will incorporate into your final `dataSchema`. The `column` is set to `time`, which is the column from the generated data you will use as the primary timestamp. The [format](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) set as "iso".

In [None]:
dataSchema_timestampSpec = {
    "column": "time",
    "format": "iso"
    }

Next, you will define how your incoming events will be partitioned. Read more about this important design consideration in the official documentation on [partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning) and [segment size optimization](https://druid.apache.org/docs/latest/operations/segment-optimization).

Run the next cell to create an object that will be incorporated into the `dataSchema` as the `granularitySpec`. Notice that the primary partitioning for your table will be `HOUR`, and that ingestion-time aggregation ([rollup](https://druid.apache.org/docs/latest/ingestion/rollup)) will not be used.

In [None]:
dataSchema_granularitySpec = {
    "rollup": "false",
    "segmentGranularity": "hour"
    }

## Configure dimensions

You have now created two objects that set up the primary timestamp, and turn attention to the third part of the `dataSchema`: the `dimensionsSpec`. Here you set what attributes and measures from the source data will be inserted into the table.

You will see examples of:

* Setting the dimensions explicitly.
* Excluding dimensions specifically.
* Using automatic schema detection.

### Use `dimensions` to explicitly set the schema

Use an array of [dimension objects](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimension-objects) to prescribe the specific attributes and measures that will be inserted and their type.

Run the next cell to create a `dimensionsSpec` object that contains a `dimensions` array containing `dimensionObjects` with a name and target data type.

In [None]:
dataSchema_dimensionsSpec = {
    "dimensions": [
        "username",
        "post_title",
        {
            "name" : "views",
            "type" : "long" },
        {
            "name" : "upvotes",
            "type" : "long" },
        {
            "name" : "comments",
            "type" : "long" }
        ]
      }

Run the next cell to create the final `dataSchema` by combining the `timestampSpec`, `granularitySpec`, and `dimensionsSpec`, along with the `dataSource` set to the target name for your table.

In [None]:
table_name = kafka_topic

dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

print(json.dumps(dataSchema,indent=3))

Run the next cell to incorporate this with the `ioConfig` and `tuningConfig` to create a native [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec).

In [None]:
ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

Run the next cell to start ingestion raw data from Kafka to Druid.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=datagen_headers)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
display.table(table_name)

Notice that the `dimensionsSpec` has caused Druid to apply a type of BIGINT to `views`, `upvotes`, and `comments`.

Learn more about data types in the dedicated [notebook on data types](../02-ingestion/04-table-datatypes.ipynb).

Before moving on, stop the data generator.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")

### Use `dimensionExclusions` to explicitly exclude dimensions

Run the next cell to switch the `dimensionsSpec` object to use the "exclusion" method for ingesting events.

Notice the `dimensionExclusions` array contains the names of dimensions that will be ignored from the incoming events.

In [None]:
dataSchema_dimensionsSpec = {
    "dimensionExclusions": [
        "username",
        "edited"
        ]
      }

Incorporate this into an ingestion specification by running the next cell.

In [None]:
dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Submit the revised specification for this table to Druid by running the next cell.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=datagen_headers)

Restart the data generator by running the next cell.

Notice that the `time_type` is a year later, meaning that the new set of events will have a later timestamp.

In [None]:
datagen_starttime = "2021-01-01 00:00:00"

datagen_request = {
    "name": datagen_job,
    "target": target,
    "config_file": "social/social_posts.json",
    "concurrency":50,
    "time_type": datagen_starttime
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)
requests.get(f"{datagen_host}/status/{datagen_job}").json()

The table will now contain two sets of events:

* Events from 2020 that were ingested using an explicit `dimensionsSpec`.
* Events from 2021 that are currently being ingested using `dimensionExclusions`.

With schema exclusion, `views`, `upvotes`, and `comments` after 2021 will have an internal VARCHAR type.

Run the next cell to show the difference in the data:

In [None]:
sql=f'''
SELECT *
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2020-01-01/PT1H')
'''

print("Using explicit inclusions:")
display.sql(sql)

sql=f'''
SELECT *
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2021-01-01/PT1H')
'''

print("Using explicit exclusions:")
display.sql(sql)

Stop the data generator by running the next cell.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")

### Use automatic schema discovery

Now set up your `dimensionsSpec` to instruct Druid to discover dimensions and determine a data type automatically by running the next cell by setting `useSchemaDiscovery` to `true`.

In [None]:
dataSchema_dimensionsSpec = {
    "useSchemaDiscovery" : "true" }

dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Submit the revised specification for this table to Druid by running the next cell.

Because automatic schema detection has been used, `views`, `upvotes`, and `comments` will be set as BIGINT.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=datagen_headers)

Run the next cell to restart the data generator, this time for 2022.

In [None]:
datagen_starttime = "2022-01-01 00:00:00"

datagen_request = {
    "name": datagen_job,
    "target": target,
    "config_file": "social/social_posts.json",
    "concurrency":50,
    "time_type": datagen_starttime
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)
requests.get(f"{datagen_host}/status/{datagen_job}").json()

In [None]:
sql=f'''
SELECT *
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2020-01-01/PT1H')
'''

print("Using explicit inclusions:")
display.sql(sql)

sql=f'''
SELECT *
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2021-01-01/PT1H')
'''

print("Using explicit exclusions:")
display.sql(sql)

sql=f'''
SELECT *
FROM "{table_name}"
WHERE TIME_IN_INTERVAL("__time",'2023-01-01/PT1H')
'''

print("Using automatic schema discovery:")
display.sql(sql)

## Clean up

Run the following cell to stop the data generator and drop the table.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")

print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/suspend","")}]')
print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

## Summary

* The schema of incoming data is defined in the `dimensionsSpec` and is realized in the target table.
* Dimensions can be explicitly included and typed, explicitly excluded, or automatically detected and typed.

## Learn more

* Review the documentation on the [`dimensionsSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec).
* Review the documentation on [partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning) and [segment size optimization](https://druid.apache.org/docs/latest/operations/segment-optimization).
* Run through the dedicated [notebook on data types](../02-ingestion/04-table-datatypes.ipynb).
* Learn about [changing schemas](https://druid.apache.org/docs/latest/data-management/schema-changes) in Druid.
* Experiment with combining batch and streaming data in the same table.