# Define schemas for incoming stream data
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

During streaming ingestion, the schema for events written into a table from a stream are set in the `dimensionsSpec`. This tutorial demonstrates various ways to work with the [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec) against an example stream of events.

In this tutorial you perform the following tasks:

- Set up a streaming ingestion from Apache Kafka.
- Start an ingestion that consumes specific dimensions and writes them into a table.
- Update the ingestion to set the specific data type for some of the dimensions.
- Amend the ingestion to consume all but specific dimensions.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [16]:
import druidapi
import os
import requests
from datetime import datetime

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Opening a connection to http://router:8888.


'29.0.1'

Run the next cell to set up the connection to Apache Kafka and to the Data Generator, and to import some helper functions for later in the tutorial.

In [17]:
import json
import kafka
from kafka import KafkaConsumer

datagenUrl = "http://datagen:9999"

generalHeaders = {'Content-Type': 'application/json'}

if (os.environ['KAFKA_HOST'] == None):
    kafka_host=f"kafka:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Start a data stream

Run the following cell to use the learn-druid Data Generator to create a stream that we can consume from.

This will create clickstream sample data for an hour and publish it to a topic in Apache Kafka for Apache Druid to consume from.

In [18]:
job_name="example-social-dimensions"
topic_name = job_name

target = {
    "type":"kafka",
    "endpoint": kafka_host,
    "topic": topic_name
}

datagen_request = {
    "name": topic_name,
    "target": target,
    "config_file": "social/social_posts.json",
    "time": "1h",
    "concurrency":10,
    "time_type": "REAL"
}

requests.post(f"{datagenUrl}/start", json.dumps(datagen_request), headers=generalHeaders)

<Response [200]>

## Set up ingestion specification basics

Run the following cell to create an [ioConfig](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig) object that sets the connection to the topic from Apache Druid along with a very simple [tuningConfig](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig) object for the tuning configuration for the ingestion.

In [19]:
ioConfig = {
  "type": "kafka",
  "consumerProperties": {
    "bootstrap.servers": "kafka:9092"
  },
  "topic": topic_name,
  "inputFormat": {
    "type": "json"
  },
  "useEarliestOffset": "false"
}

tuningConfig = { "type": "kafka" }

The third part of the ingestion specification defines the [dataSchema](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). In the cells that follow, you will define all three parts:

* A [timestampSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) which uses the `time` column from the generated data as the primary timestamp.
* A [granularitySpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#granularityspec) which uses the primary timestamp to write data into daily partitions.
* A [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec) which defines what data to create inside the target table.

### Set the timestamp and partitioning scheme

Run the next cell to see a sample of the raw data being emitted from the Data Generator.

This cell uses a simple consumer to subscribe to the topic and to show the first 5 rows that appear.

In [22]:
consumer = KafkaConsumer(
 bootstrap_servers=kafka_host
)

consumer.subscribe(topics=topic_name)
count = 0

for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))

consumer.unsubscribe()

0:435: v=b'{"time":"2024-06-27T08:49:31.204","username":"willow","post_title":"KLj.fQGv::O1Mieg7GEjjfWbWuBITuqs\'G,3adPcVfdJFnl:FwX6GC!1p4qpCpcd9FD2T9hBNbtVLcxyVyfj_KGfzhmOQlfm5,IJsrfCc\'ua,woNtt","views":659,"upvotes":64,"comments":10,"edited":"True"}'
0:436: v=b'{"time":"2024-06-27T08:49:31.205","username":"gus","post_title":"k,NL1Uj,WhdMLkthsBoYdqHvOv7G4wy5r64W3z9uxvCntc;SMthPsJadr9z\'C_kuCJpgtODG","views":27479,"upvotes":79,"comments":18,"edited":"True"}'
0:437: v=b'{"time":"2024-06-27T08:49:31.205","username":"gus","post_title":"Jp1XyghfyiLxpC:yZjXrEhGCreZ0kVWf2uGG:CpYMgfVLOKVp;oOrE!dW,b!DU2fs\'5plYxjZiqR2u1RS1W2cn_8","views":758,"upvotes":51,"comments":14,"edited":"False"}'
0:438: v=b'{"time":"2024-06-27T08:49:31.206","username":"miette","post_title":"Qpb8vzW6tcqddGP,acKDhS09J5yQZNeUL,\'uQ6I!SOK;\'nDKUcRVubYDH49sxJhNrK7elAju;Q3\'uKkMFftH\'rWRP1YNSNS!WZrrnT4sI.y_A!7O.EjYV\'y7LLx","views":21242,"upvotes":53,"comments":-1,"edited":"False"}'


Each event appears to include a timestamp in the `time` field in ISO standard format.

Run the following cell to set the primary timestamp to this field with the [format](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) set as "iso".

In [21]:
dataSchema_timestampSpec = {
    "column": "time",
    "format": "iso"
    }

For the purposes of this notebook, set the primary partitioning for your table using the `granularitySpec` to `HOUR` by running the next cell.

Read more about this important design consideration in the official documentation on [partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning) and [segments size optimization](https://druid.apache.org/docs/latest/operations/segment-optimization).

Notice that you also disable ingestion-time aggregation ([rollup](https://druid.apache.org/docs/latest/ingestion/rollup)), behaviour for which is also defined inside the `granularitySpec`.

In [23]:
dataSchema_granularitySpec = {
    "rollup": "false",
    "segmentGranularity": "hour"
    }

You have now created the first two parts of the `dataSchema` that deal with treatment and use of a primary timestamp.

Reviewing the sample data, we can now turn our attention to the options for the final part of the `dataSchema`: the `dimensionsSpec`.

## Explicitly set dimensions

Run the next cell to create a `dimensionsSpec` object that uses the "explicit" method for ingesting events.

Notice that it is made up of [dimension objects](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimension-objects) inside a `dimensions` list - the "edited" field has been left out intentionally.

There are two flavors of dimension object:

* Dimensions that ingested using all defaults, bringing in data as a string with a bitmap index.
* Dimensions that have specific types.

In [24]:
dataSchema_dimensionsSpec = {
    "dimensions": [
        "username",
        "post_title",
        {
            "name" : "views",
            "type" : "long" },
        {
            "name" : "upvotes",
            "type" : "long" },
        {
            "name" : "comments",
            "type" : "long" }
        ]
      }

Run the next cell to create the final `dataSchema`. Notice that the table name is also defined here.

Beneath this it is combined with the `ioConfig` and `tuningConfig` to create a native [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec).

In [26]:
table_name = topic_name

dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

{
     "type": "kafka",
     "spec": {
          "ioConfig": {
               "type": "kafka",
               "consumerProperties": {
                    "bootstrap.servers": "kafka:9092"
               },
               "topic": "example-social-dimensions",
               "inputFormat": {
                    "type": "json"
               },
               "useEarliestOffset": "false"
          },
          "tuningConfig": {
               "type": "kafka"
          },
          "dataSchema": {
               "dataSource": "example-social-dimensions",
               "timestampSpec": {
                    "column": "time",
                    "format": "iso"
               },
               "dimensionsSpec": {
                    "dimensions": [
                         "username",
                         "post_title",
                         {
                              "name": "views",
                              "type": "long"
                         },
                       

Start the ingestion of the raw data from Apache Kafka by submitting this object to Apache Druid by running the cell below.

In [27]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

200


Run the following cell to wait until the ingestion has started and the new table is ready for query.

In [28]:
druid.sql.wait_until_ready(table_name, verify_load_status=False)
print("Ready to go!")

Ready to go!


In [None]:
Run the following cell to get detail about the table you have created.

In [31]:
sql=f'''
SELECT
  "COLUMN_NAME",
  "ORDINAL_POSITION",
  "DATA_TYPE",
  "NUMERIC_PRECISION",
  "NUMERIC_PRECISION_RADIX",
  "DATETIME_PRECISION",
  "CHARACTER_SET_NAME",
  "JDBC_TYPE"
FROM "INFORMATION_SCHEMA"."COLUMNS"
WHERE "TABLE_NAME" = '{table_name}'
'''

display.sql(sql)

COLUMN_NAME,ORDINAL_POSITION,DATA_TYPE,NUMERIC_PRECISION,NUMERIC_PRECISION_RADIX,DATETIME_PRECISION,CHARACTER_SET_NAME,JDBC_TYPE
__time,1,TIMESTAMP,,,3.0,,93
username,2,VARCHAR,,,,UTF-16LE,12
post_title,3,VARCHAR,,,,UTF-16LE,12
views,4,BIGINT,19.0,10.0,,,-5
upvotes,5,BIGINT,19.0,10.0,,,-5
comments,6,BIGINT,19.0,10.0,,,-5


The type shown in the `DATA_TYPE` column tells you how Druid will interpret the data in SQL. Notice that the `dimensionsSpec` has caused Druid to apply a type of BIGINT to `views`, `upvotes`, and `comments`.

> As shown in [documentation](https://druid.apache.org/docs/latest/querying/sql-data-types), these SQL types map to more fundamental types inside Druid itself. Take a moment to review the [documentation](https://druid.apache.org/docs/latest/querying/sql-data-types#standard-types) to understand how each `DATA_TYPE` in the `example-flights-types-1` TABLE maps to an internal Druid runtime type.

### Explicitly exclude dimensions

Run the next cell to create a `dimensionsSpec` object that uses the "exclusion" method for ingesting events.

Notice that it is made up of the names of dimensions to exclude from the incoming data inside `dimensionExclusions` list.

In [32]:
dataSchema_dimensionsSpec = {
    "dimensionExclusions": [
        "username",
        "edited"
        ]
      }

Now incorporate this adaptation into the overall ingestion specification by running the next cell.

In [33]:
dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

{
     "type": "kafka",
     "spec": {
          "ioConfig": {
               "type": "kafka",
               "consumerProperties": {
                    "bootstrap.servers": "kafka:9092"
               },
               "topic": "example-social-dimensions",
               "inputFormat": {
                    "type": "json"
               },
               "useEarliestOffset": "false"
          },
          "tuningConfig": {
               "type": "kafka"
          },
          "dataSchema": {
               "dataSource": "example-social-dimensions",
               "timestampSpec": {
                    "column": "time",
                    "format": "iso"
               },
               "dimensionsSpec": {
                    "dimensionExclusions": [
                         "username",
                         "edited"
                    ]
               },
               "granularitySpec": {
                    "rollup": "false",
                    "segmentGranularity": "hour"
  

Submit the revised specification for this table to Druid by running the next cell.

In [34]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

200


In [42]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT __time, username, views, upvotes, comments
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT __time, username, views, upvotes, comments
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

This data is filtered at ingestion time:


__time,username,views,upvotes,comments
2024-06-27T09:09:36.906Z,,1389,68,4
2024-06-27T09:09:36.905Z,,7311,66,16
2024-06-27T09:09:36.904Z,,2278,69,6
2024-06-27T09:09:36.904Z,,1520,38,0
2024-06-27T09:09:36.904Z,,1622,56,5
2024-06-27T09:09:36.903Z,,11258,44,12
2024-06-27T09:09:36.903Z,,34264,62,4
2024-06-27T09:09:36.903Z,,955,78,9
2024-06-27T09:09:36.903Z,,849,80,11
2024-06-27T09:09:36.900Z,,1353,69,11


This data is unfiltered:


__time,username,views,upvotes,comments
2024-06-27T09:09:36.906Z,,1389,68,4
2024-06-27T09:09:36.905Z,,7311,66,16
2024-06-27T09:09:36.904Z,,2278,69,6
2024-06-27T09:09:36.904Z,,1520,38,0
2024-06-27T09:09:36.904Z,,1622,56,5
2024-06-27T09:09:36.903Z,,11258,44,12
2024-06-27T09:09:36.903Z,,34264,62,4
2024-06-27T09:09:36.903Z,,955,78,9
2024-06-27T09:09:36.903Z,,849,80,11
2024-06-27T09:09:36.900Z,,1353,69,11


## Filter data using an equality filter

In this section you will use a [equality filter](https://druid.apache.org/docs/latest/querying/filters#equality-filter) to create a table that only contains records for where someone searches for a product.

Run the following cell to get a preview of the data that we want to have in our new table.

In [None]:
sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE "event_type" = 'search'
LIMIT 10
'''

print(sql)

display.sql(sql)

Run the next cell to create a new object that represents the [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec). This will be added to the dataSchema in the ingestion specification, instructing Druid to apply a filter to the incoming data as it arrives.

Here, only one [filter](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#filter) will be applied to the data as it arrives.

* The `type` of `selector` looks for an exact match.
* The check will be against the `dimension` of `event_type`, looking for a `value` of "search".

Only rows that pass this test will be added to the table.

In [None]:
dataSchema_transformSpec = {
    "filter":
    {
              "type": "selector",
              "dimension": "event_type",
              "value": "search"
    }
}

Now run the following cell to build a new ingestion specification:

* A new table will be created, as set in `table_searches` and then used in the `dataSource` name in the `dataSchema`.
* The `dataSchema` is updated using the new table name, `timestampSpec` and `granularitySpec` and the updated `dimensionsSpec`.

In [None]:
table_searches = topic_name + "-search"

dataSchema = {
    "dataSource": table_searches,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))


Review the output above and you will see where the `transforms` have been added inside the `dataSchema`.

Submit the new specification for ingestions from this Apache Kafka topic by running the cell below. As well as submitting the new ingestion task, it will print "ready to go" when the table is ready for querying.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

druid.sql.wait_until_ready(table_searches, verify_load_status=False)
print("Ready to go!")

Run the query below to see the effect this has had on the data.

In [41]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT __time, username, views, upvotes, comments
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT __time, username, views, upvotes, comments
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

2024-06-27T09:08:58
This data is filtered at ingestion time:


__time,username,views,upvotes,comments
2024-06-27T09:08:57.534Z,,25487,71,6
2024-06-27T09:08:57.533Z,,5500,44,4
2024-06-27T09:08:57.533Z,,2837,82,7
2024-06-27T09:08:57.532Z,,4926,53,5
2024-06-27T09:08:57.532Z,,2351,55,14
2024-06-27T09:08:57.531Z,,6086,70,7
2024-06-27T09:08:57.530Z,,8124,75,9
2024-06-27T09:08:57.529Z,,513,78,14
2024-06-27T09:08:57.529Z,,5850,100,11
2024-06-27T09:08:57.528Z,,2879,66,9


This data is unfiltered:


__time,username,views,upvotes,comments
2024-06-27T09:08:57.534Z,,25487,71,6
2024-06-27T09:08:57.533Z,,5500,44,4
2024-06-27T09:08:57.533Z,,2837,82,7
2024-06-27T09:08:57.532Z,,4926,53,5
2024-06-27T09:08:57.532Z,,2351,55,14
2024-06-27T09:08:57.531Z,,6086,70,7
2024-06-27T09:08:57.530Z,,8124,75,9
2024-06-27T09:08:57.529Z,,513,78,14
2024-06-27T09:08:57.529Z,,5850,100,11
2024-06-27T09:08:57.528Z,,2879,66,9


## Filter data using an in filter

In this section you will use a [in filter](https://druid.apache.org/docs/latest/querying/filters#equality-filter) to create a table that only contains actions where someone adds or drops an item from their cart.

Run the following cell which executes SQL to give a preview of the data destined for the new table.

In [None]:
sql=f'''
SELECT
  "__time",
  "event_type"
FROM "{table_name}"
WHERE "event_type" IN ('add_to_cart', 'drop_from_cart')
LIMIT 10
'''

display.sql(sql)

Use an "explain" against a Druid SQL statement to view the native representation of the query. From here, you are able to pinpoint the specific filter that has been applied. You can use the Druid console to "explain" an SQL statement in the query tab.

Run the following cell to use the druid API to explain the SQL query above.

In [None]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

Leveraging the `filter` section above, run the next cell to create a new [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec) that contains this filter.

In [None]:
dataSchema_transformSpec = {
    "filter": {
        "type": "in",
        "dimension": "event_type",
        "values": [
          "add_to_cart",
          "drop_from_cart"
        ]
      }
}

Run this cell to build an `ingestionSpec` object, this time including the `transformSpec` above. It will also resurrect the `event_type` column to enable add and drop actions to be differentiated in the data.

In [None]:
table_cart = topic_name + "-cart"

dataSchema_dimensionsSpec = {
        "dimensions": [
          "user_id",
          "event_type",
          "client_ip",
          "client_device",
          "client_lang",
          "client_country",
          "referrer",
          "keyword",
          "product"
        ]
      }

dataSchema = {
    "dataSource": table_cart,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Submit the new specification for ingestions from this Apache Kafka topic by running the cell below.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

druid.sql.wait_until_ready(table_cart, verify_load_status=False)
print("Ready to go!")

Run the query below to see the effect this has had on the data.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(time_now)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_cart}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

Notice above that "view_cart" is not being ingested into the table.

Run the following cell to switch from an "in" type filter to a "[like](https://druid.apache.org/docs/29.0.1/querying/filters/#like-filter)" filter.

In [None]:
dataSchema_transformSpec = {
    "filter": {
        "type": "like",
        "dimension": "event_type",
        "pattern" : "%cart%"
      }
}

dataSchema = {
    "dataSource": table_cart,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

druid.sql.wait_until_ready(table_cart, verify_load_status=False)
print("Ready to go!")

Wait for a few moments for the old tasks to terminate, and for new tasks to start using this new configuration.

When done, run the next cell to see the effect.

Notice that, in the method used in this example, "view cart" actions are only included from this point forward in the stream.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(time_now)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_cart}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

## Clean up

Run the following cell to stop the data generator, stop ingestion from the topic, and to remove the table used in this notebook from the database.

In [15]:
print(f"Stop streaming generator: [{requests.post(f'{druid_host}/stop/{topic_name}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{topic_name}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

tasks = druid.tasks.tasks(state='running', table=table_searches)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_searches)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_searches}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_searches}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_searches)}]")

tasks = druid.tasks.tasks(state='running', table=table_cart)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_cart)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_cart}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_cart}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_cart)}]")

Stop streaming generator: [<Response [404]>]
Pause streaming ingestion: [<Response [404]>]
Shutting down running tasks ...
Reset offsets for re-runnability: [<Response [404]>]
Terminate streaming ingestion: [<Response [404]>]
Drop datasource: [None]


NameError: name 'table_searches' is not defined

## Summary

* Filters can be applied to data from Apache Kafka as soon as it arrives.
* Typical SQL WHERE filtering has native counterparts that you can use as a filter in the `transformSpec`.
* Unless the topic offset is reset manually, expressions only apply to new data as it arrives.

## Learn more

* Try using [logical expression filters](https://druid.apache.org/docs/latest/querying/filters#logical-expression-filters) to add AND and OR conditions in your filters.
* Read about more advanced filters, such as [regular expression](https://druid.apache.org/docs/latest/querying/filters#regular-expression-filter) and [expression](https://druid.apache.org/docs/latest/querying/filters#expression-filter) filters.
* Check out the notebook on transforming data at ingestion time using [expressions](13-native-transforms.ipynb) and then combine what you've learned here with an [extraction filter](https://druid.apache.org/docs/latest/querying/filters#extraction-filter).
* Re-run this notebook, but manually hard reset the supervisor between posting a new ingestion specification. You can do this either with a [POST](https://druid.apache.org/docs/latest/api-reference/supervisor-api#reset-a-supervisor) or [through the console](https://druid.apache.org/docs/latest/operations/web-console#supervisors). What do you expect to happen?
* Refer to the documentation on [native transform expressions](https://druid.apache.org/docs/latest/querying/math-expr).