# Filtering incoming stream data using native functions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

During streaming ingestion, you can filter incoming data using Apache Druid native filters within the ingestion spec. This tutorial demonstrates how to apply native [filters](https://druid.apache.org/docs/latest/querying/filters) to a stream of events.

In this tutorial you perform the following tasks:

- Set up a streaming ingestion from Apache Kafka.
- Create two alternative tables from the same topic that contain filtered versions of the source data.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Set up a connection to Apache Kafka

Run the next cell to set up the connection to Apache Kafka.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Set up a connection to the data generator

Run the next cell to set up the connection to the data generator.

In [None]:
import requests
import json

datagen_host = "http://datagen:9999"
datagen_headers = {'Content-Type': 'application/json'}

## Create a table using streaming ingestion

In this section, you use the data generator to generate a stream of messages into a Kafka topic. Next, you set up an on-going ingestion into Druid.

### Use the data generator to populate a Kafka topic

Run the following cell to instruct the data generator to start producing data.

This creates clickstream sample data for an hour and publishes the data to a Kafka topic for Druid to consume from.

In [None]:
datagen_topic = "example-clickstream-filters"
datagen_job = datagen_topic
datagen_config = "clickstream/clickstream.json"

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic  },
    "config_file": datagen_config,
    "concurrency":10,
    "time_type": "REAL"
}

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)
requests.get(f"{datagen_host}/status/{datagen_job}").json()

### Use streaming ingestion to populate the table

Ingest data from the Kafka topic into Druid by submitting an [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html) to the [streaming ingestion supervisor API](https://druid.apache.org/docs/latest/api-reference/supervisor-api).

Run the next cell to set up the [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig), [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig), and [`dataSchema`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema) components of the ingestion spec and submit it to Druid to start the ingestion.

When finished, you will see the table description.

In [None]:
ioConfig = {
    "type": "kafka",
    "consumerProperties": { "bootstrap.servers": kafka_host },
    "topic": datagen_topic,
    "inputFormat": { "type": "json" },
    "useEarliestOffset": "false" }

tuningConfig = { "type": "kafka" }

table_name = datagen_topic

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "hour" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"}
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
sql_client.wait_until_ready(table_name, verify_load_status=False)
display.table(table_name)

## Filter data using an equality filter

In this section, you use an [equality filter](https://druid.apache.org/docs/latest/querying/filters#equality-filter) to create a table that only contains records for where someone searches for a product.

Run the following cell to get a preview of the data that we want to have in our new table.

In [None]:
sql=f'''
SELECT
  "event_type",
  COUNT(*) AS events
FROM "{table_name}"
GROUP BY 1
'''

display.sql(sql)

Run the next cell to create a new object that represents the [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec). This will be added to the `dataSchema` in the ingestion specification, instructing Druid to apply a filter to the incoming data as it arrives.

Here, only one [filter](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#filter) will be applied to the data as it arrives.

* The `type` of `selector` looks for an exact match.
* The check will be against the `dimension` of `event_type`, looking for a `value` of "search".

Only rows that pass this test will be added to the table.

In [None]:
dataSchema_transformSpec = {
    "filter":
        {
            "type": "selector",
            "dimension": "event_type",
            "value": "search"
        }
    }

Run the next cell to build a new ingestion specification.

Notice the new table name, `table_searches`, and that while the `timestampSpec`, `granularitySpec`, and `dimensionsSpec` remain the same, the `transformSpec` is updated.

In [None]:
table_searches = table_name + "-search"

dataSchema = {
    "dataSource": table_searches,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "hour" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"},
    "transformSpec" : dataSchema_transformSpec
    }

ingestion_spec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestion_spec, indent=5))

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)

Review the output above to see where `transforms` have been added inside the `dataSchema`.

Run the next cell to confirm the table has been populated before moving on.

In [None]:
sql_client.wait_until_ready(table_searches, verify_load_status=False)
display.table(table_searches)

Run the cell below a few times.

You will see that events continue being ingested into the original table, but that the new table only contains new events that match the filter.

In [None]:
from datetime import datetime

time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_searches}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_now}')
ORDER BY __time DESC
'''

print("This data is being filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_now}')
ORDER BY __time DESC
'''

print("This data is not being filtered:")
display.sql(sql)

Run the next cell to stop the ingestion job and drop the table.

In [None]:
tasks = druid.tasks.tasks(state='running', table=table_searches)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_searches)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_searches}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_searches}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_searches)}]")

## Filter data using an in filter

In this section, you use an [in filter](https://druid.apache.org/docs/latest/querying/filters#equality-filter) to create a table that only contains actions where someone adds or drops an item from their cart.

Run the following cell which executes SQL to give a preview of the data destined for the new table.

In [None]:
sql=f'''
SELECT
  "__time",
  "event_type"
FROM "{table_name}"
WHERE "event_type" IN ('add_to_cart', 'view_cart')
LIMIT 10
'''

display.sql(sql)

Using [EXPLAIN PLAN](https://druid.apache.org/docs/latest/querying/sql#explain-plan), it's possible to see the native representation of any Druid SQL statement, allowing you to pinpoint reusable elements for native queries or ingestion.

Run the next cell to use the Druid API to execute an EXPLAIN PLAN function for the SQL query above.

In [None]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

Leveraging the `filter` section above, run the next cell to create a new [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec) that contains this filter.

In [None]:
dataSchema_transformSpec = {
    "filter": {
        "type": "in",
        "dimension": "event_type",
        "values": [
          "add_to_cart",
          "view_cart"
        ]
      }
    }

Run this cell to build an `ingestion_spec` object, this time including the `transformSpec` above.

In [None]:
table_cart = table_name + "-cart"

dataSchema = {
    "dataSource": table_cart,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "hour" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"},
    "transformSpec" : dataSchema_transformSpec
    }

ingestion_spec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
sql_client.wait_until_ready(table_cart, verify_load_status=False)
display.table(table_cart)

Run the query below to see the effect this has had on the data.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(time_now)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_cart}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_now}')
ORDER BY __time DESC
'''

print("This data is being filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_now}')
ORDER BY __time DESC
'''

print("This data is not being filtered:")
display.sql(sql)

Taking a look back at the original data, you may notice we have missed an event type!

Run the following cell to switch from an "in" type filter to a "[like](https://druid.apache.org/docs/29.0.1/querying/filters/#like-filter)" filter to catch the missing event_type: "drop_from_cart".

In [None]:
dataSchema_transformSpec = {
    "filter": {
        "type": "like",
        "dimension": "event_type",
        "pattern" : "%cart%"
      }
}

dataSchema = {
    "dataSource": table_cart,
    "timestampSpec": { "column": "time", "format": "iso" },
    "granularitySpec": { "rollup": "false", "segmentGranularity": "hour" },
    "dimensionsSpec": { "useSchemaDiscovery" : "true"},
    "transformSpec" : dataSchema_transformSpec
    }

ingestion_spec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
sql_client.wait_until_ready(table_cart, verify_load_status=False)
display.table(table_cart)

Wait for a few moments for the old consumers to [stop and hand off their data](https://druid.apache.org/docs/latest/design/storage#indexing-and-handoff), and for the supervisor to start new consumer tasks with the new configuration.

Run the next cell to see the effect.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_cart}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

Adjust the TIME_IN_INTERVAL filters above to cover different time periods.

Notice that, in the method used in this example, the filter takes effect from this point forward - historical "drop from cart" events are not captured.

## Clean up

Run the following cell to stop the data generator, stop ingestion from the topic, and remove the table used in this notebook from the database.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{druid_host}/stop/{datagen_topic}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

tasks = druid.tasks.tasks(state='running', table=table_cart)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_cart)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_cart}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_cart}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_cart)}]")

## Summary

* Filters can be applied to data from Apache Kafka as soon as it arrives.
* Typical SQL WHERE filtering has native counterparts that you can use as a filter in the `transformSpec`.
* Unless the topic offset is reset manually, expressions only apply to new data as it arrives.

## Learn more

* Try using [logical expression filters](https://druid.apache.org/docs/latest/querying/filters#logical-expression-filters) to add AND and OR conditions in your filters.
* Read about more advanced filters, such as [regular expression](https://druid.apache.org/docs/latest/querying/filters#regular-expression-filter) and [expression](https://druid.apache.org/docs/latest/querying/filters#expression-filter) filters.
* Check out the notebook on transforming data at ingestion time using [expressions](13-native-transforms.ipynb) and then combine what you've learned here with an [extraction filter](https://druid.apache.org/docs/latest/querying/filters#extraction-filter).
* Re-run this notebook, but manually hard reset the supervisor between posting a new ingestion specification. You can do this either with a [POST](https://druid.apache.org/docs/latest/api-reference/supervisor-api#reset-a-supervisor) request or [through the console](https://druid.apache.org/docs/latest/operations/web-console#supervisors). What do you expect to happen?
* Review to the documentation on [native transform expressions](https://druid.apache.org/docs/latest/querying/math-expr).