# Filtering incoming stream data using native functions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

During streaming ingestion, filters can be applied to incoming data by using native filters inside the `transformSpec`. This tutorial demonstrates how to apply some filters as [filters](https://druid.apache.org/docs/latest/querying/filters) against a stream of events.

In this tutorial you perform the following tasks:

- Set up a streaming ingestion from Apache Kafka.
- Create two alternative tables from the same topic that contain filtered versions of the source data.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import requests
from datetime import datetime

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Run the next cell to set up the connection to Apache Kafka and to the Data Generator, and to import some helper functions for later in the tutorial.

In [None]:
import requests
import json
import os
import kafka
from kafka import KafkaConsumer

datagenUrl = "http://datagen:9999"

generalHeaders = {'Content-Type': 'application/json'}

if (os.environ['KAFKA_HOST'] == None):
    kafka_host=f"kafka:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Start a data stream

Run the following cell to use the learn-druid Data Generator to create a stream that we can consume from.

This will create clickstream sample data for an hour and publish it to a topic in Apache Kafka for Apache Druid to consume from.

In [None]:
topic_name = "example-clickstream-filters"

job_name="example_clickstream"

target = {
    "type":"kafka",
    "endpoint": kafka_host,
    "topic": topic_name
}

datagen_request = {
    "name": topic_name,
    "target": target,
    "config_file": "clickstream/clickstream.json",
    "time": "1h",
    "concurrency":10,
    "time_type": "REAL"
}

requests.post(f"{datagenUrl}/start", json.dumps(datagen_request), headers=generalHeaders)

Run the next cell to see a sample of the raw data being emitted from the Data Generator.

This cell uses a simple consumer to subscribe to the topic and to show the first 5 rows that appear.

In [None]:
consumer = KafkaConsumer(
 bootstrap_servers=kafka_host
)

consumer.subscribe(topics=topic_name)
count = 0

for message in consumer:
    count += 1
    if count == 5:
        break
    print ("%d:%d: v=%s" % (message.partition,
                            message.offset,
                            message.value))

consumer.unsubscribe()

## Build a native ingestion specification

Run the following cell to create an [ioConfig](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig) object that sets the connection to the topic from Apache Druid.

In [None]:
ioConfig = {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "kafka:9092"
      },
      "topic": topic_name,
      "inputFormat": {
        "type": "json"
      },
      "useEarliestOffset": "false"
    }

Run the next cell to create a very simple [tuningConfig](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig) object for the tuning configuration for the ingestion.

In [None]:
tuningConfig = { "type": "kafka" }

Run the next cell to create a series of objects that will ultimately be used to define the table through a [dataSchema](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). These include:

* A [timestampSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) which uses the `time` column from the generated data as the primary timestamp.
* A [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec). In this example, we manually specify all of the columns from the incoming data.
* A [granularitySpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#granularityspec) which puts data into daily segment files without any ingestion-time aggregation ([rollup](https://druid.apache.org/docs/latest/ingestion/rollup)).

In the final statement, the `dataSchema` is built from these three objects.

In [None]:
table_name = topic_name

dataSchema_timestampSpec = {
    "column": "time",
    "format": "iso"
    }

dataSchema_dimensionsSpec = {
        "dimensions": [
          "user_id",
          "event_type",
          "client_ip",
          "client_device",
          "client_lang",
          "client_country",
          "referrer",
          "keyword",
          "product"
        ]
      }

dataSchema_granularitySpec = {
        "queryGranularity": "none",
        "rollup": "false",
        "segmentGranularity": "day"
      }

dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

Now run the next cell to create the final native [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec).

In [None]:
ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Start the ingestion of the raw data from Apache Kafka by submitting this object to Apache Druid by running the cell below.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

Run the following cell to wait until the ingestion has started and the new table is ready for query.

In [None]:
druid.sql.wait_until_ready(table_name, verify_load_status=False)
print("Ready to go!")

## Filter data using an equality filter

In this section you will use a [equality filter](https://druid.apache.org/docs/latest/querying/filters#equality-filter) to create a table that only contains records for where someone searches for a product.

Run the following cell to get a preview of the data that we want to have in our new table.

In [None]:
sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE "event_type" = 'search'
LIMIT 10
'''

print(sql)

display.sql(sql)

Run the next cell to create a new object that represents the [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec). This will be added to the dataSchema in the ingestion specification, instructing Druid to apply a filter to the incoming data as it arrives.

Here, only one [filter](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#filter) will be applied to the data as it arrives.

* The `type` of `selector` looks for an exact match.
* The check will be against the `dimension` of `event_type`, looking for a `value` of "search".

Only rows that pass this test will be added to the table.

In [None]:
dataSchema_transformSpec = {
    "filter":
    {
              "type": "selector",
              "dimension": "event_type",
              "value": "search"
    }
}

Now run the following cell to build a new ingestion specification:

* A new table will be created, as set in `table_searches` and then used in the `dataSource` name in the `dataSchema`.
* The `dataSchema` is updated using the new table name, `timestampSpec` and `granularitySpec` and the updated `dimensionsSpec`.

In [None]:
table_searches = topic_name + "-search"

dataSchema = {
    "dataSource": table_searches,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))


Review the output above and you will see where the `transforms` have been added inside the `dataSchema`.

Submit the new specification for ingestions from this Apache Kafka topic by running the cell below. As well as submitting the new ingestion task, it will print "ready to go" when the table is ready for querying.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

druid.sql.wait_until_ready(table_searches, verify_load_status=False)
print("Ready to go!")

Run the query below to see the effect this has had on the data.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(time_now)

sql=f'''
SELECT
  "__time",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_searches}"
WHERE TIME_IN_INTERVAL(__time,'PT30S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT30S/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

## Filter data using an in filter

In this section you will use a [in filter](https://druid.apache.org/docs/latest/querying/filters#equality-filter) to create a table that only contains actions where someone adds or drops an item from their cart.

Run the following cell which executes SQL to give a preview of the data destined for the new table.

In [None]:
sql=f'''
SELECT
  "__time",
  "event_type"
FROM "{table_name}"
WHERE "event_type" IN ('add_to_cart', 'drop_from_cart')
LIMIT 10
'''

display.sql(sql)

Use an "explain" against a Druid SQL statement to view the native representation of the query. From here, you are able to pinpoint the specific filter that has been applied. You can use the Druid console to "explain" an SQL statement in the query tab.

Run the following cell to use the druid API to explain the SQL query above.

In [None]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

Leveraging the `filter` section above, run the next cell to create a new [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec) that contains this filter.

In [None]:
dataSchema_transformSpec = {
    "filter": {
        "type": "in",
        "dimension": "event_type",
        "values": [
          "add_to_cart",
          "drop_from_cart"
        ]
      }
}

Run this cell to build an `ingestionSpec` object, this time including the `transformSpec` above. It will also resurrect the `event_type` column to enable add and drop actions to be differentiated in the data.

In [None]:
table_cart = topic_name + "-cart"

dataSchema_dimensionsSpec = {
        "dimensions": [
          "user_id",
          "event_type",
          "client_ip",
          "client_device",
          "client_lang",
          "client_country",
          "referrer",
          "keyword",
          "product"
        ]
      }

dataSchema = {
    "dataSource": table_cart,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

print(json.dumps(ingestionSpec, indent=5))

Submit the new specification for ingestions from this Apache Kafka topic by running the cell below.

In [None]:
supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

druid.sql.wait_until_ready(table_cart, verify_load_status=False)
print("Ready to go!")

Run the query below to see the effect this has had on the data.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(time_now)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_cart}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

Notice above that "view_cart" is not being ingested into the table.

Run the following cell to switch from an "in" type filter to a "[like](https://druid.apache.org/docs/29.0.1/querying/filters/#like-filter)" filter.

In [None]:
dataSchema_transformSpec = {
    "filter": {
        "type": "like",
        "dimension": "event_type",
        "pattern" : "%cart%"
      }
}

dataSchema = {
    "dataSource": table_cart,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestionSpec = {
  "type": "kafka",
  "spec": {
    "ioConfig": ioConfig,
    "tuningConfig": tuningConfig,
    "dataSchema": dataSchema
  }
}

supervisor = requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestionSpec), headers=generalHeaders)
print(supervisor.status_code)

druid.sql.wait_until_ready(table_cart, verify_load_status=False)
print("Ready to go!")

Wait for a few moments for the old tasks to terminate, and for new tasks to start using this new configuration.

When done, run the next cell to see the effect.

Notice that, in the method used in this example, "view cart" actions are only included from this point forward in the stream.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(time_now)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_cart}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is filtered at ingestion time:")
display.sql(sql)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT1M/''' + time_now + '''')
ORDER BY __time DESC
'''

print("This data is unfiltered:")
display.sql(sql)

## Clean up

Run the following cell to stop the data generator, stop ingestion from the topic, and to remove the table used in this notebook from the database.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{druid_host}/stop/{topic_name}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{topic_name}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_name}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

tasks = druid.tasks.tasks(state='running', table=table_searches)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_searches)
        
print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_searches}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_searches}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_searches)}]")

tasks = druid.tasks.tasks(state='running', table=table_cart)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_cart)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_cart}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{table_cart}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_cart)}]")

## Summary

* Filters can be applied to data from Apache Kafka as soon as it arrives.
* Typical SQL WHERE filtering has native counterparts that you can use as a filter in the `transformSpec`.
* Unless the topic offset is reset manually, expressions only apply to new data as it arrives.

## Learn more

* Try using [logical expression filters](https://druid.apache.org/docs/latest/querying/filters#logical-expression-filters) to add AND and OR conditions in your filters.
* Read about more advanced filters, such as [regular expression](https://druid.apache.org/docs/latest/querying/filters#regular-expression-filter) and [expression](https://druid.apache.org/docs/latest/querying/filters#expression-filter) filters.
* Check out the notebook on transforming data at ingestion time using [expressions](13-native-transforms.ipynb) and then combine what you've learned here with an [extraction filter](https://druid.apache.org/docs/latest/querying/filters#extraction-filter).
* Re-run this notebook, but manually hard reset the supervisor between posting a new ingestion specification. You can do this either with a [POST](https://druid.apache.org/docs/latest/api-reference/supervisor-api#reset-a-supervisor) or [through the console](https://druid.apache.org/docs/latest/operations/web-console#supervisors). What do you expect to happen?
* Refer to the documentation on [native transform expressions](https://druid.apache.org/docs/latest/querying/math-expr).