# Transforming incoming stream data using native functions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

During streaming ingestion, you can transform incoming data using Apache Druid native functions within the `transformSpec`. This tutorial demonstrates how to apply these functions as [transforms](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transforms) from a stream of events.

In this tutorial, you perform the following tasks:

- Set up a streaming ingestion from Apache Kafka.
- Apply some example transformations to update incoming data.
- Create a new dimension using data from other dimensions.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Set up a connection to Apache Kafka

Run the next cell to set up the connection to Kafka.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

### Set up a connection to the Data Generator

Run the next cell to set up the connection to the Data Generator.

In [None]:
import requests
import json

datagen_host = "http://datagen:9999"
datagen_headers = {'Content-Type': 'application/json'}

## Create a table using streaming ingestion

In this section, you use the data generator to generate a stream of messages into a Kafka topic. Next, you set up an on-going ingestion into Druid.

### Use the data generator to populate a Kafka topic

Run the following cell to instruct the data generator to start producing data.

This creates clickstream sample data for an hour and publishes it to a Kafka topic for Druid to consume from.

In [None]:
datagen_topic = "example-clickstream-transforms"
datagen_job = f"{datagen_topic}"
datagen_config = "social/social_posts.json"

datagen_request = {
    "name": datagen_job,
    "target": { "type": "kafka", "endpoint": kafka_host, "topic": datagen_topic },
    "config_file": "clickstream/clickstream.json",
    "time": "1h",
    "concurrency":10,
    "time_type": "REAL"
}

print(datagen_request)

requests.post(f"{datagen_host}/start", json.dumps(datagen_request), headers=datagen_headers)

In [None]:
requests.get(f"{datagen_host}/status/{datagen_job}").json()

### Use streaming ingestion to populate the table

Ingest data from the Kafka topic into Druid by submitting an [ingestion specification](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html) to the [streaming ingestion supervisor API](https://druid.apache.org/docs/latest/api-reference/supervisor-api).

Run the next cell to set up the [`ioConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#ioconfig) and [`tuningConfig`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#tuningconfig) properties of the ingestion specification. These configurations connect to the same Kafka host used by the data generator and consume the JSON data being published to the data generator topic.

In [None]:
ioConfig = {
    "type": "kafka",
    "consumerProperties": { "bootstrap.servers": kafka_host },
    "topic": datagen_topic,
    "inputFormat": { "type": "json" },
    "useEarliestOffset": "true" }

tuningConfig = { "type": "kafka" }

Run the next cell to create a series of objects that will ultimately be used to define the table through [dataSchema](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dataschema). These include:

* [timestampSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec): uses the `time` column from the generated data as the primary timestamp.
* [dimensionsSpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec): specifies all of the columns from the incoming data.
* [granularitySpec](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#granularityspec): puts data into daily segment files without any ingestion-time aggregation ([rollup](https://druid.apache.org/docs/latest/ingestion/rollup)).

In the final statement, the `dataSchema` is built from these three objects.

In [None]:
dataSchema_timestampSpec = { "column": "time", "format": "iso" }
dataSchema_granularitySpec = { "rollup": "false", "segmentGranularity": "day" }
dataSchema_dimensionsSpec = {
        "dimensions": [
          "user_id",
          "event_type",
          "client_ip",
          "client_device",
          "client_lang",
          "client_country",
          "referrer",
          "keyword",
          "product"
        ]
      }

table_name = datagen_topic

dataSchema = {
      "dataSource": table_name,
      "timestampSpec": dataSchema_timestampSpec,
      "dimensionsSpec": dataSchema_dimensionsSpec,
      "granularitySpec": dataSchema_granularitySpec
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

Run the next cell to submit the ingestion spec to Druid and start ingesting raw data from Kafka

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
display.table(f'{table_name}')

## Transform data using string functions

In this section, you use a [native expression](https://druid.apache.org/docs/latest/querying/math-expr) to transform new data as it arrives. While this example uses a string function, the same mechanism applies to other native functions, including numeric, IP, and date and time functions.

Take a look at the current data by running the following SQL query.

In [None]:
sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
LIMIT 10
'''

display.sql(sql)

Add a [`transformSpec`](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec) object to the ingestion specification to turn all country names into upper case as data arrives.

Run the following cell to add an "upper" [expression](https://druid.apache.org/docs/latest/querying/math-expr#string-functions) to a collection (`transforms`) inside a new object that will then be incorporated into the ingestion specification.

* The upper `expression` is calculated using `client_country`.
* The `name` instructs Druid to write the result back into `client_country`.

In [None]:
dataSchema_transformSpec = {
    "transforms":
    [
        {
            "type": "expression",
            "name": "client_country",
            "expression": "upper(client_country)"
        }
    ]
}

Now run this cell to rebuild the `ingestionSpec` object, this time including the `transformSpec` object above.

In [None]:
dataSchema = {
    "dataSource": table_name,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

print(json.dumps(ingestion_spec, indent=5))

Review the output above to see where the `transforms` have been added within the `dataSchema`.

Run the next cell to submit the new specification for ingestions from this Kafka topic.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
display.table(f'{table_name}')

Wait for a moment or two. This will allow Druid to apply the new configuration for the ingestion by shutting down the existing task, and starting a new ingestion task with the transforms.

Then, run the following query to see the effect this has had on the data.

In [None]:
from datetime import datetime, timedelta

time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(time_now)

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_now}')
ORDER BY __time DESC
'''

display.sql(sql)

Since the new ingestion specification continued where the old one finished, the function has only been applied to new data.

Run the following cell to see the values for `client_country` in data five minutes ago.

In [None]:
time_then = (datetime.now() -  timedelta(hours=0, minutes=5)).strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT
  "__time",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_then}')
ORDER BY __time DESC
'''

display.sql(sql)

## Use CASE to generate NULL values

In this section, you use a case function to catch raw data that has an "unknown" or "none" value and replace that value with NULL.

* case_searched() checks if the value of `keyword` is "None". If it is, null is returned, otherwise the existing value is returned.
* The same method is then used on `product` and `referrer`.
* In each case, the `name` means that each time, the data from each evaluated dimensions is being overwritten.

Run the next cell to change the `transformSpec` object, rebuild the `dataSchema`, and build the final `ingestionSpec`.

In [None]:
dataSchema_transformSpec = {
    "transforms":
    [
        {
            "type": "expression",
            "name": "keyword",
            "expression": "case_searched((\"keyword\" == 'None'),null,\"keyword\")"
        },
        {
            "type": "expression",
            "name": "product",
            "expression": "case_searched((\"product\" == 'None'),null,\"product\")"
        },
        {
            "type": "expression",
            "name": "referrer",
            "expression": "case_searched((\"referrer\" == 'unknown'),null,\"referrer\")"
        }
    ]
}

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

print(json.dumps(ingestion_spec, indent=5))

Run the next cell to submit the updated ingestion specification.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
display.table(f'{table_name}')

Wait for a moment or two. This allows Druid to apply the new configuration for the ingestion by shutting down the existing task and starting a new ingestion task with the transforms.

Run the following cell a number of times to see how it affects the data in the table.

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT
  "__time",
  "keyword",
  "referrer",
  "product"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_now}')
ORDER BY __time DESC
'''

display.sql(sql)

## Use a function to add a new column

Since all the examples above use the same `name` as an existing column, the supplied `expression` overwrites the data in that column. In this example, you use a new column name to create a new dimension from the incoming data.

This is a two-step process: first, add the new column to the table schema, then set the function to evaluate.

Run the next cell to achieve this.

* The `dimensionsSpec` has a new dimension called "time_friendly".
* The `transformSpec` has a new transform that applies a date formatting function.
* The `dataSchema` is then built from these new objects.
* The `ingestionSpec` is then built up from the new `dataSchema`.

In [None]:
dataSchema_dimensionsSpec = {
        "dimensions": [
          "user_id",
          "event_type",
          "client_ip",
          "client_device",
          "client_lang",
          "client_country",
          "referrer",
          "keyword",
          "product",
          "time_friendly"
        ]
      }

dataSchema_transformSpec = {
    "transforms":
    [
        {
            "type": "expression",
            "name": "keyword",
            "expression": "case_searched((\"keyword\" == 'None'),null,\"keyword\")"
        },
        {
            "type": "expression",
            "name": "product",
            "expression": "case_searched((\"product\" == 'None'),null,\"product\")"
        },
        {
            "type": "expression",
            "name": "referrer",
            "expression": "case_searched((\"referrer\" == 'unknown'),null,\"referrer\")"
        },
        {
            "type": "expression",
            "name": "time_friendly",
            "expression": "timestamp_format(\"__time\",'E dd MM yyyy','UTC')"
        }
    ]
}

dataSchema = {
    "dataSource": table_name,
    "timestampSpec": dataSchema_timestampSpec,
    "transformSpec" : dataSchema_transformSpec,
    "dimensionsSpec": dataSchema_dimensionsSpec,
    "granularitySpec": dataSchema_granularitySpec
    }

ingestion_spec = {
    "type": "kafka",
    "spec": {
        "ioConfig": ioConfig,
        "tuningConfig": tuningConfig,
        "dataSchema": dataSchema
    }
}

print(json.dumps(ingestion_spec, indent=5))

Run the next cell to submit this new ingestion specification for the topic.

In [None]:
requests.post(f"{druid_host}/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=druid_headers)
druid.sql.wait_until_ready(table_name, verify_load_status=False)
display.table(f'{table_name}')

While waiting for the new ingestion to be applied, try this trick for finding the exact expression to put into your ingestion specifications.

* Go to [http://localhost:8888/] (http://localhost:8888/) to open the Druid console.
* Switch to the query view and build a SQL statement against the table using a function you know well.
* Instead of running the SQL, use the button with the three dots to find "Explain SQL query".
* Notice the "virtual columns" section contains the equivallent native expression.

Now run this cell to see the result of your updated ingestion:

In [None]:
time_now = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT
  "__time",
  "time_friendly"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_now}')
ORDER BY __time DESC
'''

display.sql(sql)

Run the next cell to see that the same query for an older time period returns NULL values.

In [None]:
time_then = (datetime.now() -  timedelta(hours=0, minutes=5)).strftime('%Y-%m-%dT%H:%M:%S')

sql=f'''
SELECT
  "__time",
  "time_friendly"
FROM "{table_name}"
WHERE TIME_IN_INTERVAL(__time,'PT15S/{time_then}')
ORDER BY __time DESC
'''

display.sql(sql)

In the example above, a new column was created from the primary timestamp. As noted in the [ingestion spec reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec), you can use the same mechanism to overwrite the timestamp (`__time`). For example, to create a compliant format from non-standard datetime representations before it lands in the database.

## Clean up

Run the following cell to stop the data generator, stop ingestion from the topic, and remove the table used in this notebook from the database.

In [None]:
print(f"Stop streaming generator: [{requests.post(f'{datagen_host}/stop/{datagen_job}','')}]")
print(f'Pause streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/suspend","")}]')

print(f'Shutting down running tasks ...')

tasks = druid.tasks.tasks(state='running', table=table_name)
while len(tasks)>0:
    for task in tasks:
        print(f"...stopping task [{task['id']}]")
        druid.tasks.shut_down_task(task['id'])
    tasks = druid.tasks.tasks(state='running', table=table_name)

print(f'Reset offsets for re-runnability: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/reset","")}]')
print(f'Terminate streaming ingestion: [{requests.post(f"{druid_host}/druid/indexer/v1/supervisor/{datagen_topic}/terminate","")}]')
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

## Summary

* You can apply functions to Kafka data as soon as it arrives.
* SQL functions have native counterparts that you can use as a transform in `transformSpec`.
* Expressions can be used to overwrite data or to create new columns.
* Unless the topic offset is reset manually, expressions only apply to new data as it arrives.

## Learn more

* Try to use some more native functions such as numeric, IP, and more complex string functions.
* Re-run this notebook, but manually hard reset the supervisor between posting a new ingestion specification. You can do this either with a [POST](https://druid.apache.org/docs/latest/api-reference/supervisor-api#reset-a-supervisor) request or [through the console](https://druid.apache.org/docs/latest/operations/web-console#supervisors). What do you expect to happen?
* Review the documentation on [native transform expressions](https://druid.apache.org/docs/latest/querying/math-expr).