# Enriching and updating data using Kafka lookup tables
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

[Lookups](https://druid.apache.org/docs/latest/querying/lookups) are [key/value-pair tables](https://druid.apache.org/docs/latest/querying/datasource#lookup) broadcast to query processes that can be updated regularly either manually or automatically. In this notebook you will extend your knowledge of these tables to lookups populated from an [Apache Kafka topic](https://druid.apache.org/docs/latest/querying/kafka-extraction-namespace) and walk through some simple queries.

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

Before running through this notebook, you may want to familiarise yourself with key concepts by running through the [general lookups notebook](./06-lookup-tables.ipynb).

The extension for Kafka lookup tables, [`druid-kafka-extraction-namespace`](https://druid.apache.org/docs/latest/querying/kafka-extraction-namespace), must be added to the extensions load list prior to being used. In the learning environment, this has been added for you.

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Set up a connection to Apache Kafka

<!-- Include these cells if your notebook uses Kafka. -->

Run the next cell to set up the connection to Apache Kafka.

In [None]:
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

## Create a table using batch ingestion

Run the following cell to create a table using batch ingestion.

The same principles in this notebook also apply to tables receiving events from stream sources.

When completed, you'll see a description of the final table.

In [None]:
table_name = 'example-flights-kafkalookup'

sql='''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Reporting_Airline",
  "Tail_Number",
  "Distance",
  "Origin",
  "Dest"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready(f'{table_name}')
display.table(f'{table_name}')

## Create a lookup table

Run the following cell to create some helper functions for using the lookup API.

* You will use the `postLookup` function to call the [lookup configuration API](https://druid.apache.org/docs/latest/api-reference/lookups-api) to create and update lookup tables.
* The `waitForLookup` function will be used to give you feedback on Druid's progress in [distributing](https://druid.apache.org/docs/latest/querying/lookups#configuration-propagation-behavior) the lookup table around the query-serving processes in Druid.

In [None]:
import requests
import time

def postLookup(definition):
    x = requests.post(f"{druid_host}/druid/coordinator/v1/lookups/config", json=definition)

    if "error" in x.text:
        raise Exception('Not able to complete the request. \n\n'+x.text)
    else:
        print('Successfully submitted the lookup request.')

def waitForLookup(tier, name, ticsMax):

    # The default time period between checks of lookup definition changes (druid.manager.lookups.period)
    # is two minutes. The notebook environment reduces this for learning purposes.
    # 
    # https://druid.apache.org/docs/latest/configuration/#lookups-dynamic-configuration

    tics = 0
    ticsWait = 1    
    ticsMax = min(ticsMax,360)
    ticsSpinner = "/-\|"
    
    apicall = f"{druid_host}/druid/coordinator/v1/lookups/status/{tier}/{name}?detailed=true"

    x = requests.get(apicall)

    while (x.text != '{"loaded":true,"pendingNodes":[]}' and tics < ticsMax):
        print(f"{x.text} {ticsSpinner[tics%len(ticsSpinner)]} [ {str(ticsMax-tics)} ]     ", end='\r')
        time.sleep(ticsWait)
        tics += 1
        x = requests.get(apicall) 

    if (tics == ticsMax):
        raise Exception(f"\nTimeout waiting for Druid to load the {name} lookup to {tier} tier. Run the cell again.")
    else:
        print(f"\nSuccess. {name} lookup in {tier} tier is fully available.")

### Initialize lookups

Run the following cell, which posts an empty JSON object to the configuration API.

In [None]:
empty_post = {}
postLookup(empty_post)

### Generate some lookup values in a Kafka topic

Create a number of events that can be consumed by Druid into a lookup map.

Run the cell below to create a Kafka Producer, and to then send one event to a new topic, "example-lookup-airportsize". It will contain a binary key and value representing a key and value in the lookup map.

In [None]:
from kafka import KafkaProducer

kafka_producer = KafkaProducer(bootstrap_servers=kafka_host)
kafka_topic = "example-lookup-airportsize"

kafka_producer.send(kafka_topic, key=b"CLE", value=b"Medium Airport")
kafka_producer.flush()

### Create a lookup table

In this section, you will create a `lookup_post` object that can then be posted to the API as JSON. It will contain:

* The [tier](https://druid.apache.org/docs/latest/querying/lookups#dynamic-configuration) to which the table belongs - this will be the standard '__default'.
* A name for the table.
* A definition of the lookup itself. In this case, a "kafka" type lookup.

In [None]:
from datetime import datetime

lookup_post_version = datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
lookup_tier = "__default"
lookup_name = "example-flights-airportsizes"
lookup_type = "kafka"

lookup_post = {
    lookup_tier: {
        lookup_name: {
            "version": lookup_post_version,  
            "lookupExtractorFactory": {
                "type": lookup_type,
                "kafkaTopic":kafka_topic,
                "kafkaProperties":{ "bootstrap.servers":kafka_host }
            }
        }
    }
}

postLookup(lookup_post)
waitForLookup(lookup_tier, lookup_name, 60)

### Query the lookup

Run the following cell to see what data is in the lookup.

In [None]:
sql=f'''
SELECT *
FROM lookup."{lookup_name}"
'''

display.sql(sql)

## Add a new key to the lookup map

With the lookup posted and defined in Druid, run the next cell to send some new keys to the Kafka topic.

In [None]:
kafka_producer.send(kafka_topic, key=b"ABI", value=b"Medium Airport")
kafka_producer.send(kafka_topic, key=b"ABQ", value=b"Large Airport")

kafka_producer.flush()

Now run the query below to see the resulting map.

In [None]:
sql=f'''
SELECT *
FROM lookup."{lookup_name}"
'''

display.sql(sql)

As with other lookups, you can JOIN between standard and lookup tables at query time.

The cell below contains an example that you can run to see statistics about flights in a specific time period, enriched with the values from the Kafka lookup.

In [None]:
sql=f'''
SELECT
    b.v AS "airportSize",
    COUNT(DISTINCT a.Origin) AS "airports",
    COUNT(*) AS "flights",
    SUM(a.Distance) AS "totalDistance"
FROM "{table_name}" a
LEFT JOIN lookup."{lookup_name}" b ON a.Origin = b.k
WHERE TIME_IN_INTERVAL(__time,'2005-11-30T11:00:00/2015-11-30T08:00:00')
GROUP BY 1
'''

display.sql(sql)

## Send a new value to the lookup map

To replace the value of a key, post a new value for the key to the Kafka topic.

Run the cell below to change the value of "ABI" to "Large Airport" and to see the effect that this has on the data.

In [None]:
kafka_producer.send(kafka_topic, key=b"ABI", value=b"Large Airport")
kafka_producer.flush()

sql=f'''
SELECT
    b.v AS "airportSize",
    COUNT(DISTINCT a.Origin) AS "airports",
    COUNT(*) AS "flights",
    SUM(a.Distance) AS "totalDistance"
FROM "{table_name}" a
LEFT JOIN lookup."{lookup_name}" b ON a.Origin = b.k
WHERE TIME_IN_INTERVAL(__time,'2005-11-30T11:00:00/2015-11-30T08:00:00')
GROUP BY 1
'''

display.sql(sql)

## Remove a key from the lookup map

Run the cell below to add a new lookup value to the map.

In [None]:
kafka_producer.send(kafka_topic, key=b"PIE", value=b"The best food ever!")
kafka_producer.flush()

sql=f'''
SELECT *
FROM lookup."{lookup_name}"
'''

display.sql(sql)

Oops! You have added a review of a foodstuff to the lookup of airport sizes!

Remove an entry from Druid's lookup map by sending a blank value.

In [None]:
kafka_producer.send(kafka_topic, key=b"PIE", value=None)
kafka_producer.flush()

sql=f'''
SELECT *
FROM lookup."{lookup_name}"
'''

display.sql(sql)

## Clean up

Run the following cell to remove the table and lookups used in this notebook from the database.

In [None]:
print(f"Drop datasource: [{druid.datasources.drop(table_name)}]")

x = requests.delete(f"{druid_host}/druid/coordinator/v1/lookups/config/{lookup_tier}/{lookup_name}")
print (f"Drop lookup: {x}")

## Summary

* This feature requires that the `druid-kafka-extraction-namespace` extension be loaded.
* Lookup maps can be automatically populated based on Kafka topics.
* Keys can be added by sending messages to the Kafka topic.
* Values can be updated by sending new versions of the key and value pair as events to the topic.
* Keys can be removed from the cached map by sending a tombstone to the topic.
* The values in the lookup table can be used in queries via a JOIN.

## Learn more

* Find out about other lookup features in the [general lookups notebook](./06-lookup-tables.ipynb).