# Using TABLE(APPEND) for UNION operations to address multiple TABLEs in the same query

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
While working with Druid, you may need to include rows from different tables in a single result set, or to treat multiple tables as a single input to a query. This notebook introduces the TABLE(APPEND) operator, and shows you how to use it for these purposes.

## Prerequisites

This tutorial works with Druid 30.0.0 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

Run the next cell to attempt a connection to Druid services. If successful, the output shows the Druid version number.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Finally, run the following cell to import the Python JSON module.

In [None]:
import json

## Using TABLE(APPEND) to concatenate tables

As the name suggests, TABLE(APPEND) appends the contents of one table to the end of another. TABLE(APPEND) respects the schemas of the incoming result sets automatically. This makes working with TABLE(APPEND) simpler than using the [UNION](https://druid.apache.org/docs/26.0.0/querying/datasource.html#union) operator which requires you to explicitly specify the schemas for the tables.

When schemas differ, TABLE(APPEND) returns NULL columns that do not exist in one table, but do in the other. Druid automatically handles the data types for both tables. When columns of the same name have different data types in different tables, Druid adjusts the data type to be the most permissive to accommodate  all the data in the result set.

Run the following cell to prepare the SQL to ingest all the edits made by bots as a new datasource. Notice the static field `tableName` which will be used later to determine which table a given record has been queried from after appending.

In [None]:
sql='''
REPLACE INTO "example-wikipedia-table-append1" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  'table1' AS tableName,
  "isRobot",
  "channel",
  "page",
  "commentLength",
  "countryName",
  "user"
FROM "ext"
WHERE "isRobot"=TRUE
PARTITIONED BY DAY
'''

Run the following to run the ingestion and, once the ingestion is complete, query the new datasource.

In [None]:
display.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-table-append1')

sql = '''
SELECT *
FROM "example-wikipedia-table-append1"
WHERE TIME_IN_INTERVAL(__time, '2016-06-27/2016-06-28')
ORDER BY __time LIMIT 5
'''

display.sql(sql)

Run the next cell to ingest another datasource, this time with all the edits made by people rather than bots and then query it.


In [None]:
sql='''
REPLACE INTO "example-wikipedia-table-append2" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  'table2' AS tableName,
  "isRobot",
  "channel",
  "page",
  "commentLength",
  "countryName",
  "user"
FROM "ext"
WHERE "isRobot"=FALSE
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-table-append2')

sql = '''
SELECT *
FROM "example-wikipedia-table-append2"
WHERE TIME_IN_INTERVAL(__time, '2016-06-27/2016-06-28')
ORDER BY __time LIMIT 5
'''

display.sql(sql)

You can now use TABLE(APPEND) to append results from one example table to results from the other.

The first query in the following cell returns the first ten edits from both the example tables. The `tableName` column indicates the source table for each record.

In [None]:
sql = '''
SELECT *
FROM TABLE(APPEND('example-wikipedia-table-append1','example-wikipedia-table-append2'))
WHERE TIME_IN_INTERVAL(__time, '2016-06-27/2016-06-28')
ORDER BY __time LIMIT 10
'''

display.sql(sql)

Optionally, run the next cell to show the precise [EXPLAIN PLAN](https://druid.apache.org/docs/latest/querying/sql-translation#interpreting-explain-plan-output) for the query. There is one `query` execution plan in addition to Druid's optimized execution planning process of the outer query. For UNION operations, there would be two query plans. 

In [None]:
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

Run the below to get the total number of edits made by humans and the total number of edits made by bots. Without TABLE(APPEND) it would be necessary
but with TABLE(APPEND) it is possible to use a single GROUP BY query to get the total number of each type of edit.

In [None]:
sql = '''
SELECT 
    "isRobot",
    "tableName",
    COUNT("tableName") AS "edits"
FROM TABLE(APPEND('example-wikipedia-table-append1','example-wikipedia-table-append2'))
WHERE TIME_IN_INTERVAL(__time, '2016-06-27/2016-06-28')
GROUP BY 1, 2
ORDER BY 2 DESC
'''

display.sql(sql)

## Handling disparate column names for the same data

Use TABLE(APPEND) can be used to append one table to another where columns have different names but the same data i.e. "countryName" and "Country" for the country of origin of an edit.

Run the following cell to ingest the wikipedia data example again as two new tables, now with the country of origin under a seperate column name in each.

In [None]:
sql='''
REPLACE INTO "example-wikipedia-table-append-country" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  'country' AS tableName,
  "isRobot",
  "channel",
  "page",
  "commentLength",
  "countryName" AS "country",
  "user"
FROM "ext"
WHERE "isRobot"=TRUE
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-table-append-country')
display.table('example-wikipedia-table-append-country')

sql='''
REPLACE INTO "example-wikipedia-table-append-country-name" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  'countryName' AS tableName,
  "isRobot",
  "channel",
  "page",
  "commentLength",
  "countryName",
  "user"
FROM "ext"
WHERE "isRobot"=FALSE
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-wikipedia-table-append-country-name')
display.table('example-wikipedia-table-append-country-name')

In [None]:
sql = '''
SELECT
    *
FROM TABLE(APPEND('example-wikipedia-table-append-country','example-wikipedia-table-append-country-name'))
WHERE "countryName" IS NOT NULL AND TIME_IN_INTERVAL(__time, '2016-06-27/2016-06-28')
ORDER BY __time LIMIT 10
'''

display.sql(sql)

Run the cell below to use [COALESCE](https://druid.apache.org/docs/latest/querying/math-expr/#:~:text=still%20be%20null\).-,coalesce,-coalesce(exprs)%20returns) to combine the two columns with different names into a single column

In [None]:
sql = '''
SELECT
    "tableName",
    COALESCE("countryName","country") AS "countryName"
FROM TABLE(APPEND('example-wikipedia-table-append-country','example-wikipedia-table-append-country-name'))
WHERE "countryName" IS NOT NULL AND TIME_IN_INTERVAL(__time, '2016-06-27/2016-06-28')
ORDER BY __time LIMIT 10
'''

display.sql(sql)

## Clean up

Run the following cell to remove tables created for this exercise.

In [None]:
druid.datasources.drop("example-wikipedia-table-append1")
druid.datasources.drop("example-wikipedia-table-append2")
druid.datasources.drop("example-wikipedia-table-country")
druid.datasources.drop("example-wikipedia-table-country-name")

## Conclusion

* `TABLE(APPEND)` is an alternative to `UNION` in Druid for joining the contents of tables
* `UNION` creates two queries in native JSON whereas 'TABLE(APPEND)` only creates one
* `TABLE(APPEND)` automatically respects incoming schemas whereas for a `UNION` the schema of the joined table must be manually specified

## Learn more
* Read about [union](https://druid.apache.org/docs/26.0.0/querying/datasource.html#union) datasources in the documentation