# Ingest data to a Druid table using SQL-based batch
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

[SQL-based](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#sql-reference) ingestion reads raw data from files or other external batch sources and transforms them into time partitioned and fully indexed [Druid segment files](https://druid.apache.org/docs/latest/design/segments).

In this notebook on the basics of batch ingestion in Druid, you will:

- Ingest data from several files.
- Apply some context parameters to control how the ingestion executes.
- Incorporate some filters and transformations.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Learn Druid Project page](https://github.com/implydata/learn-druid/#readme).


## Initialization

In [None]:
import druidapi
import os

if (os.environ['DRUID_HOST'] == None):
    druid_host=f"http://router:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Run a simple batch ingestion from one file

SQL-based ingestion adds to or replaces data in a TABLE. Take a look at the example SQL in the cell below.

* REPLACE with OVERWRITE ALL means that the whole table will be overwritten.
* WITH sets up and defines the source, including an EXTERN [input source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html) along with [format](https://druid.apache.org/docs/latest/ingestion/data-formats.html) parser - in this case, a web source in JSON format.
* [EXTEND](https://druid.apache.org/docs/latest/multi-stage-query/reference#extern-function) describes the input schema.
* SELECT defines the transformations and schema of the resulting Druid table. The `__time` field is required.
* PARTITIONED BY tells Driud how to lay out the table data internally.

Run the following cell to load data from an external file into the "example-wikipedia-batch" table.

In [None]:
table_name = "example-wikipedia-batch"

sql = '''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS 
(
    SELECT *
    FROM TABLE(
      EXTERN(
        '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
        '{"type":"json"}'
      )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  *
FROM "ext"
PARTITIONED BY DAY
'''
druid.display.run_task(sql)
druid.sql.wait_until_ready(table_name)

Run the following SQL to see some of the data in the table.

In [None]:
sql = f'''
SELECT channel,  count(*) num_events
FROM "{table_name}"
WHERE TIME_IN_INTERVAL ("__time", '2016-06-27/2016-06-28')
GROUP BY 1 
ORDER BY 2 DESC 
LIMIT 10
'''

druid.display.sql(sql)

Run the following cell to drop the table.

In [None]:
druid.datasources.drop(table_name, True)

## Run a parallelized batch ingestion

[Druid Input Sources](https://druid.apache.org/docs/latest/ingestion/native-batch.html#splittable-input-sources) allow you to specify multiple files as input to an ingestion job.

In the following SQL, the EXTERN statement has been updated. Now there are several `uris`, each containing a different file to ingest.

Run the cell to ingest data from all three files.

In [None]:
table_name = "example-wikipedia-bigbatch"

sql = '''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS 
(
    SELECT *
    FROM TABLE(
      EXTERN(
        '{"type":"http",
          "uris":[ "https://druid.apache.org/data/wikipedia.json.gz",
                   "https://druid.apache.org/data/wikipedia.json.gz",
                   "https://druid.apache.org/data/wikipedia.json.gz"
                 ]
         }',
        '{"type":"json"}'
      )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  *
FROM "ext"
PARTITIONED BY DAY
'''
druid.display.run_task(sql)
druid.sql.wait_until_ready(table_name)

Let's look at the data now. The quantities are 3 times larger than before because we loaded the same file three times:

In [None]:
sql = f'''
SELECT channel, count(*) num_events
FROM "{table_name}" 
WHERE TIME_IN_INTERVAL("__time", '2016-06-27/2016-06-28')
GROUP BY 1 
ORDER BY 2 DESC 
LIMIT 10
'''

display.sql(sql)

Run the following cell to drop the table.

In [None]:
druid.datasources.drop(table_name, True)

Use [context parameters](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#context-parameters) to control how the ingestion executes.

The learning environment has been [configured](https://druid.apache.org/docs/latest/operations/basic-cluster-tuning#task-count) with the capacity to run four tasks concurrently.

Run the following cell to:

* Create the SQL to pass to Druid as `sql`.
* Create a request object, `request`.
* Add a context parameter for `maxNumTasks`, set to the maximum of 4, to allow all files to be read in parallel.
* Run the task.
* Wait until the data has been distributed around the database.

In [None]:
table_name = "example-wikipedia-4-batch"

sql = '''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS 
(
    SELECT *
    FROM TABLE(
      EXTERN(
        '{"type":"http",
          "uris":[ "https://druid.apache.org/data/wikipedia.json.gz",
                   "https://druid.apache.org/data/wikipedia.json.gz",
                   "https://druid.apache.org/data/wikipedia.json.gz"
                 ]
         }',
        '{"type":"json"}'
      )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  *
FROM "ext"
PARTITIONED BY DAY
'''

request = druid.sql.sql_request(sql)
request.add_context('maxNumTasks', 4)

druid.display.run_task(request)
druid.sql.wait_until_ready(table_name)

Get insight into the segments that were generated in the console "segments" view or by using Druid's `SYS.SEGMENTS` table.

Run the next cell to get information about the table that you just created.

Notice that, even though there were multiple tasks ingesting the data, a shuffle stage brought all the data together into one segment.

In [None]:
sql=f'''
SELECT
  "start",
  "end",
  "size",
  "num_rows"
FROM sys.segments
WHERE datasource = '{table_name}'
ORDER BY 1
'''

display.sql(sql)

Run the next cell to drop the table.

In [None]:
druid.datasources.drop(table_name, True)

## Filter and transform data during ingestion

Use SQL functions and filters to carry out calculations on, and filter unwanted data from, the source data.

In this section you will see some examples of filtering and expressions being applied.

### Use WHERE to filter incoming data

In situations where you need data cleansing or your only interested in a subset of the data, the ingestion job can filter the data by simply adding a WHERE clause.

Run the next cell to only ingest rows where the event wasn't generated by a robot.

In [None]:
table_name = "example-wikipedia-only-human"

sql = '''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS 
(
    SELECT *
    FROM TABLE(
      EXTERN(
        '{"type":"http",
          "uris":[ "https://druid.apache.org/data/wikipedia.json.gz"]
         }',
        '{"type":"json"}'
      )
) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  *
FROM "ext"
WHERE "isRobot"='false'
PARTITIONED BY DAY
'''

druid.display.run_task(sql)
druid.sql.wait_until_ready(table_name)

Run the following cell to see the results.

In [None]:
sql=f'''
SELECT isRobot, channel, count(*) num_events
FROM "{table_name}"
GROUP BY 1,2 
ORDER BY 3 DESC 
LIMIT 10
'''

display.sql(sql)

Run the next cell to drop the table.

In [None]:
druid.datasources.drop(table_name, True)

### Use functions to transform data at ingestion time

The SQL language provides a rich [set of functions](https://druid.apache.org/docs/latest/querying/sql-scalar.html) that can be applied to input columns to transform the data as it is being ingested. All scalar SQL functions are available for normal ingestion. Rollup ingestion is discussed in the [Rollup Notebook](05-rollup.ipynb) which includes the use of aggregate functions at ingestion time as well.

Some common functions include:
* [Time parsing and manipulation functions](https://druid.apache.org/docs/latest/querying/sql-scalar.html#date-and-time-functions)
* CASE statements to resolve complex logic and prepare columns for certain query patterns.
* [String manipulation functions](https://druid.apache.org/docs/latest/querying/sql-scalar.html#string-functions).
* Nested object (JSON) functions.

Take a look at the next cell to see some of these functions being incorporated into a batch ingestion.

In [None]:
table_name = "example-kttm-transform-batch"

sql = '''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS 
(
    SELECT *
    FROM TABLE(
      EXTERN(
        '{"type":"http","uris":["https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz"]}',
        '{"type":"json"}'
      )
    ) EXTEND ("timestamp" VARCHAR, "session" VARCHAR, "number" VARCHAR, "event" TYPE('COMPLEX<json>'), "agent" TYPE('COMPLEX<json>'), "client_ip" VARCHAR, "geo_ip" TYPE('COMPLEX<json>'), "language" VARCHAR, "adblock_list" VARCHAR, "app_version" VARCHAR, "path" VARCHAR, "loaded_image" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "server_ip" VARCHAR, "screen" VARCHAR, "window" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR)
)
SELECT
  session, 
  number,
  TIME_PARSE("timestamp") AS "__time",
  TIMESTAMPDIFF(DAY, TIME_FLOOR(TIME_PARSE("timestamp"), 'P1W'), TIME_PARSE("timestamp")) AS days_since_week_start,
  TIME_FLOOR(TIME_PARSE("timestamp"), 'P1W') AS week_start,
  TIME_CEIL(TIME_PARSE("timestamp"), 'P1W') AS week_end,
  TIME_SHIFT(TIME_FLOOR(TIME_PARSE("timestamp"), 'P1D'),'P1D', -1) AS start_of_yesterday,
  
  JSON_VALUE("event", '$.percentage' RETURNING BIGINT) as percent_cleared,
  JSON_VALUE("geo_ip", '$.city') AS city,
  
  CASE WHEN UPPER("adblock_list")='NOADBLOCK' THEN 0 ELSE 1 END AS adblock_count,
  CASE WHEN UPPER("adblock_list")='EASYLIST' THEN 1 ELSE 0 END AS easylist_count,
  
  REPLACE(REGEXP_EXTRACT("app_version", '[^\.]*\.'),'.','') AS major_version,
  ARRAY_ORDINAL(STRING_TO_ARRAY("app_version",'\.'),2) AS minor_version,
  ARRAY_ORDINAL(STRING_TO_ARRAY("app_version",'\.'),3) AS patch_version,
  session_length
FROM "ext"
PARTITIONED BY DAY
'''

druid.display.run_task(sql)
druid.sql.wait_until_ready(table_name)

Run the next cell to see see what time of day shows the highest user activity.

In [None]:
sql = f'''
SELECT EXTRACT( HOUR FROM "__time") time_hour, city, count(distinct "session") session_count
FROM "{table_name}"
WHERE "city" IS NOT NULL AND "city" <> ''
GROUP BY 1,2 
ORDER BY 3 DESC 
LIMIT 10
'''

display.sql(sql)

Drop the table by running the next cell.

In [None]:
druid.datasources.drop(table_name, True)

## Summary

Druid's [SQL Based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) enables scalable batch ingestion from a large variety of [data sources](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html) and [formats](https://druid.apache.org/docs/latest/ingestion/data-formats.html). The familiarity and expressiveness of SQL enables users to quickly transform, filter and generally enhance data directly in the cluster.

## Learn more

* Try out some more functions, like [date time](../03-query/07-functions-datetime.ipynb), [strings](../03-query/08-functions-strings.ipynb), and [CASE](../03-query/09-functions-case.ipynb).
* Incorporate a [GROUP BY](../03-query/01-groupby.ipynb) to aggregate rows at ingestion time.
* Work through the notebook on [PARTITIONED BY and CLUSTERED BY](./06-partitioning-data.ipynb).
* See how Druid works with [nested data](./05-working-with-nested-columns.ipynb).
* Deep dive into the [data types](./04-table-datatypes.ipynb) notebook.