# Data types in Druid
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Introductory paragraph - for example:

This tutorial demonstrates how to work with [feature](link to feature doc). In this tutorial you perform the following tasks:

- Task 1
- Task 2
- Task 3
- etc

## Prerequisites

This tutorial works with Druid 27.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Run the following cell to create a table called `example-flights-cast`.

Notice that the statement is specific about the columns to ingest based on is needed for this notebook, and that [Apache Datasketches are generated](../02-ingestion/03-generating-sketches.ipynb) as part of the process as well as some aggregates.

When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-flights-cast" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
  "Year",
  "Reporting_Airline",
  "Origin",
  CASE WHEN "Origin" = 'SFO' THEN true ELSE false END AS "isSilicon",
  "Dest",
  "Distance"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-cast')
display.table('example-flights-cast')

### Import additional modules

Run the following cell to import additional Python modules that you will use to X, Y, Z.

In [None]:
import json

## Use SYS to look at data types

At the end of the ingestion above, you ran the following function to bring up a simple description of the TABLE:

In [None]:
display.table('example-flights-cast')

For more detail, use the `INFORMATION_SCHEMA` schema's `COLUMNS` table.

Run the following query, showing further information about the underlying structure of the table.

In [None]:
sql='''
SELECT
  "COLUMN_NAME",
  "ORDINAL_POSITION",
  "DATA_TYPE",
  "NUMERIC_PRECISION",
  "NUMERIC_PRECISION_RADIX",
  "DATETIME_PRECISION",
  "CHARACTER_SET_NAME",
  "JDBC_TYPE"
FROM "INFORMATION_SCHEMA"."COLUMNS"
WHERE "TABLE_NAME" = 'example-flights-cast'
'''

display.sql(sql)

The type shown in the `DATA_TYPE` column is used in Druid SQL. As shown in [documentation](https://druid.apache.org/docs/latest/querying/sql-data-types), data is stored inside a segement using a fundamental data type:

* [Numbers](#numbers) (e.g. TIMESTAMP, BOOLEAN, DOUBLE, FLOAT, LONG)
* [Strings](#strings) (e.g. STRING)
* [Arrays](#arrays) (ARRAY)
* [Other](#other) (COMPLEX)

Before going further in this notebook, take a moment to use the [documentation](https://druid.apache.org/docs/latest/querying/sql-data-types#standard-types) to understand how each `DATA_TYPE` in the `example-flights-cast` TABLE maps to an internal Druid runtime type.

## Numbers

Timestamps, booleans, and numeric data as stored as LONG or DOUBLE inside segments.

This section of the notebook contains examples of conversion between these and other types.

### Timestamps

Review the SQL statement above used to create the TABLE.

```sql
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
```

The `__time` column has been recognised internally as a true TIMESTAMP, whereas `__time_arrival` has been stored as a BIGINT.

To use this as a true timestamp with timestamp functions, it must therefore be CAST to a TIMESTAMP.

Run the following two cells to see two different approaches to this: one using MILLIS_TO_TIMESTAMP function, and one using a CAST.

In [None]:
sql='''
SELECT
  TIME_FLOOR(MILLIS_TO_TIMESTAMP("__time_arrival"),'P1D') AS "day",
  AVG(TIMESTAMPDIFF(SECOND, "__time", NVL(MILLIS_TO_TIMESTAMP("__time_arrival"),"__time"))) AS "averageFlightLength_s"
FROM "example-flights-cast"
WHERE TIME_IN_INTERVAL("__time",'2005-11-01/P7D')
GROUP BY 1
'''

display.sql(sql)

In [None]:
sql='''
SELECT
  TIME_FLOOR(CAST("__time_arrival" AS TIMESTAMP),'P1D') AS "day",
  AVG(TIMESTAMPDIFF(SECOND, "__time", NVL(MILLIS_TO_TIMESTAMP("__time_arrival"),"__time"))) AS "averageSpeed"
FROM "example-flights-cast"
WHERE TIME_IN_INTERVAL("__time",'2005-11-01/P7D')
GROUP BY 1
'''

display.sql(sql)

A CAST from string-types to TIMESTAMP expects standard SQL date time formatting. Other [functions](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) support different formats.

In [None]:
sql='''
SELECT CAST('2000-01-02 03:04:05' AS TIMESTAMP) AS "castFromSqlString"
FROM "example-flights-cast"
LIMIT 1
'''

display.sql(sql)

Run the following cell to see a CAST from TIMESTAMP to a string, as well as an example of the TIME_FORMAT function to create the same output.

For more examples of TIME_FORMAT check out the [dedicated notebook](../03-query/07-functions-datetime.ipynb) on datetime functions.

In [None]:
sql='''
SELECT
    CAST("__time" AS VARCHAR) AS "__time_VARCHAR",
    TIME_FORMAT(__time, 'YYYY-MM-dd hh:mm:ss') AS "__time_FORMAT"
FROM "example-flights-cast"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/PT10M')
'''

display.sql(sql)

### Booleans

In the ingestion SQL, the following CASE statement output either true or false:

```sql
CASE WHEN "Origin" = 'SFO' THEN true ELSE false END AS "isSilicon",
```

Run the cell below to see how this plays out in a SQL statement:

In [None]:
sql='''
SELECT
  "isSilicon",
  COUNT(*) AS "flights"
FROM "example-flights-cast"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/P7D')
GROUP BY 1
'''

display.sql(sql)

The results you see reflect the base internal type for this column: BIGINT.

CASTs ensure Druid applies the configured [strict boolean behaviour](https://druid.apache.org/docs/latest/querying/sql-data-types#boolean-logic), impacting how Druid vectorizes [operations](https://druid.apache.org/docs/latest/misc/math-expr/#vectorization-support) queries, and how it [handles logical operations](https://druid.apache.org/docs/latest/misc/math-expr/#logical-operator-modes) - particulary important when controlling how NULLs are handled.

Run the following cell to see examples of how CAST enables you to treat this data in a true boolean fashion.

In [None]:
sql='''
SELECT
  CAST("isSilicon" AS BOOLEAN) AS "isSilicon-LONG",
  CAST("Distance" AS BOOLEAN) AS "isDistant-BIGINT",
  COUNT(*) AS "flights"
FROM "example-flights-cast"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/P7D')
GROUP BY "isSilicon", CAST("Distance" AS BOOLEAN)
'''

display.sql(sql)

Notice that, in the query above, `Distance` was CAST to a BOOLEAN in the GROUP BY clause. Because of the way Druid determines whether something is true of false, this resulted in only two emitted rows, rather than rows being emitted for every possible value of `Distance` in the source data.

The following cell shows examples of how CAST handles STRING-to-BOOLEAN conversion:

In [None]:
sql='''
SELECT
  CAST('I am a string' AS BOOLEAN) AS "isStringy-STRING",
  CAST('true' AS BOOLEAN) AS "isTrulyString-STRING",
  CAST('' AS BOOLEAN) AS "isNotStringy-STRING",
  CAST('false' AS BOOLEAN) AS "isStillNotStringy-STRING"
FROM "example-flights-cast"
LIMIT 1
'''

display.sql(sql)

Review the SQL in the following INSERT statement. It includes a specific CAST to BOOLEAN of the CASE condition.

What do you suspect the `DATA_TYPE` will be?

In [None]:
sql='''
REPLACE INTO "example-flights-cast" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
  "Year",
  "Reporting_Airline",
  "Origin",
  CAST((CASE WHEN "Origin" = 'SFO' THEN true ELSE false END) AS BOOLEAN) AS "isSilicon",
  "Dest",
  "Distance"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-cast')
display.table('example-flights-cast')

### Numbers

Run the following ingestion that contains new calculations:

* One adds a field called `Month_Filter` which is a VARCHAR cast of the numeric data in the `Month` source data.
* A second parses the two timestamps and emits the time taken in seconds as `timeTaken_seconds`.
* A third function applies this same calculation and divides it by distance to produce `speed_1` without casting `Distance`.
* A fourth function applies this same calculation but applies a CAST to DOUBLE and emits `speed_2`.
* A final fifth function CASTs to FLOAT for `speed_3`.

In [None]:
sql='''
REPLACE INTO "example-flights-cast" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
  "DayofMonth",
  CAST("DayofMonth" AS VARCHAR) AS "DayofMonth_Filter",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) AS "timeTaken_seconds",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) / "Distance" AS "speed_1",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) / CAST("Distance" AS DOUBLE) AS "speed_2",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) / CAST("Distance" AS FLOAT) AS "speed_3",
  "Reporting_Airline",
  "Origin",
  CASE WHEN "Origin" = 'SFO' THEN true ELSE false END AS "isSilicon",
  "Dest",
  "Distance"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-cast')
display.table('example-flights-cast')

Firstly, notice that `DayofMonth_Filter` has been stored as a VARCHAR explicitly because of the CAST function.

Since Druid automatically creates [bitmap indexes](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimension-objects) for STRING-type columns at ingestion time, it can be useful to create VARCHAR versions of columns that will be used for filtering in addition to any CLUSTERED BY column, especially when that column will never be used for any sort of mathematical calculation.

Run the following cell to see an example of a query that makes use of this filtering column.

In [None]:
sql='''
SELECT
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '10') AS "10thDayCount",
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '9') AS "9thDayCount"
FROM "example-flights-cast"
'''

display.sql(sql)

Secondly, notice that only the `speed_2` and `speed_3` columns are recognised as being anything other than BIGINT.

In the TABLE, the data therefore differs between `speed_1`, `speed_2`, and `speed_3`.

In [None]:
sql='''
SELECT
  "Origin",
  "Dest",
  "speed_1",
  "speed_2",
  "speed_3"
FROM "example-flights-cast"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/PT15M')
'''

display.sql(sql)

## Strings

CHAR and VARCHAR data is stored as a string, and can be understood as holding single or [multiple](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings) values.

[Multi-value strings](https://druid.apache.org/docs/latest/querying/multi-value-dimensions) exhibit [special behaviour](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings-behavior). A full exploration is out of the scope of this notebook.

Building on the previous example, run the following cell to see a CAST using an internally stored string to compare with a numeric value:

In [None]:
sql='''
SELECT
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '10') AS "10thDayCount",
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '9') AS "9thDayCount-1",
    COUNT(*) FILTER (WHERE CAST("DayofMonth_Filter" AS BIGINT) = 9) AS "9thDayCount-2",
    COUNT(*) FILTER (WHERE CAST("DayofMonth_Filter" AS BIGINT) = 8) AS "8thDayCount"
FROM "example-flights-cast"
'''

display.sql(sql)

## Arrays

[Arrays](https://druid.apache.org/docs/latest/querying/sql-data-types#arrays) are specially recognised in Druid, constructed using appropriate [aggregations](https://druid.apache.org/docs/latest/querying/sql-aggregations) and then worked with using dedicated SQL [functions](https://druid.apache.org/docs/latest/querying/sql-array-functions).

A full exploration is out of the scope of this notebook.

Rather than using CAST functions, use the [documented functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) to convert between ARRAY and VARCHAR types.

Run the following cell.

* `origin-array` contains the output of an ARRAY_AGG function, which constructs an ARRAY at query-time from 15-minutes-worth of source data.
* The same function is then used with the ARRAY_TO_STRING function to create a comma-separated string, output as `origin-array-string`.
* Finally, that function is used inside STRING_TO_ARRAY to convert the string back into an array again as `origin-array-string-array`.

In [None]:
sql='''
SELECT
  ARRAY_AGG(DISTINCT "Origin") AS "origin-array",
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT "Origin"),',') AS "origin-array-string",
  STRING_TO_ARRAY(ARRAY_TO_STRING(ARRAY_AGG(DISTINCT "Origin"),','),',') AS "origin-array-string-array"
FROM "example-flights-cast"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/PT15M')
'''

display.sql(sql)

## Other

Special types of data are stored as COMPLEX or OTHER, including Apache Datasketches.

Run the following ingestion to create a table where this type can be seen.

For more information on how this cell works, see the dedicated notebook on [generating sketches at ingestion time](./03-generating-sketches.ipynb).

In [None]:
sql='''
REPLACE INTO "example-flights-cast" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("depaturetime"), 'PT1H') AS "__time",
  TIME_FLOOR(TIME_PARSE("arrivalime"), 'PT1H') AS "__arrival_time",
  "Reporting_Airline",
  DS_HLL("Tail_Number") AS "Tail_Number_HLL",
  DS_THETA("Tail_Number") AS "Tail_Number_THETA"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
GROUP BY 1, 2, 3
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("finalize", "false")
req.add_context("finalizeAggregations", "false")

display.run_task(req)
sql_client.wait_until_ready('example-flights-cast')
display.table('example-flights-cast')

Notice that both the sketched columns are recognised as COMPLEX types.

## Clean up

Run the following cell to remove the tables that were created by this notebook from the database.

In [None]:
druid.datasources.drop("example-flights-cast")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

In [None]:
# STANDARD CODE BLOCKS

# When just wanting to display some SQL results
display.sql(sql)

# When ingesting data:
display.run_task(sql)
sql_client.wait_until_ready('example-flights-cast')
display.table('example-flights-cast')

# When you want to make an EXPLAIN look pretty
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3