# Data types in Druid
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This tutorial demonstrates how Druid stores [different types of data](https://druid.apache.org/docs/latest/querying/sql-data-types/) inside the database. There are examples of different [SQL-based ingestions](https://druid.apache.org/docs/latest/multi-stage-query/), each generating data types in resulting TABLEs. You will also see examples of conversion functions, like [CAST](https://druid.apache.org/docs/latest/querying/sql-scalar#other-scalar-functions), being used on various types of data.

## Prerequisites

This tutorial works with Druid 27.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Create and describe a Druid table

Run the following cell to create a table called `example-flights-types-1`.

Notice that the statement is specific about the columns to ingest based on is needed for this notebook.

In [None]:
sql='''
REPLACE INTO "example-flights-types-1" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "Year",
  "Reporting_Airline",
  "Origin",
  "Dest",
  "Distance"
FROM "ext"
WHERE "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-types-1')

You can query the `INFORMATION_SCHEMA` schema's `COLUMNS` table to [get information](https://druid.apache.org/docs/latest/querying/sql-metadata-tables#columns-table) for any TABLE in Druid.

Run this cell which queries this table to return further information about the underlying structure of the `example-flights-types-1` table.

_What would you expect the data types to be for each of the columns?_

In [None]:
sql='''
SELECT
  "COLUMN_NAME",
  "ORDINAL_POSITION",
  "DATA_TYPE",
  "NUMERIC_PRECISION",
  "NUMERIC_PRECISION_RADIX",
  "DATETIME_PRECISION",
  "CHARACTER_SET_NAME",
  "JDBC_TYPE"
FROM "INFORMATION_SCHEMA"."COLUMNS"
WHERE "TABLE_NAME" = 'example-flights-types-1'
'''

display.sql(sql)

The type shown in the `DATA_TYPE` column tells you how Druid will interpret the data in SQL.

> As shown in [documentation](https://druid.apache.org/docs/latest/querying/sql-data-types), these SQL types map to more fundamental types inside Druid itself. Take a moment to review the [documentation](https://druid.apache.org/docs/latest/querying/sql-data-types#standard-types) to understand how each `DATA_TYPE` in the `example-flights-types-1` TABLE maps to an internal Druid runtime type.

The rest of this notebook works through examples of these data types:

* [Numbers](#numbers) (e.g. TIMESTAMP, BOOLEAN, DOUBLE, FLOAT, LONG)
* [Strings](#strings) (e.g. STRING)
* [Arrays](#arrays) (ARRAY)
* [Other](#other) (COMPLEX)

## Numbers

Run the following cell to create a new TABLE called `example-flights-types-2".

This statement incorporates:

* An additional timestamp - `__time_arrival` - derived via [TIME_PARSE](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) from `arrivalime` in the source data.
* A [CASE](https://druid.apache.org/docs/latest/querying/sql-scalar#other-scalar-functions) function, outputing a boolean result to a column called `isSilicon`.
* Two fields containing numeric data, `Year` and `Distance`.

In [None]:
sql='''
REPLACE INTO "example-flights-types-2" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
  "Year",
  "Reporting_Airline",
  "Origin",
  CASE WHEN "Origin" = 'SFO' THEN true ELSE false END AS "isSilicon",
  "Dest",
  "Distance"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-types-2')

Now run this cell to get a description of the table similar to that available in `INFORMATION_SCHEMA.COLUMNS`.

Take a moment to correlate the data types shown for each column with the ingestion SQL.

In [None]:
display.table('example-flights-types-2')

### Timestamps

Review the SQL statement above used to create the TABLE.

```sql
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
```

Notice that:

* The `__time` column has been recognised internally as a true TIMESTAMP.
* `__time_arrival` is being stored as a BIGINT.

To use secondary time data, such as `__time_arrival`, convert it to a true TIMESTAMP.

Run the following cell, where the TIME_FLOOR function is applied to the `__time_arrival` column which has been converted to a timestamp in two ways:

* With [MILLIS_TO_TIMESTAMP](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) function.
* With [CAST](https://druid.apache.org/docs/latest/querying/sql-scalar#other-scalar-functions).

In [None]:
sql='''
SELECT
  TIME_FLOOR(MILLIS_TO_TIMESTAMP("__time_arrival"),'P1D') AS "day-1",
  TIME_FLOOR(CAST("__time_arrival" AS TIMESTAMP),'P1D') AS "day-2",
  AVG(TIMESTAMPDIFF(SECOND, "__time", NVL(MILLIS_TO_TIMESTAMP("__time_arrival"),"__time"))) AS "averageFlightLength_s"
FROM "example-flights-types-2"
WHERE TIME_IN_INTERVAL("__time",'2005-11-01/P7D')
GROUP BY 1, 2
'''

display.sql(sql)

To parse string data to timestamps, you can:

* Apply a [CAST](https://druid.apache.org/docs/latest/querying/sql-scalar#other-scalar-functions) from a SQL-standard datetime.
* Use the dedicated [date and time functions](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions), which support different formats.

Expanding on the SQL statement above, the third column in the SQL below, `day-3`, has been added.

1. MILLIS_TO_TIMESTAMP converts `__time_arrival` to a timestamp.
2. TIME_FORMAT converts the timestamp from (1) to a string in SQL format.
3. CAST converts the string from (2) into a timestamp.
4. TIME_FLOOR rounds the timestamp from (3) to a day.

In [None]:
sql='''
SELECT
  TIME_FLOOR(MILLIS_TO_TIMESTAMP("__time_arrival"),'P1D') AS "day-1",
  TIME_FLOOR(CAST("__time_arrival" AS TIMESTAMP),'P1D') AS "day-2",
  TIME_FLOOR(CAST(TIME_FORMAT(MILLIS_TO_TIMESTAMP("__time_arrival"),'YYYY-MM-dd hh:mm:ss') AS TIMESTAMP),'P1D') AS "day-3",
  AVG(TIMESTAMPDIFF(SECOND, "__time", NVL(MILLIS_TO_TIMESTAMP("__time_arrival"),"__time"))) AS "averageFlightLength_s"
FROM "example-flights-types-2"
WHERE TIME_IN_INTERVAL("__time",'2005-11-01/P7D')
GROUP BY 1, 2, 3
'''

display.sql(sql)

### Booleans

The result of the boolean CASE function is also a BIGINT.

```sql
    CASE WHEN "Origin" = 'SFO' THEN true ELSE false END AS "isSilicon",
```

Run the cell below to see what this looks like in the data.

In [None]:
sql='''
SELECT
  "isSilicon",
  COUNT(*) AS "flights"
FROM "example-flights-types-2"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/P7D')
GROUP BY 1
'''

display.sql(sql)

To convert to a true boolean, use the CAST function.

> CAST to BOOLEAN ensures Druid applies [strict boolean behaviour](https://druid.apache.org/docs/latest/querying/sql-data-types#boolean-logic), [vectorization](https://druid.apache.org/docs/latest/misc/math-expr/#vectorization-support), and the configured approach to [logical operations](https://druid.apache.org/docs/latest/misc/math-expr/#logical-operator-modes).

Run the following cell to see an example.

In [None]:
sql='''
SELECT
  "isSilicon",
  CAST("isSilicon" AS BOOLEAN) AS "isSilicon-1"
FROM "example-flights-types-2"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/P7D')
GROUP BY 1
'''

display.sql(sql)

Review the documentation to learn about the behaviour of [CAST to boolean from other data types](https://druid.apache.org/docs/latest/querying/math-expr#logical-operator-modes).

Run the following cell to see more examples of CAST to boolean.

In [None]:
sql='''
SELECT
  MOD(TIME_EXTRACT("__time",'DAY'),2) AS "oddDay",
  CAST(MOD(TIME_EXTRACT("__time",'DAY'),2) AS BOOLEAN) AS "oddDay-boolean",
  CASE
    WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 0 THEN 'badger'
    WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 1 THEN 'true'
  END AS "oddDay-1",
  CAST(
    CASE WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 0 THEN 'badger'
         WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 1 THEN 'true'
    END AS BOOLEAN) AS "oddDay-1-boolean",
  CASE
    WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 0 THEN 1
    WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 1 THEN 'mushroom'
  END AS "oddDay-2",
  CAST(
    CASE WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 0 THEN 1
         WHEN MOD(TIME_EXTRACT("__time",'DAY'),2) = 1 THEN 'mushroom'
    END AS BOOLEAN) AS "oddDay-2-boolean",
  COUNT(*) AS "flights"
FROM "example-flights-types-2"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/P7D')
GROUP BY MOD(TIME_EXTRACT("__time",'DAY'),2)
'''

display.sql(sql)

And, below, a CAST to boolean from numeric data.

In [None]:
sql='''
SELECT
  MOD("Distance",3) AS "distanceRemainder3",
  CAST(MOD("Distance",3) AS BOOLEAN) AS "distanceRemainder3-boolean",
  COUNT(*) AS "flights"
FROM "example-flights-types-2"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/P7D')
GROUP BY MOD("Distance",3)
'''

display.sql(sql)

Run the following cell to create a new TABLE, `example-flights-types-3`. It includes a specific CAST of the output of the CASE function to a BOOLEAN.

_What do you expect the resulting data type to be?_

In [None]:
sql='''
REPLACE INTO "example-flights-types-3" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
  "Year",
  "Reporting_Airline",
  "Origin",
  CAST((CASE WHEN "Origin" = 'SFO' THEN true ELSE false END) AS BOOLEAN) AS "isSilicon",
  "Dest",
  "Distance"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-types-3')
display.table('example-flights-types-3')

### Numbers

Run the following ingestion to create a new table, `example-flights-types-4`.

The SQL statement includes a set of calculations:

* An explicit CAST of `Distance` to a DOUBLE, emitted as `DoubleDistance`.
* Various timestamp functions that emit numeric values:
  * `timeTaken_seconds` uses two timestamps to emit the time taken in seconds.
  * `speed_1` applies the same calculation, divided by distance, without a CAST on `Distance`.
  * `speed_2` is the same calculation as `speed_2` but applies a CAST to DOUBLE on `Distance`.
  * `speed_3` does the same as `speed_2` except it CASTs to FLOAT.
* A CAST from a numeric to VARCHAR on the `DayofMonth` source column.

_Can you predict what the resulting data types will be?_
_What will the data look like in the table?_

In [None]:
sql='''
REPLACE INTO "example-flights-types-4" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  TIME_PARSE("arrivalime") AS "__time_arrival",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) AS "timeTaken_seconds",
  "Distance",
  CAST("Distance" AS DOUBLE) AS "doubleDistance",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) / "Distance" AS "speed",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) / CAST("Distance" AS DOUBLE) AS "doubleSpeed",
  TIMESTAMPDIFF(SECOND, TIME_PARSE("depaturetime"), NVL(TIME_PARSE("arrivalime"),TIME_PARSE("depaturetime"))) / CAST("Distance" AS FLOAT) AS "floatSpeed",
  "DayofMonth",
  CAST("DayofMonth" AS VARCHAR) AS "DayofMonth_Filter",
  "Reporting_Airline",
  "Origin",
  CASE WHEN "Origin" = 'SFO' THEN true ELSE false END AS "isSilicon",
  "Dest"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-types-4')
display.table('example-flights-types-4')

Notice the data types of `Distance` and of the three `speed` columns.

* `Distance` has a data type of BIGINT.
* `doubleDistance` has a data type of DOUBLE.
* `speed` inherits the data type of `Distance`.
* `doubleSpeed` inherits the data type of `Distance` CAST to a DOUBLE.
* `floatSpeed` inherits the data type of `Distance` CAST to a FLOAT.

Run the following query to see how the data differs between each one.

In [None]:
sql='''
SELECT
  "Origin",
  "Dest",
  "Distance",
  "doubleDistance",
  "speed",
  "doubleSpeed",
  "floatSpeed"
FROM "example-flights-types-4"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/PT15M')
'''

display.sql(sql)

## Strings

CHAR and VARCHAR data is stored as a string, and can be understood as holding single or [multiple](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings) values.

> [Multi-value strings](https://druid.apache.org/docs/latest/querying/multi-value-dimensions) exhibit [special behaviour](https://druid.apache.org/docs/latest/querying/sql-data-types#multi-value-strings-behavior). A full exploration is out of the scope of this notebook.

In the table above, the `DayofMonth_Filter` column is recognized as a VARCHAR because of the CAST function.

> See [schema design tips](https://druid.apache.org/docs/latest/ingestion/schema-design/#string-vs-numeric-dimensions) to learn more about the differences between storing data as numeric or a string.

Run the following cell to see a simple query that makes use of this filtering column.

In [None]:
sql='''
SELECT
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '10') AS "10thDayCount",
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '9') AS "9thDayCount"
FROM "example-flights-types-4"
'''

display.sql(sql)

Building on the previous example, run the following cell to see a CAST using an internally stored string to compare with a numeric value:

In [None]:
sql='''
SELECT
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '10') AS "10thDayCount",
    COUNT(*) FILTER (WHERE "DayofMonth_Filter" = '9') AS "9thDayCount-1",
    COUNT(*) FILTER (WHERE CAST("DayofMonth_Filter" AS BIGINT) = 9) AS "9thDayCount-2",
    COUNT(*) FILTER (WHERE CAST("DayofMonth_Filter" AS BIGINT) = 8) AS "8thDayCount"
FROM "example-flights-types-4"
'''

display.sql(sql)

You can also convert from a TIMESTAMP to a string.

Run the following cell to see a CAST from TIMESTAMP to a string, as well as an example of the TIME_FORMAT function to create the same output.

> For more examples of TIME_FORMAT check out the dedicated notebook on [date and time functions](../03-query/07-functions-datetime.ipynb).

In [None]:
sql='''
SELECT
    CAST("__time" AS VARCHAR) AS "__time_VARCHAR",
    TIME_FORMAT(__time, 'YYYY-MM-dd hh:mm:ss a') AS "__time_FORMAT"
FROM "example-flights-types-4"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/PT10M')
'''

display.sql(sql)

## Arrays

[Arrays](https://druid.apache.org/docs/latest/querying/sql-data-types#arrays) are specially recognised in Druid, constructed using appropriate [aggregations](https://druid.apache.org/docs/latest/querying/sql-aggregations) and then worked with using dedicated SQL [functions](https://druid.apache.org/docs/latest/querying/sql-array-functions).

A full exploration is out of the scope of this notebook.

Rather than using CAST functions, use the [documented functions](https://druid.apache.org/docs/latest/querying/sql-array-functions) to convert between ARRAY and VARCHAR types.

Run the following cell.

* `origin-array` contains the output of an ARRAY_AGG function, which constructs an ARRAY at query-time from 15-minutes-worth of source data.
* The same function is then used with the ARRAY_TO_STRING function to create a comma-separated string, output as `origin-array-string`.
* Finally, that function is used inside STRING_TO_ARRAY to convert the string back into an array again as `origin-array-string-array`.

In [None]:
sql='''
SELECT
  ARRAY_AGG(DISTINCT "Origin") AS "origin-array",
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT "Origin"),',') AS "origin-array-string",
  STRING_TO_ARRAY(ARRAY_TO_STRING(ARRAY_AGG(DISTINCT "Origin"),','),',') AS "origin-array-string-array"
FROM "example-flights-types-4"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/PT15M')
'''

display.sql(sql)

## Other

Special types of data are stored as COMPLEX or OTHER, including Apache Datasketches.

Run the following ingestion to create a table where this type can be seen.

> For more information on how this cell works, see the dedicated notebook on [generating sketches at ingestion time](./03-generating-sketches.ipynb).

In [None]:
sql='''
REPLACE INTO "example-flights-types-5" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("depaturetime"), 'PT1H') AS "__time",
  TIME_FLOOR(TIME_PARSE("arrivalime"), 'PT1H') AS "__arrival_time",
  "Reporting_Airline",
  DS_HLL("Tail_Number") AS "Tail_Number_HLL",
  DS_THETA("Tail_Number") AS "Tail_Number_THETA"
FROM "ext"
WHERE "arrivalime" <> 0
AND "depaturetime" <> 0
GROUP BY 1, 2, 3
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("finalize", "false")
req.add_context("finalizeAggregations", "false")

display.run_task(req)
sql_client.wait_until_ready('example-flights-types-5')
display.table('example-flights-types-5')

Notice that both the sketched columns are recognised as COMPLEX types, and need to be addressed using their dedicated functions and aggregations.

Run the following cell to see an approximate count being used on the sketch stored in `Tail_Number_HLL`.

> For more examples, see the dedicated notebook on [approximate COUNT DISTINCT](../03-query/03-approxCountDistinct.ipynb) operations.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT6H') AS "period",
  APPROX_COUNT_DISTINCT_DS_HLL(Tail_Number_HLL) AS "approxCount"
FROM "example-flights-types-5"
WHERE TIME_IN_INTERVAL("__time",'2005-11-08/P1D')
GROUP BY 1
'''

display.sql(sql)

## Clean up

Run the following cell to remove the tables that were created by this notebook from the database.

In [None]:
druid.datasources.drop("example-flights-types-1")
druid.datasources.drop("example-flights-types-2")
druid.datasources.drop("example-flights-types-3")
druid.datasources.drop("example-flights-types-4")
druid.datasources.drop("example-flights-types-5")

## Summary

* Druid tables are created through ingestion.
* Data types for columns are inferred from the SELECT statement, conforming to SQL types.
* SQL types map to internal types in Druid.
* Only the primary timestamp, `__time`, is recognized as a true TIMESTAMP.
* You can use type-specific functions, as well as CAST, to specify and convert between types at ingestion and query time.

## Learn more

* [Read more](https://druid.apache.org/docs/latest/querying/sql-metadata-tables#columns-table) about the `INFORMATION_SCHEMA.COLUMNS` table.
* See the documentation around [data types in Druid](https://druid.apache.org/docs/latest/querying/sql-data-types/).
* Experiment with ingestions that re-use existing column names but change their data type.
* Read more about about [automatic schema detection and schemaless ingestion](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec) for JSON-based ingestion in the documentation.
* Try out the notebook on [approximate COUNT DISTINCT](../03-query/03-approxCountDistinct.ipynb) and on [generating sketches at ingestion time](./03-generating-sketches.ipynb).
* Take a look at more [date and time functions](../03-query/07-functions-datetime.ipynb).