# Optimizing performance using PARTITIONED BY and CLUSTERED BY
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This notebook is a review of Apache Druid's data organization strategy and how it can be used to improve system health and query performance.

[Apache Druid stores data in Segment files](https://druid.apache.org/docs/latest/design/segments). Druid Segments have a columnar structure designed for high performance analytic queries. Segments are organized into consistent time intervals called the `Segment Granularity` of the table and typically expressed as HOUR, DAY, MONTH, or YEAR. In this notebook you will review how this setting affects the number of segments created during ingestion and the size of those segments. Druid works best when segments are balanced in size with a general rule of thumb that they should cover approximately 5 million rows each. 

[Pruning](https://druid.apache.org/docs/latest/design/architecture#query-processing) is the process of reducing the segment files that need to be inspected in order to resolve a query. To process a query in Druid, all the time chunks that overlap the query's filter condition on the __time column will be inspected. 

Within a time chunk, [segment files can be organized based on another set of columns](https://druid.apache.org/docs/latest/ingestion/partitioning). In SQL Based ingestion it's specified using a [CLUSTERED BY clause](https://druid.apache.org/docs/latest/multi-stage-query/reference#clustered-by). When you use this feature, Druid will organize the segments within a time chunk such that each one covers a range of the values from the clustering columns. At query time the Broker can do additional pruning of the segments within a time chunk if the user specifies a filter condition on the clustering columns. 

## Prerequisites

This tutorial works was tested with Druid 27.0.0.

#### Run with Docker

<!-- Profiles are:
`druid-jupyter` - just Jupyter and Druid
`all-services` - includes Jupyter, Druid, and Kafka
 -->

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [the project on github](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
from datetime import datetime
from statistics import mean 

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

<!-- Include these cells if your notebook uses Kafka. -->

Run the next cell to set up the connection to Apache Kafka.

## Time partitioning

Normally, ideal segments contain around 5 million rows, but given that this is all running on a laptop and for fast demonstration purposes, we'll set our "ideal" to only 50000 rows.


### Use PARTITIONED BY to adjust the coverage of segments

#### Too granular
Using a `Segment Granularity` that is too small will render too many segments. The following batch ingestion demonstrates this. It will take about 1 minute to complete:

In [None]:
sql='''
REPLACE INTO "flights-hour" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "arrivalime", "Year", "Quarter", "Month", "DayofMonth", "DayOfWeek", "FlightDate", "Reporting_Airline", "DOT_ID_Reporting_Airline",
  "IATA_CODE_Reporting_Airline", "Tail_Number", "Flight_Number_Reporting_Airline", 
  "OriginAirportID", "OriginAirportSeqID", "OriginCityMarketID", "Origin", "OriginCityName", "OriginState", "OriginStateFips", "OriginStateName", "OriginWac",  
  "DestAirportID", "DestAirportSeqID", "DestCityMarketID", "Dest", "DestCityName", "DestState", "DestStateFips", "DestStateName", "DestWac", 
  "CRSDepTime", "DepTime", "DepDelay", "DepDelayMinutes", "DepDel15", "DepartureDelayGroups", "DepTimeBlk",
  "TaxiOut", "WheelsOff", "WheelsOn", "TaxiIn", "CRSArrTime", "ArrTime", "ArrDelay", "ArrDelayMinutes", "ArrDel15", "ArrivalDelayGroups", "ArrTimeBlk",
  "Cancelled", "CancellationCode", "Diverted", "CRSElapsedTime", "ActualElapsedTime", "AirTime", "Flights", "Distance", "DistanceGroup", "CarrierDelay",
  "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay", "FirstDepTime", "TotalAddGTime", "LongestAddGTime", "DivAirportLandings", "DivReachedDest",
  "DivActualElapsedTime",  "DivArrDelay",  "DivDistance",  "Div1Airport",  "Div1AirportID",  "Div1AirportSeqID",  "Div1WheelsOn",  "Div1TotalGTime",
  "Div1LongestGTime",  "Div1WheelsOff",  "Div1TailNum",  "Div2Airport",  "Div2AirportID",  "Div2AirportSeqID",  "Div2WheelsOn",  "Div2TotalGTime",
  "Div2LongestGTime",  "Div2WheelsOff",  "Div2TailNum",  "Div3Airport",  "Div3AirportID",  "Div3AirportSeqID",  "Div3WheelsOn",  "Div3TotalGTime",
  "Div3LongestGTime",  "Div3WheelsOff",  "Div3TailNum",  "Div4Airport",  "Div4AirportID",  "Div4AirportSeqID",  "Div4WheelsOn",  "Div4TotalGTime",
  "Div4LongestGTime",  "Div4WheelsOff",  "Div4TailNum",  "Div5Airport",  "Div5AirportID",  "Div5AirportSeqID",  "Div5WheelsOn",  "Div5TotalGTime",
  "Div5LongestGTime",  "Div5WheelsOff",  "Div5TailNum"
FROM "ext"
PARTITIONED BY HOUR'''

display.run_task(sql)
sql_client.wait_until_ready('flights-hour')

You can see the segments that were created on [the Apache Druid console in the Segments view](http://localhost:8888/unified-console.html#segments/datasource~flights-hour).

Here's what it should look like, notice that each segment corresponds to a single time interval of one hour and the number of rows in the segments are small and highly variable:
![](assets/segments-hourly.png)

#### Too coarse
In this example you overcorrect by using a granularity of YEAR:

In [None]:
sql='''
REPLACE INTO "flights-year" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "arrivalime", "Year", "Quarter", "Month", "DayofMonth", "DayOfWeek", "FlightDate", "Reporting_Airline", "DOT_ID_Reporting_Airline",
  "IATA_CODE_Reporting_Airline", "Tail_Number", "Flight_Number_Reporting_Airline", 
  "OriginAirportID", "OriginAirportSeqID", "OriginCityMarketID", "Origin", "OriginCityName", "OriginState", "OriginStateFips", "OriginStateName", "OriginWac",  
  "DestAirportID", "DestAirportSeqID", "DestCityMarketID", "Dest", "DestCityName", "DestState", "DestStateFips", "DestStateName", "DestWac", 
  "CRSDepTime", "DepTime", "DepDelay", "DepDelayMinutes", "DepDel15", "DepartureDelayGroups", "DepTimeBlk",
  "TaxiOut", "WheelsOff", "WheelsOn", "TaxiIn", "CRSArrTime", "ArrTime", "ArrDelay", "ArrDelayMinutes", "ArrDel15", "ArrivalDelayGroups", "ArrTimeBlk",
  "Cancelled", "CancellationCode", "Diverted", "CRSElapsedTime", "ActualElapsedTime", "AirTime", "Flights", "Distance", "DistanceGroup", "CarrierDelay",
  "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay", "FirstDepTime", "TotalAddGTime", "LongestAddGTime", "DivAirportLandings", "DivReachedDest",
  "DivActualElapsedTime",  "DivArrDelay",  "DivDistance",  "Div1Airport",  "Div1AirportID",  "Div1AirportSeqID",  "Div1WheelsOn",  "Div1TotalGTime",
  "Div1LongestGTime",  "Div1WheelsOff",  "Div1TailNum",  "Div2Airport",  "Div2AirportID",  "Div2AirportSeqID",  "Div2WheelsOn",  "Div2TotalGTime",
  "Div2LongestGTime",  "Div2WheelsOff",  "Div2TailNum",  "Div3Airport",  "Div3AirportID",  "Div3AirportSeqID",  "Div3WheelsOn",  "Div3TotalGTime",
  "Div3LongestGTime",  "Div3WheelsOff",  "Div3TailNum",  "Div4Airport",  "Div4AirportID",  "Div4AirportSeqID",  "Div4WheelsOn",  "Div4TotalGTime",
  "Div4LongestGTime",  "Div4WheelsOff",  "Div4TailNum",  "Div5Airport",  "Div5AirportID",  "Div5AirportSeqID",  "Div5WheelsOn",  "Div5TotalGTime",
  "Div5LongestGTime",  "Div5WheelsOff",  "Div5TailNum"
FROM "ext"
PARTITIONED BY YEAR'''

display.run_task(sql)
sql_client.wait_until_ready('flights-year')

Here's the link to view the segments for this one on [the Apache Druid console in the Segments view](http://localhost:8888/unified-console.html#segments/datasource~flights-year).

Notice that now you have all the rows in a single segment and that it has over 500,000 rows. Normally this would still be a small segment, but our target for this discussion is 50,000. This solution puts us at 10x.
![](assets/segments-yearly.png)


### Almost there...using `rowsPerSegment` to control segment size
Given that there is a single month of data in this example, MONTH will not be any different than YEAR, DAY will be too granular as it will be 30x smaller and you need about 10x. You could try to use `P3D` to group segments into 3 day intervals and achieve our target. 

You can also just alter the execution parameters to force the partitioning of the large segment into the desired size by essentially cutting it into multiple segments 50,000 rows at a time, this is done with the query context parameter `rowsPerSegment`:

In [None]:
sql='''
REPLACE INTO "flights-year-50k" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "arrivalime", "Year", "Quarter", "Month", "DayofMonth", "DayOfWeek", "FlightDate", "Reporting_Airline", "DOT_ID_Reporting_Airline",
  "IATA_CODE_Reporting_Airline", "Tail_Number", "Flight_Number_Reporting_Airline", 
  "OriginAirportID", "OriginAirportSeqID", "OriginCityMarketID", "Origin", "OriginCityName", "OriginState", "OriginStateFips", "OriginStateName", "OriginWac",  
  "DestAirportID", "DestAirportSeqID", "DestCityMarketID", "Dest", "DestCityName", "DestState", "DestStateFips", "DestStateName", "DestWac", 
  "CRSDepTime", "DepTime", "DepDelay", "DepDelayMinutes", "DepDel15", "DepartureDelayGroups", "DepTimeBlk",
  "TaxiOut", "WheelsOff", "WheelsOn", "TaxiIn", "CRSArrTime", "ArrTime", "ArrDelay", "ArrDelayMinutes", "ArrDel15", "ArrivalDelayGroups", "ArrTimeBlk",
  "Cancelled", "CancellationCode", "Diverted", "CRSElapsedTime", "ActualElapsedTime", "AirTime", "Flights", "Distance", "DistanceGroup", "CarrierDelay",
  "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay", "FirstDepTime", "TotalAddGTime", "LongestAddGTime", "DivAirportLandings", "DivReachedDest",
  "DivActualElapsedTime",  "DivArrDelay",  "DivDistance",  "Div1Airport",  "Div1AirportID",  "Div1AirportSeqID",  "Div1WheelsOn",  "Div1TotalGTime",
  "Div1LongestGTime",  "Div1WheelsOff",  "Div1TailNum",  "Div2Airport",  "Div2AirportID",  "Div2AirportSeqID",  "Div2WheelsOn",  "Div2TotalGTime",
  "Div2LongestGTime",  "Div2WheelsOff",  "Div2TailNum",  "Div3Airport",  "Div3AirportID",  "Div3AirportSeqID",  "Div3WheelsOn",  "Div3TotalGTime",
  "Div3LongestGTime",  "Div3WheelsOff",  "Div3TailNum",  "Div4Airport",  "Div4AirportID",  "Div4AirportSeqID",  "Div4WheelsOn",  "Div4TotalGTime",
  "Div4LongestGTime",  "Div4WheelsOff",  "Div4TailNum",  "Div5Airport",  "Div5AirportID",  "Div5AirportSeqID",  "Div5WheelsOn",  "Div5TotalGTime",
  "Div5LongestGTime",  "Div5WheelsOff",  "Div5TailNum"
FROM "ext"
PARTITIONED BY YEAR'''
# use a request so that you can specify the query context
req = sql_client.sql_request(sql)
req.add_context("rowsPerSegment", "50000")  
display.run_task(req)
sql_client.wait_until_ready('flights-year-50k')

The [segment view for this table](http://localhost:8888/unified-console.html#segments/datasource~flights-year-50k) shows 12 segments that are very close to our ideal segment size. 

But let's say that the application you are building is for year long analytics for individual airlines and some airline to airline comparisons. So let's move on to clustering...

## Use CLUSTERED BY to apply secondary partitioning 

As mentioned in the introduction, time intervals of a given segment granularity can be subdivided into many segments. In the example above, you achieved this but there is no logic to the partitioning, they were just cut into 12 equally sized segments that were as close to the 50k target as possible.  

Queries on this data will be frequently filtered on `IATA_CODE_Reporting_Airline` in order to provide individual airline analytics. So instead of just splitting the segments by quantity of rows, you can reorganize the data such that you can improve pruning when filtering on a given airline or a few airlines.
You just need to add `CLUSTERED BY IATA_CODE_Reporting_Airline` to the ingestion request:

In [None]:
sql='''
REPLACE INTO "flights-year-IATA" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "arrivalime", "Year", "Quarter", "Month", "DayofMonth", "DayOfWeek", "FlightDate", "Reporting_Airline", "DOT_ID_Reporting_Airline",
  "IATA_CODE_Reporting_Airline", "Tail_Number", "Flight_Number_Reporting_Airline", 
  "OriginAirportID", "OriginAirportSeqID", "OriginCityMarketID", "Origin", "OriginCityName", "OriginState", "OriginStateFips", "OriginStateName", "OriginWac",  
  "DestAirportID", "DestAirportSeqID", "DestCityMarketID", "Dest", "DestCityName", "DestState", "DestStateFips", "DestStateName", "DestWac", 
  "CRSDepTime", "DepTime", "DepDelay", "DepDelayMinutes", "DepDel15", "DepartureDelayGroups", "DepTimeBlk",
  "TaxiOut", "WheelsOff", "WheelsOn", "TaxiIn", "CRSArrTime", "ArrTime", "ArrDelay", "ArrDelayMinutes", "ArrDel15", "ArrivalDelayGroups", "ArrTimeBlk",
  "Cancelled", "CancellationCode", "Diverted", "CRSElapsedTime", "ActualElapsedTime", "AirTime", "Flights", "Distance", "DistanceGroup", "CarrierDelay",
  "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay", "FirstDepTime", "TotalAddGTime", "LongestAddGTime", "DivAirportLandings", "DivReachedDest",
  "DivActualElapsedTime",  "DivArrDelay",  "DivDistance",  "Div1Airport",  "Div1AirportID",  "Div1AirportSeqID",  "Div1WheelsOn",  "Div1TotalGTime",
  "Div1LongestGTime",  "Div1WheelsOff",  "Div1TailNum",  "Div2Airport",  "Div2AirportID",  "Div2AirportSeqID",  "Div2WheelsOn",  "Div2TotalGTime",
  "Div2LongestGTime",  "Div2WheelsOff",  "Div2TailNum",  "Div3Airport",  "Div3AirportID",  "Div3AirportSeqID",  "Div3WheelsOn",  "Div3TotalGTime",
  "Div3LongestGTime",  "Div3WheelsOff",  "Div3TailNum",  "Div4Airport",  "Div4AirportID",  "Div4AirportSeqID",  "Div4WheelsOn",  "Div4TotalGTime",
  "Div4LongestGTime",  "Div4WheelsOff",  "Div4TailNum",  "Div5Airport",  "Div5AirportID",  "Div5AirportSeqID",  "Div5WheelsOn",  "Div5TotalGTime",
  "Div5LongestGTime",  "Div5WheelsOff",  "Div5TailNum"
FROM "ext"
PARTITIONED BY YEAR
CLUSTERED BY IATA_CODE_Reporting_Airline
'''
# use a request so that you can specify the query context
req = sql_client.sql_request(sql)
req.add_context("rowsPerSegment", "50000")  # you still want to target the ideal 50k per segment, normally greater than 3 million 
display.run_task(req)
sql_client.wait_until_ready('flights-year-IATA')

The [segment view for this table](http://localhost:8888/unified-console.html#segments/datasource~flights-year-IATA) shows 12 segments that are still very close to our ideal segment size, but now they have some new metadata based on the clustering columns in the `Shard Spec` column. It shows which range of values of the IATA_CODE_Reporting_Airline column are available in the segment. This is very useful for pruning when filtering on that column because Druid can prune to 1 or 2 segment files when looking for a single airline.

![](assets/segments-iata.png)

## Measuring the impact of segment layout on query performance
In this part of the notebook, now that the data has been loaded using a few different partitioning strategies, you will compare performance of each approach.
The `useCache` query parameter will be set to false for the purposes of this part of the notebook so that can measure performance of the same query with each table.

### The query
It is well known that flight traffic follows a weekly pattern as well as a seasonal pattern. Given that you only have a month, let's take a look at the maximum and average delay for a single airline by day of the week so that you can see the pattern and how it affects delays. You'll only consider the departure delay for now and only use delays greater than 2 minutes as "delayed".

The following cell defines a function you can use to measure performance:

In [None]:
# The measure_query function runs the specified sql query mutliple times 
# and measures the duration of each run.
# The output is a string with the results including the mean, minimum,
# and maximum durations measured.
def measure_query( sql: str, iterations: int ):
    req = sql_client.sql_request(sql)
    req.add_context("populateCache", "false")  # run without cacheing results to get a real sense of performance
    req.add_context("useCache", "false")  # do not use cached results
    stats = []
    while (iterations>0):
      start = datetime.now()
      sql_client.sql(req)
      end = datetime.now()
      stats.append( (end - start).total_seconds() * 10**3 ) # add run time in milliseconds
      iterations -=1
    return f"Results = avg:{mean(stats)} ms   min:{min(stats)} ms  max:{max(stats)} ms"

Run the test SQL to see that in the month of November in 2005, American Airlines had the highest percentage of delayed flights on Sundays(7) although the worst average delay occurs on Fridays(5):

In [None]:
sql = '''
SELECT EXTRACT( DOW FROM __time) as day_of_week, 
       AVG(DepDelay) FILTER ( WHERE DepDelay>120) as avg_delay,
       MAX(DepDelay) FILTER ( WHERE DepDelay>120) as max_delay,
       ROUND(COUNT(1) FILTER ( WHERE DepDelay>120) * 100.0 / COUNT(1), 1) as percent_delayed
FROM "{}"
WHERE "IATA_CODE_Reporting_Airline" = 'AA'
GROUP BY 1 
ORDER BY 1 ASC
'''

display.sql(sql.format("flights-hour"))

### Performance results

In [None]:
# run 100 queries with each table
use_table="flights-hour"
print (f"{use_table} {measure_query(sql.format(use_table), 100)}")

use_table="flights-year"
print (f"{use_table} {measure_query(sql.format(use_table), 100)}")

use_table="flights-year-50k"
print (f"{use_table} {measure_query(sql.format(use_table), 100)}")

use_table="flights-year-IATA"
print (f"{use_table} {measure_query(sql.format(use_table), 100)}")

### Single concurrency results

Given that the characteristics of your laptop or other execution environment are likely different than the `Apple M2 Pro` that this was built on, your results will vary, my results were:
```
flights-hour Results      = avg:28.848 ms   min:25.198 ms   max:64.379 ms
flights-year Results      = avg:15.451 ms   min:13.891 ms   max:17.940 ms
flights-year-50k Results  = avg:12.844 ms   min:11.594 ms   max:15.130 ms
flights-year-IATA Results = avg:12.574 ms   min:11.221 ms   max:15.204 ms
```

- `flights-hour` - With segment granularity of hour, there are approximately 30*24= 720 segment files to scan, there is no pruning, so even though the segments are tiny, each one is examined separately and the results are then merged, so doing that 720 times makes this the slowest option.
- `flights-year` - Here you are using a single segment, it is much larger at 500k+ rows, but Druid can still use the indexes associated to all dimension columns to process this pretty fast and it only does it once, so this is faster then hourly. This shows how the number of segments needed for a query is a factor for performance. Less segments to process tends to improve perfomance, but not always.
- `flights-year-50k` - Now each query had to process twelve 50k segments, but given that the historical does this concurrently on multiple threads, the parallelism is enough to improve over the single 500k segment of the `flights-year` table. 
- `flights-year-IATA` - Given the shard spec of the 12 resulting segments, you can see that there are rows with IATA Code "AA" in 2 of the segments. They are both processed in parallel and there are less segments to process, so this results in the fastest. 

### Effects of higher concurrency

Above, the results for `flights-year-IATA` seem very similar to the `flights-year-50k` results, but a real system processes multiple queries at once. Expanding the test to run with higher concurrency will tell us more. The expectation is that the organization of the data that requires less work per query will ultimately be better for higher concurrency.

You'll need another set of functions to drive the queries in parallel.


In [None]:
from threading import Thread
from datetime import datetime
from statistics import mean, median


# Custom thread class that runs a query multiple times so that you can execute concurrently on multiple threads
class QueryThread(Thread):
    # constructor
    def __init__(self, sql: str, iterations: int ):
        # execute the base constructor
        Thread.__init__(self)
        # set a default value
        self.stats = []
        self.sql = sql
        self.iterations = iterations
 
    # function executed in a new thread
    def run(self):
        self.stats = self.measure_query( self.sql, self.iterations)
        
    def measure_query( self, sql: str, iterations: int ):
        req = sql_client.sql_request(sql)
        req.add_context("populateCache", "false")  # run without cacheing results to get a real sense of performance
        req.add_context("useCache", "false")  # do not use cached results
        stats=[]
        while (iterations>0):
          start = datetime.now()
          sql_client.sql(req)
          end = datetime.now()
          stats.append( (end - start).total_seconds() * 10**3 ) # add run time in milliseconds
          iterations -=1
        return stats

# This function will spawns "numThreads" parallel threads that each run 
# the specified "sql" "iterations" times and measures overall results.
def test_in_parallel( sql: str, numThreads: int, iterations: int):
    threads=[]
    while numThreads>0:
        try:
            thrd = QueryThread( sql, iterations)
            thrd.start()
            threads.append(thrd)
        except:
           print ("Error: unable to start thread")
        numThreads-=1
    total_stats = []
    for thrd in threads:
        thrd.join()
        total_stats.extend(thrd.stats)
    return f"Results = count:{len(total_stats)}  avg:{median(total_stats)} ms   min:{min(total_stats)} ms  max:{max(total_stats)} ms"  

#### Concurrency = 2
In this test you will use the best table designs from the previous step (`flights-year-50k`, `flights-year-IATA`) to process queries in two concurrent threads.

In [None]:
use_table="flights-year-50k"
print (f"{use_table} {test_in_parallel(sql.format(use_table), 2, 50)}")

use_table="flights-year-IATA"
print (f"{use_table} {test_in_parallel(sql.format(use_table), 2, 50)}")

My results:
```
flights-year-50k Results  = count:100  avg:13.481 ms   min:11.139 ms  max:38.034 ms
flights-year-IATA Results = count:100  avg:13.073 ms   min:11.251 ms  max:20.746 ms
```

The avg suffered a bit for both tables when compared to the single concurrency test. 
The interesting change is in the max where it is almost `1.9` x faster on the IATA clustered table.
Let's take it up another notch...

#### Concurrency = 4
Double the concurrency and re-run the test:

In [None]:
use_table="flights-year-50k"
print (f"{use_table} {test_in_parallel(sql.format(use_table), 4, 50)}")

use_table="flights-year-IATA"
print (f"{use_table} {test_in_parallel(sql.format(use_table), 4, 50)}")

My results:
```
flights-year-50k Results  = count:200  avg:14.543 ms   min:11.341 ms  max:68.145 ms
flights-year-IATA Results = count:200  avg:13.909 ms   min:11.019 ms  max:33.341 ms
```
The avg continues to grow, meaning that you've likely hit some limit in resources. The minimum continues unchanged which also makes sense because the first time the query runs it probably doesn't need to wait. Subsequent parallel requests will need to wait in a queue while resources are released, as you increate the concurrency, there's more queries in the queue.

The important difference continues to be in the max value that is now `2.06` x faster on the IATA table.
Let's get one more data point...

#### Concurrency = 8

In [None]:
use_table="flights-year-50k"
print (f"{use_table} {test_in_parallel(sql.format(use_table), 8, 50)}")

use_table="flights-year-IATA"
print (f"{use_table} {test_in_parallel(sql.format(use_table), 8, 50)}")

My results:
```
flights-year-50k Results  = count:400  avg:25.969 ms   min:11.595 ms  max:97.475 ms
flights-year-IATA Results = count:400  avg:24.171 ms   min:11.979 ms  max:43.507 ms
```
Again, the avg continues to grow and the minimum continues unchanged.

The important difference continues to be in the max value that is now `2.25` x faster on the IATA table.

As you increase concurrency the IATA table continues to be more effective at dealing with concurrency. This effect will be more noticeable in a clustered deployment with more independent resources and as segments sizes grow to their more normal 3-10 million rows each.

## Clean up

Run the following cell to remove the tables created in the database.

In [None]:
druid.datasources.drop("flights-hour")
druid.datasources.drop("flights-year")
druid.datasources.drop("flights-year-50k")
druid.datasources.drop("flights-year-IATA")

## Summary

* You learned that
  * Druid works best with fewer optimally sized segments.
  * Optimal segment size will vary based on the use case, but 5 million rows is a good starting point.
  * PARTITIONED BY is used to define the segment granularity of a table.
  * CLUSTERED BY is used to organize segments within a segment granularity time interval.
  * CLUSTERED BY columns should be selected based on the most common filter criteria for the queries it will serve.
  
## Learn more

* Test performance of different test queries against each of the partitioning strategies.
* Take a look at the [06-partitioning-while-streaming.ipynb](06-partitioning-while-streaming.ipynb) notebook which takes a look at how to address partitioning and clustering for streaming.
* Read about:
  * [SQL Based Ingestion](https://druid.apache.org/docs/latest/multi-stage-query/)
  * [Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning)
  * [PARTITIONED BY and CLUSTERED BY docs](https://druid.apache.org/docs/latest/ingestion/partitioning)
