# Handling updates and upserts in Druid 
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Druid ingests data in batch or real-time and as a result, it builds immutable segment files which are published to Deep Storage. The key word here is _immutable_, meaning that once they're created, the segment files cannot change. So how do you go about updating data?

The answer is that segments need to be rebuilt and republished. This tutorial demonstrates how to work with REPLACE SQL to execute data updates. 

In this tutorial you perform the following tasks:

- Ingest some data with some relevant amount of history.
- Update specific rows.
- Delete rows.
- Replace a whole timeframe of data with a new set of data.
- Perform upserts from a change data set that includes updates to existing rows and new rows.
- Replace the history of events for one entity in a multi-entity dataset.


## Prerequisites

This tutorial works with Druid 28.0.0 or later.

#### Run with Docker


Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import json

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)


# Datagen client 
datagen = druidapi.rest.DruidRestClient("http://datagen:9999")


display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load History

Druid stores data in immutable segment files. Data updates come in two forms:

- Rewriting existing segment files with the changes applied.
- Creation of segment files that overlay small portions of a larger time chunk.

Run the following cells to create a table called `example-flights-updates` which holds 30 days of `flights` data. The "departuretime" column is mapped to Druid __time field. In the resulting table, individual rows are uniquely identified by __time, airline and flight\_ number.
When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-flights-updates" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  "IATA_CODE_Reporting_Airline" as "airline", 
  "Flight_Number_Reporting_Airline" as "flight_number", 
  "arrivalime" as "arrival_time", 
  "Tail_Number" as "tail_number", 
  "Origin" as "origin", 
  "Dest" as "destination", 
  "DepDelayMinutes" as "departure_delay"
FROM "ext"
WHERE "Cancelled" = 0  -- only interested in flown flights
PARTITIONED BY DAY'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-updates')
display.table('example-flights-updates')

Run the following cell to see the the segment stats. Columns like num_rows are populated asynchronously in the metadata, so you may need to run this a few times to see the correct values for each of the segments.

Given that the data is partitioned by day, you will see 30 1-day segments spanning 11/1/2005 to 11/30/2005.

In [None]:
sql='''
SELECT "start", "end", "num_rows", "version" 
FROM sys.segments 
WHERE datasource='example-flights-updates'
'''
display.sql(sql)

## Updating a single row


You can update a single row in a table by using a [REPLACE statement](https://druid.apache.org/docs/28.0.0/multi-stage-query/reference#replace) that combines the existing rows with the changed row, resulting in a new set of rows that cover a small portion of time. The new segment that is created overshadows that portion of time when the time interval is queried.  

As an example, update a row that has an incorrect tail_number because the aircraft was replaced.
- The flight was incorrectly recorded as using the plane with Tail_Number 'N063AA' 
- There was a change of aircraft to Tail_Number 'N073AA' 
- The unique key in this table involves __time (departure time), airline and flight_number
- The flight in question was the `AA` `513` flight that left at `2005-11-01T00:15:00.000Z`

In a traditional SQL database you would do something like:
```
UPDATE "example-flights-updates"
    SET "tail_number"='N073AA'
  WHERE "__time"= '2005-11-01T00:15:00.000Z' 
        AND "airline"='AA' 
        AND "flight_number"=513
```

Run the following cell to look at that row as it exists in the table before an update:

In [None]:

sql='''
SELECT * FROM "example-flights-updates" 
WHERE "__time"='2005-11-01T00:15:00.000Z' 
  AND "airline"='AA' 
  AND "flight_number"=513
'''
display.sql(sql)


### Update rows using partial segment overshadowing

At query time Druid selects the segments that are relevant to a query. Typically this involves a set of segment files that cover regular time intervals based on the segment granularity that was ingested. Up to this point you have ingested the data into DAY partitions. 

One of the lesser known capabilities in Druid is the ability to overlay segments of different granularity. This capability allows you to change the contents of a portion of the timeline within a particular segment partition by adding a smaller time granularity segment on top of existing data. 

![](assets/partial-overshadow.png)

The functionality is general is called [overshadowing](https://druid.apache.org/docs/28.0.0/ingestion/tasks#overshadowing-between-segments). In this case we are taking advantage of partial segment overshadowing at query time which uses the existing segment for most of the data and the new one for a small portion of the time interval. 

In [None]:
# The smallest slice of a time chunk that can be updated is 1 second
# Druid's segment overlapping and versioning strategy requires that the new segment include
# all of the rows in the time interval where a row is going to be changed. 
# This SQL shows what other rows occur in the same second? 
sql='''
SELECT * FROM "example-flights-updates" 
WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:15:00.000Z/PT1S') 
'''
display.sql(sql)


In [None]:
#before updating anything, get a count and a checksum so we can validate data consistency
sql='''
SELECT count(*) "total_flights_day", SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:00:00.000Z/P1D')  -- totals for the day
'''
display.sql(sql)

For this update, given that it is a single row, you can get away with creating a segment that only covers 1 second of time.
Like you can see above, besides the record that needs updating, there is only one other record during that second that we will need to include. 

A few important points on how to use this method:

- Use a SQL REPLACE statement that overwrites the minimum timeframe where this update occurs. That's `00:15:00 <= t < 00:15:01` of `2005-11-01` of the timeline as expressed in the OVERWRITE WHERE clause in the SQL below. 
- The SELECT portion of the request, must read the data for the target timeframe, including the row that is being updated and any other rows that exist in the same timeframe. In the SQL below it is: `WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:15:00.000Z/PT1S')`
- The `PARTITIONED BY` clause must fit exactly within the overwrite timeframe. The SQL below uses `FLOOR(__time TO SECOND)` which creates second level partitions.
- The update columns can be expressed as conditional CASE expressions that change the value of the column when the key columns match the target update (rows). The expression used in the SQL below is:
  
  ```
      CASE 
         WHEN "__time"= '2005-11-01 00:15:00.000Z' 
                AND "airline"='AA' 
                AND "flight_number"=513 
         THEN 'N073AA' 
      ELSE 
         "tail_number" 
      END as "tail_number"
  ```
  This expression will specify the value `N073AA` as the tail number for the `AA 513` flight that left at `2005-11-01 00:15:00`. For all other rows, it leaves the values as is.

In [None]:
sql = '''
REPLACE INTO "example-flights-updates" 
   OVERWRITE -- only overwrite the second for the __time of the updated row
       WHERE "__time" >= TIMESTAMP'2005-11-01 00:15:00' AND "__time" < TIMESTAMP'2005-11-01 00:15:01'
SELECT 
   "__time", 
   "airline", 
   "flight_number",  
   "arrival_time", 
      
      -- the following expression only makes the change to the row with the correct key
      -- any columns being updated require a similar expression
      CASE 
         WHEN "__time"= '2005-11-01 00:15:00.000Z' 
                AND "airline"='AA' 
                AND "flight_number"=513 
         THEN 'N073AA' 
      ELSE 
         "tail_number" 
      END as "tail_number",
    
  "origin", 
  "destination", 
  "departure_delay"
FROM "example-flights-updates" 
WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:15:00.000Z/PT1S')
PARTITIONED BY FLOOR(__time TO SECOND) 
'''
display.run_task(sql)
sql_client.wait_until_ready('example-flights-updates')

Redoing the query for the time interval you should still see two rows and the updated tail_number on the American Airlines flight.

In [None]:
sql='''
SELECT * FROM "example-flights-updates" 
WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:15:00.000Z/PT1M') 
'''
display.sql(sql)


Take a look at the segments for the whole day at this point:

In [None]:
# how did the segments change for the 1 day timeframe for 11/01/2005 <= t < 11/02/2005
sql='''
SELECT "start", "end", "num_rows", "version" 
FROM sys.segments 
WHERE datasource='example-flights-updates'
  AND "start" >=  '2005-11-01T00:00:00.000Z' AND "start" < '2005-11-02T00:00:00.000Z'
'''
display.sql(sql)

The full day segment with a timeframe of `2005-11-01 00:00:00` <= t < `2005-11-02 00:00:00` is still the same immutable file it was before the update. But now there is a second segment with only 2 rows that covers the one second timeframe `2005-11-01 00:15:00` <= t < `2005-11-01 00:15:01`. At query time, since the new segment has a newer version, it overshadows any rows within the larger segment that fall into that second.

Check that the count and checksum still match with the following SQL:

In [None]:
# verify the checksums
sql='''
SELECT count(*) "total_flights_day", SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:00:00.000Z/P1D')  -- totals for the day
'''
display.sql(sql)

## Delete rows using partial segment overshadowing
Multiple layers of segment overlays can occur. The newest version is always used.

To delete the same row in a traditional SQL database, you would use:
   ```
   DELETE "example-flights-updates"
   WHERE "__time"= '2005-11-01 00:15:00.000Z' 
     AND "airline"='AA' 
     AND "flight_number"=513
   ```
Run the following cell to achieve the same result by using a REPLACE statement. Again it is important to:
- Use a SQL REPLACE statement that overwrites the minimum timeframe where this delete occurs. That's `00:15:00 <= t < 00:15:01` of `2005-11-01` of the timeline as expressed in the OVERWRITE WHERE clause in the SQL below. 
- The SELECT portion of the request, must read the data for the target timeframe including all the rows that that exist in that time interval except for the row(s) being deleted. In the SQL below that is:
   ```
   WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:15:00.000Z/PT1S')
         AND
            NOT (                                        -- exclude row for the key
                "__time"= '2005-11-01 00:15:00.000Z' 
                AND "airline"='AA' 
                AND "flight_number"=513
                 )
   ```
- The `PARTITIONED BY` clause must fit exactly within the overwrite timeframe. The SQL below uses `FLOOR(__time TO SECOND)` which creates partitions that are one second long.

In [None]:
sql = '''
REPLACE INTO "example-flights-updates" 
   OVERWRITE                                       -- only overwrite the second for the __time of the the deleted row
       WHERE "__time" >= TIMESTAMP'2005-11-01 00:15:00' AND "__time" < TIMESTAMP'2005-11-01 00:15:01'
SELECT 
   "__time", 
   "airline", 
   "flight_number",  
   "arrival_time", 
   "tail_number",    
   "origin", 
   "destination", 
   "departure_delay"
FROM "example-flights-updates" 
WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:15:00.000Z/PT1S')
  AND NOT ( "__time"= '2005-11-01 00:15:00.000Z'                   -- exclude the row being deleted
                AND "airline"='AA' 
                AND "flight_number"=513 )
PARTITIONED BY FLOOR(__time TO SECOND) 
'''
display.run_task(sql)
sql_client.wait_until_ready('example-flights-updates')

In [None]:
# how did the segments change for the 1 day timeframe for 11/01/2005 <= t < 11/02/2005
sql='''
SELECT "start", "end", "num_rows", "version" 
FROM sys.segments 
WHERE datasource='example-flights-updates'
  AND "start" >=  '2005-11-01T00:00:00.000Z' AND "start" < '2005-11-02T00:00:00.000Z'
'''
display.sql(sql)

The 1-second segment was overshadowed by the new 1-second segment. Examine the version you see here and compare it to the version in the prior 1-second segment shown earlier in the notebook. The version is a newer timestamp and the new segment completely covers the time-frame of the prior segment. Since the prior segment is overshadowed and no longer needed, Druid removes it.

Notice that the new 1-second segment only contains one row. Query the checksums now to see the change:

In [None]:
sql='''
SELECT count(*) "total_flights_day", SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-01T00:00:00.000Z/P1D')  -- totals for the day
'''
display.sql(sql)

## Replacing a complete time interval with new data 
Sometimes a full set of the data is delivered as an updated batch file that covers a known timeframe. For example, a whole day.

Since the flight sample data we loaded does not come with a change set, the following cell simulates one by writing a day's worth of flights to another table while introducing a change to the `departure_delay` values. The generated table only includes the data for the day `2005-11-02` as expressed in the following SQL `WHERE TIME_IN_INTERVAL("__time", '2005-11-02T00:00:00.000Z/P1D')`:

In [None]:
sql='''
REPLACE INTO "example-flights-updates-changeset-day" 
   OVERWRITE ALL
SELECT
  "__time", 
  "airline", 
  "flight_number", 
  "arrival_time", 
  "tail_number", 
  "origin", 
  "destination", 
   0 "departure_delay" -- setting all delays to zero to show the change
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-02T00:00:00.000Z/P1D') 
PARTITIONED BY DAY    -- replacing a whole day so partition by day
'''
display.run_task(sql)
sql_client.wait_until_ready('example-flights-updates-changeset-day')

Calculate a checksum for `2005-11-02` with the following SQL to see the data before the change:

In [None]:
sql='''
SELECT count(*) "total_flights_day", SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-02T00:00:00.000Z/P1D')  -- totals for the day
'''
display.sql(sql)

The next cell ingests the generated data overwriting the existing data for `2005-11-02`.
In traditional SQL this would be:
   ```
   DELETE "example-flights-updates"
    WHERE "__time" >= '2005-11-02T00:00:00.000Z' AND "__time" < '2005-11-03T00:00:00.000Z'
   ;

   INSERT INTO "example-flights-updates"
     SELECT * FROM "example-flights-updates-changeset-day"
   ;
   ```

We can achieve this in Druid with a SQL REPLACE by:
- Use a REPLACE statement that overwrites the whole day. In the SQL below that is:
   ```
   OVERWRITE 
   WHERE "__time" >= TIMESTAMP'2005-11-02 00:00:00' AND "__time" < TIMESTAMP'2005-11-03 00:00:00'
   ```
- The SELECT portion of the request, just needs to read the data from the changeset table because it only contains data for the day being replaced. 
- Since it is meant to replace the whole day `PARTITIONED BY` clause uses `DAY`.

In [None]:
sql='''
REPLACE INTO "example-flights-updates" 
   OVERWRITE 
   WHERE "__time" >= TIMESTAMP'2005-11-02 00:00:00' AND "__time" < TIMESTAMP'2005-11-03 00:00:00'
SELECT
  "__time", 
  "airline", 
  "flight_number", 
  "arrival_time", 
  "tail_number", 
  "origin", 
  "destination", 
  "departure_delay"
FROM "example-flights-updates-changeset-day"
PARTITIONED BY DAY    -- replacing a whole day so partition by day
'''
display.run_task(sql)
sql_client.wait_until_ready('example-flights-updates')

Take a look at the segments for that day now with the following sql.
Notice that the version has changed and the number of rows are still the same because the change set had the same number of rows.

The ingestion fully replaced the segment for that day with the new set of data, deleting the old rows.

In [None]:
sql='''
SELECT "start", "end", "num_rows", "version" 
FROM sys.segments 
WHERE datasource='example-flights-updates'
  AND "start" >=  '2005-11-02T00:00:00.000Z' AND "start" < '2005-11-03T00:00:00.000Z'
'''
display.sql(sql)

Calculate the checksum again and notice that while the row count is still consistent, the checksum `total_time_lost` has changed to reflect the changes in the `departure_delay`, it's zero:

In [None]:
sql='''
SELECT count(*) "total_flights_day", SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-02T00:00:00.000Z/P1D')  -- totals for the day
'''
display.sql(sql)

## UPSERTs 

A more common change data set is one that doesn't have all the rows in the time interval, instead it only includes changed rows and new rows but it does not contain all rows for the time period, just the change. 
In traditional SQL this would be done using a MERGE like:
   ```
   MERGE INTO "example-flights-updates" t
   USING "example-flights-updates-upsert-changeset" s
      ON t."__time"=s."__time"
     AND t."airline"=s."airline"
     AND s."flight_number"=t."flight_number" 
   WHEN MATCHED THEN
        UPDATE SET "origin"=s."origin", "destination"=s."destination", "departure_delay"=s."departure_delay"
   WHEN NOT MATCHED THEN
        INSERT (  "__time",  "airline",  "flight_number",  "origin",  "destination", "departure_delay")
        VALUES (s."__time",s."airline",s."flight_number",s."origin",s."destination",s."departure_delay")
   ```
In Druid you can do the same operation using a REPLACE statement.

Create another change set that contains some updated rows and some new rows with the following SQL:

In [None]:
sql='''
REPLACE INTO "example-flights-updates-upsert-changeset" 
   OVERWRITE ALL
SELECT
  "__time", 
  "airline", 
  
   -- create some new rows by changing one of the key values only in rows with flight_number > 4900 
  CASE       
     WHEN "flight_number">4900 THEN 5000 
     ELSE "flight_number" 
  END as "flight_number", 
  
  "arrival_time", 
  "tail_number", 
  "origin", 
  "destination", 
  
  -- set departure_delay to -10 so we can see updates in individual rows 
  -10 "departure_delay"                              
  
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-03T00:00:00.000Z/P1D') 

  -- only use some of the rows for the day to demonstrate upsert
  AND "airline"='TZ'                                   
PARTITIONED BY DAY                                     
'''
display.run_task(sql)
sql_client.wait_until_ready('example-flights-updates-upsert-changeset')

Take a look at the data before applying the changes:

In [None]:
sql='''
SELECT min("flight_number") as "min_flight", max("flight_number") as "max_flight", 
       count(*) "total_flights_day", 
       SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-03T00:00:00.000Z/P1D')  
  AND "airline"='TZ'
'''
display.sql(sql)

And a few rows so you can see the changes..

In [None]:
sql='''
SELECT __time, "airline", "flight_number", "arrival_time", "departure_delay"
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-03T00:00:00.000Z/P1D')  
  AND "airline"='TZ'
GROUP BY 1,2,3,4,5
ORDER BY "flight_number" DESC  -- since you added new rows with bigger flight numbers you'll want to compare these
LIMIT 20
'''
display.sql(sql)

You can achieve an UPSERT operation using a REPLACE statement by combining existing data with the new data to replace the portion of time affected by the updated and new rows :
- Given that the change set spans a day use a REPLACE statement that overwrites that day. In the SQL below that is:

   ```
   OVERWRITE 
   WHERE "__time" >= TIMESTAMP'2005-11-03 00:00:00' AND "__time" < TIMESTAMP'2005-11-04 00:00:00'
   ```
   
- The SELECT portion of the request uses a FULL OUTER JOIN between the target table and the change set so that we can include existing rows which are not modified, update rows that exist in both tables and insert new rows. In the following SQL that is:

   ```
   FROM "example-flights-updates" t 
     FULL OUTER JOIN
       "example-flights-updates-upsert-changeset" s
     ON t."__time"=s."__time"
    AND t."airline"=s."airline"
    AND t."flight_number"=s."flight_number" 
   ```
   
- The column expressions for the key columns in the SELECT use a COALESCE to select the existing value for updated or untouched rows, and the new value for new rows being inserted:

   ```
   COALESCE(t."__time", s."__time") as "__time",
   COALESCE(t."airline", s."airline") as "airline",
   COALESCE(t."flight_number", s."flight_number") as "flight_number"
   ```
   
- Given that the values for non-key columns in either the target or source tables could be NULL and we need to address values for new rows, updated rows and untouched rows use a CASE expression that checks for these conditions and applies the correct value:

   ```
   CASE WHEN (t."__time" IS NULL OR t."__time"=s."__time") -- new  and update rows get new value
             THEN s."arrival_time"                       
        ELSE t."arrival_time                               -- existing untouched rows, get existing value   
   END AS "arrival_time" 
   ```
   
- Only include the rows from the existing table that fall into the update timeframe and new rows where the join results in NULL for the `t."__time"` column:

   ```
   WHERE t."__time" IS NULL OR TIME_IN_INTERVAL(t."__time", '2005-11-03T00:00:00.000Z/P1D')  
   ```
- Since it the operation is covering a whole day use PARTITIONED BY clause uses DAY.
- Set the context parameter to use `joinAlgorithm=sortMerge`.

Run the following SQL to apply the upsert operation:

In [None]:
sql='''
REPLACE INTO "example-flights-updates" 
   OVERWRITE 
   WHERE "__time" >= TIMESTAMP'2005-11-03 00:00:00' AND "__time" < TIMESTAMP'2005-11-04 00:00:00'
SELECT
  COALESCE(t."__time", s."__time") as "__time",
  COALESCE(t."airline", s."airline") as "airline",
  COALESCE(t."flight_number", s."flight_number") as "flight_number",
  CASE WHEN (t."__time" IS NULL OR t."__time" = s."__time") THEN s."arrival_time"       
       ELSE t."arrival_time"                                  
  END AS "arrival_time",
  CASE WHEN (t."__time" IS NULL OR t."__time" = s."__time") THEN s."tail_number"
       ELSE t."tail_number"    
  END AS "tail_number", 
  CASE WHEN (t."__time" IS NULL OR t."__time" = s."__time") THEN s."origin"
       ELSE t."origin"          
  END AS "origin", 
  CASE WHEN (t."__time" IS NULL OR t."__time" = s."__time") THEN s."destination"
       ELSE t."destination"     
  END AS "destination", 
  CASE WHEN (t."__time" IS NULL OR t."__time" = s."__time") THEN s."departure_delay"
       ELSE t."departure_delay" 
  END AS "departure_delay"
FROM "example-flights-updates" t 
  FULL OUTER JOIN
    "example-flights-updates-upsert-changeset" s
  ON t."__time"=s."__time"
  AND t."airline"=s."airline"
  AND t."flight_number"=s."flight_number" 
WHERE t."__time" IS NULL OR TIME_IN_INTERVAL(t."__time", '2005-11-03T00:00:00.000Z/P1D') 
PARTITIONED BY DAY    -- replacing a whole day so partition by day
'''
req = sql_client.sql_request(sql)
req.add_context("sqlJoinAlgorithm", 'sortMerge')
display.run_task(req)
sql_client.wait_until_ready('example-flights-updates')

Check the data changes and new rows with the same aggregation SQL as before:

In [None]:
sql='''
SELECT min("flight_number") as "min_flight", max("flight_number") as "max_flight", 
       count(*) "total_flights_day", 
       SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-03T00:00:00.000Z/P1D')  
  AND "airline"='TZ'
'''
display.sql(sql)

Notice in the result above that:
- there are now 9 new rows
- that the `departure_delay` values have been updated producing the `total_time_lost` = -1020.

You can check the 9 new rows `flight_number = 5000`, the next 9 rows that were left "as is" `4900 < flight_number < 5000` where the `departure_delay` is unchanged and some of the updated rows that show `departure_delay = -10`. Run the following SQL:

In [None]:
sql='''
SELECT __time, "airline", "flight_number", "arrival_time", "departure_delay"
FROM "example-flights-updates"
WHERE TIME_IN_INTERVAL("__time", '2005-11-03T00:00:00.000Z/P1D')  
  AND "airline"='TZ'
GROUP BY 1,2,3,4,5
ORDER BY "flight_number" DESC  -- since you added new rows with bigger flight numbers you'll want to compare these
LIMIT 20
'''
display.sql(sql)

## Revisionist history update
Imagine for example, that you have credit transaction activity data over a years of history and for a period of a few months a person's identity had been stolen. The person worked it all out with the banks and now there is new validated data for what their transactions were. So we need to do an update across a large portion of the timeline, removing the rows for this person and inserting the new ones across the same period of time.

Run the following cell to create a change dataset that replaces all history for an airline. 

This SQL creates a change data set that replaces the history of `airline`=`'TZ'` flights by using the history of flights from `airline` = `'HA'` and just replacing the "airline" value with `'TZ'`. 

In [None]:
sql='''
REPLACE INTO "example-flights-updates-replace-history" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip"]}',
    '{"type":"csv","findColumnsFromHeader":true}'
  )
) EXTEND ("depaturetime" VARCHAR, "arrivalime" VARCHAR, "Year" BIGINT, "Quarter" BIGINT, "Month" BIGINT, "DayofMonth" BIGINT, "DayOfWeek" BIGINT, "FlightDate" VARCHAR, "Reporting_Airline" VARCHAR, "DOT_ID_Reporting_Airline" BIGINT, "IATA_CODE_Reporting_Airline" VARCHAR, "Tail_Number" VARCHAR, "Flight_Number_Reporting_Airline" BIGINT, "OriginAirportID" BIGINT, "OriginAirportSeqID" BIGINT, "OriginCityMarketID" BIGINT, "Origin" VARCHAR, "OriginCityName" VARCHAR, "OriginState" VARCHAR, "OriginStateFips" BIGINT, "OriginStateName" VARCHAR, "OriginWac" BIGINT, "DestAirportID" BIGINT, "DestAirportSeqID" BIGINT, "DestCityMarketID" BIGINT, "Dest" VARCHAR, "DestCityName" VARCHAR, "DestState" VARCHAR, "DestStateFips" BIGINT, "DestStateName" VARCHAR, "DestWac" BIGINT, "CRSDepTime" BIGINT, "DepTime" BIGINT, "DepDelay" BIGINT, "DepDelayMinutes" BIGINT, "DepDel15" BIGINT, "DepartureDelayGroups" BIGINT, "DepTimeBlk" VARCHAR, "TaxiOut" BIGINT, "WheelsOff" BIGINT, "WheelsOn" BIGINT, "TaxiIn" BIGINT, "CRSArrTime" BIGINT, "ArrTime" BIGINT, "ArrDelay" BIGINT, "ArrDelayMinutes" BIGINT, "ArrDel15" BIGINT, "ArrivalDelayGroups" BIGINT, "ArrTimeBlk" VARCHAR, "Cancelled" BIGINT, "CancellationCode" VARCHAR, "Diverted" BIGINT, "CRSElapsedTime" BIGINT, "ActualElapsedTime" BIGINT, "AirTime" BIGINT, "Flights" BIGINT, "Distance" BIGINT, "DistanceGroup" BIGINT, "CarrierDelay" BIGINT, "WeatherDelay" BIGINT, "NASDelay" BIGINT, "SecurityDelay" BIGINT, "LateAircraftDelay" BIGINT, "FirstDepTime" VARCHAR, "TotalAddGTime" VARCHAR, "LongestAddGTime" VARCHAR, "DivAirportLandings" VARCHAR, "DivReachedDest" VARCHAR, "DivActualElapsedTime" VARCHAR, "DivArrDelay" VARCHAR, "DivDistance" VARCHAR, "Div1Airport" VARCHAR, "Div1AirportID" VARCHAR, "Div1AirportSeqID" VARCHAR, "Div1WheelsOn" VARCHAR, "Div1TotalGTime" VARCHAR, "Div1LongestGTime" VARCHAR, "Div1WheelsOff" VARCHAR, "Div1TailNum" VARCHAR, "Div2Airport" VARCHAR, "Div2AirportID" VARCHAR, "Div2AirportSeqID" VARCHAR, "Div2WheelsOn" VARCHAR, "Div2TotalGTime" VARCHAR, "Div2LongestGTime" VARCHAR, "Div2WheelsOff" VARCHAR, "Div2TailNum" VARCHAR, "Div3Airport" VARCHAR, "Div3AirportID" VARCHAR, "Div3AirportSeqID" VARCHAR, "Div3WheelsOn" VARCHAR, "Div3TotalGTime" VARCHAR, "Div3LongestGTime" VARCHAR, "Div3WheelsOff" VARCHAR, "Div3TailNum" VARCHAR, "Div4Airport" VARCHAR, "Div4AirportID" VARCHAR, "Div4AirportSeqID" VARCHAR, "Div4WheelsOn" VARCHAR, "Div4TotalGTime" VARCHAR, "Div4LongestGTime" VARCHAR, "Div4WheelsOff" VARCHAR, "Div4TailNum" VARCHAR, "Div5Airport" VARCHAR, "Div5AirportID" VARCHAR, "Div5AirportSeqID" VARCHAR, "Div5WheelsOn" VARCHAR, "Div5TotalGTime" VARCHAR, "Div5LongestGTime" VARCHAR, "Div5WheelsOff" VARCHAR, "Div5TailNum" VARCHAR, "Unnamed: 109" VARCHAR))
SELECT
  TIME_PARSE("depaturetime") AS "__time",
  'TZ' as "airline",                                                -- replace airline with TZ
  "Flight_Number_Reporting_Airline" as "flight_number", 
  "arrivalime" as "arrival_time",                                
  "Tail_Number" as "tail_number", 
  "Origin" as "origin", 
  "Dest" as "destination", 
  "DepDelayMinutes" as "departure_delay"
FROM "ext"
WHERE "Cancelled" = 0  AND "IATA_CODE_Reporting_Airline"='HA'       --  filtering for airline = 'HA' flights
PARTITIONED BY DAY'''

display.run_task(sql)
sql_client.wait_until_ready('example-flights-updates-replace-history')
display.table('example-flights-updates-replace-history')

Take a look at the change set:

In [None]:
sql='''
SELECT "airline", 
        min("__time") first_flight, max("__time") last_flight, 
        min("flight_number") as "min_flight_n", max("flight_number") as "max_flight_n", 
       count(*) "total_flights", 
       SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates-replace-history"
WHERE  "airline" in ('TZ', 'HA')
GROUP BY 1
'''
display.sql(sql)

...and check the target table before you change it:

In [None]:
sql='''
SELECT "airline", 
        min("__time") first_flight, max("__time") last_flight, 
        min("flight_number") as "min_flight_n", max("flight_number") as "max_flight_n", 
       count(*) "total_flights", 
       SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE "airline" in ('TZ', 'HA')
GROUP BY 1
'''
display.sql(sql)

In traditional SQL this might look like:
```
   DELETE "example-flights-updates"
   WHERE "airline" = 'TZ'
   ;
   INSERT INTO "example-flights-updates"
   SELECT * FROM "example-flights-updates-replace-history"
;
```

You can achieve this effect with a single REPLACE using the existing data combined with the new data to replace the portion of time affected by the change set. You need to exclude rows from existing data for the entity being replaced and use a UNION ALL to merge with the new data. So for this REPLACE:
-  Use a REPLACE statement that overwrites the portion of the time line that you have new history for. In our example it is ALL:

   ```
   OVERWRITE ALL
   ```
   
- The request uses a FULL OUTER JOIN between the target table and the change set so that you can include existing rows excluding rows for the key that is being replace in a subquery with a FULL OUTER JOIN to the replacement rows. In the following SQL that is:

   ```
   FROM ( SELECT * FROM "example-flights-updates" WHERE "airline"!='TZ' ) t
   FULL OUTER JOIN
        "example-flights-updates-replace-history" s
     ON t."__time"=s."__time"
    AND t."airline"=s."airline"
    AND t."flight_number"=s."flight_number" 
   ```
   
- Given the content of the tables and the FULL OUTER JOIN, the values will either be on the target table or the source table, there is no overlap, so the column expressions for the key columns in the SELECT use a COALESCE:

   ```
   COALESCE(t."__time", s."__time") as "__time",
   COALESCE(t."airline", s."airline") as "airline",
   COALESCE(t."flight_number", s."flight_number") as "flight_number"
   ```
   
- For non-key columns use a CASE expression that checks the key column to determine if this is an existing or new row and assigned the corresponding value from the target or source tables:

   ```
   CASE WHEN t."__time" IS NULL THEN s."arrival_time"      -- new rows get new value                                  
        ELSE t."arrival_time                               -- existing rows get existing value   
   END AS "arrival_time" 
   ```
- Set the [context parameter to use `sqlJoinAlgorithm=sortMerge`](https://druid.apache.org/docs/28.0.0/multi-stage-query/reference#sort-merge) to enable the FULL OUTER JOIN.

Run the following SQL to apply the upsert operation:

In [None]:
sql='''
REPLACE INTO "example-flights-updates" OVERWRITE ALL
SELECT 
      COALESCE(t."__time",       s."__time"        ) as "__time",
      COALESCE(t."airline",       s."airline"       ) as "airline",
      COALESCE(t."flight_number", s."flight_number" ) as "flight_number",
      CASE WHEN t."__time" IS NULL THEN s."arrival_time"    ELSE t."arrival_time"    END  as "arrival_time",
      CASE WHEN t."__time" IS NULL THEN s."tail_number"     ELSE t."tail_number"     END  as "tail_number", 
      CASE WHEN t."__time" IS NULL THEN s."origin"          ELSE t."origin"          END  as "origin", 
      CASE WHEN t."__time" IS NULL THEN s."destination"     ELSE t."destination"     END  as "destination", 
      CASE WHEN t."__time" IS NULL THEN s."departure_delay" ELSE t."departure_delay" END  as "departure_delay" 
FROM ( SELECT * FROM "example-flights-updates" WHERE "airline"!='TZ' ) t
  FULL OUTER JOIN
    "example-flights-updates-replace-history" s
  ON t."__time"=s."__time"
  AND t."airline"=s."airline"
  AND t."flight_number"=s."flight_number" 
PARTITIONED BY DAY    
'''
req = sql_client.sql_request(sql)
req.add_context("sqlJoinAlgorithm", 'sortMerge')
display.run_task(req)
sql_client.wait_until_ready('example-flights-updates')


Look at the data now, given that the changeset was created by copying the history of flights from airline `HA`, the aggregate results for both airlines should now match: 

In [None]:
sql='''
SELECT "airline", 
        min("__time") first_flight, max("__time") last_flight, 
        min("flight_number") as "min_flight_n", max("flight_number") as "max_flight_n", 
       count(*) "total_flights", 
       SUM("departure_delay") "total_time_lost" 
FROM "example-flights-updates"
WHERE "airline" in ('TZ', 'HA')
GROUP BY 1
'''
display.sql(sql)

## Clean up

Run the following cell to remove the table from the database.

In [None]:
druid.datasources.drop("example-flights-updates")
druid.datasources.drop("example-flights-updates-upsert-changeset")
druid.datasources.drop("example-flights-updates-changeset-day")
druid.datasources.drop("example-flights-updates-replace-history")

## Summary

You learned that using REPLACE you can:
* UPDATE rows using granular segment overlays
* DELETE rows
* UPDATE whole timeframes of data with newer data
* UPSERT rows like MERGE does in traditional SQL
* Replace the history of an entity across a broad timeframe.

## Learn more

* Use [Compaction or Auto-compaction](https://druid.apache.org/docs/28.0.0/data-management/compaction) to merge small granularity update segments with the larger base segments after doing small updates and deletes.
* Check out [SQL Based Ingestion docs](https://druid.apache.org/docs/28.0.0/multi-stage-query/) for everything you want to know about the REPLACE statement.