d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# 4.5 Advanced Delta Lake

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>
* [PARTITION](https://docs.databricks.com/delta/best-practices.html#language-sql) columns of your table 
* Evolve the schema of the table
* [Time travel](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html)!
* [DELETE](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-delete-from.html) records

In [0]:
%run ../Includes/Classroom-Setup

Create a temporary view with our Parquet file.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW fireCallsParquet
USING parquet
OPTIONS (
  path "/mnt/davis/fire-calls/fire-calls-clean.parquet/"
)

### Partitioning

Create Delta Table [partitioned by](https://docs.databricks.com/delta/best-practices.html#language-sql) City.

You are not required to partition the columns in your Delta table, but doing so can drastically speed up queries. From the Delta [docs](https://docs.delta.io/latest/best-practices.html#choose-the-right-partition-column), there are two rules of thumb for deciding which column to partition by:
  * If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
  * Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

In [0]:
%sql
CREATE DATABASE IF NOT EXISTS Databricks;
USE Databricks;
DROP TABLE IF EXISTS fire_Calls_Delta;

CREATE TABLE fire_Calls_Delta
USING DELTA
PARTITIONED BY (City)
AS 
  SELECT * FROM fireCallsParquet

num_affected_rows,num_inserted_rows


**Let's take a look how these underlying files are partitioned on disk by using the city column**

In [0]:
%fs ls dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/

path,name,size
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=AI/,City=AI/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=BN/,City=BN/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=Brisbane/,City=Brisbane/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=DALY CITY/,City=DALY CITY/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=DC/,City=DC/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=Daly City/,City=Daly City/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=FM/,City=FM/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=FORT MASON/,City=FORT MASON/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=Fort Mason/,City=Fort Mason/,0
dbfs:/user/hive/warehouse/databricks.db/fire_calls_delta/City=HP/,City=HP/,0


**The following query only reads from the 7th partition**

In [0]:
%sql
SELECT * FROM fire_Calls_delta WHERE City="Daly City" limit 10

Call_Number,Unit_ID,Incident_Number,Call_Type,Call_Date,Watch_Date,Received_DtTm,Entry_DtTm,Dispatch_DtTm,Response_DtTm,On_Scene_DtTm,Transport_DtTm,Hospital_DtTm,Call_Final_Disposition,Available_DtTm,Address,City,Zipcode_of_Incident,Battalion,Station_Area,Box,Original_Priority,Priority,Final_Priority,ALS_Unit,Call_Type_Group,Number_of_Alarms,Unit_Type,Unit_sequence_in_call_dispatch,Fire_Prevention_District,Supervisor_District,Neighborhooods_-_Analysis_Boundaries,Location,RowID
141522600,E33,14052318,Medical Incident,06/01/2014,06/01/2014,06/01/2014 07:06:20 PM,06/01/2014 07:08:06 PM,06/01/2014 07:12:31 PM,06/01/2014 07:14:34 PM,06/01/2014 07:16:48 PM,,,No Merit,06/01/2014 07:30:44 PM,CALL BOX:,Daly City,,B09,33,9117,2,2,2,True,Non Life-threatening,1,ENGINE,1,,,,"(37.7070568539773, -122.458070201117)",141522600-E33
150463306,AM16,15018073,Medical Incident,02/15/2015,02/15/2015,02/15/2015 08:19:15 PM,02/15/2015 08:20:06 PM,02/15/2015 08:20:41 PM,02/15/2015 08:21:18 PM,02/15/2015 08:30:25 PM,02/15/2015 08:36:20 PM,02/15/2015 09:09:30 PM,Code 2 Transport,02/15/2015 09:33:57 PM,"GENEVA AV/SANTOS ST, DC",Daly City,94134.0,B09,43,6246,2,2,2,False,Non Life-threatening,1,PRIVATE,1,,10.0,,"(37.7083216255827, -122.42054528003)",150463306-AM16
151920037,E33,15073098,Medical Incident,07/11/2015,07/10/2015,07/11/2015 12:13:07 AM,07/11/2015 12:13:07 AM,07/11/2015 12:13:12 AM,07/11/2015 12:13:30 AM,07/11/2015 12:18:04 AM,,,Patient Declined Transport,07/11/2015 12:26:44 AM,CALL BOX:,Daly City,,B09,33,9922,3,3,3,False,Potentially Life-Threatening,1,ENGINE,1,,,,"(37.7049649190969, -122.462393901191)",151920037-E33
182063518,RC4,18087382,Medical Incident,07/25/2018,07/25/2018,07/25/2018 06:39:17 PM,07/25/2018 06:40:56 PM,07/25/2018 06:41:32 PM,07/25/2018 06:41:49 PM,07/25/2018 06:45:51 PM,,,Code 2 Transport,07/25/2018 06:58:03 PM,"SCHWERIN ST/VELASCO AV, DC",Daly City,,B09,44,9272,E,3,3,True,Potentially Life-Threatening,1,RESCUE CAPTAIN,1,,,,"(37.70828445105653, -122.41230609249264)",182063518-RC4
143412997,E44,14121268,Structure Fire,12/07/2014,12/07/2014,12/07/2014 07:18:22 PM,12/07/2014 07:18:22 PM,12/07/2014 07:18:28 PM,12/07/2014 07:19:47 PM,12/07/2014 07:24:10 PM,,,Fire,12/07/2014 07:24:25 PM,"GENEVA AV/SANTOS ST, DC",Daly City,94134.0,B09,43,6246,3,3,3,False,Alarm,1,ENGINE,1,,10.0,,"(37.7083216255827, -122.42054528003)",143412997-E44
141282327,KM10,14043569,Traffic Collision,05/08/2014,05/08/2014,05/08/2014 03:43:51 PM,05/08/2014 03:48:28 PM,05/08/2014 03:53:16 PM,05/08/2014 03:55:20 PM,05/08/2014 04:09:07 PM,,,Cancelled,05/08/2014 04:10:18 PM,300 Block of SAINT CHARLES AV,Daly City,94132.0,B09,15,8313,3,3,3,False,Non Life-threatening,1,PRIVATE,1,9.0,7.0,Oceanview/Merced/Ingleside,"(37.7083764026475, -122.469270912671)",141282327-KM10
152091203,E43,15079567,Structure Fire,07/28/2015,07/28/2015,07/28/2015 10:01:54 AM,07/28/2015 10:01:54 AM,07/28/2015 10:03:50 AM,07/28/2015 10:04:50 AM,07/28/2015 10:07:49 AM,,,Fire,07/28/2015 10:08:07 AM,"GENEVA AV/SANTOS ST, DC",Daly City,94134.0,B09,43,6246,3,3,3,True,Alarm,1,ENGINE,1,,10.0,,"(37.7083216255827, -122.42054528003)",152091203-E43
183040277,E33,18127325,Medical Incident,10/31/2018,10/30/2018,10/31/2018 03:14:54 AM,10/31/2018 03:16:34 AM,10/31/2018 03:16:47 AM,10/31/2018 03:19:07 AM,10/31/2018 03:22:48 AM,,,Code 2 Transport,10/31/2018 03:41:03 AM,CALL BOX:,Daly City,,B09,33,9117,3,3,3,True,Potentially Life-Threatening,1,ENGINE,1,,,,"(37.707056853977306, -122.45807020111693)",183040277-E33
163110968,E44,16123962,Medical Incident,11/06/2016,11/05/2016,11/06/2016 07:49:35 AM,11/06/2016 07:49:35 AM,11/06/2016 07:51:09 AM,11/06/2016 07:51:34 AM,11/06/2016 07:56:48 AM,,,Code 2 Transport,11/06/2016 07:59:27 AM,CALL BOX:,Daly City,,B09,44,9271,3,3,3,True,Potentially Life-Threatening,1,ENGINE,1,,,,"(37.708286429739, -122.416313596189)",163110968-E44
142342453,89,14081391,Medical Incident,08/22/2014,08/22/2014,08/22/2014 04:26:08 PM,08/22/2014 04:26:08 PM,08/22/2014 04:30:03 PM,08/22/2014 04:31:19 PM,08/22/2014 04:56:06 PM,08/22/2014 04:58:59 PM,08/22/2014 05:23:10 PM,Code 2 Transport,08/22/2014 05:58:18 PM,0 Block of LINCOLN CT,Daly City,94112.0,B09,43,6231,2,2,2,True,Non Life-threatening,1,MEDIC,1,9.0,11.0,Excelsior,"(37.7084685450499, -122.442697907962)",142342453-89


### Schema Enforcement & Evolution
**Schema enforcement**, also known as schema validation, is a safeguard to ensure data quality.  Delta Lake uses schema validation *on write*, which means that all new writes to a table are checked for compatibility with the target table’s schema at write time. If the schema is not compatible, Delta Lake cancels the transaction altogether (no data is written), and raises an exception to let the user know about the mismatch.

**Schema evolution** is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. Most commonly, it’s used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns.

To determine whether a write to a table is compatible, Delta Lake uses the following rules. The DataFrame to be written:
* Cannot contain any additional columns that are not present in the target table’s schema. 
* Cannot have column data types that differ from the column data types in the target table.
* Cannot contain column names that differ only by case.

If we look at the schema, we can that someone added a few too many `o's` to the column `Neighborhooods_-_Analysis_Boundaries`. Let's create a new column called `Neighborhoods`.

In [0]:
%sql
DESCRIBE fire_Calls_Delta

col_name,data_type,comment
Call_Number,int,
Unit_ID,string,
Incident_Number,int,
Call_Type,string,
Call_Date,string,
Watch_Date,string,
Received_DtTm,string,
Entry_DtTm,string,
Dispatch_DtTm,string,
Response_DtTm,string,


In [0]:
%sql
INSERT OVERWRITE TABLE fire_Calls_Delta SELECT *, `Neighborhooods_-_Analysis_Boundaries` AS Neighborhoods FROM fire_Calls_Delta

Our write failed because we changed the schema. Let's enable `autoMerge`.

In [0]:
%sql
SET spark.databricks.delta.schema.autoMerge.enabled=TRUE

key,value
spark.databricks.delta.schema.autoMerge.enabled,True


Let's try the same command and see the difference.

In [0]:
%sql
INSERT OVERWRITE TABLE fire_Calls_Delta SELECT *, `Neighborhooods_-_Analysis_Boundaries` AS Neighborhoods FROM fire_Calls_Delta;

SELECT * FROM fireCallsDelta

### Time Travel
Now, let's try querying with the `VERSION AS OF` command. We can see the data we previously deleted. There is also `TIMESTAMP AS OF`:
```
SELECT * FROM table_identifier TIMESTAMP AS OF timestamp_expression
SELECT * FROM table_identifier VERSION AS OF version
```

In [0]:
%sql
SELECT * 
FROM fire_Calls_Delta 
  VERSION AS OF 0 LIMIT 5

Call_Number,Unit_ID,Incident_Number,Call_Type,Call_Date,Watch_Date,Received_DtTm,Entry_DtTm,Dispatch_DtTm,Response_DtTm,On_Scene_DtTm,Transport_DtTm,Hospital_DtTm,Call_Final_Disposition,Available_DtTm,Address,City,Zipcode_of_Incident,Battalion,Station_Area,Box,Original_Priority,Priority,Final_Priority,ALS_Unit,Call_Type_Group,Number_of_Alarms,Unit_Type,Unit_sequence_in_call_dispatch,Fire_Prevention_District,Supervisor_District,Neighborhooods_-_Analysis_Boundaries,Location,RowID
170531952,AP,17022925,Other,02/22/2017,02/22/2017,02/22/2017 12:54:13 PM,02/22/2017 12:54:13 PM,02/22/2017 12:54:13 PM,02/22/2017 12:54:13 PM,02/22/2017 12:54:13 PM,,,Fire,02/22/2017 12:54:13 PM,CALL BOX: SF INTERNATIONAL AIRPORT,,,B09,44,6913,3,3,3,False,Alarm,1,AIRPORT,1,,,,"(37.6168823239251, -122.384094238098)",170531952-AP
142253505,AP,14078286,Other,08/13/2014,08/13/2014,08/13/2014 09:06:57 PM,08/13/2014 09:06:57 PM,08/13/2014 09:06:57 PM,08/13/2014 09:06:57 PM,08/13/2014 09:06:57 PM,,,Fire,08/13/2014 09:07:42 PM,CALL BOX: SF INTERNATIONAL AIRPORT,,,B99,44,6913,3,3,3,False,Alarm,1,AIRPORT,1,,,,"(37.6168823239251, -122.384094238098)",142253505-AP
141381435,AP,14047259,Other,05/18/2014,05/18/2014,05/18/2014 11:44:35 AM,05/18/2014 11:44:35 AM,05/18/2014 11:44:35 AM,05/18/2014 11:44:35 AM,05/18/2014 11:44:35 AM,,,Fire,05/18/2014 11:49:03 AM,CALL BOX: SF INTERNATIONAL AIRPORT,,,B99,44,6913,3,3,3,False,Alarm,1,AIRPORT,1,,,,"(37.6168823239251, -122.384094238098)",141381435-AP
170523860,AP,17022754,Other,02/21/2017,02/21/2017,02/21/2017 10:57:18 PM,02/21/2017 10:57:18 PM,02/21/2017 10:57:18 PM,02/21/2017 10:57:18 PM,02/21/2017 10:57:18 PM,,,Fire,02/21/2017 10:57:18 PM,CALL BOX: SF INTERNATIONAL AIRPORT,,,B09,44,6913,3,3,3,False,Alarm,1,AIRPORT,1,,,,"(37.6168823239251, -122.384094238098)",170523860-AP
160033127,AP,16001376,Other,01/03/2016,01/03/2016,01/03/2016 09:46:42 PM,01/03/2016 09:46:42 PM,01/03/2016 09:46:42 PM,01/03/2016 09:46:42 PM,01/03/2016 09:46:42 PM,,,Fire,01/03/2016 09:46:42 PM,CALL BOX: SF INTERNATIONAL AIRPORT,,,B99,44,6913,3,3,3,False,Alarm,1,AIRPORT,1,,,,"(37.6168823239251, -122.384094238098)",160033127-AP


In [0]:
%sql
SELECT * 
FROM fire_Calls_Delta 
  VERSION AS OF 1 LIMIT 5

Call_Number,Unit_ID,Incident_Number,Call_Type,Call_Date,Watch_Date,Received_DtTm,Entry_DtTm,Dispatch_DtTm,Response_DtTm,On_Scene_DtTm,Transport_DtTm,Hospital_DtTm,Call_Final_Disposition,Available_DtTm,Address,City,Zipcode_of_Incident,Battalion,Station_Area,Box,Original_Priority,Priority,Final_Priority,ALS_Unit,Call_Type_Group,Number_of_Alarms,Unit_Type,Unit_sequence_in_call_dispatch,Fire_Prevention_District,Supervisor_District,Neighborhooods_-_Analysis_Boundaries,Location,RowID,Neighborhoods
93040219,E37,9091742,Structure Fire,10/31/2009,10/31/2009,10/31/2009 03:39:39 PM,10/31/2009 03:40:24 PM,10/31/2009 03:41:01 PM,10/31/2009 03:42:02 PM,10/31/2009 03:44:12 PM,,,Other,10/31/2009 04:45:27 PM,MISSOURI ST/WATCHMAN WY,SF,94107,B10,37,2566,3,3,3,False,,1,ENGINE,1,10,10,Potrero Hill,"(37.7556214464967, -122.395762627012)",093040219-E37,Potrero Hill
31470136,M36,3042085,Medical Incident,05/27/2003,05/27/2003,05/27/2003 10:20:47 AM,05/27/2003 10:23:27 AM,05/27/2003 10:24:33 AM,05/27/2003 10:25:53 AM,05/27/2003 10:27:43 AM,05/27/2003 10:37:56 AM,05/27/2003 10:42:17 AM,Other,05/27/2003 10:52:22 AM,8TH ST/HOWARD ST,SF,94103,B02,36,2335,3,3,3,True,,1,MEDIC,1,2,6,South of Market,"(37.7762213544451, -122.411606113878)",031470136-M36,South of Market
140180049,T03,14006063,Alarms,01/18/2014,01/17/2014,01/18/2014 02:51:14 AM,01/18/2014 02:52:51 AM,01/18/2014 02:52:59 AM,01/18/2014 02:54:26 AM,01/18/2014 02:56:40 AM,,,Other,01/18/2014 02:58:18 AM,300 Block of MASON ST,SF,94102,B01,3,1411,3,3,3,False,Alarm,1,TRUCK,1,1,3,Tenderloin,"(37.7864220856559, -122.409797430148)",140180049-T03,Tenderloin
21370255,94,2040554,Medical Incident,05/17/2002,05/17/2002,05/17/2002 02:33:08 PM,05/17/2002 02:33:54 PM,05/17/2002 02:35:20 PM,05/17/2002 02:35:53 PM,05/17/2002 02:38:59 PM,05/17/2002 02:54:43 PM,05/17/2002 03:08:44 PM,Other,05/17/2002 03:36:40 PM,400 Block of MINNA ST,SF,94103,B03,1,2251,3,3,3,True,,1,MEDIC,1,3,6,South of Market,"(37.7810688918781, -122.407387172098)",021370255-94,South of Market
140640156,E10,14021617,Other,03/05/2014,03/05/2014,03/05/2014 12:21:30 PM,03/05/2014 12:22:20 PM,03/05/2014 12:22:54 PM,03/05/2014 12:24:51 PM,03/05/2014 12:28:42 PM,,,Fire,03/05/2014 12:40:51 PM,200 Block of BALBOA ST,SF,94118,B07,31,7123,3,3,3,False,Alarm,1,ENGINE,1,7,1,Inner Richmond,"(37.7773012252008, -122.461328348553)",140640156-E10,Inner Richmond


### Delete

With laws such as [GDPR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and [CCPA](https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act), individuals have the right to forgotten and their data erased. 

For sake of example, let's delete the record corresponding to Incident Number `14055109`. Luckily, this is very easy with Delta Lake and we do not need to ingest our entire data and rewrite it just to remove one record.

In [0]:
%sql
SELECT * FROM fire_Calls_Delta WHERE Incident_Number = "14055109"

Call_Number,Unit_ID,Incident_Number,Call_Type,Call_Date,Watch_Date,Received_DtTm,Entry_DtTm,Dispatch_DtTm,Response_DtTm,On_Scene_DtTm,Transport_DtTm,Hospital_DtTm,Call_Final_Disposition,Available_DtTm,Address,City,Zipcode_of_Incident,Battalion,Station_Area,Box,Original_Priority,Priority,Final_Priority,ALS_Unit,Call_Type_Group,Number_of_Alarms,Unit_Type,Unit_sequence_in_call_dispatch,Fire_Prevention_District,Supervisor_District,Neighborhooods_-_Analysis_Boundaries,Location,RowID,Neighborhoods
141600888,65,14055109,Traffic Collision,06/09/2014,06/09/2014,06/09/2014 09:35:33 AM,06/09/2014 09:36:46 AM,06/09/2014 09:37:43 AM,06/09/2014 09:37:55 AM,06/09/2014 09:43:37 AM,06/09/2014 09:59:00 AM,06/09/2014 10:34:12 AM,Code 2 Transport,06/09/2014 11:14:57 AM,OAKDALE AV/TOLAND ST,San Francisco,94124,B10,9,6377,2,2,2,True,Non Life-threatening,1,MEDIC,1,10,10,Bayview Hunters Point,"(37.740961928907, -122.401555700705)",141600888-65,Bayview Hunters Point


In [0]:
%sql
DELETE FROM fire_Calls_Delta WHERE Incident_Number = "14055109";

SELECT * FROM fire_Calls_Delta WHERE Incident_Number = "14055109"


Call_Number,Unit_ID,Incident_Number,Call_Type,Call_Date,Watch_Date,Received_DtTm,Entry_DtTm,Dispatch_DtTm,Response_DtTm,On_Scene_DtTm,Transport_DtTm,Hospital_DtTm,Call_Final_Disposition,Available_DtTm,Address,City,Zipcode_of_Incident,Battalion,Station_Area,Box,Original_Priority,Priority,Final_Priority,ALS_Unit,Call_Type_Group,Number_of_Alarms,Unit_Type,Unit_sequence_in_call_dispatch,Fire_Prevention_District,Supervisor_District,Neighborhooods_-_Analysis_Boundaries,Location,RowID,Neighborhoods


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>