d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# 4.4 Delta Lake

**Delta Lake is an open format storage layer that delivers reliability, security and performance on your data lake — for both streaming and batch operations.**

Delta Lake replaces existing data silos with a central repository for structured, semi-structured, and unstructured data, providing the foundation for a cost-effective and highly scalable Lakehouse.




## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>
* Create medallion architecture (bronze, silver, gold) with [Delta Lake](http://delta.io/)
* Analyze Delta [transaction log](https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html)
* [UPDATE](https://databricks.com/blog/2020/09/29/diving-into-delta-lake-dml-internals-update-delete-merge.html) existing data

![](https://files.training.databricks.com/images/davis/delta_multihop.png)

In [0]:
%run ../Includes/Classroom-Setup

Create a temporary view with our Parquet file.

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW fireCallsParquet
USING parquet
OPTIONS (
  path "/mnt/davis/fire-calls/fire-calls-clean.parquet/"
)

<h2> Write raw data into Delta Bronze</h2>

<img width="75px" src="https://files.training.databricks.com/images/davis/images_bronze.png">

All we need to do to create a Delta table is to specify `USING DELTA`.

In [0]:
%sql
CREATE DATABASE IF NOT EXISTS Databricks;
USE Databricks;
DROP TABLE IF EXISTS fireCallsBronze;

CREATE TABLE fireCallsBronze
USING DELTA
AS 
  SELECT * FROM fireCallsParquet

num_affected_rows,num_inserted_rows


Take a look at the first 10 rows of the bronze table to conform with the data

In [0]:
%sql
select *from firecallsBronze where `Neighborhooods_-_Analysis_Boundaries` <> "None" limit 10

Call_Number,Unit_ID,Incident_Number,Call_Type,Call_Date,Watch_Date,Received_DtTm,Entry_DtTm,Dispatch_DtTm,Response_DtTm,On_Scene_DtTm,Transport_DtTm,Hospital_DtTm,Call_Final_Disposition,Available_DtTm,Address,City,Zipcode_of_Incident,Battalion,Station_Area,Box,Original_Priority,Priority,Final_Priority,ALS_Unit,Call_Type_Group,Number_of_Alarms,Unit_Type,Unit_sequence_in_call_dispatch,Fire_Prevention_District,Supervisor_District,Neighborhooods_-_Analysis_Boundaries,Location,RowID
141600888,65,14055109,Traffic Collision,06/09/2014,06/09/2014,06/09/2014 09:35:33 AM,06/09/2014 09:36:46 AM,06/09/2014 09:37:43 AM,06/09/2014 09:37:55 AM,06/09/2014 09:43:37 AM,06/09/2014 09:59:00 AM,06/09/2014 10:34:12 AM,Code 2 Transport,06/09/2014 11:14:57 AM,OAKDALE AV/TOLAND ST,San Francisco,94124,B10,9,6377,2,2,2,True,Non Life-threatening,1,MEDIC,1,10,10,Bayview Hunters Point,"(37.740961928907, -122.401555700705)",141600888-65
162743687,E01,16108733,Medical Incident,09/30/2016,09/30/2016,09/30/2016 08:05:57 PM,09/30/2016 08:07:53 PM,09/30/2016 08:08:14 PM,09/30/2016 08:08:23 PM,09/30/2016 08:10:37 PM,,,Code 2 Transport,09/30/2016 08:32:48 PM,1100 Block of MISSION ST,San Francisco,94103,B02,36,2318,2,2,2,True,Non Life-threatening,1,ENGINE,1,2,6,South of Market,"(37.7777124404316, -122.412736707425)",162743687-E01
102210202,75,10069623,Medical Incident,08/09/2010,08/09/2010,08/09/2010 01:28:40 PM,08/09/2010 01:30:47 PM,08/09/2010 01:31:53 PM,08/09/2010 01:32:29 PM,08/09/2010 01:52:28 PM,08/09/2010 02:19:50 PM,08/09/2010 02:30:50 PM,Code 2 Transport,08/09/2010 03:05:25 PM,100 Block of CLINTON PARK,SF,94103,B02,36,5126,1,1,2,True,,1,MEDIC,1,2,8,Mission,"(37.7692677111289, -122.423396856968)",102210202-75
160681260,E42,16027085,Medical Incident,03/08/2016,03/08/2016,03/08/2016 10:42:26 AM,03/08/2016 10:42:57 AM,03/08/2016 10:43:20 AM,03/08/2016 10:50:11 AM,03/08/2016 10:50:11 AM,,,Patient Declined Transport,03/08/2016 10:58:43 AM,2400 Block of SAN BRUNO AVE,San Francisco,94134,B10,42,6362,3,3,3,True,Potentially Life-Threatening,1,ENGINE,1,10,9,Portola,"(37.7318198889718, -122.405412091734)",160681260-E42
113200298,E18,11106370,Structure Fire,11/16/2011,11/16/2011,11/16/2011 05:13:01 PM,11/16/2011 05:13:01 PM,11/16/2011 05:13:09 PM,11/16/2011 05:13:54 PM,11/16/2011 05:16:30 PM,,,Fire,11/16/2011 05:17:14 PM,41ST AV/NORIEGA ST,SF,94122,B08,18,7633,3,3,3,True,,1,ENGINE,1,8,4,Sunset/Parkside,"(37.7532010612329, -122.500044865619)",113200298-E18
162584030,KM07,16101787,Medical Incident,09/14/2016,09/14/2016,09/14/2016 08:41:50 PM,09/14/2016 08:41:50 PM,09/14/2016 08:43:07 PM,09/14/2016 08:43:34 PM,09/14/2016 08:57:26 PM,,,Against Medical Advice,09/14/2016 09:35:01 PM,400 Block of 7TH ST,San Francisco,94103,B03,8,231,2,2,2,False,Non Life-threatening,1,PRIVATE,1,3,6,South of Market,"(37.7746534767072, -122.405118197249)",162584030-KM07
133150239,KM09,13107112,Medical Incident,11/11/2013,11/11/2013,11/11/2013 01:41:24 PM,11/11/2013 01:44:26 PM,11/11/2013 01:44:41 PM,11/11/2013 01:45:10 PM,11/11/2013 01:51:53 PM,,,Gone on Arrival,11/11/2013 01:53:52 PM,16TH ST/MISSION ST,SF,94103,B02,7,5236,1,1,2,False,Non Life-threatening,1,PRIVATE,1,2,9,Mission,"(37.7650513381945, -122.419668973861)",133150239-KM09
170553524,66,17023871,Medical Incident,02/24/2017,02/24/2017,02/24/2017 07:31:51 PM,02/24/2017 07:34:55 PM,02/24/2017 07:35:42 PM,02/24/2017 07:35:51 PM,02/24/2017 07:49:49 PM,02/24/2017 08:24:58 PM,02/24/2017 08:50:07 PM,Code 2 Transport,02/24/2017 09:16:13 PM,2000 Block of 46TH AVE,San Francisco,94116,B08,23,7663,2,2,2,True,Potentially Life-Threatening,1,MEDIC,1,8,4,Sunset/Parkside,"(37.7482979256317, -122.505150416998)",170553524-66
52970075,M01,5079960,Medical Incident,10/24/2005,10/24/2005,10/24/2005 09:03:35 AM,10/24/2005 09:05:37 AM,10/24/2005 09:05:55 AM,10/24/2005 09:07:09 AM,10/24/2005 09:12:02 AM,,,Cancelled,10/24/2005 09:12:42 AM,400 Block of GOLDEN GATE AVE,SF,94102,B02,3,1644,3,3,3,True,,1,MEDIC,1,2,6,Tenderloin,"(37.7812873999159, -122.41796607229)",052970075-M01
32960250,M15,3084979,Medical Incident,10/23/2003,10/23/2003,10/23/2003 02:46:24 PM,10/23/2003 02:47:44 PM,10/23/2003 02:48:13 PM,10/23/2003 02:49:55 PM,10/23/2003 02:51:27 PM,10/23/2003 03:11:07 PM,10/23/2003 03:45:03 PM,Other,10/23/2003 03:56:51 PM,CALL BOX: SAN JOSE AV/SANTA YNEZ AV,SF,94112,B09,15,8276,3,3,3,True,,1,MEDIC,1,9,11,Outer Mission,"(37.7258249736518, -122.442324422614)",032960250-M15


Navigate to the `Data` tab and take a look at the `fireCallsBronze` table in the `Databricks` database.

<img width="550px" src="https://s3-us-west-2.amazonaws.com/files.training.databricks.com/images/davis/firecallsbronze.png">

You will notice that there is a `Details` tab showing you when the table was created, last modified, how many partitions, size of data, etc.

<img width="290px" src="https://s3-us-west-2.amazonaws.com/files.training.databricks.com/images/davis/firecallsbronzedetails.png">

There is also a history tab that shows all of the versions of the delta table. 


![](https://s3-us-west-2.amazonaws.com/files.training.databricks.com/images/davis/firecallsbronzehistory.png)

We can take a look at the underlying files that were generated. You'll notice that there are 8 parquet files corresponding to the 8 partitions of data, as well as a `_delta_log` directory.

In [0]:
%fs ls dbfs:/user/hive/warehouse/databricks.db/firecallsbronze

path,name,size
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/_delta_log/,_delta_log/,0
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00000-a14ecbd1-8d47-4124-b2b2-32dae6369daa-c000.snappy.parquet,part-00000-a14ecbd1-8d47-4124-b2b2-32dae6369daa-c000.snappy.parquet,6328234
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00001-71f90e6c-cd13-4844-a6af-69c1fef0afa0-c000.snappy.parquet,part-00001-71f90e6c-cd13-4844-a6af-69c1fef0afa0-c000.snappy.parquet,6323710
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00002-b4c272e4-8ccf-44c3-97eb-9ce5bfaf7d98-c000.snappy.parquet,part-00002-b4c272e4-8ccf-44c3-97eb-9ce5bfaf7d98-c000.snappy.parquet,6340643
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00003-15b7dfd3-3eca-4553-a0aa-f0f580612994-c000.snappy.parquet,part-00003-15b7dfd3-3eca-4553-a0aa-f0f580612994-c000.snappy.parquet,6334530
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00004-6e0ec15c-96f1-4ca9-99f5-644a2cc206bb-c000.snappy.parquet,part-00004-6e0ec15c-96f1-4ca9-99f5-644a2cc206bb-c000.snappy.parquet,6336196
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00005-454b0fed-db18-4ea7-b02f-a9d02e9d495f-c000.snappy.parquet,part-00005-454b0fed-db18-4ea7-b02f-a9d02e9d495f-c000.snappy.parquet,6335106
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00006-f10290e9-761a-4070-9845-08557a70d9cd-c000.snappy.parquet,part-00006-f10290e9-761a-4070-9845-08557a70d9cd-c000.snappy.parquet,6338722
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/part-00007-ae4dc3d5-f9c4-4d4e-b2a7-b29c1732ff8e-c000.snappy.parquet,part-00007-ae4dc3d5-f9c4-4d4e-b2a7-b29c1732ff8e-c000.snappy.parquet,6345461


**Let's dig into the delta log directory and take a look at the file generated and examine the JSON record generated.**

In [0]:
%fs ls dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/_delta_log

path,name,size
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/_delta_log/.s3-optimization-0,.s3-optimization-0,0
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/_delta_log/.s3-optimization-1,.s3-optimization-1,0
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/_delta_log/.s3-optimization-2,.s3-optimization-2,0
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/_delta_log/00000000000000000000.crc,00000000000000000000.crc,93
dbfs:/user/hive/warehouse/databricks.db/firecallsbronze/_delta_log/00000000000000000000.json,00000000000000000000.json,28623


## Refine bronze tables, write to Delta Silver

<img width="75px" src="https://files.training.databricks.com/images/davis/images_silver.png">

Filter unnecessary columns and nulls.

In [0]:
%sql
DROP TABLE IF EXISTS fireCallsSilver;

CREATE TABLE fireCallsSilver 
USING DELTA
AS 
  SELECT Call_Number, Call_Type, Call_Date, Received_DtTm, Address, City, Zipcode_of_Incident, Unit_Type, `Neighborhooods_-_Analysis_Boundaries`
  FROM fireCallsBronze
  WHERE (City IS NOT null) AND (`Neighborhooods_-_Analysis_Boundaries` <> "None");
  


num_affected_rows,num_inserted_rows


**Pose a query to the Silver table**

In [0]:
%sql 
SELECT * FROM fireCallsSilver LIMIT 10;

Call_Number,Call_Type,Call_Date,Received_DtTm,Address,City,Zipcode_of_Incident,Unit_Type,Neighborhooods_-_Analysis_Boundaries
182260374,Medical Incident,08/14/2018,08/14/2018 05:58:34 AM,2100 Block of MARKET ST,San Francisco,94114,ENGINE,Castro/Upper Market
42070078,Traffic Collision,07/25/2004,07/25/2004 09:06:13 AM,36TH AV/FULTON ST,SF,94121,MEDIC,Outer Richmond
91630169,Medical Incident,06/12/2009,06/12/2009 01:01:25 PM,800 Block of 42ND AVE,SF,94121,ENGINE,Outer Richmond
70530217,Medical Incident,02/22/2007,02/22/2007 12:19:11 PM,2600 Block of SUTTER ST,SF,94115,TRUCK,Presidio Heights
170372387,Medical Incident,02/06/2017,02/06/2017 03:34:53 PM,600 Block of SANSOME ST,San Francisco,94111,MEDIC,Chinatown
60400382,Medical Incident,02/09/2006,02/09/2006 11:56:01 PM,1700 Block of FILLMORE ST,SF,94115,MEDIC,Japantown
80040078,Electrical Hazard,01/04/2008,01/04/2008 06:46:01 AM,900 Block of LAKE ST,SF,94118,ENGINE,Inner Richmond
30270280,Citizen Assist / Service Call,01/27/2003,01/27/2003 03:40:25 PM,3100 Block of PIERCE ST,SF,94123,TRUCK,Marina
150633936,Citizen Assist / Service Call,03/04/2015,03/04/2015 10:09:51 PM,2100 Block of 25TH AVE,San Francisco,94116,TRUCK,Sunset/Parkside
122830141,Medical Incident,10/09/2012,10/09/2012 11:16:03 AM,1300 Block of MARKET ST,SF,94102,PRIVATE,Tenderloin


You can see that there is certainly more cleaning that could have happened (e.g. converting all occurrences of `SF` to `San Francisco` for `City`).

Let's fix it and make an updated version of this Silver table using [UPDATE](https://docs.databricks.com/delta/delta-update.html).

In [0]:
%sql
UPDATE fireCallsSilver SET City = "San Francisco" WHERE (City = "SF") OR (City = "SAN FRANCISCO")

num_affected_rows
307974


We can see how this is reflected in the transaction log.

In [0]:
%sql
DESCRIBE HISTORY fireCallsSilver

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata
1,2022-02-12T10:10:52.000+0000,5779868209403466,birhan.hailu.tigray@gmail.com,UPDATE,Map(predicate -> ((City#3377 = SF) OR (City#3377 = SAN FRANCISCO))),,List(676651046033313),0212-095200-uaajeyhl,0.0,WriteSerializable,False,"Map(numRemovedFiles -> 8, numCopiedRows -> 108505, numAddedChangeFiles -> 0, executionTimeMs -> 14926, scanTimeMs -> 4230, numAddedFiles -> 8, numUpdatedRows -> 307974, rewriteTimeMs -> 10695)",
0,2022-02-12T10:09:33.000+0000,5779868209403466,birhan.hailu.tigray@gmail.com,CREATE TABLE AS SELECT,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(676651046033313),0212-095200-uaajeyhl,,WriteSerializable,True,"Map(numFiles -> 8, numOutputBytes -> 10260144, numOutputRows -> 416479)",


**Let us examine the actual Json log generated**

In [0]:
%fs head dbfs:/user/hive/warehouse/databricks.db/firecallssilver/_delta_log/00000000000000000001.json

## Aggregate data, write to Delta Gold

<img width="75px" src="https://files.training.databricks.com/images/davis/images_gold.png">

Aggregate call type by neighborhood. This will automatically use the latest version of the `fireCallsSilver` table.

In [0]:
%sql
DROP TABLE IF EXISTS fireCallsGold;

CREATE TABLE fireCallsGold 
USING DELTA
AS 
  SELECT `Neighborhooods_-_Analysis_Boundaries` as Neighborhoods, Call_Type, count(*) as Count
  FROM fireCallsSilver
  GROUP BY Neighborhoods, Call_Type

num_affected_rows,num_inserted_rows


**Read some data from the aggregated business table**

In [0]:
%sql
SELECT * FROM fireCallsGold LIMIT 10


Neighborhoods,Call_Type,Count
Bernal Heights,Medical Incident,5520
Chinatown,Structure Fire,949
Japantown,Other,101
Presidio Heights,Gas Leak (Natural and LP Gases),29
Presidio Heights,Outside Fire,38
Russian Hill,Alarms,678
Chinatown,Alarms,1054
Bayview Hunters Point,Structure Fire,2133
Treasure Island,Alarms,145
Outer Richmond,Structure Fire,573


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>