d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# 2.3 Caching

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>
* Cache data for increased performance
* Read the Spark UI

## File Statistics

Let's see how large our file is on disk.

In [0]:
%run ../Includes/Classroom-Setup

In [0]:
%fs ls /mnt/davis/fire-calls/fire-calls-truncated-comma.csv

path,name,size
dbfs:/mnt/davis/fire-calls/fire-calls-truncated-comma.csv,fire-calls-truncated-comma.csv,89222803


## Count

Let's see how long it takes to count all of the records in our dataset.

In [0]:
%sql
SELECT count(*) FROM fireCalls

count(1)
240613


## Cache Table

In [0]:
%sql
CACHE TABLE fireCalls

-sandbox
## Spark UI

Wow! That took a long time to cache our data. Let's go ahead and take a look at it in the Spark UI.

You'll notice that our data when cached actually takes up less space than our file on disk! That is thanks to the Tungsten Optimizer. You can learn more about Tungsten from Josh Rosen's [presentation](Deep Dive into Project Tungsten Bringing Spark Closer to Bare Metal ...) at Spark Summit.

Our file in memory takes up ~59 MB, and on disk it takes up ~90 MB!
<br><br>

<div><img src="https://files.training.databricks.com/images/davis/cache_memory_2.3.png" style="height: 300px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>

In [0]:
%fs ls /mnt/davis/fire-calls/fire-calls-truncated-comma.csv

path,name,size
dbfs:/mnt/davis/fire-calls/fire-calls-truncated-comma.csv,fire-calls-truncated-comma.csv,89222803


## Count (Again)

Although it took a while to cache our data, every time we query our data, it should be lightning fast. See how long it takes to run the same query!

In [0]:
%sql
SELECT count(*) FROM fireCalls

count(1)
240613


## Uncache Table

Wow! That was a lot faster. Now, let's remove our table from the cache.

In [0]:
%sql
UNCACHE TABLE fireCalls

## Lazy Cache

Instead of waiting a minute to cache this dataset, we could do a "lazy cache". This means it will only cache the data as it is required.

In [0]:
%sql
CACHE LAZY TABLE fireCalls

## Small Query

This query was a lot faster to run. But, did it cache our entire dataset?

In [0]:
%sql
SELECT * FROM fireCalls LIMIT 10

Call Number,Unit ID,Incident Number,Call Type,Call Date,Watch Date,Received DtTm,Entry DtTm,Dispatch DtTm,Response DtTm,On Scene DtTm,Transport DtTm,Hospital DtTm,Call Final Disposition,Available DtTm,Address,City,Zipcode of Incident,Battalion,Station Area,Box,Original Priority,Priority,Final Priority,ALS Unit,Call Type Group,Number of Alarms,Unit Type,Unit sequence in call dispatch,Fire Prevention District,Supervisor District,Neighborhooods - Analysis Boundaries,Location,RowID
1030118,E08,30625,Medical Incident,04/12/2000,04/12/2000,04/12/2000 09:27:45 PM,04/12/2000 09:28:58 PM,04/12/2000 09:29:21 PM,04/12/2000 09:31:26 PM,04/12/2000 09:32:34 PM,,,Other,04/12/2000 09:45:28 PM,4TH ST/CHANNEL ST,SF,,B03,8,2226,3,3,3,False,,1,ENGINE,1,3.0,6,,"(37.7750268633971, -122.392346204303)",001030118-E08
1030122,M18,30630,Medical Incident,04/12/2000,04/12/2000,04/12/2000 09:31:55 PM,04/12/2000 09:33:48 PM,04/12/2000 09:34:10 PM,04/12/2000 09:35:59 PM,04/12/2000 09:45:22 PM,,,Other,04/12/2000 09:49:52 PM,1800 Block of IRVING ST,SF,94122.0,B08,22,7424,1,1,2,False,,1,MEDIC,1,8.0,4,Sunset/Parkside,"(37.763482287794, -122.477678638767)",001030122-M18
1030154,M36,30662,Medical Incident,04/12/2000,04/12/2000,04/12/2000 10:43:54 PM,04/12/2000 10:45:53 PM,04/12/2000 10:49:59 PM,04/12/2000 10:50:35 PM,04/12/2000 10:53:18 PM,04/12/2000 11:11:36 PM,04/12/2000 11:22:17 PM,Other,04/12/2000 11:42:43 PM,0 Block of SOUTH VAN NESS AVE,SF,94103.0,B02,36,5117,1,1,2,False,,1,MEDIC,1,2.0,6,Mission,"(37.7741251002903, -122.418810211803)",001030154-M36
1040007,E12,30697,Structure Fire,04/13/2000,04/12/2000,04/13/2000 12:19:54 AM,04/13/2000 12:29:24 AM,04/13/2000 12:29:35 AM,04/13/2000 12:31:25 AM,04/13/2000 12:32:36 AM,,,Other,04/13/2000 12:33:18 AM,CLAYTON ST/PARNASSUS AV,SF,94117.0,B05,12,5151,3,3,3,True,,1,ENGINE,1,5.0,5,Haight Ashbury,"(37.7651387353822, -122.44763462758)",001040007-E12
1040021,M14,30711,Medical Incident,04/13/2000,04/12/2000,04/13/2000 01:17:25 AM,04/13/2000 01:18:44 AM,04/13/2000 01:20:02 AM,04/13/2000 01:21:40 AM,04/13/2000 01:24:05 AM,04/13/2000 01:56:02 AM,04/13/2000 02:33:33 AM,Other,04/13/2000 02:40:25 AM,500 Block of 38TH AVE,SF,94121.0,B07,34,7255,3,3,3,True,,1,MEDIC,1,7.0,1,Outer Richmond,"(37.778489948235, -122.498662035969)",001040021-M14
1040061,M43,30749,Medical Incident,04/13/2000,04/12/2000,04/13/2000 07:51:29 AM,04/13/2000 07:55:35 AM,04/13/2000 07:55:54 AM,04/13/2000 07:59:58 AM,,04/13/2000 08:16:30 AM,04/13/2000 08:28:37 AM,Other,04/13/2000 09:26:54 AM,200 Block of MADRID ST,SF,94112.0,B09,43,613,3,3,3,True,,1,MEDIC,3,9.0,11,Excelsior,"(37.7255316247491, -122.429925994016)",001040061-M43
1040079,E10,30766,Alarms,04/13/2000,04/13/2000,04/13/2000 09:31:19 AM,04/13/2000 09:33:04 AM,04/13/2000 09:34:10 AM,04/13/2000 09:35:52 AM,04/13/2000 09:37:59 AM,,,Other,04/13/2000 09:39:36 AM,2800 Block of BROADWAY,SF,94123.0,B04,10,4226,3,3,3,False,,1,ENGINE,1,4.0,2,Pacific Heights,"(37.7931736175933, -122.444028632879)",001040079-E10
1040143,M43,30832,Medical Incident,04/13/2000,04/13/2000,04/13/2000 01:01:56 PM,04/13/2000 01:04:12 PM,04/13/2000 01:13:08 PM,,04/13/2000 01:29:19 PM,04/13/2000 01:36:34 PM,,Other,04/13/2000 01:16:20 PM,2500 Block of OCEAN AVE,SF,94132.0,B08,19,8452,1,1,2,True,,1,MEDIC,1,8.0,7,West of Twin Peaks,"(37.7314853147957, -122.472647880057)",001040143-M43
1040170,T16,30855,Structure Fire,04/13/2000,04/13/2000,04/13/2000 02:09:54 PM,04/13/2000 02:12:27 PM,04/13/2000 02:15:00 PM,,,,,Other,04/13/2000 02:22:01 PM,POLK ST/UNION ST,SF,94109.0,B04,4,3131,3,3,3,False,,1,TRUCK,2,4.0,3,Russian Hill,"(37.7987615790944, -122.422336952094)",001040170-T16
1040233,E48,30914,Alarms,04/13/2000,04/13/2000,04/13/2000 05:23:03 PM,04/13/2000 05:23:58 PM,04/13/2000 05:25:02 PM,,04/13/2000 05:29:16 PM,,,Other,04/13/2000 05:52:48 PM,CALL BOX: FS TI,TI,94130.0,B03,48,2931,3,3,3,False,,1,ENGINE,1,,6,Treasure Island,"(37.8225682263653, -122.371537518925)",001040233-E48


## Spark UI Part 2

Why was only one partition cached? Turns out, to display 100 records, we don't need to cache our entire dataset. We only needed to materialize one partition.

## Cache I

Now let's do a `count()` on our dataset. Count forces you to go through every record of every partition of our dataset, so it ensures every data point will be cached.

In [0]:
%sql
SELECT count(*) FROM fireCalls

count(1)
240613


## Count

Look at how fast this call to `count()` is now!

In [0]:
%sql
SELECT count(*) FROM fireCalls

count(1)
240613


## Clear Cache

Let's remove any data that is currently cached.

In [0]:
%sql
CLEAR CACHE


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>