Make sure that cluster is configured, following the tutorial, with following spark options (go to advanced cluster settings):
    
spark.databricks.cluster.profile singleNode
spark.hadoop.mapred.max.split.size 256000000
spark.master local[*, 4]
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.delta.preview.enabled true
spark.databricks.io.cache.compression.enabled false

In [0]:
from pyspark.sql.functions import col

In [0]:
# read a huge (appr. 60 Gb) dataset into spark dataframe 
flight_data = spark.read.format("csv") \
                        .option("header", "true") \
                        .option("inferSchema", "true") \
                        .load("/databricks-datasets/asa/airlines/2005.csv")

In [0]:
# write as parquet file to the DBFS
flight_data.write.parquet("dbfs:/FileStore/tables/parquet_flight_data")

In [0]:
# to be able to use a delta cache we can use both delta and parquet format(all parquet formats are supported)
spark.sql("CREATE TABLE flight_data using PARQUET LOCATION 'dbfs:/FileStore/tables/parquet_flight_data'")

In [0]:
%sql

SELECT * FROM flight_data

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2005,1,28,5,1603.0,1605,1741.0,1759,UA,541,N935UA,158.0,174,131.0,-18.0,-2.0,BOS,ORD,867,4,23,0,,0,0,0,0,0,0
2005,1,29,6,1559.0,1605,1736.0,1759,UA,541,N941UA,157.0,174,136.0,-23.0,-6.0,BOS,ORD,867,6,15,0,,0,0,0,0,0,0
2005,1,30,7,1603.0,1610,1741.0,1805,UA,541,N342UA,158.0,175,131.0,-24.0,-7.0,BOS,ORD,867,9,18,0,,0,0,0,0,0,0
2005,1,31,1,1556.0,1605,1726.0,1759,UA,541,N326UA,150.0,174,129.0,-33.0,-9.0,BOS,ORD,867,11,10,0,,0,0,0,0,0,0
2005,1,2,7,1934.0,1900,2235.0,2232,UA,542,N902UA,121.0,152,106.0,3.0,34.0,ORD,BOS,867,5,10,0,,0,0,0,0,0,0
2005,1,3,1,2042.0,1900,9.0,2232,UA,542,N904UA,147.0,152,97.0,97.0,102.0,ORD,BOS,867,3,47,0,,0,23,0,0,0,74
2005,1,4,2,2046.0,1900,2357.0,2232,UA,542,N942UA,131.0,152,100.0,85.0,106.0,ORD,BOS,867,5,26,0,,0,46,0,0,0,39
2005,1,5,3,,1900,,2232,UA,542,000000,,152,,,,ORD,BOS,867,0,0,1,B,0,0,0,0,0,0
2005,1,6,4,2110.0,1900,8.0,2223,UA,542,N920UA,118.0,143,101.0,105.0,130.0,ORD,BOS,867,2,15,0,,0,16,0,0,0,89
2005,1,7,5,1859.0,1900,2235.0,2223,UA,542,N340UA,156.0,143,96.0,12.0,-1.0,ORD,BOS,867,4,56,0,,0,0,0,0,0,0


Observing the job details / storage / it is shown that around 1.3 Gb was loadet from storage to perform this query, and 0 cache memory was used. Now let's perform highly selective query

In [0]:
%sql

SELECT AirTime, Dest FROM flight_data WHERE Origin="PHL"

AirTime,Dest
222.0,DEN
239.0,DEN
239.0,DEN
254.0,DEN
243.0,DEN
218.0,DEN
222.0,DEN
241.0,DEN
241.0,DEN
227.0,DEN


In [0]:
# enable delta cache
spark.conf.set("spark.databricks.io.cache.enabled", "true")

In [0]:
%sql
CACHE SELECT * FROM flight_data

In [0]:
%sql

SELECT AirTime, Dest FROM flight_data WHERE Origin="PHL"

AirTime,Dest
222.0,DEN
239.0,DEN
239.0,DEN
254.0,DEN
243.0,DEN
218.0,DEN
222.0,DEN
241.0,DEN
241.0,DEN
227.0,DEN


In [0]:
Now the same query runs much faster than before