#### Day 10

#### Analyze Query Plan (EXPLAIN)

EXPLAIN helps you reason about performance before optimizing.

In [0]:
%sql
EXPLAIN
SELECT *
FROM ecommerce_catalog.silver.events
WHERE category_code = 'electronics';


plan
"== Physical Plan == *(1) ColumnarToRow +- PhotonResultStage  +- PhotonScan parquet ecommerce_catalog.silver.events[event_time#14138,event_type#14139,product_id#14140,category_id#14141L,category_code#14142,brand#14143,price#14144,user_id#14145,user_session#14146] DataFilters: [isnotnull(category_code#14142), (category_code#14142 = electronics)], DictionaryFilters: [(category_code#14142 = electronics)], Format: parquet, Location: PreparedDeltaFileIndex(1 paths)[s3://dbstorage-prod-scelz/uc/da97e662-87b2-45d5-aee5-bee7766f5344..., OptionalDataFilters: [], PartitionFilters: [], ReadSchema: struct"


#### Create a Partitioned Silver Table

Partitioning helps Spark skip irrelevant data during queries.

We will partition by event_type (low cardinality, frequently filtered).

In [0]:
%sql
CREATE TABLE ecommerce_catalog.silver.events_part
USING DELTA
PARTITIONED BY (event_type)
AS
SELECT *
FROM ecommerce_catalog.silver.events;


num_affected_rows,num_inserted_rows


Queries filtering on category_code now scan only relevant partitions.

#### Check partitions

In [0]:
%sql
SHOW PARTITIONS ecommerce_catalog.silver.events_part;


event_type
purchase
view
cart


#### Apply OPTIMIZE and ZORDER

ZORDER colocates related data to improve filter performance.

We will ZORDER on user_id, product_id

In [0]:
%sql
OPTIMIZE ecommerce_catalog.silver.events_part
ZORDER BY (user_id, product_id);

path,metrics
,"List(24, 8, List(45786855, 64458457, 5.355870425E7, 24, 1285408902), List(138640523, 183759967, 1.77837971875E8, 8, 1422703775), 3, List(minCubeSize(107374182400), List(0, 0), List(10, 1479271925), 0, List(8, 1422703775), 1, null), null, 0, 1, 10, 2, false, 0, 0, 1768845933606, 1768845952898, 8, 1, null, List(0, 0), null, 9, 9, 70748, 0, null)"


#### Benchmark (Before vs After)

In [0]:
%sql
-- Before
SELECT *
FROM ecommerce_catalog.silver.events_part
WHERE user_id = 12345;


event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session


In [0]:
%sql
-- Before
SELECT *
FROM ecommerce_catalog.silver.events
WHERE user_id = 537312895;

event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
2019-10-31T06:08:12.000Z,view,1004321,2053013555631882655,electronics.smartphone,huawei,321.5,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:08:36.000Z,view,1005002,2053013555631882655,electronics.smartphone,huawei,244.51,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:10:40.000Z,view,1005186,2053013555631882655,electronics.smartphone,samsung,771.94,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:01:46.000Z,view,1004785,2053013555631882655,electronics.smartphone,huawei,256.41,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:11:46.000Z,view,1004870,2053013555631882655,electronics.smartphone,samsung,275.4,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:02:02.000Z,view,1005002,2053013555631882655,electronics.smartphone,huawei,244.51,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:07:54.000Z,view,1004849,2053013555631882655,electronics.smartphone,huawei,947.0,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:09:57.000Z,view,1005002,2053013555631882655,electronics.smartphone,huawei,244.51,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:06:35.000Z,view,1004321,2053013555631882655,electronics.smartphone,huawei,321.5,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:04:06.000Z,view,1004321,2053013555631882655,electronics.smartphone,huawei,321.5,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880


In [0]:
%sql
-- After
SELECT *
FROM ecommerce_catalog.silver.events_part
WHERE user_id = 537312895;

event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
2019-10-31T06:10:40.000Z,view,1005186,2053013555631882655,electronics.smartphone,samsung,771.94,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:08:12.000Z,view,1004321,2053013555631882655,electronics.smartphone,huawei,321.5,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:08:36.000Z,view,1005002,2053013555631882655,electronics.smartphone,huawei,244.51,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:11:46.000Z,view,1004870,2053013555631882655,electronics.smartphone,samsung,275.4,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:07:54.000Z,view,1004849,2053013555631882655,electronics.smartphone,huawei,947.0,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:09:57.000Z,view,1005002,2053013555631882655,electronics.smartphone,huawei,244.51,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:06:35.000Z,view,1004321,2053013555631882655,electronics.smartphone,huawei,321.5,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:04:06.000Z,view,1004321,2053013555631882655,electronics.smartphone,huawei,321.5,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:07:33.000Z,view,1004849,2053013555631882655,electronics.smartphone,huawei,947.0,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880
2019-10-31T06:06:03.000Z,view,1004321,2053013555631882655,electronics.smartphone,huawei,321.5,537312895,d7d2b1c9-6eca-44c8-b186-8f8507865880


We see that the query time is reduced post ZORDERing and there are fewer tasks to run as well.

#### Cache for Repeated Analysis - Not supported on serverless compute that we are using

In [0]:
%sql
CACHE TABLE ecommerce_catalog.silver.events_part;

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-7301021014885685>, line 1[0m
[0;32m----> 1[0m get_ipython()[38;5;241m.[39mrun_cell_magic([38;5;124m'[39m[38;5;124msql[39m[38;5;124m'[39m, [38;5;124m'[39m[38;5;124m'[39m, [38;5;124m'[39m[38;5;124mCACHE TABLE ecommerce_catalog.silver.events_part;[39m[38;5;130;01m\n[39;00m[38;5;124m'[39m)

File [0;32m/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py:2541[0m, in [0;36mInteractiveShell.run_cell_magic[0;34m(self, magic_name, line, cell)[0m
[1;32m   2539[0m [38;5;28;01mwith[39;00m [38;5;28mself[39m[38;5;241m.[39mbuiltin_trap:
[1;32m   2540[0m     args [38;5;241m=[39m (magic_arg_s, cell)
[0;32m-> 2541[0m     result [38;5;241m=[39m fn([38;5;241m*[39margs, [38;5;241m*[39m[38;5;241m*[39mkwargs)
[1;32m   2543[0m [38;5;66;03