# Micro partitions

Snowflake stores a table in many small chunks called micro-partitions. 
For each micro-partition, Snowflake keeps metadata like the MIN and MAX values of each column inside that chunk.
 
 * 50-500 MB of uncompressed data
 * Data stored in columnar format, not rows.
 * Repetition of column range of data can happen, causing ***overlapping***
  

## Overlapping micro-partitions

If we take on one column (example: order_date), each micro-partition has a range:

Micro-partition P1 contain dates from Jan 1 to Jan 10
Micro-partition P2 contain dates from Jan 11 to Jan 20

That is 0 data overlapping.

However, 
P1: [Jan 01 … Jan 20]
P2: [Jan 10 … Feb 05]
P3: [Jan 15 … Jan 25]

Jan 15 apperrs in the range of P1, P2 and P3 causing range overlap.
So, in case you search for order_date = jan 15. all 3 partitions will be scanned.


In [None]:
%%sql -r dataframe_4
ALTER WAREHOUSE COMPUTE_WH SET AUTO_RESUME = TRUE;
ALTER WAREHOUSE COMPUTE_WH RESUME;

In [None]:
%%sql -r dataframe_3
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.orders limit 10

In [None]:
%%sql -r dataframe_2

use role sysadmin;
create or replace database sf_cert_prep;

create or replace table sf_cert_prep.public.t_orders_bad
as 
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.orders order by o_comment  asc ;

--creating a table, forcing to be order by o_comment, even knowing all searches will be on orders.


In [None]:
%%sql -r dataframe_6

ALTER SESSION SET USE_CACHED_RESULT = FALSE; -- disabling query cache
-- cleaning warehouse cache
ALTER WAREHOUSE COMPUTE_WH SUSPEND;
ALTER WAREHOUSE COMPUTE_WH RESUME;

select * from sf_cert_prep.public.t_orders_bad where o_orderdate = '1993-09-08';
select --OPERATOR_STATISTICS,
OPERATOR_STATISTICS:pruning.partitions_scanned::integer as partitions_scanned,
OPERATOR_STATISTICS:pruning.partitions_total::integer  as partitions_total , 
partitions_scanned/partitions_total as overlap from table(get_query_operator_stats(last_query_id(-1))) where operator_type = 'TableScan';

In [None]:
%%sql -r dataframe_10
create or replace table sf_cert_prep.public.t_orders_good
as 
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.orders order by o_orderdate  asc ;

In [None]:
%%sql -r dataframe_11


ALTER SESSION SET USE_CACHED_RESULT = FALSE; -- disabling query cache
-- cleaning warehouse cache
ALTER WAREHOUSE COMPUTE_WH SUSPEND;
ALTER WAREHOUSE COMPUTE_WH RESUME;

select * from sf_cert_prep.public.t_orders_good where o_orderdate = '1993-09-08';
select --OPERATOR_STATISTICS,
OPERATOR_STATISTICS:pruning.partitions_scanned::integer as partitions_scanned,
OPERATOR_STATISTICS:pruning.partitions_total::integer  as partitions_total , 
partitions_scanned/partitions_total as overlap from table(get_query_operator_stats(last_query_id(-1))) where operator_type = 'TableScan';

## Explanation

Usually we dont have the luxury to rewrite a production table the way we inteded, merge, updates, reloads of the source, will mix up and mess up the way partition were "optimized".
As alternative for those cases, we can use "clustering keys". Clustering keys will create new partitions based on the keys or expressions that we provide.
* Old paritions will remain unchanged, if not used anymore, they will be marked to be deleted after timetravel / fail safe period
* New partitons will have priority in the metadata / query plan
* Re-clustering will not recreate all paritions, only the ones that will benefit the most


In [None]:
%%sql -r dataframe_13
create or replace table sf_cert_prep.public.t_orders_bad_but_clustered 
cluster by (o_orderdate)
as 
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.orders order by o_comment  asc ;

In [None]:
%%sql -r dataframe_14


ALTER SESSION SET USE_CACHED_RESULT = FALSE; -- disabling query cache
-- cleaning warehouse cache
ALTER WAREHOUSE COMPUTE_WH SUSPEND;
ALTER WAREHOUSE COMPUTE_WH RESUME;

select * from sf_cert_prep.public.t_orders_bad_but_clustered where o_orderdate = '1993-09-08';
select --OPERATOR_STATISTICS,
OPERATOR_STATISTICS:pruning.partitions_scanned::integer as partitions_scanned,
OPERATOR_STATISTICS:pruning.partitions_total::integer  as partitions_total , 
partitions_scanned/partitions_total as overlap from table(get_query_operator_stats(last_query_id(-1))) where operator_type = 'TableScan';

In [None]:
%%sql -r dataframe_9
select operator_statistics:pruning.partitions_scanned from table(get_query_operator_stats('01c20d9d-0207-99b5-0019-9c5300165a46'))
where operator_type = 'TableScan'
;

In [None]:
%%sql -r dataframe_5
select system$clustering_information('sf_cert_prep.public.t_orders_bad', 'o_orderdate');

## Clustering depth

In [None]:
%%sql -r dataframe_1
