## Module 3. Auto Clustering for Performance Tuning

Module 3 is to focus on exploring and discovering auto clustering keys that can benefit the reporting workload.

### 3.1 Analyze table size

Firstly, let's analyze the table sizes. We can run the SHOW TABLES or query the INFORMATION_SHCEMA.TABLES view get those information.

In [None]:
-- for operations & analysis
use warehouse WH_SUMMIT25_PERF_OPS; 

show tables in schema sql_perf_optimization.public;

In [None]:
select 
    table_name,
    row_count, bytes
from information_schema.tables
where 
    table_catalog = 'SQL_PERF_OPTIMIZATION'
    and table_schema = 'PUBLIC'
    and table_name in (
        'QUESTION', 'USER_PROFILE', 
        'CATEGORY', 'TRAFFIC'
    )
;

From the 4 main tables (`CATEGORY`, `QUESTION`, `TRAFFIC` and `USER_PROFILE`), only `USER_PROFILE` and `TRAFFIC` are big enough and worthy of auto clustering.

### 3.2 Analyze Query Filters and Join Filters

In the real world, you may not be involved in table design. And as time goes, workload queries have been modified over time. You want to find out what are the most common local filters and join filters without going through each queries in the workload, which could be a lot of queries. This step is to mimic these common scenarios.

So let's analyze Query Filters and Join Filters used in those queries that we run in earlier step to discover table columns that may be good auto clustering key columns. Use the provided query snippet below to identify filter and join conditions frequently used in reporting queries on the TRAFFIC table. 

In reality, the analysis query below may take you several iterations to narrow down the query filter (i.e., join_condition ilike '%t.%') as you understand your queries more and more. This exercise is done for you and we come up with the conditions that can provide a good representives of filters. and by combining with some of the reporting queries, we understand the alias names such as T or Traffic referring to the table TRAFFIC. 

In [None]:
-- analyze filters
WITH filter_conditions AS (
  SELECT
    query_id,
    query_tag,
    CAST(
      GET_PATH (operator_attributes, 'filter_condition') AS TEXT
    ) AS filter_condition
  FROM
    base_query_stats
  WHERE
    operator_type ILIKE '%Filter%'
    and ( filter_condition not ilike '%USER_PROFILE%'
     and filter_condition not ilike '%QUESTION%')
),
join_conditions AS (
    SELECT
    query_id,
    operator_attributes,
    CAST(
      GET_PATH (operator_attributes, 'equality_join_condition') AS TEXT
    ) AS join_condition
  FROM
    base_query_stats
    where join_condition is not null
    and ( join_condition ilike '%t.%' or 
    join_condition ilike '%traffic.%' )
), 
filters as (
    SELECT
      'Filter' AS condition_type,
      query_id,
      filter_condition AS condition
    FROM
      filter_conditions
    WHERE
      NOT filter_condition IS NULL
    UNION ALL
    SELECT
      'Join' AS condition_type,
      query_id,
      join_condition AS condition
    FROM
      join_conditions
    WHERE
      NOT join_condition IS NULL
)
select distinct condition
from filters
;

The filter result on Traffic Table looks like something as the following:
- Use the provided query snippet to identify filter and join conditions frequently used in reporting queries on the TRAFFIC table. Local filter and join filter used are as follows:
  - ```(T.TIMESTAMP >= '2025-01-01 00:00:00.000000000Z') AND (T.TIMESTAMP <= '2025-02-01 00:00:00.000000000Z')```
  - ```(TO_TIMESTAMP_LTZ(T1.TIMESTAMP)) >= '2024-04-11 15:01:15.523000000Z'```
  - ```(COUNT(DISTINCT T1.UUID)) > 10```
  - ```(DATE_DIFFTIMESTAMPINDAYS(MIN(T.TIMESTAMP), MAX(T.TIMESTAMP))) > 0```
  - ```(C.ID = T.CATEGORY_ID)```
  - ```(T.UUID = USER_PROFILE.UUID)```

Focus on columns from the traffic table such as TIMESTAMP, CATEGORY_ID, and UUID. Find out the distinct count of those 3 columns, since we typically use date instead of timestamp, so do a distinct count on TO_DATE(TIMESTAMP), instead of TIMESTAMP itself.


In [None]:
select
    approx_count_distinct(to_date(timestamp)),
    approx_count_distinct(category_id),
    approx_count_distinct(uuid)
from sql_perf_optimization.public.traffic;

To determine which columns are good candidates for clustering keys, we need to consider the following:

- They should be actively used in selective filters
- They should have a large enough number of distinct values to enable effective pruning
- They should have a small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.

From the result above, we can see that the distinct count on the UUID column is way too large. The CATEGORY_ID column's distinct count is way too small, this will result many partitions share the same key, which will not help to improve the pruning. TO_DATE(TIMESTAMP) could be a better choice, however, give the fact that we have filters on both CATEGORY_ID and TIMESTAMP fields, having a combined clustering key on those two column would be a much better choice.

### 3.3 Add Auto Clustering Keys

- Based on the analysis that TIMESTAMP and CATEGORY_ID are frequently filtered and suitable for clustering key columns, the auto-clustering key design on the TRAFFIC table is `(TO_DATE(TIMESTAMP) , CATEGORY_ID)`.
- Remember that the order is important, and in our use case, TO_DATE(TIMESTAMP) in front will give us better performance because most of the queries will be date sensitive.

Let’s check on clustering information first on the `TRAFFIC` table first (you might want to copy the resultset into a text editor to see the full result).

In [None]:
SELECT SYSTEM$CLUSTERING_INFORMATION(
    'SQL_PERF_OPTIMIZATION.PUBLIC.TRAFFIC' , 
    '(TO_DATE(TIMESTAMP),CATEGORY_ID)'
);

The result would like something like below:

```json
{
  "cluster_by_keys" : "LINEAR(TO_DATE(TIMESTAMP),CATEGORY_ID)",
  "total_partition_count" : 1014,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 1013.0,
  "average_depth" : 1014.0,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 0,
    "00002" : 0,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0,
    "01024" : 1014
  },
  "clustering_errors" : [ ]
}
```

A couple of things here to note:

- The cardinality (the distinct number of values) on those two columns is very high, meaning that there are many different distinct values. This is due to the value of TIMESTAMP, which records up to milliseconds. This is OK when distributing the data across different partitions, however, it will be very expensive to maintain, as one single new value from the middle of the partition will result in all data from later TIMESTAMP values being shifted across all the remaining partitions.
- There are 539 partitions in the table; however, the values of AVERAGE_OVERLAPS and AVERAGE_DEPTH are very close to this value. This means that pretty much all partitions have overlapping values, so this table is not well clustered at all.
- For more information on how to interpret the clustering information, please refer to Clustering Information Maintained for Micro-partitions (https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions) and SYSTEM$CLUSTERING_INFORMATION (https://docs.snowflake.com/en/sql-reference/functions/system_clustering_information)

We can use the `SYSTEM` function `ESTIMATE_AUTOMATIC_CLUSTERING_COSTS` to estimate the cost of autoclustering a table. It might take sometime to run (> 10 minutes), To save sometime, we will skip this step, but you are welcome to run it after the lab.

Command as below:

```sql
SELECT SYSTEM$ESTIMATE_AUTOMATIC_CLUSTERING_COSTS(
    'TRAFFIC', 
    '(TO_DATE(TIMESTAMP),CATEGORY_ID)'
);
```

Now we are ready to add clustering keys to the `TRAFFIC` table.

DO NOT RUN the following command in the lab as we have already done so to the `TRAFFIC_CLUSTERED` table, which will be used in the following labs. There is latency for the auto clustering service to kick in and finish on time within the lab allotted time. 

```sql
ALTER TABLE TRAFFIC CLUSTER BY (TO_DATE(TIMESTAMP), CATEGORY_ID);
```

Confirm clustering information for the table `TRAFFIC_CLUSTERED`.

Since the clustering columns have been defined, we do not need to enter the columns information into function `SYSTEM$ESTIMATE_AUTOMATIC_CLUSTERING_COSTS` anymore.

In [None]:
SELECT SYSTEM$CLUSTERING_INFORMATION(
    'TRAFFIC_CLUSTERED'
);

Since the table was newly built, the clustering information for the `TRAFFIC_CLUSTERED` table should be optimized from the beginning, similar to the below:

```json
{
  "cluster_by_keys" : "LINEAR(TO_DATE(TIMESTAMP), CATEGORY_ID)",
  "total_partition_count" : 871,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 1.8422,
  "average_depth" : 2.248,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 0,
    "00002" : 201,
    "00003" : 359,
    "00004" : 217,
    "00005" : 82,
    "00006" : 12,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  },
  "clustering_errors" : [ ]
}
```

We can see that the average_overlaps dropped from 1013 to 1.8 and average_depth dropped from to 1014 to 2.2.

Different acounts might get different results, but should not be too much different.

We are now ready to test the workload again and compare the result.