## Module 7. Explore JOIN Skewness [Bonus]

As usual, let's disable query result cache.

In [None]:
ALTER SESSION SET USE_CACHED_RESULT = FALSE;

### 7.1 Background

There are three main types of skewness that can happen during a query execution that can slow down query performance:

1. Data distribution skewness

This can happen when the underlying data for a table is not evenly distributed across multiple files. Different workers can read different amounts of data after the TableScan operator, which can cause uneven processing between workers while executing a query.

2. Join key skewness

This can happen when we try to join two tables with data unevenly distributed based on the join key condition. Since the HashJoin operator distributes data during the processing based on the joining key(s), different workers will receive different amounts of data in such a case, and can result in processing skewness.

3.  Probe side skewness

This happens when the table that is put on the Probe side of the join requires a small number of file scans (this can be due to a small table or a large table with restrictive filtering conditions). Since one file/partition can only be read by a single worker instance, the number of worker instances involved in the query processing will be limited to the number of instances used in the table scan. This skewness can carry to the downstream operators of the query processing if its path does not become the Build side on later joins( i.e., to reuse the same worker without redistribution of the data), hence can result in processing skewness throughout the whole process.

In this module, we will focus on the third scenario mentioned above.


### 7.2 Prepare the data

Please run the following scripts to get the data poppulated.

In [None]:
use warehouse WH_SUMMIT25_PERF_OPS;

create or replace table t1 (a int, b varchar, c timestamp)
as 
select
    uniform(1, 1000, random()),
    randstr(uniform(1, 10, random()), random()), 
    current_timestamp
from table(generator(rowcount => 20000));

create or replace table t2 (a int, b varchar, c timestamp)
as 
select
    uniform(1, 1000, random()) as a,
    randstr(uniform(1, 50, random()), random()), 
    current_timestamp
from table(generator(rowcount => 50000))
order by a
;

create or replace table t3 (a int, b varchar, c timestamp)
as 
select
    uniform(1, 1000, random()),
    randstr(uniform(1, 1000, random()), random()), 
    current_timestamp
from table(generator(rowcount => 10000));

### 7.3 Execute the query and analyze the problem

In [None]:
use warehouse WH_SUMMIT25_PERF_SKEWNESS;

select * from t1 -- SKEWNESS QUERY BEFORE
join t2 on (t1.a = t2.a)
join t3 on (t1.a = t3.a)
where t1.a between 1 and 800
;

Please follow the quickstart guide on how to check the query profile details.

### 7.4 The solution

Please refer to the quickstart guide on explanation about the solution for this issue.

Then you can just run the solution query below.

In [None]:
use warehouse WH_SUMMIT25_PERF_SKEWNESS;

select * from ( -- SKEWNESS QUERY AFTER
    select * from t1 order by random()
) t1 
join t2 on (t1.a = t2.a)
join t3 on (t1.a = t3.a)
where t1.a between 1 and 800
;

### 7.5 Let's check the result

In [None]:
USE WAREHOUSE WH_SUMMIT25_PERF_OPS;

select 
    query_id,
    query_text,
    'SKEWNESS QUERY BEFORE' as query_tag,
    start_time,
    round(execution_time/1000, 2) as execution_time_sec,
FROM TABLE(
    INFORMATION_SCHEMA.QUERY_HISTORY_BY_WAREHOUSE(
        WAREHOUSE_NAME =>'WH_SUMMIT25_PERF_SKEWNESS'
    )
)  
where
    execution_time > 0
    and query_text like '%SKEWNESS QUERY BEFORE%'
    and error_code is null 
    and query_type = 'SELECT'
qualify row_number() over (partition by query_tag order by start_time desc) = 1

union all

select 
    query_id,
    query_text,
    'SKEWNESS QUERY AFTER' as query_tag,
    start_time,
    round(execution_time/1000, 2) as execution_time_sec,
FROM TABLE(
    INFORMATION_SCHEMA.QUERY_HISTORY_BY_WAREHOUSE(
        WAREHOUSE_NAME =>'WH_SUMMIT25_PERF_SKEWNESS'
    )
)  
where
    execution_time > 0
    and query_text like '%SKEWNESS QUERY AFTER%'
    and error_code is null 
    and query_type = 'SELECT'
qualify row_number() over (partition by query_tag order by start_time desc) = 1

order by execution_time_sec;

We can see that we have improved the query performance further by saving a few more seconds on this sample data.