### 2.2. Collect Baseline Metrics

Module 2 part 2 is to collect statistics of the base workload that are executed in the previous steps. 

Let's check the performance of the last workload run by using the `INFORMATION_SCHEMA.QUERY_HISTORY_BY_WAREHOUSE` function. We also extract the string text "BASE WORKLOAD QUERY" from the query string to identify the query number. And finally, we used the `ROW_NUMBER()` function to grab the latest run of each query. So it does not matter how many times you run for each query, the below script will always get the latest run of each of them.

In [None]:
use warehouse WH_SUMMIT25_PERF_OPS; -- for operations & analysis 
select
    query_id,
    REGEXP_SUBSTR(query_text, 'BASE WORKLOAD QUERY - [0-9]{2}') as my_tag,
    start_time,
    execution_time
from table(
    INFORMATION_SCHEMA.QUERY_HISTORY_BY_WAREHOUSE(
        WAREHOUSE_NAME =>'WH_SUMMIT25_PERF_BASE'
    )
)
where 
    execution_time > 0
    and query_text like '%BASE WORKLOAD QUERY%'
qualify row_number() over (partition by my_tag order by start_time desc) = 1
order by my_tag;

Let's also collect the stat details of each query into a table, so that we can analyze how well each query performed in terms of partitions scanned.

This is achieved by creating a stored procedure to collect query stats using the `INFORMATION_SCHEMA.GET_QUERY_OPERATOR_STATS` function and store the stats we need into a table for analysis later on.


In [None]:
CREATE OR REPLACE PROCEDURE insert_multiple_query_stats (
    WH_NAME VARCHAR, 
    TARGET_TABLE_NAME VARCHAR,
    NOTEBOOK_NAME VARCHAR
)
RETURNS TEXT
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS 
$$    
    // Get all query IDs using CTE    
    var query_ids_sql = `
        SELECT 
            DISTINCT 
            query_id, 
            REGEXP_SUBSTR(query_text, '[A-Z]{1,} WORKLOAD QUERY - [0-9]{2}') as my_tag, 
            start_time, 
            warehouse_name   
        FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_WAREHOUSE(WAREHOUSE_NAME =>'${WH_NAME}', RESULT_LIMIT =>10000))
        WHERE 
            parse_json(query_tag):"StreamlitName" = 'SQL_PERF_OPTIMIZATION.PUBLIC.${NOTEBOOK_NAME}' 
            and query_text like '%WORKLOAD QUERY%'
            and query_type = 'SELECT'
    `;

    var query_ids_stmt = snowflake.createStatement({        
    	sqlText: query_ids_sql 
    });

    var query_ids_result = query_ids_stmt.execute();    
    var processed = 0;
    var skipped = 0;
    var inserted = 0;

    snowflake.createStatement({
        sqlText: `create table if not exists SQL_PERF_OPTIMIZATION.PUBLIC.${TARGET_TABLE_NAME} (
            query_id string, 
            start_time timestamp_ntz(0),
            operator_type string, 
            execution_time_breakdown string, 
            operator_attributes variant, 
            operator_id string,
            operator_statistics variant, 
            parent_operators array, 
            step_id string, 
            query_tag string
        );`
    }).execute();

    while (query_ids_result.next()) {
        var current_query_id = query_ids_result.getColumnValueAsString(1);    
        var current_query_tag = query_ids_result.getColumnValueAsString(2);
        var current_start_time = query_ids_result.getColumnValueAsString(3);
        var exists_check_query = 
            "SELECT COUNT(1) FROM SQL_PERF_OPTIMIZATION.PUBLIC." + TARGET_TABLE_NAME + " WHERE query_id = ?";
        var exists_stmt = snowflake.createStatement({            
            sqlText: exists_check_query,            
            binds: [ current_query_id ]
        });
        var exists_result = exists_stmt.execute();        
        exists_result.next();                
        if (exists_result.getColumnValue(1) === 0) {
            var query_id
            var insert_query = `
                INSERT INTO SQL_PERF_OPTIMIZATION.PUBLIC.${TARGET_TABLE_NAME}
                (
                    operator_type, execution_time_breakdown, operator_attributes, operator_id,
                    operator_statistics, parent_operators, query_id, start_time, step_id, query_tag
                )
                SELECT 
                    operator_type, execution_time_breakdown, 
                    operator_attributes, operator_id, 
                    operator_statistics, parent_operators,
                    query_id, '${current_start_time}', 
                    step_id, '${current_query_tag}'
                FROM TABLE(INFORMATION_SCHEMA.GET_QUERY_OPERATOR_STATS(?))
            `;

            var insert_stmt = snowflake.createStatement({                
                sqlText: insert_query,                
                binds: [current_query_id] });                        
            insert_stmt.execute();            
            inserted++;        
        } 
        else {            
            skipped++;
        }        
        processed++;    
    }       

    return "Processing complete. Total processed:  "+processed+", Inserted:  "+ inserted +", Skipped: "+skipped ;
$$
;

Once the SP is ready, let's generate and collect the stats for our last run.

In [None]:
CALL insert_multiple_query_stats(
    'WH_SUMMIT25_PERF_BASE', 
    'BASE_QUERY_STATS', 
    'MODULE2_PART1_BASE_WORKLOAD'
);

The below query will scan through the data that we collected in the `BASE_QUERY_STATS` table and find out which tables we scanned more than 80% of micro-partition files.

In [None]:
with latest_query_each_tag as (
    select query_id
    from base_query_stats
    qualify row_number() over (partition by query_tag order by start_time desc) = 1
)
select 
    distinct
    s.query_id,
    query_tag,
    operator_attributes:table_name::string as table_name,
    operator_statistics:pruning:partitions_scanned as mp_scanned,
    operator_statistics:pruning:partitions_total as mp_total,
    round(mp_scanned/mp_total, 4) * 100 as scan_rate
from base_query_stats s
join latest_query_each_tag q on (
    s.query_id = q.query_id
)
where 
    mp_total is not null
    and scan_rate > 80
order by query_tag
;

From the result above, we can see that most of the scans, in fact all of them, against the tables involved in the workload were FULL TABLE SCANs. This is not ideal and means that our tables were not properly clustered based on the filters we used in those queries.

Please go back to the quickstart guide and follow the section 2.3 to add some of the monitoring queries into Snowflake Dashboard for query performance analysis.