## Module 2. Baseline Monitoring for Reporting Workload 

### 2.1. Run Base Workload of Reporting Queries

Let's run the base workload of 9 reporting queries. First, Let's disable result cache so that we can test the query performance properly (should you need to run this notebook multiple times).

Make sure that the warehouse to run this Notebook is `WH_SUMMIT25_PERF_BASE`. This is to ensure that we can easily do reporting based on different workloads. You can also double check it from Notebook Setting menu if you want to.

In [None]:
alter session set use_cached_result = false;
use schema sql_perf_optimization.public;
use warehouse WH_SUMMIT25_PERF_BASE;

Query 1: report categories based on age group and visit counts 

In [None]:
with age_20_to_30 as ( -- BASE WORKLOAD QUERY - 01
    select distinct uuid
    from user_profile
    where question_id = 3 -- DOB question
        and value::date between dateadd(year, -30, current_date) 
            and dateadd(year, -20, current_date)
),
gender_male as (
    select distinct uuid
    from user_profile
    where question_id = 4 -- Gender question
        and value::string = 'M'
),
income_50K_to_100K as (
    select distinct uuid
    from user_profile
    where question_id = 10 -- Income question
        and value::int between 50000 and 100000
)
select
    c.name,
    url,
    count(1) as visits
from traffic t
join category c on (
    c.id = t.category_id
)
join age_20_to_30 a on (
    a.uuid = t.uuid
)
join gender_male g on (
    g.uuid = t.uuid
)
join income_50K_to_100K i on (
    i.uuid = t.uuid
)
where
    t.timestamp between '2025-01-01' and '2025-02-01'
group by all
qualify row_number() over (
    partition by c.name order by visits desc
) <= 100
order by c.name, visits desc;

Query 2: Report website ranking information based on a certain category :
 1. Counts both unique visitors (using uuid) and total visits per URL in each category
 2. Uses window function RANK() to rank URLs within each category based on total visits
 3. Limits results to the top 10 URLs per category
 4. Shows category name, URL, visitor counts, and ranking
 5. Order results by category name and rank for easy reading

In [None]:
WITH url_stats AS ( -- BASE WORKLOAD QUERY - 02
  SELECT
    c.name AS category_name,
    t.url,
    COUNT(DISTINCT t.uuid) AS unique_visitors,
    COUNT(*) AS total_visits,
    RANK() OVER (
      PARTITION BY c.name
      ORDER BY
        COUNT(*) DESC
    ) AS rank_in_category
  FROM
    traffic AS t
    JOIN category AS c ON t.category_id = c.id
  WHERE
    t.timestamp between '2025-01-01' and '2025-02-01'
  GROUP BY
    c.name,
    t.url
)
SELECT
  category_name,
  url,
  unique_visitors,
  total_visits,
  rank_in_category
FROM
  url_stats
WHERE
  rank_in_category <= 10
ORDER BY
  category_name,
  rank_in_category;

Query 3: Creates a denormalized view of user information, quick lookups by uuid
 - Includes demographic data (gender, age)
 - Provides visit statistics (number of sites visited, first/last visit)
 - Can be easily filtered or joined with other tables
 - Optimized for quick lookups by `uuid`

In [None]:
WITH user_profile_age as ( -- BASE WORKLOAD QUERY - 03
    select 
        uuid,
        timestampdiff(year, value::date, current_date()) as value
    from user_profile
    where question_id = 3
),
user_profile_gender as (
    select 
        uuid,
        value::string as value
    from user_profile
    where question_id = 4
),
user_lookup AS (-- Q03
  SELECT
    t.uuid,
    age.value AS gender,
    gender.value AS age,
    COUNT(DISTINCT t.url) AS visited_sites,
    MIN(t.timestamp) AS first_visit,
    MAX(t.timestamp) AS last_visit
  FROM
    traffic AS t
    LEFT JOIN user_profile_age AS age ON t.uuid = age.uuid
    LEFT JOIN user_profile_gender AS gender ON t.uuid = gender.uuid
  WHERE
    t.timestamp between '2025-01-01' and '2025-02-01'
  GROUP BY ALL
)
SELECT
  *
FROM
  user_lookup
WHERE
  --NOT uuid IS NULL
  uuid in ('64d91ddc-cad4-4bc3-993e-53f050984738', '2d738ba3-9c32-4fe5-92ba-e8fdb3f2f6e6',
  'a628176d-dec0-453d-8f7a-33d4682bd26f','a9adda58-5762-4e1b-85b8-02b4eb5e9970',
  '93d96769-4cc8-4521-9cfb-8b81472f70e5','cb4645ef-454c-4af3-9fcd-24c48d4ad0be',
  'aabf7d37-51ee-4ce6-abc7-c07c2533a0b7','c75a4784-eddb-4e9c-b3d5-89115855bce9',
  '9484fac2-fdf6-4fa7-8084-149ec921f1af','34c8e1e4-d05e-4d71-af07-e53ec3a0d87c',
  '6a524226-0c88-4874-bdcf-a489d3b7707b','67617602-28a7-4cc4-802c-a0a871d3a967',
  '1da9d743-c674-4dd8-8c41-66b45d877afb','6c0514c7-a44e-4e14-96cd-a5f93bf9f00c'
  )
ORDER BY
  last_visit DESC;

Query 4: Analyzes week-over-week growth rates for each category
  - Analyzes 6 months of traffic data
  - Calculates daily unique visitors, logged-in users, and total visits per category
  - Computes week-over-week growth rates for each category
  - Uses window functions for time-series analysis
  - QAS would improve this query\'s performance because it:

In [None]:
WITH daily_stats AS ( -- BASE WORKLOAD QUERY - 04
  SELECT
    DATE_TRUNC ('DAY', t.timestamp) AS visit_date,
    c.name AS category_name,
    COUNT(DISTINCT t.uuid) AS unique_visitors,
    COUNT(DISTINCT t.user_id) AS logged_in_users,
    COUNT(*) AS total_visits
  FROM
    traffic AS t
    LEFT JOIN category AS c ON t.category_id = c.id
  WHERE
    t.timestamp >= DATEADD (MONTH, -6, CURRENT_TIMESTAMP())
  GROUP BY
    1,
    2
)
SELECT
  visit_date,
  category_name,
  unique_visitors,
  logged_in_users,
  total_visits,
  LAG (unique_visitors, 7) OVER (
    PARTITION BY category_name
    ORDER BY
      visit_date
  ) AS prev_week_visitors,
  ROUND(
    100.0 * (
      unique_visitors - LAG (unique_visitors, 7) OVER (
        PARTITION BY category_name
        ORDER BY
          visit_date
      )
    ) / LAG (unique_visitors, 7) OVER (
      PARTITION BY category_name
      ORDER BY
        visit_date
    ),
    2
  ) AS weekly_growth_rate
FROM
  daily_stats
ORDER BY
  visit_date DESC,
  unique_visitors DESC;

Query 5: the typical length of stay for guests booking my listing relating to user behavior. 

This query:
- Analyzes user session durations and visit patterns
- Segments of users based on their engagement duration
- Provides metrics on average visit days and pages viewed
- Helps identify patterns that might indicate different types of travelers

In [None]:
WITH gender_male as ( -- BASE WORKLOAD QUERY - 05
    select distinct uuid
    from user_profile
    where question_id = 4 -- Gender question
        and value::string = 'M'
),
user_sessions AS ( 
  SELECT
    t.uuid,
    DATE_TRUNC ('DAY', t.timestamp) AS visit_date,
    DATEDIFF (HOUR, MIN(t.timestamp), MAX(t.timestamp)) AS session_duration,
    COUNT(t.url) AS pages_visited
  FROM
    traffic AS t
  WHERE
    visit_date between '2023-01-01' and '2025-01-01'
  GROUP BY ALL
),
engagement_patterns AS (
  SELECT
    s.uuid,
    COUNT(DISTINCT visit_date) AS total_visit_days,
    AVG(session_duration) AS avg_session_hours,
    AVG(pages_visited) AS avg_pages_per_session,
    DATEDIFF (DAY, MIN(visit_date), MAX(visit_date)) AS date_range
  FROM
    user_sessions s
    join gender_male m on (m.uuid = s.uuid)
  GROUP BY
    s.uuid
)
SELECT
  CASE
    WHEN avg_session_hours < 1 THEN 'Brief Visitor'
    WHEN avg_session_hours BETWEEN 1
    AND 4 THEN 'Medium Stay'
    ELSE 'Long Stay'
  END AS engagement_type,
  COUNT(DISTINCT uuid) AS user_count,
  AVG(total_visit_days) AS avg_visit_days,
  AVG(avg_session_hours) AS avg_session_duration,
  AVG(avg_pages_per_session) AS avg_pages_viewed
FROM
  engagement_patterns
GROUP BY
  CASE
    WHEN avg_session_hours < 1 THEN 'Brief Visitor'
    WHEN avg_session_hours BETWEEN 1
    AND 4 THEN 'Medium Stay'
    ELSE 'Long Stay'
  END
ORDER BY
  avg_session_duration DESC
;

Query 6: geographic distribution of guests and their behavior patterns
- Identifies visitor origins from profile data
- Calculates engagement metrics by region
- Shows top 10 regions by visitor count
- Includes metrics for average engagement and long-stay visitors
- Helps understand geographic distribution of guests and their behavior patterns

In [None]:
WITH geographic_segments AS ( -- BASE WORKLOAD QUERY - 06
  SELECT
    p.uuid,
    p.value AS origin_location,
    COUNT(DISTINCT t.url) AS pages_visited,
    DATEDIFF (HOUR, MIN(t.timestamp), MAX(t.timestamp)) AS engagement_hours
  FROM
    user_profile AS p
    JOIN question AS q ON (p.question_id = q.id and q.label = 'Country')
    JOIN traffic AS t ON p.uuid = t.uuid
  WHERE
    NOT p.value IS NULL
    and t.timestamp between '2025-03-01' and '2025-04-01'
  GROUP BY
    p.uuid,
    p.value
),
location_analysis AS (
  SELECT
    origin_location,
    COUNT(DISTINCT uuid) AS visitor_count,
    AVG(pages_visited) AS avg_pages_per_visitor,
    AVG(engagement_hours) AS avg_engagement_hours,
    COUNT(
      DISTINCT CASE
        WHEN engagement_hours >= 72 THEN uuid
      END
    ) AS long_stay_visitors
  FROM
    geographic_segments
  WHERE
    NOT origin_location IS NULL
  GROUP BY
    origin_location
)
SELECT
  origin_location,
  visitor_count,
  avg_pages_per_visitor,
  avg_engagement_hours,
  ROUND(
    (
      CAST(long_stay_visitors AS FLOAT) / visitor_count
    ) * 100,
    2
  ) AS long_stay_percentage
FROM
  location_analysis
WHERE
  visitor_count >= 5
  /* Filter for meaningful sample sizes */
ORDER BY
  visitor_count DESC
;

Query 7: identify booking patterns to optimize pricing and availability strategies

This query:
- Analyzes the time between users first and last interactions
- Segments users into booking horizon categories
- Provides metrics on visit frequency and engagement
- Shows the distribution of users across different lead time segments
- Helps identify booking patterns to optimize pricing and availability strategies


In [None]:
WITH user_journey AS ( -- BASE WORKLOAD QUERY - 07
  SELECT
    t.uuid,
    MIN(t.timestamp) AS first_visit,
    MAX(t.timestamp) AS last_visit,
    COUNT(DISTINCT DATE_TRUNC ('DAY', t.timestamp)) AS visit_days,
    DATEDIFF (DAY, MIN(t.timestamp), MAX(t.timestamp)) AS planning_horizon,
    COUNT(DISTINCT t.url) AS pages_viewed
  FROM
    traffic AS t
  WHERE
    t.timestamp between '2025-03-01' and '2025-04-01'
  GROUP BY
    t.uuid
),
lead_time_analysis AS (
  SELECT
    CASE
      WHEN planning_horizon < 7 THEN 'Last Minute (<1 week)'
      WHEN planning_horizon BETWEEN 7
      AND 30 THEN 'Short Term (1-4 weeks)'
      WHEN planning_horizon BETWEEN 31
      AND 90 THEN 'Medium Term (1-3 months)'
      ELSE 'Long Term (>3 months)'
    END AS booking_horizon,
    COUNT(DISTINCT uuid) AS user_count,
    AVG(visit_days) AS avg_visit_days,
    AVG(pages_viewed) AS avg_pages_viewed,
    AVG(planning_horizon) AS avg_days_leadtime
  FROM
    user_journey
  WHERE
    planning_horizon > 0
  GROUP BY
    CASE
      WHEN planning_horizon < 7 THEN 'Last Minute (<1 week)'
      WHEN planning_horizon BETWEEN 7
      AND 30 THEN 'Short Term (1-4 weeks)'
      WHEN planning_horizon BETWEEN 31
      AND 90 THEN 'Medium Term (1-3 months)'
      ELSE 'Long Term (>3 months)'
    END
)
SELECT
  booking_horizon,
  user_count,
  ROUND(avg_visit_days, 1) AS avg_visit_days,
  ROUND(avg_pages_viewed, 1) AS avg_pages_viewed,
  ROUND(avg_days_leadtime, 1) AS avg_leadtime_days,
  ROUND(
    (user_count * 100.0 / SUM(user_count) OVER ()),
    2
  ) AS percentage_of_users
FROM
  lead_time_analysis
ORDER BY
  avg_days_leadtime;

Query 8: Analyzes yearly traffic hits by category 

This query:
- Aggregates yearly traffic hits by category from the trafficl table for the month of December
- Performs a heavy table scan over 5 years of historical data with window functions for additional aggregation metrics. 

In [None]:
SELECT -- BASE WORKLOAD QUERY - 08
  DATE_PART (YEAR, t.timestamp) AS Year,
  c.id AS Category_ID,
  c.name AS Category_Name,
  SUM(COUNT(*)) OVER (
    PARTITION BY DATE_PART (YEAR, t.timestamp),
    c.id
  ) AS Total_Hits
FROM
  traffic AS t,
  category AS c
WHERE
  t.category_id = c.id
  AND t.timestamp >= DATEADD (YEAR, -5, CURRENT_TIMESTAMP())
  AND MONTH (t.timestamp) = 12
GROUP BY
  DATE_PART (YEAR, t.timestamp),
  c.name,
  c.id
ORDER BY
  1,
  4,
  2
LIMIT
  200;

Query 9: distinct visiting month of stay and average number of page visits for each user 

In [None]:
WITH user_sessions AS ( -- BASE WORKLOAD QUERY - 09
  SELECT
    user_id,
    TO_CHAR (timestamp, 'YYYYMM') AS visit_month,
    COUNT(*) AS total_visits,
    COUNT(t.url) AS pages_visited
  FROM
    traffic AS t
  WHERE visit_month between '202308' and '202412'
    and user_id BETWEEN 1000000 AND 2000000
  GROUP BY ALL
)
  SELECT
    s.user_id,
    COUNT(DISTINCT visit_month) AS total_visit_month,
    AVG(pages_visited) AS avg_pages_per_session 
  FROM
    user_sessions s
  GROUP BY
    s.user_id
    ;