
# Common analytical questions and SQL templates for answering them

## Finding n-th event in a series of events with Window functions

* Many user interactions are stored as events (e.g., impressions, clicks, checkouts, cab called, cab boarded, cab dismounted, etc.)

* Analytical questions involve identifying one or more of such events and associating it with a past event. 

* For example, if a customer purchases a product, how did the user land on the product page (google, ads, Bing, etc.) (aka attribution)?

[ref: utm](https://blog.hubspot.com/customers/understanding-basics-utm-parameters)




## Find n-th click in a series of user clicks 

* Assume we have a `clickstream` table with user_id and the time they clicked on our web page. We can use ranking functions to pick the user's 3rd (or any n-th) click.

* n-th event is a series of events that is beneficial in
	* Marketing attribution
	* Debugging issues with late-arriving data



For example, let's find the 3rd click in a series of clicks:

![3-rd click](../../images/3click.png)

In [None]:
%%sql
WITH clickstream AS (
    SELECT
        1 AS user_id, '2024-07-01 10:00:00' AS click_time UNION ALL
    SELECT
        1 AS user_id, '2024-07-01 10:05:00' AS click_time UNION ALL
    SELECT
        1 AS user_id, '2024-07-01 10:10:00' AS click_time UNION ALL
    SELECT
        2 AS user_id, '2024-07-01 10:15:00' AS click_time UNION ALL
    SELECT
        2 AS user_id, '2024-07-01 10:20:00' AS click_time UNION ALL
    SELECT
        2 AS user_id, '2024-07-01 10:25:00' AS click_time
),
ranked_clicks AS (
    SELECT
        user_id,
        click_time,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY click_time) AS click_rank
    FROM
        clickstream
)
SELECT
    user_id,
    click_time,
    click_rank
FROM
    ranked_clicks
WHERE
    click_rank = 3;


* This pattern(ROW_NUMBER + ORDER BY unique key) can also remove duplicate rows. 

* Note: some DBS support drop duplicate function

Let's see how we can drop duplicates with this approach:

![Remove duplicates](../../images/dupclick.png)



## Converting row values into individual columns (aka PIVOT)

* Commonly used for easy visual summarization

* Used extensively by business folks to inspect value distributions

![](./pivot.png)

## Use GROUP BY + CASE WHEN to replicate PIVOT in SQL

* Pivots take values in rows and convert them into columns.

* We can create this logic in SQL with a CASE WHEN inside a GROUP BY 

* Only columns with a low number of unique values (aka low cardinality) are pivoted.

* Convert `orderpriority` column values into individual columns and calculate monthly revenue.


In [None]:
%%sql
SELECT strftime(o_orderdate, '%Y-%m') AS ordermonth,
       ROUND(AVG(CASE
                     WHEN o_orderpriority = '1-URGENT' THEN o_totalprice
                     ELSE NULL
                 END), 2) AS urgent_order_avg_price,
       ROUND(AVG(CASE
                     WHEN o_orderpriority = '2-HIGH' THEN o_totalprice
                     ELSE NULL
                 END), 2) AS high_order_avg_price,
       ROUND(AVG(CASE
                     WHEN o_orderpriority = '3-MEDIUM' THEN o_totalprice
                     ELSE NULL
                 END), 2) AS medium_order_avg_price,
       ROUND(AVG(CASE
                     WHEN o_orderpriority = '4-NOT SPECIFIED' THEN o_totalprice
                     ELSE NULL
                 END), 2) AS not_specified_order_avg_price,
       ROUND(AVG(CASE
                     WHEN o_orderpriority = '5-LOW' THEN o_totalprice
                     ELSE NULL
                 END), 2) AS low_order_avg_price
FROM orders
GROUP BY strftime(o_orderdate, '%Y-%m');



Some DBs support PIVOT


In [None]:
%%sql
PIVOT
  (SELECT *,
          strftime(o_orderdate, '%Y-%m') AS order_month
   FROM orders) ON o_orderpriority USING AVG(o_totalprice)
GROUP BY order_month
LIMIT 10;

## Most analytical dashboards need period-over-period comparison

* Take a look at these popular analytical websites. You will see a few key numbers in big fonts next to a smaller `+/-number` indicating the change percentage.

* People are interested in seeing how performance has changed over time

* Dashboards show metrics for a certain period and often show how they have changed compared to the prior period.

![](./dash.png)


## Use group by to create metrics and window function to compare the current period with the previous period

* Write a query on the `orders` table that has the following output:
	1. ordermonth (in YYYY-MM format)
	2. Revenue: Sum of totalprice for that month
	3. revenue_MOM_change: The current month's revenue - the previous month's revenue



In [None]:
%%sql
SELECT order_month,
       revenue,
       revenue - lag(revenue) OVER (
                                    ORDER BY order_month) AS revenue_MOM_change,
       ROUND((revenue - lag(revenue) OVER (
                                           ORDER BY order_month)) / revenue, 2) AS perc_revenue_MOM_change
FROM
  (SELECT strftime(o_orderdate, '%Y-%m') AS order_month,
          SUM(o_totalprice) AS revenue
   FROM orders
   GROUP BY 1)
ORDER BY 1 ;


## Exercise

* Scenario: You are designing a data set for a dashboard. The dashboard should be able to show metrics at day, week, month, and year levels (assume these are drop-downs on the dashboard).

* Assume that you, the data engineer assigned to building the table necessary for the dashboard.

* Question 1: What clarifying questions would you ask the dashboard team?

* Question 2: How would you design the table to be used by the dashboard software? What are the considerations you need to be mindful of?



## Data access concerns

* Query patterns: qps, other filters

* Performance consideration: Size of granular data

* Pre aggregation: data freshness, additive, nonadditive metrics

## Recap

* Find the nth event in a series of events with ranking window functions

* Do pivot in SQL with a CASE WHEN inside an aggregated function

* Do period-over-period change calculation with value (lead/lag) window function

* Window functions are expensive; if your use case requires repeated use of window functions, consider pre-aggregating your data



## Read these 

1. [Code and slide available here](https://github.com/josephmachado/adv_data_transformation_in_sql)

2. [Subscribe to Startdataengineering news letter](https://www.startdataengineering.com/news-letter/) 

## General pointers

* Practice patterns/mental models for real work 

* Practice LC for interviews

## Q & A

* Do you have any questions about what we went through in this session?
	* Window functions: Aggregate, Value, ranking
	* CTEs: readability and DRY 
	* Common analytical patterns

## What's next

* Want to learn more? -> 
	1. Learning foundational concepts
	2. Building end-to-end projects using industry-standard tools
	3. Hosting project on AWS cloud
	4. Targeted data portfolio 
	5. Job hunt strategies


[Join my Data Engineering Bootcamp, waitlist](https://astounding-architect-5764.ck.page/684e1f422f)