## Intro & CodeSpaces Setup

- Workshop format: exercises & take home exercises; code along
- setup codespaces with github
- All answers will be sent after workshop
- I will stay on this call for more questions 
- Videos will be transcribed and uploaded to Podia where you will have access
- We will use TPCH data; used in warehouse benchmarking
- restarting codespaces
- switch off codespace to save on free time
- you can also run locally with docker compose

## Advanced SQL is making complex logic easy to understand

- Advanced SQL is writing code that is understandable
- Clear thinking/pseudo code + locality of behaviour -> easy to maintain code
    
### Use window functions to compare values between rows in a table ~ 1h

- All transformations in SQL (except Windows) operate on one row at a time (e.g. lower(), round(), etc) or on a group of rows (GROUP BY + min/max/avg/count/sum) or combine rows from multiple tables (e.g. Joins)
- But only windows allow you to compare values across rows
- You can do similar thing with self join, but window functions are simpler to use and battle tested
- If you identify logic that requires looping through a subset of the rows of a table and performing some operation that involves values from multiple rows -> Windows

add: image showing difference between group by and window function

#### Partition to define a set of rows to work on, Window frame to define rows in the partition to be used in the computation

- Anatomy of a window function with function, partition, over, window frame and order
- partition is optional (without it the entire table is considered a partition) & window frame is optional (without it the entire partition is considered for the function)
- Visualize window funciton as a function that gets applied to one row at a time (defined by the order if specified)
- When you see looping, ranking, or running aggregates think window function
- Percentage change over months -> common pattern for which window functions are used. I was asked this in an interview many years ago and failed.

Exercise, 15m, Try it now: Self join to Window function with range to define window frame

complex self join and group by to get the running sum for each row (add: image) how would you convert it to a window function? 
Discuss simplicity; but with a caveat that if people don\'t understand window function that might make this code hard to read. But most DEs are familiar with window function even if they don\'t know how/when to use it appropriately

- We can partition by multiple columns, every unique combination of values will be considered a partition (add: image)
- You would typically only partition on columns with low cardinality (low number of unique values)
- The order of columns in a partition does not matter, for e.g. date, state and state, date will give you the same partition
- The order of rows specified by ORDER BY clause matters a lot!; Without order the output is non determinstic (ie it can change each time you run it); If you really don't need order by clause think about if you can get away with using group by instead of window function.
- Without order by you'd typically be looking for some aggregate or 1-st/last-st in a partition, both of which can be done with group by (if your db has the functions)

Window frame:

- A window frame allows you define a set of rows within a partition over which you want to apply your function.
- Note that the ORDER BY clause applies to the entire partition & typically window frame uses that to define the rows to be included in its computation
- There are 2 main ways to specify a window frame: using rows and range
- rows allow you to define the rows before and after the current row to be considered for the function
- range allows you to define the range of values (based on order by clause) over which the function is to be run
- Range only works with numeric and date based order by's, where you'd specify the range to consider to be included in the function computation

add: image on row and range based window selection
add: exercise 10m on how to use range

**Note**: Some DB engines have more ways to define window frame, such as GROUPS in

#### Aggregate other row values with aggregate functions, rank between rows with ranking functions, & access data from other rows with value functions

- now that we saw when to use window functions, let\'s dig into the specifics of the types of functions that you can use
- There are 3 types of functions
- Aggregate: Typically all the functions that you can use with your DBs group by (such as min/max/avg/sum/count)
- rank: Ranking rows based on values in column(s) (ranking functions)
- value: Functions to get another row's value. You specify which row to capture using value function such as firsta value, last value, lead/lag

Exercise, 15m, Try it now: Use agg, rank & value func

add: references

Break ~ 10m

### Avoid creating partial or duplicate data by building idempotent pipelines ~ 1h

scenario: (15m)
* you build this pipeline https://github.com/josephmachado/idempotent-data-pipeline/blob/main/parking_violation_data_pipeline.py which runs once a day
* You realize there is an issue with upstream data, and you have to re-run it for previous 3 days
* check the output, what do you see; use `ls -ltha` why do you hav stale data
* how do you avoid this scenario? Without having to manually clean up output each time you have to run a backfill?

#### If running your pipelines multiple times with the same input produces partial/duplicate data; it is not idempotent

- idempotency may sound like a "fancy" FP concept, but its critical for your sanity
- as number of piipelines increase and your workload grows you want to be able to have systems that fix themselves
- idempotent pipelines are critical for this
- Typically the main part that dictates if a pipeline is idempotent is the logic you use to write out data
- the output data is the output of your pipeline
- backfills are supported by multiple orchestartion system (e.g. Airflow, etc) but idempotency depends on how your design your pipeline
- **NOTE** if you have multiple pipelines (with different computation logic) writing to the same table, its nearly impossible to make those pipelines idempotent 
- note to have dynamic partition (spark specific) else entire table will be ovewritten

#### Date & timestamps of inputs are key attributes used to enforce output idempotency

- Most tables are parititioned (we will cover this in detail later) by date or some combination of date time and other attributes
- for most pipeline simply overwriting entire parititions will make them idempotent
- Note that overwrite will remove all existing data and insert new data
- There will be cases like shown below where the input loose one or more partitions the output will not be idempotent, since errant data may still remain
add: image example where old input had day_33 and so output has day_33 paritition, but new input does not have day_33 and so the output is incorrect as day_33 will still remain
- The key insight is that when reprocessing a dataset, you typically want to completely replace the output rather than just overwrite overlapping partitions. But this is a rare case, as input data systems often underproduce data and not over produce.
- even if they over produce its typically duplicates so your insert overwrite strategy will handle these

##### Use Insert overwrite for facts and snapshot dimensions; Use MERGE INTO for SCD2 dimensions

- For fact tables the insert overwrite approach works well, since your data is typically split into days/times
- for snapshot dimension tables (ie. entire copy created per run) insert override works as well, although for snapshot dimensions you typically need not do backfills as usualy only the latest dimensions are considered
- for SCD2 dimensions (add: scd2 image) you want to use MERGE INTO
- MERGE INTO allow you to combine multiple update/insert/delete into one sql query
- Before if you had issues with one of the list of update/insert/delete statement you would have had partial data in the output
- with merge into there are either all run or all not run 
- add: example https://www.startdataengineering.com/post/create-scd2-table-with-merge-into-with-spark-iceberg/ with images
- add: image showing matched, not matched, not matched by source
- Show how to build SCD2 with merge into: simple case

Exercise, 15m, Try it now: Write a query that is idempotent and writes data into a SCD2 table (hint: add complexities with time range comparison)

Break ~ 10m

## Optimization is reducing data to process & maximizing cluster utilization

### Distributed data processing sytems process data in parallel & move data between processes (aka shuffle) only when necessary ~ 15m

- How distributed data is stored in storage systems
- How distributed data processing systems, read from storage, process, shuffle (when needed) and write outputs
- How the read -> process -> write loop is designed to stream data 

#### Spark has its specific jargon for things all distributed data processing systems do 

- How spark does this with JVM & parallelization threads
- Spark: How applications -> jobs -> stage -> task and why spark is lazy evaluated

#### Narrow tx process data in parallel & Wide tx require shuffle

- characteristics of narrow tx and wide tx (shuffle)
- Spark tries to minimize the amount of data to be shuffled as transferring data across network is expensive and can overload machine memory
- Spark uses knowledge about the data (aka metadata) to only read data that it needs and uses AQE to figure out in-process how to minimize data transfer

Exercise, 5m, Multiple choice questions: Which of the following queries will result in wide transformations?

### Spark creates a plan to process data, only when we ask it to write an output ~ 30m

- spark waits until it really needs to process the data (iei when we ask for it to write output) as it give Spark the ability to see everything we asked it to do and create an optimal plan

#### Use EXPLAIN & Spark UI to spot bottleneck in Spark's plan and badly distributed input data

- Before spark runs a process we can see what it plans to do using the query plan
- Look for shuffles and try to minimize them 
- Check if data is filtered at the first step (aka filter/predicate pushdowns) as it reduces the total amount of data to be processed.
- Key parts of a query plan, scan, filter, project, shuffle, join - check for broadcast if one of the table is small
- Check SparkUI to see inflight processing
- Spot bottlenecks via the dataframe sections and see if Spark is doing more work than necessary, Spark AQE is smart, but you know the data, logic and intention add: screenshot & example to see live
- Use Spark UI to see if data processing is evenly distributed, is one executor overloaded while others wait? How can we re-distribute. This is usually a sign of bad input data storage/distribution add: screenshot & example to see live

Exercise, 15m, Scenario: Spot the bottlenecks, let's figure out how to handle them; use python loop with spark sql -> how to optimize

### Data storage pattern should depend on data's use case ~ 30m

- We saw how Spark tries to only read the data it has to with filter pushdowns
- By storing data in the right pattern we can enable Spark to read less data

#### In data warehouse: data is read way more than it is processed; So optimize storage for reads

- In data warehousing (which is what most data pipelines are built to handle), data is read multiple times but written to only a few times
- given that the number of reads >> num of writes optimizing data storage for read patterns will significantly impact your warehouse performance

#### Appropriate data storage pattern will reduce the amount of data to be processed

- When we designate a storage pattern to a table, it is stored as part of that table's metadata
- Spark will use this metadata to create the query plan
- In addition to how data is stored, there is also data encoding
- In data warehouse columnar encoding is key, as the number of columns used in analytical queries are low compared to the total number of columns most tables possess
- While there are a few column encoding format ORC, Parquet, etc. The most popular one is Parquet
- The distributed data is stored as individual files (ideally each 128MB, which is the right size of individual Spark tasks to process effectively) encoded as parquet
- The parquet format is extensive,but at a high level it has 3 key sections
- 1. footer: which indicates the column value ranges present in this file & the offset location (think of this as number saying how many lines from the top is this data at). The data in this file is split into row groups, with each row group containing chunks of column data. 
- 2. Row groups: These represents rows within the table. Each row group has all the column chunks.
- 3. Column chunks: every column values are stored consequtively so that the reader can read chunks of data (when selecting for specific column)

add: query metadata to see partition schema
add: column format
add: parquet: footer, rowgroup, column chunk

#### Understand when to use partition/cluster/sort your data

Scenario: Assume you have the files, with data being always queried by date add: query
How will you store the data such that only the data for the respective dates are queried? 

add: image

- partitioning is the idea of storing data as a set of folders, each folder representing a partition.
- You can partition a table by multiple columns, but this will impact how efficient the query is.

add scenario: reduce partitions scanned when partitioning by date, vs date & some state id?

- Partitioning is a good fit for columns with low cardinality
- For columns with high cardinality you can bucket them

scenario: same file eg, but query is often by age
You can bucket them into age buckets and partition them.

- In spark bucketing enables you to split data based on values of columns into n distinct buckets
- this is helpful to reduce data shuffle during joins,

e.g. you bucket 2 tables by some user id and then during join time there would not be any need to shuffle data since data to be joined are already in the same bucket
add: image
add example: Group bys will not require shuffle 

- Another technique to optimize storage for high cardinality columns (columns with lot of unique values like revenue, etc) is sorting
- with sorted data sets spark can identify the chunk of data that contains data for a range filter
- in addition to this the join startegy sort-merge join which requires a sort on the join key can skip the sort part, if the data is sorted by the join key

- typically mutliple approaches are use
- date/time-h and some common filter column is used as partition and inside those data is sorted by the key numeric metric to support range filters

Exercise, 15m, Try it now: Partition or cluster for this data and this use case

Break ~ 5m

## Outro, Next steps, Assignments, Slack, Feedback ~ 5m

