# 🧠 SQL Query Execution Order — Complete & Annotated Guide


---

## 🔁 Execution Flow

```text
WITH → FROM → JOIN → ON → WHERE → GROUP BY → ROLLUP/CUBE/GROUPING SETS → HAVING → WINDOW → SELECT → PIVOT/UNPIVOT → CASE → DISTINCT → UNION/INTERSECT/EXCEPT → ORDER BY → OFFSET → FETCH/LIMIT
```

---

## 🔹 1. `WITH` (Common Table Expressions)

- Defines temporary named result sets
- Executed first—even before `FROM`

```sql
WITH top_sellers AS (
  SELECT seller_id, SUM(sales) AS total_sales
  FROM orders
  GROUP BY seller_id
)
SELECT * FROM top_sellers WHERE total_sales > 10000;
```

---

## 🔹 2. `FROM`

- Identifies source tables or subqueries

```sql
SELECT * FROM employees;
```

---

## 🔹 3. `JOIN` + `ON`

- Combines rows from multiple tables

```sql
SELECT e.name, d.name
FROM employees e
JOIN departments d ON e.dept_id = d.id;
```

---

## 🔹 4. `WHERE`

- Filters rows before aggregation

```sql
SELECT * FROM orders WHERE status = 'shipped';
```

---

## 🔹 5. `GROUP BY`

- Aggregates rows into groups

```sql
SELECT dept_id, COUNT(*) FROM employees GROUP BY dept_id;
```

---

## 🔹 6. `ROLLUP`, `CUBE`, `GROUPING SETS`

- Multi-level aggregation and dimensional summaries

```sql
SELECT region, product, SUM(sales)
FROM orders
GROUP BY ROLLUP(region, product);
```

```sql
SELECT region, product, SUM(sales)
FROM orders
GROUP BY CUBE(region, product);
```

```sql
SELECT region, product, SUM(sales)
FROM orders
GROUP BY GROUPING SETS (
  (region, product),
  (region),
  ()
);
```

---

## 🔹 7. `HAVING`

- Filters groups after aggregation

```sql
SELECT dept_id, COUNT(*) AS total
FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 10;
```

---

## 🔹 8. `WINDOW` (Analytic Functions)

- Calculates values over partitions without collapsing rows

```sql
SELECT name, salary,
       RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rank
FROM employees;
```

---

## 🔹 9. `SELECT`

- Projects columns and expressions

```sql
SELECT name, salary FROM employees;
```

---

## 🔹 10. `PIVOT` / `UNPIVOT`

- Reshapes data between wide and tall formats

```sql
-- PIVOT (SQL Server)
SELECT *
FROM (
  SELECT year, region, sales FROM orders
) AS src
PIVOT (
  SUM(sales) FOR region IN ([North], [South], [East], [West])
) AS pvt;
```

```sql
-- UNPIVOT
SELECT year, region, sales
FROM (
  SELECT year, North, South, East, West FROM sales_summary
) AS src
UNPIVOT (
  sales FOR region IN (North, South, East, West)
) AS unpvt;
```

---

## 🔹 11. `CASE`

- Applies conditional logic during projection

```sql
SELECT name,
       CASE
         WHEN salary > 100000 THEN 'High'
         WHEN salary > 50000 THEN 'Medium'
         ELSE 'Low'
       END AS salary_band
FROM employees;
```

---

## 🔹 12. `DISTINCT`

- Removes duplicate rows from result

```sql
SELECT DISTINCT dept_id FROM employees;
```

---

## 🔹 13. `UNION` / `INTERSECT` / `EXCEPT`

- Combines or compares result sets vertically

```sql
SELECT name FROM customers
UNION
SELECT name FROM vendors;

SELECT name FROM customers
INTERSECT
SELECT name FROM vendors;

SELECT name FROM customers
EXCEPT
SELECT name FROM vendors;
```

---

## 🔹 14. `ORDER BY`

- Sorts final result set

```sql
SELECT name FROM employees ORDER BY salary DESC;
```

---

## 🔹 15. `OFFSET` / `FETCH` / `LIMIT`

- Applies pagination

```sql
SELECT name FROM employees ORDER BY salary DESC OFFSET 10 ROWS FETCH NEXT 5 ROWS ONLY;
-- or in MySQL/PostgreSQL
SELECT name FROM employees ORDER BY salary DESC LIMIT 5 OFFSET 10;
```

---

## 🧭 Final Visual Summary

| Step | Clause                      | Purpose                              |
|------|-----------------------------|--------------------------------------|
| 1    | `WITH`                      | Define temporary result sets         |
| 2    | `FROM`                      | Identify source tables               |
| 3    | `JOIN` + `ON`               | Combine tables                       |
| 4    | `WHERE`                     | Filter rows                          |
| 5    | `GROUP BY`                  | Aggregate rows                       |
| 6    | `ROLLUP/CUBE/GROUPING SETS` | Multi-level aggregation              |
| 7    | `HAVING`                    | Filter groups                        |
| 8    | `WINDOW`                    | Compute analytics                    |
| 9    | `SELECT`                    | Project columns                      |
| 10   | `PIVOT/UNPIVOT`             | Reshape data                         |
| 11   | `CASE`                      | Apply conditional logic              |
| 12   | `DISTINCT`                  | Remove duplicates                    |
| 13   | `cast`                      | cast operations                      |
| 13   | `alias`                     | Alias operations                     |
| 13   | `UNION/INTERSECT/EXCEPT`    | Set operations                       |
| 14   | `ORDER BY`                  | Sort results                         |
| 15   | `OFFSET/FETCH`              | Paginate results                     |

---


%md


# 🧠 SQL vs PySpark Keyword Comparison — Unified Cheat Sheet

This guide maps each SQL clause or keyword to its PySpark DataFrame API equivalent, with usage notes and examples. Perfect for learners, architects, and DataGym contributors.

---

## 🔍 SQL vs PySpark Mapping Table

| SQL Keyword              | PySpark Equivalent                             | Purpose / Notes                                                                 | Example                                                  |
|--------------------------|------------------------------------------------|----------------------------------------------------------------------------------|----------------------------------------------------------|
| `SELECT`                 | `df.select(...)`                               | Column projection                                                               | `df.select("name", "salary")`                           |
| `FROM`                   | `spark.read`, `spark.table`, `df`              | Load data from source                                                           | `spark.read.csv("file.csv")`                            |
| `WHERE`                  | `df.filter(...)`                               | Row-level filtering                                                             | `df.filter("age > 30")`                                 |
| `JOIN`                   | `df.join(df2, on=..., how=...)`                | Combine rows from multiple DataFrames                                           | `df1.join(df2, "id", "inner")`                          |
| `ON`                     | `on=` parameter in `.join()`                   | Join condition                                                                  | `df1.join(df2, df1.id == df2.id)`                       |
| `GROUP BY`               | `df.groupBy(...).agg(...)`                     | Aggregate rows                                                                  | `df.groupBy("dept").agg(F.sum("salary"))`              |
| `HAVING`                 | `.filter(...)` after `.groupBy().agg()`        | Filter aggregated results                                                       | `df.groupBy(...).agg(...).filter("sum > 1000")`        |
| `ORDER BY`               | `df.orderBy(...)`                              | Sort result set                                                                 | `df.orderBy("salary", ascending=False)`                |
| `LIMIT` / `OFFSET`       | `df.limit(n)`                                  | Limit number of rows                                                            | `df.limit(10)`                                          |
| `DISTINCT`               | `df.distinct()`                                | Remove duplicate rows                                                           | `df.select("dept").distinct()`                         |
| `UNION` / `UNION ALL`    | `df1.union(df2)`                               | Combine result sets vertically                                                  | `df1.union(df2)`                                        |
| `EXCEPT`                 | `df1.exceptAll(df2)`                           | Rows in `df1` not in `df2`                                                      | `df1.exceptAll(df2)`                                    |
| `INTERSECT`              | `df1.intersect(df2)`                           | Common rows between `df1` and `df2`                                             | `df1.intersect(df2)`                                    |
| `WITH` (CTE)             | `df.createOrReplaceTempView("name")`           | Define temporary views                                                          | `df.createOrReplaceTempView("sales_view")`             |
| `CASE`                   | `F.when(...).otherwise(...)`                  | Conditional logic                                                               | `df.withColumn("band", F.when(...).otherwise(...))`    |
| `WINDOW` Functions       | `Window.partitionBy(...).orderBy(...)`         | Analytic functions (rank, row_number, etc.)                                     | `F.rank().over(Window.partitionBy(...))`               |
| `ROLLUP`                 | `df.rollup(...).agg(...)`                      | Hierarchical aggregation                                                        | `df.rollup("region", "product").agg(F.sum("sales"))`   |
| `CUBE`                   | `df.cube(...).agg(...)`                        | Cross-dimensional aggregation                                                   | `df.cube("region", "product").agg(F.sum("sales"))`     |
| `GROUPING SETS`          | `df.groupingSets([...]).agg(...)`              | Custom group combinations                                                       | `df.groupingSets([["region"], ["product"]])`           |
| `PIVOT`                  | `df.groupBy(...).pivot(...).agg(...)`          | Convert rows to columns                                                         | `df.groupBy("year").pivot("region").sum("sales")`      |
| `UNPIVOT`                | ❌ Not directly supported (requires melt logic) | Convert columns to rows (manual workaround using `explode`, `stack`, etc.)      | Use `selectExpr("stack(...)")`                         |

---

## 🧬 Notes on Execution

- PySpark uses **lazy evaluation**: transformations are staged until an action like `.show()`, `.collect()`, or `.write()` is triggered.
- The **Catalyst optimizer** may **reorder operations** for performance (e.g., push filters before joins).
- Some SQL features like `UNPIVOT`, `OFFSET`, and recursive CTEs require **manual workarounds** in PySpark.

---

## 🔍 SQL vs PySpark — Side-by-Side Example

### 🧾 SQL
```sql
SELECT region, product, SUM(sales) AS total_sales
FROM orders
WHERE status = 'shipped'
GROUP BY CUBE(region, product)
HAVING SUM(sales) > 10000
ORDER BY total_sales DESC
LIMIT 10;
```

### 🧪 PySpark
```python
from pyspark.sql import functions as F

df = spark.read.csv("orders.csv", header=True, inferSchema=True)

result = (
    df.filter("status = 'shipped'")
      .cube("region", "product")
      .agg(F.sum("sales").alias("total_sales"))
      .filter("total_sales > 10000")
      .orderBy(F.desc("total_sales"))
      .limit(10)
)
result.show()
```

---

## 🧠 Key Differences

- **SQL is declarative**: You describe what you want.
- **PySpark is procedural**: You chain transformations.
- **SQL execution order is fixed**: Spark’s optimizer may reorder operations for performance.
- **CTEs (`WITH`)** are not natively supported in PySpark SQL—you use temp views instead.

---

## 🔍 Bonus: How Spark Executes It

1. **Unresolved Logical Plan**: Based on your code.
2. **Resolved Logical Plan**: Spark resolves column names and types.
3. **Optimized Logical Plan**: Catalyst applies rules (e.g., predicate pushdown).
4. **Physical Plan**: Spark decides how to execute (e.g., shuffle, broadcast).
5. **Execution**: Spark runs the plan across the cluster.

Inspect it using:

```python
df.explain(True)
```

---

📘 *Want this turned into a printable cheat sheet, interactive notebook, or visual guide for DataGym? I’d love to help you build it!*


# 🧠 PySpark for SQL Pros: Transition Guide

This guide helps you bridge the gap between SQL’s declarative style and PySpark’s transformation-first, distributed mindset. It highlights what feels familiar, what behaves differently, and where PySpark offers unique superpowers.

---

## 🔹 What Feels Familiar (SQL-like in PySpark)

| SQL Concept         | PySpark Equivalent             | Notes |
|---------------------|--------------------------------|-------|
| `SELECT`            | `df.select(...)`               | Column projection |
| `WHERE`             | `df.filter(...)`               | Row-level filtering |
| `GROUP BY` + `HAVING` | `df.groupBy().agg().filter()` | Aggregation + post-filter |
| `ORDER BY`          | `df.orderBy(...)`              | Sorting |
| `JOIN`              | `df.join(...)`                 | Supports all join types |
| `CASE`              | `F.when(...).otherwise(...)`   | Conditional logic |
| `DISTINCT`          | `df.distinct()`                | Removes duplicates |
| `UNION`             | `df.union(...)`                | Vertical merge |
| `LIMIT`             | `df.limit(n)`                  | Row slicing |

✅ These will feel intuitive—you’re just swapping SQL syntax for PySpark methods.

---

## ⚠️ What Behaves Differently (Surprise Zones)

| PySpark Feature         | SQL Equivalent | Surprise Factor |
|-------------------------|----------------|-----------------|
| `withColumn()`          | ❌             | Add/transform columns dynamically |
| `withColumnRenamed()`   | ❌             | Rename columns outside `SELECT` |
| `selectExpr()`          | ❌             | SQL-like expressions in strings |
| `explode()`             | ❌             | Flatten arrays/maps into rows |
| `split()`               | ❌             | Turns strings into arrays |
| `array()`, `map()`, `struct()` | ❌     | Constructs complex types |
| `get_json_object()`     | ❌             | Parses JSON strings inline |
| `drop()`                | ❌             | Removes columns mid-pipeline |
| `alias()`               | ✅             | More flexible in PySpark |
| `cast()`                | ✅             | Same logic, different syntax |
| `filter()` chaining     | ✅             | Procedural, not declarative |
| `Window` functions      | ✅             | Requires explicit `WindowSpec` |
| `UDFs` / `Pandas UDFs`  | ❌             | Custom Python logic on columns |

💡 These are **not available in SQL**, and they’re where PySpark becomes a **data engineering toolkit**, not just a query language.

---

## 🔧 Types of Transformations in PySpark

### 1. **Narrow Transformations**
- Operate on a single partition
- No shuffle required
- ✅ Fast and efficient

Examples:
```python
df.select("name")
df.filter("age > 30")
df.withColumn("discount", F.col("price") * 0.9)
```

### 2. **Wide Transformations**
- Require data movement across partitions
- Trigger shuffles
- ⚠️ Expensive if not optimized

Examples:
```python
df.groupBy("dept").agg(F.sum("salary"))
df.join(other_df, "id")
df.orderBy("salary")
```

---

## 🔍 Execution Model Differences

| Concept               | SQL Engine                  | PySpark Engine                     |
|-----------------------|-----------------------------|------------------------------------|
| Evaluation            | Declarative                 | Lazy (until `.show()`, `.collect()`) |
| Optimization          | Rule-based + cost-based     | Catalyst optimizer + DAG scheduler |
| Execution             | Single-node or MPP          | Distributed across cluster         |
| Schema enforcement    | Strict                      | Flexible, inferred or defined      |
| Error handling        | Immediate                   | Deferred until action              |

---

## 🔁 Does Execution Order Matter in PySpark?

### ✅ Yes—but differently than SQL.

### 🔹 In SQL:
- Execution order is **fixed and logical**.
- You write `SELECT` first, but it’s evaluated after `FROM`, `JOIN`, `WHERE`, etc.
- The engine reorders and optimizes internally.

### 🔸 In PySpark:
- You **chain transformations** in procedural order.
- Each step **depends on the previous one**.
- Spark builds a **logical plan**, then optimizes it—but doesn’t rewrite your logic unless safe.

---

## 🧪 Examples of Chaining Sensitivity

### ✅ Correct:
```python
df.withColumn("discounted", F.col("price") * 0.9)
  .filter(F.col("discounted") > 100)
```

### ❌ Incorrect:
```python
df.filter(F.col("discounted") > 100)  # 'discounted' doesn't exist yet!
```

### ⚠️ Semantic Shift:
```python
df.orderBy("salary").limit(10)  # Sort full dataset, then take top 10
df.limit(10).orderBy("salary")  # Sort only first 10 rows
```

---

## 🔍 Summary: SQL vs PySpark Execution Logic

| Aspect                     | SQL                          | PySpark                          |
|----------------------------|------------------------------|----------------------------------|
| Execution order            | Fixed (logical)              | Procedural (chained)             |
| Optimizer reordering       | Aggressive                   | Conservative                     |
| Dependency between steps   | Abstracted                   | Explicit                         |
| Mistakes from wrong order  | Rare (SQL engine corrects)   | Common (breaks or misbehaves)    |

---

## 🧠 Transition Tips

- ✅ Think in **transformations**, not queries.
- ✅ Use `df.explain(True)` to inspect plans.
- ✅ Cache intermediate results if reused.
- ✅ Prefer built-in functions (`F.*`) over UDFs for performance.
- ✅ Watch out for **wide transformations**—they’re your new bottlenecks.

---

📘 *Want this turned into a DataGym onboarding module, a side-by-side SQL vs PySpark notebook, or a printable cheat sheet for contributors? I’d love to help you build it!*
