In [21]:
%run tools.ipynb

```SQL
SELECT
    Employee,
    Department,
    Sales,
    DENSE_RANK() OVER (PARTITION BY Department ORDER BY Sales DESC) AS Rank
FROM employee_sales;
```


In [22]:
windowSpec = Window.partitionBy("age").orderBy(col("key").desc())


In [26]:

df_ranked = df.withColumn("dense_rank", dense_rank().over(windowSpec))\
.withColumn("Rank", rank().over(windowSpec))\
.withColumn("sum", sum(col("age")).over(windowSpec))

# Show Output
df_ranked.show()

+---+----------+---------+---+-------------+----------+----+---+
|key|first_name|last_name|age|         city|dense_rank|Rank|sum|
+---+----------+---------+---+-------------+----------+----+---+
| 41|     Syzmr|  Veqyann| 18|  Los Angeles|         1|   1| 18|
| 31|     Nqwlo|  Eblxujd| 18|     New York|         2|   2| 36|
| 47|     Lhzsl|  Ucykpsb| 19|        Miami|         1|   1| 19|
| 30|     Omhcr|  Mgjoyqa| 19|       Boston|         2|   2| 38|
| 11|     Rbynr|  Wivyegj| 19|      Chicago|         3|   3| 57|
| 93|     Hzycx|  Rpgoexz| 20|      Seattle|         1|   1| 20|
| 78|     Hbbcb|  Cqthngf| 20|      Seattle|         2|   2| 40|
| 63|     Urlnz|  Oekczoy| 20|       Boston|         3|   3| 60|
| 54|     Afszn|  Ymsskrw| 20|       Boston|         4|   4| 80|
| 46|     Qxwjh|  Cwmazca| 20|  Los Angeles|         5|   5|100|
| 34|     Wqull|  Ogxknus| 20|        Miami|         6|   6|120|
| 75|     Mebup|  Nampubi| 22|      Chicago|         1|   1| 22|
| 21|     Aznkq|  Ovqzdmt

Scenario: Advanced Sales Analysis
We have an employee_sales dataset with columns:

Employee
Department
Month
Sales
Objectives
Rank employees by Sales within each Department (DENSE_RANK)
Calculate a Running Total of Sales for each Department (SUM)
Find the Difference Between an Employee’s Sales and the Department’s Average Sales (AVG)
Get the Previous Month's Sales for Each Employee (LAG)


```SQL
SELECT
    Employee,
    Department,
    Month,
    Sales,
    DENSE_RANK() OVER (PARTITION BY Department ORDER BY Sales DESC) AS Rank,
    SUM(Sales) OVER (PARTITION BY Department ORDER BY Month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotalSales,
    AVG(Sales) OVER (PARTITION BY Department) AS DeptAvgSales,
    LAG(Sales, 1) OVER (PARTITION BY Employee ORDER BY Month) AS PreviousMonthSales
FROM employee_sales;

```


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, dense_rank, sum, avg, lag

# Create Spark Session
spark = SparkSession.builder.appName("ComplexWindowFunctions").getOrCreate()

# Sample Data
data = [
    ("Alice", "IT", "Jan", 5000),
    ("Bob", "IT", "Jan", 7000),
    ("Charlie", "IT", "Jan", 6000),
    ("Alice", "IT", "Feb", 6500),
    ("Bob", "IT", "Feb", 8000),
    ("Charlie", "IT", "Feb", 7500),
    ("David", "HR", "Jan", 4000),
    ("Eve", "HR", "Jan", 4500),
    ("David", "HR", "Feb", 4800),
    ("Eve", "HR", "Feb", 5300),
]

# Define Schema
columns = ["Employee", "Department", "Month", "Sales"]

# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()

# Define Window Specifications
dept_window = Window.partitionBy("Department").orderBy(
    col("Sales").desc()
)  # For ranking
running_total_window = (
    Window.partitionBy("Department")
    .orderBy("Month")
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)
)  # Running total
dept_avg_window = Window.partitionBy("Department")  # Department-wise average
previous_sales_window = Window.partitionBy("Employee").orderBy(
    "Month"
)  # Get previous month's sales

# Apply Window Functions
df_result = (
    df.withColumn("Rank", dense_rank().over(dept_window))
    .withColumn("RunningTotalSales", sum("Sales").over(running_total_window))
    .withColumn("DeptAvgSales", avg("Sales").over(dept_avg_window))
    .withColumn("PreviousMonthSales", lag("Sales", 1).over(previous_sales_window))
)

# Show Result
df_result.show()

+--------+----------+-----+-----+
|Employee|Department|Month|Sales|
+--------+----------+-----+-----+
|   Alice|        IT|  Jan| 5000|
|     Bob|        IT|  Jan| 7000|
| Charlie|        IT|  Jan| 6000|
|   Alice|        IT|  Feb| 6500|
|     Bob|        IT|  Feb| 8000|
| Charlie|        IT|  Feb| 7500|
|   David|        HR|  Jan| 4000|
|     Eve|        HR|  Jan| 4500|
|   David|        HR|  Feb| 4800|
|     Eve|        HR|  Feb| 5300|
+--------+----------+-----+-----+

+--------+----------+-----+-----+----+-----------------+-----------------+------------------+
|Employee|Department|Month|Sales|Rank|RunningTotalSales|     DeptAvgSales|PreviousMonthSales|
+--------+----------+-----+-----+----+-----------------+-----------------+------------------+
|   Alice|        IT|  Feb| 6500|   4|            22000|6666.666666666667|              NULL|
|   Alice|        IT|  Jan| 5000|   6|            40000|6666.666666666667|              6500|
|     Bob|        IT|  Feb| 8000|   1|             80

### **How Many Functions Can We Apply in a Window Function?**

In PySpark and SQL, **window functions** allow us to perform calculations across a set of table rows that are related to the current row. These functions can be classified into **three main categories**:

---

## **1. Ranking Functions**

These functions assign a rank to each row within a partition.
| Function | Description | Example |
|----------|------------|---------|
| `RANK()` | Assigns a unique rank to rows. **Skips** numbers after ties. | 1, 2, 2, **4** |
| `DENSE_RANK()` | Similar to `RANK()`, but **without gaps**. | 1, 2, 2, **3** |
| `ROW_NUMBER()` | Assigns a unique number to each row. No ties. | 1, 2, 3, 4 |
| `NTILE(n)` | Divides rows into **n** equal buckets. | Groups data into percentiles |

✅ **PySpark Example**

```python
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank, row_number, ntile

window_spec = Window.partitionBy("Department").orderBy("Sales")

df.withColumn("Rank", rank().over(window_spec)) \
  .withColumn("DenseRank", dense_rank().over(window_spec)) \
  .withColumn("RowNum", row_number().over(window_spec)) \
  .withColumn("Quartile", ntile(4).over(window_spec)) \
  .show()
```

✅ **SQL Example**

```sql
SELECT Employee, Department, Sales,
       RANK() OVER (PARTITION BY Department ORDER BY Sales) AS Rank,
       DENSE_RANK() OVER (PARTITION BY Department ORDER BY Sales) AS DenseRank,
       ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Sales) AS RowNum,
       NTILE(4) OVER (PARTITION BY Department ORDER BY Sales) AS Quartile
FROM employee_sales;
```

---

## **2. Aggregate Functions**

These functions perform calculations over a set of rows.

| Function  | Description                   |
| --------- | ----------------------------- |
| `SUM()`   | Cumulative or running total.  |
| `AVG()`   | Average of a partition.       |
| `MIN()`   | Minimum value in a partition. |
| `MAX()`   | Maximum value in a partition. |
| `COUNT()` | Counts the number of rows.    |

✅ **PySpark Example**

```python
from pyspark.sql.functions import sum, avg, min, max, count

df.withColumn("RunningTotal", sum("Sales").over(window_spec)) \
  .withColumn("AvgSales", avg("Sales").over(window_spec)) \
  .withColumn("MinSales", min("Sales").over(window_spec)) \
  .withColumn("MaxSales", max("Sales").over(window_spec)) \
  .withColumn("Count", count("Sales").over(window_spec)) \
  .show()
```

✅ **SQL Example**

```sql
SELECT Employee, Department, Sales,
       SUM(Sales) OVER (PARTITION BY Department ORDER BY Sales ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotal,
       AVG(Sales) OVER (PARTITION BY Department) AS AvgSales,
       MIN(Sales) OVER (PARTITION BY Department) AS MinSales,
       MAX(Sales) OVER (PARTITION BY Department) AS MaxSales,
       COUNT(Sales) OVER (PARTITION BY Department) AS Count
FROM employee_sales;
```

---

## **3. Offset (Lag/Lead) and First/Last Functions**

These functions help **compare** rows with previous or next ones.

| Function        | Description                                 |
| --------------- | ------------------------------------------- |
| `LAG()`         | Retrieves the previous row's value.         |
| `LEAD()`        | Retrieves the next row's value.             |
| `FIRST_VALUE()` | Gets the first value in a partition.        |
| `LAST_VALUE()`  | Gets the last value in a partition.         |
| `NTH_VALUE(n)`  | Retrieves the **nth** value in a partition. |

✅ **PySpark Example**

```python
from pyspark.sql.functions import lag, lead, first, last

df.withColumn("PrevMonthSales", lag("Sales", 1).over(window_spec)) \
  .withColumn("NextMonthSales", lead("Sales", 1).over(window_spec)) \
  .withColumn("FirstSale", first("Sales").over(window_spec)) \
  .withColumn("LastSale", last("Sales").over(window_spec)) \
  .show()
```

✅ **SQL Example**

```sql
SELECT Employee, Department, Sales,
       LAG(Sales, 1) OVER (PARTITION BY Employee ORDER BY Month) AS PrevMonthSales,
       LEAD(Sales, 1) OVER (PARTITION BY Employee ORDER BY Month) AS NextMonthSales,
       FIRST_VALUE(Sales) OVER (PARTITION BY Department ORDER BY Sales) AS FirstSale,
       LAST_VALUE(Sales) OVER (PARTITION BY Department ORDER BY Sales) AS LastSale
FROM employee_sales;
```

---

### **🚀 Summary Table: All Window Functions**

| Category        | Function        | Description                     |
| --------------- | --------------- | ------------------------------- |
| **Ranking**     | `RANK()`        | Skips numbers on ties           |
|                 | `DENSE_RANK()`  | No rank gaps                    |
|                 | `ROW_NUMBER()`  | Unique row number per partition |
|                 | `NTILE(n)`      | Splits data into n buckets      |
| **Aggregation** | `SUM()`         | Running total                   |
|                 | `AVG()`         | Average of partition            |
|                 | `MIN()`         | Minimum value                   |
|                 | `MAX()`         | Maximum value                   |
|                 | `COUNT()`       | Count rows in partition         |
| **Offset**      | `LAG(n)`        | Get previous row's value        |
|                 | `LEAD(n)`       | Get next row's value            |
|                 | `FIRST_VALUE()` | First value in partition        |
|                 | `LAST_VALUE()`  | Last value in partition         |
|                 | `NTH_VALUE(n)`  | Nth value in partition          |

---

### **How Many Functions Can We Apply at Once?**

There is **no strict limit** on the number of window functions you can apply **simultaneously**, but:

- **Performance Consideration**: Applying too many window functions can slow down execution.
- **Memory Constraints**: Large datasets with multiple window functions may require **high memory allocation**.

💡 **Best Practice:** Apply only the necessary functions to optimize query performance.

---

### **🔥 Advanced Challenge**

Would you like me to generate an **extremely complex** example combining all these window functions in PySpark and SQL? 🚀
