## Set Operations in Spark SQL

Just like in traditional relational databases, Spark SQL offers several set operations to compare and combine datasets. These operations include `UNION`, `INTERSECT`, and `EXCEPT` (also referred to as `MINUS`).

- **UNION**

  The `UNION` operation merges two datasets by appending rows vertically. It comes in two forms:

  - `UNION DISTINCT` (or simply UNION): eliminates duplicate rows in the result.
  - `UNION ALL`: retains all rows, including duplicates, from both datasets.

- **INTERSECT**

  The `INTERSECT` operation returns only the rows that exist in both datasets. This is commonly used to find overlapping records.

- **EXCEPT (MINUS)**

  The `EXCEPT` operation retrieves rows from the first dataset that do not appear in the second dataset.

- **PIVOT**

  Spark SQL also enables reshaping data by using the `PIVOT` clause, allowing you to transform rows into columns for easier analysis.

In [0]:
CREATE OR REPLACE TEMP VIEW customers_january AS
SELECT *
FROM VALUES
  (1, 'Alice'),
  (2, 'Bob'),
  (3, 'Charlie'),
  (4, 'David') AS t(customer_id, customer_name);


-- ------------------------------------------
CREATE OR REPLACE TEMP VIEW customers_february AS
SELECT * 
FROM VALUES
  (3, 'Charlie'),
  (4, 'David'),
  (5, 'Eve'),
  (6, 'Frank') AS t(customer_id, customer_name);

### UNION ALL - Combine all customers

In [0]:
SELECT * FROM customers_january
UNION ALL
SELECT * FROM customers_february;

### UNION DISTINCT - Combine and remove duplicates

In [0]:
SELECT * FROM customers_january
UNION
SELECT * FROM customers_february;

### INTERSECT - Customers present in both months

In [0]:
SELECT * FROM customers_january
INTERSECT
SELECT * FROM customers_february;

### EXCEPT - Customers only in January

In [0]:
SELECT * FROM customers_january
MINUS -- or EXCEPT
SELECT * FROM customers_february;

### PIVOT - Show total amount spent per customer across months

In [0]:
CREATE OR REPLACE TEMP VIEW customer_purchases AS
SELECT * 
FROM VALUES
  (1, 'January', 120),
  (1, 'February', 80),
  (2, 'January', 150),
  (3, 'January', 200),
  (3, 'February', 180),
  (4, 'February', 220)
AS t(customer_id, month, amount);

### PIVOT Query
The query syntax for generating the pivot table involves the following
steps:

#### 1. **Selecting the Input Data**

You start by selecting the data you want to transform.<br>
This input can be:

- A base table
- Or a subquery that filters or joins data

**Example**

```SELECT customer_id, month, amount FROM customer_purchases```

This defines the **rows** you want to pivot.

#### 2. **The PIVOT Clause**
The `PIVOT` block contains three critical components:

#### a) **Aggregation Function**

This defines:

* **What aggregation you want to perform** (like `SUM`, `AVG`, `COUNT`)
* **On which column**

**Example:**

```SUM(amount)```

This means: *For each customer, sum the `amount`.

#### b) **FOR Subclause**

This specifies:

* The **pivot column** whose values will become the *new columns* in your output.

Example:

```FOR month```

This means: *Turn the unique values of `month` (January, February) into columns.*

#### c) **IN Operator**

This explicitly lists:

* The **distinct values of the pivot column** you want as output columns.
* You can optionally **rename each column** using aliases.

Example:

```IN ('January' AS January, 'February' AS February)```

This will produce **two columns named January and February** in your result.

### Putting it All Together

In [0]:
SELECT 
  customer_id,
  COALESCE(January, 0) AS January,
  COALESCE(Feb, 0) AS February
FROM (
  SELECT customer_id, month, amount
  FROM customer_purchases
)
PIVOT (
  SUM(amount)
  FOR month IN ('January', 'February' AS Feb)
)
ORDER BY customer_id;

You can also wrap the pivot inside a CTE or subquery if you prefer:

In [0]:
WITH pivoted AS (
  SELECT *
  FROM (
    SELECT customer_id, month, amount
    FROM customer_purchases
  )
  PIVOT (
    SUM(amount)
    FOR month IN ('January', 'February' AS Feb)
  )
)

SELECT
  customer_id,
  COALESCE(January, 0) AS January,
  COALESCE(Feb, 0) AS February
FROM pivoted
ORDER BY customer_id;
