# Advanced Transformations

Here it is the schema for the bookstore dataset used in this notebook:

![bookstore dataset schema](../Includes/images/image1.png)

In [0]:
%run ../Includes/Copy-Datasets

## Working with JSON struct

In [0]:
SELECT * FROM customers

The `profile` field saves customers data in a nested JSON struct. The `address` itself is a JSON string that contains the street, city, and country.

In [0]:
DESCRIBE customers

Spark SQL has built-in functionality to directly interact with JSON data stored as strings. Colon syntax can be used to traverse nested data structures.

In [0]:
SELECT customer_id, profile:first_name, profile:address:country
FROM customers

Spark SQL also has the ability to parse JSON objects into struct types. **Struct** is a native Spark type with nested attributes. This can be done with the `from_json` function:

In [0]:
SELECT from_json(profile) AS profile_struct
  FROM customers

It fails because this function requires the schema of the JSON object. Fortunately, the schema can be derived from the current data. For that, a sample data of the JSON value with non-null fields is needed:

In [0]:
SELECT profile
  FROM customers
  LIMIT 1

This sample data can be copied and provided to the `schema_of_json` function:

In [0]:
CREATE OR REPLACE TEMP VIEW parsed_customers AS
  SELECT customer_id, from_json(profile, schema_of_json('{"first_name":"Susana","last_name":"Gonnely","gender":"Female","address":{"street":"760 Express Court","city":"Obrenovac","country":"Serbia"}}')) AS profile_struct
  FROM customers;

SELECT * FROM parsed_customers

The first thing to notice when working with a struct type is the ability to interact with the nested object. 

In [0]:
DESCRIBE parsed_customers

The new column has a struct `datatype`, and the `address` filed has also a struct type. With struct type the subfiedlds can be used using standard period or dot syntax instead:

In [0]:
SELECT customer_id, profile_struct.first_name, profile_struct.address.country
FROM parsed_customers

Once a JSON string is converted to struct type, the star (*) operation can be used to flatten fields into columns:

In [0]:
CREATE OR REPLACE TEMP VIEW customer_final AS
  SELECT customer_id, profile_struct.*
  FROM parsed_customers;

SELECT * FROM customer_final

Now, `first_name`, `last_name`, `gender`, and `address` are all now separate columns.

## explode() function

In [0]:
SELECT order_id, customer_id, books
  FROM orders

The `books` column is an struct array. Spark SQL has many functions to deal with arrays. The most important one is `explode` function. It **allows to put each element of an array on its own row**.

In [0]:
SELECT order_id, customer_id, explode(books) AS book
FROM orders

## collect_set() function

`collect_set()` function allows to collect unique values for a field, including fields within arrays

In [0]:
SELECT customer_id,
  collect_set(order_id) AS orders_Set,
  collect_set(books.book_id) AS book_set
FROM orders
GROUP BY customer_id

The `book_set` column is actualy an array of array. Can we flatten this array?

Yes, we can. Additionally, only the distict values can be kept. For example, for the B08 that exists in two elements of the array. Therefore, instead of having the `B08` value twice after flattening the array, it will get only one value.

In [0]:
SELECT customer_id,
  collect_set(books.book_id) AS before_flatten,
  array_distinct(flatten(collect_set(books.book_id))) AS after_flatten
FROM orders
GROUP BY customer_id

The results before and after the flatten can be noticed on the table.

## JOIN Operations

Spark SQL supports JOIN operations:
* INNER JOIN
* OUTER JOIN
* LEFT JOIN
* RIGHT JOIN
* ANTI JOIN
* CROSS JOIN
* SEMI JOIN

Joining the result of the explode operation to the books lookup table in order to retrieve books information like book's title and author's name:

In [0]:
CREATE OR REPLACE VIEW orders_enriched AS
SELECT *
FROM (
  SELECT *, explode(books) AS book
  FROM orders) o
INNER JOIN books b 
ON o.book.book_id = b.book_id;

SELECT * FROM orders_enriched

The operation type (INNER JOIN) is specified. The results are stored in a view (`order_enriched`).

For each book, title, author and category are grabbed.

## SET Operations

Spark SQL also supports SET operations like UNION

UNION of the old and the new data of the orders table:

In [0]:
CREATE OR REPLACE TEMP VIEW orders_updates
AS SELECT * FROM parquet.`${dataset.bookstore}/orders-new`;

SELECT * FROM orders
UNION
SELECT * FROM orders_updates

In the same way, `INTERSECTION` operation can be done. This operation return all rows found in both relations:

In [0]:
SELECT * FROM orders
INTERSECT
SELECT * FROM orders_updates

In this case, there are 700 records because these updates have already been inserted in the orders table in the previos notebook.

With operation `MINUS` it will give only the orders data without this 700 new records.

In [0]:
SELECT * FROM orders
MINUS
SELECT * FROM orders_updates

## PIVOT Clause

Spark SQL also supports `PIVOT` clause, which is used to change data perspective. We can get the aggregated values based on a specific column values, which will be turned to multiple columns used in `SELECT` clause. The `PIVOT` table can be specified after the table name or subquery. We have:
* SELECT * FROM (SELECT statement that will be the input for this table)
* PIVOT(): first argument is an aggregation function and the column to be aggregated. Then, the pivot column is specified in the FOR subclause. The IN operator contains the pivot columns values

In [0]:
CREATE OR REPLACE TABLE transactions AS

SELECT * FROM(
  SELECT
    customer_id,
    book.book_id AS book_id,
    book.quantity AS quantity
   FROM orders_enriched
) PIVOT (
  SUM(quantity) FOR book_id in (
    'B01', 'B02', 'B03', 'B04', 'B05', 'B06',
    'B07', 'B08', 'B09', 'B10', 'B11', 'B12'
  )
);

SELECT * FROM transactions

The `PIVOT` command has been used to create a new `transactions` table that flatten out the information contained in the `orders` table for each customer.

Such a flatten data format can be useful for dashboarding, but also useful for applying ML algorithms for inference or predictions.