-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Reshaping Data Lab

In this lab, you will create a **`clickpaths`** table that aggregates the number of times each user took a particular action in **`events`** and then join this information with the flattened view of **`transactions`** created in the previous notebook.

You'll also explore a new higher order function to flag items recorded in **`sales`** based on information extracted from item names.

## Learning Objectives
By the end of this lab, you should be able to:
- Pivot and join tables to create clickpaths for each user
- Apply higher order functions to flag types of products purchased

## Run Setup

The setup script will create the data and declare necessary values for the rest of this notebook to execute.

In [0]:
%run ../Includes/Classroom-Setup-4.9L

## Reshape Datasets to Create Click Paths
This operation will join data from your **`events`** and **`transactions`** tables in order to create a record of all actions a user took on the site and what their final order looked like.

The **`clickpaths`** table should contain all the fields from your **`transactions`** table, as well as a count of every **`event_name`** in its own column. Each user that completed a purchase should have a single row in the final table. Let's start by pivoting the **`events`** table to get counts for each **`event_name`**.

### 1. Pivot **`events`** to count actions for each user
We want to aggregate the number of times each user performed a specific event, specified in the **`event_name`** column. To do this, group by **`user`** and pivot on **`event_name`** to provide a count of every event type in its own column, resulting in the schema below.

| field | type | 
| --- | --- | 
| user | STRING |
| cart | BIGINT |
| pillows | BIGINT |
| login | BIGINT |
| main | BIGINT |
| careers | BIGINT |
| guest | BIGINT |
| faq | BIGINT |
| down | BIGINT |
| warranty | BIGINT |
| finalize | BIGINT |
| register | BIGINT |
| shipping_info | BIGINT |
| checkout | BIGINT |
| mattresses | BIGINT |
| add_item | BIGINT |
| press | BIGINT |
| email_coupon | BIGINT |
| cc_info | BIGINT |
| foam | BIGINT |
| reviews | BIGINT |
| original | BIGINT |
| delivery | BIGINT |
| premium | BIGINT |

A list of the event names are provided below.

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE VIEW events_pivot AS
SELECT * FROM (
  SELECT user_id user, event_name 
  FROM events
) PIVOT ( count(*) FOR event_name IN (
    "cart", "pillows", "login", "main", "careers", "guest", "faq", "down", "warranty", "finalize", 
    "register", "shipping_info", "checkout", "mattresses", "add_item", "press", "email_coupon", 
    "cc_info", "foam", "reviews", "original", "delivery", "premium" ))

In [0]:
%sql
select * from events_pivot
limit 5

user,cart,pillows,login,main,careers,guest,faq,down,warranty,finalize,register,shipping_info,checkout,mattresses,add_item,press,email_coupon,cc_info,foam,reviews,original,delivery,premium
UA000000105622646,,,,1.0,,,,,,,,,,,,,1.0,,,,,,
UA000000103338608,1.0,,,2.0,,1.0,,,,,,1.0,1.0,,1.0,,,,,1.0,2.0,,
UA000000106711687,,,,,,,,,,,,,,2.0,,,,,,,1.0,,
UA000000104458062,,,,1.0,,,,,,,,,,,,,,,,,,1.0,1.0
UA000000106770848,,1.0,,,,,,,,,,,,,,,,,,,,,


**NOTE**: We'll use Python to run checks occasionally throughout the lab. The helper functions below will return an error with a message on what needs to change if you have not followed instructions. No output means that you have completed this step.

In [0]:
%python
def check_table_results(table_name, column_names, num_rows):
    assert spark.table(table_name), f"Table named **`{table_name}`** does not exist"
    assert spark.table(table_name).columns == column_names, "Please name the columns in the order provided above"
    assert spark.table(table_name).count() == num_rows, f"The table should have {num_rows} records"

Run the cell below to confirm the view was created correctly.

In [0]:
%python
event_columns = ['user', 'cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty', 'finalize', 'register', 'shipping_info', 'checkout', 'mattresses', 'add_item', 'press', 'email_coupon', 'cc_info', 'foam', 'reviews', 'original', 'delivery', 'premium']
check_table_results("events_pivot", event_columns, 204586)

### 2. Join event counts and transactions for all users

Next, join **`events_pivot`** with **`transactions`** to create the table **`clickpaths`**. This table should have the same event name columns from the **`events_pivot`** table created above, followed by columns from the **`transactions`** table, as shown below.

| field | type | 
| --- | --- | 
| user | STRING |
| cart | BIGINT |
| ... | ... |
| user_id | STRING |
| order_id | BIGINT |
| transaction_timestamp | BIGINT |
| total_item_quantity | BIGINT |
| purchase_revenue_in_usd | DOUBLE |
| unique_items | BIGINT |
| P_FOAM_K | BIGINT |
| M_STAN_Q | BIGINT |
| P_FOAM_S | BIGINT |
| M_PREM_Q | BIGINT |
| M_STAN_F | BIGINT |
| M_STAN_T | BIGINT |
| M_PREM_K | BIGINT |
| M_PREM_F | BIGINT |
| M_STAN_K | BIGINT |
| M_PREM_T | BIGINT |
| P_DOWN_S | BIGINT |
| P_DOWN_K | BIGINT |

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE VIEW clickpaths AS
SELECT * 
FROM events_pivot a
JOIN transactions b 
  ON a.user = b.user_id

In [0]:
%sql
select * from clickpaths
limit 5

user,cart,pillows,login,main,careers,guest,faq,down,warranty,finalize,register,shipping_info,checkout,mattresses,add_item,press,email_coupon,cc_info,foam,reviews,original,delivery,premium,user_id,order_id,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,P_FOAM_K,M_STAN_Q,P_FOAM_S,M_PREM_Q,M_STAN_F,M_STAN_T,M_PREM_K,M_PREM_F,M_STAN_K,M_PREM_T,P_DOWN_S,P_DOWN_K
UA000000105169226,1,,,1.0,,1.0,,,,1,,1,1,1.0,1,,1.0,1,,,,,,UA000000105169226,396914,1593351573411942,1,940.5,1,,1.0,,,,,,,,,,
UA000000103791856,1,,1.0,,,,,,1.0,1,,1,1,3.0,1,,1.0,1,,,,,,UA000000103791856,335075,1592877927559277,1,1795.5,1,,,,,,,1.0,,,,,
UA000000103823790,1,,,1.0,,,3.0,,1.0,1,1.0,1,1,,1,,,1,1.0,,,,,UA000000103823790,307928,1592684321889135,1,595.0,1,,,,,,1.0,,,,,,
UA000000103482437,1,,,1.0,,1.0,,,,1,,1,1,1.0,1,,1.0,1,,,,1.0,,UA000000103482437,335790,1592887477130904,1,107.1,1,,,,,,,,,,,1.0,
UA000000105277962,1,,,1.0,,,,,,1,1.0,1,1,,1,,,1,,,,,,UA000000105277962,376180,1593191829786271,1,1045.0,1,,1.0,,,,,,,,,,


Run the cell below to confirm the table was created correctly.

In [0]:
%python
clickpath_columns = event_columns + ['user_id', 'order_id', 'transaction_timestamp', 'total_item_quantity', 'purchase_revenue_in_usd', 'unique_items', 'P_FOAM_K', 'M_STAN_Q', 'P_FOAM_S', 'M_PREM_Q', 'M_STAN_F', 'M_STAN_T', 'M_PREM_K', 'M_PREM_F', 'M_STAN_K', 'M_PREM_T', 'P_DOWN_S', 'P_DOWN_K']
check_table_results("clickpaths", clickpath_columns, 9085)

## Flag Types of Products Purchased
Here, you'll use the higher order function **`EXISTS`** to create boolean columns **`mattress`** and **`pillow`** that indicate whether the item purchased was a mattress or pillow product.

For example, if **`item_name`** from the **`items`** column ends with the string **`"Mattress"`**, the column value for **`mattress`** should be **`true`** and the value for **`pillow`** should be **`false`**. Here are a few examples of items and the resulting values.

|  items  | mattress | pillow |
| ------- | -------- | ------ |
| **`[{..., "item_id": "M_PREM_K", "item_name": "Premium King Mattress", ...}]`** | true | false |
| **`[{..., "item_id": "P_FOAM_S", "item_name": "Standard Foam Pillow", ...}]`** | false | true |
| **`[{..., "item_id": "M_STAN_F", "item_name": "Standard Full Mattress", ...}]`** | true | false |

See documentation for the <a href="https://docs.databricks.com/sql/language-manual/functions/exists.html" target="_blank">exists</a> function.  
You can use the condition expression **`item_name LIKE "%Mattress"`** to check whether the string **`item_name`** ends with the word "Mattress".

In [0]:
%sql
-- TODO
CREATE OR REPLACE TABLE sales_product_flags AS
select 
  items,
  EXISTS (items,i->i.item_name LIKE "%Mattress") as mattress,
  EXISTS (items,i->i.item_name LIKE "%Pillow") as pillow
from sales

num_affected_rows,num_inserted_rows


In [0]:
%sql
select * from sales_product_flags
limit 5

items,mattress,pillow
"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 1071.0, 595.0, 2))",True,False
"List(List(NEWBED10, M_STAN_F, Standard Full Mattress, 850.5, 945.0, 1))",True,False
"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))",True,False
"List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, P_FOAM_S, Standard Foam Pillow, 59.0, 59.0, 1))",True,True
"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",True,False


Run the cell below to confirm the table was created correctly.

In [0]:
%python
check_table_results("sales_product_flags", ['items', 'mattress', 'pillow'], 10539)
product_counts = spark.sql("SELECT sum(CAST(mattress AS INT)) num_mattress, sum(CAST(pillow AS INT)) num_pillow FROM sales_product_flags").first().asDict()
assert product_counts == {'num_mattress': 10015, 'num_pillow': 1386}, "There should be 10015 rows where mattress is true, and 1386 where pillow is true"

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>