-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Reshaping Data Lab

In this lab, you will create a **`clickpaths`** table that aggregates the number of times each user took a particular action in **`events`** and then join this information with a flattened view of **`transactions`** to create a record of each user's actions and final purchases.

The **`clickpaths`** table should contain all the fields from **`transactions`**, as well as a count of every **`event_name`** from **`events`** in its own column. This table should contain a single row for each user that completed a purchase.

## Learning Objectives
By the end of this lab, you should be able to:
- Pivot and join tables to create clickpaths for each user

## Run Setup

The setup script will create the data and declare necessary values for the rest of this notebook to execute.

In [0]:
%run ../Includes/Classroom-Setup-02.6L

We'll use Python to run checks occasionally throughout the lab. The helper functions below will return an error with a message on what needs to change if you have not followed instructions. No output means that you have completed this step.

In [0]:
%python
def check_table_results(table_name, num_rows, column_names):
    assert spark.table(table_name), f"Table named **`{table_name}`** does not exist"
    assert set(spark.table(table_name).columns) == set(column_names), "Please name the columns as shown in the schema above"
    assert spark.table(table_name).count() == num_rows, f"The table should have {num_rows} records"

## Pivot events to get event counts for each user

Let's start by pivoting the **`events`** table to get counts for each **`event_name`**.

We want to aggregate the number of times each user performed a specific event, specified in the **`event_name`** column. To do this, group by **`user_id`** and pivot on **`event_name`** to provide a count of every event type in its own column, resulting in the schema below. Note that **`user_id`** is renamed to **`user`** in the target schema.

| field | type | 
| --- | --- | 
| user | STRING |
| cart | BIGINT |
| pillows | BIGINT |
| login | BIGINT |
| main | BIGINT |
| careers | BIGINT |
| guest | BIGINT |
| faq | BIGINT |
| down | BIGINT |
| warranty | BIGINT |
| finalize | BIGINT |
| register | BIGINT |
| shipping_info | BIGINT |
| checkout | BIGINT |
| mattresses | BIGINT |
| add_item | BIGINT |
| press | BIGINT |
| email_coupon | BIGINT |
| cc_info | BIGINT |
| foam | BIGINT |
| reviews | BIGINT |
| original | BIGINT |
| delivery | BIGINT |
| premium | BIGINT |

A list of the event names are provided in the TODO cells below.

### Solve with SQL

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE TEMP VIEW events_pivot AS
SELECT * FROM (
  SELECT user_id user, event_name 
  FROM events
) PIVOT ( count(*) FOR event_name IN (
    "cart", "pillows", "login", "main", "careers", "guest", "faq", "down", "warranty", "finalize", 
    "register", "shipping_info", "checkout", "mattresses", "add_item", "press", "email_coupon", 
    "cc_info", "foam", "reviews", "original", "delivery", "premium" ))

### Solve with Python

In [0]:
%python
# ANSWER
(spark.read.table("events")
    .groupBy("user_id")
    .pivot("event_name")
    .count()
    .withColumnRenamed("user_id", "user")
    .createOrReplaceTempView("events_pivot"))

### Check your work
Run the cell below to confirm the view was created correctly.

In [0]:
%python
check_table_results("events_pivot", 204586, ['user', 'cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty', 'finalize', 'register', 'shipping_info', 'checkout', 'mattresses', 'add_item', 'press', 'email_coupon', 'cc_info', 'foam', 'reviews', 'original', 'delivery', 'premium'])

## Join event counts and transactions for all users

Next, join **`events_pivot`** with **`transactions`** to create the table **`clickpaths`**. This table should have the same event name columns from the **`events_pivot`** table created above, followed by columns from the **`transactions`** table, as shown below.

| field | type | 
| --- | --- | 
| user | STRING |
| cart | BIGINT |
| ... | ... |
| user_id | STRING |
| order_id | BIGINT |
| transaction_timestamp | BIGINT |
| total_item_quantity | BIGINT |
| purchase_revenue_in_usd | DOUBLE |
| unique_items | BIGINT |
| P_FOAM_K | BIGINT |
| M_STAN_Q | BIGINT |
| P_FOAM_S | BIGINT |
| M_PREM_Q | BIGINT |
| M_STAN_F | BIGINT |
| M_STAN_T | BIGINT |
| M_PREM_K | BIGINT |
| M_PREM_F | BIGINT |
| M_STAN_K | BIGINT |
| M_PREM_T | BIGINT |
| P_DOWN_S | BIGINT |
| P_DOWN_K | BIGINT |

### Solve with SQL

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE TEMP VIEW clickpaths AS
SELECT * 
FROM events_pivot a
JOIN transactions b 
  ON a.user = b.user_id

### Solve with Python

In [0]:
%python
# ANSWER
from pyspark.sql.functions import col
(spark.read.table("events_pivot")
    .join(spark.table("transactions"), col("events_pivot.user") == col("transactions.user_id"), "inner")
    .createOrReplaceTempView("clickpaths"))

### Check your work
Run the cell below to confirm the view was created correctly.

In [0]:
%python
check_table_results("clickpaths", 9085, ['user', 'cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty', 'finalize', 'register', 'shipping_info', 'checkout', 'mattresses', 'add_item', 'press', 'email_coupon', 'cc_info', 'foam', 'reviews', 'original', 'delivery', 'premium', 'user_id', 'order_id', 'transaction_timestamp', 'total_item_quantity', 'purchase_revenue_in_usd', 'unique_items', 'P_FOAM_K', 'M_STAN_Q', 'P_FOAM_S', 'M_PREM_Q', 'M_STAN_F', 'M_STAN_T', 'M_PREM_K', 'M_PREM_F', 'M_STAN_K', 'M_PREM_T', 'P_DOWN_S', 'P_DOWN_K'])

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>