-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Complex Types

Explore built-in functions for working with collections and strings.

##### Objectives
1. Apply collection functions to process arrays
1. Union DataFrames together

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: `unionByName`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>:
  - Aggregate: `collect_set`
  - Collection: `array_contains`, `element_at`, `explode`
  - String: `split`

In [0]:
%run ./Includes/Classroom-Setup

In this demo, we're going to use the sales data set.

In [0]:
df = spark.read.parquet(salesPath)
display(df)

### String Functions
Here are some of the built-in functions available for manipulating strings.

| Method | Description |
| --- | --- |
| translate | Translate any character in the src by a character in replaceString |
| regexp_replace | Replace all substrings of the specified string value that match regexp with rep |
| regexp_extract | Extract a specific group matched by a Java regex, from the specified string column |
| ltrim | Removes the leading space characters from the specified string column |
| lower | Converts a string column to lowercase |
| split | Splits str around matches of the given pattern |

### Collection Functions

Here are some of the built-in functions available for working with arrays.

| Method | Description |
| --- | --- |
| array_contains | Returns null if the array is null, true if the array contains value, and false otherwise. |
| element_at | Returns element of array at given index. Array elements are numbered starting with **1**. |
| explode | Creates a new row for each element in the given array or map column. |
| collect_set | Returns a set of objects with duplicate elements eliminated. |

### Aggregate Functions

Here are some of the built-in aggregate functions available for creating arrays, typically from GroupedData.

| Method | Description |
| --- | --- |
| collect_list | Returns an array consisting of all values within the group. |
| collect_set | Returns an array consisting of all unique values within the group. |

# User Purchases

List all size and quality options purchased by each buyer.
1. Extract item details from purchases
2. Extract size and quality options from mattress purchases
3. Extract size and quality options from pillow purchases
4. Combine data for mattress and pillows
5. List all size and quality options bought by each user

### 1. Extract item details from purchases

- Explode the **`items`** field in **`df`** with the results replacing the existing **`items`** field
- Select the **`email`** and **`item.item_name`** fields
- Split the words in **`item_name`** into an array and alias the column to "details"

Assign the resulting DataFrame to **`detailsDF`**.

In [0]:
from pyspark.sql.functions import *

detailsDF = (df
             .withColumn("items", explode("items"))
             .select("email", "items.item_name")
             .withColumn("details", split(col("item_name"), " "))
            )
display(detailsDF)

So you can see that our **`details`** column is now an array containing the quality, size, and object type.

### 2. Extract size and quality options from mattress purchases

- Filter **`detailsDF`** for records where **`details`** contains "Mattress"
- Add a **`size`** column by extracting the element at position 2
- Add a **`quality`** column by extracting the element at position 1

Save the result as **`mattressDF`**.

In [0]:
mattressDF = (detailsDF
              .filter(array_contains(col("details"), "Mattress"))
              .withColumn("size", element_at(col("details"), 2))
              .withColumn("quality", element_at(col("details"), 1))
             )
display(mattressDF)

Next we're going to do the same thing for pillow purchases.

### 3. Extract size and quality options from pillow purchases
- Filter **`detailsDF`** for records where **`details`** contains "Pillow"
- Add a **`size`** column by extracting the element at position 1
- Add a **`quality`** column by extracting the element at position 2

Note the positions of **`size`** and **`quality`** are switched for mattresses and pillows.

Save result as **`pillowDF`**.

In [0]:
pillowDF = (detailsDF
            .filter(array_contains(col("details"), "Pillow"))
            .withColumn("size", element_at(col("details"), 1))
            .withColumn("quality", element_at(col("details"), 2))
           )
display(pillowDF)

### 4. Combine data for mattress and pillows

- Perform a union on **`mattressDF`** and **`pillowDF`** by column names
- Drop the **`details`** column

Save the result as **`unionDF`**.

<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Warning"> The DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.union.html" target="_blank">`union`</a> method resolves columns by position, as in standard SQL. You should use it only if the two DataFrames have exactly the same schema, including the column order. In contrast, the DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html" target="_blank">`unionByName`</a> method resolves columns by name.

In [0]:
unionDF = mattressDF.unionByName(pillowDF).drop("details")
display(unionDF)

### 5. List all size and quality options bought by each user

- Group rows in **`unionDF`** by **`email`**
  - Collect the set of all items in **`size`** for each user and alias the column to "size options"
  - Collect the set of all items in **`quality`** for each user and alias the column to "quality options"

Save the result as **`optionsDF`**.

In [0]:
optionsDF = (unionDF
             .groupBy("email")
             .agg(collect_set("size").alias("size options"),
                  collect_set("quality").alias("quality options"))
            )
display(optionsDF)

### Clean up classroom

And lastly, we'll clean up the classroom.

In [0]:
%run ./Includes/Classroom-Cleanup

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>