-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Additional Functions

##### Objectives
1. Apply built-in functions to generate data for new columns
1. Apply DataFrame NA functions to handle null values
1. Join DataFrames

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html" target="_blank">DataFrameNaFunctions</a>: `fill`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>:
  - Aggregate: `collect_set`
  - Collection: `explode`
  - Non-aggregate and miscellaneous: `col`, `lit`

### DataFrameNaFunctions
<a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html" target="_blank">DataFrameNaFunctions</a> is a DataFrame submodule with methods for handling null values. Obtain an instance of DataFrameNaFunctions by accessing the `na` attribute of a DataFrame.

| Method | Description |
| --- | --- |
| drop | Returns a new DataFrame omitting rows with any, all, or a specified number of null values, considering an optional subset of columns |
| fill | Replace null values with the specified value for an optional subset of columns |
| replace | Returns a new DataFrame replacing a value with another value, considering an optional subset of columns |

### Non-aggregate and Miscellaneous Functions
Here are a few additional non-aggregate and miscellaneous built-in functions.

| Method | Description |
| --- | --- |
| col / column | Returns a Column based on the given column name. |
| lit | Creates a Column of literal value |
| isnull | Return true iff the column is null |
| rand | Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0) |

### Joining DataFrames
The DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html?highlight=join#pyspark.sql.DataFrame.join" target="_blank">`join`</a> method joins two DataFrames based on a given join expression. Several different types of joins are supported. For example:

```
# Inner join based on equal values of a shared column called 'name' (i.e., an equi join)
df1.join(df2, 'name')

# Inner join based on equal values of the shared columns called 'name' and 'age'
df1.join(df2, ['name', 'age'])

# Full outer join based on equal values of a shared column called 'name'
df1.join(df2, 'name', 'outer')

# Left outer join based on an explicit column expression
df1.join(df2, df1['customer_name'] == df2['account_name'], 'left_outer')
```

# Abandoned Carts Lab
Get abandoned cart items for email without purchases.
1. Get emails of converted users from transactions
2. Join emails with user IDs
3. Get cart item history for each user
4. Join cart item history with emails
5. Filter for emails with abandoned cart items

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: `join`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>: `collect_set`, `explode`, `lit`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html" target="_blank">DataFrameNaFunctions</a>: `fill`

### Setup
Run the cells below to create DataFrames **`salesDF`**, **`usersDF`**, and **`eventsDF`**.

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
# sale transactions at BedBricks
salesDF = spark.read.parquet(salesPath)
display(salesDF)

order_id,email,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
257437,kmunoz@powell-duran.com,1592194221828900,1,1995.0,1,"List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))"
282611,bmurillo@hotmail.com,1592504237604072,1,940.5,1,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1))"
257448,bradley74@gmail.com,1592200438030141,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
257440,jameshardin@campbell-morris.biz,1592197217716495,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
283949,whardin@hotmail.com,1592510720760323,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"
257444,emily88@cobb.com,1592199040703476,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
257449,craig61@luna-oliver.com,1592200459769596,1,1195.0,1,"List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))"
257441,johnsonashley@mcclain.com,1592197729873798,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
264191,maxwelltara@edwards.com,1592306255847870,2,993.6,2,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1), List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1))"
286727,rojasjorge@yahoo.com,1592533048926949,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"


In [0]:
# user IDs and emails at BedBricks
usersDF = spark.read.parquet(usersPath)
display(usersDF)

user_id,user_first_touch_timestamp,email
UA000000102357305,1592182691348767,
UA000000102357308,1592183287634953,
UA000000102357309,1592183302736627,
UA000000102357321,1592184604178702,david23@orozco-parker.com
UA000000102357325,1592185154063628,
UA000000102357335,1592186122660210,
UA000000102357338,1592186300091435,
UA000000102357348,1592187663145345,phillipmorgan@hotmail.com
UA000000102357350,1592187732257656,
UA000000102357356,1592188311375015,


In [0]:
# events logged on the BedBricks website
eventsDF = spark.read.parquet(eventsPath)
display(eventsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### 1-A: Get emails of converted users from transactions
- Select the **`email`** column in **`salesDF`** and remove duplicates
- Add a new column **`converted`** with the value **`True`** for all rows

Save the result as **`convertedUsersDF`**.

In [0]:
# TODO
from pyspark.sql.functions import *
convertedUsersDF = (salesDF.select('email').drop_duplicates().withColumn('converted',lit(True))
)
display(convertedUsersDF)

email,converted
zacharyfisher@brown.com,True
flowersrhonda@paul.com,True
tanya8857@yahoo.com,True
serranoerika@brooks-lawson.com,True
bishopamber@yahoo.com,True
michael915@gmail.com,True
keithterrance@martinez-mitchell.com,True
preston96@robinson.com,True
jimmy37@hotmail.com,True
andrewsantiago@yahoo.com,True


#### 1-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["email", "converted"]

expectedCount = 210370

assert convertedUsersDF.columns == expectedColumns, "convertedUsersDF does not have the correct columns"

assert convertedUsersDF.count() == expectedCount, "convertedUsersDF does not have the correct number of rows"

assert convertedUsersDF.select(col("converted")).first()[0] == True, "converted column not correct"

### 2-A: Join emails with user IDs
- Perform an outer join on **`convertedUsersDF`** and **`usersDF`** with the **`email`** field
- Filter for users where **`email`** is not null
- Fill null values in **`converted`** as **`False`**

Save the result as **`conversionsDF`**.

In [0]:
# TODO
conversionsDF = (usersDF.join(convertedUsersDF, 'email', 'outer').filter(col('email').isNotNull()).fillna(False,subset='converted')
)
display(conversionsDF)

email,user_id,user_first_touch_timestamp,converted
aabbott@fischer-thompson.info,UA000000107293930,1593868005679801,False
aacevedo@moss-young.com,UA000000103755561,1592671212475050,False
aacosta11@gmail.com,UA000000106362980,1593540790039008,False
aadams9@gmail.com,UA000000103384927,1592575968245258,False
aadams@coleman.org,UA000000107105749,1593795399348718,False
aadams@howard.biz,UA000000104562958,1592928837244180,False
aadams@parker.net,UA000000106086190,1593449235669977,False
aadams@perry.info,UA000000107015487,1593779574016960,False
aadams@robinson.com,UA000000104689093,1592952097460118,False
aadkins@hill.biz,UA000000104672436,1592947174229318,True


#### 2-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["email", "user_id", "user_first_touch_timestamp", "converted"]

expectedCount = 782749

expectedFalseCount = 572379

assert conversionsDF.columns == expectedColumns, "Columns are not correct"

assert conversionsDF.filter(col("email").isNull()).count() == 0, "Email column contains null"

assert conversionsDF.count() == expectedCount, "There is an incorrect number of rows"

assert conversionsDF.filter(col("converted") == False).count() == expectedFalseCount, "There is an incorrect number of false entries in converted column"

### 3-A: Get cart item history for each user
- Explode the **`items`** field in **`eventsDF`** with the results replacing the existing **`items`** field
- Group by **`user_id`**
  - Collect a set of all **`items.item_id`** objects for each user and alias the column to "cart"

Save the result as **`cartsDF`**.

In [0]:
# TODO
from pyspark.sql.functions import explode
cartsDF = (eventsDF.withColumn('items',explode('items')).groupBy('user_id').agg(collect_set('items.item_id').alias('cart'))
)
display(cartsDF)

user_id,cart
UA000000102358054,List(M_STAN_T)
UA000000102360011,List(M_STAN_Q)
UA000000102360488,List(M_STAN_Q)
UA000000102360715,List(M_STAN_T)
UA000000102360871,List(M_STAN_T)
UA000000102362166,List(M_STAN_K)
UA000000102362400,List(M_STAN_Q)
UA000000102362558,List(M_STAN_K)
UA000000102365562,List(M_STAN_K)
UA000000102366240,List(M_PREM_T)


#### 3-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["user_id", "cart"]

expectedCount = 488403

assert cartsDF.columns == expectedColumns, "Incorrect columns"

assert cartsDF.count() == expectedCount, "Incorrect number of rows"

assert cartsDF.select(col("user_id")).drop_duplicates().count() == expectedCount, "Duplicate user_ids present"

### 4-A: Join cart item history with emails
- Perform a left join on **`conversionsDF`** and **`cartsDF`** on the **`user_id`** field

Save result as **`emailCartsDF`**.

In [0]:
# TODO
emailCartsDF = conversionsDF.join(cartsDF,'user_id','left')
display(emailCartsDF)

user_id,email,user_first_touch_timestamp,converted,cart
UA000000102359878,barkertristan@yahoo.com,1592205011208543,False,
UA000000102363779,adrianpowers@gmail.com,1592210398967864,False,
UA000000102369408,aaron5239@gmail.com,1592214661753053,False,
UA000000102371479,aaron55@hotmail.com,1592215774061633,False,
UA000000102372965,adamnelson@hotmail.com,1592216538174065,False,
UA000000102377012,aburnett@clarke.com,1592218354105458,False,
UA000000102383281,adamsanne@hotmail.com,1592220624077876,False,
UA000000102383849,aaronhall@johnson.com,1592220808344386,True,
UA000000102393763,abowers@morales-huffman.org,1592223655901976,False,
UA000000102393817,christinahayes@mooney-holland.com,1592223667891994,True,


#### 4-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["user_id", "email", "user_first_touch_timestamp", "converted", "cart"]

expectedCount = 782749

expectedCartNullCount = 397799

assert emailCartsDF.columns == expectedColumns, "Columns do not match"

assert emailCartsDF.count() == expectedCount, "Counts do not match"

assert emailCartsDF.filter(col("cart").isNull()).count() == expectedCartNullCount, "Cart null counts incorrect from join"

### 5-A: Filter for emails with abandoned cart items
- Filter **`emailCartsDF`** for users where **`converted`** is False
- Filter for users with non-null carts

Save result as **`abandonedItemsDF`**.

In [0]:
# TODO
abandonedCartsDF = (emailCartsDF.filter(col('converted')==False).filter(col('cart').isNotNull())
)
display(abandonedCartsDF)

user_id,email,user_first_touch_timestamp,converted,cart
UA000000102358054,markfitzpatrick@hotmail.com,1592198812458125,False,List(M_STAN_T)
UA000000102367817,russellpamela@yahoo.com,1592213618560512,False,List(M_PREM_K)
UA000000102369539,karenwright@jennings.com,1592214729249771,False,List(M_STAN_K)
UA000000102374838,kyle50@huang.com,1592217432667557,False,List(M_STAN_Q)
UA000000102376621,ubrown55@yahoo.com,1592218189654409,False,List(M_PREM_Q)
UA000000102379071,nelsonchristopher@yahoo.com,1592219147404116,False,List(M_PREM_Q)
UA000000102385440,thomaswatkins@yahoo.com,1592221304639231,False,List(M_STAN_K)
UA000000102386796,lukemiller@hotmail.com,1592221721420940,False,List(M_STAN_Q)
UA000000102393288,brandonwalters@holt.info,1592223527280745,False,List(M_STAN_F)
UA000000102402429,justin6717@gmail.com,1592225763260617,False,List(M_STAN_T)


#### 5-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["user_id", "email", "user_first_touch_timestamp", "converted", "cart"]

expectedCount = 204272

assert abandonedCartsDF.columns == expectedColumns, "Columns do not match"

assert abandonedCartsDF.count() == expectedCount, "Counts do not match"

### 6-A: Bonus Activity
Plot number of abandoned cart items by product

In [0]:
# TODO
abandonedItemsDF = (abandonedCartsDF.groupBy('cart').count()
)
display(abandonedItemsDF)

cart,count
"List(P_FOAM_S, M_STAN_Q)",1577
"List(M_STAN_K, M_PREM_T, M_STAN_T)",6
"List(M_STAN_Q, M_PREM_Q)",233
"List(P_DOWN_S, M_PREM_T)",248
"List(M_PREM_Q, M_PREM_T, M_PREM_F)",1
"List(P_FOAM_S, M_STAN_Q, M_PREM_T)",11
"List(M_PREM_K, M_PREM_T, M_STAN_T)",2
"List(P_FOAM_S, M_STAN_T, P_FOAM_K, M_PREM_F)",1
"List(M_STAN_T, M_STAN_F)",243
"List(M_STAN_F, M_PREM_F)",73


#### 6-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
abandonedItemsDF.count()

In [0]:
expectedColumns = ["items", "count"]

expectedCount = 12

assert abandonedItemsDF.count() == expectedCount, "Counts do not match"

assert abandonedItemsDF.columns == expectedColumns, "Columns do not match"

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>