In [None]:
import json
import hashlib
import pyspark.sql.functions as f

**Initial Note:** Initialize a cluster with runtime 13.3 LTS and spark version 3.4.1

## Project Overview

Let's do a quick recap of the project: 
- You'll be working with the [Google Analytics Sample dataset](https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?inv=1&invt=AbmlmQ), which contains real data from the [Google Merchandise Store](https://shop.googlemerchandisestore.com/), a real ecommerce store that sells Google-branded merchandise. This data is similar to what you deal with in your day-to-day work as a data analyst.
- In Part 1 you downloaded the data, preprocessed it and answered some simple analytics questions.

### Part 2 - Advanced Analytics

In this part of the project, you will use the data you prepared in Part 1 to answer medium to hard analytics questions about the data.

### Task Completion and Validation
Throughout the notebooks, you will be asked to complete a series of tasks and answer questions. You’ll encounter empty cells where you need to implement the solution, as well as commented-out cells that you should uncomment and fill in with your responses. Afterward, assertion cells will check whether you've completed the tasks correctly.

This way you can have immediate feedback on your work, and you can ask questions if you get stuck.

## Load the data

If you have completed the first part of this project, you should already have the data saved in the DBFS.

In [None]:
df_main = spark.read.parquet('/FileStore/final_project/ga_sessions_main.parquet')
df_hits = spark.read.parquet('/FileStore/final_project/ga_sessions_hits.parquet')
df_network = spark.read.parquet('/FileStore/final_project/ga_sessions_network.parquet')

### [OPTIONAL] In case you don't have the data yet, run the cells bellow

In [None]:
%sh wget https://raw.githubusercontent.com/inesmcm26/lp-big-data-mercedes/main/data/ga_sessions.zip

In [None]:
%sh unzip ga_sessions.zip

In [None]:
dbutils.fs.cp('file:/databricks/driver/ga_sessions_main.parquet', 'dbfs:/FileStore/final_project/ga_sessions_main.parquet')
dbutils.fs.cp('file:/databricks/driver/ga_sessions_network.parquet', 'dbfs:/FileStore/final_project/ga_sessions_network.parquet')
dbutils.fs.cp('file:/databricks/driver/ga_sessions_hits.parquet', 'dbfs:/FileStore/final_project/ga_sessions_hits.parquet')

df_main = spark.read.parquet('/FileStore/final_project/ga_sessions_main.parquet')
df_hits = spark.read.parquet('/FileStore/final_project/ga_sessions_hits.parquet')
df_network = spark.read.parquet('/FileStore/final_project/ga_sessions_network.parquet')

### Data Cleaning

Run the cells bellow to apply the data cleaning operations you implemented in the first part of this project.

In [None]:
df_main = (
    df_main
    .fillna('Direct', subset=['channelGrouping'])
)

df_network = (
    df_network
    .withColumn(
        'geoNetwork',
        f.col('geoNetwork').withField('continent', f.lower(f.col('geoNetwork').getField('continent')))
    )
)

## Datasets Overview

The column that identifies a session and is **common to all tables** is the `sessionId` column.


**Main dataset:**
Besides the session id, the main dataset contains the following columns:
- **visitorId**: The unique identifier for a visitor
- **visitNumber**: The visit number of this visitor. If this is the first visit to the website, then this is set to 1.
- **visitStartTime**: The timestamp (expressed as POSIX time) of the beginning of the session
- **totals**: A struct with statistics about the session, such as total number of hits, time on site, number of transactions and revenue, etc.
- **channelGrouping**: The channel via which the user came to the Store

**Hits dataset:**
Besides the session id, the hits dataset contains the following column:
- **hits**: An array of structs representing all the hits in this session. A hit is an interaction that results in data being sent to Google Analytics. It can either be a page visit or an interaction with some page element. Each struct is a hit defined by the following fields:
    - **hitNumber**: The number of this hit in the session
    - **type**: Type of the hit: 'PAGE' (Page visit) or 'EVENT' (Interaction with some element on the page)
    - **hour**: Hour of the hit
    - **minute**: Minute of the hit
    - **time**: Time spent on the hit
    - **page**: Structured information about the page
    - **contentGroup**: Information about the content categorization of the page on the website
    - **product**: Array of structs with product information of all products displayed on the page
    - **eventInfo**: If hit is of type 'EVENT', this field contains information about the event
    - **promotion**: Array of structs with promotion information of all promotions displayed on the page.
    - **promotionActionInfo**: Present when there is a promotion on the hit. It explains whether the promotion was clicked (which corresponds to a hit of type 'EVENT' and this event is a 'Promotion Click'), or the promotion is just viewed on the page but was not clicked. 
    - **transaction**: Information about the transaction when the hit is an event 'Confirm Checkout'. Null otherwise.

**Network dataset:** Besides the session id, the network dataset contains the following columns:

- **trafficSource**: A struct with information about the source of the session, as well as adds and campaign information
- **device**: A struct with information about the device used in the session
- **geoNetwork**: A struct with information about the geographic location of the user. Most of this information is obscured and only city, country and country are available.
- **customDimensions**: Extra traffic information. You can ignore this column.



## Answer business questions

### Medium questions

5. What are the top 5 products that are most added to the cart?

**NOTES:**
- Use the hits dataset to answer this question.
- A hit of type 'EVENT' can correspond to one of the following event actions (`eventAction` field of `eventInfo`):
    - Product Click
    - Add to Cart -> This event is the one you're looking for
    - Remove from Cart
    - Quickview Click
    - Onsite Click
    - Promotion Click
- A product is identified by its SKU value. You can find this value in field `productSKU` of a product. Remember that the `product` column is an array of structured product information of each product involved in a hit.
- When a hit is of type 'EVENT' and the event action is 'Add to Cart', the `productSKU` field of the product struct contains the SKU of the product added to the cart.

In [None]:
# WRITE YOUR CODE HERE

In [None]:
# top_producs = ["product_sku_1", "product_sku_2", ...] REPLACE THE VALUES WITH THE CORRECT ONES

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(top_products, list)
    assert len(top_products) == 5
    for browser in top_products:
        assert isinstance(browser, str)
except:
    print("The variable `top_products` should be a list with 5 strings")

try:
    assert hashlib.sha256(json.dumps(''.join(top_products)).encode()).hexdigest() == '2758b9ec187f6a3f3d13a35a194c82b92e12ba7b995037a84c3f4262dacac35f'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')
    print('Make sure you wrote the product ids are in the correct order')

6. What is the average time spent by users on the 'Shopping Cart' page in sessions where a purchase was made?

Answer with 2 decimal places.

**NOTES**
- To determine sessions where purchases were made, filter the main dataframe by checking the `transactions` field of the `totals` column. If the field is non-null and greater than 0, it indicates that a purchase occurred during the session.
- You can determine if a hit corresponds to a user being on the 'Shopping Cart' page by checking the `pageTitle` field of the `page` struct in the `hits` column. Only hits of type 'PAGE' should be considered.
- The time spent on a hit is available on the `time` field of the `hits` column

In [None]:
# WRITE YOUR CODE HERE

In [None]:
# avg_time = WRITE THE SOLUTION HERE

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(avg_time, float)
except:
    print('You should assign a float value to the variable `avg_time`')

try:
    assert hashlib.sha256(str(avg_time).encode('utf-8')).hexdigest() == '5434c334d8669be826679ae262c9feb5b895a269d1afde5fd2bdb3fe24fa2d2c'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')

### Hard questions

7. In sessions where a promotion was clicked and at least one product was added to the cart, what is the ID of the most clicked promotion?

**NOTES:**
- Use the hits dataset to answer this question.
- Promotion clicks and adding a product to the cart correspond to hits of type 'EVENT'.
- To determine if a hit corresponds to a product being added to the cart or a promotion click, analyze the `eventInfo` column. There are these possible `eventAction` values:
    - Product Click
    - Add to Cart -> This event must be present in the session
    - Remove from Cart
    - Quickview Click
    - Onsite Click
    - Promotion Click -> This event must be present in the session
- For hits where there was a promotion click, the `promotion` column contains an array with only element - details of the clicked promotion. The promotion id can be found in the `promoId` field within that element.

In [None]:
# WRITE YOUR CODE HERE

In [None]:
# promo_id = WRITE THE SOLUTION HERE

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(promo_id, str)
except:
    print('You should assign a string value to the variable `promo_id`')

try:
    assert hashlib.sha256(promo_id.encode('utf-8')).hexdigest() == '2f34cf7b9ce1f5062f0fe6f8f9a6d073214fc6869ba4a85014bab1cf672e80cc'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')

8. Identify the user that most views promotions in sessions but never clicks on them.

To solve this, you'll need to count the number of sessions where each user viewed a promotion but never clicked on it. The user with the highest count is the one you're looking for.

Use a UDF to answer the question.

**NOTES:**
- Check the `promoIsView` field in `promotionActionInfo` to determine if promotions were viewed on a hit.
- Check the `promoIsClick` field in `promotionActionInfo` to determine if a user clicked on a promotion on a hit.
- Start by creating a boolean column that indicates if promotions were viewed during a session but none was clicked. To create this column you should apply a udf directly to the `hits` column on the hits dataframe.
-  Then calculate the total number of sessions where this occurred for each user. The visitor id for each session is on the main dataframe.

In [None]:
# WRITE YOUR CODE HERE

In [None]:
# visitor_id = WRITE THE SOLUTION HERE

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(visitor_id, str)
except:
    print('You should assign a string value to the variable `visitor_id`')

try:
    assert hashlib.sha256(visitor_id.encode('utf-8')).hexdigest() == '230dd9fdc397961e79b1a698614b91f948ac7ab685261b7d21bf842387768fe9'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')