In [None]:
import hashlib
import pyspark.sql.functions as f
from pyspark.sql import Window

**Initial Note:** Initialize a cluster with runtime 13.3 LTS and spark version 3.4.1

## Project Introduction

In this final project, you'll apply the skills and knowledge you've gained throughout the course in a real-world scenario, simulating your work as a Data Analyst at Mercedes-Benz.io

The goal of this project is to guide you in manipulating and analyzing Google Analytics data to gain insights into user interactions on a website.

For that, you'll work with the [Google Analytics Sample dataset](https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?inv=1&invt=AbmlmQ), which contains real data from the [Google Merchandise Store](https://shop.googlemerchandisestore.com/), a real ecommerce store that sells Google-branded merchandise.

The dataset represents typical traffic patterns for an e-commerce website and includes the following types of information:
- **Traffic source data**: information about where website visitors come from, including organic search, paid search, and display advertising traffic.
- **Content data**: information about the behavior of users on the website, such as the pages they visit, how they interact with content, etc.
- **Transactional data**: Information about purchases made on the Google Merchandise Store, including transaction details and revenue metrics.

### Part 1 - Data Cleaning and Easy Analytics

In this first notebook you'll get familiar with the data, perform some preprocessing and cleaning tasks, and then start analyzing the data to answer some easy business questions.


### Task Completion and Validation
Throughout the notebooks, you will be asked to complete a series of tasks and answer questions. You’ll encounter empty cells where you need to implement the solution, as well as commented-out cells that you should uncomment and fill in with your responses. Afterward, assertion cells will check whether you've completed the tasks correctly.

This way you can have immediate feedback on your work, and you can ask questions if you get stuck.

## Download the data

The data is available on a zip file. This zip contains three parquet files:
- `ga_sessions_main.parquet`: the main information about each session
- `ga_sessions_hits.parquet`: detailed information about hits in each session
- `ga_sessions_network.parquet`: information about traffic sources, device and geographic information

**NOTE:** To make things a bit easier, only data from the first 15 days of August 2016 was included in the dataset. Also, some noisy information about `hits` was removed from the original data.

Let's download the data and save it to the Databricks File System (DBFS).

In [None]:
%sh wget https://raw.githubusercontent.com/inesmcm26/lp-big-data-mercedes/main/data/ga_sessions.zip

In [None]:
%sh unzip ga_sessions.zip

In [None]:
dbutils.fs.cp('file:/databricks/driver/ga_sessions_main.parquet', 'dbfs:/FileStore/final_project/ga_sessions_main.parquet')
dbutils.fs.cp('file:/databricks/driver/ga_sessions_network.parquet', 'dbfs:/FileStore/final_project/ga_sessions_network.parquet')
dbutils.fs.cp('file:/databricks/driver/ga_sessions_hits.parquet', 'dbfs:/FileStore/final_project/ga_sessions_hits.parquet')

Run the following cell to load each dataset into spark dataframes.

In [None]:
df_main = spark.read.parquet('/FileStore/final_project/ga_sessions_main.parquet')
df_hits = spark.read.parquet('/FileStore/final_project/ga_sessions_hits.parquet')
df_network = spark.read.parquet('/FileStore/final_project/ga_sessions_network.parquet')

## Datasets Overview

The column that identifies a session and is **common to all tables** is the `sessionId` column.

#### Main dataset

In [None]:
df_main.printSchema()

Besides the session id, the main dataset contains the following columns:
- **visitorId**: The unique identifier for a visitor
- **visitNumber**: The visit number of this visitor. If this is the first visit to the website, then this is set to 1.
- **visitStartTime**: The timestamp (expressed as POSIX time) of the beginning of the session
- **totals**: A struct with statistics about the session, such as total number of hits, time on site, number of transactions and revenue, etc.
- **channelGrouping**: The channel via which the user came to the Store

#### Hits dataset

In [None]:
df_hits.printSchema()

Besides the session id, the hits dataset contains the following column:

- **hits**: An array of structs representing all the hits in this session. A hit is an interaction that results in data being sent to Google Analytics. It can either be a page visit or an interaction with some page element. Each struct is a hit defined by the following fields:
    - **hitNumber**: The number of this hit in the session
    - **type**: Type of the hit: 'PAGE' (Page visit) or 'EVENT' (Interaction with some element on the page)
    - **hour**: Hour of the hit
    - **minute**: Minute of the hit
    - **time**: Time spent on the hit
    - **page**: Structured information about the page
    - **contentGroup**: Information about the content categorization of the page on the website
    - **product**: Array of structs with product information of all products displayed on the page
    - **eventInfo**: If hit is of type 'EVENT', this field contains information about the event
    - **promotion**: Array of structs with promotion information of all promotions displayed on the page.
    - **promotionActionInfo**: Present when there is a promotion on the hit. It explains whether the promotion was clicked (which corresponds to a hit of type 'EVENT' and this event is a 'Promotion Click'), or the promotion is just viewed on the page but was not clicked. 
    - **transaction**: Information about the transaction when the hit is an event 'Confirm Checkout'. Null otherwise.



#### Network dataset


In [None]:
df_network.printSchema()

Besides the session id, the network dataset contains the following columns:

- **trafficSource**: A struct with information about the source of the session, as well as adds and campaign information
- **device**: A struct with information about the device used in the session
- **geoNetwork**: A struct with information about the geographic location of the user. Most of this information is obscured and only city, country and country are available.
- **customDimensions**: Extra traffic information. You can ignore this column.


## Dataset analysis and cleaning

Start by checking how many rows the dataframes have.

In [None]:
# WRITE THE CODE HERE

In [None]:
# nr_rows = WRITE THE SOLUTION HERE

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(nr_rows, int)
    assert hashlib.sha256(str(nr_rows).encode('utf-8')).hexdigest() == '5a0d24b5fcc0584bfd8d51a4fa0ebb838ef1e9769707e17c7e8002438909e383'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')

Now, see if there are any missing values on the main dataset.

In [None]:
# WRITE THE CODE HERE

Assume that if the channel grouping is missing, the user arrived at the store via the 'Direct' channel. Fill in any missing values accordingly.

In [None]:
# WRITE THE CODE HERE

In [None]:
# Run this test to verify that the answer is correct
channel_grouping_values = df_main.select('channelGrouping').rdd.flatMap(lambda x: x).collect()
try:
    assert None not in channel_grouping_values
    assert channel_grouping_values.count('Direct') == 9587
    print('Good job! You managed to fill the missing values values')
except:
    print('Try again')

Next, examine the `geoNetwork` column in the network dataframe. Count the number of sessions originating from each continent.

In [None]:
# WRITE THE CODE HERE

You see the problem? Sometimes a continent is written in lowercase and others in uppercase. Standardize the data by ensuring that the continent names are always in lowercase.

**Hint:** You've previously learned how to convert a string column to lowercase using the `f.lower` function. However, in this case, you need to apply this function to the `continent` field within the `geoNetwork` struct column. At the same time, you want to replace its value directly in the struct. You can check the PySpark column methods [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html). One of them will help you solve the problem.

In [None]:
# WRITE THE CODE HERE

In [None]:
# Run this test to verify that the answer is correct
continents = df_network.select(f.col('geoNetwork').getField('continent').alias('continent')).distinct().rdd.flatMap(lambda x: x).collect()

try:
    for continent in continents:
        assert continent == continent.lower()
    print('Good job! You managed to standardize the data')
except:
    print('Try again')

## Answer business questions

### Easy questions


Users access the store through different channels, and each session has a corresponding transaction revenue value.

1. Which channel generates the highest total revenue across all sessions?

Notes:
- Use the `channelGrouping` column in the main dataset for channel types.
- Calculate the revenue using the `totalTransactionRevenue` field within the `totals` column.

In [None]:
# WRITE THE CODE HERE

In [None]:
# channel = WRITE THE ANSWER HERE

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(channel, str)
    assert hashlib.sha256(channel.encode('utf-8')).hexdigest() == 'aeb7b00433003f75c286e214eccd11a1e4ba6fbd0d0413cb35864749821cc8e0'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')

2. Users access the store through different browsers. Which are the top 3 browsers ranked by the total time users spent on the site?

Notes:
- You can find the browser used by a user during a session in the `device` column of the network dataframe
- The total time spent on site on a session is registered on the `totals` column of the main dataframe

In [None]:
# WRITE THE CODE HERE

In [None]:
# top_browsers = ['Browser1', 'Browser2', 'Broswer3'] REPLACE THE VALUES WITH THE CORRECT ONES

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(top_browsers, list)
    assert len(top_browsers) == 3
    for browser in top_browsers:
        assert isinstance(browser, str)
    assert hashlib.sha256(json.dumps(''.join(top_browsers)).encode()).hexdigest() == '13a1c6bec3d2d1c3d8be28dbf835880972effd2ff9619f1b15d6fee8d505ff5c'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')
    print('Check if you wrote the browser names starting with capital letter')

3. Analyse the website traffic (total number of sessions) per **hour of the day** and **day of the week**.

Visualize the result using a pivot table.

**NOTE:** The start time of each session is available in the `visitStartTime` column of the main dataframe, and is saved in UNIX time. You have to first transform it into a date format before being able to extract the hour and day of week.

What is the total number of sessions registered at 8pm on tuesdays?

In [None]:
# WRITE THE CODE HERE

In [None]:
# nr_sessions = WRITE THE SOLUTION HERE

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(nr_sessions, int)
    assert hashlib.sha256(str(nr_sessions).encode('utf-8')).hexdigest() == '5426d2ca50f244fb43fe9eafc82da08f33f3b4f8d9140802bd0102e780b629d6'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')

4. Identify the `visitorId` of the user with highest average time gap **in days** between two consecutive sessions. Consider only visitors that have more than 6 registered sessions.

**NOTE:** Again the start time of each session is saved in UNIX time, so you need to transform it into a date format to calculate the time gap in days between two consecutive sessions.

In [None]:
# WRITE THE CODE HERE

In [None]:
# visitor_id = WRITE THE SOLUTION HERE

In [None]:
# Run this test to verify that the answer is correct
try:
    assert isinstance(visitor_id, str)
    assert hashlib.sha256(visitor_id.encode('utf-8')).hexdigest() == '5abdd1c91bc52becf7267a3bacfb9d5e979f54f234919d982ed0c1f61cb6425a'
    print('Good job! The answer is correct')
except:
    print('The answer is not right yet! Try again')

Good job! Next week we will continue with more advanced questions.