# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType

# TODOS: 
# 1) import any other libraries you might need
# 5) write code to answer the quiz questions 

In [4]:
# 2) instantiate a Spark session 
spark = SparkSession.builder \
    .appName("Sparkify") \
    .getOrCreate()
# 3) read in the data set located at the path "data/sparkify_log_small.json"
data_path = "data/sparkify_log_small.json"
df = spark.read.json(data_path)

In [5]:
# 4) create a view to use with your SQL queries
df.createOrReplaceTempView("df")

# Question 1

Which page did user id ""(empty string) NOT visit?

In [7]:
# TODO: write your code to answer question 1

df_q1 = spark.sql("""
            SELECT DISTINCT page
            FROM df
            EXCEPT
            SELECT DISTINCT page
            FROM df
            WHERE userId = ''
            """)

# Mostrar o resultado
df_q1.show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
|          Logout|
|   Save Settings|
|        Settings|
|        NextSong|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [6]:
# TODO: write your code to answer question 3

df_q2 = spark.sql("""
            SELECT DISTINCT page, auth
            FROM df
            WHERE userId = ''
            """)

df_q2.show()

+-----+----------+
| page|      auth|
+-----+----------+
|Login|Logged Out|
| Help|Logged Out|
| Home|     Guest|
| Home|Logged Out|
|About|Logged Out|
|About|     Guest|
+-----+----------+



# Question 4

How many songs were played from the most played artist?

In [9]:
# TODO: write your code to answer question 4

df_q4 = spark.sql("""
            SELECT *
            FROM (SELECT *, RANK() OVER (ORDER BY num_songs DESC) AS rank FROM (SELECT artist, COUNT(song) AS num_songs FROM df GROUP BY artist))
            WHERE rank = 1
            """)

# Mostrar o resultado
df_q4.show()

+--------+---------+----+
|  artist|num_songs|rank|
+--------+---------+----+
|Coldplay|       83|   1|
+--------+---------+----+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [20]:
spark.udf.register("ishome_udf", lambda ishome: int(ishome == 'Home'), IntegerType())

df_q5 = spark.sql("""
WITH cusum AS (
    SELECT userId, page, ts,
           ishome_udf(page) AS homevisit,
           SUM(ishome_udf(page)) OVER (PARTITION BY userId ORDER BY ts DESC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period
    FROM df
    WHERE page IN ('NextSong', 'Home')
)

, cusum_nextsong AS (
    SELECT userId, page, ts, homevisit, period
    FROM cusum
    WHERE page = 'NextSong'
)

, result AS (
    SELECT userId, period, COUNT(period) AS period_count
    FROM cusum_nextsong
    GROUP BY userId, period
)

SELECT ROUND(AVG(period_count),0) AS avg_period_count
FROM result
""")

df_q5.show()


+----------------+
|avg_period_count|
+----------------+
|             7.0|
+----------------+

