# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [1]:
from pyspark.sql import SparkSession

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

In [4]:
# Create Spark session

spark = SparkSession \
    .builder \
    .appName("Data wrangling with Spark SQL") \
    .getOrCreate()

# Create Spark dataframe

path = "D:/OneDrive/Data Engineering/Udacity/Data Engineering/DataEngineering_Repo/Spark/sparkify_event_data.json"
user_log=spark.read.json(path)

# Create temporary view

user_log_table = user_log.createOrReplaceTempView("user_log_table")

# Question 1

Which page did user id ""(empty string) NOT visit?

In [10]:
# TODO: write your code to answer question 1

spark.sql('''
        SELECT page
        FROM user_log_table
        WHERE userID = ""
        GROUP BY page
        ORDER BY page
        '''
        ).collect()

[Row(page='About'),
 Row(page='Error'),
 Row(page='Help'),
 Row(page='Home'),
 Row(page='Login'),
 Row(page='Register'),
 Row(page='Submit Registration')]

# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [24]:
# TODO: write your code to answer question 3

spark.sql('''
        SELECT COUNT(DISTINCT(userId))
        FROM user_log_table
        WHERE gender = "F"
        '''
        ).collect()

[Row(count(DISTINCT userId)=198)]

# Question 4

How many songs were played from the most played artist?

In [27]:
# TODO: write your code to answer question 4

#spark.sql('''
#        SELECT song, artist
#        FROM user_log_table
#        GROUP BY 2,1
#        ORDER BY 2
#        '''
#        ).collect()

spark.sql('''
        SELECT artist, COUNT(artist) AS count_artist
        FROM user_log_table
        GROUP BY artist
        ORDER BY count_artist DESC
        LIMIT 1
        '''
        ).collect()

[Row(artist='Kings Of Leon', count_artist=3497)]

# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [29]:
# TODO: write your code to answer question 5

# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
is_home = spark.sql("SELECT userID, page, ts, CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home FROM user_log_table \
            WHERE (page = 'NextSong') or (page = 'Home') \
            ")

# keep the results in a new view
is_home.createOrReplaceTempView("is_home_table")

# find the cumulative sum over the is_home column
cumulative_sum = spark.sql("SELECT *, SUM(is_home) OVER \
    (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \
    FROM is_home_table")

# keep the results in a view
cumulative_sum.createOrReplaceTempView("period_table")

# find the average count for NextSong
spark.sql("SELECT AVG(count_results) FROM \
          (SELECT COUNT(*) AS count_results FROM period_table \
GROUP BY userID, period, page HAVING page = 'NextSong') AS counts").show()

+------------------+
|avg(count_results)|
+------------------+
|23.672591053264792|
+------------------+

