# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [1]:
from pyspark.sql import SparkSession

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

In [2]:
spark = SparkSession \
    .builder \
    .appName("Data wrangling with Spark SQL") \
    .getOrCreate()

In [4]:
path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)

In [5]:
# Create a view 
user_log.createOrReplaceTempView('user_log_table')

# Question 1

Which page did user id ""(empty string) NOT visit?

In [7]:
spark.sql('''
SELECT DISTINCT page
    FROM user_log_table
    WHERE userId = ''
''').show()

+-----+
| page|
+-----+
| Home|
|About|
|Login|
| Help|
+-----+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [9]:
spark.sql("""
SELECT COUNT(DISTINCT(userId))
FROM user_log_table
WHERE gender = 'F'
""").show()

+----------------------+
|count(DISTINCT userId)|
+----------------------+
|                   462|
+----------------------+



# Question 4

How many songs were played from the most played artist?

In [10]:
spark.sql("""
SELECT artist,
       count(*)
FROM user_log_table
GROUP BY artist
ORDER BY 2 DESC;
""").show()

+--------------------+--------+
|              artist|count(1)|
+--------------------+--------+
|                null|    1653|
|            Coldplay|      83|
|       Kings Of Leon|      69|
|Florence + The Ma...|      52|
|            BjÃÂ¶rk|      46|
|       Dwight Yoakam|      45|
|       Justin Bieber|      43|
|      The Black Keys|      40|
|         OneRepublic|      37|
|                Muse|      36|
|        Jack Johnson|      36|
|           Radiohead|      31|
|        Taylor Swift|      29|
|          Lily Allen|      28|
|               Train|      28|
|Barry Tuckwell/Ac...|      28|
|           Daft Punk|      27|
|           Metallica|      27|
|          Nickelback|      27|
|          Kanye West|      26|
+--------------------+--------+
only showing top 20 rows



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [19]:
# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
is_home = spark.sql("SELECT userID, page, ts, CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home FROM user_log_table \
            WHERE (page = 'NextSong') or (page = 'Home') \
            ")

# keep the results in a new view
is_home.createOrReplaceTempView("is_home_table")

# find the cumulative sum over the is_home column
cumulative_sum = spark.sql("SELECT *, SUM(is_home) OVER \
    (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \
    FROM is_home_table")

# keep the results in a view
cumulative_sum.createOrReplaceTempView("period_table")

# find the average count for NextSong
spark.sql("SELECT AVG(count_results) FROM \
          (SELECT COUNT(*) AS count_results FROM period_table \
GROUP BY userID, period, page HAVING page = 'NextSong') AS counts").show()

+------------------+
|avg(count_results)|
+------------------+
| 6.898347107438017|
+------------------+



In [22]:
spark.sql("SELECT *, SUM(is_home) OVER \
    (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period \
    FROM is_home_table").show()

+------+--------+-------------+-------+------+
|userID|    page|           ts|is_home|period|
+------+--------+-------------+-------+------+
|  1436|NextSong|1513783259284|      0|     0|
|  1436|NextSong|1513782858284|      0|     0|
|  2088|    Home|1513805972284|      1|     1|
|  2088|NextSong|1513805859284|      0|     1|
|  2088|NextSong|1513805494284|      0|     1|
|  2088|NextSong|1513805065284|      0|     1|
|  2088|NextSong|1513804786284|      0|     1|
|  2088|NextSong|1513804555284|      0|     1|
|  2088|NextSong|1513804196284|      0|     1|
|  2088|NextSong|1513803967284|      0|     1|
|  2088|NextSong|1513803820284|      0|     1|
|  2088|NextSong|1513803651284|      0|     1|
|  2088|NextSong|1513803413284|      0|     1|
|  2088|NextSong|1513803254284|      0|     1|
|  2088|NextSong|1513803057284|      0|     1|
|  2088|NextSong|1513802824284|      0|     1|
|  2162|NextSong|1513781246284|      0|     0|
|  2162|NextSong|1513781065284|      0|     0|
|  2162|NextS

In [21]:
cumulative_sum

DataFrame[userID: string, page: string, ts: bigint, is_home: int, period: bigint]