# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, desc

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

In [3]:
spark = SparkSession \
    .builder \
    .appName("quiz_wrangling") \
    .getOrCreate()

In [4]:
PATH = "data/sparkify_log_small.json"
user_log = spark.read.json(PATH)

In [5]:
user_log.take(1)

[Row(artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046')]

In [6]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [7]:
user_log.createOrReplaceTempView("user_log_table")

# Question 1

Which page did user id ""(empty string) NOT visit?

In [9]:
spark.sql("""SELECT DISTINCT t.page
             FROM user_log_table t
             WHERE t.page NOT IN (SELECT DISTINCT t.page
             FROM user_log_table t
             WHERE t.userId = '')
          """).show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
|          Logout|
|   Save Settings|
|        Settings|
|        NextSong|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [10]:
spark.sql("""SELECT COUNT(DISTINCT t.userId, t.firstName, t.lastName)
             FROM user_log_table t
             WHERE t.gender = 'F'
          """).show()

+-------------------------------------------+
|count(DISTINCT userId, firstName, lastName)|
+-------------------------------------------+
|                                        462|
+-------------------------------------------+



# Question 4

How many songs were played from the most played artist?

In [19]:
spark.sql("""SELECT t.artist, COUNT(*) AS numPlayed
             FROM user_log_table t
             WHERE t.page = 'NextSong'
             GROUP BY t.artist
             ORDER BY numPlayed DESC
             LIMIT 1
          """).show()

+--------+---------+
|  artist|numPlayed|
+--------+---------+
|Coldplay|       83|
+--------+---------+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [None]:
# TODO: write your code to answer question 5