# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [11]:
from pyspark.sql import SparkSession

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 
spark = SparkSession.builder \
    .appName("Spark SQL Quiz") \
    .getOrCreate()

    
df = spark.read.json("data/sparkify_log_small.json")
df.createOrReplaceTempView('log_table')

# Question 1

Which page did user id ""(empty string) NOT visit?

In [12]:
# TODO: write your code to answer question 1
spark.sql('''
select * from log_table limit 5
''').show()

+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|              artist|     auth|firstName|gender|itemInSession| lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       Showaddywaddy|Logged In|  Kenneth|     M|          112| Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
|          Lily Allen|Logged In|Elizabeth|     F|            7|    Chase|195.23873| free|Shreveport-Bossie...|   PUT

In [13]:
spark.sql('''
select distinct page from log_table
where page not in 
(select page from log_table
where userId = '' )

''').show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
|          Logout|
|   Save Settings|
|        Settings|
|        NextSong|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [14]:
# TODO: write your code to answer question 3
spark.sql('''

select count(distinct userID) from log_table
where gender = 'F'

''').show()

+----------------------+
|count(DISTINCT userID)|
+----------------------+
|                   462|
+----------------------+



# Question 4

How many songs were played from the most played artist?

In [16]:
# TODO: write your code to answer question 4
spark.sql('''

select count(artist) as total
from log_table
where artist is not null
group by artist
order by 1 desc
limit 1

''').show()

+-----+
|total|
+-----+
|   83|
+-----+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [33]:
# TODO: write your code to answer question 5
is_home = spark.sql('''
select userId, page, ts, (case when (page = 'Home') then 1 else 0 end) as ishome
from log_table
where page = 'NextSong' or page = 'Home'
''')
is_home.createOrReplaceTempView('is_home')
cumsum = spark.sql('''
select userId, ishome, sum(ishome) over(partition by userId order by ts desc range between unbounded preceding and current row) as cumsum
from is_home
''')


cumsum.createOrReplaceTempView('cumsum')
spark.sql('''
select avg(total)
from
(select userId, cumsum, count(cumsum) as total from cumsum
where ishome = 0
group by 1,2) t
''').show()




+-----------------+
|       avg(total)|
+-----------------+
|6.898347107438017|
+-----------------+

