### Spark SQL Practice

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg
from pyspark.sql.functions import sum as Fsum
from pyspark.sql import Window

import datetime

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
sparkSesh = (
    SparkSession.builder.appName("app Name")
    .config("config option", "config value")
    .master("local[*]")
    .getOrCreate()
)



23/08/27 16:45:51 WARN Utils: Your hostname, rambino-AERO-15-XD resolves to a loopback address: 127.0.1.1; using 192.168.2.54 instead (on interface wlp48s0)
23/08/27 16:45:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/08/27 16:45:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
read_path = "./sparkify_log_small.json"
log_data = sparkSesh.read.json(read_path)

In [4]:
# Spark Dataframes by default do not support direct SQL querying (apparently), so we need to create a 'view'

log_data.createOrReplaceTempView("log_table")

In [5]:
sparkSesh.sql(
    """
    SELECT * FROM log_table limit 3;
    """
).show()

# Note: For more information on "show()" vs "collect()" vs "take()" in Spark for returning data:
# https://stackoverflow.com/questions/41000273/spark-difference-between-collect-take-and-show-outputs-after-conversion

+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|              artist|     auth|firstName|gender|itemInSession| lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       Showaddywaddy|Logged In|  Kenneth|     M|          112| Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
|          Lily Allen|Logged In|Elizabeth|     F|            7|    Chase|195.23873| free|Shreveport-Bossie...|   PUT

In [6]:
# Note: If you want to use UDFs with Spark SQL, you have a slightly different syntax:
sparkSesh.udf.register(
    "hour_from_epoch", lambda x: int(datetime.datetime.fromtimestamp(x / 1000).hour)
)

<function __main__.<lambda>(x)>

In [7]:
sparkSesh.sql(
    """
    SELECT hour_from_epoch(ts) AS hour, count(*) as plays_per_hour
    FROM log_table
    WHERE page = "NextSong"
    GROUP BY hour
    ORDER BY cast(hour as int) ASC
    """
).show()

+----+--------------+
|hour|plays_per_hour|
+----+--------------+
|   0|           147|
|   1|           225|
|   2|           216|
|   3|           179|
|   4|           141|
|   5|           151|
|   6|           113|
|   7|           180|
|   8|            93|
|  23|           205|
+----+--------------+



                                                                                

#### Notes:
It's quite easy to convert Spark dataframes to Pandas dataframes with the `toPandas()` method. Using Pandas or Spark dataframes is mostly a matter of preference.

However, the course instructor recommends that **I choose one API (Pandas or Spark / Spark SQL) and practice it consistently** rather than trying to learn a little about all the APIs (i.e., specialize)

>It makes most sense to me to specialize in Spark / Spark SQL. I see it used more often for Data Engineering than Pandas.

## Challenges

Which page did user id "" (empty string) NOT visit?

In [8]:
sparkSesh.sql(
    """
    SELECT DISTINCT page
    FROM log_table
    WHERE page NOT IN (
        SELECT DISTINCT page
        FROM log_table
        WHERE userID = ""
    )
    """
).show()

+--------------+
|          page|
+--------------+
|     Downgrade|
|        Logout|
| Save Settings|
|      Settings|
|      NextSong|
|       Upgrade|
|         Error|
|Submit Upgrade|
+--------------+



How many female users do we have in the data set?

In [9]:
sparkSesh.sql(
    """
    SELECT COUNT(DISTINCT userId), gender
    FROM log_table
    GROUP BY gender
    """
).show()

+----------------------+------+
|count(DISTINCT userId)|gender|
+----------------------+------+
|                   127|     F|
|                     1|  null|
|                   155|     M|
+----------------------+------+



How many songs were played from the most played artist?

In [10]:
sparkSesh.sql(
    """
    SELECT COUNT(*) AS plays, artist
    FROM log_table
    WHERE artist IS NOT NULL
    GROUP BY artist
    ORDER BY plays DESC
    """
).show()

+-----+--------------------+
|plays|              artist|
+-----+--------------------+
|   17|       Kings Of Leon|
|   16|            Coldplay|
|   15|Florence + The Ma...|
|   13|        Jack Johnson|
|   10|            BjÃÂ¶rk|
|   10|       Justin Bieber|
|   10|      The Black Keys|
|    9|          Lily Allen|
|    9|           Daft Punk|
|    9|            Tub Ring|
|    8|         OneRepublic|
|    7|           Radiohead|
|    7|     Alliance Ethnik|
|    6|        Taylor Swift|
|    6|          Kanye West|
|    6|             Rihanna|
|    6|         Miley Cyrus|
|    6|     Michael Jackson|
|    6|Red Hot Chili Pep...|
|    6|      Arctic Monkeys|
+-----+--------------------+
only showing top 20 rows



How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

### My Solution

In [22]:
sparkSesh.sql(
    """
    WITH songs_home AS (
        SELECT *
        ,SUM(isHome) OVER 
            (PARTITION BY userID ORDER BY ts DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS period
        FROM (
            SELECT page
            ,userID
            ,ts
            ,CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS isHome
            FROM log_table
            WHERE page = "Home" OR page = "NextSong"
        )
    )
    SELECT COUNT(period) AS count, userID, period, page
    FROM songs_home
    GROUP BY userID, period, page
    HAVING page = 'NextSong'
    """
).createOrReplaceTempView("songs_home_counts")

In [23]:
sparkSesh.sql(
    """
    SELECT AVG(count)
    FROM songs_home_counts
    """
).show()

+-----------------+
|       avg(count)|
+-----------------+
|5.625925925925926|
+-----------------+



### Instructor's Solution

In [19]:
# SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
is_home = sparkSesh.sql(
    """
    SELECT userID
    ,page
    ,ts
    ,CASE WHEN page = 'Home' THEN 1 ELSE 0 END AS is_home
    FROM log_table
    WHERE (page = 'NextSong') or (page = 'Home')
"""
)

# keep the results in a new view
is_home.createOrReplaceTempView("is_home_table")

In [20]:
# find the cumulative sum over the is_home column
cumulative_sum = sparkSesh.sql(
    """
    SELECT *
    ,SUM(is_home) OVER
        (PARTITION BY userID ORDER BY ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS period
    FROM is_home_table
"""
)

# keep the results in a view
cumulative_sum.createOrReplaceTempView("period_table")

In [21]:
# find the average count for NextSong
sparkSesh.sql(
    """
    SELECT AVG(count_results)
    FROM
        (SELECT COUNT(*) AS count_results FROM period_table
        GROUP BY userID, period, page HAVING page = 'NextSong') AS counts
"""
).show()

+------------------+
|avg(count_results)|
+------------------+
| 5.625925925925926|
+------------------+



## Bug: Different results when sorting differently

Note: I get different results if I partition and order data DESCENDING vs. ASCENDING within my window function.
I *think* I figured out why this is: it's because some of my timestamps are the same for 'HOME' page visits and 'NextPage' visits.
For example, with ID 1079, this is the case. When I order in one direction, the 'NextPage' event is ordered BEFORE the 'Home' visit. When I order in the other way, it is put AFTER the 'Home' visit.

This is because SQL decides to resolve the ambiguous sorting by choosing another column to sort by: page.
When it sorts page in ascending order, this resolves differently than when it sorts in descending order.

In [16]:
# This code shows that there are a lot of user sessions where 2 events share the same timestamp
sparkSesh.sql(
    """
    SELECT COUNT(ts) AS count, userID, ts
    FROM log_table
    GROUP BY userID, ts
    HAVING count > 1
    ORDER BY count DESC
"""
).show()

+-----+------+-------------+
|count|userID|           ts|
+-----+------+-------------+
|    2|  1955|1513735032284|
|    2|  2047|1513755094284|
|    2|  2219|1513723220284|
|    2|  1079|1513749231284|
|    2|  2813|1513754919284|
|    2|  1138|1513729066284|
|    2|  2089|1513748348284|
+-----+------+-------------+



### Solved
If we clean out data from these IDs where the timestamp is exactly the same for 2 or more events, we can see that we now no longer get different answers depending on if we sort ASC or DESC
>Conclusion: If these were production data, I would challenge the event-logging system which is producing events with the exact same timestamp. It doesn't make sense to me how two events like visiting 'Home' and 'NextSong' could happen simultaneously. If that event-logging system cannot be changed, then I would need to be more careful with my code so that it acknowledges that two events can happen at the same time

In [17]:
badIDs = sparkSesh.sql(
    """
    SELECT COUNT(ts) AS count, userID, ts
    FROM log_table
    GROUP BY userID, ts
    HAVING count > 1
    ORDER BY count DESC
"""
)

badIDs.createOrReplaceTempView("badIDs")

In [18]:
# Reloading data with 'bad IDs' filtered out. Rerun code with this data to compare how sorting 'ts' differently doesn't make
# a difference anymore.

log_table = sparkSesh.sql(
    """
    SELECT *
    FROM log_table
    WHERE userID NOT IN (SELECT userID from badIDs)
"""
)

log_table.createOrReplaceTempView("log_table")