# Data Wrangling with SQL: Spark SQL

The PySpark Python API is very nice for pandas users; only, we need to take into account how functional programming works and avoid for loops.

PySpark has another way for allowing the users to do exactly the same things as with the Python API: The SQL interface. The SQL interface has these advantages:

- We can use exactly the SQL query we'd use in a relational database.
- The SQL query is optimized under the hood for better performance.
- Using SQL queries avoids needing to learn a new API/library usage.
- Any data analyst can very easily user PySpark.

> You might prefer SQL over data frames because the syntax is clearer especially for teams already experienced in SQL.

> Spark data frames give you more control. You can break down your queries into smaller steps, which can make debugging easier. You can also [cache](https://unraveldata.com/to-cache-or-not-to-cache/) intermediate results or [repartition](https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4) intermediate results.

The major difference when using SQL is that we need to register the uploaded dataframe using `df.createOrReplaceTempView("tableName")`. This step creates a temporary view in Spark which allows to query SQL-like statements to analyze the data.

Once the table(s) we want have been uploaded and registered, the interface is the following:

```python
user_log = spark.read.json("../data/sparkify_log_small.json")
user_log.createOrReplaceTempView("user_log_table")
# We can use any kind of SQL query
spark.sql("SELECT * FROM user_log_table LIMIT 2").show() # Get table with first 20 entries
spark.sql("""
          SELECT * 
          FROM user_log_table
          LIMIT 2"""
          ).collect() # Get ALL Rows
```

Also, note that we can chain several retrieval/analysis functions one after the other. However, these are not executed due to the *lazy evaluation* principle until we `show()` or `collect()` them:

- `show()` returns a dataframe with `n` (20, default) entries from the RDD; use for exploration.
- `collect()` returns the complete result/table from the RDD in Row elements; use only when needed.

Important links with API information:

- [Spark SQL built-in functions](https://spark.apache.org/docs/latest/api/sql/index.html)
- [Spark SQL guide](https://spark.apache.org/docs/latest/sql-getting-started.html)

**Table of Contents**

- [1. Setup](#1.-Setup)
- [2. Create a View And Run Queries](#2.-Create-a-View-And-Run-Queries)
- [3. User Defined Functions](#3.-User-Defined-Functions)
- [4. Converting Results to Pandas](#4.-Converting-Results-to-Pandas)


## 1. Setup

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum

import datetime

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
spark = SparkSession \
    .builder \
    .appName("Data wrangling with Spark SQL") \
    .getOrCreate()

23/04/28 13:18:56 WARN Utils: Your hostname, kasiopeia.local resolves to a loopback address: 127.0.0.1; using 192.168.1.34 instead (on interface en0)
23/04/28 13:18:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/28 13:18:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/28 13:18:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
path = "../data/sparkify_log_small.json"
user_log = spark.read.json(path)

                                                                                

In [17]:
user_log.take(1)

[Row(artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046')]

In [18]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



## 2. Create a View And Run Queries

The code below creates a temporary view against which you can run SQL queries.

In [4]:
# Register the DataFrame as a temporary view.
# This is necessary for SQL data wrangling.
# This allows us to query the data using SQL-like syntax in the used session.
# Notes:
# - The contents in user_log are not registered in the session catalog, by default!
# - user_log is linked to the session
# - We create a temporary view of user_log named "user_log_table" in the catalog
user_log.createOrReplaceTempView("user_log_table")

In [5]:
# Once the table(s) we want have been uploaded and registered,
# the interface is .sql()
# BUT because of the *lazy evaluation*,
# we need to either show() or collect():
# - show() returns a dataframe with n (20, default) entries from the RDD; use for exploration
# - collect() returns the complete result/table in Row elements; use only when needed
spark.sql("SELECT * FROM user_log_table LIMIT 2").show()

                                                                                

+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|Showaddywaddy|Logged In|  Kenneth|     M|          112|Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
|   Lily Allen|Logged In|Elizabeth|     F|            7|   Chase|195.23873| free|Shreveport-Bossie...|   PUT|NextSong|1512718541284|     5027|      

In [6]:
# Multi-line queries
spark.sql('''
          SELECT * 
          FROM user_log_table 
          LIMIT 2
          '''
          ).show()

+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|Showaddywaddy|Logged In|  Kenneth|     M|          112|Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
|   Lily Allen|Logged In|Elizabeth|     F|            7|   Chase|195.23873| free|Shreveport-Bossie...|   PUT|NextSong|1512718541284|     5027|      

In [7]:
spark.sql('''
          SELECT COUNT(*) 
          FROM user_log_table 
          '''
          ).show()

+--------+
|count(1)|
+--------+
|   10000|
+--------+



In [8]:
spark.sql('''
          SELECT userID, firstname, page, song
          FROM user_log_table 
          WHERE userID == '1046'
          '''
          ).collect()

[Row(userID='1046', firstname='Kenneth', page='NextSong', song='Christmas Tears Will Fall'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='Be Wary Of A Woman'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='Public Enemy No.1'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='Reign Of The Tyrants'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='Father And Son'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='No. 5'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='Seventeen'),
 Row(userID='1046', firstname='Kenneth', page='Home', song=None),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='War on war'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='Killermont Street'),
 Row(userID='1046', firstname='Kenneth', page='NextSong', song='Black & Blue'),
 Row(userID='1046', firstname='Kenneth', page='Logout', song=None),
 Row(userID='1046', firstname='Kenneth'

In [9]:
# All unique pages
spark.sql('''
          SELECT DISTINCT page
          FROM user_log_table 
          ORDER BY page ASC
          '''
          ).show()

[Stage 7:>                                                          (0 + 2) / 2]

+----------------+
|            page|
+----------------+
|           About|
|       Downgrade|
|           Error|
|            Help|
|            Home|
|           Login|
|          Logout|
|        NextSong|
|   Save Settings|
|        Settings|
|Submit Downgrade|
|  Submit Upgrade|
|         Upgrade|
+----------------+



                                                                                

## 3. User Defined Functions

In [10]:
# We can also use User-Defined Functions (UDFs)
# but we need to register them to we used
# as part of the SQL statement
spark.udf.register("get_hour",
                   lambda x: int(datetime.datetime.fromtimestamp(x / 1000.0).hour))

<function __main__.<lambda>(x)>

In [20]:
spark.sql('''
          SELECT *, get_hour(ts) AS hour
          FROM user_log_table 
          LIMIT 1
          '''
          ).collect()

[Row(artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046', hour='22')]

In [12]:
# SQL statement with the freshly defined UDF
# Note that the statement is not evaluated
# due to the *lazy evaluation* principle.
# We need to show/collect the query to get the results.
songs_in_hour = spark.sql('''
          SELECT get_hour(ts) AS hour, COUNT(*) as plays_per_hour
          FROM user_log_table
          WHERE page = "NextSong"
          GROUP BY hour
          ORDER BY cast(hour as int) ASC
          '''
          )

In [13]:
songs_in_hour.show()



+----+--------------+
|hour|plays_per_hour|
+----+--------------+
|   0|           375|
|   1|           456|
|   2|           454|
|   3|           382|
|   4|           302|
|   5|           352|
|   6|           276|
|   7|           348|
|   8|           358|
|   9|           375|
|  10|           249|
|  11|           216|
|  12|           228|
|  13|           251|
|  14|           339|
|  15|           462|
|  16|           479|
|  17|           484|
|  18|           430|
|  19|           362|
+----+--------------+
only showing top 20 rows



                                                                                

## 4. Converting Results to Pandas

In [14]:
# The chain of statements/requests
# is also executed and the result
# transformed into a pd.DataFrame with toPandas()
songs_in_hour_pd = songs_in_hour.toPandas()

                                                                                

In [29]:
print(songs_in_hour_pd)

   hour  plays_per_hour
0     0             456
1     1             454
2     2             382
3     3             302
4     4             352
5     5             276
6     6             348
7     7             358
8     8             375
9     9             249
10   10             216
11   11             228
12   12             251
13   13             339
14   14             462
15   15             479
16   16             484
17   17             430
18   18             362
19   19             295
20   20             257
21   21             248
22   22             369
23   23             375
