-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Spark SQL

Demonstrate fundamental concepts in Spark SQL using the DataFrame API.

##### Objectives
1. Run a SQL query
1. Create a DataFrame from a table
1. Write the same query using DataFrame transformations
1. Trigger computation with DataFrame actions
1. Convert between DataFrames and SQL

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#spark-session-apis" target="_blank">SparkSession</a>: `sql`, `table`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>:
  - Transformations:  `select`, `where`, `orderBy`
  - Actions: `show`, `count`, `take`
  - Other methods: `printSchema`, `schema`, `createOrReplaceTempView`

In [0]:
%run ./Includes/Classroom-Setup-SQL

## Multiple Interfaces
Spark SQL is a module for structured data processing with multiple interfaces.  

We can interact with Spark SQL in two ways:
1. Executing SQL queries
1. Working with the DataFrame API.

**Method 1: Executing SQL queries**  

This is how we interacted with Spark SQL in the previous lesson.

In [0]:
%sql
SELECT name, price
FROM products
WHERE price < 200
ORDER BY price

**Method 2: Working with the DataFrame API**

We can also express Spark SQL queries using the DataFrame API.  
The following cell returns a DataFrame containing the same results as those retrieved above.

In [0]:
display(spark.table("products")
  .select("name", "price")
  .where("price < 200")
  .orderBy("price"))

We'll go over the syntax for the DataFrame API later in the lesson, but you can see this builder design pattern allows us to chain a sequence of operations very similar to those we find in SQL.

## Query Execution
We can express the same query using any interface. The Spark SQL engine generates the same query plan used to optimize and execute on our Spark cluster.

![query execution engine](https://files.training.databricks.com/images/aspwd/spark_sql_query_execution_engine.png)

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> Resilient Distributed Datasets (RDDs) are the low-level representation of datasets processed by a Spark cluster. In early versions of Spark, you had to write <a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html" target="_blank">code manipulating RDDs directly</a>. In modern versions of Spark you should instead use the higher-level DataFrame APIs, which Spark automatically compiles into low-level RDD operations.

## Spark API Documentation

To learn how we work with DataFrames in Spark SQL, let's first look at the Spark API documentation.  
The main Spark [documentation](https://spark.apache.org/docs/latest/) page includes links to API docs and helpful guides for each version of Spark.  

The [Scala API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) and [Python API](https://spark.apache.org/docs/latest/api/python/index.html) are most commonly used, and it's often helpful to reference the documentation for both languages.  
Scala docs tend to be more comprehensive, and Python docs tend to have more code examples.

#### Navigating Docs for the Spark SQL Module
Find the Spark SQL module by navigating to `org.apache.spark.sql` in the Scala API or `pyspark.sql` in the Python API.  
The first class we'll explore in this module is the `SparkSession` class. You can find this by entering "SparkSession" in the search bar.

## SparkSession
The `SparkSession` class is the single entry point to all functionality in Spark using the DataFrame API. 

In Databricks notebooks, the SparkSession is created for you, stored in a variable called `spark`.

In [0]:
spark

The example from the beginning of this lesson used the SparkSession method `table` to create a DataFrame from the `products` table. Let's save this in the variable `productsDF`.

In [0]:
productsDF = spark.table("products")

Below are several additional methods we can use to create DataFrames. All of these can be found in the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html" target="_blank">documentation</a> for `SparkSession`.

#### `SparkSession` Methods
| Method | Description |
| --- | --- |
| sql | Returns a DataFrame representing the result of the given query | 
| table | Returns the specified table as a DataFrame |
| read | Returns a DataFrameReader that can be used to read data in as a DataFrame |
| range | Create a DataFrame with a column containing elements in a range from start to end (exclusive) with step value and number of partitions |
| createDataFrame | Creates a DataFrame from a list of tuples, primarily used for testing |

Let's use a SparkSession method to run SQL.

In [0]:
resultDF = spark.sql("""
SELECT name, price
FROM products
WHERE price < 200
ORDER BY price
""")

display(resultDF)

## DataFrames
Recall that expressing our query using methods in the DataFrame API returns results in a DataFrame. Let's store this in the variable `budgetDF`.

A **DataFrame** is a distributed collection of data grouped into named columns.

In [0]:
budgetDF = (spark.table("products")
  .select("name", "price")
  .where("price < 200")
  .orderBy("price"))

We can use `display()` to output the results of a dataframe.

In [0]:
display(budgetDF)

The **schema** defines the column names and types of a dataframe.

Access a dataframe's schema using the `schema` attribute.

In [0]:
budgetDF.schema

View a nicer output for this schema using the `printSchema()` method.

In [0]:
budgetDF.printSchema()

## Transformations
When we created `budgetDF`, we used a series of DataFrame transformation methods e.g. `select`, `where`, `orderBy`. 

```
productsDF
  .select("name", "price")
  .where("price < 200")
  .orderBy("price")
```
Transformations operate on and return DataFrames, allowing us to chain transformation methods together to construct new DataFrames.  
However, these operations can't execute on their own, as transformation methods are **lazily evaluated**. 

Running the following cell does not trigger any computation.

In [0]:
(productsDF
  .select("name", "price")
  .where("price < 200")
  .orderBy("price"))

## Actions
Conversely, DataFrame actions are methods that **trigger computation**.  
Actions are needed to trigger the execution of any DataFrame transformations. 

The `show` action causes the following cell to execute transformations.

In [0]:
(productsDF
  .select("name", "price")
  .where("price < 200")
  .orderBy("price")
  .show())

Below are several examples of <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#dataframe-apis" target="_blank">DataFrame</a> actions.

### DataFrame Action Methods
| Method | Description |
| --- | --- |
| show | Displays the top n rows of DataFrame in a tabular form |
| count | Returns the number of rows in the DataFrame |
| describe,  summary | Computes basic statistics for numeric and string columns |
| first, head | Returns the the first row |
| collect | Returns an array that contains all rows in this DataFrame |
| take | Returns an array of the first n rows in the DataFrame |

`count` returns the number of records in a DataFrame.

In [0]:
budgetDF.count()

`collect` returns an array of all rows in a DataFrame.

In [0]:
budgetDF.collect() 

## Convert between DataFrames and SQL

`createOrReplaceTempView` creates a temporary view based on the DataFrame. The lifetime of the temporary view is tied to the SparkSession that was used to create the DataFrame.

In [0]:
budgetDF.createOrReplaceTempView("budget")

In [0]:
display(spark.sql("SELECT * FROM budget"))

# Spark SQL Lab

##### Tasks
1. Create a DataFrame from the `events` table
1. Display the DataFrame and inspect its schema
1. Apply transformations to filter and sort `macOS` events
1. Count results and take the first 5 rows
1. Create the same DataFrame using a SQL query

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html?highlight=sparksession" target="_blank">SparkSession</a>: `sql`, `table`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a> transformations: `select`, `where`, `orderBy`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a> actions: `select`, `count`, `take`
- Other <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a> methods: `printSchema`, `schema`, `createOrReplaceTempView`

### 1. Create a DataFrame from the `events` table
- Use SparkSession to create a DataFrame from the `events` table

In [0]:
# ANSWER
eventsDF = spark.table("events")

### 2. Display DataFrame and inspect schema
- Use methods above to inspect DataFrame contents and schema

In [0]:
# ANSWER
display(eventsDF)

In [0]:
# ANSWER
eventsDF.printSchema()

### 3. Apply transformations to filter and sort `macOS` events
- Filter for rows where `device` is `macOS`
- Sort rows by `event_timestamp`

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> Use single and double quotes in your filter SQL expression

In [0]:
# ANSWER
macDF = (eventsDF
         .where("device == 'macOS'")
         .sort("event_timestamp")
        )

### 4. Count results and take first 5 rows
- Use DataFrame actions to count and take rows

In [0]:
# ANSWER
numRows = macDF.count()
rows = macDF.take(5)

**CHECK YOUR WORK**

In [0]:
from pyspark.sql import Row

assert(numRows == 1938215)
assert(len(rows) == 5)
assert(type(rows[0]) == Row)

### 5. Create the same DataFrame using SQL query
- Use SparkSession to run a SQL query on the `events` table
- Use SQL commands to write the same filter and sort query used earlier

In [0]:
# ANSWER
macSQLDF = spark.sql("""
SELECT *
FROM events
WHERE device = 'macOS'
ORDER By event_timestamp
""")

display(macSQLDF)

**CHECK YOUR WORK**
- You should only see `macOS` values in the `device` column
- The fifth row should be an event with timestamp `1592539226602157`

In [0]:
verify_rows = macSQLDF.take(5)
assert (macSQLDF.select("device").distinct().count() == 1 and len(verify_rows) == 5 and verify_rows[0]['device'] == "macOS"), "Incorrect filter condition"
assert (verify_rows[4]['event_timestamp'] == 1592539226602157), "Incorrect sorting"
del verify_rows

### Classroom Cleanup

In [0]:
%run ./Includes/Classroom-Cleanup

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>