-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Spark Review

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Create a Spark DataFrame
 - Analyze the Spark UI
 - Cache data
 - Go between Pandas and Spark DataFrames

![](https://files.training.databricks.com/images/sparkcluster.png)

In [0]:
%run "./Includes/Classroom-Setup"

## Spark DataFrame

In [0]:
from pyspark.sql.functions import col, rand

df = (spark.range(1, 1000000)
      .withColumn("id", (col("id") / 1000).cast("integer"))
      .withColumn("v", rand(seed=1)))

Why were no Spark jobs kicked off above? Well, we didn't have to actually "touch" our data, so Spark didn't need to execute anything across the cluster.

In [0]:
display(df.sample(.001))

## Views

How can I access this in SQL?

In [0]:
df.createOrReplaceTempView("df_temp")

In [0]:
%sql
SELECT * FROM df_temp LIMIT 10

## Count

Let's see how many records we have.

In [0]:
df.count()

## Spark UI

Open up the Spark UI - what are the shuffle read and shuffle write fields? The command below should give you a clue.

In [0]:
df.rdd.getNumPartitions()

## Cache

For repeated access, it will be much faster if we cache our data.

In [0]:
df.cache().count()

## Re-run Count

Wow! Look at how much faster it is now!

In [0]:
df.count()

## Collect Data

When you pull data back to the driver  (e.g. call **`.collect()`**, **`.toPandas()`**,  etc), you'll need to be careful of how much data you're bringing back. Otherwise, you might get OOM exceptions!

A best practice is explicitly limit the number of records, unless you know your data set is small, before calling **`.collect()`** or **`.toPandas()`**.

In [0]:
df.limit(10).toPandas()

## What's new in <a href="https://www.youtube.com/watch?v=l6SuXvhorDY&feature=emb_logo" target="_blank">Spark 3.0</a>

* <a href="https://www.youtube.com/watch?v=jzrEc4r90N8&feature=emb_logo" target="_blank">Adaptive Query Execution</a>
  * Dynamic query optimization that happens in the middle of your query based on runtime statistics
    * Dynamically coalesce shuffle partitions
    * Dynamically switch join strategies
    * Dynamically optimize skew joins
  * Enable it with: **`spark.sql.adaptive.enabled=true`**
* Dynamic Partition Pruning (DPP)
  * Avoid partition scanning based on the query results of the other query fragments
* Join Hints
* <a href="https://www.youtube.com/watch?v=UZl0pHG-2HA&feature=emb_logo" target="_blank">Improved Pandas UDFs</a>
  * Type Hints
  * Iterators
  * Pandas Function API (mapInPandas, applyInPandas, etc)
* And many more! See the <a href="https://spark.apache.org/docs/latest/api/python/migration_guide/pyspark_2.4_to_3.0.html" target="_blank">migration guide</a> and resources linked above.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>