# Spark Review

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Create a Spark DataFrame
 - Analyze the Spark UI
 - Cache data
 - Go between Pandas and Spark DataFrames

![](https://files.training.databricks.com/images/sparkcluster.png)

In [0]:
## Put your name here
username = "renato"

dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dsacademy_embedded_wave3_{username}")
spark.sql(f"USE dsacademy_embedded_wave3_{username}")
spark.conf.set("spark.sql.shuffle.partitions", 40)

spark.sql("SET spark.databricks.delta.formatCheck.enabled = false")
spark.sql("SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true")

Out[1]: DataFrame[key: string, value: string]

## Spark DataFrame

In [0]:
from pyspark.sql.functions import col, rand

df = (spark.range(1, 1000000)
      .withColumn("id", (col("id") / 1000).cast("integer"))
      .withColumn("v", rand(seed=1)))

Why were no Spark jobs kicked off above? Well, we didn't have to actually "touch" our data, so Spark didn't need to execute anything across the cluster.

In [0]:
#display(df.sample(.001))
df.sample(.001).limit(10).display()

id,v
0,0.9982156280218047
3,0.3126434553037355
4,0.5348328238467555
4,0.0315472106758006
5,0.0723630583300772
8,0.56473145019343
9,0.9072631815405384
10,0.6758363663444936
11,0.4381899095630061
12,0.9695095299912684


## Views

How can I access this in SQL?

In [0]:
df.createOrReplaceTempView("df_temp")

In [0]:
%sql
SELECT * FROM df_temp LIMIT 10

id,v
0,0.6363787615254752
0,0.5993846534021868
0,0.134842710012538
0,0.076841639054609
0,0.8539211111755448
0,0.7167704217972344
0,0.2473902407597975
0,0.1367450741851369
0,0.3869569887491171
0,0.6051540605040805


## Count

Let's see how many records we have.

In [0]:
df.count()

Out[6]: 999999

## Spark UI

Open up the Spark UI - what are the shuffle read and shuffle write fields? The command below should give you a clue.

In [0]:
df.rdd.getNumPartitions()

Out[7]: 4

## Cache

For repeated access, it will be much faster if we cache our data.  
Spark Cache and Persist are optimization techniques in DataFrame / Dataset for iterative and interactive Spark applications to improve the performance of Jobs. In this article, you will learn What is Spark cache() and persist(), how to use it in DataFrame, understanding the difference between Caching and Persistance and how to use these two with DataFrame, and Dataset using Scala examples.  

Though Spark provides computation 100 x times faster than traditional Map Reduce jobs, If you have not designed the jobs to reuse the repeating computations you will see degrade in performance when you are dealing with billions or trillions of data. Hence, we may need to look at the stages and use optimization techniques as one of the ways to improve performance.  

Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.  

When you persist a dataset, each node stores its partitioned data in memory and reuses them in other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if any partition of a Dataset is lost, it will automatically be recomputed using the original transformations that created it.

In [0]:
df.cache().count()

Out[8]: 999999

## Re-run Count

Wow! Look at how much faster it is now!

In [0]:
df.count()

Out[9]: 999999

### Examining available Datasets

In [0]:
%fs mounts

mountPoint,source,encryptionType
/databricks-datasets,databricks-datasets,sse-kms
/databricks/mlflow-tracking,databricks/mlflow-tracking,sse-kms
/databricks-results,databricks-results,sse-kms
/databricks/mlflow-registry,databricks/mlflow-registry,sse-kms
/,DatabricksRoot,sse-kms


In [0]:
#%fs ls /databricks-datasets/
files = dbutils.fs.ls("/databricks-datasets") 
display(files)

path,name,size,modificationTime
dbfs:/databricks-datasets/COVID/,COVID/,0,1666267249065
dbfs:/databricks-datasets/README.md,README.md,976,1532502324000
dbfs:/databricks-datasets/Rdatasets/,Rdatasets/,0,1666267249065
dbfs:/databricks-datasets/SPARK_README.md,SPARK_README.md,3359,1511905961000
dbfs:/databricks-datasets/adult/,adult/,0,1666267249065
dbfs:/databricks-datasets/airlines/,airlines/,0,1666267249065
dbfs:/databricks-datasets/amazon/,amazon/,0,1666267249065
dbfs:/databricks-datasets/asa/,asa/,0,1666267249065
dbfs:/databricks-datasets/atlas_higgs/,atlas_higgs/,0,1666267249065
dbfs:/databricks-datasets/bikeSharing/,bikeSharing/,0,1666267249065


In [0]:
%fs ls /databricks-datasets/iot-stream/data-device/

path,name,size,modificationTime
dbfs:/databricks-datasets/iot-stream/data-device/part-00000.json.gz,part-00000.json.gz,2610922,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00001.json.gz,part-00001.json.gz,2612478,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00002.json.gz,part-00002.json.gz,2619023,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00003.json.gz,part-00003.json.gz,2620016,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00004.json.gz,part-00004.json.gz,2618699,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00005.json.gz,part-00005.json.gz,2619772,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00006.json.gz,part-00006.json.gz,2619027,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00007.json.gz,part-00007.json.gz,2619832,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00008.json.gz,part-00008.json.gz,2617893,1532502336000
dbfs:/databricks-datasets/iot-stream/data-device/part-00009.json.gz,part-00009.json.gz,2619764,1532502336000


## Debug Slow Query: Spark UI

Why is the query below slow? How can you speed it up?

In [0]:
DFjson = (spark.read.json("/databricks-datasets/iot-stream/data-device/"))

In [0]:
DFjson.count()

Out[12]: 1000000

In [0]:
DFjson.cache().count()

Out[13]: 1000000

In [0]:
DFjson.count()

Out[14]: 1000000

## Collect Data

When you pull data back to the driver  (e.g. call `.collect()`, `.toPandas()`,  etc), you'll need to be careful of how much data you're bringing back. Otherwise, you might get OOM exceptions!

A best practice is explicitly limit the number of records, unless you know your data set is small, before calling `.collect()` or `.toPandas()`.

In [0]:
df.limit(10).toPandas()

Unnamed: 0,id,v
0,250,0.531121
1,250,0.286137
2,250,0.494431
3,250,0.455371
4,250,0.87924
5,250,0.364463
6,250,0.450197
7,250,0.419973
8,250,0.705159
9,250,0.015088


Some material adapted from Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>