d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Intro To DataFrames, Part #5

**Technical Accomplishments:**
* Introduce the concept of Broadcast Joins

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) The Data Source

This data uses the **Pageviews By Seconds** data set.

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

# I've already gone through the exercise to determine
# how many partitions I want and in this case it is...
partitions = 8

# Make sure wide operations don't repartition to 200
spark.conf.set("spark.sql.shuffle.partitions", str(partitions))

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()
spark.conf.set(sasEntity, sasToken)

# The directory containing our parquet files.
parquetFile = source + "/wikipedia/pageviews/pageviews_by_second.parquet/"

In [0]:
# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
pageviewsDF = (spark.read
  .option("inferSchema", "true")                # The default, but not costly w/Parquet
  .parquet(parquetFile)                         # Read the data in
  .repartition(partitions)                      # From 7 >>> 8 partitions
  .withColumnRenamed("timestamp", "capturedAt") # rename and convert to timestamp datatype
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
  .orderBy( col("capturedAt"), col("site") )    # sort our records
  .cache()                                      # Cache the expensive operation
)
# materialize the cache
pageviewsDF.count()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Broadcast Joins

If you saw the section on UDFs, you know that we can **aggregate by the Day-Of-Week**.

We **first used a UDF** only to discover that there was a built in function to do the exact same thing.

We saw that **Monday had more data** than any other day of the week.

We then forked the `DataFrame` so as to compare **Mobile Requests to Desktop Requests**.

Next, we **joined those to `DataFrames`** into one so that we could easily compare the two sets of data.

We know that the problem with the data **has nothing to do with Mobile vs Desktop**.

So we don't need that type of join (two ~large `DataFrames`)

However, what if we wanted to **reproduce our first exercise** (counts per day-of-week)...
* without a UDF...
* with a lookup table for the day-of-week...
* with a join between the pageviews and the lookup table.

What's different about this example is that we are **joining a big `DataFrame` to a small `DataFrame`**.

In this scenario, Spark can optimize the join and **avoid the expensive shuffle** with a **Broadcast Join**.

Let's start with two `DataFrames`
* The first we will derive from our original `DataFrame`. In this case, we will use a simple number for the day-of-week.
* The second `DataFrame` will map that number (1-7) to the labels **Monday**, **Tue**, **W**, or whatever...

Let's take a look at our first `DataFrame`.

In [0]:
columnTrans = date_format(col("capturedAt"), "u").alias("dow")

pageviewsWithDowDF = (pageviewsDF
    .withColumn("dow", columnTrans)  # Add the column dow
)
(pageviewsWithDowDF
  .cache()                           # mark the data as cached
  .count()                           # materialize the cache
)
display(pageviewsWithDowDF)

All we did is add one column **dow** that has the value **1** for **Monday**, **2** for **Tuesday**, etc.

Next, we are going to load a mapping of 1, 2, 3, etc. to Mon, Tue, Wed, etc from a **REALLY** small `DataFrame`.

In [0]:
labelsDF = spark.read.parquet(source + "/day-of-week")

display(labelsDF) # view our labels

Now that we have two `DataFrames`...

We can execute a join between the two `DataFrames`

In [0]:
joinedDowDF = (pageviewsWithDowDF
  .join(labelsDF, pageviewsWithDowDF["dow"] == labelsDF["dow"])
  .drop( pageviewsWithDowDF["dow"] )
)
display(joinedDowDF)

Now that the data is joined, we can aggregate by any (or all) of the various labels representing the day-of-week.

Notice that we are not losing the numerical **dow** column which we can use to sort.

And when we graph this, you can graph by any one of the labels...

In [0]:
aggregatedDowDF = (joinedDowDF
  .groupBy(col("dow"), col("longName"), col("abbreviated"), col("shortName"))  
  .sum("requests")                                             
  .withColumnRenamed("sum(requests)", "Requests")
  .orderBy(col("dow"))
)
# Display and then graph...
display(aggregatedDowDF)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Already Broadcasted

Believe it or not, that was a broadcast join.

The proof can be seen by looking at the physical plan.

Run the `explain(..)` below and then look for **BroadcastHashJoin** and/or **BroadcastExchange**.

In [0]:
aggregatedDowDF.explain()

From the code perspective, it looks just like other joins.

So what's the difference between a regular and a broadcast-join?

-sandbox
## Standard Join

* In a standard join, **ALL** the data is shuffled
* This can be really expensive
<br/><br/>
<div style="text-align:center"><img src="https://files.training.databricks.com/images/join-standard.png" style="max-height:400px"/></div>
<p>Here we see how all the records keyed by "green" are moved to the same partition.<br/>The process would be repeated for "red" and "blue" records.</p>

-sandbox
## Broadcast Join
* In a Broadcast Join, only the "small" data is moved.
* It duplicates the "small" data across all executors.
* But the "big" data is left untouched.
* If the "small" data is small enough, this can be **VERY** efficient.
<br/><br/>
<div style="text-align:center"><img src="https://files.training.databricks.com/images/join-broadcasted.png" style="max-height:400px"/></div>

<p>Here we see the records keyed by "red" being replicated into the first partition.<br/>
   The process would be repeated for each executor.<br/>
   The entire process would be repeated again for "green" and "blue" records.</p>

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Broadcasted, How?

Behind the scenes, Spark is analyzing our two `DataFrames`.

It attempts to estimate if either or both are < 10MB.

We can see/change this threshold value with the config **spark.sql.autoBroadcastJoinThreshold**. 

The documentation reads as follows:
> Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled...

In [0]:
threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
print(f"Threshold: {threshold}")

In such a case it will take the small `DataFrame`, the `labelsDF` in our case
* Send the entire `DataFrame` to every **Executor**
* Then do a join on the local copy of `labelsDF`
* Compared to taking our big `DataFrame` `pageviewsWithDowDF` and shuffling it across all executors.

We can see proof of this by dropping the threshold:

In [0]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 0)

Run the `explain(..)` below and then look for the **ABSENCE OF** the **BroadcastHashJoin** and/or **BroadcastExchange**.

In [0]:
(joinedDowDF
  .groupBy(col("dow"), col("longName"), col("abbreviated"), col("shortName"))  
  .sum("requests")                                             
  .withColumnRenamed("sum(requests)", "Requests")
  .orderBy(col("dow"))
  .explain()
)

And now that we are done, let's restore the original threshold:

In [0]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", threshold)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) broadcast(..)

What if I wanted to broadcast the data and it was over the 10MB [default] threshold?

We can specify that a `DataFrame` is to be broadcasted by using the `broadcast(..)` operation from the `...sql.functions` package.

However, **it is only a hint**. Spark is allowed to ignore it.

In [0]:
pageviewsWithDowDF.join(   broadcast(labelsDF)   , pageviewsWithDowDF["dow"] == labelsDF["dow"])

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>