d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Transformations & Actions Lab
## Exploring T&As in the Spark UI

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions
0. Run the cell below.<br/>** *Note:* ** *There is no real rhyme or reason to this code.*<br/>*It simply includes a couple of actions and a handful of narrow and wide transformations.*
0. Answer each of the questions.
  * All the answers can be found in the **Spark UI**.
  * All aspects of the **Spark UI** may or may not have been reviewed with you - that's OK.
  * The goal is to get familiar with diagnosing applications.
0. Submit your answers for review.

**WARNING:** Run the following cell only once. Running it multiple times will change some of the answers and make validation a little harder.

In [0]:
initialDF = (spark                                                       
  .read                                                                     
  .parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")   
  .cache()
)
initialDF.foreach(lambda x: None) # materialize the cache

displayHTML("All done<br/><br/>")

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Round #1 Questions
0. How many jobs were triggered?
0. Open the Spark UI and select the **Jobs** tab.
  0. What action triggered the first job?
  0. What action triggered the second job?
0. Open the details for the second job, how many MB of data was read in? Hint: Look at the **Input** column.
0. Open the details for the first stage of the second job, how many records were read in? Hint: Look at the **Input Size / Records** column.

In [0]:
from pyspark.sql.functions import col, upper

someDF = (initialDF
  .withColumn("first", upper( col("article").substr(0,1)) )
  .where( col("first").isin("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z") )
  .groupBy("project", "first").sum()
  .drop("sum(bytes_served)")
  .orderBy("first", "project")
  .select( col("first"), col("project"), col("sum(requests)").alias("total"))
  .filter( col("total") > 10000)
)
total = someDF.count()

displayHTML("All done<br/><br/>")

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Round #2 Questions
0. How many jobs were triggered?
0. How many actions were executed?
0. Open the **Spark UI** and select the **Jobs** tab.
  0. What action triggered the first job?
  0. What action triggered the second job?
0. Open the **SQL** tab - what is the relationship between these two jobs?
0. For the first job...
  0. How many stages are there?
  0. Open the **DAG Visualization**. What do you suppose the green dot refers to?
0. For the second job...
  0. How many stages are there?
  0. Open the **DAG Visualization**. Why do you suppose the first stage is grey?
  0. Can you figure out what transformation is triggering the shuffle at the end of 
    0. The first stage? Hint: If you can't figure it out, look at the SQL tab again.  Exchange means shuffle.  What happened after the shuffle?
    0. The second stage?
    0. The third stage? HINT: It's not a transformation but an action.
0. For the second job, the second stage, how many records (total) 
  0. Were read in as a result of the previous shuffle operation?
  0. Were written out as a result of this shuffle operation?  
  Hint: look for the **Aggregated Metrics by Executor**
0. Open the **Event Timeline** for the second stage of the second job.
  * Make sure to turn on all metrics under **Show Additional Metrics**.
  * Note that there were 200 tasks executed.
  * Visually compare the **Scheduler Delay** to the **Executor Computing Time**
  * Then in the **Summary Metrics**, compare the median **Scheduler Delay** to the median **Duration** (aka **Executor Computing Time**)
  * What is taking longer? scheduling, execution, task deserialization, garbage collection, result serialization or getting the result?

In [0]:
someDF.take(total)

displayHTML("All done<br/><br/>")

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Round #3 Questions
0. Collectively, `someDF.count()` produced 2 jobs and 6 stages.  
However, `someDF.take(total)` produced only 1 job and 2 stages.  
  0. Why did it only produce 1 job?
  0. Why did the last job only produce 2 stages?
0. Look at the **Storage** tab. How many partitions were cached?
0. True or False: The cached data is fairly evenly distributed.
0. How many MB of data is being used by our cache?
0. How many total MB of data is available for caching?
0. Go to the **Executors** tab. How many executors do you have?
0. Go to the **Executors** tab. How many total cores do you have available?
0. Go to the **Executors** tab. What is the IP Address of your first executor?
0. How many tasks is your cluster able to execute simultaneously?
0. What is the path to your **Java Home**?

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>