## A Worked Example of Transformations and Actions

To illustrate all of these architectural and most relevantly **transformations** and **actions** - let's go through a more thorough example, this time using `DataFrames` and a csv file. 

Let's load the popular diamonds dataset in as a spark `DataFrame`. Now let's go through the dataset that we'll be working with.

In [2]:
dataPath = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
diamonds = sqlContext.read.format("com.databricks.spark.csv")\
  .option("header","true")\
  .option("inferSchema", "true")\
  .load(dataPath)

Lets get an idea what our data looks like by using the `display` function.

In [4]:
display(diamonds)

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
6,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
7,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
8,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
9,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
10,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


With DataBricks, we can easily create more sophisticated graphs by clicking the graphing icon that you can see below. Here's a plot that allows us to compare price, color, and cut.

In [6]:
display(diamonds)

_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
6,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
7,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
8,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
9,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
10,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


Create several transformations and then an action:
* First group by two variables: cut and color
* Then compute the average price
* Then join the resulting data set to the original dataset (`diamonds`) on the column `color`
* Then select two variables from the new dataset: average price and carat

Which of these operations are transformations?  Which are actions?

In [8]:
df1 = diamonds.groupBy("cut", "color").avg("price") # a simple grouping

df2 = df1\
  .join(diamonds, on='color', how='inner')\
  .select("`avg(price)`", "carat")

As we have seen several times before, we don't see any results when running this command.  The operations were all transformations, and Spark is still waiting for us to request an action that requires the engine to compute the transformations.

This is not to say that Spark did not do any work.  In fact it did make a plan for how to compute the transformations, should the need arise.  Use the `explain` method to see this plan.

In [10]:
df2.explain()

The output of the plan is shown on top, while the input of the plan is shown at the bottom of each branch of the plan.  Each branch starts by reading the CSV, and applies the specified transformations.  Because we performed a join in order to create `df2`, the plan comes together, and at the very top of the plan we see the projection of `avg(price)` and `carat`.

Execute an action and Spark will execute this plan (Hint: try `count`)

In [12]:
df2.count()

This will execute the plan that Apache Spark built up previously. Click the little arrow next to where it says `(2) Spark Jobs` after that cell finishes executing and then click the `View` link. This brings up the Apache Spark Web UI right inside of your notebook. 

![img](http://training.databricks.com/databricks_guide/gentle_introduction/spark-dag-ui.png)

### Caching

One of the significant parts of Apache Spark is the ability to to store things in memory during computation. This is a neat trick that you can use as a way to speed up access to commonly queried tables or pieces of data. This is also great for iterative algorithms that work over and over again on the same data. 

Use the `cache` method to cache a DataFrame or RDD

In [15]:
df2.cache()

Caching, like a transformation, is performed lazily. It won't store the data in memory until you call an action on that dataset. 

Now that we have asked Spark to cache the `df2` data set, request an action to force Spark to put the dataset into memory.

In [17]:
df2.count()

Now count again.

In [19]:
df2.count()

See that the count ran much faster the second time.  Compare plans and you will notice that some steps are skipped when using the cached data.