-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://cdn2.hubspot.net/hubfs/438089/docs/training/dblearning-banner.png" alt="Databricks Learning" width="555" height="64">
</div>

#Introduction to Transformations and Actions

**Technical Accomplishments:**
* Develop familiarity with the `DataFrame` APIs
* Introduce transformations and actions
* Sharing/exporting notebooks

The data is located at `dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/product.csv`.

In [4]:
%fs ls /mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/product.csv

path,name,size
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/product.csv,product.csv,3449


In [5]:
csvDir = "dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/product.csv"

retailDF = (spark                    # Our SparkSession & Entry Point
           .read                     # Our DataFrameReader
           .option("header", "true")
           .option("inferSchema", "true")
           .csv(csvDir))           # Returns an instance of DataFrame

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Our Data

Let's continue by taking a look at the type of data we have. 

We can do this with the `printSchema()` command:

In [7]:
retailDF.printSchema()

In [8]:
retailDF.show(3)

We should now be able to see that we have eight columns of data:
* **product_id** (*string*) Unique product identifier.
* **category** (*string*): The product is either a tablet or a laptop.
* **brand** (*string*): The name of the product's brand.
* **model** (*string*): The name of the product's model.
* **price** (*double*): Price of the product.
* **processor** (*string*): The type of processor the product uses.
* **size** (*string*): The size of the product in inches.
* **display** (*string*): The aspect-ratio of the display.

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) show(..)

The `show(..)` method has two optional parameters:
* **n**: The number of records to print to the console, the default being 20.
* **truncate**: If true, columns wider than 20 characters will be truncated, where the default is true.

Let's take a look at the data in our `DataFrame` with the `show()` command.

[Python Docs](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=show#pyspark.sql.DataFrame.show)

[Scala Docs](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Dataset)

In [11]:
retailDF.show()

In the cell above, change the parameters of the show command to:
* print only the first 5 records
* disable truncation
* print only the first 10 records and disable truncation

**Note:** The function `show(..)` is an **action** which triggers a job.

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) display(..)

The `show(..)` command is part of the core Spark API and simply prints the results to the console.

Our notebooks have a slightly more elegant alternative.

Instead of calling `show(..)` on an existing `DataFrame` we can instead pass our `DataFrame` to the `display(..)` command:

In [14]:
retailDF.select

In [15]:
display(retailDF)

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) limit(..)

Both `show(..)` and `display(..)` are **actions** that trigger jobs (though in slightly different ways).

If you recall, `show(..)` has a parameter to control how many records are printed but `display(..)` does not.

We can address that difference with our first transformation, `limit(..)`.

If you look at the API docs, `limit(..)` is described like this:
> Returns a new Dataset by taking the first n rows...

`show(..)`, like many actions, does not return anything. 

On the other hand, transformations like `limit(..)` return a **new** `DataFrame`:

In [17]:
limitedDF = retailDF.limit(5) # "limit" the number of records to the first 5

### Nothing Happened
* Notice how "nothing" happened - that is no job was triggered.
* This is because we are simply defining the second step in our transformations.
  0. Read in the parquet file (represented by **retailDF**).
  0. Limit those records to just the first 5 (represented by **limitedDF**).
* It's not until we induce an action that a job is triggered and the data is processed

We can induce a job by calling either the `show(..)` or the `display(..)` actions:

In [19]:
limitedDF.show(100, False) #show up to 100 records and don't truncate the columns

In [20]:
display(limitedDF) # defaults to the first 1000 records

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,


### Why is Laziness So Important?

This is a common pattern in functional programming as well as with Big Data specific languages.
* We see it in Scala as part of its core design.
* Java 8 introduced the concept with its Streams API.
* And many functional programming languages have similar APIs.

It has a number of benefits
* Not forced to load all data at step #1 
  * Technically impossible with **REALLY** large datasets.
* Easier to parallelize operations 
  * N different transformations can be processed on a single data element, on a single thread, on a single machine. 
* Most importantly, it allows the framework to automatically apply various optimizations

### Catalyst Optimizer

Because our API is declarative a large number of optimizations are available to us.

Some of the examples include:
  * Optimizing data type for storage
  * Rewriting queries for performance
  * Predicate push downs

![Catalyst](https://files.training.databricks.com/images/105/catalyst-diagram.png)

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) count()

How many rows are in our dataset? Let's use the `count()` action to find out!

Take a look at the [documentation
](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html)to see how to use `count()`.

In [24]:
total = retailDF.count()

print("Record Count: {:,}".format( total ))

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) select(..)

In our case, the `img` column is of no value to us.

We can disregard it by selecting only the 5 columns that we want:

In [26]:
minusOneDF = retailDF.select("product_id", "category", "brand", "model", "price")
  
minusOneDF.printSchema()

Again, notice how the call to `select(..)` does not trigger a job. That's because `select(..)` is a transformation.

Let's go ahead and invoke the action `show(..)` and take a look at the result.

In [28]:
minusOneDF.show(5) #Since we did not specify an argument for Truncate, it defaults to True

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) drop(..)

As a quick side note, you will quickly discover there are a lot of ways to accomplish the same task.

Instead of selecting everything we wanted, `drop(..)` allows us to specify the columns we don't want.

And we can see that we can produce the exact same result as the last exercise this way:

In [30]:
droppedDF = retailDF.drop("processor")

droppedDF.printSchema()

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) withColumnRenamed(..)

There are many ways to rename columns of a DataFrame in Spark. 

`withColumnRenamed(oldName, newName)` allows to rename columns one at a time.

In [32]:
droppedDF.withColumnRenamed("product_id", "prodID").printSchema()

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) withColumn(..)

`withColumn` allows us to add new columns, or overwrite the values of existing columns.

Let's make a new column which equals double the `price` field. There are a few ways we can do this.

One is by importing the `col` (column) function and applying it to the `price` column (recommended).

In [34]:
from pyspark.sql.functions import col

doublePriceDF = droppedDF.withColumn("doublePrice", col("price") * 2)

doublePriceDF.show(3)

Another way is to use Python Pandas syntax i.e. `droppedDF["price"]` as shown below.

In [36]:
droppedDF.withColumn("doublePrice", droppedDF["price"] * 2).show(3)

Above, we used the `col("price")` or `droppedDF["price"] * 2` syntax.

Using this syntax will not work:
`droppedDF.withColumn("doublePrice", "price")`

* The problem is that `.withColumn(..)` expects a column type, thus the notation
`col("price") * 2` or `droppedDF["price"] * 2`

Refer to: 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) The Column Class

The `Column` class is an object that encompasses more than just the name of the column, but also column-level-transformations, such as sorting in a descending order.

The first question to ask is how do I create a `Column` object?

In Python we have these options:

In [39]:
# Scala & Python both support accessing a column from a known DataFrame
columnA = retailDF["price"]
print(columnA)

columnB = retailDF.price
print(columnB)

# The $"column-name" version that works for Scala does not work in Python

# If we import ...sql.functions, we get a couple of more options:
from pyspark.sql.functions import *

# This uses the col(..) function
columnC = col("price")
print(columnC)

# This uses the expr(..) function which parses an SQL Expression
columnD = expr("a + 1")
print(columnD)

# This uses the lit(..) to create a literal (constant) value.
columnE = lit("abc")
print(columnE)

In the case of Python, the cleanest version is the **col("column-name")** variant.

** *Note:* ** *We are introducing `...sql.functions` specifically for creating `Column` objects.*<br/>
*We will be reviewing the multitude of other commands available from this part of the API in future notebooks.*

In [41]:
display(droppedDF.withColumn("doubleReq", col("price") * 2))

product_id,category,brand,model,price,size,display,doubleReq
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,2999.97998046875
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,2599.97998046875
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,2999.97998046875
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,3899.97998046875
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,3599.97998046875
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,5319.97998046875
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,2999.97998046875
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,3999.97998046875
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,1999.97998046875
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,1999.97998046875


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) selectExpr(..)

`selectExpr` is very slick - it allows you to select columns, rename columns, and create new columns all in one.

In [43]:
display(retailDF.selectExpr("product_id as prodID", "category", "brand","price", "price*2 as 2xPrice"))

prodID,category,brand,price,2xPrice
1,Laptops,HP,1499.989990234375,2999.97998046875
2,Laptops,Microsoft,1299.989990234375,2599.97998046875
3,Laptops,Microsoft,1499.989990234375,2999.97998046875
4,Laptops,Dell,1949.989990234375,3899.97998046875
5,Laptops,Lenovo,1799.989990234375,3599.97998046875
6,Laptops,Apple,2659.989990234375,5319.97998046875
7,Laptops,Apple,1499.989990234375,2999.97998046875
8,Laptops,Apple,1999.989990234375,3999.97998046875
9,Laptops,Apple,999.989990234375,1999.97998046875
10,Laptops,HP,999.989990234375,1999.97998046875


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) filter

Let's filter out all of the records where the number of requests is less than 100.

In [45]:
display(retailDF.filter("price >= 300"))

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) groupBy

Let's group by the `project` field in our dataset.

Look at the [docs](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData) to see all of the different methods we can call on groupedData.

In [47]:
display(retailDF.groupBy("brand").count().filter("brand = 'Microsoft'"))

brand,count
Microsoft,5


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) take(n)

Use `take` if you want to retrieve just a few records of your data.

In [49]:
retailDF.take(10)

##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) collect()

Collect returns all of the data to the driver - calling collect on a large dataset is the easiest way to crash a Spark cluster. You want to be very careful when you use `collect`.

In [51]:
retailDF.groupBy("brand").count().limit(40).collect()

## Next Step

[Streaming]($../3-Streaming/3-01 Streaming)

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>