## Quick Start Using Python
* Using a Databricks notebook to showcase RDD operations using Python
* Reference http://spark.apache.org/docs/latest/quick-start.html

In [2]:
# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))

path,name,size
dbfs:/databricks-datasets/samples/docs/README.md,README.md,3137


In [3]:
# Setup the textFile RDD to read the README.md file
#   Note this is lazy 
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

RDDs have ***actions***, which return values, and ***transformations***, which return pointers to new RDDs.

In [5]:
# When performing an action (like a count) this is when the textFile is read and aggregate calculated
#    Click on [View] to see the stages and executors
textFile.count()

## Python Count (Job)
![Python Count - Job](https://sparkhub.databricks.com/wp-content/uploads/2015/12/Python-Count-Job.png)

## Python Count (Stages)
* Notice how the file is read during the *.count()* action
* Many Spark operations are lazy and executed upon some action

![Python Count - Stages](https://sparkhub.databricks.com/wp-content/uploads/2015/12/Python-Count-Stages.png)

In [8]:
# Output the first line from the text file
textFile.first()

Now we're using a filter ***transformation*** to return a new RDD with a subset of the items in the file.

In [10]:
# Filter all of the lines wihtin the RDD and output the first five rows
linesWithSpark = textFile.filter(lambda line: "Spark" in line)

Notice that this completes quickly because it is a transformation but lacks any action.  
* But when performing the actions below (e.g. count, take) then you will see the executions.

In [12]:
# Perform a count (action) 
linesWithSpark.count()

In [13]:
# Filter all of the lines within the RDD and output the first five rows
linesWithSpark.take(5)