# An Introduction to Apache Spark

## The Data Interface

There are several key interfaces that you should understand when you go to use Spark.

-   ****The Dataset****
    -   The Dataset is Apache Spark's newest distributed collection and can be considering a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.
-   ****The DataFrame****
    -   The DataFrame is collection of distributed `Row` types. These provide a flexible interface and are similar in concept to the DataFrames you may be familiar with in python (pandas) as well as in the R language.
-   ****The RDD (Resilient Distributed Dataset)****
    -   Apache Spark's first abstraction was the RDD or Resilient Distributed Dataset. Essentially it is an interface to a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster. RDD's can be created in a variety of ways and are the "lowest level" API available to the user. While this is the original data structure made available, new users should focus on Datasets as those will be supersets of the current RDD functionality.

# Entry-point into a Spark Job

In [None]:
sqlContext

In [None]:
sc

In [None]:
## Also available as

spark

## Generating Data

In [None]:
firstDataFrame = sqlContext.range(1000000)
print(firstDataFrame)

## Why can't we see the values? Lazy Execution
## Types of Operations in Spark:
* ## Transofrmations - operations that just record the logic
* ## Actions - operations that execute logic

![transformations and actions](http://training.databricks.com/databricks_guide/gentle_introduction/trans_and_actions.png)

In [None]:
# An example of a transformation
# select the ID column values and multiply them by 2
secondDataFrame = firstDataFrame.selectExpr("(id * 2) as value")

In [None]:
# an example of an action
# take the first 5 values that we have in our firstDataFrame
print(firstDataFrame.take(5))
# take the first 5 values that we have in our secondDataFrame
print(secondDataFrame.take(5))

## Why are jobs divided into transformations and actions?
###  -> Optimization of the whole data processing pipeline instead of smaller groups of data

![transformations and actions](http://training.databricks.com/databricks_guide/gentle_introduction/pipeline.png)

In [None]:
iris = sqlContext.read.format("com.databricks.spark.csv")\
  .option("header","true")\
  .option("inferSchema", "true")\
  .load("../files/iris.csv")

In [None]:
iris.head()

In [None]:
iris.head(10)

In [None]:
iris.toPandas().head()

## Transformations vs actions

In [None]:
df1 = iris.groupBy("species").avg("sepal_length") # a simple grouping
print(df1)

No result!

In [None]:
df1.count()

In [None]:
df1.head(3)

### Caching - Saving an intermediate transformation state

In [None]:
df1.cache()

Caching, like a transformation, is performed lazily -> no caching happens until an action is called.

In [None]:
df1.count()

In [None]:
df1.count()

## What do we notice?

## Exercise: Read the `spam.csv` file and count the number of hams and spams
### Hint: Use the `groupBy` and `count` transformations

In [None]:
spam = sqlContext.read.format("com.databricks.spark.csv")\
  .option("header","true")\
  .option("inferSchema", "true")\
  .load("../files/spam.csv")

In [None]:
# enter code here