# Spark basics

In this chapter we will cover the following topics:

- What is an RDD?
- RDD basics
   - Operations
   - Actions
   - Lineage

## RDD ( Resilient Distributed Dataset)

- Resilient - if data in memory is lost, it can be recreated
- Distributed - processed across the cluster
- Dataset - initial data can come from a file or be created programatically


__ RDD is the fundamental unit of data in Spark __

# Creating an RDD

 - We can create a RDD in 3 ways:
   1. Using existing data (for example: read from files).
   2. By transforming another RDD.
   3. Generating data in memory.

# Creating an RDD

The spark context provides calls to create RDDs:

 - `sc.textFile("/some/hdfs-ish/path")`<br>
   This creates an RDD, where each item is a line from the text file(s).
   In this cluster environment HDFS is the default for file locations. If you want to read from the local filesystem you need to specify the path as follows `sc.textFile("file:///some/fully/specified/path")`
 - `rdd.doTransformation(transformationFunction)`<br>
   Pseudo code
 - `sc.parallelize(["An", "Example", "Collection"])`<br>
   This creates an RDD using the collection you supply. (This isn't 'big' data, but can be useful when doing things    like seeding machine-learning algorithms or testing functions when exploring data.)

# RDD from files

The `sc.textFile` method reads in the data line by line with the default lineseparator `\n`. Every line in the file will become a record in the RDD.<br>

But what about multiline data like JSON or XML?  

For that the method `sc.wholeTextFiles("/some/directory")` is available. This will create a RDD with the following structure:

```
(file1.json, {"name": "Gerard", "id": 123456, "age": 46})
(file2.json, {"name": "Michael", "id": 534623, "age": 26})
(file5.json, {"name": "Ronald", "id": 1344, "age": 16})
(file8.json, {"name": "Daisy", "id": 34534})
```

# Transforming an RDD

We also have many many ways of transforming RDDs to produce a new one. Some key transformations:

 - `map`
 - `filter`
 - `flatMap`
 - `sort`
 - `distinct`
 - `groupBy`
 - `intersection`
 
The API documentation describes all these: there are many methods. Transformations are the ones that _return a RDD_.

# RDDs are Lazy

 - When we create an RDD, its content is _not_ evaluated.
 - Instead a _lineage_ is constructed: each RDD knows what parentRDD it depends on, and what it needs to do, but won't actually do anything until the content of the RDD is actually required.
 
When is the content of the RDD actually required?

# RDD Actions

 - Actions trigger an RDD to evaluate its content, which is normally based on the lineage in a recursive manner.
 - These are normally methods on the RDD that _don't_ return a RDD.
 - Some examples include:
   - `collect`
   - `first`
   - `take`
   - `top`
   - `count`
   - `isEmpty`

In [1]:
import findspark
findspark.init()
import pyspark

spark = (
    pyspark.sql.SparkSession.builder
    .getOrCreate()
)
sc = spark.sparkContext

# Putting it Together…

In [2]:
numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
                          12, 13, 14, 15, 16, 17, 18, 19, 20])
odd_numbers = numbers.filter(lambda n: n % 2 == 1)
odd_numbers.isEmpty()

False

In [3]:
odd_numbers.take(3)

[1, 3, 5]

In [4]:
odd_numbers.count()

10

In [5]:
odd_numbers.collect()

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

When is the content of `odd_numbers` evaluated?

# This lineage thing

- Nothing happens until an action is called => lazy execution

- RDD Lineage is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.

- The data is not cached by default -> re-executing twice the same commands, actually executes all steps again

- The lineage can be inspected with the toDebugString method of the RDD

        RDD.toDebugString()


In [6]:
print(odd_numbers.toDebugString())

b'(1) PythonRDD[4] at collect at <ipython-input-5-efa00ed215ae>:1 []\n |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475 []'


# Summary

In this chapter we covered:

- RDD operations/actions/lineage