## Resilient Distributed Datasets (RDDs)

RDDs are the building blocks of Spark. It is useful for processing unstructured data, such as text or images. 

### RDDs has three Main properties:

1. **Resilient**: RDDs are fault-tolerant, meaning that you can recompute lost data due to node failures.
2. **Distributed**: RDDs are distributed across multiple nodes in a cluster.
3. **Operated in parallel**: RDDs can be operated in parallel.

## Start Coding with PySpark

#### `SparkSession`

It is the entry point to Spark SQL. It is used to create DataFrame, register DataFrame as tables and execute SQL over tables, read parquet files, etc.

#### `sparkContext`
`sparkContext` within `SparkSession` is the connection to the Spark cluster and can be used to create and transform RDDs.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#### `sparkContext.parallelize()`

Used to create an RDD from data saved locally. We can add an argument to specify the number of partitions to split the data into. Spark defaults to the number of cores on the machine.

In [None]:
# default setting
rdd_par = spark.sparkContext.parallelize(dataset_name)

#### `sparkContext.textFile()`

Used to create an RDD from a text file.

In [None]:
# with partition argument of 10
rdd_txt = spark.sparkContext.textFile("file_name.txt", 10)

#### `getNumPartitions()`

Used to verify the number of partitions

In [None]:
rdd_txt.getNumPartitions()
# output: 10

#### `.stop()`

Used to stop the SparkSession

In [None]:
spark.stop()

## Transformations

#### `map()`

applies a function to each element in the RDD.

In [None]:
rdd = spark.SparkContent.parallelize([1,2,3,4,5])
rdd.map(lambda x: x+1)
# output RDD [2,3,4,5,6]

If the RDD contains tuples, we can map the lambda expression to the elements with the specific index value.

In [None]:
# input RDD [(1,2,3),(4,5,6),(7,8,9)]
rdd.map(lambda x: (x[0]+1, x[1], x[2]))
# output RDD [(2,2,3),(5,5,6),(8,8,9)]

It can also be used to create a new RDD by selecting specific columns.

In [None]:
sum_gpa = rdd.map(lambda x: x[2]).reduce(lambda x,y: x+y)

#### `filter()`

applies a function to each element in the RDD and returns elements that satisfy the condition.

In [None]:
# input RDD [1,2,NULL,4,5]
rdd.filter(lambda x: x is not None)
# output RDD [1,2,4,5]

[Spark  Transformation Documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations)

#### `collect()`

returns all elements in the RDD.

In [None]:
rdd.filter(lambda x: x is not None).collect()
# output [1,2,4,5]

## Actions

Spark execute the transformations only when an **action** is called. Spark transforms the data in a **lazy** manner. In pandas, the data is transformed immediately or **eagerly**.

In [None]:
rdd = spark.SparkContent.parallelize([1,2,3,4,5])
rdd.map(lambda x: x+1).filter(lambda x: x>3)

Instead of following the order that we called the transformations, Spark will optimize the transformations to reduce overhead. Spark might load the values greater than 3 first and perform the map function later.

#### `take()` 

returns the first n elements in the RDD. It is useful for debugging and much preferable to `collect()` since collect returns all elements in the RDD.

In [None]:
# input RDD [1,2,3,4,5]
rdd.take(3)
# output [1,2,3]

#### `reduce()`

applies a function to each element in the RDD and returns a single value.

The following code snippet uses the `reduce()` function to sum all the elements in the RDD.

In [None]:
# input RDD [1,2,3,4,5]
rdd.reduce(lambda x,y: x+y)
# output 15

####  `count()`
returns the number of elements in the RDD.

In [None]:
# input RDD [1,2,3,4,5]
rdd.count()
# output 5

## Associative and Commutative Properties

`reduce()` is a powerful aggregation function but it requires the function to be associative and commutative due to the distributed nature of Spark.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [1,2,3,4,5]
for i in range(1,5):
    rdd = spark.sparkContext.parallelize(data, i)
    print('partition: ', rdd.glom().collect())
    print('addition: ', rdd.reduce(lambda a,b: a+b))

```text
partition:  [[1, 2, 3, 4, 5]]
addition:  15
partition:  [[1, 2], [3, 4, 5]]
addition:  15
partition:  [[1], [2, 3], [4, 5]]
addition:  15
partition:  [[1], [2], [3], [4, 5]]
addition:  15
```

In [None]:
for i in range(1,5):
    rdd = spark.sparkContext.parallelize(data, i)
    print('partition: ', rdd.glom().collect())
    print('division: ', rdd.reduce(lambda a,b: a/b))

```text
partition:  [[1, 2, 3, 4, 5]]
division:  0.008333333333333333
partition:  [[1, 2], [3, 4, 5]]
division:  3.3333333333333335
partition:  [[1], [2, 3], [4, 5]]
division:  1.875
partition:  [[1], [2], [3], [4, 5]]
division:  0.20833333333333331
```

Notice in the output that no matter how our list is being partitioned, the sum is still 15, but the division operation has different solutions based on the partitioning.