# Table of Contents

1. [PySpark Basics](#PySpark-Basics)
    1. [Data Abstractions in PySpark](#Data-Abstractions-in-PySpark)
    2. [RDDs](#RDDs)
        1. [Lazy Evaluation](#Lazy-Evaluation)
        2. [Caching](#Caching)
        3. [Transformations](#Transformations)
        4. [Paired RDDs](#Paired-RDDs)
        5. [Sequential Computation VS Spark Parallel Batch Processing](#Sequential-Computation-VS-Spark-Parallel-Batch-Processing)

# PySpark Basics

[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a python API for spark that allows developers to write spark applications with all its functionalities in a 'pythonic' way.

It supports most of Spark's features such as Spark SQL DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.

![PySpark Components](https://spark.apache.org/docs/3.1.1/api/python/_images/pyspark-components.png)

In order to access these functionalities, you need to import the modules like:


```
from pyspark.sql import ...
from pyspark.mllib import ...
from pyspark.streaming import ...
```


**Spark SQL** is the most widely used spark module. It enables structured and semi-structured data processing using SQL queries and the DataFrame API.

Spark SQL introduces **DataFrames**, which are **distributed collections of data organized into named columns**, offering a higher-level abstraction for working with structured data.

With features like the Catalyst optimizer, Spark SQL optimizes query execution plans to enhance performance. It unifies batch and streaming data processing, integrates with external systems, and includes Structured Streaming for continuous, real-time data processing with fault tolerance and scalability.

Overall, Spark SQL simplifies data processing tasks and enables seamless integration of SQL queries and DataFrame operations within Spark applications.

## Data Abstractions in PySpark

**RDDs (Resilient Distributed Datasets)** are fundamental data structures in Spark, representing **distributed collections of objects** across a set of nodes of a cluster. They are low-level abstractions that provide fault tolerance and parallel operations on data.

RDDs support two types of operations: transformations (creating new RDDs from existing ones) and actions (triggering computation and returning results).

However, they have some limitations:

- **Low-Level API**: RDDs require manual memory management and lack high-level optimization opportunities. They only allow two types of operations: transformations and actions. These are powerful, but they are also verbose and complex to use.
- **Lack of Structure**: RDDs don't have a schema or a pre-defined structure. This means they can not take advantage of some optimizations and features available for structured data, such as predicate pushdown and columnar storage in Spark SQL.
- **Lack of interoperation**: RDDs can't be used with other Spark APIs, such as Spark SQL, DataFrames, and Datasets. This limits their usability and integration with other Spark components.

To address these limitations, Spark introduced **DataFrames**, which are high-level abstractions built on top of RDDs. Those abstractions provide a more user-friendly API, better performance, and seamless integration with other Spark modules.

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database. They provide a higher-level API for working with structured data, enabling users to perform complex data processing tasks with ease. They also leverage the Catalyst optimizer to optimize query execution plans and improve performance.

Here are the main differences between RDDs and DataFrames:

| Feature | RDD | SparkDataFrame |
|---------|-----|----------------|
| Abstraction | Low-level | High-level |
| Data type | Any type | Structured or semi-structured |
| Schema | No schema | Has schema |
| Manipulation methods | Functional programming methods | Relational or SQL-like methods |
| Immutability | Immutable | Immutable |
| Laziness | Lazy | Lazy |
| Flexibility | More flexible and expressive | Less flexible and expressive |
| Efficiency | Less efficient and optimized | More efficient and optimized |
| Use case | Complex or custom data transformations, unstructured or non-tabular data | Simple or standard data transformations, structured or semi-structured data |
| Conversion methods | `.toDF` | `.toRDD` |

It's also useful to think of DataFrames as a collection of `columns` instead of as a collections of `records` (rows) as it was in the case of RDDs. 

The DataFrame API focuses on manipulating the columns to transform the data. Because of this, we can simplify how we reason about data transformations by thinking about what operations to perform and which columns will be impacted by them.

![RDDs VS DataFrames](https://drek4537l1klr.cloudfront.net/rioux/v-7/Figures/ch02-rdd-vs-dataframe.png)

For more information on RDDs and DataFrames, you can refer to this [databricks article](https://www.databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html).

## RDDs

We will go in depth with RDDs in this lab.

First, we need to recall what the `SparkContext` is. The **SparkContext** is the entry point to any Spark functionality. When we run a Spark application, a driver program starts and your **SparkContext** is initialized. This is also the object that allows you to interact with RDDs.

RDDs are a fundamental abstraction in Spark Core, which is the foundational component of the Apache Spark framework. They represent the distributed data collections that Spark operates on, forming the backbone of Spark's distributed processing capabilities. **They are immutable distributed collectiond of objects and each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.**

Formally, an RDD is a <b> read-only, partitioned </b> collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. They are fault-tolerant because they track lineage information to rebuild lost data in case of failure.

There are two ways to create RDDs − parallelizing an existing collection in your driver program using the **SparkContext**, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. We'll go through the first method in this lab.

RDDs support two types of operations:

**Transformations** − These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and map are the examples of transformations.

**Action** − These are the operations that are applied on RDD, which instructs Spark to perform computation and send the result back to the driver.

A very important concept in Spark is that of **lazy evaluation**. Spark computes transformations only when an action requires a result to be returned to the driver program. This is because Spark optimizes the execution plan based on the transformations applied to the RDDs. This optimization is possible because of the RDD's lineage graph, which tracks the sequence of transformations applied to the RDD.

Here is a visualization of a Spark DAG (Directed Acyclic Graph) that represents the sequence of transformations applied to an RDD:

![Example of a Spark DAG](https://spark.apache.org/docs/3.3.0/img/JobPageDetail2.png)

### Lazy evaluation

Lazy evaluation is an evaluation strategy that delays the evaluation of an expression until its value is actually needed. In Spark, transformations are not executed immediately. Instead, they are stored as a graph of operations on the RDD. When an action is called, the graph is executed.

Let's see an example of lazy evaluation in action.

First we need to create an RDD. For that, we will use the **SparkContext** to parallelize a list of numbers.

In Databricks clusters, the **SparkContext** is already created for you when the cluster is started and is available in notebooks in the variable `sc`. If you were running this code in your local machine, you would need to create a **SparkContext** first.

In [None]:
# Create the list of numbers
data = range(0, 11)

# Spread the data out into the cluster, transforming it into an RDD
rdd = sc.parallelize(data, 1)

With the RDD created, we will apply a transformation to it. We will use the `map` transformation to square each number in the RDD.

Since RDDs are immutable, the result of the transformation will not take effect on the original RDD. Instead, a new RDD needs to be created with the transformation applied.

In [None]:
def square(x):
    return x**2

squaredRDD = rdd.map(square)

The map transformation applies a function to each element in the RDD and returns a new RDD with the transformed elements.

Notice that the transformation is not executed immediately. 

Finally, we will call an action on the RDD. We will use the `collect` action to return all elements in the RDD to the driver program.

When the action is called, Spark will execute the transformations in the lineage graph and return the result to the driver program.

In [None]:
squaredRDD.collect()

Let's look at another example to see the lazy evaluation in action.

Let's define a function that takes a lot of time to execute.

In [None]:
import time
def slow_squared(x):
    time.sleep(2)
    return x**2

Now we will create an RDD with a list of numbers and apply a map transformation to it. The map transformation will apply the slow function to each element in the RDD. This will happen fast because the transformation is not executed immediately.

In [None]:
slowRDD = rdd.map(slow_squared)

Finally, we will call the `collect` action on the RDD. This will trigger the execution of the slow function on each element in the RDD, causing the program to take a long time to complete.

In [None]:
slowRDD.collect()

Saw the difference?

### Caching

Sometimes we may want to cache an RDD in memory to avoid recomputing it every time an action is called. This can be useful when an RDD is used multiple times in the program or when the RDD is expensive to compute.

To cache an RDD, we can use the `cache` method. This will store the RDD in memory and mark it as cacheable. The RDD will be computed the first time an action is called on it, and the result will be stored in memory for future use.

A side note is that caching does not guarantee that the RDD will be stored in memory forever. If there is not enough memory available, Spark will evict the RDD from memory and recompute it when needed.

In [None]:
slowRDD.cache()

Now run the cell bellow to execute the action three times. The first time will take a long time to execute, but the next two times will be faster because the RDD is cached in memory.

In [None]:
slowRDD.collect()

### Transformations

Transformations are operations that are applied to an RDD to create a new RDD. They are lazy operations, meaning that they are not executed immediately. Instead, they are stored as a graph of operations on the RDD.

We have already seen the `map` transformation, which applies a function to each element in the RDD and returns a new RDD with the transformed elements.

Let's use it one more time in the example below.

In [None]:
sentences = ["Hello world", "This is Spark with Python", "Learning Spark is fun"]

# Create an RDD from the list of sentences
sentenceRDD = sc.parallelize(sentences, 1)

# Use the map transformation to split each sentence into words
wordsRDD = sentenceRDD.map(lambda sentence: sentence.split(" "))

# Use the collect action to view the list of words
wordsRDD.collect()

The `map` transformation receives a function as a parameter and returns a new RDD with the function applied to each element in the original RDD. The new RDD will contain the exact same number of elements as the original RDD, with the function applied to each element.

Let's look at a common mistake when using the `map` transformation.

In [None]:
def broken_function(x):
    doubledDataList = []
    # Iterate through the input list and double each value
    for i in x:
        doubledDataList.append(i * 2)
    return doubledDataList

# Create an RDD from the list of numbers
data = range(0, 11)

rdd = sc.parallelize(data, 1)

# Use the map transformation to double each value in the list
doubledRDD = rdd.map(broken_function)

The cell ran without errors, so all looks good, right? Not quite. Remember that transformations are lazy operations, so they are not executed immediately. The error will only be raised when an action is called on the RDD.

In [None]:
doubledRDD.collect()

Ok now we see the error. This is something to be careful about when working with Spark. Just because code with transformations runs without errors, it doesn't mean that it is correct. The error will only be raised when an action is called on the RDD.

Any idea on what is wrong with the code?

The problem is that the `map` function is applied to each element in the RDD. However, the broken function assumes the inputs are lists and iterates over them. This will raise an error when the action is called on the RDD since the elements are not lists, but integers.

Run the cell below to understand what is happening when `collect` is called on the RDD.

In [None]:
broken_function(0)

___

Another common transformation is the `flatMap` transformation. The `flatMap` transformation is similar to the `map` transformation, but it flattens the result. This means that the function applied to each element can return multiple elements, and the resulting RDD will contain all the elements returned by the function.

Let's see an example of the `flatMap` transformation in action.

In [None]:
# Use the flatMap transformation to split each sentence into words
flatWordsRDD = sentenceRDD.flatMap(lambda sentence: sentence.split(" "))

# Use the collect action to view the list of words
flatWordsRDD.collect()

See the difference? The `flatMap` transformation returns a new RDD with the elements flattened, while the `map` transformation returns a new RDD with the elements nested in lists. This means that the `flatMap` transformation can return more elements than the original RDD, while the `map` transformation will return the same number of elements.

It is worth mentioning that the `flatMap` transformation only removed one level of nesting from the elements. Let's see what this means with an example.

In [None]:
# Create the data
dataNested = [[0, [1, 2, 3, 4, 5],7], [8], [9, [10, [11, 12]]]]

# Load the data into an RDD
dataNestedRDD = sc.parallelize(dataNested, 1)

# Apply that flatmap and returns the values
dataNestedRDD.flatMap(lambda x: x).collect()

Take a look at number 8, which was inside a list inside the main data list. After the flatmap 8 is just inside the data list. Howevrer, the [1, 2, 3, 4, 5] list is still a list within the data list, since it was not directly inside the main data list.

### Paired RDDs

Paired RDDs are RDDs where each element is a key-value pair. They are useful for operations that require data to be grouped by a key, such as aggregations and joins.

In pySpark we use tuples to represent key-value pairs. The first element of the tuple is the key, and the second element is the value.

Spark operations work on RDDs containing any type of objects. However, there are some extra operations available for Paired RDDs:
- groupByKey: groups the values of the RDD by key.
- reduceByKey: aggregates the values of the RDD by key using a function.
- mapValues: applies a function to the values of the RDD, without changing the keys.

Let's see an example of a Paired RDD.

In [None]:
# Create a list of tuples
orders = [
    ("item1", 10),
    ("item2", 5),
    ("item1", 10),
    ("item1", 20),
    ("item2", 15),
    ("item3", 22),
    ("item2", 3),
    ("item3", 15),
    ("item1", 25),
    ("item3", 5),
    ("item1", 2),
    ("item2", 5),
    ("item3", 10),
    ("item1", 10),
    ("item3", 5),
    ("item1", 15),
    ("item2", 20),
    ("item3", 22),
    ("item2", 3),
    ("item3", 15),
    ("item1", 25),
    ("item3", 5),
    ("item1", 2),
    ("item2", 5),
    ("item3", 10)
]

# Load the data into an RDD
ordersPairedRDD = sc.parallelize(orders, 1)

Apply the `groupByKey` transformation to the Paired RDD. The `groupByKey` transformation groups the values of the RDD by key, returning a new RDD with the keys and an iterable of values.

In [None]:
groupRDD = ordersPairedRDD.groupByKey()

groupRDD.collect()

On the other hand, the `reduceByKey` transformation aggregates the values of the RDD by key using a function. The function is applied to the values of each key, and the result is a new RDD with the keys and the aggregated values.

This function needs to be associative and commutative, meaning that the order of the elements does not matter. This is because the function is applied in parallel to the values of each key, and the order of the elements is not guaranteed.

The function needs to accept two arguments, which are the values of two observations with the same key. The function needs to return a single value, which will be the result of the aggregation.

In [None]:
sumRDD = ordersPairedRDD.reduceByKey(lambda x, y: x + y)

sumRDD.collect()

If we use the `map` transformation in a Paired RDD, it will apply the function to each key-value pair in the RDD. However, we are often interested in applying the function only to the values, ignoring the keys. This is where the `mapValues` transformation comes in.

In [None]:
mapRDD = ordersPairedRDD.mapValues(lambda x: (x+1, 1))

mapRDD.collect()

### Sequential Computation VS Spark Parallel Batch Processing

In the example below, we will compare the time it takes to compute the square of all numbers in a list sequentially and using Spark parallel batch processing.

In [None]:
import time

In [None]:
def run_sequentially(numbers_list):
    start_time = time.time()

    results = []
    for number in numbers_list:
        results.append(number**2)
    
    end_time = time.time()

    print(f'Sequential execution time: {end_time - start_time} seconds')
    return results

In [None]:
def run_parallel(numbers_list):
    start_time = time.time()

    rdd = sc.parallelize(numbers_list, 1)
    results = rdd.map(lambda x: x**2).collect()

    end_time = time.time()

    print(f'Parallel execution time: {end_time - start_time} seconds')
    return results

Let's start by comparing the two methods for a small list of numbers.

In [None]:
numbers_list = range(10)

run_sequentially(numbers_list)
run_parallel(numbers_list)

The sequential computation took less time than the Spark parallel batch processing for a small list of numbers. This is expected since the list is small and the computation is fast. In this case, the overhead introduced by distributing the data and processing it in parallel is greater than the time saved by parallel processing.

Now let's compare the two methods for a large list of numbers.

In [None]:
numbers_list = range(0, 1000000000)

run_sequentially(numbers_list)
run_parallel(numbers_list)

Now the Spark parallel batch processing took less time than the sequential computation for a large list of numbers. This shows that the distributed computing capabilities of Spark can significantly reduce the time it takes to process large datasets. The larger the dataset, the more significant the performance improvement will be.

----

Congratulations! You have learned the basics of PySpark and RDDs. Now you can practice your skills with the exercises in the `exercises.ipynb` notebook.