# Spark Basics

This Notebook covers everything you need to run a Spark program and illustrates how a Spark program works:
- **SparkContext:** How does Spark *access compute clusters*?
- **Resilient Distributed Datasets (RDDs):** How does Spark *distribute data*?
- **RDD Operations:** How does Spark *perform compute*?
- **RDD Parallelization:** How does Spark *distribute compute*?

Finally, we illustrate how you could speed up your program by distributing your computation across nodes in your cluster, e.g. across cpu cores on a local cluster.

## SparkContext

The first thing a Spark program must do is to create a `SparkContext` object, which tells Spark how to access a cluster. In this case we are using a `local` cluster by specifying
```python
master="local[4]"
```
We use a local cluster with 4 compute nodes.

In [1]:
from pyspark.context import SparkContext

sc = SparkContext(appName="SparkBasics", master="local[4]")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/10/28 14:57:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/10/28 14:57:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/10/28 14:57:16 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
21/10/28 14:57:16 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


## Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a resilient distributed dataset (RDD).
- **resilient**: immutable collection of your data
- **distributed**: data can be partitioned across nodes in your cluster

An RDD can be created by parallelizing existing collections:

In [2]:
existing_collection = list(range(16))
simple_rdd = sc.parallelize(existing_collection)
simple_rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

We can get the data on an RDD by calling `.collect()`

In [3]:
simple_rdd.collect()

                                                                                

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

We can also collect the data by each compute node by calling `.glom()` and then `.collect()`

In [4]:
simple_rdd.glom().collect()

[Stage 1:>                                                          (0 + 4) / 4]                                                                                

[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]

We can see that the **data** on `simple_rdd` is **distributed evenly** across our 4 compute nodes.

## RDD Operations
- **Transformations** are piecewise operations on an RDD, e.g. `map`, `filter`, `groupBy`.
  Transformations define a plan of execution this is executed if needed (lazy). Usually the transformation is executed when an action requires the results of a transformation.
- **Actions** are operations that require the entire data of an RDD, e.g. `collect`, `reduce`, `count` or `min`, `max` and `variance`.

Let's dive into the `map` transformation. `map` requires a hook function which defines the logic of the mapping. 

Here, our logic is take an item, wait one second, and return the item.

In [5]:
from time import sleep, time

# hook function for the map transformation
def wait_one_sec(x):
    sleep(1.0)
    return x

In [6]:
# Define a map() transformation
t0 = time()
transformed_rdd = simple_rdd.map(wait_one_sec)
print(f"Took {(time()-t0):.4f} sec to define the transformation.")


# Define a collect() action
t0 = time()
result_collect = transformed_rdd.collect()
print(f"Took {(time()-t0):.4f} sec to collect the data of temp_rdd.")


# Define a sum() action
result_sum = simple_rdd.sum()


# Results
print(f"\ntransformed_rdd type: {type(transformed_rdd)}")
print(f"result_collect: {result_collect} of type {type(result_collect)}")
print(f"result_sum: {result_sum} of type {type(result_sum)}")

Took 0.0019 sec to define the transformation.


[Stage 2:>                                                          (0 + 4) / 4]

Took 4.0709 sec to collect the data of temp_rdd.

transformed_rdd type: <class 'pyspark.rdd.PipelinedRDD'>
result_collect: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] of type <class 'list'>
result_sum: 120 of type <class 'int'>


                                                                                

We see two things:
- `wait_one_sec` is scheduled by the `map` transformation but executed by the `collect` action. 
- Transformations return RDD types.
- Actions return actual data, e.g. `list`, `int`, `float` types.

Additionally, the work scheduled by the `map` transformation is distributed over our 4 compute nodes. `wait_one_sec` is executed for each item in simple_rdd, i.e. **16 seconds** of waiting. The program finished after roughly **4 seconds**.

Let's take a look at **parallelization** with another example and proper compute:

## RDD Parallelization

Lets consider the task of performing some compute with an array of 100,000 random numbers:

In [7]:
import numpy as np
random_numbers = np.random.rand(100000)
random_numbers[:5]

array([0.01140603, 0.16433057, 0.25681613, 0.6764169 , 0.67450782])

In [8]:
# Create distributed and not distributed rdds with random_numbers
rdd_not_distribed = sc.parallelize(random_numbers, numSlices=1)
rdd = sc.parallelize(random_numbers)

In [9]:
from math import cos

def cosine_compute(x):
    """
    This function defines the compute we perform.
    """
    [cos(j*x) for j in range(100)]
    return cos(x)

In [10]:
# Distributed rdd
t0 = time()
rdd.map(cosine_compute).collect()
print(f'Took {(time()-t0):.4f} sec to execute "cosine_compute" on "rdd".')


# Not distributed rdd
t0 = time()
rdd_not_distribed.map(cosine_compute).collect()
print(f'Took {(time()-t0):.4f} sec to execute "cosine_compute" on "rdd_not_distribed".')


# Python's native list comprehension
t0 = time()
[cosine_compute(x) for x in random_numbers]
print(f'Took {(time()-t0):.4f} sec to execute "cosine_compute" with python\'s list comprehension.')

21/10/28 14:57:25 WARN TaskSetManager: Stage 5 contains a task of very large size (1870 KiB). The maximum recommended task size is 1000 KiB.


Took 1.4424 sec to execute "cosine_compute" on "rdd".


                                                                                

Took 3.2258 sec to execute "cosine_compute" on "rdd_not_distribed".
Took 3.0310 sec to execute "cosine_compute" with python's list comprehension.


By distributing the compute across our 4 local compute nodes, we **speeded up the computation by roughly 2.5 times**.