# Spark Parallelism

CWD `Apache-Spark-with-Python/notebooks` -> `Apache-Spark-with-Python`

(Current Working Directory)

In [1]:
import os
os.chdir(os.environ.get("PWD"))

Then import from `src`

In [2]:
from pyspark.context import SparkContext
n_nodes = 4
sc = SparkContext(appName="SparkBasics", master=f"local[{n_nodes}]")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/10/27 17:34:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/10/27 17:34:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/10/27 17:34:28 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


## Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a resilient distributed dataset (RDD).
- `resilient` immutable collection of your data
- `distributed` data can be partitioned across nodes in your cluster

1. An RDD can be created by parallelizing existing collections:

In [3]:
rdd = sc.parallelize(range(16))
rdd

PythonRDD[1] at RDD at PythonRDD.scala:53

In [4]:
rdd.collect()

                                                                                

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

In [5]:
rdd.glom().collect()

[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]

In [6]:
import numpy as np

In [7]:
n_samples=100000
random_numbers = np.random.rand(n_samples)

In [8]:
rdd = sc.parallelize(random_numbers, numSlices=1)
rdd_dist = sc.parallelize(random_numbers)

In [9]:
print(len(rdd.glom().collect()))
print(len(rdd_dist.glom().collect()))

21/10/27 17:34:32 WARN TaskSetManager: Stage 2 contains a task of very large size (1870 KiB). The maximum recommended task size is 1000 KiB.


1
4


In [16]:
%%time
rdd.collect()[:5]

CPU times: user 44.2 ms, sys: 7.47 ms, total: 51.6 ms
Wall time: 129 ms


21/10/27 17:35:45 WARN TaskSetManager: Stage 9 contains a task of very large size (1870 KiB). The maximum recommended task size is 1000 KiB.


[0.7840254904184738,
 0.5476920052273782,
 0.8545235372396548,
 0.7555378303106948,
 0.057690424283767805]

In [17]:
%%time
rdd_dist.collect()[:5]

CPU times: user 49.6 ms, sys: 7.4 ms, total: 57 ms
Wall time: 119 ms


[0.7840254904184738,
 0.5476920052273782,
 0.8545235372396548,
 0.7555378303106948,
 0.057690424283767805]

In [18]:
from math import cos
def taketime(x):
    [cos(j*x) for j in range(100)]
    return cos(x)

In [19]:
%%time
rdd.map(lambda x: taketime(x)).collect()
print()

21/10/27 17:36:03 WARN TaskSetManager: Stage 11 contains a task of very large size (1870 KiB). The maximum recommended task size is 1000 KiB.
[Stage 11:>                                                         (0 + 1) / 1]


CPU times: user 14.6 ms, sys: 6.47 ms, total: 21 ms
Wall time: 3.52 s


                                                                                

In [20]:
%%time
rdd_dist.map(lambda x: taketime(x)).collect()
print()

[Stage 12:>                                                         (0 + 4) / 4]


CPU times: user 15.4 ms, sys: 6.47 ms, total: 21.8 ms
Wall time: 1.32 s


                                                                                

In [21]:
%%time
[taketime(x) for x in random_numbers]
print()


CPU times: user 3.14 s, sys: 25.2 ms, total: 3.17 s
Wall time: 3.23 s


In [None]:
f"{(3.22/1.34):2.4f}"