### Apache Spark

* Apache Spark is an open-source, distributed processing system used for big data workloads.


* It's an enhancement to Hadoop's MapReduce.
  * Processes and retains data in memory for subsequent steps,
  * For smaller workloads, Spark’s data processing speeds are up to 100x faster than Hadoop's MapReduce.

* Spark was written in Scala
  * Runs on the JVM
 
* Includes libraries to support SQL queries, machine learning (MLlib), graph data analysis (GraphX) and streaming data analysis.
  * Tools for pipeline construction and evaluation

* Enahanced security.

* Ideal for real-time processing as it utilizes in-memory caching and optimized query execution for fast queries against data of any size.  

* Provides more operators other than map and reduce
  * Plethora of functions for SQL-like operation, ML and working with graph data
  * Over 80 high level operators beyond Map and Reduce.

### Spark and Function Programming

* Spark uses functional programming to manipulate data
  * Pradigm used in some popular languages such as Common Lisp, Scheme and Clojure, OCaml and Haskell
  
* Functional programming is a data oriented paradigm.
  * Functional programming decomposes a problem into a set of functions.
  * Logic is implemented by applying and composing functions.

* It's centered around the fact that data should be manipulated by functions without maintaining any external state.
  * No global variables

  * The above defines exactly what lambda functions do.
 
* In functional programming, we need to always return new data instead of manipulating the data in-place.

### Spark Operations and RDDs

* Provides more operators other than map and reduce
  * Plethora of functions for SQL-like operation, ML and working with graph data
  * Over 80 high level operators beyond Map and Reduce.
* Provides its own distributed data framework called resilient distributed datasets or RDDs.
   * RDD is an abstraction that represents a read-only collection of objects that are partitioned across machines.    * RDDs are fault tolerant and are accessed via parallel operations.

* RDDs are cached in memory and is effective when iterating ont he same data
  * Ex. optimization or ML algorithms
* Fast operation speed makes it ideal for command-line-based queries.

### Core Components of Spark

![](https://www.dropbox.com/s/azebxe8nv5nsqne/spark_architecture.png?dl=1)
* Spark Core
  * Basic functionality: 
    * APIs that define RDDs
    * operations and actions to process RDDs
    
* Spark SQL
    APIs to interact with Apache Hive variant of SQL called Hive Query Language (HiveQL). 
    * db tables are RDD and Spark SQL queries are transformed into Spark operations

* Spark Streaming
    Enables the processing and manipulation of live streams of data in real time.
    
* MLlib
  * Implementation of of machine learning algorithms using Spark on RDDs
  * Somwhat basic algorithms for classifications, regressions,
* GraphX
  * Functionality for manipulating graphs and performing parallel graph operations and computations
  * A sort of large-scale Neo4J
  
  

### Spark Paradigm

A Spark program typically follows a simple paradigm:

1. The main program is the `driver` 
2. The program has one or more workers, called executors,
  * those run code sent to them by the driver on their partitions of the RDD 
  * in local mode (or Spark in-process), the executor is the same at the driver (my laptop here)
  
3. Results are then sent back to the driver for aggregation or compilation.



<!-- These steps are outlined as follows:

    Invoke operations on the RDD by passing closures (functions) to each element of the RDD. Spark offers over 
    Use the resulting RDDs with actions (e.g. count, collect, save, etc.). Actions kick off the computing on the cluster.

When Spark runs a closure on a worker, any variables used in the closure are copied to that node, but are maintained within the local scope of that closure.

Spark provides two types of shared variables that can be interacted with by all workers in a restricted fashion:

    Broadcast variables are distributed to all workers, but are read-only. These variables can be used as lookup tables or stopword lists.
    Accumulators are variables that workers can “add” to using associative operations and are typically used as counters.
 -->

### Spark Execution

####  MISSING ???

<!-- ![](https://www.dropbox.com/s/n07i9wlbqti5ptx/spark_execution.png?dl=1) -->

* Interact with the Scala Interface using the PySpark Python library
  * Rapper that uses almost exactly the same function names and attributes

In [None]:
### Cluster Mode

### Setting a Docker Cluster

* Manually installing Spark and all its component can be a daunting task. 
 * Manually deploying, configure and optimizing a Spark is complex and time consuming.
* Easy to use Docker to install locally
jupyter/all-spark-notebook

```
docker run --rm -p 4040:4040 -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook
```

* This configuration created a master and compute nodes locally in a docker instance

* There are other docker images, including (jupyter/pyspark-notebook)

  * does not includ the jobs dashboard `http://localhost:4040`
  
*You can log into the running container using: `docker exec -it <CONTAINER_ID> bash`

where <CONTAINER_ID> of the container currently running the `jupyter/all-spark-notebook` image



In [2]:
from pyspark import SparkContext
# sc.stop()
sc = SparkContext()


21/09/20 09:38:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [13]:
SparkContext?

In [3]:
print(f"Spark version is {sc.version}")

print(f"Phthon version is {sc.pythonVer}")

print(f"The name of the master is {sc.master}")


Spark version is 3.1.2
Phthon version is 3.9
The name of the master is local[*]


In [4]:
sc.getConf().getAll()


[('spark.app.startTime', '1632130707598'),
 ('spark.executor.id', 'driver'),
 ('spark.app.id', 'local-1632130708911'),
 ('spark.app.name', 'pyspark-shell'),
 ('spark.driver.extraJavaOptions',
  '-Dio.netty.tryReflectionSetAccessible=true'),
 ('spark.driver.port', '36521'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.host', '4eedec98e4d1'),
 ('spark.executor.extraJavaOptions',
  '-Dio.netty.tryReflectionSetAccessible=true'),
 ('spark.ui.showConsoleProgress', 'true')]

### Creating a Text RDD

* You can create RDDs in a number of ways: 
    * parallelize() function cna transform Python data structures like lists and tuples into RDDs
    * makes the passed list fault-tolerant and distributed.


* Another easy way to create RDDs is to read in a file with `textFile()`

* Creates an RDD where every object is a line of the input text file

* One an RDD is created, we can access it's `map`, `reduce` and `filter` methods
 * Those operation and other we will cover are called `transformations`
 * A transormtion on a RDD yields a new RDD 
 * `flatMap` is also commonly used and is equivalent to `itertools.chain()`
 
* To get the results of we need to perform an `action`
  * Some operation like, `take()`, `takeSample` and `count` are needed to return data.
  
* More on transformations later.  


In [81]:
# Insstall using the following if not already installed 
# !pip install randomuser
from randomuser import RandomUser

# Generate a single user
user = RandomUser({"nat": "us"})

def get_user_info(u):

    user_dict = {
        "user_id": u.get_id()["number"], 
        "first_name": u.get_first_name(), 
        "last_name": u.get_last_name(), 
        "state": u.get_state(),
        "zip": u.get_zipcode()
    }
    return user_dict

get_user_info(user)

{'user_id': '485-72-3761',
 'first_name': 'Jackson',
 'last_name': 'Vasquez',
 'state': 'Louisiana',
 'zip': 91496}

In [87]:
my_users = RandomUser.generate_users(250, {"nat": "us"})
my_users[0:3]

[<randomuser.RandomUser at 0x7f19dd6ac9d0>,
 <randomuser.RandomUser at 0x7f19ddd3c3a0>,
 <randomuser.RandomUser at 0x7f19dd759850>]

In [88]:
# Generate a list of 10 random users

user_dicts = list(map(get_user_info, my_users))
user_dicts[0:3]




[{'user_id': '606-72-0736',
  'first_name': 'Richard',
  'last_name': 'Morales',
  'state': 'Indiana',
  'zip': 73780},
 {'user_id': '314-10-6439',
  'first_name': 'Brandy',
  'last_name': 'Frazier',
  'state': 'Massachusetts',
  'zip': 98770},
 {'user_id': '750-17-4776',
  'first_name': 'Jar',
  'last_name': 'Snyder',
  'state': 'Delaware',
  'zip': 23469}]

In [89]:
users_rdd = sc.parallelize(user_dicts)
print(users_rdd.count())
users_rdd.takeSample(False, 3)

250


[{'user_id': '000-35-7329',
  'first_name': 'Cherly',
  'last_name': 'Hale',
  'state': 'Idaho',
  'zip': 58952},
 {'user_id': '719-30-1515',
  'first_name': 'Tiffany',
  'last_name': 'Henderson',
  'state': 'Georgia',
  'zip': 25120},
 {'user_id': '510-30-7858',
  'first_name': 'Heather',
  'last_name': 'Fernandez',
  'state': 'Oregon',
  'zip': 64756}]

In [90]:
seelct_users_rdd = users_rdd.filter(lambda x: x['state'] in ["Nebraska", "Hawaii", "Idaho"])
seelct_users_rdd.collect()

[{'user_id': '219-42-2936',
  'first_name': 'Clara',
  'last_name': 'Hanson',
  'state': 'Idaho',
  'zip': 54726},
 {'user_id': '477-31-3680',
  'first_name': 'Sue',
  'last_name': 'Ramirez',
  'state': 'Hawaii',
  'zip': 99043},
 {'user_id': '662-16-2597',
  'first_name': 'Terra',
  'last_name': 'Carpenter',
  'state': 'Hawaii',
  'zip': 42923},
 {'user_id': '963-57-9165',
  'first_name': 'Sofia',
  'last_name': 'Ryan',
  'state': 'Idaho',
  'zip': 29804},
 {'user_id': '878-39-7497',
  'first_name': 'Derek',
  'last_name': 'Coleman',
  'state': 'Idaho',
  'zip': 60498},
 {'user_id': '010-83-2249',
  'first_name': 'Pauline',
  'last_name': 'Martinez',
  'state': 'Nebraska',
  'zip': 32048},
 {'user_id': '000-35-7329',
  'first_name': 'Cherly',
  'last_name': 'Hale',
  'state': 'Idaho',
  'zip': 58952},
 {'user_id': '353-00-5562',
  'first_name': 'Tamara',
  'last_name': 'Spencer',
  'state': 'Hawaii',
  'zip': 46969},
 {'user_id': '468-63-1606',
  'first_name': 'Louis',
  'last_name': 

In [91]:
# Building an RDD from a text file.
text = sc.textFile('data/pride_and_prejudice.txt', minPartitions=4)
### Number of items in the RDD
text.getNumPartitions()

4

In [92]:
print(text.count())

print(len(open("data/pride_and_prejudice.txt").readlines()))


14579
14579


In [94]:
print(text.take(10))

['The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen', '', 'This eBook is for the use of anyone anywhere in the United States and', 'most other parts of the world at no cost and with almost no restrictions', 'whatsoever. You may copy it, give it away or re-use it under the terms', 'of the Project Gutenberg License included with this eBook or online at', 'www.gutenberg.org. If you are not located in the United States, you', 'will have to check the laws of the country where you are located before', 'using this eBook.', '']


In [113]:
import re

def clean_split_line(line):
    a = re.sub('\d+', '', line)
    b = re.sub('[\W]+', ' ', a)
    return b.upper().split()

words = text.flatMap(clean_split_line)
words.take(60)

['THE',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'OF',
 'PRIDE',
 'AND',
 'PREJUDICE',
 'BY',
 'JANE',
 'AUSTEN',
 'THIS',
 'EBOOK',
 'IS',
 'FOR',
 'THE',
 'USE',
 'OF',
 'ANYONE',
 'ANYWHERE',
 'IN',
 'THE',
 'UNITED',
 'STATES',
 'AND',
 'MOST',
 'OTHER',
 'PARTS',
 'OF',
 'THE',
 'WORLD',
 'AT',
 'NO',
 'COST',
 'AND',
 'WITH',
 'ALMOST',
 'NO',
 'RESTRICTIONS',
 'WHATSOEVER',
 'YOU',
 'MAY',
 'COPY',
 'IT',
 'GIVE',
 'IT',
 'AWAY',
 'OR',
 'RE',
 'USE',
 'IT',
 'UNDER',
 'THE',
 'TERMS',
 'OF',
 'THE',
 'PROJECT',
 'GUTENBERG',
 'LICENSE',
 'INCLUDED']

In [19]:
# We want to do something like the following
# words_mapped = words.map(lambda x: (x,1))

words_mapped = words.map(lambda x: (x,1))
words_mapped.take(10)

[('THE', 1),
 ('PROJECT', 1),
 ('GUTENBERG', 1),
 ('EBOOK', 1),
 ('OF', 1),
 ('PRIDE', 1),
 ('AND', 1),
 ('PREJUDICE', 1),
 ('BY', 1),
 ('JANE', 1)]

In [25]:
sorted_map = words_mapped.sortByKey()


[('THE', 1),
 ('PROJECT', 1),
 ('GUTENBERG', 1),
 ('EBOOK', 1),
 ('OF', 1),
 ('PRIDE', 1),
 ('AND', 1),
 ('PREJUDICE', 1),
 ('BY', 1),
 ('JANE', 1)]

In [30]:
sorted_map.take(20000)[18000:18080]

[('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTH', 1),
 ('BOTTLE', 1),
 ('BOTTLE', 1),
 ('BOTTOM', 1),
 ('BOUGHT', 1),
 ('BOUGHT', 1),
 ('BOUGHT', 1),
 ('BOUGHT', 1),
 ('BOUGHT', 1),
 ('BOUND', 1),
 ('BOUND', 1),
 ('BOUND', 1),
 ('BOUNDARY', 1),
 ('BOUNDARY', 1),
 ('BOUNDLESS', 1),
 ('BOUNDS', 1),
 ('BOUNTY', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1),
 ('BOURGH', 1)

In [35]:
from operator import add
print(add(1, 2))
print(add(189, 11))


3
200


In [39]:
counts = words_mapped.reduceByKey(add)
counts.take(100)

[('PRIDE', 52),
 ('UNITED', 22),
 ('OTHER', 227),
 ('WORLD', 68),
 ('NO', 501),
 ('GIVE', 127),
 ('LICENSE', 18),
 ('WWW', 9),
 ('ARE', 361),
 ('TO', 4245),
 ('DATE', 5),
 ('UPDATED', 2),
 ('ENGLISH', 1),
 ('CHARACTER', 65),
 ('PRODUCED', 13),
 ('ILLUSTRATED', 1),
 ('THAT', 1555),
 ('POSSESSION', 10),
 ('LITTLE', 187),
 ('KNOWN', 58),
 ('VIEWS', 11),
 ('CONSIDERED', 23),
 ('AS', 1193),
 ('ONE', 273),
 ('THEIR', 439),
 ('MR', 784),
 ('JUST', 72),
 ('TOLD', 69),
 ('ME', 427),
 ('ANSWER', 65),
 ('WHO', 288),
 ('TELL', 71),
 ('HEARING', 24),
 ('ENOUGH', 106),
 ('WHY', 53),
 ('YOUNG', 130),
 ('MONDAY', 8),
 ('FOUR', 35),
 ('MUCH', 327),
 ('AGREED', 13),
 ('MICHAELMAS', 2),
 ('SERVANTS', 13),
 ('WEEK', 29),
 ('NAME', 34),
 ('BINGLEY', 307),
 ('OH', 96),
 ('FIVE', 32),
 ('YEAR', 29),
 ('FINE', 31),
 ('CAN', 223),
 ('NONSENSE', 8),
 ('THEREFORE', 75),
 ('VISIT', 53),
 ('PERHAPS', 76),
 ('PARTY', 58),
 ('THAN', 285),
 ('CONSIDER', 33),
 ('YOUR', 446),
 ('ESTABLISHMENT', 6),
 ('WILLIAM', 46),
 (

Since in fucntional programing, we need to always returns new data instead of manipulating the data in-place we can re-write the above cleanly using:
    
```

counts_test_2 = text.flatMap(clean_split_line).map(lambda x: (x,1)).reduceByKey(add)
counts_test_2.take(100)
```


In [42]:
counts_test_2 = text.flatMap(clean_split_line).map(lambda x: (x,1)).reduceByKey(add)
counts_test_2.take(100)


[('PRIDE', 52),
 ('UNITED', 22),
 ('OTHER', 227),
 ('WORLD', 68),
 ('NO', 501),
 ('GIVE', 127),
 ('LICENSE', 18),
 ('WWW', 9),
 ('ARE', 361),
 ('TO', 4245),
 ('DATE', 5),
 ('UPDATED', 2),
 ('ENGLISH', 1),
 ('CHARACTER', 65),
 ('PRODUCED', 13),
 ('ILLUSTRATED', 1),
 ('THAT', 1555),
 ('POSSESSION', 10),
 ('LITTLE', 187),
 ('KNOWN', 58),
 ('VIEWS', 11),
 ('CONSIDERED', 23),
 ('AS', 1193),
 ('ONE', 273),
 ('THEIR', 439),
 ('MR', 784),
 ('JUST', 72),
 ('TOLD', 69),
 ('ME', 427),
 ('ANSWER', 65),
 ('WHO', 288),
 ('TELL', 71),
 ('HEARING', 24),
 ('ENOUGH', 106),
 ('WHY', 53),
 ('YOUNG', 130),
 ('MONDAY', 8),
 ('FOUR', 35),
 ('MUCH', 327),
 ('AGREED', 13),
 ('MICHAELMAS', 2),
 ('SERVANTS', 13),
 ('WEEK', 29),
 ('NAME', 34),
 ('BINGLEY', 307),
 ('OH', 96),
 ('FIVE', 32),
 ('YEAR', 29),
 ('FINE', 31),
 ('CAN', 223),
 ('NONSENSE', 8),
 ('THEREFORE', 75),
 ('VISIT', 53),
 ('PERHAPS', 76),
 ('PARTY', 58),
 ('THAN', 285),
 ('CONSIDER', 33),
 ('YOUR', 446),
 ('ESTABLISHMENT', 6),
 ('WILLIAM', 46),
 (

### Spark and Lazy Evaluation


* Transformation on an RDD  are delayed until an action is performed

  * Similar to python genertor

  * This is called lazy evaluation 

* You can chain many transformations on the same RDD without causing any execution. 





In [108]:
text.take(1)

['The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen']

In [112]:
# The following wo't return an error until an action is performed
text.map(lambda x: len(x)/0).filter(lambda x: x>0)


PythonRDD[133] at RDD at PythonRDD.scala:53

In [107]:
# The following will generate an error since the transformation dividing by 0 
# is executed
# the `ZeroDivisionError: division by zero` is burried in many Scala error messages.

data.take(2)

21/09/20 18:49:38 ERROR Executor: Exception in task 0.0 in stage 68.0 (TID 179)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1560, in takeUpToNumLeft
    yield next(iterator)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_33/2990878127.py", line 1, in <lambda>
TypeError: unsupported operand type(s) for /: 'str' and 'int'

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
	at

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 68.0 failed 1 times, most recent failure: Lost task 0.0 in stage 68.0 (TID 179) (4eedec98e4d1 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1560, in takeUpToNumLeft
    yield next(iterator)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_33/2990878127.py", line 1, in <lambda>
TypeError: unsupported operand type(s) for /: 'str' and 'int'

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at jdk.internal.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1560, in takeUpToNumLeft
    yield next(iterator)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_33/2990878127.py", line 1, in <lambda>
TypeError: unsupported operand type(s) for /: 'str' and 'int'

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more


### The Spark Computation DAG

* Lazy evaluation is possible because Spark maintains a graph (DAG) of the transformation transformations
* The transformating are optimized and executed in the graph once an action is triggered

* A simple exampe of an execution is:

```python
data_2 =  data_1.map(lambda x: x+2)
# do some work here
data_3 =  data_2.map(lambda x: x-2)
```
* The above transformations are not run because it does not change the value of `x`.
  * `data_3` is equal to `data_1`

* See the the following blog post about the catalyst optimizer.

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
