### The case for distributed computing

* What happens if the data to be analyzed is too large?
  * e.g. cannot be stored on a single machine

* What if the computation is too complex?
  * e.g., in interactive mode, it is unacceptably slow

* What if you have to deal with both situations?

* Can you scale up? Can you scale out?

### Scaling Up: Local Machine

* Scaling up occurs within the same system hosting the data and running the computaiton
  * Simple to carry out from a physical standpoint
  * from a programmatic standpoint, it's managed by the operating system and programming libraries; does not require additional frameworks
* Scabaility is typically limited by the OS physical resources on the system.
  * RAM upper bound is determined by the OS  or the number of slots available on the motherboard
* The cost of a single machine at the highest configuration may be prohibitive

* May not meet the demands of the workload at hand


### Distributed Systems

* The hardware specs may not be the same across machines, adding another layer of complexity if it doesn't

<img src="https://www.dropbox.com/s/8mncw4ffe8uajol/networking.jpg?dl=1" width="700" height="600">


### Key Requirements for Distributed Systems

* A distributed system needs to meet several crucial criteria, including:

  * Node Communication: Enabling seamless interaction among nodes in the network.
  * Resilience to Faults: Ensuring that both data and operations remain unaffected by system failures.
  * Scalability: The ability to scale computational resources to handle growing workloads.


### Apache Spark

* Apache Spark is an open-source, distributed processing system used for big data workloads.
  * Runs on a cluster

* It's an enhancement to Hadoop's MapReduce
  * Processes and retains data in memory for subsequent steps
  * For smaller workloads, Spark’s data processing speeds are up to 100x faster than Hadoop's MapReduce

* Written in Scala and runs in the JVM


### Apache Spark

* Ideal for real-time processing as it utilizes in-memory caching and optimized query execution for fast queries against data of any size.  

* Provides a richer ecosystem of functionality
  * Over 80 high level operators beyond Map and Reduce
    * Tools for pipeline construction and evaluation
  * compared to Hadoop, Spark provides more operators other than map and reduce
    * Includes libraries to support SQL queries, machine learning (MLlib), graph data analysis (GraphX) and streaming data analysis
  * Plethora of functions for SQL-like operation, ML and working with graph data


### What is Apache Spark

* [See Video](https://www.databricks.com/spark/about)

### Spark and Functional Programming

* To manipulate data, Spark uses functional programming
  * The functional programming paradigm is used in many popular languages including Common Lisp, Scheme, Clojure, OCaml, and Haskell
  
* Functional programming is a data oriented paradigm
  * Decomposes a problem into a set of functions.
  * Logic is implemented by applying and composing functions.

* The idea is that functions should be able to manipulate data without maintaining any external state.
  * No external or global variables
 
* In functional programming, we need to always return new data instead of manipulating the data in-place.

### Spark Operations and RDDs: An Overview

What are RDDs? 
 * Resilient Distributed Datasets is the data representation in Spark. 
   * an RDD is conceptually divided into one or more partitions. 
     Each partition is a self-contained unit of data that can be operated on independently and in parallel. 
     These partitions are distributed across the nodes in a Spark cluster for parallel computation.

  * RDDs are read-only collections, partitioned across multiple computing nodes for optimized performance.


### Spark Operations and RDDs: An Overview
  * Each partition is also replicated across nodes.
    * Number of replicates is a configuration parameter.
  * Partitioning enhances fault tolerance and boosts the efficiency of data operations.
  * This allows RDDs to be accessed through parallel operations
    * Data operations can be executed on all partitions at the same time, speeding up data tasks.
    
* In-Memory Caching
  * RDDs are stored in memory -- if the RDD can fit into a node's memory -- facilitating quick iterations over the same dataset.
    * Disk Spilling: If an RDD is too large Spark spills the what doesn't fit in RAM to disk



### Overview of Apache Spark's Core Components

* Spark Core: The foundation of the entire Spark ecosystem.
  * Defines the basic data structure for (RDDs).
  * Provides a set of operations (transformations) and actions to process RDDs.
  * Enables distributed data processing, fault tolerance, and in-memory computations.

* Spark SQL: Spark's SQL engine for structured data.
  * Supports ANSI SQL standards for query language.
  * Transforms SQL queries into Spark operations.
  * Allows for SQL-like querying on large datasets, bridging the gap between traditional databases and big data processing.



### Overview of Apache Spark's Core Components - Cont'd

* Spark Streaming: Real-time data processing module in Spark.
  * Processes live streaming data.
  * Offers seamless integration with other Spark components.
  * Enables real-time analytics and data processing, vital for applications like fraud detection, monitoring, and recommendation systems.

* MLlib: Spark's machine learning library.
  * Implements machine learning algorithms on RDDs.
  * Provides algorithms for classification, regression, clustering, and more.
  * Allows for scalable machine learning tasks, leveraging Spark's distributed computing power.

* GraphX: Spark's library for graph processing.
  * Manages and manipulates graph structures.
  * Performs parallel graph operations and computations.
  * Enables graph analytics at scale, useful for social network analysis, recommendation systems, and more.
  

### Core Components of Spark
<img src="https://www.dropbox.com/s/azebxe8nv5nsqne/spark_architecture.png?dl=1" width="900" height="600">


### Spark High-level Components

* Cluster manager: Manages the resources across the Spark cluster.
  * Responsible for allocating resources like CPU and memory to Spark applications.
  
* Application driver: The central orchestrator of the Spark program, housing the main application logic.
  * Contains an isntance of the spark context
    * Requests resources from the Cluster Manager to launch executors.
    * Coordinates the overall data processing workflow.
  
* Executors (workers): These are the worker nodes to which tasks are delegated.
  * They execute the code sent by the driver, specifically focusing on their designated partitions of the dataset.
  * Executors communicate with the Cluster Manager to report status and failures.


![](https://spark.apache.org/docs/latest/img/cluster-overview.png)


### Spark Program Flow

* A typical Spark program adheres to the following structure:
  * Application Driver: Central point of the Spark program, where the main application logic resides. 
    * Responsible for coordinating the entire data processing workflow.
  * Executors (Workers): The worker nodes that tasks are delegated to.
  
    * Rxecuting the code sent by the driver, specifically focusing their designated partitions of the dataset.
     * Driver send smaller, more specific operations that the executors can carry out.
       * Referred to as a task plan.
  * Result Aggregation: results sent by the executor to the application driver for aggregation
    * Often a final layer of computation to produce the output.


### Spark Application Lifecycle

1. Python Program Initialization
    * The `SparkContext` object is created; it's a client to interact with the Spark cluster.
  
2. Resource Request to Cluster Manager:
    * `SparkContext` contacts the Cluster Manager to request resources (CPU, memory) for your application.
  
3. Cluster Manager Allocates Resources:
    * The Cluster Manager allocates the necessary resources for the application and decides where to place the executors across the cluster's nodes.
  
4. Application Driver Initialization
    * The main logic of your Spark application is in the Application Driver.
    * It becomes the master node for your application, coordinating tasks between the cluster and your Python program.
5. Executor Launch
    * Based on the resources allocated by the Cluster Manager, executors for the Spark application are launched on the worker nodes.
6. Task Division and Execution Plan
    * The Application Driver divides the job into tasks and builds an execution plan.
7. Sending Task Plans to Executors
    * Instead of sending raw code, the Application Driver sends the execution plan (or task plans) to the Executors.
8. Task Execution on Executors
    * Executors run the tasks on their designated partition of the dataset.
9. Result Aggregation
    * After task execution, the Executors send the results back to the Application Driver.
10. Final Computation and Output Retrieval
    * The Application Driver may perform some final computations.
    * Results are returned to your Python program through the `SparkContext` object.
11. Resource Release
    * Once the application completes, the resources are released back to the Cluster Manager for use by other applications.


### Setting up a Docker Cluster
 
* Installing Spark and all its components manually can be challenging and time-consuming.
  * Configuring and optimizing a Spark cluster from scratch requires substantial effort.
* Easier deployment options are available on cloud services, such as:
  * [Amazon's EMR](https://aws.amazon.com/emr/features/spark/) for Spark support
  * [Databricks' Community Edition](https://www.databricks.com/product/faq/community-edition) and paid offerings
  * Other providers like Google's Dataproc, Microsoft's HDInsight, etc.

* We'll utilize Databricks Community Edition.
  [Brief Demo](www.databricks.com)

### Installing Via Docker For ICS438 (Optional)

* Note that the following is optional. All the work we will using pySpark will be done on a Databricks cluster in community edition. Also, this solutions assumes that you have docker running on your machine.

* It is easy to use Docker to install locally. We will use the following Docker image
  
```  
jupyter/all-spark-notebook
```
* There are other docker images, including (jupyter/pyspark-notebook), which does not include the jobs dashboard `http://localhost:4040`

* We will run the infrastructure as follows:

```
docker run --rm -p 4040:4040 -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook
```

* This configuration created a master and compute nodes locally in a docker instance
 
* While you're probably not going to need to, you can log into the running container using: `docker exec -it <CONTAINER_ID> bash`

 * where <CONTAINER_ID> of the container currently running the `jupyter/all-spark-notebook` image


* The Docker instance has all the libraries installed and ready to go.

* Make sure you run a Jupyer notebook on the Docker instnac
  * If the code below fails, this means you're not running in the Docker instance.

In [4]:

# pip install pyspark

from pyspark import SparkContext
sc = SparkContext()



In [5]:
SparkContext?

[0;31mInit signature:[0m
[0mSparkContext[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmaster[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mappName[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msparkHome[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpyFiles[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0menvironment[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatchSize[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m0[0m[0;34m,[0

In [None]:
print(f"Spark version is {sc.version}")

print(f"Python version is {sc.pythonVer}")

print(f"The name of the master is {sc.master}")


In [None]:
sc.getConf().getAll()


### Creating a Text RDD in Spark

* Resilient Distributed Datasets (RDDs) can be created in various ways in Spark. Here are two commonly used methods:

    * `parallelize()`: This function allows you to transform Python collections, such as lists or arrays, into an RDD.
        * It distributes the elements of the passed collection across multiple nodes, making the RDD fault-tolerant.

    * `textFile()`: This function reads in a text file and creates an RDD where each object corresponds to a line in the file.

* Example using `parallelize()`:
  ```python
  from pyspark import SparkContext
  sc = SparkContext()
  my_list = [1, 2, 3, 4, 5]
  my_rdd = sc.parallelize(my_list)
  # or
  my_text_rdd = sc.textFile("path/to/text/file.txt")
  ```

### Fundamental Operations on RDDs

* Several fundamental operations are available to manipulate and transform Spark RDDs. 
  * These operations are generally called 'transformations.'
    * `map`: Applies a given function to each element of the RDD and returns a new RDD consisting of the results.
    * `filter`: Returns a new RDD containing only the elements that satisfy a given predicate.
    * `reduce`: Aggregates the elements of the RDD using a given function, 
      * The function should be commutative and associative so that it can be computed in parallel.

* Each transformation on an RDD produces a new RDD without modifying the original one, making RDDs immutable.

* `flatMap`: Another commonly used transformation, which first applies a function to all elements of the RDD and then flattens the results. It is conceptually equivalent to Python's `itertools.chain()`.



In [3]:
my_rdd = sc.parallelize([1,2,3,4,5,6,7,8])

In [4]:
doubled_rdd = my_rdd.map(lambda x: x * 2)

In [5]:
even_rdd = my_rdd.filter(lambda x: x % 2 == 0)

In [6]:
sum_of_elements = my_rdd.reduce(lambda a, b: a + b)

In [7]:
flat_rdd = my_rdd.flatMap(lambda x: (x, x * 3))

### Transformations and Actions in Spark

* Transformations: These operations transform your data and produce a new RDD. They come in two types:
    * Narrow-dependency transformations:
        * Transformations where each partition of the parent RDD is used to build only one partition of the new RDD, e.g., `map` and `filter`.
        * This means the operation can be performed independently on each partition, which allows for better parallelism and less data movement.
    * Wide-dependency transformations:
        * Transformations where a single partition of the parent RDD may be used to build multiple partitions of the child RDD, e.g., `groupBy` and `reduceByKey`.
        * These operations are generally expensive in terms of performance as they typically require shuffling, or  redistributing, data across partitions

* Actions: These operations trigger the computation and execute the job, producing a result.
    * Each action initiates a Spark job, which may consist of multiple stages depending on the transformations involved.

#### Why Distinguish Between Transformations and Actions?

  * One reason is query optimization. For instance, performing a `filter` operation before a `groupBy` is usually more efficient than doing it afterward.


* Before Join

* Partition 1 of RDD1: 
```
(1, "apple")
(2, "banana")
```
* Partition 2 of RDD1: 
```
(3, "cherry")
(4, "date")
```
  
* Partition 1 of RDD2: 
```
(3, "red")
```
* Partition 2 of RDD2: 
```
(4, "brown")
(2, "yellow")
```

Compute a jon between RDD1 and RDD2 based on the keys
1 . shuffle the data around to group all occurrences of the same key together. 
* Data after shuffle

  * Shuffle output targeting Partition 1 (from RDD1 and RDD2):
    ```
    (2, "banana")
    (2, "yellow")
    ```
    
  * Shuffle output targeting Partition 2 (from RDD1 and RDD2):
    ```
    (3, "cherry")
    (3, "red")
    (4, "date")
    (4, "brown")
    ```

* Data Partitions After Join

* Partition 1: 
```
(2, ("banana", "yellow"))
```
* Partition 2
```
(3, ("cherry", "red")), (4, ("date", "brown"))
```

### Understanding the Concept of a Stage in Spark

* A "stage" is a sequence of transformations that can be executed in parallel.
* Stages are separated by operations that require data to be rearranged, known as "wide dependencies." 
  * Managing this complexity is easier when tasks are grouped into stages.
* Each stage either reads data, performs computations, or writes data.

* Stages are executed in sequence, one after the other.
  * Within each stage, tasks are executed in parallel.
* Computation moves to where the data resides.


### Example Query: Decomposition into Stages


```python
flights_df = spark.read.option("head", "true").option("inferSchema", "true").csv("flights_info.csv")

flights_data_partitioned_df = flights_data.repartition(minPartitions=4)
counts_df = flights_data_partitioned_df.where("duration > 120")
                                       .select("dep", "dest", "carrier", "durations")
                                       .groupBy("carrier")
                                       .count()
counts_df.collect()
```


### Stages
1. Reading Data
* Reads the partitioned data into memory.
  * A task for each partition.
* This task has no dependency
2. Filter, Select, and GroupBy
* Applies .where(), .select(), and .groupBy().
* Each task applies all these transformations on a partition.
* GroupBy is "Wide" dependency (Needs to shuffle data between partitions for grouping)
3. Count
* Applies .count() to each group.
* Each task calculates the count for groups in its partition.
  * Rach group is guaranteed to be in the same partition
* This task has no dependency


### PySpark: Job, Stages and Tasks

<img src="https://www.dropbox.com/s/5qa1fb7p867i787/Page5.jpg?dl=1" width="900" height="600">

    

In [8]:
pip install randomuser

Collecting randomuser
  Downloading randomuser-1.6.tar.gz (5.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: randomuser
  Building wheel for randomuser (setup.py) ... [?25ldone
[?25h  Created wheel for randomuser: filename=randomuser-1.6-py3-none-any.whl size=5066 sha256=6e7c0224df743d27edd7642c6960a3314f192ced0afc8cb8cb7ff654a164c569
  Stored in directory: /Users/mahdi/Library/Caches/pip/wheels/41/6f/23/878c103a235dc2d4e85a3965c124aae8a28470c541b81aa2ba
Successfully built randomuser
[33mDEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: randomuser
Successfully in

In [9]:
# Insstall using the following if not already installed 

from randomuser import RandomUser

# # Generate a single user
user = RandomUser({"nat": "us"})
print(f"user object  is {user}")
def get_user_info(u):

    user_dict = {
        "user_id": u.get_id()["number"], 
        "first_name": u.get_first_name(), 
        "last_name": u.get_last_name(), 
        "state": u.get_state(),
        "zip": u.get_zipcode(),
        "lat_long": u.get_coordinates()
    }
    return user_dict

user_json = get_user_info(user)
print(f"user json representation  is\n {user_json}")



user object  is <randomuser.RandomUser object at 0x118d40ca0>
user json representation  is
 {'user_id': '263-97-9076', 'first_name': 'Devon', 'last_name': 'Chapman', 'state': 'Kentucky', 'zip': 64821, 'lat_long': {'latitude': '59.4647', 'longitude': '-45.3609'}}


In [2]:
my_users = RandomUser.generate_users(5001, {"nat": "us"})
print(len(my_users))
my_users[0:3]

In [18]:
# Generate a list of 10 random users

user_dicts = list(map(get_user_info, my_users))

user_dicts[0:3]

[{'user_id': '792-23-2937',
  'first_name': 'Mildred',
  'last_name': 'Day',
  'state': 'Arkansas',
  'zip': 39155,
  'lat_long': {'latitude': '80.0027', 'longitude': '133.7742'}},
 {'user_id': '432-50-8553',
  'first_name': 'Kurt',
  'last_name': 'Wilson',
  'state': 'Tennessee',
  'zip': 66358,
  'lat_long': {'latitude': '14.9553', 'longitude': '-49.4870'}},
 {'user_id': '145-17-4371',
  'first_name': 'Brianna',
  'last_name': 'Alvarez',
  'state': 'Mississippi',
  'zip': 76395,
  'lat_long': {'latitude': '-76.4153', 'longitude': '-97.7275'}}]

In [13]:
users_rdd = sc.parallelize(user_dicts)
users_rdd_size  = users_rdd.count()
print(f"The number of objects in my RDD is: {users_rdd_size}")
users_rdd.takeSample(False, 3)

The number of objects in my RDD is: 5000


[{'user_id': '675-04-0332',
  'first_name': 'Oscar',
  'last_name': 'Chambers',
  'state': 'Oregon',
  'zip': 17604,
  'lat_long': {'latitude': '60.8608', 'longitude': '-82.8117'}},
 {'user_id': '302-13-3578',
  'first_name': 'Julie',
  'last_name': 'Watson',
  'state': 'Georgia',
  'zip': 86710,
  'lat_long': {'latitude': '87.7597', 'longitude': '-27.8451'}},
 {'user_id': '707-49-4477',
  'first_name': 'Manuel',
  'last_name': 'Burns',
  'state': 'Florida',
  'zip': 83279,
  'lat_long': {'latitude': '64.0851', 'longitude': '99.6517'}}]

In [19]:
select_users_rdd = users_rdd.filter(lambda x: x['state'] in ["Hawaii", "Idaho"])
select_users_rdd

SyntaxError: invalid syntax (4268562196.py, line 1)

In [14]:
# collect the result means grab them from all the chunk nodes
select_users_rdd.collect()[:10]

[{'user_id': '048-73-1753',
  'first_name': 'Albert',
  'last_name': 'Scott',
  'state': 'Nebraska',
  'zip': 94069,
  'lat_long': {'latitude': '73.8760', 'longitude': '-13.5365'}},
 {'user_id': '018-56-7439',
  'first_name': 'Veronica',
  'last_name': 'Woods',
  'state': 'Idaho',
  'zip': 85540,
  'lat_long': {'latitude': '35.9805', 'longitude': '-0.2080'}},
 {'user_id': '808-14-7364',
  'first_name': 'June',
  'last_name': 'Baker',
  'state': 'Nebraska',
  'zip': 89660,
  'lat_long': {'latitude': '-76.5984', 'longitude': '-51.9512'}},
 {'user_id': '753-04-9746',
  'first_name': 'Marvin',
  'last_name': 'Beck',
  'state': 'Idaho',
  'zip': 62505,
  'lat_long': {'latitude': '37.4552', 'longitude': '-169.8550'}},
 {'user_id': '553-71-2902',
  'first_name': 'Anita',
  'last_name': 'Hernandez',
  'state': 'Nebraska',
  'zip': 37841,
  'lat_long': {'latitude': '-0.7319', 'longitude': '-57.2282'}},
 {'user_id': '770-54-5308',
  'first_name': 'Adam',
  'last_name': 'Powell',
  'state': 'Idah

In [15]:
# Building an RDD from a text file.
text = sc.textFile('data/pride_and_prejudice.txt', minPartitions=4)
### Number of items in the RDD
text.getNumPartitions()

4

In [16]:
text_rdd_size = text.count()
print(f"numbe of objects in the RDD is {text_rdd_size}")

nb_lines = len(open("data/pride_and_prejudice.txt").readlines())
print(f"numbe of lines in the text file is {nb_lines}")


numbe of objects in the RDD is 14579
numbe of lines in the text file is 14579


In [17]:
subset_x = text.take(10)
print(f"len of subset_x is: {len(subset_x)}\n")
print(f"type of subset_x is: {type(subset_x)}\n")
print(f"subset_x is:\n{subset_x}")

len of subset_x is: 10

type of subset_x is: <class 'list'>

subset_x is:
['The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen', '', 'This eBook is for the use of anyone anywhere in the United States and', 'most other parts of the world at no cost and with almost no restrictions', 'whatsoever. You may copy it, give it away or re-use it under the terms', 'of the Project Gutenberg License included with this eBook or online at', 'www.gutenberg.org. If you are not located in the United States, you', 'will have to check the laws of the country where you are located before', 'using this eBook.', '']


In [18]:
import re

def clean_split_line(line):
    a = re.sub('\d+', '', line)
    b = re.sub('[\W]+', ' ', a)
    return b.upper().split()

words = text.map(clean_split_line)
words.take(60)

[['THE',
  'PROJECT',
  'GUTENBERG',
  'EBOOK',
  'OF',
  'PRIDE',
  'AND',
  'PREJUDICE',
  'BY',
  'JANE',
  'AUSTEN'],
 [],
 ['THIS',
  'EBOOK',
  'IS',
  'FOR',
  'THE',
  'USE',
  'OF',
  'ANYONE',
  'ANYWHERE',
  'IN',
  'THE',
  'UNITED',
  'STATES',
  'AND'],
 ['MOST',
  'OTHER',
  'PARTS',
  'OF',
  'THE',
  'WORLD',
  'AT',
  'NO',
  'COST',
  'AND',
  'WITH',
  'ALMOST',
  'NO',
  'RESTRICTIONS'],
 ['WHATSOEVER',
  'YOU',
  'MAY',
  'COPY',
  'IT',
  'GIVE',
  'IT',
  'AWAY',
  'OR',
  'RE',
  'USE',
  'IT',
  'UNDER',
  'THE',
  'TERMS'],
 ['OF',
  'THE',
  'PROJECT',
  'GUTENBERG',
  'LICENSE',
  'INCLUDED',
  'WITH',
  'THIS',
  'EBOOK',
  'OR',
  'ONLINE',
  'AT'],
 ['WWW',
  'GUTENBERG',
  'ORG',
  'IF',
  'YOU',
  'ARE',
  'NOT',
  'LOCATED',
  'IN',
  'THE',
  'UNITED',
  'STATES',
  'YOU'],
 ['WILL',
  'HAVE',
  'TO',
  'CHECK',
  'THE',
  'LAWS',
  'OF',
  'THE',
  'COUNTRY',
  'WHERE',
  'YOU',
  'ARE',
  'LOCATED',
  'BEFORE'],
 ['USING', 'THIS', 'EBOOK'],
 [],
 [

In [20]:
import re

def clean_split_line(line):
    a = re.sub('\d+', '', line)
    b = re.sub('[\W]+', ' ', a)
    return b.upper().split()

words = text.flatMap(clean_split_line)
words.take(60)

['THE',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'OF',
 'PRIDE',
 'AND',
 'PREJUDICE',
 'BY',
 'JANE',
 'AUSTEN',
 'THIS',
 'EBOOK',
 'IS',
 'FOR',
 'THE',
 'USE',
 'OF',
 'ANYONE',
 'ANYWHERE',
 'IN',
 'THE',
 'UNITED',
 'STATES',
 'AND',
 'MOST',
 'OTHER',
 'PARTS',
 'OF',
 'THE',
 'WORLD',
 'AT',
 'NO',
 'COST',
 'AND',
 'WITH',
 'ALMOST',
 'NO',
 'RESTRICTIONS',
 'WHATSOEVER',
 'YOU',
 'MAY',
 'COPY',
 'IT',
 'GIVE',
 'IT',
 'AWAY',
 'OR',
 'RE',
 'USE',
 'IT',
 'UNDER',
 'THE',
 'TERMS',
 'OF',
 'THE',
 'PROJECT',
 'GUTENBERG',
 'LICENSE',
 'INCLUDED']

In [21]:
words.count()

126018

In [20]:
# We want to do something like the following
# words_mapped = words.map(lambda x: (x,1))

words_mapped = words.map(lambda x: (x,1))
words_mapped.take(10)

[(['THE',
   'PROJECT',
   'GUTENBERG',
   'EBOOK',
   'OF',
   'PRIDE',
   'AND',
   'PREJUDICE',
   'BY',
   'JANE',
   'AUSTEN'],
  1),
 ([], 1),
 (['THIS',
   'EBOOK',
   'IS',
   'FOR',
   'THE',
   'USE',
   'OF',
   'ANYONE',
   'ANYWHERE',
   'IN',
   'THE',
   'UNITED',
   'STATES',
   'AND'],
  1),
 (['MOST',
   'OTHER',
   'PARTS',
   'OF',
   'THE',
   'WORLD',
   'AT',
   'NO',
   'COST',
   'AND',
   'WITH',
   'ALMOST',
   'NO',
   'RESTRICTIONS'],
  1),
 (['WHATSOEVER',
   'YOU',
   'MAY',
   'COPY',
   'IT',
   'GIVE',
   'IT',
   'AWAY',
   'OR',
   'RE',
   'USE',
   'IT',
   'UNDER',
   'THE',
   'TERMS'],
  1),
 (['OF',
   'THE',
   'PROJECT',
   'GUTENBERG',
   'LICENSE',
   'INCLUDED',
   'WITH',
   'THIS',
   'EBOOK',
   'OR',
   'ONLINE',
   'AT'],
  1),
 (['WWW',
   'GUTENBERG',
   'ORG',
   'IF',
   'YOU',
   'ARE',
   'NOT',
   'LOCATED',
   'IN',
   'THE',
   'UNITED',
   'STATES',
   'YOU'],
  1),
 (['WILL',
   'HAVE',
   'TO',
   'CHECK',
   'THE',
   'LA

In [21]:
sorted_map = words_mapped.sortByKey()
sorted_map

PythonRDD[21] at RDD at PythonRDD.scala:53

In [31]:
sample = sorted_map.sample(withReplacement=False, fraction= 0.001)
sample.collect()

[('A', 1),
 ('A', 1),
 ('A', 1),
 ('ALL', 1),
 ('AND', 1),
 ('AND', 1),
 ('AND', 1),
 ('ASSEMBLED', 1),
 ('AT', 1),
 ('BE', 1),
 ('BE', 1),
 ('BEFORE', 1),
 ('BEFORE', 1),
 ('BENNET', 1),
 ('BUT', 1),
 ('BUT', 1),
 ('BY', 1),
 ('CANDOUR', 1),
 ('COACH', 1),
 ('COLONEL', 1),
 ('COLONEL', 1),
 ('CONFIRMATION', 1),
 ('CONTRIVED', 1),
 ('CRIED', 1),
 ('DARCY', 1),
 ('DAY', 1),
 ('DESIRED', 1),
 ('DIFFIDENCE', 1),
 ('DON', 1),
 ('EBOOK', 1),
 ('EDWARD', 1),
 ('ELIZABETH', 1),
 ('EMBARRASSMENT', 1),
 ('EVERYTHING', 1),
 ('EXAGGERATION', 1),
 ('HAD', 1),
 ('HAD', 1),
 ('HAD', 1),
 ('HE', 1),
 ('HER', 1),
 ('HER', 1),
 ('HERTFORDSHIRE', 1),
 ('HIM', 1),
 ('HIS', 1),
 ('HIS', 1),
 ('HOPE', 1),
 ('IN', 1),
 ('IN', 1),
 ('INDEED', 1),
 ('INDUCEMENT', 1),
 ('IS', 1),
 ('IT', 1),
 ('IT', 1),
 ('ITS', 1),
 ('KNOW', 1),
 ('LAUGHED', 1),
 ('LINES', 1),
 ('LOOSE', 1),
 ('MANNERS', 1),
 ('ME', 1),
 ('MORE', 1),
 ('MUST', 1),
 ('MYSELF', 1),
 ('NO', 1),
 ('NOT', 1),
 ('OCCASION', 1),
 ('OF', 1),
 ('OF', 

In [33]:
counts = words_mapped.reduceByKey(lambda x,y: x+y)
counts.collect()[:50]

[('PRIDE', 52),
 ('UNITED', 22),
 ('OTHER', 227),
 ('WORLD', 68),
 ('NO', 501),
 ('GIVE', 127),
 ('LICENSE', 18),
 ('WWW', 9),
 ('ARE', 361),
 ('TO', 4245),
 ('DATE', 5),
 ('UPDATED', 2),
 ('ENGLISH', 1),
 ('CHARACTER', 65),
 ('PRODUCED', 13),
 ('ILLUSTRATED', 1),
 ('THAT', 1555),
 ('POSSESSION', 10),
 ('LITTLE', 187),
 ('KNOWN', 58),
 ('VIEWS', 11),
 ('CONSIDERED', 23),
 ('AS', 1193),
 ('ONE', 273),
 ('THEIR', 439),
 ('MR', 784),
 ('JUST', 72),
 ('TOLD', 69),
 ('ME', 427),
 ('ANSWER', 65),
 ('WHO', 288),
 ('TELL', 71),
 ('HEARING', 24),
 ('ENOUGH', 106),
 ('WHY', 53),
 ('YOUNG', 130),
 ('MONDAY', 8),
 ('FOUR', 35),
 ('MUCH', 327),
 ('AGREED', 13),
 ('MICHAELMAS', 2),
 ('SERVANTS', 13),
 ('WEEK', 29),
 ('NAME', 34),
 ('BINGLEY', 307),
 ('OH', 96),
 ('FIVE', 32),
 ('YEAR', 29),
 ('FINE', 31),
 ('CAN', 223)]

In [35]:
# As functional programming always returns new data instead of manipulating the data in-place, we can rewrite the above as:

%%time
counts_test_2 = text.flatMap(clean_split_line).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
counts_test_2.take(100)


CPU times: user 40.9 ms, sys: 7.41 ms, total: 48.4 ms
Wall time: 2.9 s


[('PRIDE', 52),
 ('UNITED', 22),
 ('OTHER', 227),
 ('WORLD', 68),
 ('NO', 501),
 ('GIVE', 127),
 ('LICENSE', 18),
 ('WWW', 9),
 ('ARE', 361),
 ('TO', 4245),
 ('DATE', 5),
 ('UPDATED', 2),
 ('ENGLISH', 1),
 ('CHARACTER', 65),
 ('PRODUCED', 13),
 ('ILLUSTRATED', 1),
 ('THAT', 1555),
 ('POSSESSION', 10),
 ('LITTLE', 187),
 ('KNOWN', 58),
 ('VIEWS', 11),
 ('CONSIDERED', 23),
 ('AS', 1193),
 ('ONE', 273),
 ('THEIR', 439),
 ('MR', 784),
 ('JUST', 72),
 ('TOLD', 69),
 ('ME', 427),
 ('ANSWER', 65),
 ('WHO', 288),
 ('TELL', 71),
 ('HEARING', 24),
 ('ENOUGH', 106),
 ('WHY', 53),
 ('YOUNG', 130),
 ('MONDAY', 8),
 ('FOUR', 35),
 ('MUCH', 327),
 ('AGREED', 13),
 ('MICHAELMAS', 2),
 ('SERVANTS', 13),
 ('WEEK', 29),
 ('NAME', 34),
 ('BINGLEY', 307),
 ('OH', 96),
 ('FIVE', 32),
 ('YEAR', 29),
 ('FINE', 31),
 ('CAN', 223),
 ('NONSENSE', 8),
 ('THEREFORE', 75),
 ('VISIT', 53),
 ('PERHAPS', 76),
 ('PARTY', 58),
 ('THAN', 285),
 ('CONSIDER', 33),
 ('YOUR', 446),
 ('ESTABLISHMENT', 6),
 ('WILLIAM', 46),
 (

In [75]:
text.take(1)

['The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen']

In [None]:
# The following won't return an error until an action is performed

data_s1 = text.map(lambda x: len(x)/0)
data_s2.filter(lambda x: x>0)



In [1]:
# The following will generate an error since the transformation dividing by 0 
# is executed
# the `ZeroDivisionError: division by zero` is burried in many Scala error messages.

data.collect()

### Understanding Spark's Computation DAG

* Spark's lazy evaluation model is enabled by its use of a Directed Acyclic Graph (DAG) to represent transformations.
* Transformations within the DAG are optimized and executed only when an action is called.

* Consider the following code example:

    ```python
    data_2 = data_1.map(lambda x: x + 2)
    # Additional operations here
    data_3 = data_2.map(lambda x: x - 2)
    ```

* In this scenario, the transformations are not immediately executed. Furthermore, they can be optimized away because they negate each other.
  * As a result, `data_3` would effectively be the same as `data_1`.
