# Spark Handbook
## Apache Spark: A Comprehensive Guide for Data Engineers
This handbook provides a comprehensive overview of [Apache Spark](glossary.md#apache-spark), a powerful [distributed data processing framework](glossary.md#distributed-data-processing-framework) designed for handling [big data](glossary.md#big-data) workloads with speed, ease of use, and flexibility.

## Table of Contents
- [How Apache Spark Works](#how-apache-spark-works)
- [Apache Spark Architecture](#apache-spark-architecture)
- [Spark Core Components and Libraries](#spark-core-components-and-libraries)
- [Why Do Data Engineers Need Spark?](#why-do-data-engineers-need-spark)
- [Typical Use Cases](#typical-use-cases)
- [RDDs and DataFrames in Apache Spark](#rdds-and-dataframes-in-apache-spark)
  - [Introduction](#1-introduction)
  - [RDD: Resilient Distributed Dataset](#2-rdd-resilient-distributed-dataset)
  - [DataFrames](#3-dataframes)
  - [Conversion Between RDD and DataFrame](#4-conversion-between-rdd-and-dataframe)
  - [RDD vs. DataFrame - Comparison](#5-rdd-vs-dataframe---comparison)
  - [Use Case Summary](#6-use-case-summary)
  - [Conclusion](#7-conclusion)
- [Apache Spark Local Setup](#apache-spark-local-setup)
  - [Installing Spark Locally (Native Installation)](#1-installing-spark-locally-native-installation)
  - [Using Docker to Set Up Spark](#2-using-docker-to-set-up-spark)

## What is Apache Spark?

**Apache Spark** is an open-source, distributed analytics engine designed for large-scale data processing and machine learning. It is renowned for its speed, versatility, and ability to scale from a single machine to large clusters of computers. Spark offers APIs in several popular languages, including Python (using PySpark), Scala, Java, and R, making it accessible to a wide audience of data professionals.

## Main Abstractions of Apache Spark

### Resilient Distributed Datasets (RDDs)

RDDs are the fundamental data structure in Spark. They represent immutable, distributed collections of objects partitioned across the cluster. RDDs support two types of operations:

- **Transformations:** Lazy operations that define a new RDD from an existing one (e.g., map, filter).
- **Actions:** Operations that trigger computation and return results (e.g., collect, count).

RDDs enable fault tolerance by tracking lineage, allowing the system to recompute lost data partitions in case of node failure.

### Directed Acyclic Graph (DAG)

Spark uses a DAG to represent the sequence of transformations applied to RDDs. When a job is submitted, Spark’s DAG scheduler breaks the computation into stages of tasks that can be executed in parallel. This DAG-based execution plan enables optimization and efficient job scheduling.

## Spark Architecture

Apache Spark follows a **master-worker** architecture composed of several key components:

### 1. Spark Driver

The driver is the central coordinator of a Spark application. It:
- Maintains the lifecycle of the application.
- Converts user code into a logical execution plan (DAG).
- Schedules tasks and monitors their execution.
- Communicates with the cluster manager to request resources.
- Collects and aggregates results from executors.

The driver contains components such as SparkContext, DAG Scheduler, Task Scheduler, and Block Manager.

### 2. Cluster Manager

The cluster manager oversees resource allocation in the cluster. It manages CPUs, memory, and executors across worker nodes. Spark can run on various cluster managers such as:
- Apache YARN (Hadoop ecosystem)
- Apache Mesos
- Kubernetes
- Spark's standalone cluster manager

### 3. Executors

Executors are worker processes launched on cluster nodes. Each executor:
- Executes tasks assigned by the driver.
- Performs computations on partitions of data.
- Caches data in memory or on disk as needed.
- Reports task status and results back to the driver.

Executors live for the duration of a Spark application and enable parallel task execution.

### 4. Worker Nodes

Worker nodes are the physical or virtual machines in the cluster where executors run. They host one or more executors executing tasks in parallel.

### 5. SparkContext

SparkContext is the entry point through which the driver interacts with the cluster. It:
- Connects to the cluster manager.
- Creates RDDs and manages their lifecycle.
- Coordinates job execution.

---

## Spark Core Components and Libraries

### Spark SQL

Spark SQL is Spark’s module for working with structured data. It allows querying data using:
- Standard SQL.
- Hive Query Language (HQL).
- Support for numerous data sources including Hive tables, Parquet, and JSON.

Spark SQL integrates SQL queries with Spark’s programmatic APIs (RDDs, DataFrames) in Python, Scala, and Java. This tight integration supports complex analytics and interactive querying within a unified application framework.

### MLlib

MLlib is Spark’s scalable machine learning library. It provides:
- Algorithms for classification, regression, clustering, and collaborative filtering.
- Utilities for model evaluation and data import.
- Low-level primitives such as a generic gradient descent optimization algorithm.

### GraphX

GraphX is Spark’s graph processing library, enabling:
- Creation and manipulation of graphs with properties on vertices and edges.
- Graph-parallel computations like PageRank and triangle counting.
- Operators such as subgraph extraction and vertex mapping.

## Why Do Data Engineers Need Spark?

### 1. Speed and Performance
- Spark performs in-memory computing, reducing costly disk read/write operations.
- It can be up to 100× faster than Hadoop MapReduce for iterative and interactive workloads.

### 2. Scalability
- Spark scales from a single machine to thousands of cluster nodes.
- Handles petabyte-scale data through distributed processing.

### 3. Unified Processing Engine
- Supports batch processing, real-time streaming, SQL querying, machine learning, and graph analytics all within one platform.

### 4. Language Flexibility and Ease of Use
- Provides APIs in Python, Scala, Java, and R.
- High-level abstractions (RDDs, DataFrames, Datasets) simplify complex data transformations.

### 5. Ecosystem and Integration
- Integrates with Hadoop HDFS, Amazon S3, Apache Kafka, and other platforms.
- Supports multiple cluster managers for flexible deployment.

### 6. Essential for Modern Workloads
- Enables ETL pipelines, real-time dashboards, machine learning workflows, and large-scale interactive queries.

---

## Typical Use Cases
- ETL pipelines for big data ingestion and transformation
- Scalable machine learning model training and deployment
- Real-time data stream processing (e.g., fraud detection, log analysis)
- Graph analytics for social network analysis and recommendations
---

# RDDs and DataFrames in Apache Spark

## 1. Introduction
Apache Spark has two core abstractions for working with distributed data:
- **RDD (Resilient Distributed Dataset):** The original low-level distributed data structure
- **DataFrame:** A high-level abstraction built on top of RDDs, offering a tabular data structure similar to a database table or Pandas DataFrame.

## 2. RDD: Resilient Distributed Dataset

### 2.1 What is an RDD?
An RDD is an immutable distributed collection of objects that can be processed in parallel.

### 2.2 Key Features
- Fault-tolerant
- Lazy evaluation
- Supports transformations (`map`, `filter`, etc.) and actions (`collect`, `count`, etc.)
- Type-safe (in Scala/Java)
- No built-in schema

### 2.3 Creating or Loading Data into an RDD

#### Creating an RDD (PySpark):

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDDExample").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4, 5])

#### Loading Data into an RDD

In [None]:
# Load file (skip header)
rdd = sc.textFile("./data/customers.csv")
header = rdd.first()
rdd_data = rdd.filter(lambda line: line != header)

### 2.4 RDD Transformation and Actions

In [None]:
# Split CSV into fields
customers_rdd = rdd_data.map(lambda line: line.split(","))

In [None]:
# View Sample
customers_rdd.take(3)

In [None]:
# Count missing join dates
missing_dates = customers_rdd.filter(lambda x: x[4] == "").count()
print(f"Missing join dates: {missing_dates}")

In [None]:
# Extract customer names
names = customers_rdd.map(lambda x: f"{x[1]} {x[2]}").collect()
print(names)

## 3. DataFrames

### 3.1 What is a DataFrame?
A DataFrame is a distributed collection of data organized into named columns, like a SQL table.

### 3.2 Key Features
- Schema-aware (columns and types)
- Optimized by Catalyst optimizer
- Supports SQL queries via `spark.sql()`
- Interoperable with RDDs and Pandas
- Better performance than RDD for most use cases

### 3.3 Creating or Loading Data into a DataFrame

#### Reading CSV into DataFrame

In [None]:
df = spark.read.option("header", True).csv("./data/customers.csv")
df.show()

### 3.4 Common DataFrame Operations

In [None]:
# Print schema
df.printSchema()

In [None]:
# Select specific columns
df.select("first_name", "email").show()

In [None]:
# Filter customers with missing join dates
df.filter(df.join_date.isNull()).show()

In [None]:
# Count customers who joined
df.filter(df.join_date.isNotNull()).count()

In [None]:
# Extract customer names (from RDD)
names = customers_rdd.map(lambda x: f"{x[1]} {x[2]}").collect()
print(names)

## 4. Conversion Between RDD and DataFrame

### From RDD to DataFrame

In [None]:
from pyspark.sql import Row

# Convert RDD to Row RDD
row_rdd = customers_rdd.map(lambda x: Row(
    customer_id=int(x[0]),
    first_name=x[1],
    last_name=x[2],
    email=x[3],
    join_date=x[4] if x[4] != "" else None
))

df_from_rdd = spark.createDataFrame(row_rdd)
df_from_rdd.show()

### From DataFrame to RDD

In [None]:
rdd_from_df = df.rdd
rdd_from_df.take(3)

## 5. RDD vs. DataFrame - Comparison

| Feature           | RDD                        | DataFrame               |
| ----------------- | -------------------------- | ----------------------- |
| Abstraction Level | Low                        | High                    |
| API Style         | Functional                 | SQL-like                |
| Schema            | Not enforced               | Schema-aware            |
| Performance       | Lower                      | Optimized with Catalyst |
| Best for          | Custom, fine-grained logic | Queries, aggregations   |

## 6. Use Case Summary

| Task                                     | Recommended |
| ---------------------------------------- | ----------- |
| Load structured CSV data                 | DataFrame   |
| Filter or select fields efficiently      | DataFrame   |
| Custom parsing, transformation, or logic | RDD         |
| SQL-like querying and grouping           | DataFrame   |

## 7. Conclusion

- Use DataFrames when working with structured data like CSV, JSON, or Parquet.
- Use RDDs when you need custom logic, performance tuning, or low-level transformations.

This practical section using your `customers.csv` helps you clearly see how both abstractions work and when to use them.

# Apache Spark Local Setup

In this section, we'll cover two common ways to set up Apache Spark on a local development machine:

1. **Installing Spark Locally (Native Installation)**
2. **Using Docker to Set Up Spark**

## 1. Installing Spark Locally (Native Installation)

This method involves manually installing Spark and its dependencies on your machine.

### 1.1 Prerequisites
- **Java (JDK 8 or 11):** Spark runs on the JVM.
- **Python 3.x:** Required for PySpark.

### 1.2 Download and Install Spark
- Download Spark from the [Official Apache Spark website](https://spark.apache.org/downloads.html).
    - Choose a version (e.g., Spark 3.4.1) and a pre-built package for Hadoop (e.g., "Pre-built for Apache Hadoop 3.3 and later").

#### Extract the archive to a directory of your choice

**On Linux:**

In [None]:
tar -xzf spark-3.4.1-bin-hadoop3.tgz -C /path/to/your/directory

**On Windows:**
- Use a tool like 7-Zip or WinRAR.
    - Right-click the downloaded `.tgz` file
    - Select "Extract Here" or "Extract to spark-3.4.1-bin-hadoop3"
    - Move the extracted folder to your desired location

### 1.3 Set Environment Variables
Set the following environment variables so your system can find Spark and Java.

In [None]:
# Linux (add to ~/.bashrc or ~/.zshrc)
export SPARK_HOME=/path/to/your/directory/spark-3.4.1-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin
export JAVA_HOME=/path/to/your/java

On Windows, set environment variables via System Properties > Environment Variables.

### 1.4 Install Required Python Libraries

In [None]:
!pip install pyspark findspark

### 1.5 Test Your Installation
Start the PySpark shell:

In [None]:
!pyspark

Or test with a small script:

In [None]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
print(spark.range(5).collect())

---
## 2. Using Docker to Set Up Spark

An alternative way is to run Spark inside Docker containers. This avoids manual setup and ensures a clean environment.

### 2.1 Prerequisites
- Docker installed on your system ([Install Docker](https://docs.docker.com/get-docker/))

### 2.2 Standalone Setup
#### 2.2.1 Pull a Spark Docker image
You can use an existing image from Docker Hub or customize it using a Dockerfile.

In [None]:
docker pull bitnami/spark

#### 2.2.2 Run a Spark Container
Start a Spark standalone container:

In [None]:
docker run -it bitnami/spark pyspark

### 2.3 Set Up Spark Cluster
You can create a local Spark cluster with [`docker-compose.yaml`](./docker-compose.yaml).

#### 2.3.1 Start the Cluster
Run the following command to start the cluster:

In [None]:
docker compose up -d

#### 2.3.2 Access the Spark Web UI
- Master: [http://localhost:8080](http://localhost:8080)
- Worker: [http://localhost:8081](http://localhost:8081)

#### 2.3.3 Submit Jobs
You can submit jobs using the spark-submit tool or run a PySpark shell inside the container:

In [None]:
docker exec -it spark-master pyspark --master spark://spark-master:7077

#### 2.3.4 Setting Up Jupyter Notebook Container for Spark (Optional)

Running a Jupyter Notebook container alongside your Spark services is a great way to interactively test Spark code using PySpark.

- Uncomment the `jupyter` service block in the [`docker-compose.yaml`](./docker-compose.yaml) file.
- Ensure the `notebooks` directory exists in the same location as your `docker-compose.yaml`:
    ```bash
    mkdir notebooks
    ```
  This directory will be mounted into the Jupyter container so that your notebooks are saved persistently.

- To start the whole cluster (including Jupyter):
    ```bash
    docker-compose up -d
    ```
- To start only the Jupyter container (after cluster is running):
    ```bash
    docker-compose up -d jupyter
    ```
- You can now access the notebook UI at: [http://localhost:8888](http://localhost:8888)

  Use the token shown in the terminal (when the Jupyter container starts) to log in.

#### 2.3.5 Test Notebook Code

In a new notebook, run:

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("NotebookSpark") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

spark.range(5).show()