# 🔥 Introduction to Apache Spark
Welcome to the world of **Apache Spark** — an open-source, distributed computing system that enables large-scale data processing and analytics. This notebook is designed for **beginners** and will help you understand Spark concepts step by step.

## What is Apache Spark?
**Apache Spark** is an open-source, unified analytics engine for large-scale data processing. It provides high-level APIs in Python, Java, Scala, and R, and an optimized engine that supports general computation graphs.

Simpler terms — Spark lets you process large datasets **in memory** across multiple machines efficiently. It was developed at **UC Berkeley’s AMPLab** and later became an Apache project.

### Example:
Imagine you have terabytes of log data stored across multiple servers. Spark helps you load, process, and analyze this data faster than traditional systems like Hadoop MapReduce.

## Spark Research
Apache Spark originated from research at the **AMPLab at UC Berkeley**. The main research paper titled *Resilient Distributed Datasets (RDDs): A Fault-Tolerant Abstraction for In-Memory Cluster Computing* introduced a new way to process large-scale data efficiently.

The key research contribution was the **RDD (Resilient Distributed Dataset)** abstraction — a distributed collection of elements that can be operated on in parallel.

### Reference:
- Zaharia, M., Chowdhury, M., Das, T., et al. *Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing* (NSDI 2012)

## Spark Timeline
| Year | Milestone |
|------|------------|
| 2009 | Spark project started at UC Berkeley AMPLab |
| 2010 | First research paper on RDDs published |
| 2013 | Spark became an Apache Incubator project |
| 2014 | Apache Spark 1.0 released |
| 2015–2018 | Spark SQL, DataFrames, MLlib, and Streaming matured |
| 2020+ | Spark 3.x introduced Adaptive Query Execution and better Python integration |

## Spark vs Hadoop
| Feature | Apache Spark | Hadoop MapReduce |
|----------|---------------|-----------------|
| **Processing Model** | In-memory computation | Disk-based computation |
| **Speed** | Up to 100x faster in memory | Slower due to disk I/O |
| **Ease of Use** | APIs in Python, Scala, Java, R | Complex code in Java |
| **Streaming** | Real-time (Spark Streaming) | Batch only |
| **Fault Tolerance** | Through RDD lineage | Through data replication |
| **Use Case** | Machine Learning, Streaming, ETL | Batch processing |

## Working with Spark
When you start using Spark, you typically create a **SparkSession** — your entry point to interact with all Spark features.

Simplified workflow:
1. Load data into Spark (from CSV, JSON, Parquet, etc.)
2. Perform transformations (filter, map, join, etc.)
3. Execute actions (collect, count, save, etc.)

Spark works on top of a **cluster manager** (like YARN, Mesos, or Kubernetes) and runs computations in parallel across nodes.

## Spark RDDs (Resilient Distributed Datasets)
**RDDs** are the core data structure of Spark. They are immutable, distributed collections of objects that can be processed in parallel.

Each RDD is divided into **partitions**, and each partition is processed by one task on a cluster node.

### Example:
Loading a text file as an RDD (conceptually):
```
lines = sparkContext.textFile('data.txt')
```
Now you can perform transformations (like filtering lines containing a keyword) or actions (like counting total lines).

## RDD Operations
There are two main types of operations on RDDs:

1. **Transformations** – Create a new RDD from an existing one (lazy).
2. **Actions** – Trigger computation and return a result to the driver program or write to storage.

### Common Transformations and Actions
| Type | Operation | Description |
|------|------------|-------------|
| **Transformation** | `map()` | Applies a function to each element and returns a new RDD |
| **Transformation** | `filter()` | Returns elements that satisfy a condition |
| **Transformation** | `flatMap()` | Similar to map but flattens the results |
| **Transformation** | `union()` | Merges two RDDs |
| **Transformation** | `distinct()` | Removes duplicates |
| **Transformation** | `groupByKey()` | Groups data with the same key |
| **Transformation** | `reduceByKey()` | Aggregates values for each key |
| **Action** | `collect()` | Returns all elements to the driver |
| **Action** | `count()` | Returns the number of elements |
| **Action** | `first()` | Returns the first element |
| **Action** | `take(n)` | Returns first n elements |
| **Action** | `saveAsTextFile()` | Saves RDD content to external storage |

## Lazy Evaluation
Spark does **not** execute transformations immediately. Instead, it builds a **DAG (Directed Acyclic Graph)** of transformations.

The actual computation happens only when an **action** (like `count()` or `collect()`) is called.

💡 **Example:**
```
rdd = sparkContext.textFile('data.txt')
filtered = rdd.filter(lambda x: 'error' in x)
filtered.count()  # Computation happens here!
```
This design optimizes execution by combining and reordering transformations efficiently before execution.

## Passing Functions to Spark
Spark allows you to pass **lambda functions** or named functions to transformations.

Example (conceptual):
```
def is_even(x):
    return x % 2 == 0

numbers = sparkContext.parallelize([1,2,3,4,5,6])
even_numbers = numbers.filter(is_even)
```

## Working with Key-Value Pairs
Many operations in Spark use **pair RDDs** (key-value pairs).

### Example:
```
data = [('A', 10), ('B', 20), ('A', 15)]
rdd = sparkContext.parallelize(data)
result = rdd.reduceByKey(lambda a, b: a + b)
# Output → ('A', 25), ('B', 20)
```

## Shuffle Operations
**Shuffle** is the process of redistributing data across partitions — often needed during operations like `groupByKey()` or `reduceByKey()`.

Shuffles involve disk and network I/O, which makes them expensive.

💡 **Tip:** Use `reduceByKey()` instead of `groupByKey()` to minimize shuffle size.

## RDD Persistence (Caching)
By default, RDDs are recomputed every time you perform an action. To avoid this, Spark lets you **cache** or **persist** RDDs in memory.

```
rdd = sparkContext.textFile('data.txt')
rdd.cache()
```

## PySpark
**PySpark** is the Python API for Apache Spark.

It allows Python developers to leverage Spark’s distributed computation capabilities without writing Scala or Java code.

PySpark provides modules for:
- Spark SQL
- DataFrames
- Streaming
- Machine Learning (MLlib)
- Graph Processing (GraphX equivalent in Python)

## Working with PySpark in Jupyter Notebooks (Interactive Environment)
Jupyter Notebooks provide an **interactive environment** for running PySpark commands.

### Steps:
1. Install PySpark using pip:
   ```bash
   pip install pyspark
   ```
2. Launch Jupyter Notebook:
   ```bash
   jupyter notebook
   ```
3. Create a SparkSession inside the notebook:
   ```python
   from pyspark.sql import SparkSession
   spark = SparkSession.builder.appName('MyApp').getOrCreate()
   ```