# PySpark and Big Data Overview

## Understanding Big Data

**Big Data** refers to extremely large and complex datasets that cannot be handled by traditional data processing applications. These datasets come from various sources such as social media, sensors, transactions, and more.

### The 4 Vs of Big Data

| Dimension | Description |
|------------|--------------|
| **Volume** | Refers to the massive amount of data generated every second. Organizations deal with terabytes to petabytes of data. |
| **Velocity** | Describes the speed at which new data is generated, collected, and processed — often in real-time. |
| **Variety** | Represents the different types of data — structured, semi-structured, and unstructured (text, images, audio, etc.). |
| **Veracity** | Refers to the trustworthiness and accuracy of the data collected. |

### Data Size Units

| Unit | Size |
|------|------|
| 1 Byte | 8 bits |
| 1 Kilobyte (KB) | 1,024 Bytes |
| 1 Megabyte (MB) | 1,024 KB |
| 1 Gigabyte (GB) | 1,024 MB |
| 1 Terabyte (TB) | 1,024 GB |
| 1 Petabyte (PB) | 1,024 TB |
| 1 Exabyte (EB) | 1,024 PB |
| 1 Zettabyte (ZB) | 1,024 EB |
| 1 Yottabyte (YB) | 1,024 ZB |

### Value Proposition of Big Data

Big Data allows organizations to:
- Gain deeper business insights
- Improve operational efficiency
- Personalize customer experiences
- Enable predictive analytics and real-time decision making

---

## Apache Spark in Big Data Processing

Apache Spark is an open-source **distributed data processing framework** designed for speed, scalability, and ease of use.

### Key Features of Spark
- **In-memory computation** for faster performance.
- **Supports multiple languages** — Python (PySpark), Scala, Java, R, SQL.
- **Rich APIs** for streaming, machine learning, and graph processing.
- **Scales horizontally** across a cluster of machines.

### Spark Ecosystem Components
- **Spark Core** – The foundation, responsible for distributed computation.
- **Spark SQL** – For structured data queries using SQL.
- **Spark Streaming** – For real-time data processing.
- **MLlib** – For machine learning operations.
- **GraphX** – For graph processing.

---

## RDD – Resilient Distributed Dataset

RDD is the **fundamental data structure in Spark**. It represents an **immutable distributed collection of objects** that can be processed in parallel.

### Characteristics of RDDs
- **Resilient** – Fault-tolerant with automatic recovery from failures.
- **Distributed** – Data is divided across multiple nodes.
- **Immutable** – Once created, cannot be modified.
- **Lazy Evaluation** – Operations are not executed until an action is triggered.

### RDD Operations

1. **Transformations** – Return a new RDD from an existing one (lazy).
   - Examples: `map()`, `filter()`, `flatMap()`, `distinct()`
2. **Actions** – Trigger computation and return a result.
   - Examples: `count()`, `collect()`, `first()`, `saveAsTextFile()`

---


## Setting up Spark using Docker

The following steps explain how to set up **Apache Spark** using Docker.

> ⚙️ **Note:** Docker must be installed on your machine before running these commands.

### Step 1: Pull the Spark Docker Image
```bash
docker pull jupyter/all-spark-notebook:95f855f8e55f
```

### Step 2: Run the Spark Container
```bash
docker run -p 8888:8888 --name spark jupyter/all-spark-notebook:95f855f8e55f
```

This command:
- Maps port **8888** of the container to your local machine.
- Starts a container named **spark**.
- Launches Jupyter Notebook with Spark pre-configured.

Once the container is running, you can access Jupyter Notebook in your browser using the URL displayed in the terminal (typically `http://127.0.0.1:8888`).

---
