# RDD

an RDD is the fundamental data structure of Apache Spark. It's a fault-tolerant, distributed collection of elements that can be operated on in parallel.

**Key Characteristics:**

- Immutable
- Lazy evaluation
- Fault tolerant (via lineage info)
- Partitioned across cluster nodes
- Can be cached in memory

### SparkContext and SparkConf


SparkContext is the entry point for Spark functionality.

#### `SparkConf`

- Configuration for Spark application

**Common settings:**

- setMaster("local[*]") – Use local mode with all cores
- setAppName("RDDExample") – Application name

### transformations

Transformations create a new RDD from an existing one. They are lazy – not executed until an action is triggered.

| Transformation  | Description                                          |
| --------------- | ---------------------------------------------------- |
| `map(func)`     | Returns a new RDD by applying `func` to each element |
| `filter(func)`  | Filters elements for which `func` returns true       |
| `flatMap(func)` | Like map but flattens the result                     |
| `distinct()`    | Removes duplicates                                   |
| `union(rdd)`    | Combines two RDDs                                    |
| `groupByKey()`  | Groups values with same key                          |
| `reduceByKey()` | Aggregates values with same key using a function     |
| `sortBy(func)`  | Sorts RDD by computed key                            |


### actions

Actions trigger computation and return results or write data.

| Action             | Description                            |
| ------------------ | -------------------------------------- |
| `collect()`        | Returns all elements to driver         |
| `count()`          | Returns number of elements             |
| `first()`          | Returns first element                  |
| `take(n)`          | Returns first `n` elements             |
| `reduce(func)`     | Reduces elements using binary operator |
| `saveAsTextFile()` | Writes RDD to text files               |



reference - [spark rdd docs](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
