# Set ENV Variable to Project Path

In [1]:
# Automatically reload modules when they change
%load_ext autoreload
%autoreload 2

Insert project root folder in environment variable

In [2]:
import os
import sys

def find_project_root(start_path=None, markers=(".git", "pyproject.toml", "requirements.txt")):
    """
    Walks up from start_path until it finds one of the marker files/folders.
    Returns the path of the project root.
    """
    if start_path is None:
        start_path = os.getcwd()

    current_path = os.path.abspath(start_path)

    while True:
        # check if any marker exists in current path
        if any(os.path.exists(os.path.join(current_path, marker)) for marker in markers):
            return current_path

        new_path = os.path.dirname(current_path)  # parent folder
        if new_path == current_path:  # reached root of filesystem
            raise FileNotFoundError(f"None of the markers {markers} found above {start_path}")
        current_path = new_path

project_root = find_project_root()
print("Project root:", project_root)

if project_root not in sys.path:
    sys.path.insert(0, project_root)


Project root: c:\ds_analytics_projects\darshil_course\apache-pyspark\darshil-pyspark


# Import Libraries

Import packages

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path

Relative import

In [4]:
from utils.file_utils import get_project_path

In [5]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("RDD") \
    .config("spark.sql.catalogImplementation", "hive") \
    .enableHiveSupport() \
    .getOrCreate()

# 📒 Resilient Distributed Dataset (RDD)

**Subject:** Apache Spark

**Topics:** #spark #bigdata #rdd

---

### 🔎 Step 1: What is an RDD?

- **RDD** = Resilient Distributed Dataset
- The **original Spark API** (before DataFrames).
- Represents an **immutable, partitioned collection** of objects that can be processed in parallel across the cluster.
- Each record is just a **Python/Scala/Java object** (unlike DataFrames, which have schemas).

👉 All Spark workloads (DataFrames, SQL) compile down to **RDD operations** internally.

---

### 🔎 Step 2: Why (or When) to Use RDDs?

Use RDDs **only when**:

1. You need **fine-grained control** over data distribution (custom partitioning).
2. You're maintaining **legacy codebases** from Spark 1.x.
3. You need to **manipulate objects directly** (e.g., complex, unstructured formats).

👉 Otherwise: **always prefer DataFrames** — they're optimized (Catalyst optimizer, Tungsten execution engine, compressed storage).

---

### 🔎 Step 3: Types of RDDs

1. **Generic RDD**: collection of objects.
2. **Key-Value RDD**: (key, value) pairs, with extra functions like `reduceByKey`, `groupByKey`, `partitionBy`.

---

### 🔎 Step 4: Properties of an RDD

Each RDD has:

- **Partitions** (data split across cluster).
- **Computation function** for each split.
- **Dependencies** (on parent RDDs).
- **Optional Partitioner** (for key-value RDDs).
- **Preferred locations** (where partitions should be processed, e.g., data locality in HDFS).

---

### 🔎 Step 5: Creating RDDs

### (a) From a DataFrame / Dataset

In [6]:
# Convert DataFrame to RDD
df = spark.range(10).toDF("id")
rdd = df.rdd.map(lambda row: row[0])
print(rdd.collect())  # [0, 1, 2, ..., 9]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


### (b) From a Local Collection

In [7]:
words = spark.sparkContext.parallelize("My name is Darshil".split(" "), 2)
words.setName("myWordsRDD")
print(words.collect())

['My', 'name', 'is', 'Darshil']


### (c) From a File

In [8]:
# Each line of file = one record
rdd_text = spark.sparkContext.textFile(get_project_path('data', 'sample', 'sample_text_file_1.txt'))
print(rdd_text.collect())

# Each file = one record
rdd_whole = spark.sparkContext.wholeTextFiles(get_project_path('data', 'sample'))
print(rdd_whole.collect())

['Apache Spark is great for big data processing.', 'RDDs are the low-level API in Spark.']
[('file:/c:/ds_analytics_projects/darshil_course/apache-pyspark/darshil-pyspark/data/sample/sample_text_file_1.txt', 'Apache Spark is great for big data processing.\r\nRDDs are the low-level API in Spark.'), ('file:/c:/ds_analytics_projects/darshil_course/apache-pyspark/darshil-pyspark/data/sample/sample_text_file_2.txt', 'DataFrames provide a higher-level abstraction.\r\nThis is a sample text file for SparkContext.textFile.')]


---

### 🔎 Step 6: RDD Transformations (Lazy)

Transformations define **how to change data**, but do not execute until an action is called.

- **distinct**

In [9]:
words.distinct().collect()

['name', 'is', 'My', 'Darshil']

- **filter**

In [10]:
words.filter(lambda w: w.startswith("D")).collect()

['Darshil']

- **map**

In [11]:
words2 = words.map(lambda w: (w, w[0], w.startswith("D")))
print(words2.take(3))

[('My', 'M', False), ('name', 'n', False), ('is', 'i', False)]


- **flatMap** (expands into multiple outputs)

In [12]:
chars = words.flatMap(lambda w: list(w))
print(chars.take(5))  # ['M','y','n','a','m']

['M', 'y', 'n', 'a', 'm']


- **sortBy**

In [13]:
words.sortBy(lambda w: len(w), ascending=False).take(2)

['Darshil', 'name']

- **randomSplit**<br>
This returns an array of RDDs that you can manipulate individually.

In [14]:
split_rdds = words.randomSplit([0.5, 0.5], seed=42)
print(split_rdds)

[PythonRDD[29] at RDD at PythonRDD.scala:53, PythonRDD[30] at RDD at PythonRDD.scala:53]


---

### 🔎 Step 7: RDD Actions (Eager)

Actions **trigger execution** of transformations.

- **reduce**

In [15]:
nums = spark.sparkContext.parallelize(range(1, 21))
print(nums.reduce(lambda x, y: x + y))  # sum = 210

210


- **count**

In [16]:
print(words.count())

4


- **first**

In [17]:
print(words.first())

My


- **max / min**

In [18]:
print(nums.max())  # 20
print(nums.min())  # 1

20
1


- **take / takeOrdered / top**

In [19]:
print(words.take(5))
print(words.takeOrdered(5))  # lowest sorted
print(words.top(5))          # highest sorted

['My', 'name', 'is', 'Darshil']
['Darshil', 'My', 'is', 'name']
['name', 'is', 'My', 'Darshil']


- **countByValue**

In [20]:
print(words.countByValue())

defaultdict(<class 'int'>, {'My': 1, 'name': 1, 'is': 1, 'Darshil': 1})


- **takeSample**

In [21]:
sample = words.takeSample(withReplacement=True, num=3, seed=10)
print(sample)

['My', 'is', 'name']


---

### 🔎 Step 8: Saving RDDs

**👉 Delete the existing directory: Before calling saveAsTextFile, you can delete the target directory if it exists.**

In [22]:
# Save as text file (each partition -> 1 file)
words.saveAsTextFile("tmp/words_output")

---

### 🔎 Step 9: Special Functions

- **glom()**
    - Converts partitions → arrays (useful for debugging).

In [23]:
print(spark.sparkContext.parallelize(["Hello", "World"], 2).glom().collect())
# [['Hello'], ['World']]

[['Hello'], ['World']]


---

### 🔎 Step 10: Key Insights

- **RDDs = flexible, lower-level control**, but harder to optimize.
- **DataFrames = structured, optimized, easier to use**.
- Best practice:
    - Use **RDDs only when absolutely needed**.
    - For everything else, **use DataFrame / Dataset APIs**.

---

✅ **In simple words:**

RDDs are like Spark's "raw ingredients." They give you complete control, but you lose automatic optimizations. DataFrames are like "ready-to-use recipes." Spark will optimize them for you.