# Set ENV Variable to Project Path

In [2]:
# Automatically reload modules when they change
%load_ext autoreload
%autoreload 2

Insert project root folder in environment variable

In [3]:
import os
import sys

def find_project_root(start_path=None, markers=(".git", "pyproject.toml", "requirements.txt")):
    """
    Walks up from start_path until it finds one of the marker files/folders.
    Returns the path of the project root.
    """
    if start_path is None:
        start_path = os.getcwd()

    current_path = os.path.abspath(start_path)

    while True:
        # check if any marker exists in current path
        if any(os.path.exists(os.path.join(current_path, marker)) for marker in markers):
            return current_path

        new_path = os.path.dirname(current_path)  # parent folder
        if new_path == current_path:  # reached root of filesystem
            raise FileNotFoundError(f"None of the markers {markers} found above {start_path}")
        current_path = new_path

project_root = find_project_root()
print("Project root:", project_root)

if project_root not in sys.path:
    sys.path.insert(0, project_root)


Project root: c:\ds_analytics_projects\darshil_course\apache-pyspark\darshil-pyspark


# Import Libraries

Import packages

In [4]:
import pandas as pd
import numpy as np
from pathlib import Path

Relative import

In [5]:
from utils.file_utils import get_project_path

In [6]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("RDD") \
    .config("spark.sql.catalogImplementation", "hive") \
    .enableHiveSupport() \
    .getOrCreate()

# 📒 Advanced RDDs

---

### 🔎 Step 1: Why Advanced RDDs?

- Basic RDDs (`map`, `filter`, `reduce`) work on generic objects.
- But **many big data tasks are about grouping and aggregating by keys** (like SQL `GROUP BY`), joining datasets, or controlling partitions for performance.
- That's where **Pair RDDs (key–value RDDs)** come in.

👉 Example:

If we want to **count words**, it's natural to turn them into `(word, 1)` pairs and then aggregate by word.

---

### 🔎 Step 2: Creating Key–Value RDDs

You can turn any dataset into a key–value RDD.

### Example – Word Dataset

In [7]:
words = spark.sparkContext.parallelize(
    "My Name is Darshil and I love Spark".split(" "), 2
)

# (word, 1)
pairs = words.map(lambda w: (w.lower(), 1))
print(pairs.take(5))

[('my', 1), ('name', 1), ('is', 1), ('darshil', 1), ('and', 1)]


📌 Explanation:

- Each record is now `(key=word, value=1)`.
- This makes it possible to use key-based aggregations like `reduceByKey`.

---

### 🔎 Step 3: Flexible Keys with `keyBy`

Instead of manually creating `(key, value)` pairs, we can generate keys dynamically.

In [8]:
# Key = first letter of word
keyword = words.keyBy(lambda w: w.lower()[0])
print(keyword.take(5))

[('m', 'My'), ('n', 'Name'), ('i', 'is'), ('d', 'Darshil'), ('a', 'and')]


📌 Explanation:

- `keyBy` transforms an RDD into key–value pairs,
- where the key is derived by applying the given function to each element.
- Example: `"Spark"` → `("s", "Spark")`.

---

### 🔎 Step 4: Working with Keys and Values

Once you have key–value pairs, you can manipulate them more directly:

In [9]:
# Convert values to uppercase
print(keyword.mapValues(lambda w: w.upper()).take(5))

# Extract only keys
print(keyword.keys().take(5))

# Extract only values
print(keyword.values().take(5))

# Lookup by key
print(keyword.lookup("s"))  # returns ['Spark']

[('m', 'MY'), ('n', 'NAME'), ('i', 'IS'), ('d', 'DARSHIL'), ('a', 'AND')]
['m', 'n', 'i', 'd', 'a']
['My', 'Name', 'is', 'Darshil', 'and']
['Spark']


📌 Explanation:

- `.mapValues()` modifies only the **values** (keeps keys intact).
- `.keys()` and `.values()` let you extract just one side.
- `.lookup(key)` fetches all values for a given key.

---

### 🔎 Step 5: Aggregations

### Count characters example

In [10]:
chars = words.flatMap(lambda w: w.lower())
KVchars = chars.map(lambda c: (c, 1))

Now we can aggregate:

- **countByKey**

In [11]:
print(KVchars.countByKey())

defaultdict(<class 'int'>, {'m': 2, 'y': 1, 'n': 2, 'a': 4, 'e': 2, 'i': 3, 's': 3, 'd': 2, 'r': 2, 'h': 1, 'l': 2, 'o': 1, 'v': 1, 'p': 1, 'k': 1})


📌 Returns a dictionary of how many times each key appears.

---

### groupByKey vs reduceByKey

- **groupByKey (⚠️ risky for skewed data)**

In [12]:
from functools import reduce
def addFunc(x, y): return x + y

print(
    KVchars.groupByKey().map(lambda row: (row[0], reduce(addFunc, row[1]))).take(5)
)

[('y', 1), ('i', 3), ('s', 3), ('d', 2), ('r', 2)]


📌 Problem: `groupByKey` brings **all values for a key into memory** → can cause **OutOfMemoryError** if one key has too many values (data skew).

---

- **reduceByKey (✅ preferred)**

In [13]:
print(KVchars.reduceByKey(addFunc).take(5))

[('y', 1), ('i', 3), ('s', 3), ('d', 2), ('r', 2)]


📌 Why better?

- Combines values **within each partition first** before shuffling.
- Less data movement → faster & safer.

👉 Rule of thumb:

- Use `reduceByKey` for additive/mergeable operations.
- Use `groupByKey` only if you really need all values together (rare).

---

### 🔎 Step 6: Joins

You can join two key–value RDDs just like SQL joins:

In [14]:
import random

# distinct letters keyed with random numbers
distinctChars = chars.distinct()
keyedChars = distinctChars.map(lambda c: (c, random.random()))

# Inner join
print(KVchars.join(keyedChars, 10).take(5))

[('m', (1, 0.22564860562641875)), ('m', (1, 0.22564860562641875)), ('y', (1, 0.8805029896320028)), ('n', (1, 0.30812983281156814)), ('n', (1, 0.30812983281156814))]


📌 Explanation:

- `join` combines records with the same key.
- The second argument (`10`) specifies number of output partitions.

---

### 🔎 Step 7: Controlling Partitions

Partitioning = **how Spark splits data across executors**.

- **coalesce** → reduce partitions without shuffle

In [15]:
print(words.coalesce(1).getNumPartitions())  # collapses to 1 partition

1


- **repartition** → change partitions with shuffle

In [16]:
print(words.repartition(10).getNumPartitions())  # now 10 partitions

10


📌 Explanation:

- Use `coalesce` to shrink partitions cheaply.
- Use `repartition` when you need more parallelism (adds shuffle overhead).

---

### 🔎 Step 8: Custom Partitioning

Sometimes, default partitioning causes **data skew** (one key gets too much data).

Custom partitioning helps you **control data distribution**.

In [17]:
data = [
    ("10001", "A123", "Apple",     5, 10.0, 17850, "UK"),
    ("10002", "B456", "Banana",   10,  5.5, 12583, "UK"),
    ("10003", "C789", "Carrot",    2,  2.0, 17850, "France"),
    ("10004", "D111", "Dates",     1, 20.0, 11111, "USA"),
]

columns = ["InvoiceNo", "StockCode", "Description", "Quantity", "UnitPrice", "CustomerID", "Country"]

df = spark.createDataFrame(data, columns)

rdd = df.rdd
first_row = rdd.first()
print(first_row)
print("Row length:", len(first_row))


Row(InvoiceNo='10001', StockCode='A123', Description='Apple', Quantity=5, UnitPrice=10.0, CustomerID=17850, Country='UK')
Row length: 7


In [18]:
keyedRDD = rdd.keyBy(lambda row: row.CustomerID)

def partitionFunc(key):
    import random
    if key in [17850, 12583]:  # skewed customers
        return 0
    else:
        return random.randint(1, 2)

partitioned = keyedRDD.partitionBy(3, partitionFunc)
print(partitioned.glom().map(len).collect())


[3, 1, 0]


📌 Explanation:

- Custom partitioning is only possible with RDDs (not DataFrames).
- Here, we isolate heavy keys into their own partition → avoids bottlenecks.

---

### 🔎 Step 9: Key Takeaways

1. **Key–Value RDDs** allow aggregations and joins like SQL.
2. **reduceByKey > groupByKey** for performance.
3. **Partitioning control** is the biggest reason to still use RDDs.
4. Custom partitioners help fight **data skew** in large-scale jobs.

---

✅ **In simple words:**

Advanced RDDs let you treat data as `(key, value)` pairs, which unlocks aggregations and joins. The real superpower is **custom partitioning** — you can decide how data is spread across the cluster to avoid skew. This level of control is why RDDs still matter, even though DataFrames should be your default choice.