# Set ENV Variable to Project Path

In [1]:
# Automatically reload modules when they change
%load_ext autoreload
%autoreload 2

Insert project root folder in environment variable

In [2]:
import os
import sys

def find_project_root(start_path=None, markers=(".git", "pyproject.toml", "requirements.txt")):
    """
    Walks up from start_path until it finds one of the marker files/folders.
    Returns the path of the project root.
    """
    if start_path is None:
        start_path = os.getcwd()

    current_path = os.path.abspath(start_path)

    while True:
        # check if any marker exists in current path
        if any(os.path.exists(os.path.join(current_path, marker)) for marker in markers):
            return current_path

        new_path = os.path.dirname(current_path)  # parent folder
        if new_path == current_path:  # reached root of filesystem
            raise FileNotFoundError(f"None of the markers {markers} found above {start_path}")
        current_path = new_path

project_root = find_project_root()
print("Project root:", project_root)

if project_root not in sys.path:
    sys.path.insert(0, project_root)


Project root: c:\ds_analytics_projects\darshil_course\apache-pyspark\darshil-pyspark


# Import Libraries

Import packages

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path

Relative import

In [4]:
from utils.file_utils import get_project_path

In [5]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName("lowerLevelApi") \
    .config("spark.sql.catalogImplementation", "hive") \
    .enableHiveSupport() \
    .getOrCreate()

# 📒 Lower-Level APIs

## 🔎 Step 1: What Are Lower-Level APIs?
Spark provides two "layers" of APIs:
* **High-level APIs (Structured APIs):**
   * DataFrame, Dataset, and Spark SQL.
   * Provide declarative syntax, optimized by Catalyst optimizer.
   * Easier, safer, and preferred in most cases.
* **Low-level APIs:**
   * **RDDs (Resilient Distributed Datasets):** Primitive distributed collections with transformations and actions.
   * **Distributed shared variables:**
      * **Broadcast variables** (read-only shared data across executors).
      * **Accumulators** (write-only variables to aggregate results, e.g., counters).
   * **SparkContext** (entry point to cluster-level functionality).

👉 Every **DataFrame operation** internally compiles down to **RDD transformations and actions**.

## 🔎 Step 2: When Should You Use Low-Level APIs?
Use RDDs, accumulators, or broadcast variables only when:
1. **Functionality missing in Structured APIs**
   * e.g., fine-grained control of data placement, custom partitioning, byte-level data manipulation.
2. **Legacy code**
   * Old Spark jobs written with RDDs still need maintenance.
3. **Custom shared variable manipulation**
   * e.g., debugging counters, global accumulators, broadcasting lookup tables.

Otherwise, stick with **DataFrame/Dataset** APIs — they're optimized, less error-prone, and faster.

## 🔎 Step 3: How to Use Low-Level APIs?
### SparkContext
* **SparkContext** is the main entry point for lower-level operations.
* Available from `SparkSession` via `spark.sparkContext`.

In [6]:
# Access SparkContext from SparkSession
sc = spark.sparkContext

# Parallelize Python collection into an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])
print(rdd.collect())

[1, 2, 3, 4, 5]


👉 SparkContext is responsible for:
* Connecting to the cluster manager.
* Requesting resources (executors).
* Creating RDDs from local collections or external sources.

## 🔎 Step 4: RDDs (Resilient Distributed Datasets)
* **RDD = low-level distributed collection of objects.**
* Immutable, partitioned across cluster.
* Supports transformations (map, filter, flatMap) and actions (collect, count, reduce).

Example:

In [7]:
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Transformation (lazy)
rdd_squared = rdd.map(lambda x: x * x)

# Action (executes)
print(rdd_squared.collect())  # [1, 4, 9, 16, 25]

[1, 4, 9, 16, 25]


## 🔎 Step 5: Distributed Shared Variables
### Broadcast Variables
* Allow sharing large read-only data across executors without sending it with every task.

In [8]:
broadcast_var = sc.broadcast([1, 2, 3])

print("Broadcast value:", broadcast_var.value)

Broadcast value: [1, 2, 3]


### Accumulators
* Write-only variables used for aggregations like counters or sums.

In [9]:
accum = sc.accumulator(0)

def add_num(x):
    global accum
    accum += x

rdd.foreach(add_num)
print("Accumulator value:", accum.value)

Accumulator value: 15


## 🔎 Step 6: Why This Matters?
* All Spark jobs eventually **compile down to RDD operations**.
* Understanding low-level APIs helps in:
   * Debugging query plans.
   * Optimizing performance at the physical level.
   * Maintaining old codebases.
* But in modern Spark: **prefer Structured APIs** unless you really need the control.

✅ **In simple words:**
High-level APIs (DataFrame/SQL) are the default and most efficient way to work in Spark. But under the hood, everything runs on **RDDs**. You only use RDDs, broadcast variables, or accumulators when you need fine-grained control, maintain legacy code, or share data across tasks.