In [0]:
# Databricks Notebook

# Apache Spark RDD Operations with Detailed Explanation

This notebook demonstrates various RDD (Resilient Distributed Dataset) operations in Apache Spark using Databricks. It includes explanations of how each operation works and its significance in distributed data processing.

In [0]:
# Initializing SparkContext
sc = spark.sparkContext

## Creating an RDD
**What is happening?**
- We use `parallelize()` to create an RDD from a Python list.
- RDD (Resilient Distributed Dataset) is a fundamental data structure in Spark, representing an immutable distributed collection of objects.

In [0]:
num = [1, 2, 3, 4, 5]
num_rdd = sc.parallelize(num)
print("RDD Contents:", num_rdd.collect())

RDD Contents: [1, 2, 3, 4, 5]


## Finding Maximum Value
**What is happening?**
- We use the `reduce()` function to find the maximum value in the RDD.
- `reduce()` works by applying a binary function cumulatively to the elements of the RDD.

In [0]:
num_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
num_rdd = sc.parallelize(num_list)
max_value = num_rdd.reduce(lambda x, y: x if x > y else y)
print("Maximum Value:", max_value)

Maximum Value: 15


## Applying `map()` Transformation
**What is happening?**
- `map()` is a transformation that applies a given function to each element of the RDD.
- Here, we multiply each element by 2.
- The output is a new RDD containing the transformed values.

In [0]:
n_rdd = sc.parallelize([1, 2, 3, 4, 5])
final_res = n_rdd.map(lambda y: y * 2).collect()
print("Doubled Values:", final_res)

Doubled Values: [2, 4, 6, 8, 10]


## Filtering Even Numbers
**What is happening?**
- `filter()` is used to keep only elements that satisfy a given condition.
- We filter out even numbers by checking if each element is divisible by 2.

In [0]:
rdd = sc.parallelize(range(1, 21))  # RDD from 1 to 20
final_res = rdd.filter(lambda x: x % 2 == 0).collect()
print("Even Numbers:", final_res)

Even Numbers: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]


## Sorting RDD Based on Elements
**What is happening?**
- We use `sortBy()` to sort elements based on a specific index in a tuple.
- Sorting can be done in ascending or descending order.
- Here, we sort by the third and second values in the tuple.

In [0]:
pairs = [("a", 5, 10), ("d", 3, 9), ("c", 2, 11), ("b", 3, 9)]
rdd = sc.parallelize(pairs)

sorted_by_third_asc = rdd.sortBy(lambda x: x[2])
print("Sorted by Third Element Ascending:", sorted_by_third_asc.collect())

sorted_by_third_desc = rdd.sortBy(lambda x: x[2], ascending=False)
print("Sorted by Third Element Descending:", sorted_by_third_desc.collect())

sorted_by_second_asc = rdd.sortBy(lambda x: x[1])
print("Sorted by Second Element Ascending:", sorted_by_second_asc.collect())

Sorted by Third Element Ascending: [('d', 3, 9), ('b', 3, 9), ('a', 5, 10), ('c', 2, 11)]
Sorted by Third Element Descending: [('c', 2, 11), ('a', 5, 10), ('d', 3, 9), ('b', 3, 9)]
Sorted by Second Element Ascending: [('c', 2, 11), ('d', 3, 9), ('b', 3, 9), ('a', 5, 10)]


## Sorting RDD Using `sortByKey()`
**What is happening?**
- `sortByKey()` sorts an RDD based on the key in (key, value) pairs.
- Sorting can be in ascending or descending order.

In [0]:
key_value_pairs = sc.parallelize([(3, "apple"), (1, "banana"), (2, "cherry"), (4, "date")])

sorted_asc = key_value_pairs.sortByKey()
print("Sorted by Key Ascending:", sorted_asc.collect())

sorted_desc = key_value_pairs.sortByKey(ascending=False)
print("Sorted by Key Descending:", sorted_desc.collect())

Sorted by Key Ascending: [(1, 'banana'), (2, 'cherry'), (3, 'apple'), (4, 'date')]
Sorted by Key Descending: [(4, 'date'), (3, 'apple'), (2, 'cherry'), (1, 'banana')]


## Using a Function with `map()`
**What is happening?**
- Instead of using a lambda function, we define a function `eve_function()` that doubles a value.
- We pass this function to `map()` to transform the RDD.

In [0]:
def eve_function(x):
    return x * 2

n_rdd = sc.parallelize([1, 2, 3, 4, 5])
final_res = n_rdd.map(eve_function).collect()
print("Function-based Mapping Result:", final_res)

Function-based Mapping Result: [2, 4, 6, 8, 10]


## Filtering Even Numbers with `filter()`
**What is happening?**
- We use `filter()` to keep only even numbers from the RDD.

In [0]:
listt = [1, 2, 3, 4, 5]
rdd = sc.parallelize(listt)
even_rdd = rdd.filter(lambda x: x % 2 == 0)
print("Filtered Even Numbers:", even_rdd.collect())

Filtered Even Numbers: [2, 4]


## `flatMap()` Transformation
**What is happening?**
- Unlike `map()`, `flatMap()` flattens the output.
- Each element produces multiple values, and they are returned as a single list.

In [0]:
num_rdd = sc.parallelize([1, 2, 3, 4, 5])
final_res = num_rdd.flatMap(lambda x: range(1, x)).collect()
print("Flattened Values:", final_res)

Flattened Values: [1, 1, 2, 1, 2, 3, 1, 2, 3, 4]


## Removing Duplicates with `distinct()`
**What is happening?**
- `distinct()` removes duplicate elements from an RDD.

In [0]:
num_rdd = sc.parallelize([1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
final_res = num_rdd.distinct().collect()
print("Distinct Elements:", final_res)

Distinct Elements: [1, 2, 3, 4, 5]
